# Baseline App Analysis: Calm User Reviews

**Milestone 3: Data Analysis (Part 2)**

This notebook serves as the "control group" for our main analysis. We will perform the exact same topic modeling pipeline on the user reviews for **Calm**, a popular but non-conversational wellness app.

**Objective:**
By identifying the primary complaint themes for a baseline app, we can compare them to our conversational app findings. This will allow us to isolate the failure modes that are **unique to conversational AI**.

 🧼 Import all necessary libraries UNCOMMENT AND RUN IT
 📦 Install all required packages quietly
 !pip install bertopic[visualization] --quiet
 !pip install hdbscan --quiet
 !pip install umap-learn --quiet

In [1]:
import warnings
import os
import pandas as pd
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader, TensorDataset
from tqdm.auto import tqdm
import plotly.express as px
from IPython.display import display

warnings.filterwarnings("ignore")
print("✅ Setup complete. All libraries are ready.")

✅ Setup complete. All libraries are ready.


In [3]:
# Build the full path to the CSV
cwd = os.getcwd()
while not os.path.exists(os.path.join(cwd, "1_datasets")):
    parent = os.path.dirname(cwd)
    if parent == cwd:
        raise FileNotFoundError("Could not find repo root containing '1_datasets'.")
    cwd = parent

REPO_ROOT = cwd
print(f"Repo root detected at: {REPO_ROOT}")

# --- BUILD PATH TO BASELINE CSV ---
DATA_PATH = os.path.join(
    REPO_ROOT, "1_datasets", "all_datasets", "baseline_app_dataset.csv"
)

# --- LOAD CSV ---
df_base = pd.read_csv(DATA_PATH)

# --- CLEAN DATA ---
df_base.dropna(subset=["review_text"], inplace=True)
df_base["review_text"] = df_base["review_text"].astype(str).str.lower().str.strip()
df_base = df_base[df_base["review_text"].str.len() > 15]

# --- CONVERT TO LIST OF DOCUMENTS ---
docs_base = df_base["review_text"].tolist()
print(f"Successfully loaded and cleaned {len(docs_base)} documents for modeling.")

Repo root detected at: c:\Users\azizt\OneDrive\Desktop\ET6-CDSP-group-20-repo
Successfully loaded and cleaned 8506 documents for modeling.


In [4]:
# --- Step 3: Configure the Topic Model ---
print("Configuring a reproducible BERTopic model...")

# A. Define a Stopword List (Identical to the first analysis)
stop_words = [
    "app",
    "replika",
    "wysa",
    "woebot",
    "calm",
    "bot",
    "ai",
    "like",
    "feel",
    "good",
    "great",
    "nice",
    "love",
    "best",
    "amazing",
    "awesome",
    "fun",
    "ok",
    "cool",
    "me",
    "it",
    "and",
    "to",
    "the",
    "my",
    "is",
    "of",
    "with",
    "that",
    "for",
    "you",
    "but",
    "so",
    "on",
    "was",
    "this",
    "have",
    "in",
    "be",
    "as",
    "at",
    "not",
    "just",
    "are",
    "get",
    "want",
    "use",
    "go",
    "know",
    "say",
    "see",
    "think",
    "really",
    "even",
    "also",
]

# B. Define Deterministic Components (Identical)
random_seed = 42
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=random_seed,
)
hdbscan_model = HDBSCAN(
    min_cluster_size=30,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True,
)  # min_cluster_size can be smaller for smaller dataset
vectorizer_model = CountVectorizer(stop_words=stop_words, ngram_range=(1, 2))

# C. Initialize the final BERTopic model (Identical)
topic_model_base = BERTopic(
    language="multilingual",
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    min_topic_size=30,  # Adjusted for a potentially smaller dataset
    verbose=True,
)

print("BERTopic model is configured and ready for training.")

Configuring a reproducible BERTopic model...
BERTopic model is configured and ready for training.


In [6]:
print(
    f"Training BERTopic on {len(docs_base)} basline documents. This will take a few minutes..."
)

topics_base, probs_base = topic_model_base.fit_transform(docs_base)

print("n\ Baseline model trainig complete")

2025-08-23 21:35:27,785 - BERTopic - Embedding - Transforming documents to embeddings.


Training BERTopic on 8506 basline documents. This will take a few minutes...


Batches:   0%|          | 0/266 [00:00<?, ?it/s]

2025-08-23 21:36:50,165 - BERTopic - Embedding - Completed ✓
2025-08-23 21:36:50,166 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-23 21:37:17,863 - BERTopic - Dimensionality - Completed ✓
2025-08-23 21:37:17,865 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-23 21:37:18,119 - BERTopic - Cluster - Completed ✓
2025-08-23 21:37:18,122 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-23 21:37:18,591 - BERTopic - Representation - Completed ✓


n\ Baseline model trainig complete


In [7]:
# --- Step 5: Review the Discovered Topics ---
print("Displaying the baseline topic overview...")

topic_info_base = topic_model_base.get_topic_info()
display(topic_info_base)

print("\n--- Detailed View of Top 10 Baseline Topics ---")
for topic_id in range(10):
    if topic_id in topic_model_base.get_topics():
        print(f"\n--- Words for Baseline Topic #{topic_id} ---")
        print(topic_model_base.get_topic(topic_id))

Displaying the baseline topic overview...


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2379,-1_sleep_can_free_pay,"[sleep, can, free, pay, they, subscription, no...","[for the content, it's pretty great i must say..."
1,0,1691,0_free_pay_if_can,"[free, pay, if, can, content, everything, ther...",[it's ok but as a free user you have nothing t...
2,1,679,1_trial_free trial_free_day,"[trial, free trial, free, day, day free, charg...","[want to cancel 7 day trial, 7 day trial so it..."
3,2,488,2_meditation_meditations_find_more,"[meditation, meditations, find, more, or, ther...",[i've had a subscription for this app for 2 ye...
4,3,402,3_subscription_cancel_account_email,"[subscription, cancel, account, email, they, n...","[i subscribed to the calm app with high hopes,..."
5,4,239,4_listen_sounds_sound_music,"[listen, sounds, sound, music, free, pay, can,...",[if you don't pay for premium you can't listen...
6,5,219,5_subscription_subscribe_without_monthly subsc...,"[subscription, subscribe, without, monthly sub...","[requires a subscription, can't do anything un..."
7,6,173,6_sounds_noise_sound_soundscapes,"[sounds, noise, sound, soundscapes, white, aud...","[been a user for many year, but the app is get..."
8,7,146,7_free_free free_nothing free_nothing,"[free, free free, nothing free, nothing, free ...","[it's not really free, this is not free, this ..."
9,8,133,8_cancel_cancel subscription_subscription_impo...,"[cancel, cancel subscription, subscription, im...","[impossible to cancel the subscription 😕, impo..."



--- Detailed View of Top 10 Baseline Topics ---

--- Words for Baseline Topic #0 ---
[('free', np.float64(0.016521271140252345)), ('pay', np.float64(0.012474851000288285)), ('if', np.float64(0.01120306233161068)), ('can', np.float64(0.010012146921566347)), ('content', np.float64(0.00995063830968243)), ('everything', np.float64(0.009752252072441098)), ('there', np.float64(0.008526276041306418)), ('premium', np.float64(0.008322438177745067)), ('they', np.float64(0.008246295506983126)), ('don', np.float64(0.008240746810188528))]

--- Words for Baseline Topic #1 ---
[('trial', np.float64(0.05220221868982305)), ('free trial', np.float64(0.03814593460312458)), ('free', np.float64(0.026170376254311196)), ('day', np.float64(0.023394299768316956)), ('day free', np.float64(0.015974074806825854)), ('charged', np.float64(0.014366904210447472)), ('cancel', np.float64(0.014332084502104683)), ('day trial', np.float64(0.013812263691551465)), ('they', np.float64(0.013463526095739019)), ('up', np.float

In [8]:
# --- PART A: Detailed mapping of Topic IDs to specific names ---

print("Mapping final Topic IDs to names and high-level themes for Baseline App...")
# Assign the raw topic IDs from the model to your baseline DataFrame
df_base["topic_id"] = topics_base
topic_id_to_name_base = {
    0: "Paywall & Lack of Free Content",
    1: "Meditation Content Issues",
    2: "Free Trial Complaints & Charges",
    3: "Redundant: Free Trial Complaints",
    4: "Paywalled Audio & Music",
    5: "Technical: App Crashing / Not Opening",
    6: "Refund & Billing Issues",
    7: "Forced Subscription to Use",
    8: "Soundscape & Audio Feature Issues",
    9: "Subscription Cancellation Problems",
    10: "Misleading 'Free' Label",
    11: "Redundant: Refund & Billing Issues",
    12: "Content Available Free on YouTube",
    13: "False Advertising",
    14: "Generic Negative Feedback ('doesn't work')",
    15: "Redundant: Subscription Cancellation",
    16: "Price is Too Expensive",
    17: "Redundant: Paywall & Lack of Free Content",
    18: "Redundant: Subscription Cancellation",
    19: "Free Version is Useless",
    20: "Issues with Sleep Content",
    21: "Features Locked Behind Paywall",
    22: "Redundant: Issues with Sleep Content",
    23: "Exploiting Anxiety for Money",
    24: "Redundant: Paywalled Features",
    25: "App Uninstalled Due to Cost",
    26: "Generic Positive/Neutral Feedback",
    27: "Redundant: App Uninstalled",
    28: "Redundant: Paywalled Sleep Content",
    29: "Redundant: Paywall & Lack of Free Content",
    30: "Redundant: Paywall & Lack of Free Content",
    31: "Technical Issues with Wearables (Watch)",
    32: "Meta: 1-Star Review Complaints",
    33: "Perceived as a Scam",
    34: "Price is Too Expensive ($70/year)",
    35: "Noise/Junk Topic (mm, diva, bb)",
    36: "Redundant: Paywall & Lack of Free Content",
    37: "Technical Issues with Casting (Google Home)",
    38: "Redundant: Locked/Paid Features",
    39: "Account & Login Issues",
    40: "Noise/Junk Topic (crickets)",
    41: "Issues with Specific Narrator (Tamara Levitt)",
    42: "Redundant: Locked/Paid Features",
    43: "Redundant: Account & Login Issues",
    44: "Redundant: Sound/Audio Issues",
}

df_base["topic_name"] = df_base["topic_id"].map(topic_id_to_name_base)
df_base["topic_name"].fillna("Specific/Niche Complaint", inplace=True)


# --- PART B: Mapping specific names to high-level themes (This remains largely the same) ---
topic_name_to_theme_base = {
    # Theme 1: Monetization & Value
    "Paywall & Lack of Free Content": "Monetization & Value",
    "Free Trial Complaints & Charges": "Monetization & Value",
    "Paywalled Audio & Music": "Monetization & Value",
    "Forced Subscription to Use": "Monetization & Value",
    "Refund & Billing Issues": "Monetization & Value",
    "Misleading 'Free' Label": "Monetization & Value",
    "Subscription Cancellation Problems": "Monetization & Value",
    "Price is Too Expensive": "Monetization & Value",
    "Content Available Free on YouTube": "Monetization & Value",
    "False Advertising": "Monetization & Value",
    "Features Locked Behind Paywall": "Monetization & Value",
    "App Uninstalled Due to Cost": "Monetization & Value",
    "Perceived as a Scam": "Monetization & Value",
    "Exploiting Anxiety for Money": "Monetization & Value",
    "Free Version is Useless": "Monetization & Value",
    "Price is Too Expensive ($70/year)": "Monetization & Value",
    "Redundant: Free Trial Complaints": "Monetization & Value",
    "Redundant: Refund & Billing Issues": "Monetization & Value",
    "Redundant: Subscription Cancellation": "Monetization & Value",
    "Redundant: Paywall & Lack of Free Content": "Monetization & Value",
    "Redundant: Paywalled Sleep Content": "Monetization & Value",
    "Redundant: Locked/Paid Features": "Monetization & Value",
    # Theme 2: Content-Specific Issues
    "Meditation Content Issues": "Content-Specific Issues",
    "Soundscape & Audio Feature Issues": "Content-Specific Issues",
    "Issues with Sleep Content": "Content-Specific Issues",
    "Issues with Specific Narrator (Tamara Levitt)": "Content-Specific Issues",
    "Redundant: Issues with Sleep Content": "Content-Specific Issues",
    "Redundant: Sound/Audio Issues": "Content-Specific Issues",
    # Theme 3: Technical Performance & Bugs
    "Technical: App Crashing / Not Opening": "Technical Performance",
    "Technical Issues with Wearables (Watch)": "Technical Performance",
    "Technical Issues with Casting (Google Home)": "Technical Performance",
    "Account & Login Issues": "Technical Performance",
    "Redundant: Account & Login Issues": "Technical Performance",
    # Theme 4: Other
    "Generic Negative Feedback ('doesn't work')": "Other/Misc.",
    "Generic Positive/Neutral Feedback": "Other/Misc.",
    "Meta: 1-Star Review Complaints": "Other/Misc.",
    "Noise/Junk Topic (mm, diva, bb)": "Other/Misc.",
    "Noise/Junk Topic (crickets)": "Other/Misc.",
    "Specific/Niche Complaint": "Other/Misc.",
}

df_base["theme"] = df_base["topic_name"].map(topic_name_to_theme_base)
df_base.loc[df_base["topic_id"] == -1, "theme"] = "Outliers / Generic"
df_base["theme"].fillna("Other/Misc.", inplace=True)  # Catch any leftovers

print("\n--- Final, Corrected Theme Distribution for Baseline App (Calm) ---")
display(df_base["theme"].value_counts())

Mapping final Topic IDs to names and high-level themes for Baseline App...

--- Final, Corrected Theme Distribution for Baseline App (Calm) ---


theme
Monetization & Value       4593
Outliers / Generic         2379
Content-Specific Issues     943
Other/Misc.                 332
Technical Performance       259
Name: count, dtype: int64

In [9]:
# --- Emotional Analysis: Measuring Sentiment by Theme (Robust Version) ---

# This cell uses a powerful pre-trained Transformer model to get accurate sentiment scores.
print("Setting up Transformer sentiment analysis pipeline for Baseline App...")

# 1. LOAD TOKENIZER AND MODEL
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")


# 2. TOKENIZE THE DATA
print("Tokenizing all baseline reviews...")
review_list_base = df_base["review_text"].tolist()

inputs_base = tokenizer(
    review_list_base, padding=True, truncation=True, max_length=512, return_tensors="pt"
)
inputs_base = {key: val.to(device) for key, val in inputs_base.items()}
print("Tokenization complete.")


# 3. PERFORM INFERENCE IN BATCHES
print("Running model inference in batches...")
all_logits_base = []
batch_size = 32
dataset_base = TensorDataset(inputs_base["input_ids"], inputs_base["attention_mask"])
loader_base = DataLoader(dataset_base, batch_size=batch_size)

with torch.no_grad():
    for batch in tqdm(loader_base, desc="Analyzing Batches"):
        input_ids, attention_mask = batch
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        all_logits_base.append(outputs.logits)

all_logits_base = torch.cat(all_logits_base, dim=0)
probabilities = torch.nn.functional.softmax(all_logits_base, dim=-1)
predictions = torch.argmax(probabilities, dim=-1)
print("Inference complete.")


# 4. PROCESS THE RESULTS
id_to_label = model.config.id2label
predicted_labels = [id_to_label[pred.item()] for pred in predictions]
label_to_score = {"positive": 1, "neutral": 0, "negative": -1}
sentiment_scores = [label_to_score[label] for label in predicted_labels]

df_base["sentiment_score"] = sentiment_scores
print("Sentiment scores added to the baseline DataFrame.")

Setting up Transformer sentiment analysis pipeline for Baseline App...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Using device: cpu
Tokenizing all baseline reviews...
Tokenization complete.
Running model inference in batches...


Analyzing Batches:   0%|          | 0/266 [00:00<?, ?it/s]

Inference complete.
Sentiment scores added to the baseline DataFrame.


In [11]:
# We'll use the 'df_base' DataFrame that has the 'theme' and 'topic_name' columns.
# Let's ensure our topic names are clean for the charts.
df_base["topic_name"] = df_base["topic_name"].str.replace("Redundant: ", "")

### Deep Dive 1: Why Are Users Upset About Monetization?

In [12]:
# Filter for the theme
monetization_df = df_base[df_base["theme"] == "Monetization & Value"]
# Get the breakdown
monetization_breakdown = monetization_df["topic_name"].value_counts().reset_index()
monetization_breakdown.columns = ["Specific Complaint", "Review Count"]

# Visualize
fig_monetization = px.treemap(
    monetization_breakdown.head(10),
    path=[px.Constant("Monetization Complaints"), "Specific Complaint"],
    values="Review Count",
    title="<b>Breakdown of Monetization Complaints</b>",
    color_discrete_sequence=px.colors.sequential.Reds_r,
)
fig_monetization.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_monetization.show()

### Deep Dive 2: What Content Are Users Complaining About?

In [13]:
# Filter for the theme
content_df = df_base[df_base["theme"] == "Content-Specific Issues"]
# Get the breakdown
content_breakdown = content_df["topic_name"].value_counts().reset_index()
content_breakdown.columns = ["Specific Complaint", "Review Count"]

# Visualize
fig_content = px.treemap(
    content_breakdown.head(10),
    path=[px.Constant("Content Complaints"), "Specific Complaint"],
    values="Review Count",
    title="<b>Breakdown of Content-Specific Complaints</b>",
    color_discrete_sequence=px.colors.sequential.Blues_r,
)
fig_content.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_content.show()

### Deep Dive 3: What Technical Problems Are Users Facing?

In [14]:
# Filter for the theme
technical_df = df_base[df_base["theme"] == "Technical Performance"]
# Get the breakdown
technical_breakdown = technical_df["topic_name"].value_counts().reset_index()
technical_breakdown.columns = ["Specific Complaint", "Review Count"]

# Visualize
fig_technical = px.treemap(
    technical_breakdown.head(10),
    path=[px.Constant("Technical Complaints"), "Specific Complaint"],
    values="Review Count",
    title="<b>Breakdown of Technical Performance Complaints</b>",
    color_discrete_sequence=px.colors.sequential.Greens_r,
)
fig_technical.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_technical.show()

In [15]:
# In your Kaggle Cell

# --- The Ultimate Breakdown: Sunburst Chart ---
print("Creating a hierarchical Sunburst chart to show theme breakdowns...")

# We need a DataFrame with three columns: Theme, Specific Complaint, and Count
# We can create this by combining our previous breakdown dataframes.
breakdown_df = df_base[~df_base["theme"].isin(["Outliers / Generic", "Other/Misc."])]
sunburst_data = (
    breakdown_df.groupby(["theme", "topic_name"]).size().reset_index(name="count")
)

# Create the Sunburst chart
fig_sunburst = px.sunburst(
    sunburst_data,
    path=["theme", "topic_name"],  # This defines the hierarchy
    values="count",
    title="<b>Anatomy of a Negative Review: A Hierarchical View of Complaints</b>",
    color="theme",  # Color the inner ring by theme
    color_discrete_sequence=px.colors.qualitative.Pastel,
)

fig_sunburst.update_layout(margin=dict(t=50, l=25, r=25, b=25))
fig_sunburst.show()

Creating a hierarchical Sunburst chart to show theme breakdowns...


In [16]:
# In your Kaggle Cell

# --- The "Problem Priority Matrix": Sentiment vs. Frequency ---
print("Creating a Sentiment vs. Frequency scatter plot...")

# 1. Get the Frequency data
theme_freq = df_base["theme"].value_counts().reset_index()
theme_freq.columns = ["Theme", "Frequency (Number of Reviews)"]

# 2. Get the Sentiment data
theme_sent = df_base.groupby("theme")["sentiment_score"].mean().reset_index()
theme_sent.columns = ["Theme", "Average Sentiment Score"]

# 3. Merge them into one DataFrame
priority_df = pd.merge(theme_freq, theme_sent, on="Theme")
priority_df = priority_df[
    ~priority_df["Theme"].isin(["Outliers / Generic", "Other/Misc."])
]


# 4. Create the Scatter Plot
fig_scatter = px.scatter(
    priority_df,
    x="Frequency (Number of Reviews)",
    y="Average Sentiment Score",
    text="Theme",  # Label each point with the theme name
    size="Frequency (Number of Reviews)",  # Make bubbles bigger for more frequent topics
    color="Average Sentiment Score",
    color_continuous_scale="Reds_r",
    title="<b>Problem Priority Matrix: Which Complaints Matter Most?</b>",
    template="plotly_white",
)

# Add annotations to create quadrants
fig_scatter.update_traces(textposition="top center")
fig_scatter.add_vline(
    x=priority_df["Frequency (Number of Reviews)"].mean(),
    line_dash="dash",
    annotation_text="Avg. Frequency",
)
fig_scatter.add_hline(
    y=priority_df["Average Sentiment Score"].mean(),
    line_dash="dash",
    annotation_text="Avg. Sentiment",
)
fig_scatter.update_layout(title_x=0.5)
fig_scatter.show()

Creating a Sentiment vs. Frequency scatter plot...


# **TIME SERIES ANALYSIS**

In [17]:
# 1. Prepare the data
df_base["date"] = pd.to_datetime(df_base["date"], errors="coerce")
df_base.dropna(subset=["date"], inplace=True)
plot_df_time_base = df_base[
    ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
]

# 2. Group by month
monthly_trends_base = (
    plot_df_time_base.groupby([pd.Grouper(key="date", freq="M"), "theme"])
    .size()
    .reset_index(name="review_count")
)
monthly_trends_base["month"] = (
    monthly_trends_base["date"].dt.to_period("M").dt.to_timestamp()
)

# 3. Create the Visualization
fig_timeseries_base = px.line(
    monthly_trends_base,
    x="month",
    y="review_count",
    color="theme",
    markers=True,
    title="<b>Baseline App (Calm): Evolution of User Complaints Over Time</b>",
    labels={"month": "Month", "review_count": "Number of Negative Reviews"},
    template="plotly_white",
)

fig_timeseries_base.update_layout(title_x=0.5, legend_title_text="Complaint Theme")
fig_timeseries_base.show()

In [18]:
# --- Final Step: Save the Enriched DataFrame ---

# Define the output filename
output_filename_base = "baseline_app_themed_and_scored.csv"

# Save the DataFrame to a new CSV file
df_base.to_csv(output_filename_base, index=False)

print(f"Successfully saved the final, enriched dataset to: {output_filename_base}")
print("You can now find this file in the 'Output' section of this notebook on Kaggle.")

Successfully saved the final, enriched dataset to: baseline_app_themed_and_scored.csv
You can now find this file in the 'Output' section of this notebook on Kaggle.
