# Final Synthesis: Answering Our Research Question

Having completed the separate analyses of both conversational and baseline apps, we can now synthesize the findings to directly address our project's central research question:

> *What are the most prevalent themes of user-reported conversational failure in leading mental health chatbots, and what do these themes reveal about the gap between user expectations for emotional support and current algorithmic capabilities?*

Our data provides a clear, evidence-based answer.

### Objective Fulfilled: Comparing Conversational vs. Non-Conversational Themes

Our first objective was to compare the prevalence of complaint themes. The chart below visualizes the stark difference between the two app categories.

**Key Finding:** The data confirms that while both app types share common frustrations around **Monetization** and **Technical Performance**, conversational apps introduce a massive, unique category of failure: **"AI Performance & Quality."** This theme, which is completely absent from the baseline app, constitutes a major portion of all specific complaints for chatbots. This proves that conversational failures are a distinct and significant problem.

### Objective Fulfilled: Juxtaposing Quantitative and Qualitative Evidence

To understand the *nature* of these failures, we combine our quantitative findings with qualitative user voices.

#### Insight 1: Monetization is a Universal Annoyance, but Not the Core Story
Both app types are heavily criticized for their business models. However, the emotional weight of this issue differs.

#### Insight 2: The "Gap Between Expectation and Reality" for AI is Palpable
The "AI Performance & Quality" theme directly reveals the gap between what users *expect* from an AI companion and what the algorithms *deliver*.

*   **Expectation:** Users seek memory, personality, and genuine interaction.
*   **Reality (as told by users):** They experience robotic, repetitive, and forgetful behavior.

> **Quantitative Evidence:** "AI Performance & Quality" is one of the most frequent complaint themes.
>
> **Qualitative Evidence (User Voice):** *"My experience with this app has been very bad. the ai does not do as you ask, nor follow through with simple instructions... my experience has caused me to stress out on some occasions."*

In [1]:
import pandas as pd
import os
import plotly.express as px

In [2]:
print("Loading the final, themed datasets...")

# --- DETECT REPO ROOT ---
cwd = os.getcwd()
while not os.path.exists(os.path.join(cwd, "1_datasets")):
    parent = os.path.dirname(cwd)
    if parent == cwd:
        raise FileNotFoundError("Could not find repo root containing '1_datasets'.")
    cwd = parent

REPO_ROOT = cwd
DATA_PATH = os.path.join(REPO_ROOT, "1_datasets", "all_datasets")
print(f"Repo root detected at: {REPO_ROOT}")

# --- Load Conversational App Data ---
conv_file = os.path.join(DATA_PATH, "conversational_apps_themed_and_scored.csv")
if os.path.exists(conv_file):
    df_conv = pd.read_csv(conv_file)
    print(f"Loaded {len(df_conv)} themed reviews for Conversational apps.")
else:
    print(f"ERROR: Conversational themed dataset not found at {conv_file}.")
    df_conv = pd.DataFrame()

# --- Load Baseline App Data ---
base_file = os.path.join(DATA_PATH, "baseline_app_themed_and_scored.csv")
if os.path.exists(base_file):
    df_base = pd.read_csv(base_file)
    print(f"Loaded {len(df_base)} themed reviews for the Baseline app.")
else:
    print(f"ERROR: Baseline themed dataset not found at {base_file}.")
    df_base = pd.DataFrame()

Loading the final, themed datasets...
Repo root detected at: c:\Users\azizt\OneDrive\Desktop\ET6-CDSP-group-20-repo
Loaded 20178 themed reviews for Conversational apps.
Loaded 8506 themed reviews for the Baseline app.


### Comparative Analysis 1: What Do Users Complain About?

First, we compare the relative frequency of each high-level complaint theme between the two app types. We use percentages (`normalize=True`) to ensure a fair comparison, as the total number of reviews is different.

In [3]:
# 1. Get the theme distribution for both app types
conv_themes = (
    df_conv[
        ~df_conv["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts(normalize=True)
    .reset_index()
)
conv_themes.columns = ["Theme", "Percentage"]
conv_themes["App Type"] = "Conversational"

base_themes = (
    df_base[
        ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts(normalize=True)
    .reset_index()
)
base_themes.columns = ["Theme", "Percentage"]
base_themes["App Type"] = "Baseline (Calm)"

# 2. Combine the data for plotting
comparison_df = pd.concat([conv_themes, base_themes])
comparison_df["Percentage Label"] = (comparison_df["Percentage"] * 100).round(1).astype(
    str
) + "%"

# 3. Create the comparative bar chart
fig_freq = px.bar(
    comparison_df,
    x="Percentage",
    y="Theme",
    color="App Type",
    barmode="group",
    orientation="h",
    title="<b>The Unique Failures of AI: Complaint Frequency by App Type</b>",
    labels={"Percentage": "Percentage of All Specific Complaints"},
    text="Percentage Label",
    template="plotly_white",
)
fig_freq.update_layout(
    yaxis={"categoryorder": "total ascending"},
    title_x=0.5,
    legend_title_text="App Type",
    xaxis_tickformat=".0%",
)
fig_freq.show()

### Comparative Analysis 2: Which Complaints Cause More Frustration?

Next, we compare the average emotional sentiment for each complaint theme. This tells us not just what users complain about, but how deeply frustrating each issue is.

In [4]:
# 1. Get sentiment data for both app types
conv_sentiment = df_conv.groupby("theme")["sentiment_score"].mean().reset_index()
conv_sentiment["App Type"] = "Conversational"

base_sentiment = df_base.groupby("theme")["sentiment_score"].mean().reset_index()
base_sentiment["App Type"] = "Baseline (Calm)"

# 2. Combine for plotting
combined_sentiment = pd.concat([conv_sentiment, base_sentiment])
plot_df_sent = combined_sentiment[
    ~combined_sentiment["theme"].isin(
        ["Outliers / Generic", "Other/Misc.", "Uncategorized"]
    )
]

# 3. Create the comparative sentiment chart
fig_sent = px.bar(
    plot_df_sent,
    x="sentiment_score",
    y="theme",
    color="App Type",
    barmode="group",
    orientation="h",
    title="<b>Comparative Emotional Impact: Average Sentiment by App Type</b>",
    labels={
        "sentiment_score": "Average Sentiment (More Negative is Worse)",
        "theme": "Complaint Theme",
    },
    template="plotly_white",
)
fig_sent.update_layout(
    yaxis={"categoryorder": "total ascending"},
    title_x=0.5,
    legend_title_text="App Type",
)
fig_sent.show()

In [5]:
# --- Final Visualization: The Complaint "Fingerprint" Comparison ---

# 1. Get the percentage distributions for both app types (this part is the same)
conv_themes = (
    df_conv[
        ~df_conv["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts(normalize=True)
    .reset_index()
)
conv_themes.columns = ["Theme", "Percentage of Complaints"]
conv_themes["App Type"] = "Conversational AI"

base_themes = (
    df_base[
        ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts(normalize=True)
    .reset_index()
)
base_themes.columns = ["Theme", "Percentage of Complaints"]
base_themes["App Type"] = "Baseline App (Calm)"

# 2. Combine the data
comparison_df = pd.concat([conv_themes, base_themes])

# 3. Create the Grouped Bar Chart
fig_final_comp = px.bar(
    comparison_df,
    x="Theme",
    y="Percentage of Complaints",
    color="App Type",
    barmode="group",  # This creates the side-by-side bars
    title='<b>The Different "Complaint Fingerprints": Conversational AI vs. Baseline App</b>',
    labels={"Percentage of Complaints": "Share of All Specific Complaints"},
    text_auto=".1%",  # Automatically format text on bars as percentage
    template="plotly_white",
)

# 4. Make it Beautiful and Clear
fig_final_comp.update_layout(
    title_x=0.5,
    yaxis_tickformat=".0%",  # Format y-axis as percentage
    legend_title_text="App Type",
    font=dict(size=12),
    xaxis_title=None,  # Remove the "Theme" x-axis title for a cleaner look
)
fig_final_comp.show()

In [6]:
# --- Advanced Viz 2: The Emotional Landscape (Corrected) ---

print("Creating the comparative 'Emotional Landscape' scatter plot...")

# --- SETUP: We must first create the priority dataframes from our loaded data ---

# 1. Create the priority dataframe for CONVERSATIONAL apps
conv_freq = (
    df_conv[
        ~df_conv["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts()
    .reset_index()
)
conv_freq.columns = ["Theme", "Frequency (Number of Reviews)"]
conv_sent = df_conv.groupby("theme")["sentiment_score"].mean().reset_index()
conv_sent.columns = ["Theme", "Average Sentiment Score"]
priority_df_conv = pd.merge(conv_freq, conv_sent, on="Theme")
priority_df_conv["App Type"] = "Conversational"


# 2. Create the priority dataframe for the BASELINE app
base_freq = (
    df_base[
        ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]["theme"]
    .value_counts()
    .reset_index()
)
base_freq.columns = ["Theme", "Frequency (Number of Reviews)"]
base_sent = df_base.groupby("theme")["sentiment_score"].mean().reset_index()
base_sent.columns = ["Theme", "Average Sentiment Score"]
priority_df_base = pd.merge(base_freq, base_sent, on="Theme")
priority_df_base["App Type"] = "Baseline (Calm)"


# --- VISUALIZATION: Now that the data is prepared, we can create the plot ---

# 3. Combine the two priority dataframes for plotting
emotional_landscape_df = pd.concat([priority_df_conv, priority_df_base])

# 4. Create the comparative scatter plot
fig_landscape = px.scatter(
    emotional_landscape_df,
    x="Frequency (Number of Reviews)",
    y="Average Sentiment Score",
    text="Theme",
    color="App Type",
    size="Frequency (Number of Reviews)",
    title="<b>The Emotional Landscape of User Complaints: Conversational vs. Baseline</b>",
    template="plotly_white",
    hover_name="Theme",
    height=700,  # Make the chart taller to prevent text overlap
)

fig_landscape.update_traces(textposition="top center", textfont=dict(size=11))
fig_landscape.update_layout(title_x=0.5, legend_title_text="App Type")
fig_landscape.show()

Creating the comparative 'Emotional Landscape' scatter plot...


In [7]:
# --- Visualization: Emotional Impact for Conversational Apps ---

# Get the sentiment data for conversational apps (this code is from the setup of the previous chart)
conv_sent = (
    df_conv[
        ~df_conv["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]
    .groupby("theme")["sentiment_score"]
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)
conv_sent.columns = ["Theme", "Average Sentiment Score"]


# Create a clean, compelling bar chart
fig_conv_sentiment = px.bar(
    conv_sent,
    x="Average Sentiment Score",
    y="Theme",
    orientation="h",
    title="<b>Conversational Apps: Which Complaints are Most Painful?</b>",
    labels={"Average Sentiment Score": "Average Sentiment (More Negative is Worse)"},
    text=conv_sent["Average Sentiment Score"].apply(
        lambda x: f"{x:.3f}"
    ),  # Format text on bars
    template="plotly_white",
)

fig_conv_sentiment.update_traces(
    marker_color="#ADD8E6",  # A consistent light blue
    textposition="outside",
)
fig_conv_sentiment.update_layout(title_x=0.5, font=dict(size=12))
fig_conv_sentiment.show()

In [8]:
# --- Visualization: Emotional Impact for Baseline App ---

# Get the sentiment data for the baseline app
base_sent = (
    df_base[
        ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
    ]
    .groupby("theme")["sentiment_score"]
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)
base_sent.columns = ["Theme", "Average Sentiment Score"]


# Create the corresponding bar chart for the baseline app
fig_base_sentiment = px.bar(
    base_sent,
    x="Average Sentiment Score",
    y="Theme",
    orientation="h",
    title="<b>Baseline App (Calm): Which Complaints are Most Painful?</b>",
    labels={"Average Sentiment Score": "Average Sentiment (More Negative is Worse)"},
    text=base_sent["Average Sentiment Score"].apply(lambda x: f"{x:.3f}"),
    template="plotly_white",
)

fig_base_sentiment.update_traces(
    marker_color="#FFB6C1",  # A consistent light red/pink
    textposition="outside",
)
fig_base_sentiment.update_layout(title_x=0.5, font=dict(size=12))
fig_base_sentiment.show()

In [9]:
# --- Advanced Viz 3: A Tale of Two Complaints ---


def compare_user_voices(theme_name, n_samples=2):
    """
    Shows a side-by-side comparison of user reviews for the same theme
    from both conversational and baseline apps.
    """
    print(f"--- COMPARATIVE DEEP DIVE: '{theme_name}' ---")

    # Get conversational samples
    print("\n--- CONVERSATIONAL APP COMPLAINTS ---")
    conv_samples = (
        df_conv[df_conv["theme"] == theme_name]["review_text"]
        .sample(n_samples)
        .tolist()
    )
    for i, sample in enumerate(conv_samples):
        print(f'Sample {i + 1}: "{sample[:400]}..."')

    # Get baseline samples
    print("\n--- BASELINE APP COMPLAINTS ---")
    base_samples = (
        df_base[df_base["theme"] == theme_name]["review_text"]
        .sample(n_samples)
        .tolist()
    )
    for i, sample in enumerate(base_samples):
        print(f'Sample {i + 1}: "{sample[:400]}..."')
    print("-" * 50)


# Execute the comparison for your shared themes
compare_user_voices("Technical Performance")
compare_user_voices("Monetization & Value")

--- COMPARATIVE DEEP DIVE: 'Technical Performance' ---

--- CONVERSATIONAL APP COMPLAINTS ---
Sample 1: "please please please do don't install this app !! even if you did , don't give the usual passwords you use inside the app because soon as you enter your email and password they will try to hack into our social media..."
Sample 2: "trash.replica b4>>>>. rn :🗑..."

--- BASELINE APP COMPLAINTS ---
Sample 1: "the app's ux is annoying. it shows you what they want you to see instead of being able to drive your own content organization. every time i open the app, i get 3 different popups before i can get back to the main page. i don't need the app asking me how i slept, how i liked the story, and suggesting random things. the kids content is also hard to find. if it weren't for the good stories the app wo..."
Sample 2: "it doesn't work on my new samsung galaxy s22 ultra...."
--------------------------------------------------
--- COMPARATIVE DEEP DIVE: 'Monetization & Value' ---

--- CONVER

In [10]:
# This assumes you have 'df_conv' and 'df_base' loaded and themed.
# --- The Final Comparative Time Series Chart ---

# 1. Prepare conversational data
df_conv["date"] = pd.to_datetime(df_conv["date"], errors="coerce")
conv_time = df_conv[
    ~df_conv["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
]
conv_trends = (
    conv_time.groupby([pd.Grouper(key="date", freq="M"), "theme"])
    .size()
    .reset_index(name="review_count")
)
conv_trends["month"] = conv_trends["date"].dt.to_period("M").dt.to_timestamp()
conv_trends["App Type"] = "Conversational"

# 2. Prepare baseline data
df_base["date"] = pd.to_datetime(df_base["date"], errors="coerce")
base_time = df_base[
    ~df_base["theme"].isin(["Outliers / Generic", "Other/Misc.", "Uncategorized"])
]
base_trends = (
    base_time.groupby([pd.Grouper(key="date", freq="M"), "theme"])
    .size()
    .reset_index(name="review_count")
)
base_trends["month"] = base_trends["date"].dt.to_period("M").dt.to_timestamp()
base_trends["App Type"] = "Baseline (Calm)"

# 3. Combine the data
combined_trends = pd.concat([conv_trends, base_trends])

# 4. Create a unique 'series' column for plotting
combined_trends["series"] = (
    combined_trends["App Type"] + " - " + combined_trends["theme"]
)

# 5. Create the Visualization
fig_comp_time = px.line(
    combined_trends,
    x="month",
    y="review_count",
    color="series",  # Plot each combination as a separate line
    title="<b>Comparative Complaint Trends: Conversational AI vs. Baseline App</b>",
    labels={"month": "Month", "review_count": "Number of Negative Reviews"},
    template="plotly_white",
    # Use a custom color map to group related themes visually
    color_discrete_map={
        "Conversational - Monetization & Value": "red",
        "Baseline (Calm) - Monetization & Value": "lightcoral",
        "Conversational - Technical Performance": "purple",
        "Baseline (Calm) - Technical Performance": "plum",
        "Conversational - AI Performance & Quality": "blue",
    },
)

# Add the key annotation
fig_comp_time.add_vline(
    x=pd.to_datetime("2023-02-01").timestamp() * 1000,
    line_dash="dash",
    line_color="black",
    annotation_text="Replika ERP Update",
)

fig_comp_time.update_layout(title_x=0.5, legend_title_text="Theme & App Type")
fig_comp_time.show()


'M' is deprecated and will be removed in a future version, please use 'ME' instead.


'M' is deprecated and will be removed in a future version, please use 'ME' instead.



In [11]:
# This cell prepares the data. We run it once.
# 'combined_trends' is the DataFrame we created in the last step.

# Add the key annotation information as a dictionary for easy reuse
event_annotation = {
    "x": pd.to_datetime("2023-02-01").timestamp() * 1000,
    "line_dash": "dash",
    "line_color": "black",
    "annotation_text": "Replika ERP Update",
}

In [12]:
# --- Comparison 1: Monetization & Value ---
plot_data = combined_trends[combined_trends["theme"] == "Monetization & Value"]

fig1 = px.line(
    plot_data,
    x="month",
    y="review_count",
    color="App Type",
    title="<b>Complaint Trends: Monetization & Value</b>",
    labels={"review_count": "Number of Negative Reviews"},
    template="plotly_white",
    color_discrete_map={"Conversational": "red", "Baseline (Calm)": "lightcoral"},
)
fig1.add_vline(**event_annotation)
fig1.update_layout(title_x=0.5, legend_title_text="App Type")
fig1.show()

In [13]:
# --- Comparison 2: Technical Performance ---
plot_data = combined_trends[combined_trends["theme"] == "Technical Performance"]

fig2 = px.line(
    plot_data,
    x="month",
    y="review_count",
    color="App Type",
    title="<b>Complaint Trends: Technical Performance</b>",
    labels={"review_count": "Number of Negative Reviews"},
    template="plotly_white",
    color_discrete_map={"Conversational": "purple", "Baseline (Calm)": "plum"},
)
fig2.add_vline(**event_annotation)
fig2.update_layout(title_x=0.5, legend_title_text="App Type")
fig2.show()

In [14]:
# --- Comparison 3: The Unique Failures of Conversational AI ---
plot_data = combined_trends[
    combined_trends["theme"].isin(
        ["AI Performance & Quality", "Feature-Specific Issues"]
    )
]

fig3 = px.line(
    plot_data,
    x="month",
    y="review_count",
    color="theme",
    title="<b>Unique Complaint Themes for Conversational AI</b>",
    labels={"review_count": "Number of Negative Reviews"},
    template="plotly_white",
    color_discrete_map={
        "AI Performance & Quality": "blue",
        "Feature-Specific Issues": "cyan",
    },
)
fig3.add_vline(**event_annotation)
fig3.update_layout(title_x=0.5, legend_title_text="Complaint Theme")
fig3.show()

### Conclusion: Answering the Main Research Question

1.  **What are the most prevalent themes of conversational failure?**
    Our analysis identifies two primary themes: **Failures of AI Intellect** (poor memory, repetitive loops, inability to understand context) and **Failures of AI Persona** (robotic responses, creepy behavior, breakdown of the "friend" illusion).

2.  **What does this reveal about the gap between user expectation and algorithmic capability?**
    It reveals a fundamental disconnect. Users, particularly those seeking emotional support, instinctively project human-like expectations onto these chatbots. They expect a partner who remembers, learns, and empathizes. The current algorithmic reality, however, is often a system optimized for simple, scripted interactions. This gap is the primary source of unique, emotionally charged user dissatisfaction with mental health chatbots. While technical bugs and high prices are frustrating, the failure to meet these deep-seated conversational expectations constitutes a critical failure of the core product promise.