# 📊 Task 2 – Sentiment & Thematic Analysis  
📘 Version: 2025-06-08

Quantify user sentiment and identify key themes in cleaned Google Play reviews for three Ethiopian banks (CBE, BOA, Dashen) to uncover satisfaction drivers and pain points.

### This notebook covers:
- Loading pre-cleaned reviews (`data/cleaned/reviews_all_banks_cleaned.csv`)
- Computing review-level sentiment scores and labels (VADER, with optional DistilBERT/FinBERT fallback)
- Aggregating mean sentiment by bank and star-rating
- Extracting significant keywords & phrases (TF-IDF unigrams/bigrams, spaCy noun chunks)
- Assigning reviews to 3–5 rule-based themes per bank (e.g., Account Access, Transaction Performance, UI/UX)
- Exporting an enriched CSV (`data/outputs/reviews_with_sentiment_themes.csv`) for reporting and visualization


In [4]:
# ------------------------------------------------------------------------------
# 🛠 Ensure Notebook Runs from Project Root (for src/ imports to work)
# ------------------------------------------------------------------------------

import os
import sys

# If running from /notebooks/, move up to project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
    print("📂 Changed working directory to project root")

# Add project root to sys.path so `src/` modules can be imported
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✅ Added to sys.path: {project_root}")

# Optional: verify file presence to confirm we're in the right place
expected_path = "data/raw"
print(
    "📁 Output path ready"
    if os.path.exists(expected_path)
    else f"⚠️ Output path not found: {expected_path}"
)

📂 Changed working directory to project root
✅ Added to sys.path: c:\Users\admin\Documents\GIT Repositories\b5w2-customer-ux-analytics-challenge
📁 Output path ready


In [5]:
# ------------------------------------------------------------------------------
# 📦 Core Libraries
# ------------------------------------------------------------------------------
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------------------------------------------------------
# 🧠 NLP & Text Processing (Task 2)
# ------------------------------------------------------------------------------
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Sentiment analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Optional: transformer-based sentiment
# from transformers import pipeline

# SymSpell for spelling correction (if you re‐normalize in‐pipeline)
from symspellpy.symspellpy import SymSpell

# spaCy for noun‐chunk/theme extraction (if needed)
import spacy

# ------------------------------------------------------------------------------
# 🔧 Display & Config
# ------------------------------------------------------------------------------
from IPython.display import display

# ------------------------------------------------------------------------------
# ⚙️ Optional: Download NLTK resources if running for first time
# ------------------------------------------------------------------------------
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📥 Load Cleaned Google Play Reviews Dataset

This step initializes the core reviews dataset used throughout the sentiment & thematic analysis pipeline:

- Loads `reviews_all_banks_cleaned.csv` from the `data/cleaned/` directory.  
- Automatically parses the `date` column (UTC) into datetime objects.  
- Wraps `pandas.read_csv()` in a fault-tolerant `ReviewDataLoader` class with UTF-8 → latin1 fallback.  
- Includes verbose diagnostics to confirm encoding, row/column counts, and schema.  

This ensures a reliable, reproducible foundation for all downstream sentiment scoring and theme extraction tasks.  


In [6]:
# ------------------------------------------------------------------------------
# 📥 Load Cleaned Google Play Reviews Dataset
# ------------------------------------------------------------------------------
from src.nlp.review_loader import ReviewDataLoader

DATA_PATH = "data/cleaned/reviews_all_banks_cleaned.csv"

try:
    loader = ReviewDataLoader(path=DATA_PATH, verbose=True)
    df_reviews = loader.load()
    print(
        f"✅ Successfully loaded {len(df_reviews):,} cleaned reviews into `df_reviews`"
    )
except Exception as e:
    print(f"❌ Failed to load reviews dataset: {e}")
    df_reviews = pd.DataFrame()  # gracefully degrade for further diagnosis


📄 File loaded: data/cleaned/reviews_all_banks_cleaned.csv
📦 Encoding used: utf-8
🔢 Shape: 1,200 rows × 12 columns
🧪 Columns: review, rating, date, bank, source, reviewId, userName, userImage, appVersion, repliedAt, replyContent, thumbsUpCount

✅ Successfully loaded 1,200 cleaned reviews into `df_reviews`


In [7]:
df_reviews

Unnamed: 0,review,rating,date,bank,source,reviewId,userName,userImage,appVersion,repliedAt,replyContent,thumbsUpCount
0,"""Why don’t your ATMs support account-to-accoun...",4,2025-06-06,CBE,Google Play,be2cb2ac-bbe0-4175-81c4-9f6c86afdaaa,Aim4 Beyond,https://play-lh.googleusercontent.com/a/ACg8oc...,,,,0
1,what is this app problem???,1,2025-06-05,CBE,Google Play,8efd71e9-59cd-41ce-8c5c-12052dee9ad0,zakir man,https://play-lh.googleusercontent.com/a/ACg8oc...,5.1.0,,,0
2,the app is proactive and a good connections.,5,2025-06-05,CBE,Google Play,b12d0383-9b27-4e49-a94d-277a43b15800,Yesuf Ahmed,https://play-lh.googleusercontent.com/a/ACg8oc...,5.1.0,,,0
3,I cannot send to cbebirr app. through this app.,3,2025-06-05,CBE,Google Play,dd9f9e37-177a-46df-b877-d0edaa9aed29,Yonas Mekonnen,https://play-lh.googleusercontent.com/a-/ALV-U...,,,,0
4,good,4,2025-06-05,CBE,Google Play,8e34703c-203c-4180-8b32-bfd0b3f0c871,Yibrah Yebo,https://play-lh.googleusercontent.com/a/ACg8oc...,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1195,It has a Good performance but need more upgrad...,4,2025-01-17,Dashen,Google Play,40cc8813-3573-4cc3-bbba-3cd722362717,Leul Tube,https://play-lh.googleusercontent.com/a-/ALV-U...,1.0.4,,,75
1196,It is a very wonderful work that has saved its...,5,2025-01-17,Dashen,Google Play,42006315-95d0-4ccb-9a5c-a7ac1053d977,Fantabil Deresse,https://play-lh.googleusercontent.com/a-/ALV-U...,1.0.4,,,5
1197,“Life-changing!” I can’t imagine going back to...,5,2025-01-17,Dashen,Google Play,b969925e-d6eb-4620-9bb2-6a7868cc4d57,Dawit Alemayehu,https://play-lh.googleusercontent.com/a/ACg8oc...,,,,4
1198,Pro max,5,2025-01-17,Dashen,Google Play,4082204a-22e9-4568-93dc-2e4f572a0a35,Meba Abiye,https://play-lh.googleusercontent.com/a/ACg8oc...,,,,7


## 🔄 Normalize Review Text & Split per Bank

This step applies our Tier-1 `TextNormalizer` to standardize raw review content and prepares bank-specific datasets for targeted analysis:

- Instantiates `TextNormalizer` (SymSpell + spaCy) with defensive checks  
- Detects the source text column (`corrected_review` if present, otherwise `review`)  
- Normalizes each review into a new `normalized_review` column, with per-row error logging  
- Splits the normalized DataFrame into `df_combined` plus `df_cbe`, `df_boa`, and `df_dashen`  
- Prints review-count diagnostics for each bank to confirm correct partitioning  

This ensures you have uniformly cleaned, lemmatized, and spell-corrected text ready for downstream sentiment scoring and theme extraction.  


In [8]:
# ------------------------------------------------------------------------------
# 🔄 Normalize Review Text & Split per Bank (Robust Column Handling)
# ------------------------------------------------------------------------------
from src.nlp.text_normalizer import TextNormalizer

# 1️⃣ Instantiate the normalizer
try:
    normalizer = TextNormalizer(use_symspell=True, use_spacy=True)
    print("🔧 TextNormalizer initialized successfully")
except Exception as e:
    print(f"❌ Failed to initialize TextNormalizer: {e}")
    raise

# 2️⃣ Determine which text column to normalize
if "corrected_review" in df_reviews.columns:
    source_col = "corrected_review"
elif "review" in df_reviews.columns:
    source_col = "review"
else:
    raise KeyError(
        "⚠️ Neither 'corrected_review' nor 'review' column found in df_reviews"
    )

print(f"ℹ️ Normalizing text from column: `{source_col}`")

# 3️⃣ Apply normalization with per-row error handling
normalized_texts = []
for idx, text in enumerate(df_reviews[source_col].astype(str), start=1):
    try:
        normalized = normalizer.normalize(text)
    except Exception as ex:
        print(f"⚠️ Normalization error at row {idx}: {ex}")
        normalized = ""  # fallback
    normalized_texts.append(normalized)
    if idx % 500 == 0:
        print(f"⏱️ Normalized {idx:,} reviews so far")

# 4️⃣ Assign the normalized column
df_reviews["normalized_review"] = normalized_texts
print(f"✅ Completed normalization for {len(df_reviews):,} reviews")

# 5️⃣ Split into per-bank DataFrames
df_combined = df_reviews.copy()  # full dataset
df_cbe = df_reviews[df_reviews["bank"] == "CBE"].reset_index(drop=True)
df_boa = df_reviews[df_reviews["bank"] == "BOA"].reset_index(drop=True)
df_dashen = df_reviews[df_reviews["bank"] == "Dashen"].reset_index(drop=True)

# 6️⃣ Verify split counts
print(
    f"📊 Review counts by bank:\n"
    f"  • CBE:    {len(df_cbe):,}\n"
    f"  • BOA:    {len(df_boa):,}\n"
    f"  • Dashen: {len(df_dashen):,}\n"
    f"  • TOTAL:  {len(df_combined):,}"
)

  import pkg_resources  # To locate symspell dictionaries


🔧 TextNormalizer initialized successfully
ℹ️ Normalizing text from column: `review`
⏱️ Normalized 500 reviews so far
⏱️ Normalized 1,000 reviews so far
✅ Completed normalization for 1,200 reviews
📊 Review counts by bank:
  • CBE:    400
  • BOA:    400
  • Dashen: 400
  • TOTAL:  1,200


"""
sentiment_classifier.py – Ensemble Sentiment Analysis Module (B5W2)
--------------------------------------------------------------------
Combines DistilBERT, VADER, and TextBlob into an equal-weight ensemble for robust review sentiment scoring.
Implements star-rating rules with expanded allowances for 4★–5★, computes uncertainty, and flags significant mismatches.

Responsibilities:
- Load local DistilBERT SST-2 model (PyTorch) via `AutoTokenizer` & `AutoModelForSequenceClassification`
- Compute signed sentiment scores in [-1, +1] from:
    • DistilBERT (P_pos – P_neg)
    • VADER compound
    • TextBlob polarity
- Build an **equal-weight** ensemble score and discrete label (±0.05 thresholds)
- Calculate **uncertainty** as the standard deviation of the three scorers
- Apply **rating-based rules**:
    • ★★★★★/★★★★ → allow any label (positive, neutral, negative)  
    • ★★★        → allow any label  
    • ★★         → neutral or negative  
    • ★          → negative only
- **Flag** reviews where the ensemble label is disallowed by the rating and the ensemble score deviates from the rule label by >0.5
- Expose `run(df, text_col="normalized_review")` to augment a DataFrame with:
    ['bert','vader','textblob','ensemble','label','uncertainty','rule_label','flag']

Author: Nabil Mohamed
"""


In [9]:
# ------------------------------------------------------------------------------
# 📝 Sentiment Ensembling on Combined Data (Local Model Path)
# ------------------------------------------------------------------------------
from src.nlp.sentiment_classifier import SentimentEnsembler

# Path to your local DistilBERT SST-2 model files
MODEL_DIR = "models/distilbert-base-uncased-finetuned-sst-2-english"

# 1️⃣ Instantiate the ensembler and load models from disk
ensembler = SentimentEnsembler(model_path=MODEL_DIR, device="cpu")
ensembler.tokenizer = ensembler.tokenizer.from_pretrained(
    MODEL_DIR, local_files_only=True
)
ensembler.model = ensembler.model.from_pretrained(MODEL_DIR, local_files_only=True).to(
    ensembler.device
)
print(f"🔧 Loaded DistilBERT model from `{MODEL_DIR}`")

# 2️⃣ Run ensemble sentiment on the combined DataFrame
df_enriched = ensembler.run(df_combined, text_col="normalized_review")
print(f"✅ Ensemble sentiment computed for {len(df_enriched):,} reviews")

# 3️⃣ Split enriched DataFrame into per-bank subsets
df_cbe = df_enriched[df_enriched["bank"] == "CBE"].reset_index(drop=True)
df_boa = df_enriched[df_enriched["bank"] == "BOA"].reset_index(drop=True)
df_dashen = df_enriched[df_enriched["bank"] == "Dashen"].reset_index(drop=True)

# 4️⃣ Confirm the splits
print(
    f"📊 Enriched review counts by bank:\n"
    f"  • CBE:    {len(df_cbe):,}\n"
    f"  • BOA:    {len(df_boa):,}\n"
    f"  • Dashen: {len(df_dashen):,}\n"
    f"  • TOTAL:  {len(df_enriched):,}"
)

# 5️⃣ (Optional) Save enriched results
df_enriched.to_csv("data/outputs/reviews_enriched_all.csv", index=False)
# df_cbe.to_csv("data/outputs/reviews_enriched_cbe.csv", index=False)
# df_boa.to_csv("data/outputs/reviews_enriched_boa.csv", index=False)
# df_dashen.to_csv("data/outputs/reviews_enriched_dashen.csv", index=False)

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


🔧 Loaded DistilBERT model from `models/distilbert-base-uncased-finetuned-sst-2-english`
✅ Ensemble sentiment computed for 1,200 reviews
📊 Enriched review counts by bank:
  • CBE:    400
  • BOA:    400
  • Dashen: 400
  • TOTAL:  1,200


## 📦 Keyword, Key Phrase & Theme Extraction Module

This module provides three reusable, object-oriented classes to support thematic analysis of normalized reviews:

1. **KeywordExtractor**  
   - Uses TF-IDF (unigrams + bigrams) to surface the top-N keywords across a corpus  
   - Accepts a custom stopword list for domain-specific filtering  

2. **KeyPhraseExtractor**  
   - Leverages spaCy’s noun-chunk parser (full pipeline with dependency parsing)  
   - Extracts and cleans key phrases, removing stopwords and non-alphanumeric noise  
   - Aggregates across documents to return the top-N most frequent noun-phrases  

3. **ThemeExtractor**  
   - Applies rule-based theme tagging using per-bank seed keyword maps  
   - Supports an expanded set of themes (e.g. Concise Feedback, Connectivity Issues, Functionality, Usability, Performance, Security, Notifications, Stability & Bugs, etc.)  
   - Tags each review with one or more themes, defaulting to “Other” when no seeds match  

**Usage:**  
- Instantiate each extractor with your custom stopwords and seed maps  
- Call `extract_keywords(texts)` or `extract_top_phrases(texts)` for global or per-bank analysis  
- Use `ThemeExtractor.tag_corpus(df)` to add a `themes` column to your enriched review DataFrame  

This design ensures modular, testable, and scalable extraction of keywords, key phrases, and actionable themes for your Task 2 pipeline.  


In [10]:
# ------------------------------------------------------------------------------
# 🛠 Reload and Import Keyword, Keyphrase & Theme Extractors
# ------------------------------------------------------------------------------
import importlib

# Import your modules
import src.nlp.stopwords as sw_module
import src.nlp.keyword_theme_extractor as kte_module

# Force-reload so notebook picks up any recent edits
importlib.reload(sw_module)
importlib.reload(kte_module)

# Bring the classes into the notebook namespace
KeywordExtractor = kte_module.KeywordExtractor
KeyPhraseExtractor = kte_module.KeyPhraseExtractor
ThemeExtractor = kte_module.ThemeExtractor

# Verify that everything is in place
print("✅ COMBINED_STOPWORDS length:", len(sw_module.COMBINED_STOPWORDS))
print("✅ Loaded classes:", KeywordExtractor, KeyPhraseExtractor, ThemeExtractor)

✅ COMBINED_STOPWORDS length: 207
✅ Loaded classes: <class 'src.nlp.keyword_theme_extractor.KeywordExtractor'> <class 'src.nlp.keyword_theme_extractor.KeyPhraseExtractor'> <class 'src.nlp.keyword_theme_extractor.ThemeExtractor'>


In [11]:
# ------------------------------------------------------------------------------
# 🗝️ Keyword, Keyphrase & Theme Extraction (Using Enriched Splits)
# ------------------------------------------------------------------------------
from src.nlp.keyword_theme_extractor import (
    KeywordExtractor,
    KeyPhraseExtractor,
    ThemeExtractor,
)
from src.nlp.stopwords import COMBINED_STOPWORDS
from IPython.display import display
import pandas as pd
import spacy  # need full pipeline for noun_chunks

# 1️⃣ Re-split the ENRICHED DataFrame so each has `normalized_review`
df_cbe = df_enriched[df_enriched["bank"] == "CBE"].reset_index(drop=True)
df_boa = df_enriched[df_enriched["bank"] == "BOA"].reset_index(drop=True)
df_dashen = df_enriched[df_enriched["bank"] == "Dashen"].reset_index(drop=True)

# 2️⃣ Prepare the text corpora from those enriched splits
texts_all = df_enriched["normalized_review"].dropna().tolist()
texts_cbe = df_cbe["normalized_review"].dropna().tolist()
texts_boa = df_boa["normalized_review"].dropna().tolist()
texts_dashen = df_dashen["normalized_review"].dropna().tolist()

# 3️⃣ Initialize your extractors using the centralized stopword set
kw_extractor = KeywordExtractor(stopwords=list(COMBINED_STOPWORDS), max_features=30)
phrase_extractor = KeyPhraseExtractor(stopwords=list(COMBINED_STOPWORDS))

# 🔧 Reload the spaCy pipeline with parser enabled for noun_chunks
phrase_extractor.nlp = spacy.load("en_core_web_sm")

# 4️⃣ Define per-bank theme seeds
seed_map = {
    "CBE": {
        "Account Access": ["login", "otp", "password", "pin"],
        "Transaction Perf.": ["slow", "delay", "timeout", "transfer", "fee"],
        "UI/UX": ["interface", "screen", "navigate", "button"],
        "Stability": ["crash", "freeze", "error", "bug"],
        "Support": ["help", "support", "customer"],
    },
    "BOA": {
        # … BOA-specific seeds …
    },
    "Dashen": {
        # … Dashen-specific seeds …
    },
}
theme_extractor = ThemeExtractor(seed_map=seed_map)

# 5️⃣ Extract & display GLOBAL results
global_keywords = kw_extractor.extract_keywords(texts_all)
global_phrases = phrase_extractor.extract_top_phrases(texts_all, top_n=20)
df_all_tagged = theme_extractor.tag_corpus(df_enriched)

print("🔑 Global Top 30 TF-IDF Keywords:")
print(global_keywords, "\n")

print("🧠 Global Top 20 Noun Phrases:")
print(global_phrases, "\n")

print("📊 Global Theme Distribution:")
display(df_all_tagged["themes"].explode().value_counts().to_frame("count"))

# 6️⃣ Extract & display PER-BANK results using enriched splits
for bank, texts, df_bank in [
    ("CBE", texts_cbe, df_cbe),
    ("BOA", texts_boa, df_boa),
    ("Dashen", texts_dashen, df_dashen),
]:
    print(f"\n🏦 {bank} Top 30 TF-IDF Keywords:")
    print(kw_extractor.extract_keywords(texts), "\n")

    print(f"🏦 {bank} Top 20 Noun Phrases:")
    print(phrase_extractor.extract_top_phrases(texts, top_n=20), "\n")

    df_bank_tagged = theme_extractor.tag_corpus(df_bank)
    print(f"🏦 {bank} Theme Distribution:")
    display(df_bank_tagged["themes"].explode().value_counts().to_frame("count"))

🔑 Global Top 30 TF-IDF Keywords:
['amazing', 'bad', 'banking', 'dash', 'dash super', 'developer', 'easy', 'easy use', 'excellent', 'experience', 'fast', 'feature', 'fix', 'good', 'great', 'like', 'money', 'need', 'nice', 'option', 'super', 'thank', 'time', 'transaction', 'transfer', 'update', 'use', 'user', 'work', 'wow'] 

🧠 Global Top 20 Noun Phrases:
['work', 'money', 'boa', 'love', 'transaction', 'developer option', 'life', 'easy use', 'improvement', 'good job', 'time', 'tel birr mesa', 'use', 'crash', 'screenshot', 'waw', 'payment', 'good', 'big problem', 'good easy use'] 

📊 Global Theme Distribution:


Unnamed: 0_level_0,count
themes,Unnamed: 1_level_1
Other,1155
UI/UX,20
Transaction Perf.,15
Support,8
Stability,7



🏦 CBE Top 30 TF-IDF Keywords:
['amazing', 'bad', 'banking', 'dash', 'dash super', 'developer', 'easy', 'easy use', 'excellent', 'experience', 'fast', 'feature', 'fix', 'good', 'great', 'like', 'money', 'need', 'nice', 'option', 'super', 'thank', 'time', 'transaction', 'transfer', 'update', 'use', 'user', 'work', 'wow'] 

🏦 CBE Top 20 Noun Phrases:
['work', 'money', 'easy use', 'love', 'good', 'good job', 'life', 'improvement', 'screenshot feature', 'screenshot', 'birr', 'country', 'kenya nigeria south africa', 'physically old security layer', 'fraud attempt', 'space', 'abib', 'eng ida key fete', 'safety', 'facilitate client'] 

🏦 CBE Theme Distribution:


Unnamed: 0_level_0,count
themes,Unnamed: 1_level_1
Other,355
UI/UX,20
Transaction Perf.,15
Support,8
Stability,7



🏦 BOA Top 30 TF-IDF Keywords:
['amazing', 'bad', 'banking', 'dash', 'dash super', 'developer', 'easy', 'easy use', 'excellent', 'experience', 'fast', 'feature', 'fix', 'good', 'great', 'like', 'money', 'need', 'nice', 'option', 'super', 'thank', 'time', 'transaction', 'transfer', 'update', 'use', 'user', 'work', 'wow'] 

🏦 BOA Top 20 Noun Phrases:
['work', 'boa', 'money', 'developer option', 'love', 'guy', 'long time', 'time', 'great boa', 'transaction', 'download', 'device', 'crash', 'problem', 'boa system', 'long piss fix problem', 'half', 'kind social experiment test patience build sleep', 'different career path', 'open android'] 

🏦 BOA Theme Distribution:


Unnamed: 0_level_0,count
themes,Unnamed: 1_level_1
Other,400



🏦 Dashen Top 30 TF-IDF Keywords:
['amazing', 'bad', 'banking', 'dash', 'dash super', 'developer', 'easy', 'easy use', 'excellent', 'experience', 'fast', 'feature', 'fix', 'good', 'great', 'like', 'money', 'need', 'nice', 'option', 'super', 'thank', 'time', 'transaction', 'transfer', 'update', 'use', 'user', 'work', 'wow'] 

🏦 Dashen Top 20 Noun Phrases:
['transaction', 'payment', 'money', 'life', 'love', 'waw', 'banking', 'customer', 'ethiopia innovation', 'seamless shopping experience', 'expectation marketplace', 'new update commerce section', 'balance transfer money', 'spending', 'fast easy use', 'tel birr mesa', 'work', 'gad', 'real life changer', 'simple robust feature'] 

🏦 Dashen Theme Distribution:


Unnamed: 0_level_0,count
themes,Unnamed: 1_level_1
Other,400
