# Task 2 — Sentiment and Thematic Analysis (Fintech App Reviews)

This notebook satisfies the Task‑2 requirements:

## Sentiment
- Model: `distilbert-base-uncased-finetuned-sst-2-english`
- Output: `p_pos`, `p_neg`, signed `sentiment_score = p_pos - p_neg`
- Labeling with **neutral margin**: if `|p_pos - p_neg| < margin` → `NEUTRAL`
- Aggregations: by **bank**, and by **bank × rating**

## Themes
- Keyword extraction with **TF‑IDF** (1–2 grams)
- Manual/rule‑based clustering into **3–5 themes per bank**
- Example reviews per theme

## Outputs
Written to `data/processed/task2/`:
- `reviews_task2_scored.csv`
- `task2_sentiment_by_bank.csv`
- `task2_sentiment_by_bank_rating.csv`
- `task2_keywords_tfidf_by_bank.csv`
- `task2_theme_examples.csv`

---

## Important (your setup)
You said you’re using **`PYTHONPATH=src`**. This notebook therefore imports modules as:

- `from bank_reviews... import ...`

If any import fails, the notebook falls back to self‑contained implementations so you can still deliver results.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

import pandas as pd

# --- Resolve project root (repo root) ---
# notebooks/01_task1_scrape_and_clean.ipynb -> project root assumed to be parent of notebooks/
PROJECT_ROOT = Path.cwd()
# If running from within notebooks/ directory, go one level up
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

print("PROJECT_ROOT:", PROJECT_ROOT.resolve())

# --- Ensure `src/` is on sys.path so `import bank_reviews` works ---
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))


import re
import hashlib
from dataclasses import dataclass

import numpy as np


pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 140)

# Heavy deps used only for sentiment; OK if they are installed via requirements-nlp.txt
HAS_TRANSFORMERS = True
try:
    from transformers import pipeline
except Exception as e:
    HAS_TRANSFORMERS = False
    print('Transformers not importable. Install torch + transformers for distilBERT sentiment.')
    print('Error:', e)

from sklearn.feature_extraction.text import TfidfVectorizer

## 0) Prefer project modules (`bank_reviews.*`) when available

Because you run with `PYTHONPATH=src`, these imports should work.

If they don’t (e.g., wrong working directory), the notebook uses fallbacks.


In [None]:
USE_PROJECT_MODULES = True

try:
    # utilities
    from bank_reviews.utils.text import make_review_id as make_review_id_project
    from bank_reviews.utils.text import normalize_text as normalize_text_project

    # sentiment
    from bank_reviews.nlp.sentiment import SentimentConfig as SentimentConfigProject
    from bank_reviews.nlp.sentiment import add_sentiment_columns as add_sentiment_columns_project
    from bank_reviews.nlp.sentiment import label_from_probs as label_from_probs_project

    # themes
    from bank_reviews.nlp.themes import ThemeConfig as ThemeConfigProject
    from bank_reviews.nlp.themes import add_theme_columns as add_theme_columns_project

    # keywords
    from bank_reviews.nlp.keywords import top_keywords_by_bank as top_keywords_by_bank_project

    # metrics
    from bank_reviews.analysis.metrics import sentiment_aggregates_by_bank as sent_agg_by_bank_project
    from bank_reviews.analysis.metrics import sentiment_aggregates_by_bank_rating as sent_agg_by_bank_rating_project

    # examples (if your project has it)
    from bank_reviews.analysis.scenarios import sample_theme_examples as sample_theme_examples_project

    print('Using project modules: bank_reviews.*')

except Exception as e:
    USE_PROJECT_MODULES = False
    print('Could not import project modules; using fallback implementations.')
    print('Error:', e)


## 1) Paths & load Task‑1 cleaned data

Expected input: `data/processed/reviews_task1_clean.csv`

Columns: `review, rating, date, bank, source`


In [None]:
PROJECT_ROOT = Path.cwd()
if PROJECT_ROOT.name == "notebooks":
    PROJECT_ROOT = PROJECT_ROOT.parent

IN_CSV = PROJECT_ROOT / "data" / "processed" / "reviews_task1_clean.csv"
OUT_DIR = PROJECT_ROOT / "data" / "processed" / "task2"
OUT_DIR.mkdir(parents=True, exist_ok=True)


print('Project root:', PROJECT_ROOT)
print('Input CSV:', IN_CSV, 'exists=', IN_CSV.exists())
print('Output dir:', OUT_DIR)

df = pd.read_csv(IN_CSV)
df.head()

In [None]:
required_cols = {'review','rating','date','bank','source'}
missing = required_cols - set(df.columns)
if missing:
    raise ValueError(f'Missing required columns: {missing}. Found: {list(df.columns)}')

df = df.copy()
df['review'] = df['review'].astype(str)
df['bank'] = df['bank'].astype(str)
df['source'] = df['source'].astype(str)
df['date'] = df['date'].astype(str)
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

print('Rows:', len(df))
df[['bank','source','rating']].value_counts().head(10)

## 2) Create deterministic `review_id`

We use your project’s `make_review_id()` if available; otherwise fallback SHA‑1 over:

`bank||source||date||rating||review`


In [None]:
def make_review_id_fallback(*, review: str, bank: str, source: str, date: str, rating: float | int | str) -> str:
    raw = f"{bank}||{source}||{date}||{rating}||{review}"
    return hashlib.sha1(raw.encode('utf-8')).hexdigest()

def normalize_text_fallback(s: str) -> str:
    s = str(s).strip()
    s = re.sub(r"\s+", " ", s)
    return s

make_review_id = make_review_id_project if USE_PROJECT_MODULES else make_review_id_fallback
normalize_text = normalize_text_project if USE_PROJECT_MODULES else normalize_text_fallback

df['review_id'] = [
    make_review_id(review=r, bank=b, source=s, date=d, rating=rt)
    for r,b,s,d,rt in zip(df['review'], df['bank'], df['source'], df['date'], df['rating'])
]

df[['review_id','bank','rating','review']].head()

## 3) Sentiment analysis (distilBERT)

Preferred path:
- Use your project’s `add_sentiment_columns(df, SentimentConfig())`

Fallback path:
- Run `transformers.pipeline(..., top_k=None)` and compute neutral labels via margin.


In [None]:
# Fallback implementation (only used if project module not available)
NEUTRAL_MARGIN = 0.15
BATCH_SIZE = 32
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'

def label_from_probs_fallback(p_pos: float, p_neg: float, neutral_margin: float = NEUTRAL_MARGIN) -> str:
    if abs(p_pos - p_neg) < neutral_margin:
        return 'NEUTRAL'
    return 'POSITIVE' if p_pos > p_neg else 'NEGATIVE'

def add_sentiment_columns_fallback(df_in: pd.DataFrame) -> pd.DataFrame:
    if not HAS_TRANSFORMERS:
        raise RuntimeError('Transformers not available. Install torch + transformers.')

    clf = pipeline(
        'sentiment-analysis',
        model=MODEL_NAME,
        device=-1,
        top_k=None,  # replaces deprecated return_all_scores=True
    )

    texts = df_in['review'].astype(str).map(normalize_text).tolist()
    p_pos_list, p_neg_list, score_list, label_list = [], [], [], []

    for i in range(0, len(texts), BATCH_SIZE):
        batch = texts[i:i+BATCH_SIZE]
        results = clf(batch)
        for r in results:
            score_map = {d['label'].upper(): float(d['score']) for d in r}
            p_pos = score_map.get('POSITIVE', 0.0)
            p_neg = score_map.get('NEGATIVE', 0.0)
            s = p_pos - p_neg
            lab = label_from_probs_fallback(p_pos, p_neg, NEUTRAL_MARGIN)

            p_pos_list.append(p_pos)
            p_neg_list.append(p_neg)
            score_list.append(s)
            label_list.append(lab)

    out = df_in.copy()
    out['p_pos'] = p_pos_list
    out['p_neg'] = p_neg_list
    out['sentiment_score'] = score_list
    out['sentiment_label'] = label_list
    return out

if USE_PROJECT_MODULES:
    # Uses your configured model + neutral margin in SentimentConfig
    df = add_sentiment_columns_project(df, SentimentConfigProject())
    label_from_probs = label_from_probs_project
else:
    df = add_sentiment_columns_fallback(df)
    label_from_probs = label_from_probs_fallback

df[['sentiment_label','sentiment_score','p_pos','p_neg']].head()

In [None]:
# KPI: sentiment coverage (target 90%+)
coverage = df['sentiment_label'].notna().mean()
print('Sentiment coverage:', round(coverage * 100, 2), '%')
df['sentiment_label'].value_counts(dropna=False)

## 4) Aggregate sentiment by bank and rating

Preferred path: use your `bank_reviews.analysis.metrics` functions.

Fallback path: compute groupby aggregates in-notebook.


In [None]:
def sent_agg_by_bank_fallback(df_in: pd.DataFrame) -> pd.DataFrame:
    return (
        df_in.groupby('bank', as_index=False)
        .agg(
            n_reviews=('review', 'size'),
            mean_sentiment_score=('sentiment_score', 'mean'),
            pos_rate=('sentiment_label', lambda s: (
                s == 'POSITIVE').mean()),
            neg_rate=('sentiment_label', lambda s: (
                s == 'NEGATIVE').mean()),
            neutral_rate=('sentiment_label', lambda s: (
                s == 'NEUTRAL').mean()),
        )
    )


def sent_agg_by_bank_rating_fallback(df_in: pd.DataFrame) -> pd.DataFrame:
    return (
        df_in.groupby(['bank', 'rating'], as_index=False)
        .agg(
            n_reviews=('review', 'size'),
            mean_sentiment_score=('sentiment_score', 'mean'),
            pos_rate=('sentiment_label', lambda s: (
                s == 'POSITIVE').mean()),
            neg_rate=('sentiment_label', lambda s: (
                s == 'NEGATIVE').mean()),
            neutral_rate=('sentiment_label', lambda s: (
                s == 'NEUTRAL').mean()),
        )
    )


if USE_PROJECT_MODULES:
    sent_by_bank = sent_agg_by_bank_project(df)
    sent_by_bank_rating = sent_agg_by_bank_rating_project(df)
else:
    sent_by_bank = sent_agg_by_bank_fallback(df)
    sent_by_bank_rating = sent_agg_by_bank_rating_fallback(df)

display(sent_by_bank.sort_values('mean_sentiment_score', ascending=False))
display(sent_by_bank_rating.sort_values(['bank', 'rating']))

## 5) Keyword extraction (TF‑IDF 1–2 grams)

Preferred path: `bank_reviews.nlp.keywords.top_keywords_by_bank`.

Fallback path: compute TF‑IDF mean scores per bank.


In [None]:
def top_keywords_by_bank_fallback(
    df_in: pd.DataFrame,
    top_k: int = 30,
    ngram_range=(1,2),
    min_df: int = 2,
    max_df: float = 0.95,
) -> pd.DataFrame:
    rows = []
    for bank, dfg in df_in.groupby('bank'):
        texts = dfg['review'].astype(str).map(normalize_text).tolist()
        if len(texts) < 3:
            continue
        vec = TfidfVectorizer(
            stop_words='english',
            ngram_range=ngram_range,
            min_df=min_df,
            max_df=max_df,
        )
        X = vec.fit_transform(texts)
        terms = np.array(vec.get_feature_names_out())
        scores = np.asarray(X.mean(axis=0)).ravel()
        idx = np.argsort(scores)[::-1][:top_k]
        for rank, j in enumerate(idx, start=1):
            rows.append({
                'bank': bank,
                'rank': rank,
                'term': terms[j],
                'tfidf_mean': float(scores[j]),
                'n_docs': len(texts),
            })
    return pd.DataFrame(rows)

if USE_PROJECT_MODULES:
    kw_by_bank = top_keywords_by_bank_project(df, top_k=30)
else:
    kw_by_bank = top_keywords_by_bank_fallback(df, top_k=30)

kw_by_bank.head(25)

## 6) Theme assignment (3–5 themes)

Preferred path: use your `bank_reviews.nlp.themes.add_theme_columns`.

Fallback path: rule-based lexicon scoring (weights: bigrams=2, unigrams=1) and assign:
- `theme_primary`
- `themes` (up to 2)


In [None]:
THEME_LEXICON_FALLBACK = {
    'ACCOUNT_ACCESS': [
        'login', 'log in', 'sign in', 'signin', 'password', 'pin', 'otp', 'verification', 'biometric', 'fingerprint'
    ],
    'TXN_PERFORMANCE': [
        'transfer', 'send money', 'transaction', 'pending', 'failed', 'failure', 'slow', 'delay', 'reversal', 'charged', 'fee', 'debit', 'credit'
    ],
    'STABILITY_BUGS': [
        'crash', 'crashes', 'bug', 'bugs', 'freeze', 'frozen', 'hang', 'stuck', 'error', 'not working', 'keeps stopping'
    ],
    'UX_UI': [
        'ui', 'ux', 'interface', 'design', 'layout', 'navigation', 'update', 'upgrade', 'easy to use', 'user friendly'
    ],
    'SUPPORT_SERVICE': [
        'support', 'customer service', 'call center', 'agent', 'help', 'response', 'respond', 'no one', 'branch'
    ],
}

def theme_scores_fallback(text: str, lexicon: dict[str, list[str]]) -> dict[str, float]:
    t = ' ' + normalize_text(text).lower() + ' '
    scores = {k: 0.0 for k in lexicon}
    for theme, phrases in lexicon.items():
        for p in phrases:
            p2 = p.strip().lower()
            if not p2:
                continue
            w = 2.0 if ' ' in p2 else 1.0
            if p2 in t:
                scores[theme] += w
    return scores

def assign_themes_fallback(text: str, lexicon: dict[str, list[str]], top_n: int = 2, min_score: float = 1.0) -> tuple[str | None, str]:
    sc = theme_scores_fallback(text, lexicon)
    items = sorted(sc.items(), key=lambda kv: kv[1], reverse=True)
    items = [(k,v) for k,v in items if v >= min_score]
    if not items:
        return None, ''
    primary = items[0][0]
    top = [k for k,_ in items[:top_n]]
    return primary, '|'.join(top)

if USE_PROJECT_MODULES:
    df = add_theme_columns_project(df, ThemeConfigProject())
else:
    primary, themes = [], []
    for txt in df['review'].astype(str):
        p, t = assign_themes_fallback(txt, THEME_LEXICON_FALLBACK, top_n=2, min_score=1.0)
        primary.append(p)
        themes.append(t)
    df['theme_primary'] = primary
    df['themes'] = themes

df[['bank','rating','theme_primary','themes','review']].head(10)

In [None]:
# KPI: 3+ themes per bank
theme_counts = (
    df.dropna(subset=['theme_primary'])
      .groupby('bank')['theme_primary']
      .nunique()
      .rename('n_unique_themes')
      .reset_index()
)
theme_counts

## 7) Examples per theme per bank

Preferred: `bank_reviews.analysis.scenarios.sample_theme_examples`.

Fallback: pick top 3 longer reviews per `(bank, theme_primary)`.


In [None]:
def sample_theme_examples_fallback(df_in: pd.DataFrame, n_per_theme: int = 3) -> pd.DataFrame:
    d = df_in.dropna(subset=['theme_primary']).copy()
    d['review_len'] = d['review'].astype(str).str.len()
    d = d.sort_values(['bank','theme_primary','review_len'], ascending=[True, True, False])
    out = (
        d.groupby(['bank','theme_primary'], as_index=False)
         .head(n_per_theme)
         .loc[:, ['bank','theme_primary','review_id','rating','sentiment_label','sentiment_score','review']]
    )
    return out

if USE_PROJECT_MODULES:
    examples = sample_theme_examples_project(df, n_per_theme=3)
else:
    examples = sample_theme_examples_fallback(df, n_per_theme=3)

examples.head(20)

## 8) Save outputs to CSV

Minimum essentials:
- sentiment scores for **≥ 400 reviews**
- **≥ 2 themes per bank** via keywords/themes

KPIs:
- sentiment scored for **90%+** reviews
- **3+ themes per bank** with examples


In [None]:
out_reviews = OUT_DIR / 'reviews_task2_scored.csv'
out_bank = OUT_DIR / 'task2_sentiment_by_bank.csv'
out_bank_rating = OUT_DIR / 'task2_sentiment_by_bank_rating.csv'
out_keywords = OUT_DIR / 'task2_keywords_tfidf_by_bank.csv'
out_examples = OUT_DIR / 'task2_theme_examples.csv'

# KPI checks
sent_cov = df['sentiment_score'].notna().mean() if 'sentiment_score' in df.columns else 0.0
n_scored = int(df['sentiment_score'].notna().sum()) if 'sentiment_score' in df.columns else 0
print('Sentiment coverage:', round(sent_cov*100, 2), '%')
print('N sentiment-scored reviews:', n_scored)

theme_per_bank = (
    df.dropna(subset=['theme_primary'])
      .groupby('bank')['theme_primary']
      .nunique()
)
print('Unique themes per bank:')
print(theme_per_bank)

# Write files
df.to_csv(out_reviews, index=False)
sent_by_bank.to_csv(out_bank, index=False)
sent_by_bank_rating.to_csv(out_bank_rating, index=False)
kw_by_bank.to_csv(out_keywords, index=False)
examples.to_csv(out_examples, index=False)

print('Wrote:')
for p in [out_reviews, out_bank, out_bank_rating, out_keywords, out_examples]:
    print(' -', p, 'exists=', p.exists())