# Task 2: Sentiment and Thematic Analysis

## Objective
Quantify review sentiment and identify themes to uncover satisfaction drivers and pain points.

**Components:**
1. **Sentiment Analysis** - Classify reviews as Positive/Negative/Neutral
   - VADER (rule-based, fast)
   - DistilBERT (transformer-based, accurate)
2. **Thematic Analysis** - Extract themes from reviews
   - TF-IDF keyword extraction
   - Theme mapping to business categories

## 1. Setup and Imports

In [None]:
# Add src directory to path
import sys
import os
sys.path.insert(0, os.path.abspath('../src'))

# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Project imports
from config import DATA_PATHS, THEME_KEYWORDS, BANK_NAMES
from sentiment_analyzer import SentimentAnalyzer
from theme_analyzer import ThemeAnalyzer

# Display settings
pd.set_option('display.max_colwidth', 100)
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')

# Create output directory
os.makedirs('../data/processed', exist_ok=True)

print("Setup complete!")

## 2. Load Processed Reviews

In [None]:
# Load the preprocessed reviews from Task 1
df = pd.read_csv('../data/processed/reviews_processed.csv')

print(f"Loaded {len(df)} reviews")
print(f"\nReviews per bank:")
print(df['bank_name'].value_counts())
print(f"\nColumns: {list(df.columns)}")

In [None]:
# Preview the data
df.head()

## 3. Sentiment Analysis

We'll use two methods and compare their results:
1. **VADER** - Fast, rule-based, good for social media text
2. **DistilBERT** - Deep learning, understands context, more accurate

### 3.1 VADER Sentiment Analysis

In [None]:
# Initialize VADER analyzer
vader_analyzer = SentimentAnalyzer(method='vader')

# Analyze all reviews
df_vader = vader_analyzer.analyze_dataframe(df)

In [None]:
# VADER results preview
df_vader[['review_text', 'rating', 'sentiment_label_vader', 'sentiment_score_vader']].head(10)

### 3.2 DistilBERT Sentiment Analysis

**Note:** This will take longer than VADER (a few minutes for 1000+ reviews). The model will be downloaded on first run (~250MB).

In [None]:
# Initialize DistilBERT analyzer
distilbert_analyzer = SentimentAnalyzer(method='distilbert')

# Analyze all reviews (this takes longer)
df_sentiment = distilbert_analyzer.analyze_dataframe(df_vader)

In [None]:
# DistilBERT results preview
df_sentiment[['review_text', 'rating', 'sentiment_label_vader', 'sentiment_label_distilbert']].head(10)

### 3.3 Compare VADER vs DistilBERT

In [None]:
# Agreement rate between the two methods
agreement = (df_sentiment['sentiment_label_vader'] == df_sentiment['sentiment_label_distilbert']).mean()
print(f"Agreement rate: {agreement * 100:.1f}%")

# Confusion matrix
print("\nCross-tabulation (VADER vs DistilBERT):")
pd.crosstab(df_sentiment['sentiment_label_vader'], 
            df_sentiment['sentiment_label_distilbert'], 
            margins=True)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# VADER distribution
ax1 = axes[0]
vader_counts = df_sentiment['sentiment_label_vader'].value_counts()
colors = {'POSITIVE': '#2ecc71', 'NEGATIVE': '#e74c3c', 'NEUTRAL': '#95a5a6'}
ax1.bar(vader_counts.index, vader_counts.values, 
        color=[colors.get(x, '#333') for x in vader_counts.index], edgecolor='black')
ax1.set_title('VADER Sentiment Distribution', fontsize=14, fontweight='bold')
ax1.set_xlabel('Sentiment')
ax1.set_ylabel('Count')
for i, (label, count) in enumerate(vader_counts.items()):
    ax1.text(i, count + 10, str(count), ha='center', fontweight='bold')

# DistilBERT distribution
ax2 = axes[1]
distilbert_counts = df_sentiment['sentiment_label_distilbert'].value_counts()
ax2.bar(distilbert_counts.index, distilbert_counts.values,
        color=[colors.get(x, '#333') for x in distilbert_counts.index], edgecolor='black')
ax2.set_title('DistilBERT Sentiment Distribution', fontsize=14, fontweight='bold')
ax2.set_xlabel('Sentiment')
ax2.set_ylabel('Count')
for i, (label, count) in enumerate(distilbert_counts.items()):
    ax2.text(i, count + 10, str(count), ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('../data/processed/sentiment_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

### 3.4 Sentiment by Bank

In [None]:
# Sentiment distribution by bank (using DistilBERT as primary)
fig, ax = plt.subplots(figsize=(12, 6))

sentiment_by_bank = df_sentiment.groupby(['bank_name', 'sentiment_label_distilbert']).size().unstack(fill_value=0)

# Reorder columns
col_order = ['POSITIVE', 'NEUTRAL', 'NEGATIVE'] if 'NEUTRAL' in sentiment_by_bank.columns else ['POSITIVE', 'NEGATIVE']
sentiment_by_bank = sentiment_by_bank[[c for c in col_order if c in sentiment_by_bank.columns]]

sentiment_by_bank.plot(kind='bar', ax=ax, color=['#2ecc71', '#95a5a6', '#e74c3c'][:len(sentiment_by_bank.columns)], 
                       edgecolor='black', width=0.8)

ax.set_title('Sentiment Distribution by Bank (DistilBERT)', fontsize=14, fontweight='bold')
ax.set_xlabel('Bank', fontsize=12)
ax.set_ylabel('Number of Reviews', fontsize=12)
ax.legend(title='Sentiment', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('../data/processed/sentiment_by_bank.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Sentiment percentage by bank
print("Sentiment Percentage by Bank (DistilBERT):")
print("=" * 60)

for bank in df_sentiment['bank_name'].unique():
    bank_df = df_sentiment[df_sentiment['bank_name'] == bank]
    total = len(bank_df)
    pos = len(bank_df[bank_df['sentiment_label_distilbert'] == 'POSITIVE'])
    neg = len(bank_df[bank_df['sentiment_label_distilbert'] == 'NEGATIVE'])
    
    print(f"\n{bank}:")
    print(f"  Positive: {pos} ({pos/total*100:.1f}%)")
    print(f"  Negative: {neg} ({neg/total*100:.1f}%)")
    print(f"  Satisfaction Score: {pos/total*100:.1f}%")

### 3.5 Sentiment vs Rating Correlation

In [None]:
# How well does sentiment match the star rating?
fig, ax = plt.subplots(figsize=(10, 6))

# Group by rating and sentiment
rating_sentiment = df_sentiment.groupby(['rating', 'sentiment_label_distilbert']).size().unstack(fill_value=0)

# Normalize to percentages
rating_sentiment_pct = rating_sentiment.div(rating_sentiment.sum(axis=1), axis=0) * 100

rating_sentiment_pct.plot(kind='bar', stacked=True, ax=ax, 
                          color=['#2ecc71', '#95a5a6', '#e74c3c'][:len(rating_sentiment_pct.columns)],
                          edgecolor='black')

ax.set_title('Sentiment Distribution by Star Rating', fontsize=14, fontweight='bold')
ax.set_xlabel('Star Rating', fontsize=12)
ax.set_ylabel('Percentage', fontsize=12)
ax.legend(title='Sentiment', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.savefig('../data/processed/sentiment_vs_rating.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nObservation: We expect 1-2 star reviews to be mostly NEGATIVE,")
print("and 4-5 star reviews to be mostly POSITIVE.")

## 4. Thematic Analysis

Now we'll identify WHAT users are talking about using:
- TF-IDF keyword extraction
- Mapping to predefined business themes

In [None]:
# Show predefined themes
print("Predefined Themes and Keywords:")
print("=" * 60)
for theme, keywords in THEME_KEYWORDS.items():
    print(f"\n{theme}:")
    print(f"  Keywords: {', '.join(keywords[:5])}...")

In [None]:
# Initialize theme analyzer
theme_analyzer = ThemeAnalyzer()

# Analyze themes
df_final = theme_analyzer.analyze_dataframe(df_sentiment)

In [None]:
# Preview theme results
df_final[['review_text', 'primary_theme', 'themes', 'matched_keywords']].head(10)

### 4.1 Theme Distribution

In [None]:
# Overall theme distribution
all_themes = []
for themes in df_final['themes']:
    if isinstance(themes, list):
        all_themes.extend(themes)
    elif isinstance(themes, str):
        # Handle string representation of list
        import ast
        try:
            all_themes.extend(ast.literal_eval(themes))
        except:
            pass

theme_counts = Counter(all_themes)

fig, ax = plt.subplots(figsize=(12, 6))

themes = [t[0] for t in theme_counts.most_common()]
counts = [t[1] for t in theme_counts.most_common()]

colors = sns.color_palette('Set2', len(themes))
bars = ax.barh(themes, counts, color=colors, edgecolor='black')

# Add count labels
for bar, count in zip(bars, counts):
    ax.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2, 
            str(count), va='center', fontweight='bold')

ax.set_xlabel('Number of Reviews', fontsize=12)
ax.set_title('Theme Distribution Across All Reviews', fontsize=14, fontweight='bold')
ax.invert_yaxis()  # Most common at top

plt.tight_layout()
plt.savefig('../data/processed/theme_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.2 Themes by Bank

In [None]:
# Theme distribution by bank
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for idx, bank in enumerate(df_final['bank_name'].unique()):
    ax = axes[idx]
    bank_df = df_final[df_final['bank_name'] == bank]
    
    bank_themes = []
    for themes in bank_df['themes']:
        if isinstance(themes, list):
            bank_themes.extend(themes)
        elif isinstance(themes, str):
            import ast
            try:
                bank_themes.extend(ast.literal_eval(themes))
            except:
                pass
    
    if bank_themes:
        bank_theme_counts = Counter(bank_themes).most_common(5)
        themes = [t[0] for t in bank_theme_counts]
        counts = [t[1] for t in bank_theme_counts]
        
        ax.barh(themes, counts, color=sns.color_palette('Set2', len(themes)), edgecolor='black')
        ax.set_xlabel('Count')
        ax.set_title(f'{bank}\nTop 5 Themes', fontsize=12, fontweight='bold')
        ax.invert_yaxis()

plt.tight_layout()
plt.savefig('../data/processed/themes_by_bank.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.3 Theme-Sentiment Correlation

**Key Insight:** Which themes are associated with positive vs negative sentiment?

In [None]:
# Calculate theme-sentiment correlation
correlation_df = theme_analyzer.get_theme_sentiment_correlation(df_final)

In [None]:
# Visualize theme-sentiment correlation
if correlation_df is not None and len(correlation_df) > 0:
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Sort by negative percentage (pain points first)
    correlation_df = correlation_df.sort_values('negative_pct', ascending=True)
    
    themes = correlation_df['theme']
    pos_pct = correlation_df['positive_pct']
    neg_pct = correlation_df['negative_pct']
    
    y_pos = range(len(themes))
    
    # Create horizontal bar chart
    ax.barh(y_pos, pos_pct, color='#2ecc71', label='Positive', edgecolor='black')
    ax.barh(y_pos, -neg_pct, color='#e74c3c', label='Negative', edgecolor='black')
    
    ax.set_yticks(y_pos)
    ax.set_yticklabels(themes)
    ax.set_xlabel('Percentage of Reviews', fontsize=12)
    ax.set_title('Theme-Sentiment Correlation\n(Left = Negative, Right = Positive)', 
                 fontsize=14, fontweight='bold')
    ax.axvline(x=0, color='black', linewidth=0.5)
    ax.legend(loc='lower right')
    
    # Add percentage labels
    for i, (p, n) in enumerate(zip(pos_pct, neg_pct)):
        ax.text(p + 2, i, f'{p:.0f}%', va='center', fontsize=9)
        ax.text(-n - 8, i, f'{n:.0f}%', va='center', fontsize=9)
    
    plt.tight_layout()
    plt.savefig('../data/processed/theme_sentiment_correlation.png', dpi=300, bbox_inches='tight')
    plt.show()

## 5. Key Insights

In [None]:
# Generate key insights
print("=" * 60)
print("KEY INSIGHTS FROM SENTIMENT & THEMATIC ANALYSIS")
print("=" * 60)

# 1. Overall sentiment
total = len(df_final)
pos_total = len(df_final[df_final['sentiment_label_distilbert'] == 'POSITIVE'])
neg_total = len(df_final[df_final['sentiment_label_distilbert'] == 'NEGATIVE'])

print(f"\n1. OVERALL SENTIMENT")
print(f"   Total reviews analyzed: {total}")
print(f"   Positive: {pos_total} ({pos_total/total*100:.1f}%)")
print(f"   Negative: {neg_total} ({neg_total/total*100:.1f}%)")

# 2. Best and worst performing bank
print(f"\n2. BANK COMPARISON")
bank_scores = {}
for bank in df_final['bank_name'].unique():
    bank_df = df_final[df_final['bank_name'] == bank]
    pos = len(bank_df[bank_df['sentiment_label_distilbert'] == 'POSITIVE'])
    bank_scores[bank] = pos / len(bank_df) * 100

best_bank = max(bank_scores, key=bank_scores.get)
worst_bank = min(bank_scores, key=bank_scores.get)

print(f"   Best satisfaction: {best_bank} ({bank_scores[best_bank]:.1f}% positive)")
print(f"   Needs improvement: {worst_bank} ({bank_scores[worst_bank]:.1f}% positive)")

# 3. Top pain points (themes with highest negative %)
print(f"\n3. TOP PAIN POINTS (Themes with most negative reviews)")
if correlation_df is not None:
    pain_points = correlation_df.nlargest(3, 'negative_pct')
    for _, row in pain_points.iterrows():
        print(f"   - {row['theme']}: {row['negative_pct']:.1f}% negative")

# 4. Satisfaction drivers (themes with highest positive %)
print(f"\n4. SATISFACTION DRIVERS (Themes with most positive reviews)")
if correlation_df is not None:
    drivers = correlation_df.nlargest(3, 'positive_pct')
    for _, row in drivers.iterrows():
        print(f"   - {row['theme']}: {row['positive_pct']:.1f}% positive")

## 6. Sample Reviews by Theme

In [None]:
# Show sample reviews for top themes
print("Sample Reviews by Theme")
print("=" * 60)

for theme in list(THEME_KEYWORDS.keys())[:4]:  # Top 4 themes
    print(f"\n{theme}")
    print("-" * 40)
    
    # Find reviews with this theme
    theme_reviews = df_final[df_final['primary_theme'] == theme]
    
    if len(theme_reviews) > 0:
        # Show one positive and one negative
        pos_review = theme_reviews[theme_reviews['sentiment_label_distilbert'] == 'POSITIVE'].head(1)
        neg_review = theme_reviews[theme_reviews['sentiment_label_distilbert'] == 'NEGATIVE'].head(1)
        
        if len(pos_review) > 0:
            print(f"  [POSITIVE] \"{pos_review['review_text'].values[0][:150]}...\"")
        if len(neg_review) > 0:
            print(f"  [NEGATIVE] \"{neg_review['review_text'].values[0][:150]}...\"")
    else:
        print("  No reviews found for this theme.")

## 7. Save Results

In [None]:
# Save final results with sentiment and themes
output_path = '../data/processed/reviews_with_sentiment_themes.csv'
df_final.to_csv(output_path, index=False)
print(f"Results saved to {output_path}")

# Show final columns
print(f"\nFinal dataset columns:")
for col in df_final.columns:
    print(f"  - {col}")

In [None]:
# Final dataset preview
df_final[['bank_name', 'review_text', 'rating', 'sentiment_label_distilbert', 'primary_theme']].head(10)

## 8. Task 2 Summary

### Completed:
- ✅ Sentiment analysis using VADER (rule-based)
- ✅ Sentiment analysis using DistilBERT (transformer-based)
- ✅ Comparison of both methods
- ✅ Thematic analysis with TF-IDF keyword extraction
- ✅ Theme-sentiment correlation analysis
- ✅ Visualizations for all analyses

### Key Deliverables:
- Sentiment scores for all reviews (90%+ coverage)
- 3+ themes identified per bank
- Pain points and satisfaction drivers identified

### Next Steps (Task 3):
- Store results in PostgreSQL database
- Create database schema for reviews, sentiment, and themes