# Task 2: Sentiment and Thematic Analysis

This notebook covers sentiment analysis and thematic extraction from customer reviews.

## Objectives:
- Perform sentiment analysis using DistilBERT model
- Extract themes and keywords from reviews
- Identify satisfaction drivers and pain points
- Prepare data for insights generation

## Model Used:
- **Sentiment Analysis**: `distilbert-base-uncased-finetuned-sst-2-english` (Hugging Face)
- **Thematic Analysis**: TF-IDF and spaCy NLP


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Libraries imported successfully")


## Step 1: Load Preprocessed Data


In [None]:
# Load cleaned data
df = pd.read_csv('../data/processed/reviews_cleaned.csv')
print(f"‚úÖ Loaded {len(df)} reviews for analysis")
print(f"\nüìä Data Overview:")
print(f"   Banks: {df['bank'].unique().tolist()}")
print(f"   Rating Range: {df['rating'].min()} - {df['rating'].max()} stars")


## Step 2: Sentiment Analysis

**Note**: The full sentiment analysis is done using `sentiment_analysis.py`. This notebook demonstrates the process and loads results.

### 2.1 Load Sentiment Analysis Results


In [None]:
# Load data with sentiment analysis
try:
    df_sentiment = pd.read_csv('../data/processed/reviews_with_sentiment.csv')
    print(f"‚úÖ Loaded {len(df_sentiment)} reviews with sentiment analysis")
    print(f"\nüí≠ Sentiment Distribution:")
    print(df_sentiment['sentiment_label'].value_counts())
    print(f"\nüìä Average Sentiment Score: {df_sentiment['sentiment_score'].mean():.3f}")
except FileNotFoundError:
    print("‚ö†Ô∏è  Sentiment analysis file not found.")
    print("üí° Please run: python task2_analysis/sentiment_analysis.py")
    df_sentiment = df.copy()


### 2.2 Sentiment Analysis Visualization


In [None]:
# Sentiment distribution by bank
if 'sentiment_label' in df_sentiment.columns:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Sentiment distribution
    ax1 = axes[0]
    sentiment_by_bank = pd.crosstab(df_sentiment['bank'], df_sentiment['sentiment_label'])
    sentiment_by_bank.plot(kind='bar', ax=ax1, color=['#FF6B6B', '#6BCB77'], width=0.8)
    ax1.set_title('Sentiment Distribution by Bank', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Bank', fontsize=12)
    ax1.set_ylabel('Number of Reviews', fontsize=12)
    ax1.legend(title='Sentiment', title_fontsize=11)
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)
    ax1.grid(axis='y', alpha=0.3)
    
    # Average sentiment score
    if 'sentiment_score' in df_sentiment.columns:
        ax2 = axes[1]
        avg_sentiment = df_sentiment.groupby('bank')['sentiment_score'].mean().sort_values(ascending=False)
        colors = ['#6BCB77' if x > 0.5 else '#FF6B6B' for x in avg_sentiment]
        avg_sentiment.plot(kind='bar', ax=ax2, color=colors, width=0.6)
        ax2.set_title('Average Sentiment Score by Bank', fontsize=14, fontweight='bold')
        ax2.set_xlabel('Bank', fontsize=12)
        ax2.set_ylabel('Average Sentiment Score', fontsize=12)
        ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Neutral (0.5)')
        ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)
        ax2.legend()
        ax2.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()


### 2.3 Sentiment by Rating


In [None]:
# Analyze sentiment correlation with ratings
if 'sentiment_label' in df_sentiment.columns:
    print("="*60)
    print("üí≠ SENTIMENT BY RATING")
    print("="*60)
    
    sentiment_rating = pd.crosstab(df_sentiment['rating'], df_sentiment['sentiment_label'])
    print("\n", sentiment_rating)
    
    # Visualization
    fig, ax = plt.subplots(figsize=(10, 6))
    sentiment_rating.plot(kind='bar', ax=ax, color=['#FF6B6B', '#6BCB77'], width=0.8)
    ax.set_title('Sentiment Distribution by Rating', fontsize=14, fontweight='bold')
    ax.set_xlabel('Rating', fontsize=12)
    ax.set_ylabel('Number of Reviews', fontsize=12)
    ax.legend(title='Sentiment', title_fontsize=11)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
    ax.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()


## Step 3: Thematic Analysis

**Note**: The full thematic analysis is done using `thematic_analysis.py`. This notebook demonstrates the process and loads results.

### 3.1 Load Thematic Analysis Results


In [None]:
# Load data with themes
try:
    df_themes = pd.read_csv('../data/processed/reviews_with_themes.csv')
    print(f"‚úÖ Loaded {len(df_themes)} reviews with thematic analysis")
    
    # Extract and count themes
    all_themes = []
    for themes in df_themes['themes']:
        if pd.notna(themes):
            try:
                if isinstance(themes, str):
                    theme_list = eval(themes) if themes.startswith('[') else [themes]
                else:
                    theme_list = themes
                all_themes.extend(theme_list)
            except:
                pass
    
    print(f"\nüè∑Ô∏è  Top 10 Themes Across All Banks:")
    theme_counts = Counter(all_themes)
    for theme, count in theme_counts.most_common(10):
        print(f"   {theme}: {count} reviews")
        
except FileNotFoundError:
    print("‚ö†Ô∏è  Thematic analysis file not found.")
    print("üí° Please run: python task2_analysis/thematic_analysis.py")
    df_themes = df_sentiment.copy()


### 3.2 Theme Distribution by Bank


In [None]:
# Analyze themes by bank
if 'themes' in df_themes.columns:
    theme_data = []
    for _, row in df_themes.iterrows():
        if pd.notna(row['themes']):
            try:
                if isinstance(row['themes'], str):
                    themes = eval(row['themes']) if row['themes'].startswith('[') else [row['themes']]
                else:
                    themes = row['themes']
                for theme in themes:
                    theme_data.append({'bank': row['bank'], 'theme': theme})
            except:
                pass
    
    if theme_data:
        theme_df = pd.DataFrame(theme_data)
        theme_counts = theme_df.groupby(['bank', 'theme']).size().unstack(fill_value=0)
        
        # Visualization
        fig, ax = plt.subplots(figsize=(12, 8))
        theme_counts.plot(kind='barh', ax=ax, width=0.8, colormap='Set3')
        ax.set_title('Theme Distribution by Bank', fontsize=14, fontweight='bold')
        ax.set_xlabel('Number of Reviews', fontsize=12)
        ax.set_ylabel('Bank', fontsize=12)
        ax.legend(title='Theme', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
        ax.grid(axis='x', alpha=0.3)
        plt.tight_layout()
        plt.show()
        
        # Print top themes per bank
        print("\n" + "="*60)
        print("üè∑Ô∏è  TOP THEMES BY BANK")
        print("="*60)
        for bank in df_themes['bank'].unique():
            bank_themes = theme_df[theme_df['bank'] == bank]['theme'].value_counts().head(3)
            print(f"\n{bank}:")
            for theme, count in bank_themes.items():
                print(f"   {theme}: {count} reviews")


## Task 2 Summary

‚úÖ **Completed Steps:**
1. Sentiment analysis using DistilBERT model
2. Sentiment distribution analysis by bank and rating
3. Thematic analysis using TF-IDF and keyword extraction
4. Theme identification and categorization
5. Data preparation for insights generation

‚úÖ **KPIs Achieved:**
- Sentiment scores for 90%+ reviews
- 3+ themes identified per bank
- Modular analysis pipeline

**Next Step**: Proceed to Task 3 for Database Storage or Task 4 for Insights and Recommendations
