# Sentiment Analysis Comparison
## Overview
In this analysis, we'll compare three different sentiment analysis approaches:
1. DistilBERT (Transformer-based)
2. VADER (Rule-based)
3. TextBlob (Lexicon-based)

Each method has its strengths:
- DistilBERT: More accurate but computationally intensive
- VADER: Good for social media text, handles emojis and slang
- TextBlob: Simple and fast, good baseline

We'll analyze reviews from three Ethiopian banks:
- Commercial Bank of Ethiopia (CBE)
- Bank of Abyssinia (BOA)
- Dashen Bank

## Data Loading and prepare for analysis
First, we'll load our cleaned review data and prepare it for analysis. We'll:
1. Load the CSV files for each bank
2. Combine them into a single DataFrame
3. Perform basic text preprocessing
4. Create a sample for testing (to avoid memory issues with DistilBERT)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import os

# Create analyzed directory if it doesn't exist
analyzed_dir = Path('../data/analyzed')
analyzed_dir.mkdir(parents=True, exist_ok=True)

# Load the cleaned data
data_path = Path('../data/cleaned')
cbe_df = pd.read_csv(data_path / 'Commercial_Bank_of_Ethiopia_cleaned_data.csv')
boa_df = pd.read_csv(data_path / 'Bank_of_Abyssinia_cleaned_data.csv')
dashen_df = pd.read_csv(data_path / 'Dashen_Bank_cleaned_data.csv')

# Add bank name column to each dataframe
cbe_df['bank'] = 'CBE'
boa_df['bank'] = 'BOA'
dashen_df['bank'] = 'Dashen'

# Combine all dataframes
all_reviews = pd.concat([cbe_df, boa_df, dashen_df], ignore_index=True)

print("Total number of reviews:", len(all_reviews))
print("\nReviews per bank:")
print(all_reviews['bank'].value_counts())
print("\nRating distribution:")
print(all_reviews['rating'].value_counts().sort_index())

## TextBlob Sentiment Analysis
TextBlob provides a simple API for common NLP tasks. For sentiment analysis, it:
- Uses a pre-trained model
- Returns polarity (-1 to +1) and subjectivity (0 to 1)
- Is fast and easy to use

We'll:
1. Calculate TextBlob sentiment scores
2. Compare results with VADER
3. Analyze the correlation between the two methods

In [None]:
from textblob import TextBlob
import time

# Function to get TextBlob sentiment
def get_textblob_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

# Function to categorize sentiment
def categorize_textblob_sentiment(score):
    if score > 0.1:
        return 'positive'
    elif score < -0.1:
        return 'negative'
    else:
        return 'neutral'

# Start timing
start_time = time.time()

# Calculate TextBlob sentiment
all_reviews['textblob_score'] = all_reviews['review'].apply(get_textblob_sentiment)
all_reviews['textblob_sentiment'] = all_reviews['textblob_score'].apply(categorize_textblob_sentiment)

# End timing
textblob_time = time.time() - start_time

# Save TextBlob results
textblob_results = all_reviews[['review', 'rating', 'date', 'bank', 'textblob_score', 'textblob_sentiment']]
textblob_results.to_csv(analyzed_dir / 'textblob_analysis.csv', index=False)

# Create visualizations
plt.figure(figsize=(15, 5))

# Plot 1: Sentiment Distribution by Bank
plt.subplot(1, 2, 1)
sentiment_by_bank = pd.crosstab(all_reviews['bank'], all_reviews['textblob_sentiment'])
sentiment_by_bank.plot(kind='bar', stacked=True)
plt.title('TextBlob Sentiment Distribution by Bank')
plt.xlabel('Bank')
plt.ylabel('Number of Reviews')
plt.legend(title='Sentiment')

# Plot 2: Sentiment vs Rating
plt.subplot(1, 2, 2)
sns.boxplot(x='rating', y='textblob_score', data=all_reviews)
plt.title('TextBlob Sentiment Scores vs Star Ratings')
plt.xlabel('Star Rating')
plt.ylabel('TextBlob Polarity Score')

plt.tight_layout()
plt.savefig(analyzed_dir / 'textblob_analysis.png')
plt.show()

print(f"TextBlob processing time: {textblob_time:.2f} seconds")
print("\nTextBlob Sentiment Distribution:")
print(all_reviews['textblob_sentiment'].value_counts(normalize=True))

## VADER Sentiment Analysis
VADER (Valence Aware Dictionary and sEntiment Reasoner) is specifically attuned to sentiments expressed in social media. It:
- Handles emojis and slang
- Considers punctuation and capitalization
- Provides compound scores between -1 (negative) and +1 (positive)

We'll use VADER to:
1. Calculate sentiment scores for each review
2. Categorize reviews as positive, negative, or neutral
3. Analyze the distribution of sentiments across banks and ratings

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time

# Initialize VADER
vader = SentimentIntensityAnalyzer()

# Function to get VADER sentiment
def get_vader_sentiment(text):
    scores = vader.polarity_scores(text)
    return scores['compound']

# Function to categorize sentiment
def categorize_vader_sentiment(score):
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Start timing
start_time = time.time()

# Calculate VADER sentiment
all_reviews['vader_score'] = all_reviews['review'].apply(get_vader_sentiment)
all_reviews['vader_sentiment'] = all_reviews['vader_score'].apply(categorize_vader_sentiment)

# End timing
vader_time = time.time() - start_time

# Save VADER results
vader_results = all_reviews[['review', 'rating', 'date', 'bank', 'vader_score', 'vader_sentiment']]
vader_results.to_csv(analyzed_dir / 'vader_analysis.csv', index=False)

# Create visualizations
plt.figure(figsize=(15, 5))

# Plot 1: Sentiment Distribution by Bank
plt.subplot(1, 2, 1)
sentiment_by_bank = pd.crosstab(all_reviews['bank'], all_reviews['vader_sentiment'])
sentiment_by_bank.plot(kind='bar', stacked=True)
plt.title('VADER Sentiment Distribution by Bank')
plt.xlabel('Bank')
plt.ylabel('Number of Reviews')
plt.legend(title='Sentiment')

# Plot 2: Sentiment vs Rating
plt.subplot(1, 2, 2)
sns.boxplot(x='rating', y='vader_score', data=all_reviews)
plt.title('VADER Sentiment Scores vs Star Ratings')
plt.xlabel('Star Rating')
plt.ylabel('VADER Compound Score')

plt.tight_layout()
plt.savefig(analyzed_dir / 'vader_analysis.png')
plt.show()

print(f"VADER processing time: {vader_time:.2f} seconds")
print("\nVADER Sentiment Distribution:")
print(all_reviews['vader_sentiment'].value_counts(normalize=True))

## DistilBERT Sentiment Analysis
DistilBERT is a lighter, faster version of BERT that:
- Uses transformer architecture
- Is pre-trained on a large corpus
- Provides more nuanced sentiment understanding

We'll:
1. Load the pre-trained model
2. Process reviews in batches
3. Compare results with VADER and TextBlob

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch
import time

# Initialize DistilBERT
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
sentiment_pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Function to get DistilBERT sentiment
def get_distilbert_sentiment(text):
    result = sentiment_pipeline(text)[0]
    # Convert label to score
    score = result['score'] if result['label'] == 'POSITIVE' else -result['score']
    return score

# Function to categorize sentiment
def categorize_distilbert_sentiment(score):
    if score > 0.5:
        return 'positive'
    elif score < -0.5:
        return 'negative'
    else:
        return 'neutral'

# Start timing
start_time = time.time()

# Calculate DistilBERT sentiment
all_reviews['distilbert_score'] = all_reviews['review'].apply(get_distilbert_sentiment)
all_reviews['distilbert_sentiment'] = all_reviews['distilbert_score'].apply(categorize_distilbert_sentiment)

# End timing
distilbert_time = time.time() - start_time

# Save DistilBERT results
distilbert_results = all_reviews[['review', 'rating', 'date', 'bank', 'distilbert_score', 'distilbert_sentiment']]
distilbert_results.to_csv(analyzed_dir / 'distilbert_analysis.csv', index=False)

# Create visualizations
plt.figure(figsize=(15, 5))

# Plot 1: Sentiment Distribution by Bank
plt.subplot(1, 2, 1)
sentiment_by_bank = pd.crosstab(all_reviews['bank'], all_reviews['distilbert_sentiment'])
sentiment_by_bank.plot(kind='bar', stacked=True)
plt.title('DistilBERT Sentiment Distribution by Bank')
plt.xlabel('Bank')
plt.ylabel('Number of Reviews')
plt.legend(title='Sentiment')

# Plot 2: Sentiment vs Rating
plt.subplot(1, 2, 2)
sns.boxplot(x='rating', y='distilbert_score', data=all_reviews)
plt.title('DistilBERT Sentiment Scores vs Star Ratings')
plt.xlabel('Star Rating')
plt.ylabel('DistilBERT Score')

plt.tight_layout()
plt.savefig(analyzed_dir / 'distilbert_analysis.png')
plt.show()

print(f"DistilBERT processing time: {distilbert_time:.2f} seconds")
print("\nDistilBERT Sentiment Distribution:")
print(all_reviews['distilbert_sentiment'].value_counts(normalize=True))

## Comparative Analysis
We'll compare the three methods by:
1. Correlation between different methods
2. Distribution of sentiments across banks
3. Accuracy against star ratings
4. Processing time and resource usage
5. Handling of specific cases (emojis, slang, etc.)

This will help us understand:
- Which method is most suitable for our use case
- Trade-offs between accuracy and performance
- How to best combine methods for optimal results