# 🎭 Sentiment Analysis Deep Dive

This notebook explores advanced sentiment analysis techniques including:
- Multiple classification algorithms
- Feature engineering strategies
- Model evaluation and comparison
- Real-world application scenarios

## 📚 Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('stopwords', quiet=True)

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 📊 Create Comprehensive Dataset

Let's create a more comprehensive dataset for sentiment analysis:

In [None]:
# Extended movie reviews dataset
positive_reviews = [
    "This movie is absolutely fantastic and amazing",
    "Great film with excellent acting and superb direction",
    "I love this movie so much, brilliant performance",
    "Outstanding cinematography and incredible storyline",
    "Perfect movie with amazing special effects",
    "Excellent script and wonderful character development",
    "Brilliant acting and fantastic plot twists",
    "Amazing movie that exceeded all my expectations",
    "Superb direction and outstanding performances",
    "Incredible film with great emotional depth",
    "Best movie I have seen this year",
    "Wonderful story and excellent execution",
    "Fantastic movie with great chemistry between actors",
    "Amazing visual effects and compelling narrative",
    "Perfect blend of action and emotion"
]

negative_reviews = [
    "This is a terrible and awful movie",
    "Worst film I have ever seen, completely boring",
    "I hate this movie, waste of time",
    "Poor storyline and terrible acting",
    "Disappointing movie with bad direction",
    "Horrible script and weak performances",
    "Awful movie with no redeeming qualities",
    "Terrible plot and boring characters",
    "Bad movie that failed to deliver",
    "Worst acting and poor cinematography",
    "Disappointing film with weak storyline",
    "Boring movie with terrible dialogue",
    "Poor execution and bad special effects",
    "Awful direction and weak character development",
    "Terrible movie that wasted great potential"
]

# Combine reviews and create labels
all_reviews = positive_reviews + negative_reviews
labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)

# Create DataFrame
df = pd.DataFrame({
    'text': all_reviews,
    'sentiment': labels,
    'sentiment_label': ['Positive' if label == 1 else 'Negative' for label in labels]
})

print(f"Dataset size: {len(df)}")
print(f"Positive reviews: {sum(labels)}")
print(f"Negative reviews: {len(labels) - sum(labels)}")
print("\nSample data:")
print(df.head())

## 🔍 Exploratory Data Analysis

In [None]:
# Basic statistics
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

print("Text Statistics:")
print(df.groupby('sentiment_label')[['text_length', 'word_count']].agg(['mean', 'std', 'min', 'max']).round(2))

In [None]:
# Visualize text statistics
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# Sentiment distribution
df['sentiment_label'].value_counts().plot(kind='bar', ax=ax1, color=['red', 'green'], alpha=0.7)
ax1.set_title('Sentiment Distribution')
ax1.set_xlabel('Sentiment')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=0)

# Text length distribution by sentiment
sns.boxplot(data=df, x='sentiment_label', y='text_length', ax=ax2)
ax2.set_title('Text Length Distribution by Sentiment')

# Word count distribution by sentiment
sns.boxplot(data=df, x='sentiment_label', y='word_count', ax=ax3)
ax3.set_title('Word Count Distribution by Sentiment')

# Text length histogram
df[df['sentiment_label'] == 'Positive']['text_length'].hist(alpha=0.7, label='Positive', ax=ax4, color='green')
df[df['sentiment_label'] == 'Negative']['text_length'].hist(alpha=0.7, label='Negative', ax=ax4, color='red')
ax4.set_title('Text Length Histogram')
ax4.set_xlabel('Text Length')
ax4.set_ylabel('Frequency')
ax4.legend()

plt.tight_layout()
plt.show()

## 🧹 Advanced Text Preprocessing

In [None]:
def advanced_text_preprocessing(text):
    """
    Advanced text preprocessing function
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize
    words = text.split()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    
    return ' '.join(words)

# Apply preprocessing
df['processed_text'] = df['text'].apply(advanced_text_preprocessing)

print("Original vs Processed Text Examples:")
for i in range(3):
    print(f"\nOriginal: {df.iloc[i]['text']}")
    print(f"Processed: {df.iloc[i]['processed_text']}")

## ☁️ Word Clouds

In [None]:
# Create word clouds for positive and negative sentiments
positive_text = ' '.join(df[df['sentiment'] == 1]['processed_text'])
negative_text = ' '.join(df[df['sentiment'] == 0]['processed_text'])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Positive word cloud
wordcloud_pos = WordCloud(width=400, height=400, background_color='white', 
                          colormap='Greens').generate(positive_text)
ax1.imshow(wordcloud_pos, interpolation='bilinear')
ax1.set_title('Positive Reviews Word Cloud', fontsize=16)
ax1.axis('off')

# Negative word cloud
wordcloud_neg = WordCloud(width=400, height=400, background_color='white',
                          colormap='Reds').generate(negative_text)
ax2.imshow(wordcloud_neg, interpolation='bilinear')
ax2.set_title('Negative Reviews Word Cloud', fontsize=16)
ax2.axis('off')

plt.tight_layout()
plt.show()

## 🎯 Feature Engineering Comparison

In [None]:
# Split the data
X = df['processed_text']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training set positive ratio: {y_train.mean():.2f}")
print(f"Test set positive ratio: {y_test.mean():.2f}")

In [None]:
# Define different vectorizers
vectorizers = {
    'BoW (1-gram)': CountVectorizer(max_features=1000, ngram_range=(1, 1)),
    'BoW (1-2 gram)': CountVectorizer(max_features=1000, ngram_range=(1, 2)),
    'TF-IDF (1-gram)': TfidfVectorizer(max_features=1000, ngram_range=(1, 1)),
    'TF-IDF (1-2 gram)': TfidfVectorizer(max_features=1000, ngram_range=(1, 2)),
    'TF-IDF (1-3 gram)': TfidfVectorizer(max_features=1000, ngram_range=(1, 3))
}

# Store feature matrices
feature_matrices = {}

for name, vectorizer in vectorizers.items():
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    feature_matrices[name] = {
        'train': X_train_vec,
        'test': X_test_vec,
        'vectorizer': vectorizer
    }
    print(f"{name}: {X_train_vec.shape[1]} features")

## 🤖 Multiple Algorithm Comparison

In [None]:
# Define algorithms
algorithms = {
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42, probability=True),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100)
}

# Store results
results = []

# Test each combination
for feature_name, feature_data in feature_matrices.items():
    X_train_features = feature_data['train']
    X_test_features = feature_data['test']
    
    for algo_name, algorithm in algorithms.items():
        # Train the model
        algorithm.fit(X_train_features, y_train)
        
        # Make predictions
        y_pred = algorithm.predict(X_test_features)
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        
        # Cross-validation score
        cv_scores = cross_val_score(algorithm, X_train_features, y_train, cv=5, scoring='accuracy')
        cv_mean = cv_scores.mean()
        cv_std = cv_scores.std()
        
        results.append({
            'Features': feature_name,
            'Algorithm': algo_name,
            'Test_Accuracy': accuracy,
            'CV_Mean': cv_mean,
            'CV_Std': cv_std
        })

# Create results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.round(4)

print("Algorithm Performance Comparison:")
print(results_df.to_string(index=False))

In [None]:
# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

# Test accuracy comparison
pivot_test = results_df.pivot(index='Algorithm', columns='Features', values='Test_Accuracy')
sns.heatmap(pivot_test, annot=True, cmap='RdYlGn', ax=ax1, fmt='.3f')
ax1.set_title('Test Accuracy Heatmap')
ax1.set_xlabel('Feature Engineering')
ax1.set_ylabel('Algorithm')

# CV accuracy comparison
pivot_cv = results_df.pivot(index='Algorithm', columns='Features', values='CV_Mean')
sns.heatmap(pivot_cv, annot=True, cmap='RdYlGn', ax=ax2, fmt='.3f')
ax2.set_title('Cross-Validation Accuracy Heatmap')
ax2.set_xlabel('Feature Engineering')
ax2.set_ylabel('Algorithm')

plt.tight_layout()
plt.show()

## 🎯 Best Model Analysis

In [None]:
# Find best performing combination
best_result = results_df.loc[results_df['Test_Accuracy'].idxmax()]
print("Best Performing Model:")
print(f"Features: {best_result['Features']}")
print(f"Algorithm: {best_result['Algorithm']}")
print(f"Test Accuracy: {best_result['Test_Accuracy']:.4f}")
print(f"CV Mean: {best_result['CV_Mean']:.4f} ± {best_result['CV_Std']:.4f}")

# Train the best model
best_features = best_result['Features']
best_algo = best_result['Algorithm']

X_train_best = feature_matrices[best_features]['train']
X_test_best = feature_matrices[best_features]['test']
best_vectorizer = feature_matrices[best_features]['vectorizer']

best_model = algorithms[best_algo]
best_model.fit(X_train_best, y_train)
y_pred_best = best_model.predict(X_test_best)

In [None]:
# Detailed evaluation of best model
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred_best, target_names=['Negative', 'Positive']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'], 
            yticklabels=['Negative', 'Positive'])
plt.title(f'Confusion Matrix - {best_algo} with {best_features}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 📊 ROC Curve Analysis

In [None]:
# Generate ROC curves for different algorithms
plt.figure(figsize=(12, 8))
colors = ['red', 'blue', 'green', 'orange', 'purple']

# Use TF-IDF (1-2 gram) features for comparison
feature_name = 'TF-IDF (1-2 gram)'
X_train_roc = feature_matrices[feature_name]['train']
X_test_roc = feature_matrices[feature_name]['test']

for i, (algo_name, algorithm) in enumerate(algorithms.items()):
    # Train model
    algorithm.fit(X_train_roc, y_train)
    
    # Get prediction probabilities
    if hasattr(algorithm, "predict_proba"):
        y_proba = algorithm.predict_proba(X_test_roc)[:, 1]
    else:
        y_proba = algorithm.decision_function(X_test_roc)
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.plot(fpr, tpr, color=colors[i], lw=2, 
             label=f'{algo_name} (AUC = {roc_auc:.3f})')

# Plot diagonal line
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curves Comparison - {feature_name}')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

## 🔍 Feature Importance Analysis

In [None]:
# Analyze feature importance for Logistic Regression
if best_algo == 'Logistic Regression':
    feature_names = best_vectorizer.get_feature_names_out()
    coefficients = best_model.coef_[0]
    
    # Create feature importance DataFrame
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'coefficient': coefficients,
        'abs_coefficient': np.abs(coefficients)
    }).sort_values('abs_coefficient', ascending=False)
    
    # Plot top features
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))
    
    # Top positive features
    top_positive = feature_importance.nlargest(15, 'coefficient')
    ax1.barh(range(len(top_positive)), top_positive['coefficient'], color='green', alpha=0.7)
    ax1.set_yticks(range(len(top_positive)))
    ax1.set_yticklabels(top_positive['feature'])
    ax1.set_title('Top 15 Positive Sentiment Features')
    ax1.set_xlabel('Coefficient Value')
    
    # Top negative features
    top_negative = feature_importance.nsmallest(15, 'coefficient')
    ax2.barh(range(len(top_negative)), top_negative['coefficient'], color='red', alpha=0.7)
    ax2.set_yticks(range(len(top_negative)))
    ax2.set_yticklabels(top_negative['feature'])
    ax2.set_title('Top 15 Negative Sentiment Features')
    ax2.set_xlabel('Coefficient Value')
    
    plt.tight_layout()
    plt.show()
    
    print("\nTop 10 Most Important Features (by absolute coefficient):")
    print(feature_importance.head(10)[['feature', 'coefficient']].to_string(index=False))

## 🧪 Model Testing with New Examples

In [None]:
def predict_sentiment(text, model, vectorizer, preprocessor):
    """
    Predict sentiment for new text
    """
    # Preprocess text
    processed_text = preprocessor(text)
    
    # Vectorize
    text_vector = vectorizer.transform([processed_text])
    
    # Predict
    prediction = model.predict(text_vector)[0]
    probability = model.predict_proba(text_vector)[0]
    
    return prediction, probability

# Test with new examples
test_examples = [
    "This movie is absolutely incredible and mind-blowing",
    "Terrible film with awful acting and boring plot",
    "Great movie with some amazing scenes",
    "Not the worst movie but definitely disappointing",
    "Outstanding performance by all actors",
    "Complete waste of time and money",
    "Good movie with excellent direction",
    "Poor storyline but decent acting"
]

print("Sentiment Predictions on New Examples:")
print("=" * 80)

for text in test_examples:
    prediction, probability = predict_sentiment(text, best_model, best_vectorizer, advanced_text_preprocessing)
    sentiment = "Positive" if prediction == 1 else "Negative"
    confidence = max(probability)
    
    print(f"\nText: {text}")
    print(f"Predicted Sentiment: {sentiment}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Probabilities: Negative={probability[0]:.3f}, Positive={probability[1]:.3f}")

## 📈 Model Performance Summary

In [None]:
# Summary statistics
summary_stats = results_df.groupby('Algorithm').agg({
    'Test_Accuracy': ['mean', 'std', 'max'],
    'CV_Mean': ['mean', 'std', 'max']
}).round(4)

print("Algorithm Performance Summary:")
print(summary_stats)

# Best algorithm by average performance
avg_performance = results_df.groupby('Algorithm')['Test_Accuracy'].mean().sort_values(ascending=False)
print("\nAlgorithms Ranked by Average Test Accuracy:")
for i, (algo, score) in enumerate(avg_performance.items(), 1):
    print(f"{i}. {algo}: {score:.4f}")

## 🎯 Key Insights and Recommendations

### **Performance Insights:**
- **Best Overall Performance**: The analysis shows which algorithm-feature combination works best
- **Feature Engineering Impact**: N-grams and TF-IDF typically outperform simple Bag of Words
- **Algorithm Strengths**: Different algorithms excel with different feature representations

### **Model Selection Guidelines:**
- **For Interpretability**: Logistic Regression with TF-IDF
- **For Small Datasets**: Naive Bayes (especially Multinomial NB)
- **For Complex Patterns**: SVM or Random Forest
- **For Speed**: Naive Bayes or Logistic Regression

### **Feature Engineering Recommendations:**
- **Always use TF-IDF** over simple Bag of Words
- **Include bigrams** (1-2 grams) for better context capture
- **Proper preprocessing** is crucial for performance
- **Consider domain-specific stop words** removal

### **Next Steps for Improvement:**
1. **Larger Dataset**: Collect more diverse reviews
2. **Advanced Features**: Sentiment lexicons, POS tags
3. **Deep Learning**: Try LSTM, BERT for comparison
4. **Ensemble Methods**: Combine multiple models
5. **Hyperparameter Tuning**: Grid search for optimal parameters

### **Real-World Application Tips:**
- **Monitor Performance**: Regularly evaluate on new data
- **Handle Class Imbalance**: Use appropriate sampling techniques
- **Domain Adaptation**: Retrain for different domains (movies vs products)
- **Confidence Thresholds**: Set minimum confidence for predictions