# Section 4 - Sentiment Analysis (Classical)

Reddit comments often carry emotional tone and thematic content. In this exercise, we explore sentiment analysis to enrich our understanding of text analytics.

**Objectives:**
- Use VADER (pre-trained sentiment analysis model) to classify comments as positive, negative, or neutral
- Analyze overall sentiment distribution
- Visualize sentiment distribution per subreddit
- Investigate correlation between sentiment and gender

## 1. Setup and Data Loading

In [None]:
# Install required packages
!pip install pandas numpy matplotlib seaborn nltk scipy

In [None]:
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
import seaborn as sns
import nltk
from nltk.sentiment. vader import SentimentIntensityAnalyzer
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Download VADER lexicon
nltk.download('vader_lexicon')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

In [None]:
# Load the supervised dataset
# Note: Using ORIGINAL body text, not the cleaned version!
# For sentiment analysis, we need the original text with stopwords intact

df_supervised = pd.read_csv('../data/data_supervised.csv')
df_target = pd.read_csv('../data/target_supervised.csv')

print(f"Supervised dataset shape: {df_supervised. shape}")
print(f"Target dataset shape: {df_target.shape}")
print(f"\nColumns in supervised data: {df_supervised.columns.tolist()}")
print(f"Columns in target data: {df_target.columns.tolist()}")

In [None]:
# Preview the data
print("Sample comments:")
df_supervised.head()

## 2. Preprocessing for Sentiment Analysis

**Important Note:** For sentiment analysis, we should NOT remove negation words like "not", "never", "no" as they drastically change the sentiment of a sentence.

Example:
- "I am happy" → Positive
- "I am NOT happy" → Negative (if we remove "not", we lose this information! )

We apply minimal preprocessing that preserves sentiment-bearing words.

In [None]:
import html
import re

def preprocess_for_sentiment(text):
    """
    Minimal preprocessing for sentiment analysis.
    Preserves negation words and sentiment-bearing content.
    """
    if pd.isna(text):
        return ""
    
    text = str(text)
    
    # Decode HTML entities (e.g., &amp; -> &)
    text = html.unescape(text)
    
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    # Remove subreddit and user references (r/...  and u/... )
    text = re.sub(r'r/\w+|u/\w+', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # NOTE: We do NOT:
    # - Remove stopwords (especially negations like 'not', 'never', 'no')
    # - Lowercase (VADER handles case sensitivity for emphasis detection)
    # - Remove punctuation (!  and ?  carry sentiment information)
    
    return text

# Apply preprocessing
df_supervised['body_sentiment'] = df_supervised['body']. apply(preprocess_for_sentiment)

print("Preprocessing complete!")
print(f"Empty bodies after preprocessing: {(df_supervised['body_sentiment'] == '').sum()}")

In [None]:
# Show comparison between original and preprocessed text
comparison_df = df_supervised[['body', 'body_sentiment']].head(5)
for idx, row in comparison_df.iterrows():
    print(f"--- Comment {idx} ---")
    print(f"Original: {row['body'][:200]}..." if len(str(row['body'])) > 200 else f"Original: {row['body']}")
    print(f"Preprocessed: {row['body_sentiment'][:200]}..." if len(row['body_sentiment']) > 200 else f"Preprocessed:  {row['body_sentiment']}")
    print()

## 3. Sentiment Analysis with VADER

**VADER (Valence Aware Dictionary and sEntiment Reasoner)** is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.

VADER returns 4 scores:
- `neg`: Negative sentiment proportion
- `neu`: Neutral sentiment proportion  
- `pos`: Positive sentiment proportion
- `compound`: Normalized, weighted composite score (-1 to +1)

Classification thresholds (standard):
- Positive: compound >= 0.05
- Negative: compound <= -0.05
- Neutral: -0.05 < compound < 0.05

In [None]:
# Initialize VADER
sia = SentimentIntensityAnalyzer()

# Test VADER on sample sentences
test_sentences = [
    "I love this!  It's amazing!",
    "This is terrible and I hate it.",
    "It's okay, nothing special.",
    "I am NOT happy about this.",  # Test negation handling
    "I am happy about this."
]

print("VADER Test Results:")
print("-" * 80)
for sentence in test_sentences:
    scores = sia.polarity_scores(sentence)
    print(f"Text: {sentence}")
    print(f"Scores: {scores}")
    print()

In [None]:
from tqdm import tqdm
tqdm.pandas()

def get_sentiment_scores(text):
    """Get VADER sentiment scores for a text."""
    if not text or text == "":
        return {'neg': 0, 'neu': 1, 'pos': 0, 'compound': 0}
    return sia.polarity_scores(text)

def classify_sentiment(compound_score):
    """Classify sentiment based on compound score."""
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply sentiment analysis to all comments
print("Analyzing sentiment for all comments... ")
sentiment_scores = df_supervised['body_sentiment'].progress_apply(get_sentiment_scores)

# Extract individual scores
df_supervised['sentiment_neg'] = sentiment_scores. apply(lambda x: x['neg'])
df_supervised['sentiment_neu'] = sentiment_scores.apply(lambda x: x['neu'])
df_supervised['sentiment_pos'] = sentiment_scores. apply(lambda x: x['pos'])
df_supervised['sentiment_compound'] = sentiment_scores.apply(lambda x: x['compound'])

# Classify sentiment
df_supervised['sentiment_label'] = df_supervised['sentiment_compound'].apply(classify_sentiment)

print("\nSentiment analysis complete!")

In [None]:
# Preview results
df_supervised[['body_sentiment', 'sentiment_compound', 'sentiment_label']]. head(10)

## 4. Overall Sentiment Distribution (Task 1b)

What is the overall sentiment distribution across all comments?

In [None]:
# Calculate sentiment distribution
sentiment_counts = df_supervised['sentiment_label'].value_counts()
sentiment_percentages = df_supervised['sentiment_label'].value_counts(normalize=True) * 100

print("Overall Sentiment Distribution:")
print("=" * 40)
for label in ['positive', 'neutral', 'negative']:
    count = sentiment_counts. get(label, 0)
    pct = sentiment_percentages.get(label, 0)
    print(f"{label. capitalize():10s}:  {count:>7,} comments ({pct:.2f}%)")
print(f"{'Total':10s}: {len(df_supervised):>7,} comments")

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# 1. Bar chart of sentiment counts
colors = {'positive': '#2ecc71', 'neutral': '#3498db', 'negative':  '#e74c3c'}
order = ['positive', 'neutral', 'negative']

ax1 = axes[0]
bars = ax1.bar(order, [sentiment_counts. get(s, 0) for s in order], 
               color=[colors[s] for s in order], edgecolor='black', linewidth=1. 2)
ax1.set_xlabel('Sentiment', fontsize=12)
ax1.set_ylabel('Number of Comments', fontsize=12)
ax1.set_title('Sentiment Distribution (Counts)', fontsize=14, fontweight='bold')

# Add value labels on bars
for bar, label in zip(bars, order):
    height = bar.get_height()
    ax1.annotate(f'{int(height):,}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                ha='center', va='bottom', fontsize=11, fontweight='bold')

# 2. Pie chart
ax2 = axes[1]
sizes = [sentiment_counts.get(s, 0) for s in order]
explode = (0.02, 0.02, 0.02)
ax2.pie(sizes, labels=[s.capitalize() for s in order], autopct='%1.1f%%',
        colors=[colors[s] for s in order], explode=explode,
        startangle=90, textprops={'fontsize': 11})
ax2.set_title('Sentiment Distribution (Percentage)', fontsize=14, fontweight='bold')

# 3. Distribution of compound scores
ax3 = axes[2]
ax3.hist(df_supervised['sentiment_compound'], bins=50, color='steelblue', 
         edgecolor='black', alpha=0.7)
ax3.axvline(x=0.05, color='green', linestyle='--', label='Positive threshold (0.05)')
ax3.axvline(x=-0.05, color='red', linestyle='--', label='Negative threshold (-0.05)')
ax3.set_xlabel('Compound Score', fontsize=12)
ax3.set_ylabel('Frequency', fontsize=12)
ax3.set_title('Distribution of Compound Scores', fontsize=14, fontweight='bold')
ax3.legend(loc='upper right')

plt.tight_layout()
plt.savefig('sentiment_overall_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

# Statistics on compound scores
print("\nCompound Score Statistics:")
print(df_supervised['sentiment_compound'].describe())

## 5. Sentiment Distribution per Subreddit (Task 1c)

Visualize sentiment distribution per subreddit using bar charts or heatmaps.

In [None]:
# Calculate sentiment distribution per subreddit
subreddit_sentiment = df_supervised.groupby(['subreddit', 'sentiment_label']).size().unstack(fill_value=0)

# Calculate percentages
subreddit_sentiment_pct = subreddit_sentiment. div(subreddit_sentiment. sum(axis=1), axis=0) * 100

# Get top 20 subreddits by comment count
top_subreddits = df_supervised['subreddit'].value_counts().head(20).index.tolist()

print(f"Total unique subreddits: {df_supervised['subreddit'].nunique()}")
print(f"\nTop 20 subreddits by comment count:")
print(df_supervised['subreddit'].value_counts().head(20))

In [None]:
# Bar chart:  Top 20 subreddits sentiment distribution
fig, ax = plt.subplots(figsize=(14, 8))

# Filter to top subreddits and reorder columns
plot_data = subreddit_sentiment_pct. loc[top_subreddits][['positive', 'neutral', 'negative']]

# Stacked bar chart
plot_data.plot(kind='barh', stacked=True, ax=ax,
               color=[colors['positive'], colors['neutral'], colors['negative']],
               edgecolor='black', linewidth=0.5)

ax.set_xlabel('Percentage (%)', fontsize=12)
ax.set_ylabel('Subreddit', fontsize=12)
ax.set_title('Sentiment Distribution by Subreddit (Top 20)', fontsize=14, fontweight='bold')
ax.legend(title='Sentiment', loc='lower right')
ax.set_xlim(0, 100)

plt.tight_layout()
plt.savefig('sentiment_by_subreddit_bar. png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Heatmap:  Sentiment distribution across top subreddits
fig, ax = plt. subplots(figsize=(10, 12))

# Prepare data for heatmap
heatmap_data = subreddit_sentiment_pct.loc[top_subreddits][['positive', 'neutral', 'negative']]

# Create heatmap
sns.heatmap(heatmap_data, annot=True, fmt='.1f', cmap='RdYlGn', 
            center=50, ax=ax, linewidths=0.5,
            cbar_kws={'label': 'Percentage (%)'})

ax.set_xlabel('Sentiment', fontsize=12)
ax.set_ylabel('Subreddit', fontsize=12)
ax.set_title('Sentiment Distribution Heatmap (Top 20 Subreddits)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('sentiment_by_subreddit_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Average compound score per subreddit
subreddit_compound = df_supervised.groupby('subreddit')['sentiment_compound']. agg(['mean', 'std', 'count'])
subreddit_compound = subreddit_compound.sort_values('mean', ascending=False)

# Filter subreddits with at least 100 comments for reliability
subreddit_compound_filtered = subreddit_compound[subreddit_compound['count'] >= 100]

print("Most Positive Subreddits (min 100 comments):")
print(subreddit_compound_filtered.head(10))

print("\nMost Negative Subreddits (min 100 comments):")
print(subreddit_compound_filtered.tail(10))

In [None]:
# Visualize top 10 most positive and negative subreddits
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Most positive
top_positive = subreddit_compound_filtered.head(10)
ax1 = axes[0]
bars1 = ax1.barh(top_positive.index, top_positive['mean'], color='#2ecc71', edgecolor='black')
ax1.set_xlabel('Average Compound Score', fontsize=12)
ax1.set_title('Top 10 Most Positive Subreddits', fontsize=14, fontweight='bold')
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Most negative
top_negative = subreddit_compound_filtered.tail(10)
ax2 = axes[1]
bars2 = ax2.barh(top_negative.index, top_negative['mean'], color='#e74c3c', edgecolor='black')
ax2.set_xlabel('Average Compound Score', fontsize=12)
ax2.set_title('Top 10 Most Negative Subreddits', fontsize=14, fontweight='bold')
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig('sentiment_extreme_subreddits.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Sentiment Correlation with Gender (Task 1d)

Does sentiment correlate with gender? Do male or female users tend to post more positive or negative comments?

In [None]:
# Merge sentiment data with gender information
# First, we need to map authors to their genders

# Get unique authors from supervised data
authors_supervised = df_supervised['author'].unique()
print(f"Unique authors in supervised data: {len(authors_supervised)}")
print(f"Number of entries in target file: {len(df_target)}")

In [None]:
# Check the structure of target file
print("Target file structure:")
print(df_target. head())
print(f"\nTarget columns: {df_target.columns. tolist()}")

In [None]:
# Create author-gender mapping
# Assuming target file has 'author' and 'gender' columns
# If structure is different, adjust accordingly

# Option 1: If target has author column
if 'author' in df_target.columns:
    author_gender_map = df_target. set_index('author')['gender'].to_dict()
else:
    # Option 2: If target is aligned with unique authors
    # Get unique authors in order they appear
    unique_authors = df_supervised. groupby('author').first().reset_index()['author']
    author_gender_map = dict(zip(unique_authors, df_target['gender']))

# Map gender to comments
df_supervised['gender'] = df_supervised['author'].map(author_gender_map)

# Check mapping success
print(f"Comments with gender info: {df_supervised['gender'].notna().sum()}")
print(f"Comments without gender info: {df_supervised['gender'].isna().sum()}")

# Gender distribution (0=male, 1=female)
print(f"\nGender distribution in comments:")
print(df_supervised['gender']. value_counts())

In [None]:
# Map numeric gender to labels
df_supervised['gender_label'] = df_supervised['gender'].map({0: 'Male', 1: 'Female'})

# Calculate sentiment distribution by gender
gender_sentiment = df_supervised.groupby(['gender_label', 'sentiment_label']).size().unstack(fill_value=0)
gender_sentiment_pct = gender_sentiment.div(gender_sentiment.sum(axis=1), axis=0) * 100

print("Sentiment Distribution by Gender (Counts):")
print(gender_sentiment)

print("\nSentiment Distribution by Gender (Percentages):")
print(gender_sentiment_pct. round(2))

In [None]:
# Statistical comparison of compound scores by gender
male_compound = df_supervised[df_supervised['gender_label'] == 'Male']['sentiment_compound']
female_compound = df_supervised[df_supervised['gender_label'] == 'Female']['sentiment_compound']

print("Compound Score Statistics by Gender:")
print("=" * 50)
print(f"\nMale users: ")
print(f"  Mean: {male_compound.mean():.4f}")
print(f"  Std:   {male_compound.std():.4f}")
print(f"  Median: {male_compound.median():.4f}")
print(f"  Count: {len(male_compound):,}")

print(f"\nFemale users: ")
print(f"  Mean: {female_compound.mean():.4f}")
print(f"  Std:   {female_compound.std():.4f}")
print(f"  Median: {female_compound. median():.4f}")
print(f"  Count: {len(female_compound):,}")

# Statistical test (Mann-Whitney U test - non-parametric)
statistic, p_value = stats.mannwhitneyu(male_compound, female_compound, alternative='two-sided')
print(f"\nMann-Whitney U Test: ")
print(f"  Statistic:  {statistic: ,.0f}")
print(f"  P-value: {p_value:.6f}")
print(f"  Significant difference (α=0.05): {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# Visualizations for gender-sentiment correlation
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Bar chart: Sentiment distribution by gender
ax1 = axes[0, 0]
x = np.arange(3)
width = 0.35
sentiments = ['positive', 'neutral', 'negative']

male_pct = [gender_sentiment_pct.loc['Male', s] for s in sentiments]
female_pct = [gender_sentiment_pct.loc['Female', s] for s in sentiments]

bars1 = ax1.bar(x - width/2, male_pct, width, label='Male', color='#3498db', edgecolor='black')
bars2 = ax1.bar(x + width/2, female_pct, width, label='Female', color='#e91e63', edgecolor='black')

ax1.set_xlabel('Sentiment', fontsize=12)
ax1.set_ylabel('Percentage (%)', fontsize=12)
ax1.set_title('Sentiment Distribution by Gender', fontsize=14, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels([s.capitalize() for s in sentiments])
ax1.legend()
ax1.set_ylim(0, max(max(male_pct), max(female_pct)) * 1.1)

# Add value labels
for bar in bars1:
    ax1.annotate(f'{bar.get_height():.1f}%', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                ha='center', va='bottom', fontsize=9)
for bar in bars2:
    ax1.annotate(f'{bar.get_height():.1f}%', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                ha='center', va='bottom', fontsize=9)

# 2. Box plot: Compound score distribution by gender
ax2 = axes[0, 1]
df_supervised. boxplot(column='sentiment_compound', by='gender_label', ax=ax2,
                      patch_artist=True,
                      boxprops=dict(facecolor='lightblue'),
                      medianprops=dict(color='red', linewidth=2))
ax2.set_xlabel('Gender', fontsize=12)
ax2.set_ylabel('Compound Score', fontsize=12)
ax2.set_title('Compound Score Distribution by Gender', fontsize=14, fontweight='bold')
plt.suptitle('')  # Remove automatic title

# 3. Violin plot: Detailed distribution
ax3 = axes[1, 0]
sns.violinplot(data=df_supervised, x='gender_label', y='sentiment_compound', 
               palette=['#3498db', '#e91e63'], ax=ax3)
ax3.set_xlabel('Gender', fontsize=12)
ax3.set_ylabel('Compound Score', fontsize=12)
ax3.set_title('Sentiment Distribution (Violin Plot)', fontsize=14, fontweight='bold')
ax3.axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# 4. KDE plot: Density comparison
ax4 = axes[1, 1]
sns.kdeplot(data=male_compound, ax=ax4, label='Male', color='#3498db', fill=True, alpha=0.3)
sns.kdeplot(data=female_compound, ax=ax4, label='Female', color='#e91e63', fill=True, alpha=0.3)
ax4.axvline(x=male_compound.mean(), color='#3498db', linestyle='--', label=f'Male Mean ({male_compound.mean():.3f})')
ax4.axvline(x=female_compound.mean(), color='#e91e63', linestyle='--', label=f'Female Mean ({female_compound. mean():.3f})')
ax4.set_xlabel('Compound Score', fontsize=12)
ax4.set_ylabel('Density', fontsize=12)
ax4.set_title('Compound Score Density by Gender', fontsize=14, fontweight='bold')
ax4.legend()

plt.tight_layout()
plt.savefig('sentiment_by_gender.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Aggregate by user:  Average sentiment per user
user_sentiment = df_supervised.groupby(['author', 'gender_label']).agg({
    'sentiment_compound':  'mean',
    'sentiment_pos': 'mean',
    'sentiment_neg': 'mean',
    'body':  'count'
}).rename(columns={'body': 'num_comments'}).reset_index()

print(f"\nUser-level sentiment statistics:")
print(user_sentiment. groupby('gender_label')['sentiment_compound'].describe())

In [None]:
# User-level comparison
male_users = user_sentiment[user_sentiment['gender_label'] == 'Male']['sentiment_compound']
female_users = user_sentiment[user_sentiment['gender_label'] == 'Female']['sentiment_compound']

fig, ax = plt.subplots(figsize=(10, 6))

sns.kdeplot(data=male_users, ax=ax, label=f'Male (n={len(male_users)})', 
            color='#3498db', fill=True, alpha=0.3)
sns.kdeplot(data=female_users, ax=ax, label=f'Female (n={len(female_users)})', 
            color='#e91e63', fill=True, alpha=0.3)

ax. axvline(x=male_users.mean(), color='#3498db', linestyle='--', linewidth=2)
ax.axvline(x=female_users.mean(), color='#e91e63', linestyle='--', linewidth=2)

ax.set_xlabel('Average Compound Score per User', fontsize=12)
ax.set_ylabel('Density', fontsize=12)
ax.set_title('User-Level Average Sentiment by Gender', fontsize=14, fontweight='bold')
ax.legend()

# Statistical test at user level
stat_user, p_user = stats.mannwhitneyu(male_users, female_users, alternative='two-sided')
ax.text(0.02, 0.98, f'Mann-Whitney U p-value: {p_user:. 4f}', 
        transform=ax.transAxes, fontsize=10, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('sentiment_by_gender_user_level.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\nUser-level Mann-Whitney U Test:")
print(f"  P-value: {p_user:. 6f}")
print(f"  Significant difference (α=0.05): {'Yes' if p_user < 0.05 else 'No'}")

## 7. Summary and Conclusions

In [None]:
print("="*60)
print("SECTION 4 - SENTIMENT ANALYSIS SUMMARY")
print("="*60)

print("\n1. PREPROCESSING: ")
print("   - Used minimal preprocessing for sentiment analysis")
print("   - Preserved negation words (not, never, no) - crucial for sentiment! ")
print("   - Kept punctuation (!  ?) and case sensitivity")
print("   - Only removed URLs and Reddit-specific references")

print("\n2. OVERALL SENTIMENT DISTRIBUTION:")
for label in ['positive', 'neutral', 'negative']:
    pct = sentiment_percentages. get(label, 0)
    print(f"   - {label. capitalize()}: {pct:.1f}%")

print("\n3. SUBREDDIT ANALYSIS:")
print(f"   - Analyzed {df_supervised['subreddit'].nunique()} unique subreddits")
print(f"   - Most positive subreddit: {subreddit_compound_filtered.index[0]}")
print(f"   - Most negative subreddit: {subreddit_compound_filtered.index[-1]}")

print("\n4. GENDER CORRELATION:")
print(f"   - Male avg compound score: {male_compound.mean():.4f}")
print(f"   - Female avg compound score: {female_compound.mean():.4f}")
print(f"   - Difference: {abs(male_compound. mean() - female_compound.mean()):.4f}")
print(f"   - Statistical significance (p<0.05): {'Yes' if p_value < 0.05 else 'No'}")

if female_compound.mean() > male_compound.mean():
    print("   - Finding:  Female users tend to post slightly more positive comments")
else:
    print("   - Finding: Male users tend to post slightly more positive comments")

print("\n" + "="*60)

In [None]:
# Save results for Section 2 integration
sentiment_features = df_supervised[['author', 'subreddit', 'sentiment_compound', 
                                     'sentiment_pos', 'sentiment_neg', 'sentiment_neu',
                                     'sentiment_label', 'gender']]
sentiment_features.to_csv('sentiment_features.csv', index=False)

# Save user-level aggregated sentiment
user_sentiment. to_csv('user_sentiment_aggregated.csv', index=False)

print("Saved files:")
print("  - sentiment_features.csv (comment-level)")
print("  - user_sentiment_aggregated.csv (user-level)")