# SECTION 4 - SENTIMENT ANALYSIS

### PART 1 - CLASSICAL ANALYSIS
Try using VADER from NLTK or HuggingFace transformers. The aim is to classify each comment in the supervised dataset as positive,
negative, or neutral.

a. Note that for sentiment analysis removing some stopwords (e.g., “not”, “never”, “no”) may
be harmful. Slightly adjust your preprocessing pipeline if you deem it useful.

b. What is the overall sentiment distribution across all comments?

c. Visualize sentiment distribution per subreddit using bar charts or heatmaps.

d. Does sentiment correlate with gender?

## 0. Setup and Data

In [3]:
# Install required packages
!pip install pandas numpy matplotlib seaborn nltk scipy



In [4]:
import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
import seaborn as sns
import nltk
from nltk.sentiment. vader import SentimentIntensityAnalyzer
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Download VADER lexicon
nltk.download('vader_lexicon')
# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\rocca\AppData\Roaming\nltk_data...


In [5]:
# Load the supervised dataset
# For sentiment analysis, we need the original text with stopwords intact

df_supervised = pd.read_csv('../data/data_supervised.csv')
df_target = pd.read_csv('../data/target_supervised.csv')

print(f"Supervised dataset shape: {df_supervised. shape}")
print(f"Target dataset shape: {df_target.shape}")
print(f"\nColumns in supervised data: {df_supervised.columns.tolist()}")
print(f"Columns in target data: {df_target.columns.tolist()}")

Supervised dataset shape: (296042, 4)
Target dataset shape: (5000, 2)

Columns in supervised data: ['author', 'subreddit', 'created_utc', 'body']
Columns in target data: ['author', 'gender']


In [6]:
# Preview the data
print("Sample comments:")
df_supervised.head()

Sample comments:


Unnamed: 0,author,subreddit,created_utc,body
0,Shamus_Aran,mylittlepony,1388534000.0,I don't think we'd get nearly as much fanficti...
1,Riddance,sex,1388534000.0,"Thanks. I made it up, that's how I got over my..."
2,Secret_Wizard,DragonsDogma,1388534000.0,Are you sure you aren't confusing Cyclops (the...
3,Penultimatum,malefashionadvice,1388534000.0,dont do this to me bro
4,7-SE7EN-7,todayilearned,1388534000.0,That's what we do when we can't find a mate


## 1. PREPROCESSING
Remember that is important to NOT remove negation words like "not", "never", "no" as they change the sentiment of a sentence.

Thus we apply minimal preprocessing that preserves sentiment-bearing words.

In [7]:
import html
import re

def preprocess_for_sentiment(text):
    if pd.isna(text):
        return ""
    text = str(text)
    # Decode HTML entities (e.g., &amp; -> &)
    text = html.unescape(text)
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    # Remove subreddit and user references (r/...  and u/... )
    text = re.sub(r'r/\w+|u/\w+', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply preprocessing
df_supervised['body_sentiment'] = df_supervised['body']. apply(preprocess_for_sentiment)

print("Preprocessing complete!")
print(f"Empty bodies after preprocessing: {(df_supervised['body_sentiment'] == '').sum()}")

Preprocessing complete!
Empty bodies after preprocessing: 987


In [8]:
# Show comparison between original and preprocessed text
comparison_df = df_supervised[['body', 'body_sentiment']].head(5)
for idx, row in comparison_df.iterrows():
    print(f"--- Comment {idx} ---")
    print(f"Original: {row['body'][:200]}..." if len(str(row['body'])) > 200 else f"Original: {row['body']}")
    print(f"Preprocessed: {row['body_sentiment'][:200]}..." if len(row['body_sentiment']) > 200 else f"Preprocessed:  {row['body_sentiment']}")
    print()

--- Comment 0 ---
Original: I don't think we'd get nearly as much fanfiction and pictures shipping Ban-Ban and Lyro. Just saying.
Preprocessed:  I don't think we'd get nearly as much fanfiction and pictures shipping Ban-Ban and Lyro. Just saying.

--- Comment 1 ---
Original: Thanks. I made it up, that's how I got over my first heart break. 
Preprocessed:  Thanks. I made it up, that's how I got over my first heart break.

--- Comment 2 ---
Original: Are you sure you aren't confusing Cyclops (the easiest boss monster) for Ogres? I'm talking about [these guys](http://i.imgur.com/c3YKPdI.jpg)

Maybe I'm just a bad player... But every time I faced on...
Preprocessed: Are you sure you aren't confusing Cyclops (the easiest boss monster) for Ogres? I'm talking about [these guys]( Maybe I'm just a bad player... But every time I faced one on my first playthrough, all m...

--- Comment 3 ---
Original: dont do this to me bro
Preprocessed:  dont do this to me bro

--- Comment 4 ---
Original: That's

## 3. Sentiment Analysis with VADER