# Cross-Platform Emotion Classification — Preprocessing & EDA

This notebook covers data loading, cleaning, text preprocessing, automated emotion annotation, and exploratory data analysis (EDA) for Twitter and Reddit mental health data.

**Steps:**
1. Load raw datasets
2. Initial cleaning (duplicates, nulls, irrelevant content)
3. Text preprocessing (emoji removal, tokenisation, lemmatisation)
4. Emotion annotation using a pre-trained transformer model
5. Exploratory Data Analysis (emotion distributions, text length, word counts, word clouds)
6. Save cleaned datasets for modelling

> **Note:** Raw data files are not included in this repository as they were collected via the Twitter and Reddit APIs and may contain sensitive content. To reproduce, collect data using `tweepy` (Twitter) and `praw` (Reddit) targeting mental health related keywords and subreddits.

## 1. Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from transformers import pipeline
import torch
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

print('All imports successful')

## 2. Load Raw Data

Data was collected via the Twitter API (using Tweepy) and Reddit API (using PRAW), targeting mental health related keywords and subreddits.

In [None]:
mh_reddit = pd.read_csv('../data/MHData_Reddit.csv')
mh_twitter = pd.read_csv('../data/MHData_Twitter.csv')

print(f'Reddit dataset:  {mh_reddit.shape[0]:,} rows, {mh_reddit.shape[1]} columns')
print(f'Twitter dataset: {mh_twitter.shape[0]:,} rows, {mh_twitter.shape[1]} columns')

print('\nReddit columns:', mh_reddit.columns.tolist())
print('Twitter columns:', mh_twitter.columns.tolist())

In [None]:
mh_reddit.head()

In [None]:
mh_twitter.head()

## 3. Initial Cleaning

### 3a. Remove Duplicates and Irrelevant Columns

- **Twitter:** Drop `user_id` and `created_at` (not needed for NLP)
- **Reddit:** Drop metadata columns, rename `body` to `text` to standardise across platforms

In [None]:
# Twitter: remove duplicates and drop metadata columns
mh_twitter_clean = mh_twitter.drop_duplicates(subset='text').drop(columns=['user_id', 'created_at'])

# Reddit: remove duplicates, drop metadata, rename body to text
mh_reddit_clean = mh_reddit.drop_duplicates(subset='body')
mh_reddit_clean = mh_reddit_clean.drop(columns=['title', 'score', 'id', 'url', 'comms_num', 'created'])
mh_reddit_clean = mh_reddit_clean.rename(columns={'body': 'text'})

print(f'Twitter after deduplication: {mh_twitter_clean.shape}')
print(f'Reddit after deduplication:  {mh_reddit_clean.shape}')

### 3b. Handle Missing Values

In [None]:
print('Twitter missing values:')
print(mh_twitter_clean.isnull().sum())
print('\nReddit missing values:')
print(mh_reddit_clean.isnull().sum())

# Drop nulls from Reddit (Twitter has none)
mh_reddit_clean = mh_reddit_clean.dropna()
print(f'\nReddit after dropping nulls: {mh_reddit_clean.shape}')

### 3c. Remove Organisational and Governmental Content

Some posts are from organisations (charities, government bodies, awareness campaigns) rather than individuals experiencing mental health issues. These are filtered out using a keyword list to ensure the dataset reflects genuine personal expression.

In [None]:
custom_keywords = [
    'awareness', 'gov', 'official', 'ministry', 'our', 'organization', 'organisation',
    'follow', 'donate', 'we need', 'government', 'startup', 'contest', 'initiative',
    'department', 'movement', 'organize', 'organise', 'workshop', 'conference', 'seminar',
    'study', 'research', 'services', 'tickets', 'suicideprevention', 'campus', 'program',
    'reports', 'survey', 'statistics', 'industry', 'invited', 'join', 'series',
    'register', 'enroll', 'launching', 'launch'
]

def contains_keywords(text, keywords):
    for keyword in keywords:
        if keyword.lower() in text.lower():
            return True
    return False

mh_twitter_clean = mh_twitter_clean[~mh_twitter_clean['text'].apply(lambda x: contains_keywords(x, custom_keywords))]
mh_reddit_clean  = mh_reddit_clean[~mh_reddit_clean['text'].apply(lambda x: contains_keywords(x, custom_keywords))]

print(f'Twitter after keyword filtering: {mh_twitter_clean.shape}')
print(f'Reddit after keyword filtering:  {mh_reddit_clean.shape}')

## 4. Text Preprocessing

Applying a standard NLP preprocessing pipeline to both datasets:
- Remove emojis, URLs, mentions, special characters and numbers
- Lowercase
- Tokenise
- Remove stopwords
- Lemmatise

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Remove Unicode emojis
    emoji_pattern = re.compile(
        "[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF"
        "\U0001F1E0-\U0001F1FF\U00002700-\U000027BF\U0001F900-\U0001F9FF"
        "\U00002600-\U000026FF\U000025A0-\U00002BEF]+",
        flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)   # Remove special characters and numbers
    text = text.lower()                         # Lowercase
    text = re.sub(r'http\S+|www\S+', '', text) # Remove URLs
    text = re.sub(r'@\w+', '', text)            # Remove mentions
    tokens = nltk.word_tokenize(text)           # Tokenise
    tokens = [w for w in tokens if w not in stop_words]           # Remove stopwords
    tokens = [lemmatizer.lemmatize(w) for w in tokens]            # Lemmatise
    return ' '.join(tokens)

print('Preprocessing function defined.')

In [None]:
# Apply to both datasets
mh_twitter_clean['clean_text'] = mh_twitter_clean['text'].apply(preprocess_text)
mh_reddit_clean['clean_text']  = mh_reddit_clean['text'].apply(preprocess_text)

print('Sample before preprocessing:')
print(mh_twitter_clean['text'].iloc[0])
print('\nSample after preprocessing:')
print(mh_twitter_clean['clean_text'].iloc[0])

## 5. Emotion Annotation

Using the pre-trained `bhadresh-savani/distilbert-base-uncased-emotion` model from Hugging Face to automatically label each post with one of six emotions: **joy, sadness, anger, fear, love, surprise**.

Batch processing is used to handle large datasets efficiently. GPU is used if available, otherwise CPU.

In [None]:
device = 0 if torch.cuda.is_available() else -1
print(f'Using device: {"GPU" if device == 0 else "CPU"}')

classifier = pipeline(
    'text-classification',
    model='bhadresh-savani/distilbert-base-uncased-emotion',
    tokenizer='bhadresh-savani/distilbert-base-uncased-emotion',
    truncation=True,
    device=device
)

In [None]:
def classify_batch(texts):
    results = classifier(texts)
    return [(r['label'], r['score']) for r in results]

def apply_batch_processing(df, text_column, batch_size=16):
    emotions, scores = [], []
    for i in range(0, len(df), batch_size):
        batch = df[text_column].iloc[i:i+batch_size].tolist()
        batch = [t if pd.notna(t) else '' for t in batch]
        batch_results = classify_batch(batch)
        batch_emotions, batch_scores = zip(*batch_results)
        emotions.extend(batch_emotions)
        scores.extend(batch_scores)
    return pd.DataFrame({'emotion': emotions, 'score': scores})

In [None]:
# This step is computationally expensive — expect several minutes on CPU
mh_twitter_clean[['emotion', 'score']] = apply_batch_processing(mh_twitter_clean, 'clean_text')
mh_reddit_clean[['emotion', 'score']]  = apply_batch_processing(mh_reddit_clean, 'clean_text')

# Drop any rows where annotation failed
mh_twitter_clean = mh_twitter_clean.dropna(subset=['emotion'])
mh_reddit_clean  = mh_reddit_clean.dropna(subset=['emotion'])

print(f'Twitter annotated: {mh_twitter_clean.shape}')
print(f'Reddit annotated:  {mh_reddit_clean.shape}')

## 6. Exploratory Data Analysis

### 6a. Emotion Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

sns.countplot(x='emotion', data=mh_twitter_clean,
              order=mh_twitter_clean['emotion'].value_counts().index, ax=axes[0])
axes[0].set_title('Emotion Distribution — Twitter')
axes[0].tick_params(axis='x', rotation=45)

sns.countplot(x='emotion', data=mh_reddit_clean,
              order=mh_reddit_clean['emotion'].value_counts().index, ax=axes[1])
axes[1].set_title('Emotion Distribution — Reddit')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print('Twitter emotion counts:')
print(mh_twitter_clean['emotion'].value_counts())
print('\nReddit emotion counts:')
print(mh_reddit_clean['emotion'].value_counts())

### 6b. Text Length and Word Count Distribution

In [None]:
mh_twitter_clean['text_length'] = mh_twitter_clean['clean_text'].apply(len)
mh_reddit_clean['text_length']  = mh_reddit_clean['clean_text'].apply(len)
mh_twitter_clean['word_count']  = mh_twitter_clean['clean_text'].apply(lambda x: len(x.split()))
mh_reddit_clean['word_count']   = mh_reddit_clean['clean_text'].apply(lambda x: len(x.split()))

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

sns.histplot(mh_twitter_clean['text_length'], bins=50, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Twitter — Text Length Distribution')
axes[0, 0].set_xlabel('Character Count')

sns.histplot(mh_reddit_clean['text_length'], bins=50, kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Reddit — Text Length Distribution')
axes[0, 1].set_xlabel('Character Count')

sns.histplot(mh_twitter_clean['word_count'], bins=50, kde=True, color='green', ax=axes[1, 0])
axes[1, 0].set_title('Twitter — Word Count Distribution')
axes[1, 0].set_xlabel('Word Count')

sns.histplot(mh_reddit_clean['word_count'], bins=50, kde=True, color='green', ax=axes[1, 1])
axes[1, 1].set_title('Reddit — Word Count Distribution')
axes[1, 1].set_xlabel('Word Count')

plt.tight_layout()
plt.show()

for platform, df in [('Twitter', mh_twitter_clean), ('Reddit', mh_reddit_clean)]:
    print(f'{platform} — Text length: {df["text_length"].min()}–{df["text_length"].max()} chars | Word count: {df["word_count"].min()}–{df["word_count"].max()} words')

### 6c. Word Clouds by Emotion

Visualising the most frequent terms for each emotion class across both platforms.

In [None]:
def plot_wordclouds(df, platform_name):
    emotions = df['emotion'].unique()
    for emotion in sorted(emotions):
        text = ' '.join(df[df['emotion'] == emotion]['clean_text'])
        if text.strip():
            wc = WordCloud(width=800, height=400, background_color='white').generate(text)
            plt.figure(figsize=(10, 5))
            plt.imshow(wc, interpolation='bilinear')
            plt.axis('off')
            plt.title(f'Word Cloud — {emotion} ({platform_name})')
            plt.show()

plot_wordclouds(mh_twitter_clean, 'Twitter')
plot_wordclouds(mh_reddit_clean, 'Reddit')

## 7. Save Cleaned Datasets

Saving the annotated and cleaned datasets for use in the modelling notebook.

In [None]:
mh_twitter_clean.to_csv('../data/mh_twitter_clean.csv', index=False)
mh_reddit_clean.to_csv('../data/mh_reddit_clean.csv', index=False)

print(f'Saved Twitter: {mh_twitter_clean.shape[0]:,} rows')
print(f'Saved Reddit:  {mh_reddit_clean.shape[0]:,} rows')
print('Columns:', mh_twitter_clean.columns.tolist())