# Cross-Platform Emotion Classification — Modelling

This notebook builds and evaluates machine learning models for emotion classification using preprocessed Twitter and Reddit mental health data.

**Steps:**
1. Load cleaned datasets
2. TF-IDF vectorisation and label encoding
3. Class imbalance handling (random oversampling)
4. Single-platform evaluation (Twitter vs Twitter, Reddit vs Reddit)
5. Cross-platform evaluation (train on one, test on the other)
6. Results summary and key insights

## 1. Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

import warnings
warnings.filterwarnings('ignore')

print('All imports successful')

## 2. Load Cleaned Data

In [None]:
mh_twitter = pd.read_csv('../data/mh_twitter_clean.csv')
mh_reddit  = pd.read_csv('../data/mh_reddit_clean.csv')

# Drop any rows with missing clean_text
mh_twitter = mh_twitter.dropna(subset=['clean_text'])
mh_reddit  = mh_reddit.dropna(subset=['clean_text'])

print(f'Twitter: {mh_twitter.shape}')
print(f'Reddit:  {mh_reddit.shape}')
print('\nEmotion classes:', sorted(mh_twitter['emotion'].unique()))

## 3. TF-IDF Vectorisation

Converting text to numerical features using TF-IDF with unigrams and bigrams (`ngram_range=(1,2)`). English stopwords are removed.

For single-platform evaluation, each dataset is vectorised independently. For cross-platform evaluation, the vectoriser is fitted on combined data so both platforms share the same feature space.

In [None]:
# Single-platform vectorisation (fit separately)
vectorizer_twitter = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
vectorizer_reddit  = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))

X_twitter = vectorizer_twitter.fit_transform(mh_twitter['clean_text'])
X_reddit  = vectorizer_reddit.fit_transform(mh_reddit['clean_text'])

print(f'Twitter TF-IDF matrix: {X_twitter.shape}')
print(f'Reddit TF-IDF matrix:  {X_reddit.shape}')

## 4. Label Encoding

In [None]:
label_encoder = LabelEncoder()

y_twitter = mh_twitter['emotion']
y_reddit  = mh_reddit['emotion']

y_twitter_encoded = label_encoder.fit_transform(y_twitter)
y_reddit_encoded  = label_encoder.transform(y_reddit)

print('Emotion label mapping:')
for i, cls in enumerate(label_encoder.classes_):
    print(f'  {i} → {cls}')

## 5. Train/Test Split and Oversampling

Splitting each dataset 80/20. Oversampling is applied **only to the training set** using `RandomOverSampler` to balance class distribution without leaking test data.

In [None]:
X_tw_train, X_tw_test, y_tw_train, y_tw_test = train_test_split(
    X_twitter, y_twitter_encoded, test_size=0.2, random_state=42)

X_rd_train, X_rd_test, y_rd_train, y_rd_test = train_test_split(
    X_reddit, y_reddit_encoded, test_size=0.2, random_state=42)

ros = RandomOverSampler(random_state=42)
X_tw_train_res, y_tw_train_res = ros.fit_resample(X_tw_train, y_tw_train)
X_rd_train_res, y_rd_train_res = ros.fit_resample(X_rd_train, y_rd_train)

print(f'Twitter train (after oversampling): {X_tw_train_res.shape}')
print(f'Reddit train (after oversampling):  {X_rd_train_res.shape}')

### Class Distribution Before and After Oversampling

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

for ax, (before, after, title) in zip(axes, [
    (np.bincount(y_tw_train), np.bincount(y_tw_train_res), 'Twitter'),
    (np.bincount(y_rd_train), np.bincount(y_rd_train_res), 'Reddit')
]):
    n = len(before)
    x = np.arange(n)
    w = 0.35
    ax.bar(x, before, w, label='Before', color='royalblue')
    ax.bar(x + w, after[:n], w, label='After', color='orange')
    ax.set_xticks(x + w/2)
    ax.set_xticklabels(label_encoder.classes_, rotation=45)
    ax.set_title(f'{title} — Class Distribution Before/After Oversampling')
    ax.set_ylabel('Samples')
    ax.legend()

plt.tight_layout()
plt.show()

## 6. Helper Functions

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test, title):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f'\n=== {title} ===')
    print(f'Accuracy: {acc:.4f}')
    print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))
    return y_pred

def plot_confusion_matrix(y_true, y_pred, title, cmap='Blues'):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap=cmap,
                xticklabels=label_encoder.classes_,
                yticklabels=label_encoder.classes_)
    plt.title(title)
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()

## 7. Single-Platform Evaluation

Training and testing each model on the same platform to establish baseline performance.

### 7a. SVM — Twitter

In [None]:
svm_twitter = SVC(kernel='linear', random_state=42)
y_tw_pred_svm = evaluate_model(svm_twitter, X_tw_train_res, y_tw_train_res, X_tw_test, y_tw_test,
                                'SVM — Twitter (Single-Platform)')
plot_confusion_matrix(y_tw_test, y_tw_pred_svm, 'SVM Confusion Matrix — Twitter', cmap='Blues')

### 7b. SVM — Reddit

In [None]:
svm_reddit = SVC(kernel='linear', random_state=42)
y_rd_pred_svm = evaluate_model(svm_reddit, X_rd_train_res, y_rd_train_res, X_rd_test, y_rd_test,
                                'SVM — Reddit (Single-Platform)')
plot_confusion_matrix(y_rd_test, y_rd_pred_svm, 'SVM Confusion Matrix — Reddit', cmap='Reds')

### 7c. Naive Bayes — Twitter

In [None]:
nb_twitter = MultinomialNB()
y_tw_pred_nb = evaluate_model(nb_twitter, X_tw_train_res, y_tw_train_res, X_tw_test, y_tw_test,
                               'Naive Bayes — Twitter (Single-Platform)')
plot_confusion_matrix(y_tw_test, y_tw_pred_nb, 'Naive Bayes Confusion Matrix — Twitter', cmap='Purples')

### 7d. Naive Bayes — Reddit

In [None]:
nb_reddit = MultinomialNB()
y_rd_pred_nb = evaluate_model(nb_reddit, X_rd_train_res, y_rd_train_res, X_rd_test, y_rd_test,
                               'Naive Bayes — Reddit (Single-Platform)')
plot_confusion_matrix(y_rd_test, y_rd_pred_nb, 'Naive Bayes Confusion Matrix — Reddit', cmap='Greens')

## 8. Cross-Platform Evaluation

To test whether emotion patterns learned on one platform generalise to the other, we:
1. Fit TF-IDF on **combined** Twitter + Reddit data (shared feature space)
2. Train on one platform, test on the other

This is the core contribution of this project — assessing cross-platform generalisability.

In [None]:
# Fit TF-IDF on combined data for shared feature space
combined_text = mh_twitter['clean_text'].tolist() + mh_reddit['clean_text'].tolist()
vectorizer_combined = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
vectorizer_combined.fit(combined_text)

X_twitter_combined = vectorizer_combined.transform(mh_twitter['clean_text'])
X_reddit_combined  = vectorizer_combined.transform(mh_reddit['clean_text'])

# Fit label encoder on combined labels
combined_labels = mh_twitter['emotion'].tolist() + mh_reddit['emotion'].tolist()
label_encoder_combined = LabelEncoder()
label_encoder_combined.fit(combined_labels)

y_twitter_enc = label_encoder_combined.transform(mh_twitter['emotion'])
y_reddit_enc  = label_encoder_combined.transform(mh_reddit['emotion'])

# Split into train/test
X_tw_train_c, X_tw_test_c, y_tw_train_c, y_tw_test_c = train_test_split(
    X_twitter_combined, y_twitter_enc, test_size=0.2, random_state=42)

X_rd_train_c, X_rd_test_c, y_rd_train_c, y_rd_test_c = train_test_split(
    X_reddit_combined, y_reddit_enc, test_size=0.2, random_state=42)

print(f'Combined vocabulary size: {len(vectorizer_combined.vocabulary_):,}')

### 8a. Train on Twitter, Test on Reddit

In [None]:
svm_cross_tw = SVC(kernel='linear', class_weight='balanced', random_state=42)
svm_cross_tw.fit(X_tw_train_c, y_tw_train_c)

y_rd_pred_cross = svm_cross_tw.predict(X_rd_test_c)
print('SVM — Train on Twitter, Test on Reddit:')
print(f'Accuracy: {accuracy_score(y_rd_test_c, y_rd_pred_cross):.4f}')
print(classification_report(y_rd_test_c, y_rd_pred_cross, target_names=label_encoder_combined.classes_))

plot_confusion_matrix(y_rd_test_c, y_rd_pred_cross,
                      'SVM — Train: Twitter | Test: Reddit', cmap='RdBu')

### 8b. Train on Reddit, Test on Twitter

In [None]:
svm_cross_rd = SVC(kernel='linear', class_weight='balanced', random_state=42)
svm_cross_rd.fit(X_rd_train_c, y_rd_train_c)

y_tw_pred_cross = svm_cross_rd.predict(X_tw_test_c)
print('SVM — Train on Reddit, Test on Twitter:')
print(f'Accuracy: {accuracy_score(y_tw_test_c, y_tw_pred_cross):.4f}')
print(classification_report(y_tw_test_c, y_tw_pred_cross, target_names=label_encoder_combined.classes_))

plot_confusion_matrix(y_tw_test_c, y_tw_pred_cross,
                      'SVM — Train: Reddit | Test: Twitter', cmap='PiYG')

## 9. Results Summary

Comparing all model configurations in a single table.

In [None]:
results = pd.DataFrame({
    'Setup': [
        'SVM — Twitter (Single)',
        'SVM — Reddit (Single)',
        'Naive Bayes — Twitter (Single)',
        'Naive Bayes — Reddit (Single)',
        'SVM — Train Twitter, Test Reddit (Cross)',
        'SVM — Train Reddit, Test Twitter (Cross)'
    ],
    'Accuracy': [
        accuracy_score(y_tw_test, y_tw_pred_svm),
        accuracy_score(y_rd_test, y_rd_pred_svm),
        accuracy_score(y_tw_test, y_tw_pred_nb),
        accuracy_score(y_rd_test, y_rd_pred_nb),
        accuracy_score(y_rd_test_c, y_rd_pred_cross),
        accuracy_score(y_tw_test_c, y_tw_pred_cross)
    ]
})

results['Accuracy'] = results['Accuracy'].apply(lambda x: f'{x:.4f}')
print(results.to_string(index=False))

## 10. Key Findings

- **SVM consistently outperforms Naive Bayes** across both single-platform and cross-platform setups
- **Reddit shows more genuine mental health content** — sadness is the dominant emotion, reflecting the structured nature of mental health subreddits
- **Twitter data is noisier** — joy dominates, likely due to hashtag misuse by non-affected users
- **Cross-platform generalisation is limited** — models trained on one platform lose significant accuracy when tested on the other, highlighting the linguistic and contextual differences between platforms
- **Sadness is the most cross-platform-detectable emotion** — it maintains relatively strong recall even in cross-platform settings, suggesting it has more consistent linguistic markers across both platforms
- **Future improvements:** Fine-tuned transformer models (e.g. BERT, RoBERTa), better data filtering, and domain adaptation techniques could significantly improve cross-platform performance