<a href="https://colab.research.google.com/github/Meet-Amin/Movie-Recommender-System/blob/main/reddit_mental_health_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Reddit Mental Health Classification Project
# Meet Amin

This project aims to predict mental health conditions based on textual data collected from Reddit. Natural language processing techniques are applied to preprocess and extract meaningful features from user posts. Machine learning models are then trained to classify and predict potential mental health risks. The study demonstrates how social media data can be leveraged for early detection and awareness of mental health concerns.

> **Course Project:** Text Classification with Hugging Face & Transformers  
> **Dataset:** `kamruzzaman-asif/reddit-mental-health-classification` (Hugging Face Datasets)

---

## Table of Contents
1. [Problem Definition](#problem)
2. [Data Understanding](#data-understanding)
3. [Data Preparation](#data-preparation)
4. [Modeling](#modeling)
5. [Evaluation](#evaluation)
6. [Conclusion & Future Work](#conclusion)



## 1. Problem Definition <a id="problem"></a>

Mental health is a sensitive topic and online communities such as Reddit are frequently used by people to share their emotions, struggles, and experiences.  
However, the volume of posts is very large and it becomes hard to manually identify which posts may need urgent attention or belong to specific mental health categories.

In this project, the goal is to:

- Build a **text classification model** that predicts a label for each Reddit post related to mental health.
- Use the publicly available **`kamruzzaman-asif/reddit-mental-health-classification`** dataset from Hugging Face.
- Fine-tune a **transformer-based model (DistilBERT)** for this classification task.
- Evaluate the model with appropriate metrics and visualizations.

This notebook follows a standard machine learning workflow:

1. **Data Understanding:** Explore the dataset, label distribution, and basic statistics.  
2. **Data Preparation:** Clean and split the data into training and validation sets.  
3. **Modeling:** Fine-tune a pretrained language model for text classification.  
4. **Evaluation:** Measure performance using accuracy, precision, recall, and F1-score with clear charts.  
5. **Conclusion:** Summarize the results and discuss limitations and future improvements.


## 2. Data Understanding <a id="data-understanding"></a>

In [None]:

# =======================
# 2.1 Import Libraries
# =======================

# Basic libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Hugging Face datasets
from datasets import load_dataset

# Modeling utilities
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Better display options
pd.set_option("display.max_colwidth", 200)
sns.set(style="whitegrid")

# Make plots a bit larger by default
plt.rcParams["figure.figsize"] = (8, 5)


In [None]:

# =======================
# 2.2 Load the Dataset
# =======================

# Load the Reddit mental health dataset from Hugging Face
ds = load_dataset("kamruzzaman-asif/reddit-mental-health-classification")

# Show available splits and general info
ds


In [None]:

# =======================
# 2.3 Convert to pandas and Inspect
# =======================

# Convert the train split to a pandas DataFrame for exploration
df = ds["train"].to_pandas()

print("Shape of training data:", df.shape)
df.head()


In [None]:

# Overview of columns and data types
df.info()


In [None]:

# Quick descriptive statistics for numeric columns (including label if numeric)
df.describe(include="all")



### 2.4 Encode Labels as Integers (Important Fix)

The original dataset stores labels as **strings**.  
However, PyTorch and Hugging Face's Trainer expect labels to be **integers**.

Here, we:

- Create a mapping from string labels â†’ integer IDs (`label2id`).  
- Store the reverse mapping in `id2label`.  
- Replace the `label` column with the numeric ID version.


In [None]:

# =======================
# 2.4 Encode labels as integers
# =======================

# Keep a copy of the original string labels
df["label_str"] = df["label"]

# Get sorted list of unique string labels
unique_labels = sorted(df["label_str"].unique())
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {v: k for k, v in label2id.items()}

print("Label mapping (string -> id):")
print(label2id)

# Replace 'label' column with integer IDs
df["label"] = df["label_str"].map(label2id)

# Check the first few rows
df[["text", "label_str", "label"]].head()



### 2.5 Target Labels and Basic Statistics

In this section, we look at:

- How many examples we have in the training data.
- How many **unique labels/classes** are present.
- Whether the classes are **balanced or imbalanced**.


In [None]:

# Count of each numeric label
label_counts = df["label"].value_counts().sort_index()
label_counts


In [None]:

# Bar plot of label distribution (absolute counts)
plt.figure()
sns.barplot(x=label_counts.index, y=label_counts.values)
plt.xlabel("Label (id)")
plt.ylabel("Number of examples")
plt.title("Label Distribution (Counts)")
plt.tight_layout()
plt.show()

# Bar plot of label distribution (percentage)
label_percent = (label_counts / label_counts.sum()) * 100

plt.figure()
sns.barplot(x=label_percent.index, y=label_percent.values)
plt.xlabel("Label (id)")
plt.ylabel("Percentage (%)")
plt.title("Label Distribution (Percentage)")
plt.tight_layout()
plt.show()



### 2.6 Text Length Analysis

Now we explore how long the posts are. This can give us a sense of:

- Whether posts are typically short (a few words) or long (several paragraphs).
- How to choose a **maximum sequence length** for the transformer model.


In [None]:

# Compute length of each post in characters
df["text_len"] = df["text"].astype(str).str.len()

print("Text length (characters) - summary:")
df["text_len"].describe()


In [None]:

# Histogram of text length
plt.figure()
sns.histplot(df["text_len"], bins=50, kde=True)
plt.xlabel("Text length (characters)")
plt.ylabel("Frequency")
plt.title("Distribution of Post Lengths")
plt.tight_layout()
plt.show()


In [None]:

# Boxplot of text length per numeric label
plt.figure(figsize=(9, 5))
sns.boxplot(x="label", y="text_len", data=df)
plt.xlabel("Label (id)")
plt.ylabel("Text length (characters)")
plt.title("Text Length by Label")
plt.tight_layout()
plt.show()



### 2.7 Sample Posts

It is always useful to **read a few raw examples** to understand what the data looks like in practice.


In [None]:

# Show a few random examples from the dataset
df.sample(5)[["text", "label_str", "label"]]



**Observations (to fill in after running the notebook):**

- Total number of training samples: *write here*.  
- Number of unique labels: *write here*.  
- Are some labels much more frequent than others? *comment on imbalance*.  
- Typical length of posts (median): *write value from summary output*.  
- Any interesting patterns from the sample posts (for example, posts containing strong emotional language, questions, etc.).


## 3. Data Preparation <a id="data-preparation"></a>


The goal of data preparation is to:

1. Clean the text minimally (lowercasing, removing URLs, and extra whitespace).  
2. Create **training** and **validation** splits.  
3. Prepare the data in a format compatible with Hugging Face Transformers.

We keep the text cleaning simple to avoid accidentally removing important context.


In [None]:

# =======================
# 3.1 Basic Text Cleaning
# =======================
import re

def clean_text(t):
    """Basic cleaning: lowercase, remove URLs, and extra whitespace."""
    if not isinstance(t, str):
        return ""
    t = t.lower()
    # Remove URLs
    t = re.sub(r"http\S+|www\.\S+", "", t)
    # Remove extra spaces
    t = re.sub(r"\s+", " ", t).strip()
    return t

# Apply cleaning to the text column
df["clean_text"] = df["text"].astype(str).apply(clean_text)

df[["text", "clean_text"]].head()


In [None]:

# =======================
# 3.2 Train / Validation Split
# =======================

train_df, val_df = train_test_split(
    df,
    test_size=0.2,
    stratify=df["label"],
    random_state=42
)

print("Training size:", len(train_df))
print("Validation size:", len(val_df))


In [None]:

# Check label distribution in train and validation sets
print("Train label distribution:")
print(train_df["label"].value_counts(normalize=True).sort_index())

print("\nValidation label distribution:")
print(val_df["label"].value_counts(normalize=True).sort_index())



**Note:**

- We used **stratified splitting**, which keeps the proportion of each label similar in both training and validation sets.
- This helps the evaluation metrics to be more reliable.


## 4. Modeling <a id="modeling"></a>


In this section we:

1. Convert the pandas DataFrames into Hugging Face `Dataset` objects.  
2. Tokenize the text using a pretrained tokenizer (`distilbert-base-uncased`).  
3. Fine-tune a `DistilBertForSequenceClassification` model on our training data.


In [None]:

# This cell may not be needed on some platforms
!pip install -q transformers accelerate


In [None]:
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import torch

In [None]:

# =======================
# 4.1 Create HF Datasets and Tokenizer
# =======================

train_hf = Dataset.from_pandas(train_df[["clean_text", "label"]])
val_hf   = Dataset.from_pandas(val_df[["clean_text", "label"]])

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_batch(batch):
    return tokenizer(
        batch["clean_text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

train_tokenized = train_hf.map(tokenize_batch, batched=True)
val_tokenized   = val_hf.map(tokenize_batch, batched=True)

# Remove the original text column from the tokenized dataset
train_tokenized = train_tokenized.remove_columns(["clean_text"])
val_tokenized   = val_tokenized.remove_columns(["clean_text"])

# Set format for PyTorch
train_tokenized.set_format("torch")
val_tokenized.set_format("torch")


In [None]:

# =======================
# 4.2 Load the Classification Model (uses label2id / id2label)
# =======================

num_labels = len(label2id)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)


## 5. Evaluation <a id="evaluation"></a>


We will train the model and then evaluate it using:

- **Accuracy**  
- **Precision, Recall, F1-score (weighted)**  
- A **confusion matrix** to visualize where the model gets confused between classes.


In [None]:

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="weighted", zero_division=0
    )
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }


In [None]:

# =======================
# 5.1 Training Configuration
# =======================

batch_size = 16

training_args = TrainingArguments(
    output_dir="./reddit-mental-health-model",
    eval_strategy="epoch", # Changed from evaluation_strategy
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    compute_metrics=compute_metrics
)

In [None]:
from transformers import TrainingArguments, Trainer

# ðŸ”¹ 1. Use a smaller subset so it trains fast
# Adjust the numbers if you want a bit more data
small_train_dataset = train_tokenized.select(range(2000))
small_eval_dataset  = val_tokenized.select(range(500))

# ðŸ”¹ 2. Define lighter training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,                 # 1 epoch is enough for demo
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_ratio=0.1,

    logging_steps=20,
    eval_strategy="steps", # Corrected from evaluation_strategy
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    load_best_model_at_end=True,

    report_to="none",                   # disable wandb, etc.
    no_cuda=False,                      # use GPU if available
    fp16=True,                          # mixed precision on GPU (ignored on CPU)
    max_steps=500                       # hard cap on steps (optional but safe)
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# =======================
# 5.2 Train the Model
# =======================
train_result = trainer.train()
train_result

In [None]:

# =======================
# 5.3 Final Evaluation on Validation Set
# =======================

eval_results = trainer.evaluate()
eval_results


In [None]:
from IPython.display import display



**How to interpret these results (to discuss in your own words):**

- **Accuracy:** Overall proportion of correct predictions.  
- **Precision:** Of the posts predicted in a certain class, how many were actually in that class.  
- **Recall:** Of the posts that truly belong to a class, how many did the model correctly find.  
- **F1-score:** Harmonic mean of precision and recall (balances both).  
- **Confusion matrix:** Shows which classes are often confused with each other.


## 6. Conclusion & Future Work <a id="conclusion"></a>


### 6.1 Summary of Findings

In this project, we:

1. Loaded and explored the **Reddit mental health** dataset.  
2. Observed the **label distribution**, typical **post lengths**, and sample posts.  
3. Encoded string labels into **integer IDs** to make them compatible with PyTorch.  
4. Performed minimal text cleaning and created **train/validation** splits.  
5. Fine-tuned a **DistilBERT** model for sequence classification.  
6. Evaluated performance using accuracy, precision, recall, F1-score, and a confusion matrix.

After running the notebook, you should summarize here in your own words:

- The final evaluation metrics (accuracy, F1, etc.).  
- Which classes the model handled well and which were more difficult.  
- Any signs of class imbalance impacting performance.



### 6.2 Limitations

Some possible limitations to mention:

- **Dataset Size & Bias:** The dataset may not fully represent all types of mental health content on the internet.  
- **Label Noise:** Labels might not be perfect, especially if they were created automatically or by a small group of annotators.  
- **Context:** Reddit posts can be highly contextual; sometimes the label may depend on previous posts or comments that are not included.  
- **Model Complexity:** DistilBERT is powerful but still has limitations when detecting subtle emotions or sarcasm.



### 6.3 Future Work

Ideas to improve this project:

- Experiment with **different transformer models** (e.g., BERT, RoBERTa).  
- Use **class weighting** or **focal loss** if the dataset is highly imbalanced.  
- Try more advanced **text cleaning** or domain-specific pre-processing.  
- Perform **hyperparameter tuning** (learning rate, batch size, epochs).  
- Deploy the model as a simple **web app** to classify new Reddit-style posts in real time.
