<a href="https://colab.research.google.com/github/Malumma01/reddit-stress-detection-nlp/blob/main/stressnlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this project is to analyze Reddit text data using Natural Language Processing (NLP) techniques to determine whether a user is experiencing stress or not based on their posts.


## 1 To import necessary libraries


In [None]:
# import the necessary libraries
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer







[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2 To load the dataset

In [None]:
# Loading the dataset
df = pd.read_csv('/content/stress.csv')

In [None]:
df.head()

Unnamed: 0,subreddit,post_id,sentence_range,text,id,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_pleasantness,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
0,ptsd,8601tu,"(15, 20)","He said he had not felt that way before, sugge...",33181,1,0.8,1521614353,5,1.806818,...,1.0,1.125,1.0,1.77,1.52211,1.89556,0.86,1,3.253573,-0.002742
1,assistance,8lbrx9,"(0, 5)","Hey there r/assistance, Not sure if this is th...",2606,0,1.0,1527009817,4,9.429737,...,1.125,1.0,1.0,1.69586,1.62045,1.88919,0.65,2,8.828316,0.292857
2,ptsd,9ch1zh,"(15, 20)",My mom then hit me with the newspaper and it s...,38816,1,0.8,1535935605,2,7.769821,...,1.0,1.1429,1.0,1.83088,1.58108,1.85828,0.67,0,7.841667,0.011894
3,relationships,7rorpp,"[5, 10]","until i met my new boyfriend, he is amazing, h...",239,1,0.6,1516429555,0,2.667798,...,1.0,1.125,1.0,1.75356,1.52114,1.98848,0.5,5,4.104027,0.141671
4,survivorsofabuse,9p2gbc,"[0, 5]",October is Domestic Violence Awareness Month a...,1421,1,0.8,1539809005,24,7.554238,...,1.0,1.125,1.0,1.77644,1.64872,1.81456,1.0,1,7.910952,-0.204167


In [None]:
df.columns


Index(['subreddit', 'post_id', 'sentence_range', 'text', 'id', 'label',
       'confidence', 'social_timestamp', 'social_karma', 'syntax_ari',
       ...
       'lex_dal_min_pleasantness', 'lex_dal_min_activation',
       'lex_dal_min_imagery', 'lex_dal_avg_activation', 'lex_dal_avg_imagery',
       'lex_dal_avg_pleasantness', 'social_upvote_ratio',
       'social_num_comments', 'syntax_fk_grade', 'sentiment'],
      dtype='object', length=116)

In [None]:
df.shape


(2838, 116)

## 3 Data Cleaning

Stress is often expressed via emotional words, repetition, negativity — cleaning helps models catch patterns.

In [None]:
# checking for null values
df.isnull().sum()

Unnamed: 0,0
subreddit,0
post_id,0
sentence_range,0
text,0
id,0
...,...
lex_dal_avg_pleasantness,0
social_upvote_ratio,0
social_num_comments,0
syntax_fk_grade,0


In [None]:
# drop rows with empty
df.dropna(inplace=True)

In [None]:
# normalize text
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)   # remove links
    text = re.sub(r'[^a-z\s]', ' ', text)       # keep only letters
    text = re.sub(r'\s+', ' ', text).strip()    # remove extra spaces
    return text

df['normalized_text'] = df['text'].apply(normalize_text)


In [None]:
# tokenization
df['tokens'] = df['normalized_text'].apply(lambda x: x.split())


Stop word removal was used to remove very common words that add little meaning.

In [None]:
# removing stopwords

stop_words = set(stopwords.words('english'))
stop_words.discard('not')
stop_words.discard('no')

df['tokens'] = df['tokens'].apply(
    lambda words: [w for w in words if w not in stop_words]
)


Lemmatization is applied only to the textual features to normalize word forms and has no effect on the target labels used for classification.

In [None]:
# lemmatization
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

df['tokens'] = df['tokens'].apply(
    lambda words: [lemmatizer.lemmatize(w) for w in words]
)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# removing very short words
df['tokens'] = df['tokens'].apply(
    lambda words: [w for w in words if len(w) > 2]
)


In [None]:
# reconstruct clean text
df['clean_text'] = df['tokens'].apply(lambda x: " ".join(x))


Text preprocessing was performed using tokenization, stopword removal, and lemmatization to reduce noise and normalize the Reddit posts before feature extraction.

## 4 Feature extraction

In [None]:
# Apply TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,     # limits vocabulary size
    ngram_range=(1,1)      # unigrams only for now
)

X = tfidf.fit_transform(df['clean_text'])
y = df['label']


In [None]:
# Train–Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


Text features were extracted using TF-IDF vectorization to convert cleaned Reddit posts into numerical representations suitable for machine learning models.

## 5 Modelling

In [None]:
# to build the baseline model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


## 6 Model evaluation

In [None]:
# model evaluation
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[285 120]
 [ 99 348]]
              precision    recall  f1-score   support

           0       0.74      0.70      0.72       405
           1       0.74      0.78      0.76       447

    accuracy                           0.74       852
   macro avg       0.74      0.74      0.74       852
weighted avg       0.74      0.74      0.74       852



The model achieved an overall accuracy of 74%. The recall score for stressed posts was 70%, indicating that the model was able to correctly identify the majority of stressed users, although some stressed posts were misclassified. This suggests that while the model performs reasonably well, further improvements are required to reduce false negatives.