# Model Exploration
The objective of this project is to evaluate 3 approaches to accurately summarize data: a naive approach, a non deep learning approach, and a neural network-based deep learning approach

In [None]:
# Imports
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


### Naive Approach
Classify texts as expressing mental health distress if they contain predefined keywords

In [71]:
# Load data
train_df = pd.read_csv('data/processed/train.csv')
val_df = pd.read_csv('data/processed/validation.csv')
test_df = pd.read_csv('data/processed/test.csv')

# Fill NaN values
train_df['text'] = train_df['text'].fillna('')
val_df['text'] = val_df['text'].fillna('')
test_df['text'] = test_df['text'].fillna('')

In [72]:
keywords = { 
    0: ['stress', 'tired', 'pressure', 'overworked'], # Stress
    1: ['depressed','depression' ,'hopeless', 'suicidal', 'worthless', 'sad'], # Depression
    2: ['manic', 'bipolar', 'high energy', 'mood swing'], # Bipolar
    3: ['personality', 'narcissistic', 'borderline'],  # Personality Disorder
    4: ['anxious', 'anxiety', 'panic', 'fear', 'worry', 'nervous'] # Anxiety
}
# Keywords generated by ChatGPT on 10/27/2025

In [73]:
def naive_classifier(text):
    '''Assign label based on presence of keywords.'''
    text = str(text).lower()
    for label, words in keywords.items():
        for w in words:
            if w in text:
                return int(label)  # force integer
    return 101  # assign 101 (which is not a real class) if no keywords found

In [83]:
test_df['predicted'] = test_df['text'].apply(naive_classifier) # Apply classifier
accuracy = accuracy_score(test_df['target'], test_df['predicted']) # Calculate accuracy

print('Naive Accuracy:', round(accuracy, 4))
print(classification_report(test_df['target'], test_df['predicted'], zero_division=0))
test_df[['text', 'target', 'predicted']].head(10)

Naive Accuracy: 0.3624
              precision    recall  f1-score   support

           0       0.51      0.65      0.57       118
           1       0.44      0.31      0.36       121
           2       1.00      0.30      0.46       118
           3       0.60      0.05      0.09       120
           4       0.62      0.51      0.56       119
         101       0.00      0.00      0.00         0

    accuracy                           0.36       596
   macro avg       0.53      0.30      0.34       596
weighted avg       0.63      0.36      0.41       596



Unnamed: 0,text,target,predicted
0,don’t know many time i’ve fantasized relations...,1,0
1,like hard make friend keeping seems almost imp...,3,101
2,couple get pregnant easily despite trying long...,0,0
3,anyone else total fucking mess get drunk like ...,3,101
4,homework usual nothing mind actually blank sud...,0,0
5,ditto,1,101
6,fucked say liked kid way got one cant handle m...,0,101
7,im sure due disorder imposter syndrome old chr...,3,101
8,reminder caffeine form exacerbates anxiety peo...,4,4
9,considering fact ive nearly died time hospital...,2,101


The naive baseline achieved an accuracy of 0.3624. If no keyword match was found, the model returns a filler class, which is classified as inaccurate. Class 2 (bipolar)  has perfect precision but very low recall but others are barely predicted at all. This shows that simple keyword detection captures some patterns of mental health issues but it struggles to distinguish between the more nuanced mental health words.

### Classical Machine Learning Approach
Train a logistic regression machine learning classifier on TF-IDF features of the text

In [75]:
# Load data
train_df = pd.read_csv('data/processed/train.csv')
val_df = pd.read_csv('data/processed/validation.csv')
test_df = pd.read_csv('data/processed/test.csv')

# Fill NaN values
train_df['text'] = train_df['text'].fillna('')
val_df['text'] = val_df['text'].fillna('')
test_df['text'] = test_df['text'].fillna('')

In [76]:
# TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_df['text'])
X_val = vectorizer.transform(val_df['text'])
X_test = vectorizer.transform(test_df['text'])

y_train = train_df['target']
y_val = val_df['target']
y_test = test_df['target']

In [77]:
# Train classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [None]:
y_pred = model.predict(X_test) # Make predictions
accuracy = accuracy_score(y_test, y_pred) # Calculate accuracy

print("Classical ML Accuracy:", round(accuracy, 4))
print(classification_report(y_test, y_pred))

Classical ML Accuracy: 0.7215
              precision    recall  f1-score   support

           0       0.65      0.86      0.74       118
           1       0.70      0.69      0.70       121
           2       0.86      0.69      0.76       118
           3       0.73      0.68      0.70       120
           4       0.73      0.68      0.70       119

    accuracy                           0.72       596
   macro avg       0.73      0.72      0.72       596
weighted avg       0.73      0.72      0.72       596



The logistic regression model performs significantly better than the naive baseline and achieves an accuracy of 0.7215 compared the naive baseline accuracy of 0.3624. The classical ML model shows more balanced precision and recall across all classes which means it capturs the nuances in text that keyword matching misses.

### Neural Network-based Deep Learning Approach
Fine-tune a BERT transformer model to predict distress from the full text context.

###ANALYSIS###