# Model Exploration
The objective of this project is to evaluate 3 approaches to accurately summarize data: a naive approach, a non deep learning approach, and a neural network-based deep learning approach

In [28]:
# Imports
import pandas as pd

### Naive Approach
Classify texts as expressing mental health distress if they contain predefined keywords

In [49]:
# Load data
train_df = pd.read_csv('data/processed/train.csv')
val_df = pd.read_csv('data/processed/validation.csv')
test_df = pd.read_csv('data/processed/test.csv')

In [50]:
keywords = { 
    0: ['stress', 'tired', 'pressure', 'overworked'], # Stress
    1: ['depressed','depression' ,'hopeless', 'suicidal', 'worthless', 'sad'], # Depression
    2: ['manic', 'bipolar', 'high energy', 'mood swing'], # Bipolar
    3: ['personality', 'narcissistic', 'borderline'],  # Personality Disorder
    4: ['anxious', 'anxiety', 'panic', 'fear', 'worry', 'nervous'] # Anxiety
}
# Keywords generated by ChatGPT on 10/27/2025

In [None]:
def naive_classifier(text):
    '''Assign label based on presence of keywords.'''
    text = str(text).lower()
    for label, words in keywords.items():
        for w in words:
            if w in text:
                return int(label)  # force integer
    return 101  # assign 101 (which is not a real class) if no keywords found

In [52]:
test_df['predicted'] = test_df['text'].apply(naive_classifier) # Apply classifier
accuracy = (test_df['predicted'] == test_df['target']).mean() # Calculate accuracy

print('Naive Accuracy:', round(accuracy, 4))
test_df[['text', 'target', 'predicted']].head(10)

Naive Accuracy: 0.3624


Unnamed: 0,text,target,predicted
0,don’t know many time i’ve fantasized relations...,1,0
1,like hard make friend keeping seems almost imp...,3,101
2,couple get pregnant easily despite trying long...,0,0
3,anyone else total fucking mess get drunk like ...,3,101
4,homework usual nothing mind actually blank sud...,0,0
5,ditto,1,101
6,fucked say liked kid way got one cant handle m...,0,101
7,im sure due disorder imposter syndrome old chr...,3,101
8,reminder caffeine form exacerbates anxiety peo...,4,4
9,considering fact ive nearly died time hospital...,2,101


The naive baseline achieved an accuracy of 0.3624. If no keyword match was found, the model returns a filler class,which is classified as inaccurate. The accuracy shows that simple keyword detection captures some patterns of mental health issues but it struggles to distinguish between the more nuanced mental health words.

### Classical Machine Learning Approach
Train a machine learning classifier on TF-IDF features of the text

###ANALYSIS###

### Neural Network-based Deep Learning Approach
Fine-tune a BERT transformer model to predict distress from the full text context.

###ANALYSIS###