# Reddit Depression Detection Final Project
Link to the paper: https://dl.acm.org/doi/pdf/10.1145/3578503.3583621

## Project explanation
- Explore specific 13 symptoms of depression through languaged gathered from Reddit subreddits.
- Goal: Produce models that can detect symptoms in text. Could be helpful with early detection of mental health issues.
- Use LDA- topic distribution and Roberta model to extract deeper language patterns to help predict symptoms in text data.
- Use forest classifier for evaluation of models.

## Installations and Imports

In [40]:
!pip install happiestfuntokenizing
!pip install transformers torch



In [39]:
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import Counter

from gensim.models import LdaMulticore
from gensim import corpora

from torch.utils.data import DataLoader

from transformers import RobertaTokenizer, DistilBertModel
import torch

from tabulate import tabulate

from sklearn.model_selection import cross_validate, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from happiestfuntokenizing.happiestfuntokenizing import Tokenizer

# IF USING COLAB DRIVE
#from google.colab import drive
#drive.mount('/content/drive')
#FILEPATH = '/content/drive/MyDrive/Brown University/Coursework/Fall 2024/CSCI 1460- Computational Linguistics/Final Project/student.pkl'

# ADJUST FILE PATHS
FILEPATH = '/content/student.pkl'
ROBERTA_FILEPATH = '/content/roberta_batch_embeddings.pkl'
LDA_FILEPATH = '/content/lda_feature_embeddings.pkl'


## Preprocessing

In [3]:
def load_data():
  """Load pickles"""
  df = pd.read_pickle(FILEPATH)
  return df

def load_embeddings():
  roberta_embeddings = pd.read_pickle(ROBERTA_FILEPATH)
  lda_embeddings = pd.read_pickle(LDA_FILEPATH)
  return roberta_embeddings, lda_embeddings

In [4]:
# List of depression subreddits in the paper
depression_subreddits = ["Anger",
    "anhedonia", "DeadBedrooms",
    "Anxiety", "AnxietyDepression", "HealthAnxiety", "PanicAttack",
    "DecisionMaking", "shouldi",
    "bingeeating", "BingeEatingDisorder", "EatingDisorders", "eating_disorders", "EDAnonymous",
    "chronicfatigue", "Fatigue",
    "ForeverAlone", "lonely",
    "cry", "grief", "sad", "Sadness",
    "AvPD", "SelfHate", "selfhelp", "socialanxiety", "whatsbotheringyou",
    "insomnia", "sleep",
    "cfs", "ChronicPain", "Constipation", "EssentialTremor", "headaches", "ibs", "tinnitus",
    "AdultSelfHarm", "selfharm", "SuicideWatch",
    "Guilt", "Pessimism", "selfhelp", "whatsbotheringyou"
]

In [5]:
def dataset_generation():
  """Build control and symptom datasets"""
  df = load_data()

  # Data split into "13" symptoms
  anger_data = df[df['subreddit'].isin(['Anger'])]
  anhedonia_data = df[df['subreddit'].isin(['anhedonia', 'DeadBedrooms'])]
  anxiety_data = df[df['subreddit'].isin(['Anxiety', 'AnxietyDepression', 'HealthAnxiety', 'PanicAttack'])]
  concen_deficit_data = df[df['subreddit'].isin(['DecisionMaking', 'shouldi'])]
  disordered_eating_data = df[df['subreddit'].isin(['bingeeating', 'BingeEatingDisorder', 'EatingDisorders', 'eating_disorders', 'EDAnonymous'])]
  fatigue_data = df[df['subreddit'].isin(['chronicfatigue', 'Fatigue'])]
  loneliness_data = df[df['subreddit'].isin(['ForeverAlone', 'lonely'])]
  sad_mood_data = df[df['subreddit'].isin(['cry', 'grief', 'sad', 'Sadness'])]
  self_loathing = df[df['subreddit'].isin(['AvPD', 'SelfHate', 'selfhelp', 'socialanxiety', 'whatsbotheringyou'])]
  sleep_problem_data = df[df['subreddit'].isin(['insomnia', 'sleep'])]
  somatic_data = df[df['subreddit'].isin(['cfs', 'ChronicPain', 'Constipation', 'EssentialTremor', 'headaches', 'ibs', 'tinnitus'])]
  suicidal_thoughts_data = df[df['subreddit'].isin(['AdultSelfHarm', 'selfharm', 'SuicideWatch'])]
  worthlessness_data = df[df['subreddit'].isin(['Guilt', 'Pessimism'])]

  # Full Depression Data set including all Symptoms
  depression_data = df[df['subreddit'].isin(depression_subreddits)]

  # Control Data- Idea from https://tinyurl.com/ControlData
  # "Only [keep] non-mental health posts by authors that were at least 180 days
  # older than their index (earliest) post in a mental health subreddit"

  # Create DF that links authors to their earlist post on the depression_data
  earliest_post = (
      depression_data.groupby('author')['created_utc'] # groups author with utc time
      .min() # earliest value (smallest utc for that author)
      .reset_index() # turns it into an indexed df
      .rename(columns={'created_utc': 'earliest_mental_health_post'}) # renames columns
  )

  # Merge new df that includes earliest post with original df on the author name
  control_df = df.merge(earliest_post, on='author', how='left')

  SECONDS_IN_180_DAYS = 180 * 24 * 60 * 60 # 180 days in seconds

  control_data = control_df[
    (~control_df['subreddit'].isin(depression_subreddits)) &  # exclude mental health subreddits
    (control_df['created_utc'] < control_df['earliest_mental_health_post'] - SECONDS_IN_180_DAYS)  # at least 180 days before index post
  ]

  # Our dictionary for all our symptoms, including a Control and all symptoms category (Depression)
  symptom_data ={
      "Control": control_data['text'].tolist(),
      "Depression": depression_data['text'].tolist(),
      "Anger": anger_data['text'].tolist(),
      "Anhedonia": anhedonia_data['text'].tolist(),
      "Anxiety": anxiety_data['text'].tolist(),
      "Concentration deficit": concen_deficit_data['text'].tolist(),
      "Disordered Eating": disordered_eating_data['text'].tolist(),
      "Fatigue": fatigue_data['text'].tolist(),
      "Loneliness": loneliness_data['text'].tolist(),
      "Sad Mood": sad_mood_data['text'].tolist(),
      "Self-loathing": self_loathing['text'].tolist(),
      "Sleep problem": sleep_problem_data['text'].tolist(),
      "Somatic complaint": somatic_data['text'].tolist(),
      "Suicidal thoughts and attempts": suicidal_thoughts_data['text'].tolist(),
      "Worthlessness": worthlessness_data['text'].tolist()
  }

  return symptom_data

In [19]:
# TEST datasorting
symptom_data = dataset_generation()
print(symptom_data['Control'][0])

Man, I do love me some Bandicoot crash. 


## LDA Tokenization

In [9]:
def tokenize(tokenizer, symptom_data):
  """Tokenize Symptom data using HappyTokenizer for LDA"""

  tokenized_data ={} # dictionary- key: Symptom and value: list[list[token]]

  # Loop through all the texts for each symptom
  for symptom, texts in tqdm(symptom_data.items(), desc="Tokenizing Symptoms", ncols=100):
    tokens_list = []
    # Loop through lists of text in each individual symptom
    for text in texts:
      tokens = tokenizer.tokenize(text) # Tokenize sentence which makes all lower case
      if tokens == []: # If sentence is tokenized to be empty, ignore it.
        continue
      tokens_list.append(tokens) # add tokenized text to list
    tokenized_data[symptom] = tokens_list # add tokenized list to dic

  return tokenized_data

def stop_words(tokenized_data, top_n = 100):
  """Find top 100 words from Reddit dataset to use as stop words"""

  # Control Vocab
  vocab = [] # get all words in our vocab
  for sentence in tokenized_data['Control']:
    vocab.extend(sentence)

  # Count top 100 vocab
  word_counts = Counter(vocab)
  top_100 = [word for word, count in word_counts.most_common(top_n)]
  return top_100

def remove_stop_words(tokenized_data, top_100):
  """Remove stop words from our tokenized data"""
  stop_words = top_100
  processed_data = {}

  # loop through our tokenized data
  for symptom, sentences in tqdm(tokenized_data.items(), desc="Removing Stop Words", ncols=100):
    filtered_sentences = []
    for sentence in sentences:
      # remove all words that were in top 100 stop words
      filtered_sentence = [word for word in sentence if word not in stop_words]
      filtered_sentences.append(filtered_sentence)
    processed_data[symptom] = filtered_sentences

  return processed_data

In [20]:
# TEST tokenization

happy_tokenizer = Tokenizer() # tokenizer already produce lower case token
tokenized_data = tokenize(happy_tokenizer, symptom_data) # tokenize
# test tokenization
print(tokenized_data['Control'][0:2])

# control length
print(len(tokenized_data['Control']))
print(len(tokenized_data['Depression'])) # depression length

Tokenizing Symptoms: 100%|██████████████████████████████████████████| 15/15 [02:24<00:00,  9.62s/it]

[['man', ',', 'i', 'do', 'love', 'me', 'some', 'bandicoot', 'crash', '.'], ['how', 'good', 'is', 'this', 'pc', 'for', 'my', '700-750', '$', 'budget', '?', 'want', 'it', 'for', 'gaming', 'on', 'high', '/', 'ultra', 'settings', ',', 'thanks', '!', 'https://www.youtube.com/watch', 'https://www.youtube.com/watch', '?', 'v', '=', 'y_ulqrs', '76xs', '&', 'amp', ';', 't', '=', '110s']]
4369
94514





In [21]:
# TEST stop word removal

# top 100 stop words from control
top_100 = stop_words(tokenized_data)
print("STOP WORDS:", top_100)

# removal of stop words from whole dataset?
processed_data = remove_stop_words(tokenized_data, top_100)
print()
print(processed_data['Control'][0:2])


STOP WORDS: ['.', ',', 'i', 'the', 'to', 'and', 'a', 'of', '?', 'my', 'in', 'it', 'is', 'for', 'that', '*', 'this', 'but', 'on', 'you', ')', 'with', 'was', 'have', '(', 'me', 'so', 'be', '"', '-', "i'm", 'or', 'just', '[', ']', 'if', 'not', 'what', 'like', '!', 'are', 'as', 'at', '/', ':', 'do', 'about', 'up', 'out', 'can', 'all', 'he', 'from', 'we', 'they', ';', 'her', 'how', 'would', 'she', 'get', 'when', 'one', 'an', 'know', 'had', "don't", "it's", 'there', 'some', 'been', 'will', 'time', "i've", 'any', 'because', 'no', 'more', 'am', 'want', 'your', 'has', 'really', 'people', 'now', 'them', 'amp', '&', 'who', 'other', 'only', 'think', 'by', 'even', 'his', 'back', 'much', '|', 'good', 'then']


Removing Stop Words: 100%|██████████████████████████████████████████| 15/15 [00:30<00:00,  2.01s/it]


[['man', 'love', 'bandicoot', 'crash'], ['pc', '700-750', '$', 'budget', 'gaming', 'high', 'ultra', 'settings', 'thanks', 'https://www.youtube.com/watch', 'https://www.youtube.com/watch', 'v', '=', 'y_ulqrs', '76xs', 't', '=', '110s']]





In [None]:
# TEST
# IF i wanted to use tokenized data in Roberta, recreate sentences from tokenized lists.
print(symptom_data['Control'][0])
testing_dic = {}
for symptom, posts in processed_data.items():
  list_of_setences = []
  for post in posts:
    sentence = " ".join(post)
    list_of_setences.append(sentence)
  testing_dic[symptom] = list_of_setences

print(testing_dic['Control'][0])


Man, I do love me some Bandicoot crash. 
man love bandicoot crash


## Reddit Topics with LDA

 - Don't use MALLET (as the paper does), use some other LDA implementation.

In [10]:
# We highly recommend you using the LdaMulticore interface, but feel free to use any other implementations if you prefer.
# from gensim.models import LdaMulticore

def create_corpus(tokenized_data):
  """
  Produces a dictionary and bow of our data corpus to train our LDA model.
  """

  # Combine our control and all the symptoms dataset
  combined_data = tokenized_data['Control'] + tokenized_data['Depression']
  # Creates a dictionary mapping for all the posts (uniqueid, word)
  id2word = corpora.Dictionary(combined_data)
  # Local bow for each post (id, count4post) <-- links uniqueid accross all posts
  corpus = [id2word.doc2bow(text) for text in combined_data]

  return id2word, corpus

def train_lda(id2word, corpus, num_topics=200):

  # Produces our LDA model
  lda_model = LdaMulticore(corpus=corpus, id2word=id2word, num_topics=num_topics, passes=2)
  return lda_model

In [22]:
#TEST id2word and corpus creation
dictionary, corpus = create_corpus(processed_data)

In [31]:
print(dictionary[0]) # {0: "bandicoot", 1: "crash"}
print(processed_data['Control'][0])
print(corpus[0])

bandicoot
['man', 'love', 'bandicoot', 'crash']
[(0, 1), (1, 1), (2, 1), (3, 1)]


In [32]:
#TEST lda model
lda_model = train_lda(dictionary, corpus)

In [37]:
# Preview topics
topics = lda_model.print_topics(num_topics=200, num_words=10)
for ith_topic, topic in enumerate(topics):
    print(ith_topic, "Topics:", topic[1])

0 Topics: 0.039*"trial" + 0.025*"pot" + 0.023*"pace" + 0.022*"saved" + 0.019*"tip" + 0.018*"humiliating" + 0.015*"destructive" + 0.014*"ignorant" + 0.013*"solo" + 0.012*"criticism"
1 Topics: 0.026*"injury" + 0.024*"liver" + 0.023*"finger" + 0.021*"migraines" + 0.021*"scan" + 0.020*"solutions" + 0.017*"ct" + 0.010*"sweats" + 0.010*"ah" + 0.009*"wired"
2 Topics: 0.076*"social" + 0.041*"anxiety" + 0.017*"awkward" + 0.016*"feel" + 0.015*"being" + 0.012*"shy" + 0.011*"interactions" + 0.011*"interaction" + 0.010*"conversations" + 0.008*"judging"
3 Topics: 0.095*"wanna" + 0.030*"feel" + 0.019*"die" + 0.014*"pointless" + 0.012*"everyone" + 0.011*"everything" + 0.011*"alone" + 0.010*"someone" + 0.010*"myself" + 0.009*"belong"
4 Topics: 0.023*"partly" + 0.021*"inevitable" + 0.016*"agency" + 0.012*"landlord" + 0.012*"keto" + 0.009*"weaker" + 0.009*"bitterness" + 0.008*"observe" + 0.007*"annual" + 0.007*"bless"
5 Topics: 0.015*"autism" + 0.015*"their" + 0.014*"science" + 0.012*"knowledge" + 0.012*

In [11]:
# LDA Generate feature matrix using topic distribution for each Symptom

def topic_distribution(lda_model, symptom_posts, dictionary):
  """
  Produces the topic distribution of each post for a symptom.

  ROWS: posts
  COLUMNS: topics
  Values: weight of the topic in the post
  """

  # Initialize a feature matrix
  M = np.zeros((len(symptom_posts), 200), dtype=np.float64)

  # For a specific symptom create a feature matrix
  for post_idx, post in enumerate(symptom_posts):

    # Convert the post to a bag-of-words representation (uniqueid, count4post)
    bow = dictionary.doc2bow(post)
    # Use the trained topic model to access the topic distribution for the post
    topic_distribution = lda_model[bow]

    # Loop through the topic distribution for the post
    for topic_id, weight in topic_distribution:
      # places the weight of the topic on the row
      M[post_idx, topic_id] = weight

  return M

def feature_matrix(lda_model, processed_data, dictionary):
  """ Produce the topic distribution accross all the symptoms."""

  lda_features = {}
  for symptom, posts in processed_data.items():
    print(symptom)

    if symptom != 'Depression': # exclude Depression as it is included as other symptoms
      M = topic_distribution(lda_model, posts, dictionary)
      lda_features[symptom] = M

  return lda_features

In [38]:
# TEST LDA embedding
lda_features = feature_matrix(lda_model, processed_data, dictionary)

Control
Depression
Anger
Anhedonia
Anxiety
Concentration deficit
Disordered Eating
Fatigue
Loneliness
Sad Mood
Self-loathing
Sleep problem
Somatic complaint
Suicidal thoughts and attempts
Worthlessness


In [None]:
# DOWNLOAD data for future testing
with open('lda_features.pkl', 'wb') as f:
    pickle.dump(lda_features, f)

## RoBERTa Embeddings

In [12]:
# Initialize model, tokenizer and device
model_name = 'distilroberta-base'

tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using a model of type roberta to instantiate a model of type distilbert. This is not supported for all configurations of models and can yield errors.
Some weights of DistilBertModel were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'transformer.layer.0.attention.k_lin.bias', 'transformer.layer.0.attention.k_lin.weight', 'transformer.layer.0.attention.out_lin.bias', 'transformer.layer.0.attent

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

In [13]:
class SymptomDataset(torch.utils.data.Dataset):
  """
  Initialize custom dataloader to load posts in batches of 32. Includes
  Roberta tokenizer that pads to the longest sequence in the batch and truncates
  to the model's max length.
  """

  def __init__(self, posts):
      self.posts = posts

  def __len__(self):
      return len(self.posts)

  def __getitem__(self, idx):
      return self.posts[idx]  # Return raw post

# Custom collate function for padding
def collate_fn(batch):
    tokenized = tokenizer(
        batch,
        padding=True,         # Pad to the longest sequence in the batch
        truncation=True,      # Truncate sequences longer than the model's max length
        return_tensors="pt"   # Return PyTorch tensors
    )
    return tokenized

In [14]:
# TODO: Your RoBERTa code!
def RoBERTa_embeddings(tokenized_data, layer_num = 5):
  """
  For each symptom produce a embedding by taking the 5th hidden layer
  of the roberta model.
  """
  roberta_embeddings = {}

  for symptom, posts in tokenized_data.items():
    print(symptom)
    if symptom != 'Depression':

      # produce the training dataset
      train_dataset = SymptomDataset(posts)
      # load in the model with batches of 32
      train_dataloader = DataLoader(train_dataset, batch_size=32, collate_fn=collate_fn)

      symptom_embedding_list = [] # list of post embeddings
      progress_bar = tqdm(train_dataloader, desc="Processing Symptoms", ncols=100)

      for batch in progress_bar:
        # needs to be a SINGULAR sentence not a list of tokinized items! made it
        # so that embedding was shape (post length, padding_length, embedding_size)
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        # access the outputs of the pre-trained model (no training involved)
        with torch.no_grad():
          outputs = model(input_ids=input_ids, attention_mask = attention_mask, output_hidden_states= True)
          hidden_states = outputs.hidden_states # access the models hidden states

        # [batch_size, sequence_length, embedding_size]
        layer_embeddings = hidden_states[layer_num] # access the 5th layer in the transformer

        # [batch_size, embedding_size]
        post_embedding = layer_embeddings.mean(dim=1) # get the mean over the sequence length

        symptom_embedding_list.append(post_embedding.cpu())

    roberta_embeddings[symptom] = symptom_embedding_list

  return roberta_embeddings


In [None]:
roberta_embeddings = RoBERTa_embeddings(symptom_data, layer_num = 5)

In [None]:
# DOWNLOAD data for future testing
with open('roberta_embeddings_punctuation.pkl', 'wb') as f:
    pickle.dump(roberta_embeddings, f)

In [None]:
# TEST roberta_embeddings
print(roberta_embeddings['Control']).shape

## Evaluate Embeddings - AUC

In [15]:
def LDA_AUC(X, y):
  """
  Runs 5-fold cross validation with random forest to evaluate LDA's embedding
  performance.
  """
  rf_classifier = RandomForestClassifier(
      max_depth=20, # Limit tree depth to reduce model complexity
      min_samples_split=40, # Increase the minimum number of samples required to split a node
      min_samples_leaf=20 # Increase the minimum number of samples per leaf node
  )
  cv = KFold(n_splits=5, shuffle=True)
  results = cross_validate(rf_classifier, X=X, y=y, cv=cv, scoring='roc_auc', return_train_score=True)
  return np.mean(results['test_score'])

def ROBERTA_AUC(X, y):
  """
  Runs 5-fold cross validation with random forest to evaluate LDA's embedding
  performance.
  """
  rf_classifier = RandomForestClassifier(
    # n_estimators=50,            # Reduce the number of trees to prevent overfitting
    max_depth=10,                # Limit tree depth to reduce model complexity
    min_samples_split=100,       # Increase the minimum number of samples required to split a node
    min_samples_leaf=50        # Increase the minimum number of samples per leaf node
  )
  cv = KFold(n_splits=5, shuffle=True)
  results = cross_validate(rf_classifier, X=X, y=y, cv=cv, scoring='roc_auc', return_train_score=True)
  return np.mean(results['test_score'])

def evaluate_embeddings(lda_features, roberta_embeddings):
  lda_auc_scores = []
  roberta_auc_scores = []
  symptoms = []

  lda_control = lda_features['Control']
  roberta_control = np.vstack(roberta_embeddings['Control'])

  for symptom in tqdm(lda_features, desc="Evaluating LDA model", ncols=100):
      if symptom not in ['Control', 'Depression', 'Concentration deficit', 'Fatigue', 'Suicidal thoughts and attempts']:

          lda_symptom_embedding = lda_features[symptom]

          # Produces our Input data by concatenating our control and current symptom embeddings
          X = np.concatenate((lda_control, lda_symptom_embedding), axis=0)
          # Produces our labels by using 0 as control and 1 as symptom "positive"
          y = np.concatenate((np.zeros(len(lda_control)), np.ones(len(lda_symptom_embedding))), axis=0)

          lda_auc = LDA_AUC(X, y)
          lda_auc_scores.append(lda_auc)
          symptoms.append(symptom)

  for symptom in tqdm(roberta_embeddings, desc="Evaluating LDA model", ncols=100):
      if symptom not in ['Control', 'Depression', 'Concentration deficit', 'Fatigue', 'Suicidal thoughts and attempts']:

          roberta_symptom_embedding = np.vstack(roberta_embeddings[symptom])

          # Produces our Input data by concatenating our control and current symptom embeddings
          X = np.concatenate((roberta_control, roberta_symptom_embedding), axis=0)
          # Produces our labels by using 0 as control and 1 as symptom "positive"
          y = np.concatenate((np.zeros(len(roberta_control)), np.ones(len(roberta_symptom_embedding))), axis=0)

          roberta_auc = ROBERTA_AUC(X, y)
          roberta_auc_scores.append(roberta_auc)

  # Create DataFrame
  auc_data = {
      'Symptom': symptoms,
      'LDA AUC': lda_auc_scores,
      'RoBERTa AUC': roberta_auc_scores,
  }
  auc_df = pd.DataFrame(auc_data)

  # Pretty print the DataFrame using tabulate
  table = tabulate(auc_df, headers='keys', tablefmt='fancy_grid', showindex=False)
  print(table)


## Main
Run main, choose if you have already loaded the dateset before or not.


In [17]:
def main(load_data = False):

  # generate dictionary containing lists of posts

  print("Generate Symptom Data")
  symptom_data = dataset_generation() # generate our original dictionary
  print("--------------------------------------")

  print("Preprocess our dataset for LDA\n")
  happy_tokenizer = Tokenizer() # tokenizer already produce lower case token
  tokenized_data = tokenize(happy_tokenizer, symptom_data) # tokenize
  top_100 = stop_words(tokenized_data) # get top 100 stop words
  processed_data = remove_stop_words(tokenized_data, top_100) # remove stop words + lowercases (inherently removes punctuation)

  print("-------------------------------------")
  # Preload Embeddings
  if load_data:
    roberta_embeddings, lda_features = load_embeddings()
    evaluate_embeddings(lda_features, roberta_embeddings)

  else:
    # Generate Embeddings
    # LDA embeddings
    dictionary, corpus = create_corpus(processed_data)
    print("Training LDA model \n")
    print("-------------------------")
    lda_model = train_lda(dictionary, corpus)
    print("Produces LDA feature matrix\n")
    lda_features = feature_matrix(lda_model, processed_data, dictionary)
    print("-------------------------")

    # RoBERTa embeddings
    print("Generating Roberta embeddings\n")
    roberta_embeddings = RoBERTa_embeddings(symptom_data)
    print("-------------------------")

    print("Evaluating models\n")
    # Produces a table that represents the AUC scores of both the LDA and roberta embeddings
    evaluate_embeddings(lda_features, roberta_embeddings)

main(True)

Generate Symptom Data
--------------------------------------
Preprocess our dataset for LDA



Tokenizing Symptoms: 100%|██████████████████████████████████████████| 15/15 [02:22<00:00,  9.49s/it]
Removing Stop Words: 100%|██████████████████████████████████████████| 15/15 [00:29<00:00,  1.99s/it]


-------------------------------------
╒═══════════════════╤═══════════╤═══════════════╕
│ Symptom           │   LDA AUC │   RoBERTa AUC │
╞═══════════════════╪═══════════╪═══════════════╡
│ Anger             │  0.949425 │      0.892395 │
├───────────────────┼───────────┼───────────────┤
│ Anhedonia         │  0.965538 │      0.909087 │
├───────────────────┼───────────┼───────────────┤
│ Anxiety           │  0.937916 │      0.892172 │
├───────────────────┼───────────┼───────────────┤
│ Disordered Eating │  0.961068 │      0.87232  │
├───────────────────┼───────────┼───────────────┤
│ Loneliness        │  0.879236 │      0.825268 │
├───────────────────┼───────────┼───────────────┤
│ Sad Mood          │  0.844092 │      0.845463 │
├───────────────────┼───────────┼───────────────┤
│ Self-loathing     │  0.875498 │      0.85112  │
├───────────────────┼───────────┼───────────────┤
│ Sleep problem     │  0.976371 │      0.882789 │
├───────────────────┼───────────┼───────────────┤
│ Somatic co

## Ethical Discussion

### Benefits:
- NLP systems process large amounts of data --> can produce insights on mental health and symptom patterns
- Early Warning Systems

### Drawbacks:
- NLP is still not human <-- can still lack contextual understanding
- Mental health is unique to individuals <-- data can have biases to certain demographics depending on where we access the data
- Mental health is extremely complicated we shouldn't over-rely on nlp for diagnoses

### Harms:
- Mining user data can always be extremely dangerous from a privacy perspective.
- False positives and negatives can have serious real world consequences on peoples lives.
- Could be miss used for example targeted advertising.
