# COGS 118A - Final Project

# Self-Supervised Learning on Social Media Posts: Mental Health Disorder Classification

## Group members

- Gilberto Robles
- Soyon Kim
- Allan Tan
- Jason Sheu

Mental health patients struggle with financial, psychological, and logistical burden when seeking professional help. As such, social media, particularly Reddit, has become a popular outlet for people to anonymously seek help. In order to help make mental healthcare accessible and affordable online, we aim to use supervised machine learning to detect self-harming and destructive behavior as well as classify potential mental disorders using the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) Diagnostic Criteria. Our model could also be utilized in a professional environment by assisting care providers with diagnostic information. We use the Reddit Mental Health Dataset, which consists of posts from 28 different subreddits (15 mental health support groups) from 2018-2020. In our methods, we deploy a Multi-Class Text Classification model in a cross-examination study to evaluate the performance difference with K-Nearest Neighbors and Support Vector Machines to find patterns in people seeking support in mental health related subreddits. The posts will thus be related to mental disorders and harmful behavior in order to potentially diagnose (or at least warn) users about the content of their posts, and then direct them to helpful resources in an accessible, private, and preventative manner.

# Background

### The Mental Health Epidemic
The current mental healthcare system places various financial, psychological, and logistical burdens on those seeking professional help for their mental disorders.  

To list a few:
- There is a huge shortage of therapists/psychiatrists, leading to months and even year-long waitlists for an incoming patient’s first appointment. Especially since many do not accept insurance, this causes a huge difficulty in patients finding connection to clinicians in the first place.
- There is a strong social stigma against seeking professional help for mental disorders. That is, the fear of admitting to one’s issues and becoming labeled as “disabled” leads to anosognosia, or the denial or “lack of insight” in acknowledging one’s mental health issues.
- There is also the burden on incoming patients to find the best-fitting therapist in terms of location, specialization, cost, gender, age, and culture. When switching clinicians, the psychological burden of repeated intake sessions where one must elaborate on their mental health history can also be extremely cumbersome.

### Previous Research: Machine Learning and Mental Health
As a result of this lack of outlet for mental health struggles, many turn to the internet to anonymously confess their difficulties and build communities for support. For example, Reddit has user-established mental health support groups for various mental health disorders such as addiction, alcoholism, Bipolar disorder, anxiety, depression, eating disorders, and post-traumatic stress disorder. Meanwhile, Human-Machine Interaction (HMI) is the field that explores computer and robot technology that focuses on the relationship between people and machines. As closely related to humans, HMI has focused on enhancing human health, particularly mental health. 

For example, 
- Chikersal et al. developed automatic Depression detection through machine learning of biosensor feedback [2]. There are also various products in the market that use machine learning and other intelligent methods for user personalization in preference and treatment. For example, there are mobile apps including mindfulness apps like Headspace, Calm, and UCLA Mindful and therapeutic robots like Paro, Hugvie, Pepper, Carebot, and QTRobot [4, 5].
- Cheng et al. created "Psychologist in a Pocket", a Lexicon Development and Content Validation of a Mobile-Based App for Depression Screening

For the development of our project and our own lexicon, we also consulted Li et al. Automatic Construction of a Depression-Domain Lexicon Based on Microblogs: Text Mining Study, as well as "Lexicon-based method to detect signs of depression in text" on GitHub by Pablo Gamallo.

Additionally, during the development of our Self-Supervised Model for text classification, we consulted with Keras documentation, "End-to-end Masked Language Modeling with BERT" to train a language model in a self-supervised setting (without human-annotated labels).


# Problem Statement

In this project, we want to tackle the problem of lack of accessible, affordable, and preventative mental health support resources through the use of popular social media sites. Specifically, we want to assist Internet users struggling with mental disorders who have exhibited a range of qualifying behavior traits, as per DSM-5, through the use of an automated system that classifies them based on their Reddit posts. The system would learn words or phrases commonly used by people with the qualifying criteria versus people who do not exhibit any concern within the same Reddit thread, with the use of KNN and SVM text classification models. That is, the system would learn words and phrases used by people who clearly exhibit behavior and meet the criteria for mental disorders. Then, given a reddit post, the system will try to detect whether the post meets concerning criteria, at which point the user will be notified and directed to relevant resouces. For this project, we aim to primarily focus on a single model that can differentiate between people that display signs of mental illness versus those that do not.

# Data

Link to dataset: https://zenodo.org/record/3941387/files/depression_2018_features_tfidf_256.csv?download=1

Dataset size: 24535 rows, 350 columns. Many of these columns are tf-idf statistics that we will not be using. The column that we are interested in is primarily the 'post' column, which is the column containing the text of the post made by a user.

The dataset we plan to use is text that was scraped from Reddit’s mental health support subreddits. The link to the dataset can be found as item 3 in the footnotes [3]. The data was originally created to examine the effects of COVID-19 on mental health. It contains posts from 28 different subreddits, or 15 mental health support groups, and dates range from 2018-2020. The link provided includes a variety of mental health disorders.

For our project in particular, we will be analyzing the posts associated with self-harming behavior as per DSM-5 criteria. As a result, the data we will be examining are for example, in the case of "Major depressive disorder", would be
- depression_2018_features_tfidf_256.csv
- depression_2019_features_tfidf_256.csv
- etc.

Regarding these two datasets in particular, the each have about 24500 and 33500 observations, respectively. They also have a wide array of variables concerning the text and the text’s metadata. Examples include:
- the subreddit the text was scraped from
- the username of the poster
- actual text itself
- date posted
- unique words
- syllables

In total there are 350 variables for one observation. Similarly, a single observation basically constitutes a post on the subreddit would have all of the aforementioned variables. Because our project revolves around textual analysis and classification, we will predominantly focus on the reddit post's text portion of each observation. As a result, some critical variables are:
- the text of the post
- the date it was posted
- number of words
- number of times a “trigger” word such as gun or suicide is used, etc.

The text of the post will be in string format and the date will also be a string in MM/DD/YYYY format. Other variables will be in integer or float format as they are primarily responsible for keeping track of word frequencies. Because we are only using a few of the variables out of the 350 available, we will need to remove the unnecessary ones. Additionally, to preserve anonymity, we will be removing the author variable from each observation. Some additional cleaning could also take the form of removing special characters from the text of each observation or making all characters lowercase.


# Proposed Solution

Internet users have been posting volumes of texts discussing their mental illness or self harm urges. In this project, in order to solve the problem of lack of accessible, affordable and preventative mental health support resources, we aim to develop a mental disorder detection system and provide help to users by providing automatic warnings about the onset of hamrful behavior described explicitly by the user. 

### Preprocessing

##### Data Splitting and Cleaning
The first thing to tackle this difficult dataset is to do preprocessing on the data. We will take 4 sub-datasets with tens of thousands of samples each, namely: ADHD, Anxiety, Alcoholism, and Depression. 

Since each data point comes with extra information we will not be using, we extract only the reddit post text that we will be using in our analysis. 

Since this is a text NLP task, we need to clean the data of unwanted stopwords which are irrelevant in understanding the sentiment of text. We also remove the capitalizaion and symbols.


### Term Frequency-Inverse Document Frequency (TF-IDF)

The original plan for our project was to use TF-IDF as an information retrieval (IR) tool to quantify the importance or relevance of words or phrases, in comparison to their relevance scores in pre-labeled mental disorder lexicons. Unfortunately the lexicons that we found from researchers in the same field, had their results restricted for public use and we were not able to move on with this stage alone as a pre-classification task. While the text vectorization remains relevant for establishing a representative vocabulary from the training data, we decided to take a step further.


### BERT-Like Masked Language Model for Self-Supervised Text Classification

Since our only options from this point was to either give up on the project up to the previous step, manually hand-label thousands of samples from our data, or to escalate into a Neural Network layer, we decided to do the latter.

In this stage we create a pretraining model with the Keras TextVectorization and MultiHeadAttention layers to create a BERT Transformer-Encoder network architecture which automatically labels the datasets. By extrating the token IDs from our vectorization as inputs (including masked tokens) the model predicts the correct IDs for the masked input tokens.

### Training and Testing with KNN and SVM

In this section our goal is to create a cross-examination study to compare the classification accuracy of K Nearest Neighbors vs a non-linear Support Vector Machine. Since Neural Networks take hours and even days to train, we only used a simple vocabulary layer to pretrain or model with, but to generalize these results to the entire Reddit Mental Health Disorder dataset, we opt for faster and more efficient classification models. In the model selection process we compare KNN and SVM.

### Machine Learning and Online Mental Health Support
The end goal of this project essentially is to make use of such a classification model to create a private and optional alert and support system for people who might be looking online for mental health support. By being able to correctly classify who could use this help, mental health can become more widely accessible for those who are already searching. And we can help direct them in the right direction with disorder-specific information. Our proposed solution would be an optional and private alert through Reddit.

# Evaluation Metrics

At least one evaluation metric we will be using to quantify the performance of our model is precsion, recall and f1 score. Because we plan on leveraging sentiment analysis on the textual content of each observation, we will likely begin with logistic regression as a baseline.  
Precision, recall, and f1 scores matter to us as we are performing classification, and want to minimize the amount of false positives and negatives, using precision and recall to check the performance of our model. We will use f1 score as an overall indicator of model accuracy, as well as regular accuracy between testing and training sets.

# Results

### Import libraries/dataset

In [None]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
import string
from sklearn.feature_extraction.text import TfidfVectorizer

#### ADHD Data

In [None]:
# import data from drive
data_path = '.'
adhd_2018_df = pd.read_csv(data_path + '/adhd/adhd_2018.csv')
adhd_2019_df = pd.read_csv(data_path + '/adhd/adhd_2019.csv')
adhd_post_df = pd.read_csv(data_path + '/adhd/adhd_post.csv')
adhd_pre_df  = pd.read_csv(data_path + '/adhd/adhd_pre.csv')

# join all data into one DataFrame
adhd_dataset = pd.concat([adhd_2018_df, adhd_2019_df, adhd_post_df, adhd_pre_df])

#### Anxiety Data

In [None]:
# import data from drive
anxiety_2018_df = pd.read_csv(data_path + '/anxiety/anxiety_2018.csv')
anxiety_2019_df = pd.read_csv(data_path + '/anxiety/anxiety_2019.csv')
anxiety_post_df = pd.read_csv(data_path + '/anxiety/anxiety_post.csv')
anxiety_pre_df  = pd.read_csv(data_path + '/anxiety/anxiety_pre.csv')

# join all data into one DataFrame
anxiety_dataset = pd.concat([anxiety_2018_df, anxiety_2019_df, anxiety_post_df, anxiety_pre_df])

#### Alcoholism Data

In [None]:
# import data from drive
alcoholism_2018_df = pd.read_csv(data_path + '/alcoholism/alcoholism_2018.csv')
alcoholism_2019_df = pd.read_csv(data_path + '/alcoholism/alcoholism_2019.csv')
alcoholism_post_df = pd.read_csv(data_path + '/alcoholism/alcoholism_post.csv')
alcoholism_pre_df  = pd.read_csv(data_path + '/alcoholism/alcoholism_pre.csv')

# join all data into one DataFrame
alcoholism_dataset = pd.concat([alcoholism_2018_df, alcoholism_2019_df, alcoholism_post_df, alcoholism_pre_df])

#### Depression Data

In [None]:
# import data from drive
depression_2018_df = pd.read_csv(data_path + "/depression/depression_2018.csv")
depression_2019_df = pd.read_csv(data_path + "/depression/depression_2019.csv")
depression_post_df = pd.read_csv(data_path + "/depression/depression_post.csv")
depression_pre_df  = pd.read_csv(data_path + "/depression/depression_pre.csv")

# join all data into one DataFrame
depression_dataset = pd.concat([depression_2018_df, depression_2019_df, depression_post_df, depression_pre_df])

#### Example of post from dataset

In [None]:
# length of our data
print('adhd', len(adhd_dataset))
print('anxiety', len(anxiety_dataset))
print('alcoholim', len(alcoholism_dataset))
print('depression', len(depression_dataset))

In [None]:
adhd_post = adhd_dataset.loc[:, "post"][1]
adhd_post

#### Data noise reduction and format simplification

In [None]:
adhd_posts = [i for i in adhd_dataset.loc[:, "post"]]
anxiety_posts = [i for i in anxiety_dataset.loc[:, "post"]]
alcoholism_posts = [i for i in alcoholism_dataset.loc[:, "post"]]
depression_posts = [i for i in depression_dataset.loc[:, "post"]]

### Preprocessing: Data Splitting and Cleaning

In [None]:
#functions to remove stopwords from posts

def remove_stops(text, stops):
    words = text.split()
    final = []
    for word in words:
        if word not in stops:
            final.append(word)
    final = " ".join(final)
    final = final.translate(str.maketrans("", "", string.punctuation))
    final = "".join([i for i in final if not i.isdigit()])
    while "  " in final:
        final = final.replace("  ", " ")
    return final

def clean_docs(docs):
    stops = stopwords.words("english")
    final = []
    final2 = []
    for doc in docs:
        clean_doc = remove_stops(doc, stops)
        final.append(clean_doc)

    return final

In [None]:
cleaned_adhd_docs = clean_docs(adhd_posts)
cleaned_anxiety_docs = clean_docs(anxiety_posts)
cleaned_alcoholism_docs = clean_docs(alcoholism_posts)
cleaned_depression_docs = clean_docs(depression_posts)

In [None]:
#check for an data loss

print('adhd', len(cleaned_adhd_docs))
print('anxiety', len(cleaned_anxiety_docs))
print('alcoholism', len(cleaned_alcoholism_docs))
print('depression', len(cleaned_depression_docs))

#### Compare clean vs unclean samples

In [None]:
# From ADHD dataset
adhd_posts[1]

In [None]:
cleaned_adhd_docs[1]

In [None]:
# From alcoholism dataset
alcoholism_posts[1]

In [None]:
cleaned_alcoholism_docs[1]

### Term Frequency-Inverse Document Frequency (TF-IDF)
We struggled through a lot of difficulties during the process of this project. Although we did not end up using our TF-IDF direct results, we are including them here because they are the basis on which we later built our other very similar text vectorization layer in the next section. Although it was not a direct predecessor to the following steps in the project, it established a useful foundation for the structure of the next vectorization technique.

In [None]:
from sklearn.feature_extraction import text

#custom_stopwords = 'drive/MyDrive/Courses UCSD/SPRING 2022/COGS 118A/FinalProject/stop_words_english.txt'
custom_stopwords = 'stop_words_english.txt'
with open(custom_stopwords, 'r') as f:
    more_stop_words = [line.strip() for line in f]
my_stop_words = text.ENGLISH_STOP_WORDS.union(more_stop_words)

vectorizer = TfidfVectorizer(
                                lowercase=True,
                                max_features=300,
                                stop_words=my_stop_words)

vectors = vectorizer.fit_transform(cleaned_adhd_docs)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
adhd_tfidf = pd.DataFrame(denselist, columns=feature_names)

In [None]:
adhd_tfidf

In [None]:
#visualize only the keywords

all_keywords = []

for description in denselist:
    x = 0
    keywords = []
    for word in description:
        if word > 0:
            keywords.append(feature_names[x])
        x = x+1
    all_keywords.append(keywords)

In [None]:
print(all_keywords[1])

In [None]:
## if we are interested in the n features with the highest TF IDF scores

top_n = 300
top_n_features = sorted(list(zip(feature_names, 
                                  vectors.sum(0).getA1())), 
                              key=lambda x: x[1], reverse=True)[:top_n]

In [None]:
for feature in top_n_features:
    if feature[0] == 'suicide':
        print(feature)

In [None]:
# Extract the TF-IDF seed words from the 2018 depression dataset
my_seed_words = []
for feature in top_n_features:
    my_seed_words.append(feature[0])
print(my_seed_words)

In [None]:
# TF-IDF seed words from existing study
depression_true_seed_words = ['myself', 'really', 'depression', 'hope', 'life', 'forever', 'pain', 'sad', 'live', 'mood']

In [None]:
# Calculate Cosine Similarity between the two seed word lists
from collections import Counter

# count word occurrences
our_vals = Counter(my_seed_words)
true_vals = Counter(depression_true_seed_words)

# convert to word-vectors
words  = list(our_vals.keys() | true_vals.keys())
our_vect = [our_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
true_vect = [true_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_our  = sum(av*av for av in our_vect) ** 0.5             # sqrt(7)
len_true  = sum(bv*bv for bv in true_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(our_vect, true_vect))    # 3
cosine = dot / (len_our * len_true)                          # 0.5669467

In [None]:
print(cosine)

### BERT-Like Masked Language Model for Self-Supervised Text Classification

The following code creates labels automatically using Masked Language Modeling, with one Neural Network layer for our dataset, since reddit posts unfortunately do not come prelabeled.


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from dataclasses import dataclass
import re
from pprint import pprint

In [None]:
# make the adhd features into a tensor
adhd_features = tf.constant(cleaned_adhd_docs)

# make the adhd labels into a tensor
# these are temporary labels just to make a temporary dataset
adhd_labels = tf.constant(np.random.choice([0, 1], size=(len(cleaned_adhd_docs),), p=[1./3, 2./3]))

# initialize a tensorflow dataset for text features and labels
# wel will use this dataset to extract a lexicon out of all data samples
# so that we can train a neural network with it
adhd_dataset = tf.data.Dataset.from_tensor_slices((adhd_features, adhd_labels))

In [None]:
## this just displays the first couple data points and their classification label

for text_batch, label_batch in adhd_dataset.take(3):
        print(text_batch.numpy())
        print(label_batch.numpy())

#### Dataset Prep: Vocabulary and Mask Layer

In [None]:
@dataclass
class Config:
    MAX_LEN = 256
    BATCH_SIZE = 32
    LR = 0.001
    VOCAB_SIZE = 30000
    EMBED_DIM = 128
    NUM_HEAD = 8  # used in bert model
    FF_DIM = 128  # used in bert model
    NUM_LAYERS = 1


config = Config()

In [None]:
## data cleaning from capitalization and symbols

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )

In [None]:
def get_vectorize_layer(texts, vocab_size, max_seq, special_tokens=["[MASK]"]):
    """Build Text vectorization layer

    Args:
      texts (list): List of string i.e input texts
      vocab_size (int): vocab size
      max_seq (int): Maximum sequence lenght.
      special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].

    Returns:
        layers.Layer: Return TextVectorization Keras Layer
    """

    # initialize vocabulary layer, creates a lexicon to adapt our model with
    vectorize_layer = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        standardize=custom_standardization,
        output_sequence_length=max_seq,
    )

    # use the entire dataset (no labels) and create a useful lexicon out of it:
    vectorize_layer.adapt(texts)

    # Insert mask token in vocabulary
    vocab = vectorize_layer.get_vocabulary()
    vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
    vectorize_layer.set_vocabulary(vocab)
    return vectorize_layer

#### Text Vectorization with Vocabulary Layer
- the layer will build a vocabulary of all string tokens seen in the dataset, sorted by occurance count, with ties broken by sort order of the tokens (high to low).
- Will compute the most frequent tokens occurring in the input dataset.
- We use this 'vocab' to train our model with

In [None]:
'''ADHD'''
# run our cleaned ADHD data through the vocab layer
adhd_vectorize_layer = get_vectorize_layer(
    cleaned_adhd_docs,
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
adhd_mask_token_id = adhd_vectorize_layer(["[mask]"]).numpy()[0][0]

In [None]:
'''ANXIETY'''
# run our cleaned ADHD data through the vocab layer
anxiety_vectorize_layer = get_vectorize_layer(
    cleaned_anxiety_docs,
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
anxiety_mask_token_id = anxiety_vectorize_layer(["[mask]"]).numpy()[0][0]

In [None]:
'''ALCOHOLISM'''
# run our cleaned ADHD data through the vocab layer
alcoholism_vectorize_layer = get_vectorize_layer(
    cleaned_alcoholism_docs,
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
alcoholism_mask_token_id = alcoholism_vectorize_layer(["[mask]"]).numpy()[0][0]

In [None]:
#NOT ENOUGH RAM ON COLAB

'''DEPRESSION'''
# run our cleaned ADHD data through the vocab layer
depression_vectorize_layer = get_vectorize_layer(
    cleaned_depression_docs,
    config.VOCAB_SIZE,
    config.MAX_LEN,
    special_tokens=["[mask]"],
)

# Get mask token id for masked language model
depression_mask_token_id = depression_vectorize_layer(["[mask]"]).numpy()[0][0]

#### Encoding and Self-Classification with Masked Language Modeling
Code sample from Keras Official Documentation: https://keras.io/examples/nlp/masked_language_modeling/

In [None]:
'''
This is the function which creates automatic labels for our dataset
by using the vectorization and vocab layer we created previously
'''
def get_masked_input_and_labels(encoded_texts, mask_token_id):
    # 15% BERT masking
    inp_mask = np.random.rand(*encoded_texts.shape) < 0.15
    # Do not mask special tokens
    inp_mask[encoded_texts <= 2] = False
    # Set targets to -1 by default, it means ignore
    labels = -1 * np.ones(encoded_texts.shape, dtype=int)
    # Set labels for masked tokens
    labels[inp_mask] = encoded_texts[inp_mask]

    # Prepare input
    encoded_texts_masked = np.copy(encoded_texts)
    # Set input to [MASK] which is the last token for the 90% of tokens
    # This means leaving 10% unchanged
    inp_mask_2mask = inp_mask & (np.random.rand(*encoded_texts.shape) < 0.90)
    encoded_texts_masked[
        inp_mask_2mask
    ] = mask_token_id  # mask token is the last in the dict

    # Set 10% to a random token
    inp_mask_2random = inp_mask_2mask & (np.random.rand(*encoded_texts.shape) < 1 / 9)
    encoded_texts_masked[inp_mask_2random] = np.random.randint(
        3, mask_token_id, inp_mask_2random.sum()
    )

    # Prepare sample_weights to pass to .fit() method
    sample_weights = np.ones(labels.shape)
    sample_weights[labels == -1] = 0

    # y_labels would be same as encoded_texts i.e input tokens
    y_labels = np.copy(encoded_texts)

    return encoded_texts_masked, y_labels, sample_weights

#### Create Automatic Labels

In [None]:
'''ADHD'''
# Prepare data for masked language model for the unlabeled ADHD dataset
x_all_adhd = adhd_vectorize_layer(cleaned_adhd_docs).numpy()
x_masked_adhd_train, y_masked_adhd_labels, adhd_sample_weights = get_masked_input_and_labels(
    x_all_adhd, adhd_mask_token_id)

# formulate our new self-labeled dataset
mlm_adhd_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_adhd_train, y_masked_adhd_labels, adhd_sample_weights))
mlm_adhd_ds = mlm_adhd_ds.shuffle(1000).batch(config.BATCH_SIZE)

In [None]:
'''Anxiety'''
# Prepare data for masked language model for the unlabeled anxiety dataset
x_all_anxiety = anxiety_vectorize_layer(cleaned_anxiety_docs).numpy()
x_masked_anxiety_train, y_masked_anxiety_labels, anxiety_sample_weights = get_masked_input_and_labels(
    x_all_anxiety, anxiety_mask_token_id)

# formulate our new self-labeled dataset
mlm_anxiety_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_anxiety_train, y_masked_anxiety_labels, anxiety_sample_weights))
mlm_anxiety_ds = mlm_anxiety_ds.shuffle(1000).batch(config.BATCH_SIZE)

In [None]:
'''Alcoholism'''
# Prepare data for masked language model for the unlabeled alcoholism dataset
x_all_alcoholism = alcoholism_vectorize_layer(cleaned_alcoholism_docs).numpy()
x_masked_alcoholism_train, y_masked_alcoholism_labels, alcoholism_sample_weights = get_masked_input_and_labels(
    x_all_alcoholism, alcoholism_mask_token_id)

# formulate our new self-labeled dataset
mlm_alcoholism_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_alcoholism_train, y_masked_alcoholism_labels, alcoholism_sample_weights))
mlm_alcoholism_ds = mlm_alcoholism_ds.shuffle(1000).batch(config.BATCH_SIZE)

In [None]:
'''Depression'''
# Prepare data for masked language model for the unlabeled depression dataset
x_all_depression = depression_vectorize_layer(cleaned_depression_docs).numpy()
x_masked_depression_train, y_masked_depression_labels, depression_sample_weights = get_masked_input_and_labels(
    x_all_depression, depression_mask_token_id)

# formulate our new self-labeled dataset
mlm_depression_ds = tf.data.Dataset.from_tensor_slices(
    (x_masked_depression_train, y_masked_depression_labels, depression_sample_weights))
mlm_depression_ds = mlm_depression_ds.shuffle(1000).batch(config.BATCH_SIZE)

Labeled ADHD dataset

In [None]:
# length of our data
print('adhd', len(mlm_adhd_ds))
print('anxiety', len(mlm_anxiety_ds))
print('alcoholism', len(mlm_alcoholism_ds))

### Create BERT Model (Pretraining Model) for masked language modeling
It will take token ids as inputs (including masked tokens) and it will predict the correct ids for the masked input tokens.  
Code sample from Keras Official Documentation: https://keras.io/examples/nlp/masked_language_modeling/

In [None]:
## please note this bert module is from Keras Documentation
## it is included here because tensorflow or keras do not have a set of
## functions we can just use for this, we have to include them here

def bert_module(query, key, value, i):
    # Multi headed self-attention
    attention_output = layers.MultiHeadAttention(
        num_heads=config.NUM_HEAD,
        key_dim=config.EMBED_DIM // config.NUM_HEAD,
        name="encoder_{}/multiheadattention".format(i),
    )(query, key, value)
    attention_output = layers.Dropout(0.1, name="encoder_{}/att_dropout".format(i))(
        attention_output
    )
    attention_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/att_layernormalization".format(i)
    )(query + attention_output)

    # Feed-forward layer
    ffn = keras.Sequential(
        [
            layers.Dense(config.FF_DIM, activation="relu"),
            layers.Dense(config.EMBED_DIM),
        ],
        name="encoder_{}/ffn".format(i),
    )
    ffn_output = ffn(attention_output)
    ffn_output = layers.Dropout(0.1, name="encoder_{}/ffn_dropout".format(i))(
        ffn_output
    )
    sequence_output = layers.LayerNormalization(
        epsilon=1e-6, name="encoder_{}/ffn_layernormalization".format(i)
    )(attention_output + ffn_output)
    return sequence_output


def get_pos_encoding_matrix(max_len, d_emb):
    pos_enc = np.array(
        [
            [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)]
            if pos != 0
            else np.zeros(d_emb)
            for pos in range(max_len)
        ]
    )
    pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2])  # dim 2i
    pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2])  # dim 2i+1
    return pos_enc


loss_fn = keras.losses.SparseCategoricalCrossentropy(
    reduction=tf.keras.losses.Reduction.NONE
)
loss_tracker = tf.keras.metrics.Mean(name="loss")


class MaskedLanguageModel(tf.keras.Model):
    def train_step(self, inputs):
        if len(inputs) == 3:
            features, labels, sample_weight = inputs
        else:
            features, labels = inputs
            sample_weight = None

        with tf.GradientTape() as tape:
            predictions = self(features, training=True)
            loss = loss_fn(labels, predictions, sample_weight=sample_weight)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Compute our own metrics
        loss_tracker.update_state(loss, sample_weight=sample_weight)

        # Return a dict mapping metric names to current value
        return {"loss": loss_tracker.result()}

    @property
    def metrics(self):
        # We list our `Metric` objects here so that `reset_states()` can be
        # called automatically at the start of each epoch
        # or at the start of `evaluate()`.
        # If you don't implement this property, you have to call
        # `reset_states()` yourself at the time of your choosing.
        return [loss_tracker]


def create_masked_language_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)

    word_embeddings = layers.Embedding(
        config.VOCAB_SIZE, config.EMBED_DIM, name="word_embedding"
    )(inputs)
    position_embeddings = layers.Embedding(
        input_dim=config.MAX_LEN,
        output_dim=config.EMBED_DIM,
        weights=[get_pos_encoding_matrix(config.MAX_LEN, config.EMBED_DIM)],
        name="position_embedding",
    )(tf.range(start=0, limit=config.MAX_LEN, delta=1))
    embeddings = word_embeddings + position_embeddings

    encoder_output = embeddings
    for i in range(config.NUM_LAYERS):
        encoder_output = bert_module(encoder_output, encoder_output, encoder_output, i)

    mlm_output = layers.Dense(config.VOCAB_SIZE, name="mlm_cls", activation="softmax")(
        encoder_output
    )
    mlm_model = MaskedLanguageModel(inputs, mlm_output, name="masked_bert_model")

    optimizer = keras.optimizers.Adam(learning_rate=config.LR)
    mlm_model.compile(optimizer=optimizer)
    return mlm_model


id2token = dict(enumerate(alcoholism_vectorize_layer.get_vocabulary()))
token2id = {y: x for x, y in id2token.items()}
mask_token_id = alcoholism_mask_token_id

# anxiety_id2token = dict(enumerate(anxiety_vectorize_layer.get_vocabulary()))
# anxiety_token2id = {y: x for x, y in anxiety_id2token.items()}

# alcoholism_id2token = dict(enumerate(alcoholism_vectorize_layer.get_vocabulary()))
# alcoholism_token2id = {y: x for x, y in alcoholism_id2token.items()}

# depression_id2token = dict(enumerate(depression_vectorize_layer.get_vocabulary()))
# depression_token2id = {y: x for x, y in depression_id2token.items()}



## optional text generator
class MaskedTextGenerator(keras.callbacks.Callback):
    def __init__(self, sample_tokens, top_k=5):
        self.sample_tokens = sample_tokens
        self.k = top_k

    def decode(self, tokens):
        return " ".join([id2token[t] for t in tokens if t != 0])

    def convert_ids_to_tokens(self, id):
        return id2token[id]

    def on_epoch_end(self, epoch, logs=None):
        prediction = self.model.predict(self.sample_tokens)

        masked_index = np.where(self.sample_tokens == mask_token_id)
        masked_index = masked_index[1]
        mask_prediction = prediction[0][masked_index]

        top_indices = mask_prediction[0].argsort()[-self.k :][::-1]
        values = mask_prediction[0][top_indices]

        for i in range(len(top_indices)):
            p = top_indices[i]
            v = values[i]
            tokens = np.copy(sample_tokens[0])
            tokens[masked_index[0]] = p
            result = {
                "input_text": self.decode(sample_tokens[0].numpy()),
                "prediction": self.decode(tokens),
                "probability": v,
                "predicted mask token": self.convert_ids_to_tokens(p),
            }
            pprint(result)

In [None]:
## this callback can show us the evolution of our training
adhd_sample_tokens = adhd_vectorize_layer(["Lately I have been feeling [mask] and I do not know what to do"])
generator_callback = MaskedTextGenerator(adhd_sample_tokens.numpy())

bert_masked_adhd_model = create_masked_language_bert_model()
bert_masked_adhd_model.summary()

In [None]:
#bert model on a much smaller dataset

## this callback can show us the evolution of our training
alcoholism_sample_tokens = alcoholism_vectorize_layer(["I am so happy to be [mask] now. Daily drinking was ruining my life."])
generator_callback = MaskedTextGenerator(alcoholism_sample_tokens.numpy())

bert_masked_alcoholism_model = create_masked_language_bert_model()
bert_masked_alcoholism_model.summary()

### Train and save model

In [None]:
# unfortunately takes like 15 hours

#bert_masked_adhd_model.fit(mlm_adhd_ds, epochs=2, callbacks=[generator_callback])
#bert_masked_adhd_model.save(data_path + "/adhd/bert_mlm_adhd.h5")

In [None]:
# bert_masked_alcoholism_model.fit(mlm_alcoholism_ds, epochs=2, callbacks=[generator_callback])
# bert_masked_alcoholism_model.save(data_path + "/alcoholism/bert_mlm_alcoholism.h5")

In [None]:
# Load pretrained bert model
alcoholism_mlm_model = keras.models.load_model(
    data_path+"/alcoholism/bert_mlm_alcoholism.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)
pretrained_bert_model = tf.keras.Model(
    alcoholism_mlm_model.input, alcoholism_mlm_model.get_layer("encoder_0/ffn_layernormalization").output
)

# Freeze it
pretrained_bert_model.trainable = False

In [None]:
def create_classifier_bert_model():
    inputs = layers.Input((config.MAX_LEN,), dtype=tf.int64)
    sequence_output = pretrained_bert_model(inputs)
    pooled_output = layers.GlobalMaxPooling1D()(sequence_output)
    hidden_layer = layers.Dense(64, activation="relu")(pooled_output)
    outputs = layers.Dense(1, activation="sigmoid")(hidden_layer)
    classifer_model = keras.Model(inputs, outputs, name="classification")
    optimizer = keras.optimizers.Adam()
    classifer_model.compile(
        optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
    )
    return classifer_model

alcoholism_classifer_model = create_classifier_bert_model()
alcoholism_classifer_model.summary()

In [None]:
# for text_batch, label_batch, weights in mlm_adhd_ds.take(1):
#     for i in range(1):
#         print('text vector\n', text_batch.numpy()[i])
#         print('label\n', label_batch.numpy()[i])
        

In [None]:
# Our model previously split 32% of the data for this auto-classification task

DATASET_SIZE = len(list(mlm_adhd_ds))
DATASET_SIZE

In [None]:
# this piece of code lets us control the train/test split
# it splits the adhd tensorflow dataset and splits it

train_size = int(0.7 * DATASET_SIZE)
test_size = int(0.3 * DATASET_SIZE)

adhd_train_dataset = mlm_adhd_ds.take(train_size)
adhd_test_dataset = mlm_adhd_ds.skip(train_size)

print("train", len(list(adhd_train_dataset)))
print("test", len(list(adhd_test_dataset)))

In [None]:
# split tensorflow datasets into x and y lists to use for sklearn

adhd_train_y = []
adhd_train_x = []

# training data split into text vectorizations and vectorized labels
adhd_train_x = np.array([list(x[0].numpy()) for x in list(adhd_train_dataset)])
adhd_train_y = np.array([x[1].numpy() for x in list(adhd_train_dataset)])

# test data split into text vectorizations and vectorized labels
adhd_test_x = np.array([list(x[0].numpy()) for x in list(adhd_test_dataset)])
adhd_test_y = np.array([x[1].numpy() for x in list(adhd_test_dataset)])

In [None]:
# just view the length of our dataset, make sure its the right number of training samples

len(adhd_train_x), len(adhd_train_y)

In [None]:
# PROBLEM
# we have a dataset with a very weird shape, making it difficult to put into
# any sklearn fit() function. We need to reduce the dimensionality of the data
# somehow to be able to train with it

np.array(adhd_train_x).shape, np.array(adhd_train_y).shape

In [None]:
DATASET_SIZE = len(list(mlm_alcoholism_ds))
print("data size ", DATASET_SIZE)

train_size = int(0.7 * DATASET_SIZE)
test_size = int(0.3 * DATASET_SIZE)

alcoholism_train_dataset = mlm_alcoholism_ds.take(train_size)
alcoholism_test_dataset = mlm_alcoholism_ds.skip(train_size)

print("train", len(list(alcoholism_train_dataset)))
print("test", len(list(alcoholism_test_dataset)))

### Training models

In [None]:
# Train the classifier with frozen BERT stage
alcoholism_classifer_model.fit(
    alcoholism_train_dataset,
    epochs=5,
    validation_data=alcoholism_test_dataset,
)

# Unfreeze the BERT model for fine-tuning
pretrained_bert_model.trainable = True
optimizer = keras.optimizers.Adam()
alcoholism_classifer_model.compile(
    optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)
alcoholism_classifer_model.fit(
    alcoholism_train_dataset,
    epochs=5,
    validation_data=alcoholism_test_dataset,
)

#### KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(adhd_train_x, adhd_train_y)

#### SVM

In [None]:
from sklearn.svm import SVC

clf = SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

clf.fit(adhd_train_x, adhd_train_y)

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

lin_svm = OneVsRestClassifier(LinearSVC(random_state=0)).fit(adhd_train_x, adhd_train_y)

# Discussion

### Interpreting the result

Unfortunately, our model ran into too many issues and we were not able to produce any concrete results in terms of the evaluation metrics we wanted to analyze. Overall, the selection of our dataset proved to be one of main issues that we ran into when trying to implement our models. Due to the unlabelled nature of our dataset, we either had to hand label thousands of rows of data, or find some other solution to use the data in our supervised learning models. Initially, we planned on using tf-idf to generate labels from the text we had already collected, and to use lexicons other researchers had developed to label our data. However, we could not move out of this stage due to the private nature of many lexicons. After this, we tried using deep learning to generate labels using a BERT Transformer-Encoder network architecture which automatically labels the datasets. This proved to be mostly successful, as we were able to label our datasets. However, when we reached the model training and testing phase, we found that the output of the BERT Transformer-Encoder network architecture was unusable for our planned KNN and SVM models. We were unable to solve this bug and unfortunately ran out of time to complete our implementation. 

### Limitations

Overall, our project had many, many limitations and shortcomings. With more time, we would have liked to either find a different, labeled, dataset, or work more with our neural network auto labeling code to output labels that are usable by the algorithms we wanted to test out. Our approach shows quite a bit of promise, and we believe that this research is something that is worth investigating further with more time. With our initial attempts, we did indeed find ways to autolabel our datasets, pointing to potential research that could be done with deep neural networks trained on internet text posts to discover mental health issues. Furthermore, we were limited by the amount of RAM we had to use, and found that we were often out of RAM to properly generate labels and train our neural networks. With better hardware, we could expand on our project to generate the results we were hoping for.   

### Ethics & Privacy

Because our project and data potentially involves sensitive or intimate information regarding a person’s mental health, there are clear ethical and privacy concerns. To preserve anonymity, we plan on removing or assigning anonymous ID to the authors’ usernames from each observation during our data cleaning process.
Additionally, extra processing to the text can be done to remove sensitive information such as names, addresses, etc. It is also important to note that the data was originally collected using Reddit’s API, which pulls from publicly available subreddits. This means we are not in violation of any major privacy sectors and all of the data we use can be found on the open web.
In terms of medical ethics, we hope our model can be used as simply a preventative aid and optional resource to Reddit users who may require mental health diagnosis. Most importantly, to never carry more weight than the opinion of a medical professional.

### Conclusion

Overall, we wanted to create a model that could correctly identify mental health disorders through text posts found on the internet. We wanted this to be a tool that could support people struggling with mental health issues by pointing them in the right direction to seek professional medical help. However, we faced numerous challenges throughout the implementation of our project and unfortunately could not bring it to fruition. With more time, we would want to find a better labeled dataset to use for supervised learning, as well as utilize deep learning techniques to see if we can better detect mental health disorders. Once achieved, this project would fit with other work in this field [2, 3] to boost mental health detection and prevention of its aggravation for a diverse population. In the future, the practicality of this experiment would extend into the implementation of notification alerts to reddit users privately, if their posts are detected to have a potential for self-harm or a diagnosis of a mental disorder.

# Footnotes
[1] Alegria, Margarita, Jackson, James S. (James Sidney), Kessler, Ronald C., and Takeuchi, David. Collaborative Psychiatric Epidemiology Surveys (CPES), 2001-2003 [United States]. Inter-university Consortium for Political and Social Research [distributor], 2016-03-23. https://doi.org/10.3886/ICPSR20240.v8  
[2] Prerna Chikersal, Afsaneh Doryab, Michael Tumminia, Daniella K. Villalba, Janine M. Dutcher, Xinwen Liu, Sheldon Cohen, Kasey G. Creswell, Jennifer Mankoff, J. David Creswell, Mayank Goel, Anind Dey. 2020. Detecting Depression and Predicting its Onset Using Longitudinal Symptoms Captured by Passive Sensing: A Machine Learning Approach With Robust Feature Selection. ACM Transactions on Computer-Human Interaction (TOCHI), 2020.  
[3] Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635. https://zenodo.org/record/3941387#.YmXlUNPMKDU  
[4] Kuwamura, Kaiko, et al. "Hugvie: A medium that fosters love." 2013 IEEE RO-MAN. IEEE, 2013.  
[5] Šabanović, Selma, et al. "PARO robot affects diverse interaction modalities in group sensory therapy for older adults with dementia." 2013 IEEE 13th international conference on rehabilitation robotics (ICORR). IEEE, 2013.  