<a href="https://colab.research.google.com/github/CombustingRats/mental_health_classifier/blob/main/mental_health_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Mental Health Condition Classificaiton

The goal of this project is to correctly identify a possible mental disorder based on text input.

Dataset is taken from https://huggingface.co/datasets/solomonk/reddit_mental_health_posts, and labeled by the subreddit to which a post belongs.

Model is from pretrained bert-base-cased fine-tuned on above data. 

Purely for experimentation. Do not take this as serious medical advice or use it to inform medical decisions.

## Load packages and credentials

### Installtion and imports

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 34.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 44.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Unin

In [2]:
from transformers import pipeline, AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoConfig, AutoModelForSequenceClassification
from datasets import load_dataset, Dataset, ClassLabel, load_metric
import torch
import pandas as pd
import numpy as np

### Sign in to huggingface

In [3]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [4]:
mental_health_ds = load_dataset('solomonk/reddit_mental_health_posts')

Using custom data configuration solomonk--reddit_mental_health_posts-954e1c5cc1be8399


Downloading and preparing dataset csv/solomonk--reddit_mental_health_posts to /root/.cache/huggingface/datasets/solomonk___csv/solomonk--reddit_mental_health_posts-954e1c5cc1be8399/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/solomonk___csv/solomonk--reddit_mental_health_posts-954e1c5cc1be8399/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
mental_health_ds

DatasetDict({
    train: Dataset({
        features: ['author', 'body', 'created_utc', 'id', 'num_comments', 'score', 'subreddit', 'title', 'upvote_ratio', 'url'],
        num_rows: 151288
    })
})

## Basic cleaning

In [6]:
df = pd.DataFrame(mental_health_ds['train'])

In [7]:
# deleted meaningless data (removed and deleted posts)

df = df[(df['body'] != '[removed]') & (df['body'] != '[deleted]')]

In [8]:
df.isnull().sum()

author             0
body            1609
created_utc        0
id                 0
num_comments       0
score              0
subreddit          0
title              0
upvote_ratio       0
url                0
dtype: int64

In [9]:
df['subreddit'].unique()

array(['ADHD', 'aspergers', 'depression', 'OCD', 'ptsd'], dtype=object)

In [10]:
df.dropna(inplace=True, axis=0)

In [11]:
mental_health_ds = Dataset.from_pandas(df)
mental_health_ds

Dataset({
    features: ['author', 'body', 'created_utc', 'id', 'num_comments', 'score', 'subreddit', 'title', 'upvote_ratio', 'url', '__index_level_0__'],
    num_rows: 87078
})

In [12]:
# split into train, validation and test set

mental_health_ds = mental_health_ds.train_test_split(test_size=0.3)
mental_health_ds

DatasetDict({
    train: Dataset({
        features: ['author', 'body', 'created_utc', 'id', 'num_comments', 'score', 'subreddit', 'title', 'upvote_ratio', 'url', '__index_level_0__'],
        num_rows: 60954
    })
    test: Dataset({
        features: ['author', 'body', 'created_utc', 'id', 'num_comments', 'score', 'subreddit', 'title', 'upvote_ratio', 'url', '__index_level_0__'],
        num_rows: 26124
    })
})

In [13]:
print(mental_health_ds['train'][1]['body'])


I was diagnosed with OCD about 2 years ago because of my intrusive thoughts and some rituals that I had at that time. I have intrusive thoughts all the time I can't live normally. 

One of my psychologist friends told me that if I am not struggling with rituals it means it's not OCD. 
What do you think?


## Tokenize Text

In [14]:
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentence = tokenizer(mental_health_ds['train'][3]['body'])
print(tokenized_sentence)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

{'input_ids': [101, 7592, 1010, 1045, 1521, 1049, 1037, 2484, 2095, 2214, 24665, 4215, 3076, 1012, 1045, 1521, 2310, 2467, 2018, 4390, 7995, 1998, 2893, 11116, 1013, 23760, 8081, 4383, 1012, 2026, 8619, 6749, 2008, 1045, 2131, 16330, 2005, 5587, 1013, 4748, 14945, 1012, 1045, 2031, 5427, 1010, 2021, 1045, 1521, 1049, 6603, 2876, 1521, 1056, 2009, 2022, 2488, 2000, 2424, 1037, 2334, 18146, 2030, 7522, 2030, 2000, 2224, 1037, 2326, 2066, 7592, 3805, 2030, 2589, 1029, 4283, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [15]:
print(tokenizer.convert_ids_to_tokens(tokenized_sentence['input_ids']))

['[CLS]', 'hello', ',', 'i', '’', 'm', 'a', '24', 'year', 'old', 'gr', '##ad', 'student', '.', 'i', '’', 've', 'always', 'had', 'trouble', 'focusing', 'and', 'getting', 'distracted', '/', 'hyper', 'fix', '##ated', '.', 'my', 'advisor', 'recommended', 'that', 'i', 'get', 'evaluated', 'for', 'add', '/', 'ad', '##hd', '.', 'i', 'have', 'insurance', ',', 'but', 'i', '’', 'm', 'wondering', 'wouldn', '’', 't', 'it', 'be', 'better', 'to', 'find', 'a', 'local', 'psychiatrist', 'or', 'physician', 'or', 'to', 'use', 'a', 'service', 'like', 'hello', 'ahead', 'or', 'done', '?', 'thanks', '!', '[SEP]']


In [16]:
str_to_int = {
    'ADHD' : 0,
    'aspergers': 1,
    'depression': 2,
    'OCD': 3,
    'ptsd': 4
}

int_to_str = {item: key for key,item in str_to_int.items()}

In [17]:
def tokenize_function(batch):
    tokenized_batch = tokenizer(batch["body"], truncation=True)
    tokenized_batch['label'] = [str_to_int[label] for label in batch['subreddit']]
    return tokenized_batch

In [18]:
tokenized_dataset = mental_health_ds.map(tokenize_function, batched=True)

  0%|          | 0/61 [00:00<?, ?ba/s]

  0%|          | 0/27 [00:00<?, ?ba/s]

In [19]:
# Casting the label column to Classlabel

tokenized_dataset = tokenized_dataset.cast_column('label', ClassLabel(num_classes=5, names=list(str_to_int.keys())))

Casting the dataset:   0%|          | 0/7 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

In [20]:
tokenized_dataset['train']['label'][0]

3

## Padding

In [21]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [23]:
# test out the data collator on some samples

samples = tokenized_dataset['train'][:10]
samples = {k:v for k,v in samples.items() if k in ['input_ids', 'token_type_ids', 'attention_mask','label']}

In [24]:
[len(x) for x in samples['input_ids']]

[126, 74, 124, 78, 512, 512, 379, 354, 208, 116]

In [25]:
# add the padding token as required

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

In [26]:
batch = data_collator(samples)

In [27]:
{k : v.shape for k,v in batch.items()}

{'attention_mask': torch.Size([10, 512]),
 'input_ids': torch.Size([10, 512]),
 'labels': torch.Size([10]),
 'token_type_ids': torch.Size([10, 512])}

In [29]:
tokenized_dataset = tokenized_dataset.remove_columns(['author', 'body', 'created_utc', 'id', 'num_comments', 'score', 'subreddit', 'title', 'upvote_ratio', 'url', '__index_level_0__'])

In [30]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
        num_rows: 60954
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
        num_rows: 26124
    })
})

## Model Building and Training

### Thinning down the dataset

In [32]:
train_sample = tokenized_dataset['train'].shuffle().select(list(range(0,10000)))
test_sample = tokenized_dataset['test'].shuffle().select(list(range(0,1000)))

In [33]:
tokenized_dataset['train'].select(list(range(0,1000)))

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 1000
})

### Setting Training Arguments and Config, instantiating model

In [40]:
training_args = TrainingArguments('mental_health_trainer', 
                                  save_strategy='epoch', 
                                  evaluation_strategy='epoch',
                                  push_to_hub=True)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [35]:
# To let the model know how to name our labels

config = AutoConfig.from_pretrained(checkpoint, label2id=str_to_int, id2label=int_to_str)

In [36]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           config=config,)
                                                           #num_labels=5)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

### Setting training metrics

In [37]:
def compute_metrics(eval_preds):
    metric_acc = load_metric("accuracy")
    metric_prec = load_metric("precision")
    metric_rec = load_metric("recall")
    metric_f1 = load_metric("f1")
    

    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    accuracy = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    precision = metric_prec.compute(predictions=predictions, references=labels, average='weighted')['precision']
    recall = metric_rec.compute(predictions=predictions, references=labels, average='weighted')['recall']
    f1 = metric_f1.compute(predictions=predictions, references=labels, average='weighted')['f1']
    

    return {"accuracy": accuracy, "precision":precision, "recall":recall, "f1":f1}

### Instantiating the trainer and training

In [41]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = train_sample,
    eval_dataset = test_sample,
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
)

/content/mental_health_trainer is already a clone of https://huggingface.co/edmundhui/mental_health_trainer. Make sure you pull the latest changes with `repo.git_pull()`.


In [42]:
trainer.train()

***** Running training *****
  Num examples = 10000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3750


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5722,0.479368,0.87,0.8968,0.87,0.873099
2,0.3398,0.557856,0.86,0.877139,0.86,0.860307
3,0.1438,0.689282,0.87,0.872192,0.87,0.869857


***** Running Evaluation *****
  Num examples = 100
  Batch size = 8


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.58k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

Saving model checkpoint to mental_health_trainer/checkpoint-1250
Configuration saved in mental_health_trainer/checkpoint-1250/config.json
Model weights saved in mental_health_trainer/checkpoint-1250/pytorch_model.bin
tokenizer config file saved in mental_health_trainer/checkpoint-1250/tokenizer_config.json
Special tokens file saved in mental_health_trainer/checkpoint-1250/special_tokens_map.json
tokenizer config file saved in mental_health_trainer/tokenizer_config.json
Special tokens file saved in mental_health_trainer/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
Saving model checkpoint to mental_health_trainer/checkpoint-2500
Configuration saved in mental_health_trainer/checkpoint-2500/config.json
Model weights saved in mental_health_trainer/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in mental_health_trainer/checkpoint-2500/tokenizer_config.json
Special tokens file saved in mental_health_trainer/checkpoint-2500/special

TrainOutput(global_step=3750, training_loss=0.35635265502929686, metrics={'train_runtime': 2538.5157, 'train_samples_per_second': 11.818, 'train_steps_per_second': 1.477, 'total_flos': 6669810571866768.0, 'train_loss': 0.35635265502929686, 'epoch': 3.0})

In [None]:
# upload the model to the huggingface hub

trainer.push_to_hub()

## Testing

In [43]:
test_set = tokenized_dataset['test'].shuffle().select(list(range(1000)))

In [44]:
predictions = trainer.predict(test_set)
preds = np.argmax(predictions.predictions, axis=-1)

***** Running Prediction *****
  Num examples = 1000
  Batch size = 8


In [45]:
ground_truth = [int_to_str[label] for label in test_set['label']]
str_preds = [int_to_str[pred] for pred in preds]

In [46]:
test_sentences = []
for i in range(len(test_set)):
  test_sentences.append(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(test_set[i]['input_ids'])))

In [47]:
test_dataframe = pd.DataFrame({"sentence":test_sentences, "prediction":str_preds, "ground_truth": ground_truth})

In [50]:
test_dataframe

Unnamed: 0,sentence,prediction,ground_truth
0,[CLS] i don't know if this is a common thing a...,aspergers,aspergers
1,[CLS] i was so frustrated inside when she said...,aspergers,aspergers
2,[CLS] i was reading on one of the adhd subs ab...,ADHD,aspergers
3,[CLS] i was just interviewed for a software en...,aspergers,ADHD
4,"[CLS] not to be that overly offended person, b...",ptsd,ptsd
...,...,...,...
995,[CLS] i have always enjoyed driving and am the...,ADHD,ADHD
996,[CLS] it's like the title says. i don't hate m...,depression,depression
997,[CLS] i started concerta about 3 weeks ago and...,ADHD,ADHD
998,[CLS] like i get very upset i guess you would ...,OCD,OCD


### Inspecting entries that the model got wrong

In [58]:
print(test_dataframe[test_dataframe['prediction'] != test_dataframe['ground_truth']].head().iloc[2]['sentence'])
print(test_dataframe[test_dataframe['prediction'] != test_dataframe['ground_truth']].head().iloc[2]['ground_truth'])

[CLS] since 2016 i have been struggling with this odd phenomenon that has caused me a world of turmoil, in 2016 i had a bad reaction to an anxiety attack, my reaction included severe dissociation, pacing around, as well as obsessing about it for up to a week, i experienced images of me pacing in my head, and it was stuck in a loop, during these episodes i would feel dissociation, extreme disconnection and detachment from reality, severe anxiety and distress. eventually the images no longer occurred, instead i was than obsessing over the psychical sensations such as the dissociation, which all date back to that traumatic event for me. in 2017, i had episodes where i would just feel as if i had an ‘ aura ’ or almost as something was latched onto me, i would obsess over this sensation, to the point this sensation would cause even more dissociation. after taking a combination of zoloft, seroquel, risperidone, i finally had solace in my life and no longer focused on these sensations. some o

## Using Pipeline

In [None]:
classifier = pipeline(model="edmundhui/mental_health_trainer")

In [None]:
classifier("I feel nervous about school")