# Project Name: EmpathAI - AI for Mental Wellness

Team Members Names: Debankitha Basu, Shreevidhya Shambanna, Ziyang Song

**Project Overview:**

Our project aims to develop a mental health chatbot that assists users by providing support and information on common mental health issues. Utilizing natural language processing (NLP) and machine learning (ML), the chatbot will interact with users in a conversational manner, understanding their concerns and offering guidance or resources.


**Data Sources:** [MentalHealthChat Dataset by Hizardev](https://huggingface.co/datasets/hizardev/MentalHealthChat)

**Other Data Sources:** [MentalHealthChat Datasets from huggingface](https://huggingface.co/datasets?sort=trending&search=mental)

# **Pre-processing for BERT**

**Load the tokenizer**

In [None]:
from transformers import BertTokenizer
from datasets import load_dataset
import numpy as np

# Load the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**Define a function to tokenize the dataset**

In [None]:
# Define a function to tokenize the dataset
def tokenize_function(examples):
    # Adjusting to use the "Merged" column for tokenization
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

**Tokenize the dataset**

In [None]:
# Convert the combined DataFrame to a Hugging Face Dataset
from datasets import Dataset, concatenate_datasets

# Define the chunk size
chunk_size = 100000  # Adjust this based on system's memory capacity

# Create a list to store the tokenized chunks
tokenized_chunks = []

# Process and tokenize the DataFrame in chunks
for start_idx in range(0, len(combined_df), chunk_size):
    end_idx = start_idx + chunk_size
    chunk_df = combined_df[start_idx:end_idx]

    # Convert the chunk to a Hugging Face Dataset
    chunk_dataset = Dataset.from_pandas(chunk_df)

    # Tokenize the chunk
    tokenized_chunk = chunk_dataset.map(tokenize_function, batched=True, batch_size=100)

    # Add the tokenized chunk to the list
    tokenized_chunks.append(tokenized_chunk)

# Concatenate the tokenized chunks into a single dataset
tokenized_dataset = concatenate_datasets(tokenized_chunks)



Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6645 [00:00<?, ? examples/s]

In [None]:
# Save the tokenized dataset to a directory
tokenized_dataset.save_to_disk('/content/drive/MyDrive/tokenized_dataset')

In [None]:
from google.colab import files

files.download('/content/drive/MyDrive/tokenized_dataset')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!ls drive/MyDrive/tokenized_dataset

cache-10c222975469ab1a.arrow  data-00003-of-00008.arrow  data-00007-of-00008.arrow
data-00000-of-00008.arrow     data-00004-of-00008.arrow  dataset_info.json
data-00001-of-00008.arrow     data-00005-of-00008.arrow  state.json
data-00002-of-00008.arrow     data-00006-of-00008.arrow


In [None]:
from datasets import load_from_disk

# Specify the directory containing your dataset
dataset_directory = '/content/drive/MyDrive/tokenized_dataset'

# Load the dataset
tokenized_dataset = load_from_disk(dataset_directory)

# Now you can work with `tokenized_dataset`
print(tokenized_dataset)

Dataset({
    features: ['text', 'preprocessed_text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 606645
})


In [None]:
# Print the first few examples of the tokenized dataset

print(tokenized_dataset[0])

{'text': "<s> [INST] Okay so I keep dragging the damage me through the dirt and I'm like sorry we can't stop like we gotta keep going. Cause duh I gotta, and I swear my mental health is getting a billion times worst cause I need to get diagnosed again for other shit to make sure. nnBut yeah anxiety and depression is great and at this point I don't even think I have anxiety like my anxiety turned into paranoia deadass. Because I don't feel the same type of anxiety I use to . I get more scared and paranoia asf. Like I get scared ppl gonna hurt me and shit. And paranoia of people looking at me and the shit around me. Well Actually I can't say that, I think I still do have a bit anxiety left in me. nI'm so fucking lazy bc of depression like I can't feel myself at all, I mean I was never able to but this time I'm so fucking lowwwwwwww as hell like to the point idgaf if I get hit by a car when crossing a street. nI feel so angry and sad and just wanna scream on the top of my lungs. And also 

#### Given dataset is already preprocessed for BERT (input_ids, token_type_ids, attention_mask), we can use these embeddings to derive sentence embeddings from them, which represent each text entry as a vector.

####**Step 1: Calculate Sentence Embeddings**
####To derive sentence embeddings from BERT's output, we  average the last hidden states of the model across all tokens, resulting in a single vector representation for each text entry, known as "pooling".

In [None]:
#Sampling the datasetwork with a smaller, manageable sample of the dataset to determine the best parameters

sampled_dataset = tokenized_dataset.shuffle(seed=42).select(range(10000))  # Sample 10,000 rows randomly


In [None]:
from transformers import BertModel, BertTokenizer
from torch.utils.data import DataLoader
import torch
import numpy as np

# Load pre-trained model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')

# Ensure the model is in evaluation mode
model.eval()

# Check if CUDA is available and move the model to GPU if it is
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Prepare the dataset for PyTorch
sampled_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Create a DataLoader for batch processing
data_loader = DataLoader(sampled_dataset, batch_size=32, shuffle=False)

def extract_embeddings(input_ids, attention_mask):
    # Ensure tensors are on the correct device
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)

    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu()

# Process batches and collect embeddings
all_embeddings = []
for batch in data_loader:
    embeddings = extract_embeddings(batch['input_ids'], batch['attention_mask'])
    all_embeddings.append(embeddings.numpy())  # Convert embeddings to numpy arrays

# Stack all batch embeddings into a single numpy array
all_embeddings = np.vstack(all_embeddings)

print("Shape of all embeddings:", all_embeddings.shape)

Shape of all embeddings: (10000, 768)


####**Clustering the Embeddings with K-Means**

In [None]:
from sklearn.cluster import KMeans

k = 10

# Perform K-Means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(all_embeddings)

# Cluster labels are now associated with each embedding

In [12]:
%pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
     ---------------------------------------- 0.0/163.3 kB ? eta -:--:--
     ----------------------------------- -- 153.6/163.3 kB 3.1 MB/s eta 0:00:01
     -------------------------------------- 163.3/163.3 kB 2.4 MB/s eta 0:00:00
Collecting transformers<5.0.0,>=4.32.0
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
     ---------------------------------------- 0.0/8.8 MB ? eta -:--:--
     ----- ---------------------------------- 1.2/8.8 MB 25.7 MB/s eta 0:00:01
     ------------- -------------------------- 2.9/8.8 MB 36.6 MB/s eta 0:00:01
     ----------------------- ---------------- 5.2/8.8 MB 37.0 MB/s eta 0:00:01
     --------------------------------- ------ 7.3/8.8 MB 38.9 MB/s eta 0:00:01
     ---------------------------------------  8.8/8.8 MB 40.1 MB/s eta 0:00:01
     ---------------------------------------  8.8/8.8 MB 40.1 MB/s eta 0:00:01
     -----------------


[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm  # Import tqdm

# Load a pre-trained sentence-transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to encode texts in batches with tqdm progress bar for the outer loop
def encode_texts_in_batches(texts, batch_size=100):
    embeddings = []
    for i in tqdm(range(0, len(texts), batch_size)):  # Add tqdm here for progress bar
        batch_texts = texts[i:i + batch_size]
        batch_embeddings = model.encode(batch_texts, show_progress_bar=False)
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

#
texts = sampled_dataset['preprocessed_text']
embeddings = encode_texts_in_batches(texts)


In [None]:
from datasets import ClassLabel

# Define the number of clusters (intents)
num_clusters = 10

# Perform K-means clustering
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Add cluster labels as intent IDs to the DataFrame
import pandas as pd

# Convert the Dataset to a Pandas DataFrame
tokenized_dataset_copy = sampled_dataset.to_pandas()

# Add the cluster labels as intent IDs to the DataFrame
tokenized_dataset_copy['intent_id'] = cluster_labels

tokenized_dataset_copy['intent_id'] = pd.Categorical(tokenized_dataset_copy['intent_id'])



In [8]:
from torch.utils.data import Dataset, DataLoader
import torch

class IntentClassificationDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx], dtype=torch.long) for key, val in self.encodings.items()}  # Ensure dtype=torch.long
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)  # Also ensure labels are torch.long
        return item

    def __len__(self):
        return len(self.labels)



In [3]:
tokenized_dataset_copy = pd.read_csv('tokenized_dataset_copy.csv')

In [4]:
tokenized_dataset_copy

Unnamed: 0,text,preprocessed_text,input_ids,token_type_ids,attention_mask,intent_id
0,"\nsince i was a kid, i’ve been afraid of pictu...",since i was a kid i ve been afraid of pictures...,[ 101 2144 1045 2001 1037 4845 1010 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
1,I tried to slit my wrist. I couldn’t put enoug...,i tried to slit my wrist i couldn t put enough...,[ 101 1045 2699 2000 18036 2026 7223 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,2
2,I hope you feel better. I can relate to the co...,i hope you feel better i can relate to the con...,[ 101 1045 3246 2017 2514 2488 1012 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,2
3,This summer after I graduated my parents told ...,this summer after i graduated my parents told ...,[ 101 2023 2621 2044 1045 3852 2026 30...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
4,I've had it pretty bad this year. I've been in...,i ve had it pretty bad this year i ve been in ...,[ 101 1045 1005 2310 2018 2009 3492 29...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,6
...,...,...,...,...,...,...
9995,About 1 1/2 years ago I had such a strong urge...,about 1 1 2 years ago i had such a strong urge...,[ 101 2055 1015 1015 1013 1016 2086 32...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
9996,I often get mentally stuck in remembering thin...,i often get mentally stuck in remembering thin...,[ 101 1045 2411 2131 10597 5881 1999 103...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,0
9997,<s>[INST] <<SYS>>\nYou are a helpful and joyou...,s inst sys you are a helpful and joyous mental...,[ 101 1026 1055 1028 1031 16021 2102 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,7
9998,Everytime my boyfriend says he was hanging out...,everytime my boyfriend says he was hanging out...,[ 101 2296 7292 2026 6898 2758 2002 20...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,8


In [5]:

tokenized_dataset_copy.reset_index(drop=True, inplace=True)


In [6]:
tokenized_dataset_copy

Unnamed: 0,text,preprocessed_text,input_ids,token_type_ids,attention_mask,intent_id
0,"\nsince i was a kid, i’ve been afraid of pictu...",since i was a kid i ve been afraid of pictures...,[ 101 2144 1045 2001 1037 4845 1010 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
1,I tried to slit my wrist. I couldn’t put enoug...,i tried to slit my wrist i couldn t put enough...,[ 101 1045 2699 2000 18036 2026 7223 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,2
2,I hope you feel better. I can relate to the co...,i hope you feel better i can relate to the con...,[ 101 1045 3246 2017 2514 2488 1012 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,2
3,This summer after I graduated my parents told ...,this summer after i graduated my parents told ...,[ 101 2023 2621 2044 1045 3852 2026 30...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
4,I've had it pretty bad this year. I've been in...,i ve had it pretty bad this year i ve been in ...,[ 101 1045 1005 2310 2018 2009 3492 29...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,6
...,...,...,...,...,...,...
9995,About 1 1/2 years ago I had such a strong urge...,about 1 1 2 years ago i had such a strong urge...,[ 101 2055 1015 1015 1013 1016 2086 32...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,5
9996,I often get mentally stuck in remembering thin...,i often get mentally stuck in remembering thin...,[ 101 1045 2411 2131 10597 5881 1999 103...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,0
9997,<s>[INST] <<SYS>>\nYou are a helpful and joyou...,s inst sys you are a helpful and joyous mental...,[ 101 1026 1055 1028 1031 16021 2102 10...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,7
9998,Everytime my boyfriend says he was hanging out...,everytime my boyfriend says he was hanging out...,[ 101 2296 7292 2026 6898 2758 2002 20...,[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...,[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...,8


In [None]:
# tokenized_dataset_copy.to_csv('tokenized_dataset_copy.csv', index=False)

In [None]:
# from google.colab import files

# files.download('tokenized_dataset_copy.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [10]:

encodings = {
    'input_ids': tokenized_dataset_copy['input_ids'].tolist(),
    'attention_mask': tokenized_dataset_copy['attention_mask'].tolist(),
    'token_type_ids': tokenized_dataset_copy['token_type_ids'].tolist(),
}
labels = tokenized_dataset_copy['intent_id'].tolist()

# Split the dataset
train_size = int(0.8 * len(labels))
val_size = len(labels) - train_size

train_dataset = IntentClassificationDataset({k: v[:train_size] for k, v in encodings.items()}, labels[:train_size])
val_dataset = IntentClassificationDataset({k: v[train_size:] for k, v in encodings.items()}, labels[train_size:])

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4)

In [13]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import EvalPrediction

def compute_metrics(eval_pred: EvalPrediction):
    """Compute metrics for intent classification."""
    # Extract predictions and labels from the evaluation prediction
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, predictions)

    # Calculate precision, recall, and F1 score (weighted)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

In [None]:
# !pip install accelerate -U
# !pip install transformers[torch] -U

Collecting accelerate
  Downloading accelerate-0.29.2-py3-none-any.whl (297 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/297.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/297.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.29.2
Collecting transformers[torch]
  Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m71.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.2
    Uninstalling transformers-4.38.2:
      Successfully uninstalled transformers-4.38.2
Successfully installed tran

In [None]:
# import accelerate
# print(accelerate.__version__)


In [None]:
#pip install --upgrade transformers




In [None]:
#pip install cloud-tpu-client

Collecting cloud-tpu-client
  Downloading cloud_tpu_client-0.10-py3-none-any.whl (7.4 kB)
Collecting google-api-python-client==1.8.0 (from cloud-tpu-client)
  Downloading google_api_python_client-1.8.0-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting google-api-core<2dev,>=1.13.0 (from google-api-python-client==1.8.0->cloud-tpu-client)
  Downloading google_api_core-1.34.1-py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.4/120.4 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Collecting uritemplate<4dev,>=3.0.0 (from google-api-python-client==1.8.0->cloud-tpu-client)
  Downloading uritemplate-3.0.1-py2.py3-none-any.whl (15 kB)
Installing collected packages: uritemplate, google-api-core, google-api-python-client, cloud-tpu-client
  Attempting uninstall: uritemplate
    Found existing installation: uritemplate 4.1.1
    Uninstalling uritemp

In [None]:
# import transformers
# import torch

# print(transformers.__version__)
# print(torch.__version__)


4.38.2
2.2.1+cu121


In [14]:
num_train_epochs = 3  # Define the number of epochs


In [17]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(), lr=5e-5)

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=500, num_training_steps=len(train_loader) * num_train_epochs)



