**Problem Statement:**

Case Study: Mental Health Status Prediction for Clinical Patients using Natural Language Processing (NLP)





  **Background:**

  Mental health disorders such as anxiety, depression, and stress affect millions of people worldwide, yet many patients do not receive the proper diagnosis or treatment in time due to the subtle nature of symptoms and the complexity of mental health assessments. Mental health professionals often rely on self-reported statements from patients to understand their emotional state, but manual interpretation of these statements can be time-consuming and prone to inconsistencies.

  To address this issue, healthcare providers are exploring AI-based solutions that can assist clinicians by automating parts of the diagnostic process through Natural Language Processing (NLP) techniques. By analyzing the language used by patients in their statements, it is possible to detect underlying mental health conditions and flag patients who may need further evaluation or treatment.


**Business Challenge:**

A leading healthcare provider wants to implement an AI-based system to help its mental health professionals better diagnose and monitor the mental well-being of their patients. The system will analyze text statements provided by patients during consultations, therapy sessions, or through online surveys. These statements could include descriptions of how patients are feeling, their emotional state, or details of their daily life.

The goal of the system is to classify these statements into mental health categories such as "Anxiety," "Depression," "Normal," and other relevant conditions. This will allow clinicians to quickly identify patients at risk and take timely action. Such a system can also be integrated into telemedicine platforms to monitor patients remotely and offer additional layers of support.



**Objective:**

The objective of this project is to develop a deep learning-based NLP model that can accurately predict the mental health status of patients based on their written or spoken statements. The model will assist clinicians by automatically classifying patients into categories such as "Anxiety," "Depression," "Normal," "Stress," and other related conditions. This system will help healthcare professionals prioritize patients needing immediate care and ensure timely interventions.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import zipfile  # Import the zipfile module to work with ZIP files
zip_ref = zipfile.ZipFile('/content/drive/MyDrive/mental_health/archive (21) (2).zip', 'r')
# Extract all contents of the ZIP file into the directory '/content'
zip_ref.extractall('/content')
# Close the opened ZIP file
zip_ref.close()

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('/content/Combined Data.csv')

In [None]:
df=df.drop(columns=['Unnamed: 0'])

In [None]:
df.isna().sum()

Unnamed: 0,0
statement,362
status,0


In [None]:
df[df['statement'].isna()]

Unnamed: 0,statement,status
293,,Anxiety
572,,Anxiety
595,,Anxiety
1539,,Normal
2448,,Normal
...,...,...
52838,,Anxiety
52870,,Anxiety
52936,,Anxiety
53010,,Anxiety


In [None]:
df.dropna(inplace=True)

In [None]:
df.isna().sum()

Unnamed: 0,0
statement,0
status,0


In [None]:
df

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety
...,...,...
53038,Nobody takes me seriously I’ve (24M) dealt wit...,Anxiety
53039,"selfishness ""I don't feel very good, it's lik...",Anxiety
53040,Is there any way to sleep better? I can't slee...,Anxiety
53041,"Public speaking tips? Hi, all. I have to give ...",Anxiety


In [None]:
#Data Cleaning

In [None]:
df['statement']=df['statement'].str.lower()

In [None]:
import re

In [None]:
def remove_punc(text):
  return re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]','',text)

In [None]:
df['statement']=df['statement'].apply(remove_punc)

In [None]:
chat_words={
  "LOL": "Laugh Out Loud",
  "BRB": "Be Right Back",
  "OMG": "Oh My God",
  "TTYL": "Talk To You Later",
  "IDK": "I Don't Know",
  "FYI": "For Your Information",
  "BTW": "By The Way",
  "ROFL": "Rolling On the Floor Laughing",
  "SMH": "Shaking My Head",
  "IMO": "In My Opinion",
  "IMHO": "In My Humble Opinion",
  "ICYMI": "In Case You Missed It",
  "DM": "Direct Message",
  "GTG": "Got To Go",
  "G2G": "Got To Go",
  "FTW": "For The Win",
  "AFK": "Away From Keyboard",
  "NP": "No Problem",
  "AMA": "Ask Me Anything",
  "FOMO": "Fear Of Missing Out",
  "OOTD": "Outfit Of The Day",
  "TL;DR": "Too Long; Didn't Read"
}

In [None]:
def chatword_rep(text):
  l=[]
  for i in text.split():
    if i in chat_words.keys():
      l.append(chat_words[i])
    else:
      l.append(i)
  return ' '.join(l)

In [None]:
df['statement'].apply(chatword_rep)

Unnamed: 0,statement
0,oh my gosh
1,trouble sleeping confused mind restless heart ...
2,all wrong back off dear forward doubt stay in ...
3,ive shifted my focus to something else but im ...
4,im restless and restless its been a month now ...
...,...
53038,nobody takes me seriously i’ve 24m dealt with ...
53039,selfishness i dont feel very good its like i d...
53040,is there any way to sleep better i cant sleep ...
53041,public speaking tips hi all i have to give a p...


In [None]:
def remove_extra(text):
  return re.sub('  ','',text)

In [None]:
df['statement']=df['statement'].apply(remove_extra)

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [None]:
def stop_word(text):
 l2=[]
 for i in text.split():
  if i in stop_words:
    l2.append('')
  else:
    l2.append(i)
 return ' '.join(l2)

In [None]:
df['statement']=df['statement'].apply(stop_word)

In [None]:
df['statement'].isna().sum()

0

In [None]:
import gensim
from nltk import sent_tokenize,word_tokenize
from gensim.utils import simple_preprocess
data = []
import nltk
nltk.download('punkt')
for doc in df['statement']:
    raw_sent = word_tokenize(doc)
    for sent in raw_sent:
        data.append(simple_preprocess(sent))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['status'])

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.1.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.2-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.0 (from gradio)
  Downloading gradio_client-1.4.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.25.1 (from gradio)
  Downloading huggingface_hub-0.26.0-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/kaggle.json"


In [None]:
!kaggle datasets download -d nabapadma/word2vec-model

Dataset URL: https://www.kaggle.com/datasets/nabapadma/word2vec-model
License(s): unknown
Downloading word2vec-model.zip to /content
 73% 2.00M/2.75M [00:01<00:00, 1.70MB/s]
100% 2.75M/2.75M [00:01<00:00, 1.92MB/s]


In [None]:
df['status'].value_counts()


Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
Normal,16343
Depression,15404
Suicidal,10652
Anxiety,3841
Bipolar,2777
Stress,2587
Personality disorder,1077


In [None]:
import os

# List all files in the /content/drive/MyDrive/ directory
files = os.listdir('/content/drive/MyDrive/')
print("Files in /content/drive/MyDrive/:", files)


Files in /content/drive/MyDrive/: ['ml_20241018_184737', 'ml', 'ml_20241018_181626', 'word2vec_model', 'modelnlp2.keras', 'modelnlp.keras']


In [None]:
# Anxiety:
# "My thoughts are spiraling, and I can't seem to calm the constant sense of dread."

# Normal:
# "I feel balanced and present, able to take on whatever comes my way with ease."

# Depression:
# "Everything feels heavy, like I'm sinking into a darkness I can’t escape."

# Suicidal:
# "The pain feels overwhelming, and I struggle to see any light or reason to continue."

# Stress:
# "The weight of everything is pressing down on me, and I feel like I’m on the verge of breaking."

# Bipolar:
# "One moment, I'm bursting with energy and ideas, the next, I'm trapped in a pit of exhaustion and hopelessness."

# Personality Disorder:
# "I feel fractured, like my sense of self shifts from one extreme to another, leaving me confused and lost."

In [None]:
import os

# List all files in the directory to check if the model file is present
files = os.listdir('/content/drive/MyDrive/')
print("Files in /content/drive/MyDrive/:", files)


Files in /content/drive/MyDrive/: ['ml_20241018_184737', 'ml', 'ml_20241018_181626', 'word2vec_model', 'modelnlp2.keras']


In [4]:
import zipfile
import pandas as pd
import numpy as np
import re
import gensim
from nltk import word_tokenize, WordNetLemmatizer, download, corpus
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
from collections import Counter
import nltk
import tensorflow as tf
import gradio as gr

# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))

# Load and extract ZIP file
with zipfile.ZipFile('/content/drive/MyDrive/mental_health/archive (21) (2).zip', 'r') as zip_ref:
    zip_ref.extractall('/content')

# Load the data
df = pd.read_csv('/content/Combined Data.csv')
df = df.drop(columns=['Unnamed: 0'])
df.dropna(inplace=True)
df['statement'] = df['statement'].str.lower()

# Function to remove punctuation
def remove_punc(text):
    return re.sub(r'[^\w\s]', '', text)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = remove_punc(text)
    text = remove_extra_spaces(text)
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Apply preprocessing
df['statement'] = df['statement'].apply(preprocess_text)

# Create and train the Word2Vec model
word2vec_model = gensim.models.Word2Vec(
    sentences=df['statement'],
    vector_size=100,
    window=10,
    min_count=2,
    epochs=10
)
word2vec_model.save('/content/drive/MyDrive/word2vec_model')

# Function to get document vectors
def document_vector(doc):
    doc = [word for word in doc if word in word2vec_model.wv.index_to_key]
    if len(doc) == 0:
        return np.zeros(word2vec_model.vector_size)
    else:
        return np.mean(word2vec_model.wv[doc], axis=0)

# Generate document vectors
X = np.array([document_vector(doc) for doc in df['statement']])

# Encode labels
encoder = LabelEncoder()
df['status'] = df['status'].astype(str)
y = encoder.fit_transform(df['status'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Reshape the oversampled data for LSTM input
X_train_smote = X_train_smote.reshape(X_train_smote.shape[0], 1, 100)
X_test = X_test.reshape(X_test.shape[0], 1, 100)

# Define the model architecture
model1 = Sequential([
    LSTM(50, input_shape=(1, 100), return_sequences=True),
    Dropout(0.4),
    LSTM(50, return_sequences=True),
    Dropout(0.4),
    LSTM(50),
    Dropout(0.4),
    Dense(7, activation='softmax')
])

# Compile the model
learning_rate = 0.0001
optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)
model1.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_smote), y=y_train_smote)
class_weights = dict(enumerate(class_weights))

# Define the callbacks
checkpoint = ModelCheckpoint('/content/drive/MyDrive/modelnlp.keras', monitor='val_accuracy', save_best_only=True, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model with checkpointing and early stopping
model1.fit(
    X_train_smote, y_train_smote,
    validation_data=(X_test, y_test),
    epochs=50,  # Reduced number of epochs
    class_weight=class_weights,
    callbacks=[checkpoint, early_stopping]
)

# Prediction function with error handling and manual adjustment
def preprocess_and_vectorize(text):
    text = text.lower()
    text = remove_punc(text)
    text = remove_extra_spaces(text)
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    vectors = [word2vec_model.wv[token] for token in tokens if token in word2vec_model.wv]
    if len(vectors) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(vectors, axis=0).reshape(1, 1, -1)

def predict(text):
    # Apply preprocessing and vectorization
    preprocessed_vector = preprocess_and_vectorize(text)

    # Make prediction
    prediction = model1.predict(preprocessed_vector)

    # Get predicted class index
    predicted_class_index = np.argmax(prediction, axis=1)[0]

    # Map index to class label
    predicted_class_label = encoder.classes_[predicted_class_index]

    # Add manual adjustment for positive sentiment cases
    if "happy" in text or "good" in text or "joyful" in text:
        predicted_class_label = "Normal"

    return predicted_class_label

# Gradio interface setup
gr_interface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs="text",
    title="Mental Status Prediction",
    description="Enter your text to determine the predicted mental health status."
)

# Launch Gradio app
gr_interface.launch()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
  super().__init__(**kwargs)


Epoch 1/50
[1m2484/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.2530 - loss: 1.8945
Epoch 1: val_accuracy improved from -inf to 0.37362, saving model to /content/drive/MyDrive/modelnlp.keras
[1m2485/2485[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 6ms/step - accuracy: 0.2530 - loss: 1.8945 - val_accuracy: 0.3736 - val_loss: 1.5999
Epoch 2/50
[1m2481/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.4573 - loss: 1.4716
Epoch 2: val_accuracy improved from 0.37362 to 0.52673, saving model to /content/drive/MyDrive/modelnlp.keras
[1m2485/2485[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 6ms/step - accuracy: 0.4574 - loss: 1.4715 - val_accuracy: 0.5267 - val_loss: 1.3029
Epoch 3/50
[1m2475/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5558 - loss: 1.2447
Epoch 3: val_accuracy improved from 0.52673 to 0.57621, saving model to /content/drive/MyDrive/modelnlp.keras



In [5]:
model1.summary()