**Problem Statement:**

Case Study: Mental Health Status Prediction for Clinical Patients using Natural Language Processing (NLP)





  **Background:**

  Mental health disorders such as anxiety, depression, and stress affect millions of people worldwide, yet many patients do not receive the proper diagnosis or treatment in time due to the subtle nature of symptoms and the complexity of mental health assessments. Mental health professionals often rely on self-reported statements from patients to understand their emotional state, but manual interpretation of these statements can be time-consuming and prone to inconsistencies.

  To address this issue, healthcare providers are exploring AI-based solutions that can assist clinicians by automating parts of the diagnostic process through Natural Language Processing (NLP) techniques. By analyzing the language used by patients in their statements, it is possible to detect underlying mental health conditions and flag patients who may need further evaluation or treatment.


**Business Challenge:**

A leading healthcare provider wants to implement an AI-based system to help its mental health professionals better diagnose and monitor the mental well-being of their patients. The system will analyze text statements provided by patients during consultations, therapy sessions, or through online surveys. These statements could include descriptions of how patients are feeling, their emotional state, or details of their daily life.

The goal of the system is to classify these statements into mental health categories such as "Anxiety," "Depression," "Normal," and other relevant conditions. This will allow clinicians to quickly identify patients at risk and take timely action. Such a system can also be integrated into telemedicine platforms to monitor patients remotely and offer additional layers of support.



**Objective:**

The objective of this project is to develop a deep learning-based NLP model that can accurately predict the mental health status of patients based on their written or spoken statements. The model will assist clinicians by automatically classifying patients into categories such as "Anxiety," "Depression," "Normal," "Stress," and other related conditions. This system will help healthcare professionals prioritize patients needing immediate care and ensure timely interventions.

In [None]:
import zipfile  # Import the zipfile module to work with ZIP files
zip_ref = zipfile.ZipFile('/content/archive (21).zip', 'r')
# Extract all contents of the ZIP file into the directory '/content'
zip_ref.extractall('/content')
# Close the opened ZIP file
zip_ref.close()

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('/content/Combined Data.csv')

In [None]:
df=df.drop(columns=['Unnamed: 0'])

In [None]:
df.isna().sum()

Unnamed: 0,0
statement,362
status,0


In [None]:
df[df['statement'].isna()]

Unnamed: 0,statement,status
293,,Anxiety
572,,Anxiety
595,,Anxiety
1539,,Normal
2448,,Normal
...,...,...
52838,,Anxiety
52870,,Anxiety
52936,,Anxiety
53010,,Anxiety


In [None]:
df.dropna(inplace=True)

In [None]:
df.isna().sum()

Unnamed: 0,0
statement,0
status,0


In [None]:
df

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety
...,...,...
53038,Nobody takes me seriously I’ve (24M) dealt wit...,Anxiety
53039,"selfishness ""I don't feel very good, it's lik...",Anxiety
53040,Is there any way to sleep better? I can't slee...,Anxiety
53041,"Public speaking tips? Hi, all. I have to give ...",Anxiety


In [None]:
#Data Cleaning

In [None]:
df['statement']=df['statement'].str.lower()

In [None]:
import re

In [None]:
def remove_punc(text):
  return re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]','',text)

In [None]:
df['statement']=df['statement'].apply(remove_punc)

In [None]:
chat_words={
  "LOL": "Laugh Out Loud",
  "BRB": "Be Right Back",
  "OMG": "Oh My God",
  "TTYL": "Talk To You Later",
  "IDK": "I Don't Know",
  "FYI": "For Your Information",
  "BTW": "By The Way",
  "ROFL": "Rolling On the Floor Laughing",
  "SMH": "Shaking My Head",
  "IMO": "In My Opinion",
  "IMHO": "In My Humble Opinion",
  "ICYMI": "In Case You Missed It",
  "DM": "Direct Message",
  "GTG": "Got To Go",
  "G2G": "Got To Go",
  "FTW": "For The Win",
  "AFK": "Away From Keyboard",
  "NP": "No Problem",
  "AMA": "Ask Me Anything",
  "FOMO": "Fear Of Missing Out",
  "OOTD": "Outfit Of The Day",
  "TL;DR": "Too Long; Didn't Read"
}

In [None]:
def chatword_rep(text):
  l=[]
  for i in text.split():
    if i in chat_words.keys():
      l.append(chat_words[i])
    else:
      l.append(i)
  return ' '.join(l)

In [None]:
df['statement'].apply(chatword_rep)

Unnamed: 0,statement
0,oh my gosh
1,trouble sleeping confused mind restless heart ...
2,all wrong back off dear forward doubt stay in ...
3,ive shifted my focus to something else but im ...
4,im restless and restless its been a month now ...
...,...
53038,nobody takes me seriously i’ve 24m dealt with ...
53039,selfishness i dont feel very good its like i d...
53040,is there any way to sleep better i cant sleep ...
53041,public speaking tips hi all i have to give a p...


In [None]:
def remove_extra(text):
  return re.sub('  ','',text)

In [None]:
df['statement']=df['statement'].apply(remove_extra)

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words('english'))
stop_words

In [None]:
def stop_word(text):
 l2=[]
 for i in text.split():
  if i in stop_words:
    l2.append('')
  else:
    l2.append(i)
 return ' '.join(l2)

In [None]:
df['statement']=df['statement'].apply(stop_word)

In [None]:
df['statement'].isna().sum()

0

In [None]:
import gensim
from nltk import sent_tokenize,word_tokenize
from gensim.utils import simple_preprocess
data = []
import nltk
nltk.download('punkt')
for doc in df['statement']:
    raw_sent = word_tokenize(doc)
    for sent in raw_sent:
        data.append(simple_preprocess(sent))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['status'])

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import zipfile  # Import the zipfile module to work with ZIP files
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import gensim
from nltk import word_tokenize, WordNetLemmatizer
from gensim.utils import simple_preprocess
import nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint

# Load and extract ZIP file
zip_ref = zipfile.ZipFile('/content/archive (21).zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

# Load the data
df = pd.read_csv('/content/Combined Data.csv')
df = df.drop(columns=['Unnamed: 0'])
df.dropna(inplace=True)
df['statement'] = df['statement'].str.lower()

# Download NLTK resources
nltk.download('punkt')
nltk.download('wordnet')  # Download wordnet for lemmatization

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to remove punctuation
def remove_punc(text):
    return re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', '', text)

# Preprocess: remove punctuation
df['statement'] = df['statement'].apply(remove_punc)

# Function to tokenize and lemmatize the text (done after all preprocessing)
def tokenize_and_lemmatize(text):
    tokens = word_tokenize(text)  # Tokenize the text
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Apply lemmatization
    return lemmatized_tokens

# Apply tokenization and lemmatization after cleaning
data = []
for doc in df['statement']:
    cleaned_doc = tokenize_and_lemmatize(doc)
    data.append(cleaned_doc)  # Collect tokenized and lemmatized sentences

# Create and train the Word2Vec model
model = gensim.models.Word2Vec(
    sentences=data,
    vector_size=50,  # Choose an appropriate vector size
    window=10,
    min_count=2,
    epochs=10
)

# Function to get document vectors
def document_vector(doc):
    doc = [word for word in word_tokenize(doc) if word in model.wv.index_to_key]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    else:
        return np.mean(model.wv[doc], axis=0)

# Generate document vectors
X = []
for doc in tqdm(df['statement']):
    X.append(document_vector(doc))

# Convert X to numpy array and reshape for LSTM
X = np.array(X).reshape(len(X), 1, 50)  # Shape it for LSTM (samples, timesteps, features)

# Encode labels
encoder = LabelEncoder()
y = encoder.fit_transform(df['status'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
from imblearn.over_sampling import SMOTE
from collections import Counter

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit SMOTE to the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print(f"Original class distribution: {Counter(y_train)}")
print(f"New class distribution after SMOTE: {Counter(y_train_smote)}")


# Define the model
model1 = Sequential()
model1.add(LSTM(50, input_shape=(1, 50), return_sequences=False))
model1.add(Dropout(0.5))
model1.add(Dense(7, activation='softmax'))  # Adjust based on number of classes
model1.summary()

# Compile the model
model1.compile(loss='sparse_categorical_crossentropy',
               optimizer='adam',
               metrics=['accuracy'])

# Define the checkpoint callback
checkpoint = ModelCheckpoint(filepath='/content/model1_best.keras',
                             monitor='val_accuracy',
                             save_best_only=True,
                             verbose=1)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


KeyboardInterrupt: 

In [None]:
import zipfile  # Import the zipfile module to work with ZIP files
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import gensim
from nltk import word_tokenize, WordNetLemmatizer, download, corpus
from gensim.utils import simple_preprocess
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from imblearn.over_sampling import SMOTE
from collections import Counter

# Load and extract ZIP file
zip_ref = zipfile.ZipFile('/content/archive (21).zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

# Load the data
df = pd.read_csv('/content/Combined Data.csv')
df = df.drop(columns=['Unnamed: 0'])
df.dropna(inplace=True)
df['statement'] = df['statement'].str.lower()

# Download NLTK resources
download('punkt')
download('wordnet')  # Download wordnet for lemmatization
download('stopwords')  # Download stopwords

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(corpus.stopwords.words('english'))

# Function to remove punctuation
def remove_punc(text):
    return re.sub(r'[^\w\s]', '', text)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to handle chatwords (optional: add more replacements if needed)
def handle_chatwords(text):
    replacements= {
        "u": "you",
        "r": "are",
        "ur": "your",
        "cuz": "because",
        "pls": "please",
        "thx": "thanks",
        "LOL": "Laugh Out Loud",
        "BRB": "Be Right Back",
        "OMG": "Oh My God",
        "TTYL": "Talk To You Later",
        "IDK": "I Don't Know",
        "FYI": "For Your Information",
        "BTW": "By The Way",
        "ROFL": "Rolling On the Floor Laughing",
        "SMH": "Shaking My Head",
        "IMO": "In My Opinion",
        "IMHO": "In My Humble Opinion",
        "ICYMI": "In Case You Missed It",
        "DM": "Direct Message",
        "GTG": "Got To Go",
        "G2G": "Got To Go",
        "FTW": "For The Win",
        "AFK": "Away From Keyboard",
        "NP": "No Problem",
        "AMA": "Ask Me Anything",
        "FOMO": "Fear Of Missing Out",
        "OOTD": "Outfit Of The Day",
        "TL;DR": "Too Long; Didn't Read"
          }
    for chatword, replacement in replacements.items():
        text = text.replace(chatword, replacement)
    return text

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()  # Convert to lower case
    text = remove_punc(text)  # Remove punctuation
    text = handle_chatwords(text)  # Handle chat words
    text = remove_extra_spaces(text)  # Remove extra spaces
    tokens = word_tokenize(text)  # Tokenize
    tokens = remove_stopwords(tokens)  # Remove stopwords
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize
    return tokens

# Apply preprocessing
df['statement'] = df['statement'].apply(preprocess_text)

# Create and train the Word2Vec model
model = gensim.models.Word2Vec(
    sentences=df['statement'],
    vector_size=50,  # Choose an appropriate vector size
    window=10,
    min_count=2,
    epochs=10
)

# Function to get document vectors
def document_vector(doc):
    doc = [word for word in doc if word in model.wv.index_to_key]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    else:
        return np.mean(model.wv[doc], axis=0)

# Generate document vectors
X = []
for doc in tqdm(df['statement']):
    X.append(document_vector(doc))

# Convert X to numpy array and reshape for LSTM
X = np.array(X).reshape(len(X), 1, 50)  # Shape it for LSTM (samples, timesteps, features)

# Encode labels
encoder = LabelEncoder()
y = encoder.fit_transform(df['status'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit SMOTE to the training data
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Check the new class distribution
print(f"Original class distribution: {Counter(y_train)}")
print(f"New class distribution after SMOTE: {Counter(y_train_smote)}")

# Define the model
model1 = Sequential()
model1.add(LSTM(50, input_shape=(1, 50), return_sequences=False))
model1.add(Dropout(0.5))
model1.add(Dense(7, activation='softmax'))  # Adjust based on number of classes
model1.summary()

# Compile the model
model1.compile(loss='sparse_categorical_crossentropy',
               optimizer='adam',
               metrics=['accuracy'])

# Define the checkpoint callback
checkpoint = ModelCheckpoint(filepath='/content/model1_best.keras',
                             monitor='val_accuracy',
                             save_best_only=True,
                             verbose=1)


In [None]:
# Train the model with checkpointing
model1.fit(X_train_smote,y_train_smote,
           validation_data=(X_test, y_test),
           epochs=20,
           callbacks=[checkpoint])

In [None]:
!pip install gradio



In [None]:
import numpy as np
import re
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import gradio as gr
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.preprocessing import LabelEncoder

# Load the saved NLP model
model = load_model('/content/model1_best.keras')

# Initialize or load your tokenizer
tokenizer = Tokenizer(num_words=5000)
# Load tokenizer if saved, e.g., tokenizer = load_tokenizer('/path/to/tokenizer.pkl')

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))

# Manually define the class labels
class_labels = ['Anxiety', 'Normal', 'Depression', 'Suicidal', 'Stress', 'Bipolar', 'Personality disorder']
encoder = LabelEncoder()
encoder.classes_ = np.array(class_labels)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Preprocessing steps (same as training)
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = remove_extra_spaces(text)  # Remove extra spaces
    tokens = word_tokenize(text)  # Tokenize
    tokens = remove_stopwords(tokens)  # Remove stopwords
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatize
    sequences = tokenizer.texts_to_sequences([' '.join(lemmatized_tokens)])
    padded_sequence = pad_sequences(sequences, maxlen=50)  # Adjust maxlen to match the model's input size

    # Reshape to (1, 1, 50) as required by the model
    padded_sequence = np.reshape(padded_sequence, (1, 1, 50))
    return padded_sequence

# Prediction function with error handling
def predict(text):
      # Apply preprocessing
      preprocessed_text = preprocess_text(text)

      # Make prediction
      prediction = model.predict(preprocessed_text)

      # Get predicted class index
      predicted_class_index = np.argmax(prediction, axis=1)[0]

      # Map index to class label
      predicted_class_label = encoder.classes_[predicted_class_index]

      return predicted_class_label

# Create Gradio interface
gr_interface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs="text",  # Output will be the predicted class label
    title="NLP Model Deployment",
    description="Enter text for prediction"
)

# Launch Gradio app
gr_interface.launch()


In [None]:
import numpy as np
import re
from tensorflow.keras.models import load_model
import gensim
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.preprocessing import LabelEncoder

# Load the saved NLP model
model = load_model('/content/model1_best.keras')

# Load the Word2Vec model
word2vec_model = gensim.models.Word2Vec.load('/content/word2vec_model')

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))

# Manually define the class labels
class_labels = ['Anxiety', 'Normal', 'Depression', 'Suicidal', 'Stress', 'Bipolar', 'Personality disorder']
encoder = LabelEncoder()
encoder.classes_ = np.array(class_labels)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to convert text to Word2Vec vectors
def text_to_word2vec(text):
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    vectors = [word2vec_model.wv[token] for token in lemmatized_tokens if token in word2vec_model.wv]
    if len(vectors) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(vectors, axis=0)

# Preprocessing and converting to Word2Vec vectors
def preprocess_and_vectorize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = remove_extra_spaces(text)  # Remove extra spaces
    vector = text_to_word2vec(text)
    return np.reshape(vector, (1, 1, word2vec_model.vector_size))  # Reshape for model input

# Prediction function with error handling
def predict(text):
    try:
        # Apply preprocessing and vectorization
        preprocessed_vector = preprocess_and_vectorize(text)

        # Make prediction
        prediction = model.predict(preprocessed_vector)

        # Get predicted class index
        predicted_class_index = np.argmax(prediction, axis=1)[0]

        # Map index to class label
        predicted_class_label = encoder.classes_[predicted_class_index]

        return predicted_class_label
    except Exception as e:
        return str(e)

# Create Gradio interface
gr_interface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs="text",  # Output will be the predicted class label
    title="NLP Model Deployment",
    description="Enter text for prediction"
)

# Launch Gradio app
gr_interface.launch()


FileNotFoundError: [Errno 2] No such file or directory: '/content/word2vec_model'

In [None]:
df['status'].value_counts()


Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
Normal,16343
Depression,15404
Suicidal,10652
Anxiety,3841
Bipolar,2777
Stress,2587
Personality disorder,1077


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 52681/52681 [02:44<00:00, 320.57it/s]


Original class distribution: Counter({3: 11356, 2: 10867, 6: 7431, 0: 2707, 1: 1963, 5: 1807, 4: 745})
New class distribution after SMOTE: Counter({2: 11356, 3: 11356, 6: 11356, 1: 11356, 0: 11356, 5: 11356, 4: 11356})


  super().__init__(**kwargs)


Epoch 1/20
[1m2479/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - accuracy: 0.5295 - loss: 1.3302

ValueError: Exception encountered when calling Sequential.call().

[1mInvalid input shape for input Tensor("IteratorGetNext:0", shape=(None, 50), dtype=float32). Expected shape (None, 1, 50), but input has incompatible shape (None, 50)[0m

Arguments received by Sequential.call():
  • inputs=tf.Tensor(shape=(None, 50), dtype=float32)
  • training=False
  • mask=None

In [None]:
import numpy as np
import re
from tensorflow.keras.models import load_model
import gensim
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.preprocessing import LabelEncoder

# Load the saved NLP model
model = load_model('/content/model1_best.keras')

# Load the Word2Vec model
word2vec_model = gensim.models.Word2Vec.load('/content/drive/MyDrive/word2vec_model')

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))

#Manually define the class labels
class_labels = sort(['Anxiety', 'Normal', 'Depression', 'Suicidal', 'Stress', 'Bipolar', 'Personality disorder'])
encoder = LabelEncoder()
encoder.classes_ = np.array(class_labels)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to convert text to Word2Vec vectors
def text_to_word2vec(text):
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    vectors = [word2vec_model.wv[token] for token in lemmatized_tokens if token in word2vec_model.wv]
    if len(vectors) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(vectors, axis=0)

# Preprocessing and converting to Word2Vec vectors
def preprocess_and_vectorize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = remove_extra_spaces(text)  # Remove extra spaces
    vector = text_to_word2vec(text)
    return np.reshape(vector, (1, 1, word2vec_model.vector_size))  # Reshape for model input

# Prediction function with error handling
def predict(text):
    try:
        # Apply preprocessing and vectorization
        preprocessed_vector = preprocess_and_vectorize(text)

        # Make prediction
        prediction = model.predict(preprocessed_vector)

        # Get predicted class index
        predicted_class_index = np.argmax(prediction, axis=1)[0]

        # Map index to class label
        predicted_class_label = encoder.classes_[predicted_class_index]

        return predicted_class_label
    except Exception as e:
        return str(e)

# Create Gradio interface
gr_interface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs="text",  # Output will be the predicted class label
    title="NLP Model Deployment",
    description="Enter text for prediction"
)

# Launch Gradio app
gr_interface.launch()

In [None]:
import zipfile
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import gensim
from nltk import word_tokenize, WordNetLemmatizer, download, corpus
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from imblearn.over_sampling import SMOTE
from collections import Counter
# Load and extract ZIP file
zip_ref = zipfile.ZipFile('/content/archive (21).zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

# Load the data
df = pd.read_csv('/content/Combined Data.csv')
df = df.drop(columns=['Unnamed: 0'])
df.dropna(inplace=True)
df['statement'] = df['statement'].str.lower()

# Download NLTK resources
download('punkt')
download('wordnet')
download('stopwords')

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(corpus.stopwords.words('english'))

# Function to remove punctuation
def remove_punc(text):
    return re.sub(r'[^\w\s]', '', text)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to handle chatwords
def handle_chatwords(text):
    replacements = {
        "u": "you",
        "r": "are",
        "ur": "your",
        "cuz": "because",
        "pls": "please",
        "thx": "thanks",
        "LOL": "Laugh Out Loud",
        "BRB": "Be Right Back",
        "OMG": "Oh My God",
        "TTYL": "Talk To You Later",
        "IDK": "I Don't Know",
        "FYI": "For Your Information",
        "BTW": "By The Way",
        "ROFL": "Rolling On the Floor Laughing",
        "SMH": "Shaking My Head",
        "IMO": "In My Opinion",
        "IMHO": "In My Humble Opinion",
        "ICYMI": "In Case You Missed It",
        "DM": "Direct Message",
        "GTG": "Got To Go",
        "G2G": "Got To Go",
        "FTW": "For The Win",
        "AFK": "Away From Keyboard",
        "NP": "No Problem",
        "AMA": "Ask Me Anything",
        "FOMO": "Fear Of Missing Out",
        "OOTD": "Outfit Of The Day",
        "TL;DR": "Too Long; Didn't Read"
    }
    for chatword, replacement in replacements.items():
        text = text.replace(chatword, replacement)
    return text

# Function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = remove_punc(text)
    text = handle_chatwords(text)
    text = remove_extra_spaces(text)
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Apply preprocessing
df['statement'] = df['statement'].apply(preprocess_text)

# Create and train the Word2Vec model
word2vec_model = gensim.models.Word2Vec(
    sentences=df['statement'],
    vector_size=100,
    window=10,
    min_count=2,
    epochs=10
)


# Save the trained Word2Vec model
word2vec_model.save('/content/drive/MyDrive/word2vec_model')
# Function to get document vectors
def document_vector(doc):
    doc = [word for word in doc if word in word2vec_model.wv.index_to_key]
    if len(doc) == 0:
        return np.zeros(word2vec_model.vector_size)
    else:
        return np.mean(word2vec_model.wv[doc], axis=0)

# Generate document vectors
X = []
for doc in tqdm(df['statement']):
    X.append(document_vector(doc))

# Convert X to numpy array and flatten for SMOTE
X = np.array(X)
X_flattened = X.reshape(X.shape[0], -1)

# Encode labels
encoder = LabelEncoder()
y = encoder.fit_transform(df['status'])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_flattened, y, test_size=0.3, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit SMOTE to the flattened training data
X_train_smote_flattened, y_train_smote = smote.fit_resample(X_train, y_train)

# Reshape the oversampled data back to 3D
X_train_smote = X_train_smote_flattened.reshape(X_train_smote_flattened.shape[0], 1, 100)

# Check the new class distribution
print(f"Original class distribution: {Counter(y_train)}")
print(f"New class distribution after SMOTE: {Counter(y_train_smote)}")

# # Define the model
# model1 = Sequential()
# model1.add(LSTM(50, input_shape=(1, 50), return_sequences=False))
# model1.add(Dropout(0.5))
# model1.add(Dense(7, activation='softmax'))
# model1.summary()

# # Compile the model
# model1.compile(loss='sparse_categorical_crossentropy',
#                optimizer='adam',
#                metrics=['accuracy'])

# # Define the checkpoint callback
# checkpoint = ModelCheckpoint(filepath='/content/drive/MyDrive/Bestmodel.keras',
#                              monitor='val_accuracy',
#                              save_best_only=True,
#                              verbose=1)
# # Train the model with checkpointing
# model1.fit(X_train_smote, y_train_smote,
#            validation_data=(X_test.reshape(X_test.shape[0], 1, 50), y_test),  # Reshape X_test for LSTM
#            epochs=20,
#            callbacks=[checkpoint])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
100%|██████████| 52681/52681 [02:03<00:00, 426.69it/s]


Original class distribution: Counter({3: 11356, 2: 10867, 6: 7431, 0: 2707, 1: 1963, 5: 1807, 4: 745})
New class distribution after SMOTE: Counter({2: 11356, 3: 11356, 6: 11356, 1: 11356, 0: 11356, 5: 11356, 4: 11356})


In [None]:
# Initialize the model
model1 = Sequential()

# Add the first LSTM layer with return_sequences=True to stack more LSTM layers
model1.add(LSTM(units=50, return_sequences=True,input_shape=(1, 100)))
model1.add(Dropout(0.4))
# Add the second LSTM layer
model1.add(LSTM(units=50, return_sequences=True))
model1.add(Dropout(0.4))

model1.add(LSTM(units=50))
model1.add(Dropout(0.4))
model1.add(Dense(7, activation='softmax'))
model1.summary()

  super().__init__(**kwargs)


In [None]:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import LSTM, Dropout, Dense, Bidirectional
# model1 = Sequential()
# # Bidirectional LSTM
# model1.add(Bidirectional(LSTM(units=50, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), input_shape=(None, 100)))
# model1.add(LSTM(units=50, dropout=0.2, recurrent_dropout=0.2))
# model1.add(Dense(7, activation='softmax'))
# model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# model1.summary()


  super().__init__(**kwargs)


In [None]:
# Compile the model
# Compile the model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
learning_rate =0.0001
optimizer = tf.keras.optimizers.RMSprop(learning_rate=learning_rate)

model1.compile(loss='sparse_categorical_crossentropy',
               optimizer=optimizer,
               metrics=['accuracy'])

# Define the checkpoint callback
checkpoint = ModelCheckpoint(filepath='/content/drive/MyDrive/modelnlp2.keras',
                             monitor='val_accuracy',
                             save_best_only=True,
                             verbose=1)

In [None]:
# Train the model with checkpointing
model1.fit(X_train_smote, y_train_smote,
           validation_data=(X_test.reshape(X_test.shape[0], 1, 100), y_test),  # Reshape X_test for LSTM
           epochs=200,
           callbacks=[checkpoint])


Epoch 1/200
[1m2475/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.3989 - loss: 1.7308
Epoch 1: val_accuracy improved from -inf to 0.55350, saving model to /content/drive/MyDrive/modelnlp2.keras
[1m2485/2485[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 7ms/step - accuracy: 0.3993 - loss: 1.7298 - val_accuracy: 0.5535 - val_loss: 1.2417
Epoch 2/200
[1m2482/2485[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 7ms/step - accuracy: 0.6000 - loss: 1.1254
Epoch 2: val_accuracy improved from 0.55350 to 0.60146, saving model to /content/drive/MyDrive/modelnlp2.keras
[1m2485/2485[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 7ms/step - accuracy: 0.6000 - loss: 1.1253 - val_accuracy: 0.6015 - val_loss: 1.1085
Epoch 3/200
[1m2485/2485[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6332 - loss: 1.0282
Epoch 3: val_accuracy improved from 0.60146 to 0.62126, saving model to /content/drive/MyDrive/modelnlp2

<keras.src.callbacks.history.History at 0x7ccc143b2500>

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Load the saved NLP model
from tensorflow.keras.models import load_model
!pip install gradio

import gensim
import gradio as gr
model1 = load_model('/content/drive/MyDrive/modelnlp.keras')
# Load the Word2Vec model
word2vec_model = gensim.models.Word2Vec.load('/content/drive/MyDrive/word2vec_model')
# Download required NLTK data
!pip install nltk
from nltk import word_tokenize, WordNetLemmatizer, download, corpus
import zipfile
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import gensim
from nltk import word_tokenize, WordNetLemmatizer, download, corpus
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from imblearn.over_sampling import SMOTE
from collections import Counter
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stopwords = set(nltk.corpus.stopwords.words('english'))

# Manually define the class labels
class_labels = ['Anxiety', 'Normal', 'Depression', 'Suicidal', 'Stress', 'Bipolar', 'Personality disorder']
encoder = LabelEncoder()
encoder.classes_ = np.array(class_labels)

# Function to remove extra spaces
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()

# Function to remove stopwords
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stopwords]

# Function to convert text to Word2Vec vectors
def text_to_word2vec(text):
    tokens = word_tokenize(text)
    tokens = remove_stopwords(tokens)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    vectors = [word2vec_model.wv[token] for token in lemmatized_tokens if token in word2vec_model.wv]
    if len(vectors) == 0:
        return np.zeros(word2vec_model.vector_size)
    return np.mean(vectors, axis=0)

# Preprocessing and converting to Word2Vec vectors
def preprocess_and_vectorize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = remove_extra_spaces(text)  # Remove extra spaces
    vector = text_to_word2vec(text)
    return np.reshape(vector, (1, 1, word2vec_model.vector_size))  # Reshape for model input

# Prediction function with error handling
def predict(text):
    # Apply preprocessing and vectorization
    preprocessed_vector = preprocess_and_vectorize(text)

    # Make prediction
    prediction = model1.predict(preprocessed_vector)

    # Get predicted class index
    predicted_class_index = np.argmax(prediction, axis=1)[0]

    # Map index to class label
    predicted_class_label = encoder.classes_[predicted_class_index]

    return predicted_class_label


# Create Gradio interface
gr_interface = gr.Interface(
    fn=predict,
    inputs="text",
    outputs="text",  # Output will be the predicted class label
    title="Mental Status Prediction",
    description="How you are feeling"
)
# Launch Gradio app
gr_interface.launch()


Mounted at /content/drive
Collecting gradio
  Downloading gradio-4.42.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.2-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-mu

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a1a2559bff73bd2e76.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
# Anxiety:
# "My thoughts are spiraling, and I can't seem to calm the constant sense of dread."

# Normal:
# "I feel balanced and present, able to take on whatever comes my way with ease."

# Depression:
# "Everything feels heavy, like I'm sinking into a darkness I can’t escape."

# Suicidal:
# "The pain feels overwhelming, and I struggle to see any light or reason to continue."

# Stress:
# "The weight of everything is pressing down on me, and I feel like I’m on the verge of breaking."

# Bipolar:
# "One moment, I'm bursting with energy and ideas, the next, I'm trapped in a pit of exhaustion and hopelessness."

# Personality Disorder:
# "I feel fractured, like my sense of self shifts from one extreme to another, leaving me confused and lost."