## Sentiment Analysis for Mental Health
The dataset amalgamates raw data from multiple sources, cleaned and compiled to create a robust resource for developing chatbots and performing sentiment analysis.

In [1]:
#Importing all the necessary libraries
import pandas as pd
import numpy as np

In [2]:
#Loading the datafile
data = pd.read_csv("Combined Data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,statement,status
0,0,oh my gosh,Anxiety
1,1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,3,I've shifted my focus to something else but I'...,Anxiety
4,4,"I'm restless and restless, it's been a month n...",Anxiety


In [3]:
data = data.drop(['Unnamed: 0'], axis=1)
data.head()

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety


## Data Preprocessing & Cleaning

In [4]:
#Checking the data dimensions
data.shape

(53043, 2)

In [5]:
#Checking the data dimensions
data.isnull().sum()

statement    362
status         0
dtype: int64

There are 362 null values in our statement column which can be dropped.

In [6]:
#Dropping all the null values
data = data.dropna()

In [7]:
#Checking the data dimensions
data.isnull().sum()


statement    0
status       0
dtype: int64

All the null values has been removed

In [8]:
#Checking the data dimensions
data.shape

(52681, 2)

In [9]:
#Output feature is a categorical Feature
data['status'].unique()

array(['Anxiety', 'Normal', 'Depression', 'Suicidal', 'Stress', 'Bipolar',
       'Personality disorder'], dtype=object)

Performing Label Encoding for the Out Put Feature Status

In [10]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


# Create a LabelEncoder object
label_encoder = LabelEncoder()


# Step 3: Use Label Encoder
label_encoder = LabelEncoder()
data['Category_encoded'] = label_encoder.fit_transform(data['status'])

# Step 4: Show the assigned values
category_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Category Mapping:")
print(category_mapping)


Category Mapping:
{'Anxiety': 0, 'Bipolar': 1, 'Depression': 2, 'Normal': 3, 'Personality disorder': 4, 'Stress': 5, 'Suicidal': 6}


#### Label encoding performed as below on the data set.
Anxiety: 0, 
Bipolar: 1,
Depression: 2, 
Normal: 3,
Personality disorder: 4,
Stress: 5,
Suicidal: 6,

In [11]:
data.head()

Unnamed: 0,statement,status,Category_encoded
0,oh my gosh,Anxiety,0
1,"trouble sleeping, confused mind, restless hear...",Anxiety,0
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety,0
3,I've shifted my focus to something else but I'...,Anxiety,0
4,"I'm restless and restless, it's been a month n...",Anxiety,0


In [12]:
#Dropping status column from the dataset

df = data.drop(['status'], axis=1)
df.head()

Unnamed: 0,statement,Category_encoded
0,oh my gosh,0
1,"trouble sleeping, confused mind, restless hear...",0
2,"All wrong, back off dear, forward doubt. Stay ...",0
3,I've shifted my focus to something else but I'...,0
4,"I'm restless and restless, it's been a month n...",0


In [13]:
df['Category_encoded'].value_counts()

Category_encoded
3    16343
2    15404
6    10652
0     3841
1     2777
5     2587
4     1077
Name: count, dtype: int64

There seems to be a data imbalance. Based on Inverse Class frequency method we have calculated and assigned the below class weights

In [14]:
# Value counts for each category
value_counts = {
    0: 3841,
    1: 2777,
    2: 15404,
    3: 16343,
    4: 1077,
    5: 2587,
    6: 10652
}

# Total number of records
total_records = 52681

# Calculate class weights
class_weights = {
    category: np.round(1 / (count / total_records), 2)
    for category, count in value_counts.items()
}

print(class_weights)

{0: 13.72, 1: 18.97, 2: 3.42, 3: 3.22, 4: 48.91, 5: 20.36, 6: 4.95}


In [15]:
#Define class weights (increase weights for categories 0, 1, 5, 4)
class_weights = {0: 13.72, 1: 18.97, 2: 3.42, 3: 3.22, 4: 48.91, 5: 20.36, 6:4.95}

In [16]:
#Lower All the cases
df['statement']=df['statement'].str.lower()
df.head()

Unnamed: 0,statement,Category_encoded
0,oh my gosh,0
1,"trouble sleeping, confused mind, restless hear...",0
2,"all wrong, back off dear, forward doubt. stay ...",0
3,i've shifted my focus to something else but i'...,0
4,"i'm restless and restless, it's been a month n...",0


In [17]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\allen.harry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
## Removing special characters
df['statement']=df['statement'].apply(lambda x:re.sub('[^a-z A-z 0-9-]+', '',x))

## Remove the stopswords
df['statement']=df['statement'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))

## Remove any additional spaces
df['statement']=df['statement'].apply(lambda x: " ".join(x.split()))

In [19]:
df.to_csv('Combined_cleaned.csv', index=False)

Lemmatizer Stemming

In [20]:
## Lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

In [21]:
df1.isna().sum()

NameError: name 'df1' is not defined

In [None]:
df1=df1.dropna()

In [None]:
#Witing a function to Lemmatize the data
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

In [None]:
df1['statement']=df1['statement'].apply(lambda x:lemmatize_words(x))

In [None]:
#Loading the new saved DataSet
df1 = pd.read_csv('Combined_cleaned1.csv')
df1.head()

Splitting the DataFrame to Test and Train

In [None]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df1['statement'],df1['Category_encoded'],
                                              test_size=0.20)

Importing Word2vec from Gensim Library