# **Mental Health Prediction**

##### **Authors**:
<ul type='square'> 
    <li> Alice Wamuyu</li>
    <li> Eugene Kuloba </li>
    <li> Fridah Kimathi </li>
    <li> Karen Amanya  </li>
    <li> Nicholus Magak  </li>
    <li> Nobert Akwir </li>
</ul>

#  **1. Business Understanding**

## **Objectives**

> ### **General Objective**

> ### **Specific Objectives**
<ul type='square'>
    <li >  </li>
    <li>  </li>
    <li>  </li>
    <li>  </li>
    <li>  </li>
    <li>  </li>
</ul>



### **Importing the required libraries**

In [86]:
import pandas as pd
import string
from textblob import TextBlob, Word
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer



from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import  accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, log_loss
## To install textblob : conda install -c conda-forge textblob in terminal

# **2. Data Understanding**

The data used in this project is from the <a href="https://zindi.africa/competitions/basic-needs-basic-rights-kenya-tech4mentalhealth/data">  Basic Needs Basic Rights Kenya - Tech4MentalHealth</a> competition hosted by Zindi Africa. The data consists of statements and questions expressed by students from multiple universities across Kenya who reported suffering from these different mental health challenges. he wording of the statements is intended to respond to the prompting question, “What is on your mind?”

#### **Loading the data**

In [34]:
train_df = pd.read_csv('Data/Train.csv')
validation_df = pd.read_csv('Data/Test.csv')

In [35]:
# shape of the datasets
print(f'The train data shape: {train_df.shape}')
print(f'The test data shape: {validation_df.shape}')

The train data shape: (616, 3)
The test data shape: (309, 2)


In [36]:
# the columns in the datasets
print(f'The train data columns: \n {train_df.columns} \n')
print(f'The test data columns: \n {validation_df.columns}')

The train data columns: 
 Index(['ID', 'text', 'label'], dtype='object') 

The test data columns: 
 Index(['ID', 'text'], dtype='object')


In [37]:
# the info
print(f'The train data info: {train_df.info()} \n \n')
print(f'The test data info: {validation_df.info()}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616 entries, 0 to 615
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      616 non-null    object
 1   text    616 non-null    object
 2   label   616 non-null    object
dtypes: object(3)
memory usage: 14.6+ KB
The train data info: None 
 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      309 non-null    object
 1   text    309 non-null    object
dtypes: object(2)
memory usage: 5.0+ KB
The test data info: None


In [38]:
# classes proportionality 
train_df['label'].value_counts(normalize=True)

    # There is class imbalance

Depression    0.571429
Alcohol       0.227273
Suicide       0.107143
Drugs         0.094156
Name: label, dtype: float64

In [39]:
train_df['length'] =  train_df['text'].apply(len)
train_df['length']

0      39
1      28
2      57
3      22
4      51
       ..
611    36
612    30
613    24
614    16
615    31
Name: length, Length: 616, dtype: int64

In [40]:
train_df.describe()

# The smallest statement is 8 words long
# The biggest statement is 196 words long

Unnamed: 0,length
count,616.0
mean,39.813312
std,21.438797
min,8.0
25%,26.0
50%,35.0
75%,48.25
max,196.0


In [41]:
# Viewing the statement with the most words

train_df[train_df['length'] == 196]

Unnamed: 0,ID,text,label,length
194,J55053XP,I am financially constrained over school fees ...,Depression,196


In [42]:
# Viewing the statement with the most words

print(train_df['text'].iloc[194])

I am financially constrained over school fees and my  family background is not stable with a lot of debts…I have an elderly brother who could easily support me but has no job even after graduating


# **3. Data Preparation**

 #### **i. Correcting spelling mistakes**

In [43]:
def correct_sent(text):
    correction = TextBlob(text)
    correction = correction.correct()
    return str(correction)

train_df['corrected_sent'] = train_df['text'].apply(lambda x: correct_sent(x))
train_df.head()

Unnamed: 0,ID,text,label,length,corrected_sent
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,I feel that it was better I die happy
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,Why do I get hallucinations?
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,I am stressed due to lack of financial support...
3,6UY7DX6Q,Why is life important?,Suicide,22,Why is life important?
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,Now could I be helped to go through the depres...


In [44]:
# find a way to handle the extra punctation (e.g '...')
print(train_df['corrected_sent'].iloc[194] + '\n \n')

print(train_df['corrected_sent'].iloc[48])

I am financially constrained over school fees and my  family background is not stable with a lot of debts…I have an elderly brother who could easily support me but has no job even after granulating
 

I am facing a lot of challenges in life financially, emotional, psycologically and with no solutions…Now can I safely look for solutions about depression on goose


 #### **ii. Changing text to lowercase**

In [45]:

train_df['corrected_sent'] = train_df['corrected_sent'].apply(lambda x: x.lower())
train_df.head()

Unnamed: 0,ID,text,label,length,corrected_sent
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,i feel that it was better i die happy
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,why do i get hallucinations?
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,i am stressed due to lack of financial support...
3,6UY7DX6Q,Why is life important?,Suicide,22,why is life important?
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,now could i be helped to go through the depres...


 #### **iii. Removing the punctuation marks**

In [46]:
# Checking texts with special characters such as â€¦ represented as …

for x in train_df['corrected_sent']:
    if '…' in x:
        print(x)

i feel hopeless, unworthy and useless …now do i cope with stress and forge the past?
i am facing a lot of challenges in life financially, emotional, psycologically and with no solutions…now can i safely look for solutions about depression on goose
there i get money for my needs…there do i get money  for personal needs?
i am financially constrained over school fees and my  family background is not stable with a lot of debts…i have an elderly brother who could easily support me but has no job even after granulating
i feel desperate…why is the world so unfair
by relatives deny me…i wonder if i am part of my family?


In [47]:
# Removing special characters â€¦ represented as …
train_df['corrected_sent'] = train_df['corrected_sent'].apply(lambda x: x.replace('…', ' '))

In [48]:
# removing the other standard punctuation marks

punc_to_rem = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

train_df['corrected_sent'] = train_df['corrected_sent'].apply(lambda x: x.translate(str.maketrans('', '', punc_to_rem)))

train_df.head()

Unnamed: 0,ID,text,label,length,corrected_sent
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,i feel that it was better i die happy
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,why do i get hallucinations
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,i am stressed due to lack of financial support...
3,6UY7DX6Q,Why is life important?,Suicide,22,why is life important
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,now could i be helped to go through the depres...


 #### **iv. removing stop words**

In [49]:
# Downloading the necessary nltk packages. Uncomment to download

#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('punkt')

In [50]:
stopwords = nltk.corpus.stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()

def remove_stopwords(x):
    sent = [wordnet_lemmatizer.lemmatize(i, 'v') for i in x.split() if i not in stopwords]
    return ' '.join(sent)

train_df['no_stopwords'] = train_df['corrected_sent'].apply(lambda x: remove_stopwords(x))
train_df.head()

Unnamed: 0,ID,text,label,length,corrected_sent,no_stopwords
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,i feel that it was better i die happy,feel better die happy
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,why do i get hallucinations,get hallucinations
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,i am stressed due to lack of financial support...,stress due lack financial support school
3,6UY7DX6Q,Why is life important?,Suicide,22,why is life important,life important
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,now could i be helped to go through the depres...,could help go depression


In [51]:
# find a way to handle the extra punctation (e.g '...')
print(train_df['no_stopwords'].iloc[194] + '\n \n')

print(train_df['no_stopwords'].iloc[48])

financially constrain school fee family background stable lot debts elderly brother could easily support job even granulate
 

face lot challenge life financially emotional psycologically solutions safely look solutions depression goose


 #### **iv. Tokenizing the sentences**

In [52]:
# tokenizing the sentences
#train_df['tokenized_text'] = train_df['no_stopwords'].apply(lambda x: nltk.word_tokenize(x))
#train_df.head()

Unnamed: 0,ID,text,label,length,corrected_sent,no_stopwords,tokenized_text
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,i feel that it was better i die happy,feel better die happy,"[feel, better, die, happy]"
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,why do i get hallucinations,get hallucinations,"[get, hallucinations]"
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,i am stressed due to lack of financial support...,stress due lack financial support school,"[stress, due, lack, financial, support, school]"
3,6UY7DX6Q,Why is life important?,Suicide,22,why is life important,life important,"[life, important]"
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,now could i be helped to go through the depres...,could help go depression,"[could, help, go, depression]"


 #### **iv. Vectorization**

In [57]:
countvec = CountVectorizer(ngram_range=(1,1))

count_data = countvec.fit_transform(train_df.no_stopwords)

cv_df = pd.DataFrame(count_data.toarray(), columns=countvec.get_feature_names())
cv_df

Unnamed: 0,abandon,able,abuse,academic,accept,add,addict,addition,adduct,adults,...,worry,worst,worth,worthless,would,wrap,wrong,yet,young,youths
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
611,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
612,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
613,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
614,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
train_df

Unnamed: 0,ID,text,label,length,corrected_sent,no_stopwords,tokenized_text
0,SUAVK39Z,I feel that it was better I dieAm happy,Depression,39,i feel that it was better i die happy,feel better die happy,"[feel, better, die, happy]"
1,9JDAGUV3,Why do I get hallucinations?,Drugs,28,why do i get hallucinations,get hallucinations,"[get, hallucinations]"
2,419WR1LQ,I am stresseed due to lack of financial suppor...,Depression,57,i am stressed due to lack of financial support...,stress due lack financial support school,"[stress, due, lack, financial, support, school]"
3,6UY7DX6Q,Why is life important?,Suicide,22,why is life important,life important,"[life, important]"
4,FYC0FTFB,How could I be helped to go through the depres...,Depression,51,now could i be helped to go through the depres...,could help go depression,"[could, help, go, depression]"
...,...,...,...,...,...,...,...
611,BOHSNXCN,What should I do to stop alcoholism?,Alcohol,36,that should i do to stop alcoholism,stop alcoholism,"[stop, alcoholism]"
612,GVDXRQPY,How to become my oldself again,Suicide,30,now to become my oneself again,become oneself,"[become, oneself]"
613,IO4JHIQS,How can someone stop it?,Alcohol,24,now can someone stop it,someone stop,"[someone, stop]"
614,1DS3P1XO,I feel unworthy,Depression,16,i feel unworthy,feel unworthy,"[feel, unworthy]"


In [62]:
train_label = train_df['label']
train_label

0      Depression
1           Drugs
2      Depression
3         Suicide
4      Depression
          ...    
611       Alcohol
612       Suicide
613       Alcohol
614    Depression
615    Depression
Name: label, Length: 616, dtype: object

In [64]:
cv_df

Unnamed: 0,abandon,able,abuse,academic,accept,add,addict,addition,adduct,adults,...,worry,worst,worth,worthless,would,wrap,wrong,yet,young,youths
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
611,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
612,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
613,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
614,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
X_train, X_test, y_train, y_test = train_test_split(cv_df, train_label, test_size=0.2, random_state=42)


In [None]:
target={'Debt collection':0, 'Credit card or prepaid card':1, 'Mortgage':2, 'Checking or savings account':3, 'Student loan':4, 'Vehicle loan or lease':5}
complaints_df['target']=complaints_df['Product'].map(target)

In [88]:
label_encoder = LabelEncoder()
y_train_cont = pd.Series(label_encoder.fit_transform(y_train))
y_test_cont = pd.Series(label_encoder.transform(y_test))
y_train_cont.unique()

array([1, 0, 3, 2])

# **3. Modelling**

In [78]:
#Models to be tested
models = { 'Model' : ['Baseline Decision Tree', 'Baseline KNN Classifier', 'Baseline Random Forest Classifier',\
                      'Baseline Adaboost Classifier', 'Baseline Gradient Boost', 'baseline XGBoost Classifier',\
                        'XGBoost Classifier-Grid Search', 'Final Model-Random Forest Classifier'],
          'Train Accuracy Score(%)': [0, 0, 0, 0, 0, 0, 0, 0],
          'Test Accuracy Score(%)': [0, 0, 0, 0, 0, 0, 0, 0]}

#Dataframe holding the model names and accuracy score
df_model_results = pd.DataFrame(models, columns=['Model','Train Accuracy Score(%)', 'Test Accuracy Score(%)'])

#Function to fill the dataframe holding model names and accuracy score

def model_results(model_type,y_train, y_train_pred, y_test, y_test_pred):
  index_val = df_model_results[df_model_results['Model']==model_type].index

  df_model_results.loc[index_val, 'Train Accuracy Score(%)'] = round(accuracy_score(y_train, y_train_pred), 2)*100
  df_model_results.loc[index_val, 'Test Accuracy Score(%)'] = round(accuracy_score(y_test, y_test_pred), 2)*100

  return df_model_results

In [93]:
baseline_dt = DecisionTreeClassifier(random_state=0)

baseline_dt.fit(X_train, y_train_cont)

y_test_pred = baseline_dt.predict_proba(X_test)
y_train_pred = baseline_dt.predict_proba(X_train)

probs = y_test_pred[:, 1]
loss = log_loss(y_test_cont, probs)
loss


#Observation:
     # Despite the fact that suicide has more data than drugs, it performs worse than it as drugs
     # There is overfitting 

ValueError: y_true and y_pred contain different number of classes 4, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2 3]

In [80]:
baseline_knn = KNeighborsClassifier()

baseline_knn.fit(X_train, y_train)

y_test_pred = baseline_knn.predict(X_test)
y_train_pred = baseline_knn.predict(X_train)

print('*********************************************************************')
print(confusion_matrix(y_test, y_test_pred))

print('*********************************************************************')
print(classification_report(y_test, y_test_pred))

print('*********************************************************************')
model_results('Baseline KNN Classifier',y_train, y_train_pred, y_test, y_test_pred)

# Observation
 # KNN performs worse than decision tree

*********************************************************************
[[24  4  0  0]
 [ 9 53  0  1]
 [ 7  3  7  0]
 [ 3  5  0  8]]
*********************************************************************
              precision    recall  f1-score   support

     Alcohol       0.56      0.86      0.68        28
  Depression       0.82      0.84      0.83        63
       Drugs       1.00      0.41      0.58        17
     Suicide       0.89      0.50      0.64        16

    accuracy                           0.74       124
   macro avg       0.82      0.65      0.68       124
weighted avg       0.79      0.74      0.74       124

*********************************************************************


Unnamed: 0,Model,Train Accuracy Score(%),Test Accuracy Score(%)
0,Baseline Decision Tree,99.0,81.0
1,Baseline KNN Classifier,87.0,74.0
2,Baseline Random Forest Classifier,0.0,0.0
3,Baseline Adaboost Classifier,0.0,0.0
4,Baseline Gradient Boost,0.0,0.0
5,baseline XGBoost Classifier,0.0,0.0
6,XGBoost Classifier-Grid Search,0.0,0.0
7,Final Model-Random Forest Classifier,0.0,0.0


In [81]:
baseline_rf = RandomForestClassifier(random_state=0)

baseline_rf.fit(X_train, y_train)

y_test_pred = baseline_rf.predict(X_test)
y_train_pred = baseline_rf.predict(X_train)

print('*********************************************************************')
print(confusion_matrix(y_test, y_test_pred))

print('*********************************************************************')
print(classification_report(y_test, y_test_pred))

print('*********************************************************************')
model_results('Baseline Random Forest Classifier',y_train, y_train_pred, y_test, y_test_pred)

# Observation 
 # Random Forest performs better than Decision Tree and KNN


*********************************************************************
[[25  2  1  0]
 [ 2 60  0  1]
 [ 2  3 12  0]
 [ 3  6  0  7]]
*********************************************************************
              precision    recall  f1-score   support

     Alcohol       0.78      0.89      0.83        28
  Depression       0.85      0.95      0.90        63
       Drugs       0.92      0.71      0.80        17
     Suicide       0.88      0.44      0.58        16

    accuracy                           0.84       124
   macro avg       0.86      0.75      0.78       124
weighted avg       0.85      0.84      0.83       124

*********************************************************************


Unnamed: 0,Model,Train Accuracy Score(%),Test Accuracy Score(%)
0,Baseline Decision Tree,99.0,81.0
1,Baseline KNN Classifier,87.0,74.0
2,Baseline Random Forest Classifier,99.0,84.0
3,Baseline Adaboost Classifier,0.0,0.0
4,Baseline Gradient Boost,0.0,0.0
5,baseline XGBoost Classifier,0.0,0.0
6,XGBoost Classifier-Grid Search,0.0,0.0
7,Final Model-Random Forest Classifier,0.0,0.0


In [82]:
baseline_adaboost = AdaBoostClassifier(random_state=0)

baseline_adaboost.fit(X_train, y_train)

y_test_pred = baseline_adaboost.predict(X_test)
y_train_pred = baseline_adaboost.predict(X_train)

print('*********************************************************************')
print(confusion_matrix(y_test, y_test_pred))

print('*********************************************************************')
print(classification_report(y_test, y_test_pred))

print('*********************************************************************')
model_results('Baseline Adaboost Classifier',y_train, y_train_pred, y_test, y_test_pred)

# Observations:
 # Adaboost performs worse than decision tree and random forest but better than KNN


*********************************************************************
[[22  3  3  0]
 [ 1 60  0  2]
 [ 1  7  9  0]
 [ 0 10  1  5]]
*********************************************************************
              precision    recall  f1-score   support

     Alcohol       0.92      0.79      0.85        28
  Depression       0.75      0.95      0.84        63
       Drugs       0.69      0.53      0.60        17
     Suicide       0.71      0.31      0.43        16

    accuracy                           0.77       124
   macro avg       0.77      0.65      0.68       124
weighted avg       0.78      0.77      0.76       124

*********************************************************************


Unnamed: 0,Model,Train Accuracy Score(%),Test Accuracy Score(%)
0,Baseline Decision Tree,99.0,81.0
1,Baseline KNN Classifier,87.0,74.0
2,Baseline Random Forest Classifier,99.0,84.0
3,Baseline Adaboost Classifier,85.0,77.0
4,Baseline Gradient Boost,0.0,0.0
5,baseline XGBoost Classifier,0.0,0.0
6,XGBoost Classifier-Grid Search,0.0,0.0
7,Final Model-Random Forest Classifier,0.0,0.0


In [83]:
baseline_gradientboost = GradientBoostingClassifier(random_state=0)

baseline_gradientboost.fit(X_train, y_train)

y_test_pred = baseline_gradientboost.predict(X_test)
y_train_pred = baseline_gradientboost.predict(X_train)

print('*********************************************************************')
print(confusion_matrix(y_test, y_test_pred))

print('*********************************************************************')
print(classification_report(y_test, y_test_pred))

print('*********************************************************************')
model_results('Baseline Gradient Boost',y_train, y_train_pred, y_test, y_test_pred)

# Observations:
 # Adaboost performs worse than decision tree and random forest but better than KNN


*********************************************************************
[[25  2  0  1]
 [ 0 60  0  3]
 [ 2  3 12  0]
 [ 1  8  0  7]]
*********************************************************************
              precision    recall  f1-score   support

     Alcohol       0.89      0.89      0.89        28
  Depression       0.82      0.95      0.88        63
       Drugs       1.00      0.71      0.83        17
     Suicide       0.64      0.44      0.52        16

    accuracy                           0.84       124
   macro avg       0.84      0.75      0.78       124
weighted avg       0.84      0.84      0.83       124

*********************************************************************


Unnamed: 0,Model,Train Accuracy Score(%),Test Accuracy Score(%)
0,Baseline Decision Tree,99.0,81.0
1,Baseline KNN Classifier,87.0,74.0
2,Baseline Random Forest Classifier,99.0,84.0
3,Baseline Adaboost Classifier,85.0,77.0
4,Baseline Gradient Boost,97.0,84.0
5,baseline XGBoost Classifier,0.0,0.0
6,XGBoost Classifier-Grid Search,0.0,0.0
7,Final Model-Random Forest Classifier,0.0,0.0


# **4. Evaluation**

# **Conclusion and recommendations**