## Mental Health Detection using KNN 

[Dataset](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey)

[Google Colab Notebook](https://colab.research.google.com/drive/1GklIS35u5_EcfCAsmByh-d_0PXoxEyXQ)

[Github](https://github.com/z5208980/machine-learning-health/tree/main/mental_health)


Mental health is about the wellness of the mind and can affect a person emotionally and physically. In Australia, affects one in five individuals and is prevalent, especially in the rapid development of the tech industry and the massive demand. Ultimately these pressure and social responsibility can affect a person's self-esteem and overall stratification with life and can be led to suicide. 


The dataset provided can be used to assist us in understanding if an individual will require mental health care, based on attributes such as their gender, age group, their background on mental illness. This particular dataset is from the 2014 survey on an individual's perspective in the tech workplace. 

Here the column `Treatment` will be the *target* that will determine if they have thought about their mental health. Hence we will be using this model to predict individuals' attitudes toward mental health and see if they have been in thoughts of treating their mental health illness. The features include,

- Age: Age of individual in years
- Gender: Sex of individual
- family_history: Does the individual have a family history of mental illness? yes or no
- work_interfere: Does the individual's mental state affect their work?
- tech_company: Does the individual work in the tech industry 
- seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
- leave
- mental_health_interview: Is the individual willing to bring up mental health issues in an interview?
- phys_health_interview: Is the individual willing to bring up mental health issues in an interview?

In [1]:
import numpy as np
import pandas as pd
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading and seeking the data

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/z5208980/machine-learning-health/main/mental_health/data/raw.csv')

print(f"There have {df.shape[0]} rows with {df.shape[1]} columns including targets")

# Seek the dataset
df.head(5)

There have 1259 rows with 27 columns including targets


Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [3]:
# Check the null value of each columns
df.isnull().sum()

Timestamp                       0
Age                             0
Gender                          0
Country                         0
state                         515
self_employed                  18
family_history                  0
treatment                       0
work_interfere                264
no_employees                    0
remote_work                     0
tech_company                    0
benefits                        0
care_options                    0
wellness_program                0
seek_help                       0
anonymity                       0
leave                           0
mental_health_consequence       0
phys_health_consequence         0
coworkers                       0
supervisor                      0
mental_health_interview         0
phys_health_interview           0
mental_vs_physical              0
obs_consequence                 0
comments                     1095
dtype: int64

In [4]:
# Processing the data
 
# Remove uneccessary features
df.drop(columns=['state','comments','Timestamp'], inplace=True)

# Remove age outliers
df.drop(df[df['Age'] < 0].index, inplace = True) 
df.drop(df[df['Age'] > 100].index, inplace = True) 

# Scale Age column for easier. Note this improved the KNN model from ~89 to ~95
scaler = MinMaxScaler()
df['Age'] = scaler.fit_transform(df[['Age']])

np.sort(df.Age.unique())

array([0.        , 0.04477612, 0.08955224, 0.19402985, 0.20895522,
       0.2238806 , 0.23880597, 0.25373134, 0.26865672, 0.28358209,
       0.29850746, 0.31343284, 0.32835821, 0.34328358, 0.35820896,
       0.37313433, 0.3880597 , 0.40298507, 0.41791045, 0.43283582,
       0.44776119, 0.46268657, 0.47761194, 0.49253731, 0.50746269,
       0.52238806, 0.53731343, 0.55223881, 0.56716418, 0.58208955,
       0.59701493, 0.6119403 , 0.62686567, 0.64179104, 0.65671642,
       0.67164179, 0.68656716, 0.71641791, 0.73134328, 0.74626866,
       0.76119403, 0.7761194 , 0.79104478, 0.82089552, 0.8358209 ,
       0.85074627, 0.89552239, 1.        ])

In [5]:
# Looking at Gender values we can normalise these to values of, male, female, others
df['Gender']=[m.lower() for m in df['Gender']]

m = ['male','m','make','male-ish', 'maile', 'cis male','mal','male (cis)','guy (-ish) ^_^','male ','man','msle','mail','malr','cis man']
f = ['female','f','woman','cis female','femake','female ', 'cis-female/femme','female (cis)','femail']
o = ['other', 'trans-female','queer/she/they','non-binary','nah', 'all', 'enby', 'fluid', 'genderqueer','androgyne','agender', 'male leaning androgynous','trans woman','neuter','something kinda male?', 'female (trans)', 'queer', 'a little about you','p','ostensibly male, unsure what that really means']

df.Gender.loc[df.Gender.isin(m)]= 'male'
df.Gender.loc[df.Gender.isin(f)]= 'female'
df.Gender.loc[df.Gender.isin(o)]= 'other'

df['Gender'].value_counts()

male      989
female    247
other      18
Name: Gender, dtype: int64

In [6]:
encoding_cols = ['Gender', 'family_history', 'treatment', 'work_interfere', 'tech_company', 'seek_help', 'leave', 'mental_health_interview', 'phys_health_interview']
feature_cols = ['Age'] + encoding_cols

X = df[feature_cols]
y = df.treatment

encoder = LabelEncoder()
for col in encoding_cols:
    encoder.fit(X[col])
    X[col] = encoder.transform(X[col])


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=200)

scaler = StandardScaler() # Use StandardScaler if postprocessing 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

The choosen model use is **LogisticRegression** which yields a 92% accurancy in training and testing. No parameters is required

## Using the model



In [8]:
def LR():
  model = LogisticRegression()
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  print('RESULT')
  print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))

  return model

def RFC():
  model = RandomForestClassifier()
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  print('RESULT')
  print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))
  
  return model

def GBC():
  model = GradientBoostingClassifier()
  model.fit(X_train, y_train)
  y_pred_class = model.predict(X_test)

  print('RESULT')
  print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))

  return model

model = LR()
# model = RFC()
# model = GBC()

filename = '/content/sample_data/model.sav'
pickle.dump(model, open(filename, 'wb'))

RESULT
Accuracy: 1.0


In [None]:
model = pickle.load(open('/content/sample_data/model.sav', 'rb'))   # load model

val = []
row = 456
for x in X_train[row]:
  val.append(x)

input = [val]
output = model.predict(input)

print("X=%s, Predicted=%s, Actually=%s" % (input[0], output[0], y_train.iloc[row]))

X=[0.08531272482693393, 0.8040078180145834, 1.4680089712632325, -1.2301743085131287, 0.3340507655393036, -0.20963637316919836, 0.3282508860130756, 1.204769515047139, 1.481532718756035, 0.5462450269457152, 0.08573524815288212, 0.15520614829241597], Predicted=0=Blood Donor, Actually=0=Blood Donor
