## Problem Statement
You have to take the algorithm and datasets to check the performance and then enhance the algorithm to improve the accuracy from the previous one

## Solution
I have take two datasets related to Health ("HospitalCosts.csv" and "mental_health.csv") and perform the following steps:
(1) Data Prepation
(2) Data Cleansing
(3) Data Mining (Algorithm)
(4) Model Evaluation 
(5) Enhance Accuracy by modifing the Algorithm

In [2]:
# lets import the neccassary libraries
# for data processing
import pandas as pd
# for linear algebra
import numpy as np
# for data visualization
import matplotlib.pyplot as plt

In [83]:
# read the dataset 
# Predicted variable is Age and other variables are features
df = pd.read_csv("MentalHealth.csv")

In [84]:
# show the first 20 rows of dataset
df.head(20)

Unnamed: 0,Age,Gender,Country,state,family_history,treatment,work_interfere,remote_work,tech_company,benefits,care_options,seek_help,anonymity,mental_health_consequence
0,37,Female,United States,IL,No,Yes,Often,No,Yes,Yes,Not sure,Yes,Yes,No
1,44,M,United States,IN,No,No,Rarely,No,No,Don't know,No,Don't know,Don't know,Maybe
2,32,Male,Canada,,No,No,Rarely,No,Yes,No,No,No,Don't know,No
3,31,Male,United Kingdom,,Yes,Yes,Often,No,Yes,No,Yes,No,No,Yes
4,31,Male,United States,TX,No,No,Never,Yes,Yes,Yes,No,Don't know,Don't know,No
5,33,Male,United States,TN,Yes,No,Sometimes,No,Yes,Yes,Not sure,Don't know,Don't know,No
6,35,Female,United States,MI,Yes,Yes,Sometimes,Yes,Yes,No,No,No,No,Maybe
7,39,M,Canada,,No,No,Never,Yes,Yes,No,Yes,No,Yes,No
8,42,Female,United States,IL,Yes,Yes,Sometimes,No,Yes,Yes,Yes,No,No,Maybe
9,23,Male,Canada,,No,No,Never,No,Yes,Don't know,No,Don't know,Don't know,No


In [85]:
# let's see the data information, i.e Data types etc
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 14 columns):
Age                          1259 non-null int64
Gender                       1259 non-null object
Country                      1259 non-null object
state                        744 non-null object
family_history               1259 non-null object
treatment                    1259 non-null object
work_interfere               995 non-null object
remote_work                  1259 non-null object
tech_company                 1259 non-null object
benefits                     1259 non-null object
care_options                 1259 non-null object
seek_help                    1259 non-null object
anonymity                    1259 non-null object
mental_health_consequence    1259 non-null object
dtypes: int64(1), object(13)
memory usage: 137.8+ KB


In [86]:
# showing the number of rows and columns in the data set
df.shape
# 1259 rows and 27 columns

(1259, 14)

In [88]:
# Data Cleansing
# finding the number of null values in the data set
df.isnull().sum()

Age                            0
Gender                         0
Country                        0
state                        515
family_history                 0
treatment                      0
work_interfere               264
remote_work                    0
tech_company                   0
benefits                       0
care_options                   0
seek_help                      0
anonymity                      0
mental_health_consequence      0
dtype: int64

In [89]:
# there 515 null values in the state column and 264 in work interface column
# lets remove it
df.fillna(method='ffill', inplace = True)

In [90]:
# now again lets check
df.isnull().sum()

Age                          0
Gender                       0
Country                      0
state                        0
family_history               0
treatment                    0
work_interfere               0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
seek_help                    0
anonymity                    0
mental_health_consequence    0
dtype: int64

In [91]:
# now the null values have been removed

In [92]:
# lets do one hot encoding of quanlitative variables
# to do this, we have to find the type of every column in the dataset

In [93]:
# temporary droping age column becuase it is quantitative variable and we need just qualitative variables
age = df.pop('Age') 

In [94]:
label = df.pop('mental_health_consequence')

In [96]:
#change the type of columns
for c in df.columns:
    df[c] = df[c].astype('category')

In [97]:
# you can see the type has been changed
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 12 columns):
Gender            1259 non-null category
Country           1259 non-null category
state             1259 non-null category
family_history    1259 non-null category
treatment         1259 non-null category
work_interfere    1259 non-null category
remote_work       1259 non-null category
tech_company      1259 non-null category
benefits          1259 non-null category
care_options      1259 non-null category
seek_help         1259 non-null category
anonymity         1259 non-null category
dtypes: category(12)
memory usage: 20.7 KB


In [98]:
# lets do one hot encoding now
df_encoded = pd.get_dummies(df, columns=list(df.columns))

In [99]:
# yes, the data has been encoded.
df_encoded

Unnamed: 0,Gender_A little about you,Gender_Agender,Gender_All,Gender_Androgyne,Gender_Cis Female,Gender_Cis Male,Gender_Cis Man,Gender_Enby,Gender_F,Gender_Femake,...,benefits_Yes,care_options_No,care_options_Not sure,care_options_Yes,seek_help_Don't know,seek_help_No,seek_help_Yes,anonymity_Don't know,anonymity_No,anonymity_Yes
0,0,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,1,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
1255,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,1
1256,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,1,0,0
1257,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0


In [100]:
# the quantitaive column that we have dropped
# now its time to combine it
df_num = pd.DataFrame(age, columns=['Age'])

In [101]:
df_num

Unnamed: 0,Age
0,37
1,44
2,32
3,31
4,31
...,...
1254,26
1255,32
1256,34
1257,46


In [102]:
# now combine the both numerical and categorical data
df = pd.concat([df_num,df_encoded,label],axis=1)

In [106]:
# Now labeling the output 
# this is compulsory becuase, All ML algorithm require Integer, not text 
df.mental_health_consequence = df.mental_health_consequence.map({
    'No':0,
    'Yes':1,
    'Maybe':2,
})
# 0 means NO
# 1 means Yes
# 2 means Maybe

In [108]:
df

Unnamed: 0,Age,Gender_A little about you,Gender_Agender,Gender_All,Gender_Androgyne,Gender_Cis Female,Gender_Cis Male,Gender_Cis Man,Gender_Enby,Gender_F,...,care_options_No,care_options_Not sure,care_options_Yes,seek_help_Don't know,seek_help_No,seek_help_Yes,anonymity_Don't know,anonymity_No,anonymity_Yes,mental_health_consequence
0,37,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
1,44,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,2
2,32,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
3,31,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,1
4,31,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,26,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
1255,32,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,0
1256,34,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,1
1257,46,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,0,0,1


In [109]:
# now let's apply Data mining Algorithm
# we will use KNN, it calculate the distance b/t the variable and then calculate mean.
# you can learn more about KNN at the following link
# https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

In [197]:
from sklearn.neighbors import KNeighborsClassifier

In [111]:
# before applying KNN, we have divide the dataset into Train and Test split, let's do

In [198]:
from sklearn.model_selection import train_test_split

In [113]:
# separating the input features and output label 
y = df.pop('mental_health_consequence')
x = df

In [167]:
# converting into train and test dataset
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 42)

In [215]:
# now lets apply the KNN algorithm
knn = KNeighborsClassifier()

In [216]:
# training phase 
knn.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [217]:
# getting accuracy 
knn.score(x_train,y_train)
# you can see the accuracy is 0.61, mean 61% accuracy

0.6137040714995035

In [171]:
# Let's improve the accuracy
# for improving the accuracy, we should more clean the dataset you let's do :)
# going to normalize the dataset
# normalization mean, if we have different scale of dataset then applying normalization
# you can learn more about normalization from the following link
# https://en.wikipedia.org/wiki/Normalization_(statistics)

In [218]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

In [219]:
np.array(x.Age).reshape(-1,1)

array([[1.76299997e-08],
       [1.76999997e-08],
       [1.75799997e-08],
       ...,
       [1.75999997e-08],
       [1.77199997e-08],
       [1.75099997e-08]])

In [220]:
# Applying MinMax Scaling to scale(normalize) the dataset
norm_pipeline = Pipeline([
    ('normalize',MinMaxScaler(feature_range=(0,1))),
])
x.Age = norm_pipeline.fit(np.array(x.Age).reshape(-1,1)).transform(np.array(x.Age).reshape(-1,1))

In [221]:
# after normalizing the data
x

Unnamed: 0,Age,Gender_A little about you,Gender_Agender,Gender_All,Gender_Androgyne,Gender_Cis Female,Gender_Cis Male,Gender_Cis Man,Gender_Enby,Gender_F,...,benefits_Yes,care_options_No,care_options_Not sure,care_options_Yes,seek_help_Don't know,seek_help_No,seek_help_Yes,anonymity_Don't know,anonymity_No,anonymity_Yes
0,1.763000e-08,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,1,0,0,1
1,1.770000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,1.758000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
3,1.757000e-08,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
4,1.757000e-08,0,0,0,0,0,0,0,0,0,...,1,1,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,1.752000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
1255,1.758000e-08,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,0,0,1
1256,1.760000e-08,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,1,0,0
1257,1.772000e-08,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0


In [222]:
# lets again apply train test spliting and KNN

In [223]:
# converting into train and test dataset
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 42)

In [224]:
x_train

Unnamed: 0,Age,Gender_A little about you,Gender_Agender,Gender_All,Gender_Androgyne,Gender_Cis Female,Gender_Cis Male,Gender_Cis Man,Gender_Enby,Gender_F,...,benefits_Yes,care_options_No,care_options_Not sure,care_options_Yes,seek_help_Don't know,seek_help_No,seek_help_Yes,anonymity_Don't know,anonymity_No,anonymity_Yes
243,1.751000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1
514,1.764000e-08,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,1,0,0,1
966,1.752000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
199,1.754000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
270,1.756000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1044,1.755000e-08,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,1,0,0
1095,1.762000e-08,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0
1130,1.768000e-08,0,0,0,0,0,0,0,0,1,...,1,0,0,1,0,1,0,1,0,0
860,1.758000e-08,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0


In [226]:
# now lets apply the KNN algorithm
# choosing the value of K is 3
knn = KNeighborsClassifier(n_neighbors=3)

In [227]:
# training phase 
knn.fit(x_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [228]:
# getting accuracy 
knn.score(x_train,y_train)
# you can see the accuracy is 0.67, mean 67 % accuracy

0.6722939424031777

## WOOOOOO! Yes we have enhance the accuracy.
### First it was 61% and it is 67%, you can see it
#### We improve the algorithm using (1) Normalizing the Dataset (2) Changing the number of neighbours