# Introduction

In this notebook we will analyze the possibility to run a machine learning algorithm for prediction of depression. The dataset can be found at http://zindi.africa/competitions/busara-mental-health-prediction-challenge. 

In [1]:
from google.colab import drive 
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
#import libraries 
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import Imputer

from sklearn.svm import SVC

import xgboost as XGB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')





In [0]:
#import data 
df =  pd.read_csv('gdrive/My Drive/Colab Notebooks/Busura/train.csv')

We want to make sure that the data is in a suitable datatype. For example age should not be string. 

In [4]:
df.dtypes


surveyid                     int64
village                      int64
survey_date                 object
femaleres                    int64
age                        float64
married                      int64
children                     int64
hhsize                       int64
edu                          int64
hh_children                  int64
hh_totalmembers            float64
cons_nondurable            float64
asset_livestock            float64
asset_durable              float64
asset_phone                float64
asset_savings              float64
asset_land_owned_total     float64
asset_niceroof               int64
cons_allfood               float64
cons_ownfood               float64
cons_alcohol               float64
cons_tobacco               float64
cons_med_total             float64
cons_med_children          float64
cons_ed                    float64
cons_social                float64
cons_other                 float64
ent_wagelabor                int64
ent_ownfarm         

As I will, for simpliity, only use the training data provided by Busura I will remove the surveyid. On their website you can upload the file and get the results for the performance of your algorithm. However, the testing dataset does not provide the labels for the target variables as does are supposed to be predicted and the evaluated. 

In [0]:
df.drop(columns=['surveyid'],inplace = True)

Encode the survey dates.

In [0]:
le = LabelEncoder()
df['survey_date'] = df['survey_date'].apply(lambda x: str(x))
le.fit(df['survey_date'])
df['survey_date'] = le.transform(df['survey_date'])


Determine which columns have NaN values.

In [0]:
colNull = df.isnull().sum()
colNull = [keys for keys, values in colNull.items() if values > 0]

Using different interpolation methods to 

In [0]:
interpolation_methods = ['linear','slinear', 'quadratic', 'cubic', 'polynomial', 'spline', 'piecewise_polynomial']

In [0]:
df_linear = df
df_slinear = df
df_quad = df
df_cubic = df
df_poly = df
df_spline = df
df_piecewise = df

df_interpolation_methods = [df_linear,df_slinear,df_quad,df_cubic,df_poly,df_spline,df_piecewise]


df_interpolation_methods = {
		"linear" : df,
		"slinear" : df,
		"quadratic" :df,
		"cubic" : df,
    "piecewise_polynomial": df
	}


In [0]:
#function for interpolation
def interpolation(df1, method_used):
  for i in colNull:
    df1[i] = df1[i].interpolate(method = method_used)
  df1.dropna(inplace=True)
  df1.reset_index(drop=True, inplace=True)
  return df1 

We can also try to instead impute the data by using mean and median umputation methods. 

In [0]:
#Impute the data set. We try 3 different methods: mean, median, and knn.

def perform_mean_imputation(df):
    fill_NaN = Imputer(missing_values=np.nan, strategy='mean', axis=1)
    imputed_DF = pd.DataFrame(fill_NaN.fit_transform(df))
    imputed_DF.columns = df.columns
    imputed_DF.index = df.index
    return imputed_DF
  
def perform_median_imputation(df):
    fill_NaN = Imputer(missing_values=np.nan, strategy='median', axis=1)
    imputed_DF = pd.DataFrame(fill_NaN.fit_transform(df))
    imputed_DF.columns = df.columns
    imputed_DF.index = df.index 
    return imputed_DF



Begginng of the machine learning algorithm section. Now we split the dataset for training and testing. 

In [0]:
def split_dataset(features, target):  
  X_train, X_test, y_train, y_test = train_test_split( features, target, test_size=0.4, random_state=0)
  return X_train, X_test, y_train, y_test

Check if the dataset is balanced. 

In [13]:
df['depressed'].value_counts()

0    950
1    193
Name: depressed, dtype: int64

As the dataset is not balanced, we want to balance it to limit bias. We will use oversampling done by SMOTE as undersampling would result in a small dataset. 

In [0]:
def apply_smote(X_train,y_train):
  sm = SMOTE(random_state=2)
  X_train, y_train = sm.fit_sample(X_train, y_train.ravel())
  return X_train, y_train

Perform interpolation for all methods as well as imputation by mean and median. 

In [0]:
for key,value in df_interpolation_methods.items():
  df_interpolation_methods[key] = interpolation(value, key)
  
df_interpolation_methods['Mean Imputed'] = perform_mean_imputation(df)
df_interpolation_methods['Median Imputed'] = perform_median_imputation(df)

For each dataframe in the dictionary we will create a test and training sample. 

In [0]:
for key,value in df_interpolation_methods.items():
  y = value['depressed']
  X = value.drop(labels=['depressed'], axis=1)
  X_train, X_test, y_train, y_test = split_dataset(X, y)
  X_train, y_train = apply_smote(X_train,y_train)
  df_interpolation_methods[key] = [X_train, X_test, y_train, y_test]

  

Perform Random Forest and GradientBoosting for all dataframes in the dict.

In [0]:
def ML_models(X_train, X_test, y_train, y_test):
  model = RandomForestClassifier(random_state=3, n_estimators=20)
  model.fit(X_train, y_train)
  rf_pred = model.predict(X_test)
  d = {'RandomForest':rf_pred}
  
  model = GradientBoostingClassifier(n_estimators=90, max_depth=3, random_state=8) 
  model.fit(X_train,y_train)
  gb_pred = model.predict(X_test)
  d['GradientBoosting'] = gb_pred
  
  return d 
  


Evaluation of the methods by confusion matrix.

In [47]:
for key,value in df_interpolation_methods.items():
  print('------------------------')
  print(key)
  [X_train, X_test, y_train, y_test] = df_interpolation_methods[key]
  d = ML_models(X_train, X_test, y_train, y_test)
  for key2, pred_value in d.items():
    cm = confusion_matrix(y_test, pred_value)
    print('----------')
    print(key2)
    print('True positive = ', cm[0][0])
    print('False positive = ', cm[0][1])
    print('False negative = ', cm[1][0])
    print('True negative = ', cm[1][1])

------------------------
linear
----------
RandomForest
True positive =  352
False positive =  19
False negative =  72
True negative =  6
----------
GradientBoosting
True positive =  351
False positive =  20
False negative =  75
True negative =  3
------------------------
slinear
----------
RandomForest
True positive =  352
False positive =  19
False negative =  72
True negative =  6
----------
GradientBoosting
True positive =  351
False positive =  20
False negative =  75
True negative =  3
------------------------
quadratic
----------
RandomForest
True positive =  352
False positive =  19
False negative =  72
True negative =  6
----------
GradientBoosting
True positive =  351
False positive =  20
False negative =  75
True negative =  3
------------------------
cubic
----------
RandomForest
True positive =  352
False positive =  19
False negative =  72
True negative =  6
----------
GradientBoosting
True positive =  351
False positive =  20
False negative =  75
True negative =  3
-----

The work could be improved by altering and trying other machine learnign algorithms. But right not linear interpolation and RandomForest seems to work the best.