Ranked 86th among 2053 participants by Analytics Vidhya

# McKinsey-Analytics - Stroke Probability Prediction
## -Ajaj Ahmed(14th April'18)

### Problem Statement
Your Client, a chain of hospitals aiming to create the next generation of healthcare for its patients, has retained McKinsey to help achieve its vision. The company brings the best doctors and enables them to provide proactive health care for its patients. One such investment is a Center of Data Science Excellence.

In this case, your client wants to have study around one of the critical disease "Stroke". Stroke is a disease that affects the arteries leading to and within the brain. A stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked by a clot or bursts (or ruptures). When that happens, part of the brain cannot get the blood (and oxygen) it needs, so it and brain cells die.

 Over the last few years, the Client has captured several health, demographic and lifestyle details about its patients. This includes details such as age and gender, along with several health parameters (e.g. hypertension, body mass index) and lifestyle related variables (e.g. smoking status, occupation type).

The Client wants you to predict the probability of stroke happening to their patients. This will help doctors take proactive health measures for these patients.

## Evaluation Metric
I have used two way to clculate score:-
1) normalized Gini index
2) ROC-AUC

## Public and Private Split
Test data is further randomly divided into Public (30%) and Private (70%) data.

Your initial responses will be checked and scored on the Public data.
The final rankings would be based on your private score which will be published once the competition is over.

# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Loading Data And it's Discription

In [2]:
train = pd.read_csv('D://Analytics Vidya//McKinsey Challenge//train.csv')
test = pd.read_csv('D://Analytics Vidya//McKinsey Challenge//test.csv')

In [3]:
train.head(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,30669,Male,3.0,0,0,No,children,Rural,95.12,18.0,,0
1,30468,Male,58.0,1,0,Yes,Private,Urban,87.96,39.2,never smoked,0
2,16523,Female,8.0,0,0,No,Private,Urban,110.89,17.6,,0
3,56543,Female,70.0,0,0,Yes,Private,Rural,69.04,35.9,formerly smoked,0
4,46136,Male,14.0,0,0,No,Never_worked,Rural,161.28,19.1,,0


In [4]:
train.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,43400.0,43400.0,43400.0,43400.0,43400.0,41938.0,43400.0
mean,36326.14235,42.217894,0.093571,0.047512,104.48275,28.605038,0.018041
std,21072.134879,22.519649,0.291235,0.212733,43.111751,7.77002,0.133103
min,1.0,0.08,0.0,0.0,55.0,10.1,0.0
25%,18038.5,24.0,0.0,0.0,77.54,23.2,0.0
50%,36351.5,44.0,0.0,0.0,91.58,27.7,0.0
75%,54514.25,60.0,0.0,0.0,112.07,32.9,0.0
max,72943.0,82.0,1.0,1.0,291.05,97.6,1.0


In [5]:
#collecting id's of test data
test_id = test['id'].to_frame()

# Missing Values

In [6]:
print("Missing values in train data")
for col in train.columns:
    print('No. of null values in ' + col + ': '+
         str(train[pd.isnull(train[col])].shape[0]))


Missing values in train data
No. of null values in id: 0
No. of null values in gender: 0
No. of null values in age: 0
No. of null values in hypertension: 0
No. of null values in heart_disease: 0
No. of null values in ever_married: 0
No. of null values in work_type: 0
No. of null values in Residence_type: 0
No. of null values in avg_glucose_level: 0
No. of null values in bmi: 1462
No. of null values in smoking_status: 13292
No. of null values in stroke: 0


In [7]:
print('Missing values in test data:')
for col in test.columns:
    print('No. of null values in ' + col + ': '+
         str(test[pd.isnull(test[col])].shape[0]))

Missing values in test data:
No. of null values in id: 0
No. of null values in gender: 0
No. of null values in age: 0
No. of null values in hypertension: 0
No. of null values in heart_disease: 0
No. of null values in ever_married: 0
No. of null values in work_type: 0
No. of null values in Residence_type: 0
No. of null values in avg_glucose_level: 0
No. of null values in bmi: 591
No. of null values in smoking_status: 5751


In [8]:
print("Missing ratio in train data")
per_mis = (train.isnull().sum()/len(train))*100
ratio_mis = per_mis.sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :ratio_mis})
missing_data

Missing ratio in train data


Unnamed: 0,Missing Ratio
smoking_status,30.626728
bmi,3.368664
stroke,0.0
avg_glucose_level,0.0
Residence_type,0.0
work_type,0.0
ever_married,0.0
heart_disease,0.0
hypertension,0.0
age,0.0


In [9]:
print("Missing ratio in test data")
per_mis = (test.isnull().sum()/len(test))*100
ratio_mis = per_mis.sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :ratio_mis})
missing_data

Missing ratio in test data


Unnamed: 0,Missing Ratio
smoking_status,30.917693
bmi,3.177249
avg_glucose_level,0.0
Residence_type,0.0
work_type,0.0
ever_married,0.0
heart_disease,0.0
hypertension,0.0
age,0.0
gender,0.0


# Fill Missing Values

##### In 'bmi' column I have updated missing values using it's median

In [10]:
train['bmi'].fillna(train['bmi'].median(), inplace=True)
test['bmi'].fillna(train['bmi'].median(), inplace=True)

##### In 'smoking_status' I have replaced missing vlues with 'never_smoked' because most of the missing values belong to children with age<10 who don't smoke

In [11]:
train['smoking_status'] = train['smoking_status'].fillna('never_smoked')
test['smoking_status'] = train['smoking_status'].fillna('never_smoked')

## Dummies of the categorical variables

In [12]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

Now to final part, I have done this in two ways, made a gini normalized funtion to predict the score and 2nd by using roc_auc_score

### Defined a gini_normalized function

In [13]:
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(mod,X_test,y_true):
    a = y_true   
    p = mod.predict(X_test)
    return gini(a, p) / gini(a, a)

def gini_normalized_proba(mod,X_test,y_true):
    a = y_true   
    p = mod.predict_proba(X_test)    
    return gini(a, p[:,1]) / gini(a, a)

Split data into train and validation. Defined X and y as well

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.iloc[:,train.columns!='stroke'],
                                                    train['stroke'], test_size=0.33, random_state=42)

### Importing Dependencies

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC                
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import roc_auc_score

Here I have used 4 different models and calculated gini score on each model.

In [24]:
classifers = [LogisticRegressionCV(),RandomForestClassifier(),
              GradientBoostingClassifier(),DecisionTreeClassifier()]

In [25]:
print('starting training...')  
df = pd.DataFrame(columns = ['test_score','train_score'])
for clf in classifers:
    clf.fit(X_train,y_train)
    pred = clf.predict_proba(X_test)
    test_score, train_score = [gini_normalized(clf,X_test,y_test),gini_normalized(clf, X_train,y_train)]
    print(train_score, test_score)
    df.append([[test_score,train_score]])


starting training...
0.0054113206195254325 0.00674104008196793
0.7613710198784204 0.007570312989233045
0.03201725173958567 0.012275743900700482
1.0 0.05030025168645919


### ROC_AUC's Turn

I have used same above 4 models and calculated scores.

In [26]:
for clf in classifers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(accuracy_score(y_test, y_pred),roc_auc_score(y_test, y_pred) )

0.9813573523250942 0.5
0.9811478843736908 0.49989327641408754
0.9808685937718196 0.50158806294693
0.9577572964669739 0.5192062483261984


We can see that for GBM is doing well in to case. So, we will use gbm to predict 'stroke' here('stroke' is label here)

In [19]:
clf = GradientBoostingClassifier().fit(X_train, y_train)
test['stroke'] = clf.predict_proba(test)[:,1]

In [20]:
test = pd.concat([test], axis=1, join='inner')
test.to_csv('Submission.csv', columns = ['id', 'stroke'], index = False)

#### This my solution to online hackathon McKinsey Analytics - Healthcare Analytic organised by Analytic Vidhya. My public score was 0.835 with rank 142. I was surprised when private leaderboard updated, I ranked 86 among 2000+ participants with score of 0.853.