## About the dataset

The dataset I have used is from Kaggle and the link to find this dataset is https://www.kaggle.com/datasets/mitishaagarwal/patient. Treating the people in hospital depends on the type of disorder, disease, infection etc they have. Its important to treat the patient on time and more importantly what kind of the treatment suits the body will also depend. Survival of the patients after treatment depends on many number of features like male/female, age, BMI, height, weight, type of issue, type of admit, past history and so on. So, predicting the survival rate is very important while treating the patients. Based on the factors available in the dataset, I am going predict whether the patient will survive or not which will help the hospital to predict accurately in future.

## Setting up the environment

In [7]:
import pandas as pd
from pandas import MultiIndex, Int16Dtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

%matplotlib inline

import plotly.tools as tls
import plotly.offline as py
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')
import dvc.api


pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.



## Data versioning using DVC

#### Steps performed to push the data and pull the data using DVC

1.  cd Desktop
2.	cd Assignment-2
3.	git init
4.	dvc init
5.	pwd
6.	dvc add Dataset1.csv
7.	git add Dataset.csv.dvc .gitignore
8.	git commit -m “adding the dataset1"
9.	dvc remote add -d storage gdrive://1g1ufBaObvC7_QPxvVRse95RZ-FjyGTtr
10.	git add .dvc/config
11.	git commit -m “versioning the dataset”
12.	dvc remote add -d myremote  /tmp/dvcstore
13.	git commit .dvc/onfig -m “configure local remote”
14.	dvc push 
15.	dvc pull

#### Repeating the same steps to add dataset2
1.  dvc add Dataset2.csv
2.  git add Dataset.csv.dvc .gitignore
3.  git commit -m “adding the dataset2"
4.  dvc push 
5.  dvc pull

#### Steps performed to start mlflow
1.  cd Desktop
2.	cd Assignment-2
3.  #mlflow server     --backend-store-uri sqlite:///mlflow.db     --default-artifact-root ./artifacts     --host 0.0.0.0 -p 8000
4. started logging the parameters and accuracies in http://127.0.0.1:8000/

#### Reference used
1. https://dvc.org/doc/start
2. https://www.mlflow.org/docs/latest/quickstart.html

In [28]:
# Download Data through DVC API from Google Drive and Github
#def download_data_dvc_drive():
data = dvc.api.read(
        'Dataset.csv',
         repo = 'https://github.com/Keerthiagasthya/Assignment2')
#     ) as fd:
#         dt = fd.read()
#         print(dt)
#         f = open("file.csv","w")
#         f.write(dt)
#         f.close()
#         return pd.read_csv("file.csv")
type(data)

AssertionError: 

## Load the dataset

In [3]:
#patientsData = download_data_dvc_drive()
data = pd.read_csv('../Assignment-2/Dataset.csv')

## Feature Engineering

converting categorical values tranforming numerical columns and removing nulls. Null values for categories are replaced by mode, and those for numerical are replaced by mean

In [4]:
numerical_cat = [
 'elective_surgery',
 'apache_post_operative',
 'arf_apache',
 'gcs_unable_apache',
 'intubated_apache',
 'ventilated_apache',
 'aids',
 'cirrhosis',
 'diabetes_mellitus',
 'hepatic_failure',
 'immunosuppression',
 'leukemia',
 'lymphoma',
 'solid_tumor_with_metastasis']

categorical = ['ethnicity',
 'gender',
 'icu_type',
 'apache_3j_bodysystem',
 'apache_2_bodysystem']

In [5]:
data.nunique()[data.nunique() == 2].index.tolist()

['hospital_death',
 'elective_surgery',
 'gender',
 'apache_post_operative',
 'arf_apache',
 'gcs_unable_apache',
 'intubated_apache',
 'ventilated_apache',
 'aids',
 'cirrhosis',
 'diabetes_mellitus',
 'hepatic_failure',
 'immunosuppression',
 'leukemia',
 'lymphoma',
 'solid_tumor_with_metastasis']

In [6]:
data.select_dtypes(include='O').columns.values.tolist()

['ethnicity',
 'gender',
 'hospital_admit_source',
 'icu_admit_source',
 'icu_stay_type',
 'icu_type',
 'apache_3j_bodysystem',
 'apache_2_bodysystem']

In [7]:
not_numeric = data[numerical_cat + categorical + ['hospital_death']].columns.tolist()
numeric_only = data.drop(not_numeric,axis=1).columns.tolist()
numeric_only

['encounter_id',
 'patient_id',
 'hospital_id',
 'age',
 'bmi',
 'height',
 'hospital_admit_source',
 'icu_admit_source',
 'icu_id',
 'icu_stay_type',
 'pre_icu_los_days',
 'readmission_status',
 'weight',
 'albumin_apache',
 'apache_2_diagnosis',
 'apache_3j_diagnosis',
 'bilirubin_apache',
 'bun_apache',
 'creatinine_apache',
 'fio2_apache',
 'gcs_eyes_apache',
 'gcs_motor_apache',
 'gcs_verbal_apache',
 'glucose_apache',
 'heart_rate_apache',
 'hematocrit_apache',
 'map_apache',
 'paco2_apache',
 'paco2_for_ph_apache',
 'pao2_apache',
 'ph_apache',
 'resprate_apache',
 'sodium_apache',
 'temp_apache',
 'urineoutput_apache',
 'wbc_apache',
 'd1_diasbp_invasive_max',
 'd1_diasbp_invasive_min',
 'd1_diasbp_max',
 'd1_diasbp_min',
 'd1_diasbp_noninvasive_max',
 'd1_diasbp_noninvasive_min',
 'd1_heartrate_max',
 'd1_heartrate_min',
 'd1_mbp_invasive_max',
 'd1_mbp_invasive_min',
 'd1_mbp_max',
 'd1_mbp_min',
 'd1_mbp_noninvasive_max',
 'd1_mbp_noninvasive_min',
 'd1_resprate_max',
 'd1_r

In [8]:
for col in numerical_cat:
     data[col] = data[col].astype('Int64')
    
for col in numerical_cat:
     data[col] = data[col].fillna(data[col].mode()[0])

In [9]:
 data[numeric_only].isna().sum(axis=0).sort_values(ascending=False)

h1_bilirubin_min      84619
h1_bilirubin_max      84619
h1_lactate_max        84369
h1_lactate_min        84369
h1_albumin_max        83824
                      ...  
icu_id                    0
icu_stay_type             0
pre_icu_los_days          0
readmission_status        0
encounter_id              0
Length: 166, dtype: int64

In [10]:
split_one = data[numeric_only].isna().sum(axis=0).sort_values()[data[numeric_only].isna().sum(axis=0) < 11000].index.tolist()

split_two = data[numeric_only].isna().sum(axis=0).sort_values()[data[numeric_only].isna().sum(axis=0) > 11000].index.tolist()

In [11]:
data[categorical].nunique()
#using one-hot encoder because of large range of unique values in categories

ethnicity                6
gender                   2
icu_type                 8
apache_3j_bodysystem    11
apache_2_bodysystem     10
dtype: int64

In [12]:
data.columns = [x.lower() for x in data.columns.tolist()]
data = data.loc[:,~data.columns.duplicated()]

In [13]:
t = data['arf_apache'].dtype
for col in tqdm(data.columns.tolist()):
    if data[col].values.dtype == 'uint8' or t == data[col].values.dtype:
        data[col] = data[col].astype(int)

100%|██████████████████████████████████████| 186/186 [00:00<00:00, 12715.19it/s]


In [14]:
 data.dtypes

encounter_id                     int64
patient_id                       int64
hospital_id                      int64
hospital_death                   int64
age                            float64
                                ...   
leukemia                         int64
lymphoma                         int64
solid_tumor_with_metastasis      int64
apache_3j_bodysystem            object
apache_2_bodysystem             object
Length: 186, dtype: object

## Building a Model


Instead of manually erasing the data to create a versions, I splitted the data into 2 and converted into excel and saved the file in defined path and then i converted the excel to csv file and performed data push and pull using the dvc. 

In [15]:
#df_1 = data.iloc[:30000,:]
#df_2 = data.iloc[1000:,:] 
#writer = pd.ExcelWriter(r'/Users/keerthanakn/Desktop/Assignment-2/Dataset1.xlsx')
#df_1.to_excel(writer)
#writer.save()
#writer1 = pd.ExcelWriter(r'/Users/keerthanakn/Desktop/Assignment-2/Dataset2.xlsx')
#df_2.to_excel(writer1)
#writer1.save()

In [16]:
#read_file = pd.read_excel ("../Assignment-2/Dataset1.xlsx")
#read_file.to_csv ("../Assignment-2/Dataset1.csv", index = None,header=True)
#df_1 = pd.DataFrame(pd.read_csv("../Assignment-2/Dataset1.csv"))
#df_1
df_1 = pd.read_csv("../Assignment-2/Dataset1.csv")
df_1

Unnamed: 0.1,Unnamed: 0,encounter_id,patient_id,hospital_id,hospital_death,age,bmi,elective_surgery,ethnicity,gender,...,aids,cirrhosis,diabetes_mellitus,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,apache_3j_bodysystem,apache_2_bodysystem
0,0,66154,25312,118,0,68.0,22.730000,0,Caucasian,M,...,0,0,1,0,0,0,0,0,Sepsis,Cardiovascular
1,1,114252,59342,81,0,77.0,27.420000,0,Caucasian,F,...,0,0,1,0,0,0,0,0,Respiratory,Respiratory
2,2,119783,50777,118,0,25.0,31.950000,0,Caucasian,F,...,0,0,0,0,0,0,0,0,Metabolic,Metabolic
3,3,79267,46918,118,0,81.0,22.640000,1,Caucasian,F,...,0,0,0,0,0,0,0,0,Cardiovascular,Cardiovascular
4,4,92056,34377,33,0,19.0,,0,Caucasian,M,...,0,0,0,0,0,0,0,0,Trauma,Trauma
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,29995,10576,76044,79,0,72.0,67.814990,0,Caucasian,M,...,0,0,1,0,0,0,0,0,Sepsis,Cardiovascular
29996,29996,61815,85029,171,0,40.0,36.707204,0,Caucasian,M,...,0,0,0,0,0,0,0,0,Cardiovascular,Cardiovascular
29997,29997,6232,68236,128,0,52.0,30.700050,0,Caucasian,M,...,0,0,0,0,0,0,0,0,Respiratory,Respiratory
29998,29998,47745,125650,112,0,78.0,22.269388,1,Caucasian,M,...,0,0,0,0,0,0,0,0,Cardiovascular,Undefined diagnoses


In [17]:
# read_file = pd.read_excel ("../Assignment-2/Dataset2.xlsx")
# read_file.to_csv("../Assignment-2/Dataset2.csv", 
#                   index = None,
#                   header=True)
df_2 = pd.read_csv("../Assignment-2/Dataset2.csv")
df_2

Unnamed: 0.1,Unnamed: 0,encounter_id,patient_id,hospital_id,hospital_death,age,bmi,elective_surgery,ethnicity,gender,...,aids,cirrhosis,diabetes_mellitus,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,apache_3j_bodysystem,apache_2_bodysystem
0,1000,102042,30579,151,1,59.0,24.273289,0,Caucasian,M,...,0,0,1,0,0,0,0,0,Sepsis,Cardiovascular
1,1001,85189,13206,118,0,56.0,21.419753,0,Caucasian,M,...,0,0,0,1,0,0,0,1,Sepsis,Cardiovascular
2,1002,10878,112743,83,0,58.0,,0,Caucasian,F,...,0,0,0,0,0,0,0,0,Neurological,Neurologic
3,1003,86366,39312,118,0,,30.556815,0,Caucasian,M,...,0,0,1,0,0,0,0,0,Cardiovascular,Cardiovascular
4,1004,72568,79848,118,0,34.0,30.180226,0,Asian,F,...,0,0,0,0,0,0,0,0,Genitourinary,Renal/Genitourinary
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90708,91708,91592,78108,30,0,75.0,23.060250,0,Caucasian,M,...,0,0,1,0,0,0,0,1,Sepsis,Cardiovascular
90709,91709,66119,13486,121,0,56.0,47.179671,0,Caucasian,F,...,0,0,0,0,0,0,0,0,Sepsis,Cardiovascular
90710,91710,8981,58179,195,0,48.0,27.236914,0,Caucasian,M,...,0,0,1,0,0,0,0,0,Metabolic,Metabolic
90711,91711,33776,120598,66,0,,23.297481,0,Caucasian,F,...,0,0,0,0,0,0,0,0,Respiratory,Respiratory


In [18]:
X = df_1.drop(['hospital_death'], axis=1)
y = df_1['hospital_death']

X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

In [19]:
print(X_train1.shape)
print(y_train1.shape)

(24000, 186)
(24000,)


In [20]:
y_train1.value_counts()

0    22062
1     1938
Name: hospital_death, dtype: int64

In [21]:
y_test1.value_counts()

0    5532
1     468
Name: hospital_death, dtype: int64

In [22]:
X = df_2.drop(['hospital_death'], axis=1)
y = df_2['hospital_death']

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

In [23]:
print(X_train2.shape)
print(y_train2.shape)

(72570, 186)
(72570,)


In [24]:
y_train2.value_counts()

0    66296
1     6274
Name: hospital_death, dtype: int64

In [25]:
y_test1.value_counts()

0    5532
1     468
Name: hospital_death, dtype: int64

In [26]:
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBClassifier

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import time
from sklearn import  metrics
from sklearn.model_selection import GridSearchCV


In [27]:
preprocess_pipeline = ColumnTransformer(transformers=
                                        [('num', SimpleImputer(strategy='median'),numerical_cat),
                                        ('cat',OneHotEncoder(),categorical)]
                                       )

## Hyperparameter tuning using mlflow

Advantages of using Pipeline
1. These pipelines helps to follow the order of steps required in a project and also helps in preprocessing the raw data. 
2. it helps in automating and forecating the results faster.  
3. It gives the access to other team member. 
4. pipelines helps in understaning and reading the mlflow easier.
5. pipelines helps in reproducing the same work.  

Reference
3. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
4. https://mkai.org/what-are-the-benefits-of-a-machine-learning-pipeline/

In [28]:
param_test1 = {
 'max_depth':[4,6],
 'min_child_weight':[3,5]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate=0.1,n_estimators=140,max_depth=5,
 min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,
 objective= 'binary:logistic',eval_metric = "logloss",nthread=4,scale_pos_weight=1,seed=27,use_label_encoder =False), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=3,cv=5)


In [29]:
model = Pipeline(steps=[
("transformed_data",preprocess_pipeline),
("gridcvsearch", gsearch1)])
#model.fit(X_train, y_train)
#model.grid_scores_, model.best_params_, model.best_score

In [30]:
model.fit(X_train1, y_train1)

  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index


Pipeline(steps=[('transformed_data',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['elective_surgery',
                                                   'apache_post_operative',
                                                   'arf_apache',
                                                   'gcs_unable_apache',
                                                   'intubated_apache',
                                                   'ventilated_apache', 'aids',
                                                   'cirrhosis',
                                                   'diabetes_mellitus',
                                                   'hepatic_failure',
                                                   'immunosuppression',
                                                   'leukemia', 'lymphoma',
                                  

In [31]:
gsearch1.best_params_

{'max_depth': 4, 'min_child_weight': 3}

In [36]:
param_test1 = {
 'max_depth':[4,6],
 'min_child_weight':[3,5]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier(learning_rate=0.1,n_estimators=140,max_depth=5,
 min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,
 objective= 'binary:logistic',
               eval_metric = "logloss",nthread=4,scale_pos_weight=1,seed=27,use_label_encoder =False), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=3,cv=5)

In [38]:
model1 = Pipeline(steps=[
("transformed_data",preprocess_pipeline),
("gridcvsearch", gsearch2)])

In [39]:
model1.fit(X_train2, y_train2)

Pipeline(steps=[('transformed_data',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='median'),
                                                  ['elective_surgery',
                                                   'apache_post_operative',
                                                   'arf_apache',
                                                   'gcs_unable_apache',
                                                   'intubated_apache',
                                                   'ventilated_apache', 'aids',
                                                   'cirrhosis',
                                                   'diabetes_mellitus',
                                                   'hepatic_failure',
                                                   'immunosuppression',
                                                   'leukemia', 'lymphoma',
                                  

In [40]:
gsearch2.best_params_

{'max_depth': 4, 'min_child_weight': 3}

In [41]:
precision_score1 =[]
recall_score1=[]
f1_score1=[]
confusion_matrix1=[]
accuracy_score1=[]
prediction1 = model.predict(X_test1)
precision_score1 = metrics.precision_score(y_test1, prediction1, average='weighted')
recall_score1=metrics.recall_score(y_test1, prediction1, average='weighted')
f1_score1=metrics.f1_score(y_test1, prediction1, average='weighted')
confusion_matrix1=metrics.confusion_matrix(y_test1, prediction1)
accuracy_score1 = metrics.accuracy_score(y_test1, prediction1)

In [42]:
precision_score2 =[]
recall_score2=[]
f1_score2=[]
confusion_matrix2=[]
accuracy_score2=[]
prediction2 = model1.predict(X_test2)
precision_score2 = metrics.precision_score(y_test2, prediction2, average='weighted')
recall_score2=metrics.recall_score(y_test2, prediction2, average='weighted')
f1_score2=metrics.f1_score(y_test2, prediction2, average='weighted')
confusion_matrix2=metrics.confusion_matrix(y_test2, prediction2)
accuracy_score2 = metrics.accuracy_score(y_test2, prediction2)

In [43]:
print("precision_score: ",precision_score1)
print("recall_score: ",recall_score1)
print("f1_score: ",f1_score1)
print("confusion_matrix1:\n",confusion_matrix1)
print("accuracy_score: ",accuracy_score1) 

precision_score:  0.9045374183827223
recall_score:  0.9235
f1_score:  0.8908888941566908
confusion_matrix1:
 [[5523    9]
 [ 450   18]]
accuracy_score:  0.9235


In [44]:
print("precision_score: ",precision_score2)
print("recall_score: ",recall_score2)
print("f1_score: ",f1_score2)
print("confusion_matrix1:\n",confusion_matrix2)
print("accuracy_score: ",accuracy_score2) 

precision_score:  0.8839287972563444
recall_score:  0.9130794245714601
f1_score:  0.8755853632582867
confusion_matrix1:
 [[16522    34]
 [ 1543    44]]
accuracy_score:  0.9130794245714601


## Logging the accuracy, scores and pest parameters using MLFlow 

In [45]:
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:8000/")
import os
os.environ['MLFLOW_TRACKING_URI'] = 'http://127.0.0.1:8000/'

mlflow.set_experiment('XGB classifier')

with mlflow.start_run(run_name="servival prediction"):
    predictions = model1.predict(X_test1)

    print("precision_score",precision_score1)    
    print("recall_score",recall_score1)
    print("f1_score",f1_score1)
    print("confusion_matrix1",confusion_matrix1)
    print("accuracy_score",accuracy_score1)  

    best_param = gsearch1.best_params_
    print(best_param)
    mlflow.log_param("max_depth",best_param['max_depth'])
    mlflow.log_param("min_samples_split",best_param['min_child_weight'])
    mlflow.log_metric("precision_score",precision_score1)
    mlflow.log_metric("recall_score",recall_score1)
    mlflow.log_metric("f1_score",f1_score1)
    mlflow.log_metric("accuracy_score",accuracy_score1)
    
    
    mlflow.sklearn.log_model(sk_model = model ,artifact_path="", registered_model_name="XGB classifier1")

2022/06/14 09:12:21 INFO mlflow.tracking.fluent: Experiment with name 'XGB classifier' does not exist. Creating a new experiment.


precision_score 0.9045374183827223
recall_score 0.9235
f1_score 0.8908888941566908
confusion_matrix1 [[5523    9]
 [ 450   18]]
accuracy_score 0.9235
{'max_depth': 4, 'min_child_weight': 3}


Successfully registered model 'XGB classifier1'.
2022/06/14 09:12:24 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: XGB classifier1, version 1
Created version '1' of model 'XGB classifier1'.


In [49]:
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:8000/")
import os
os.environ['MLFLOW_TRACKING_URI'] = 'http://127.0.0.1:8000/'

mlflow.set_experiment('XGB classifier')

with mlflow.start_run(run_name="servival prediction"):
    predictions = model1.predict(X_test2)

    print("precision_score",precision_score2)    
    print("recall_score",recall_score2)
    print("f1_score",f1_score2)
    print("confusion_matrix1",confusion_matrix2)
    print("accuracy_score",accuracy_score2)  

    best_param = gsearch2.best_params_
    print(best_param)
    mlflow.log_param("max_depth",best_param['max_depth'])
    mlflow.log_param("min_samples_split",best_param['min_child_weight'])
    mlflow.log_metric("precision_score",precision_score2)
    mlflow.log_metric("recall_score",recall_score2)
    mlflow.log_metric("f1_score",f1_score2)
    mlflow.log_metric("accuracy_score",accuracy_score2)
    
    mlflow.sklearn.log_model(sk_model = model ,artifact_path="", registered_model_name="XGB classifier2")

precision_score 0.8839287972563444
recall_score 0.9130794245714601
f1_score 0.8755853632582867
confusion_matrix1 [[16522    34]
 [ 1543    44]]
accuracy_score 0.9130794245714601
{'max_depth': 4, 'min_child_weight': 3}


Registered model 'XGB classifier2' already exists. Creating a new version of this model...
2022/06/14 09:45:47 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: XGB classifier2, version 3
Created version '3' of model 'XGB classifier2'.


## References


1. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
2. https://www.mlflow.org/docs/latest/quickstart.html
3. https://www.mlflow.org/docs/latest/index.html
4. https://mkai.org/what-are-the-benefits-of-a-machine-learning-pipeline/
5. https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html
6. https://dvc.org/doc/start