<center>
<img src="https://images.medicinenet.com/images/slideshow/visual-guide-to-stroke-s2-diagram-of-a-stroke.jpg">
</center>

### <p style="background-color:pink; font-family:newtimeroman; font-size:120%; text-align:center;color:white;border-radius: 15px">Context</p>

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

<b>Data information:</b>

1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

### <p style="background-color:pink; font-family:newtimeroman; font-size:120%; text-align:center;color:white;border-radius: 15px">Table of contents</p>

* [1. Loading Data 💎](#1)
* [2. EDA 📊](#2)
* [3. Models ⚙️](#3)
    * [3.1 Xgboost 🛠](#3.1)
        * [3.1.1 Xgboost Round 1 🛠](#3.1.1)
        * [3.1.2 Xgboost Round 2 🛠](#3.1.2)
        * [3.1.3 Xgboost Round 3 🛠](#3.1.3)
        * [3.1.4 Xgboost Round 4 🛠](#3.1.4)
        * [3.1.5 Xgboost Round 5 🛠](#3.1.5)
        * [3.1.6 Xgboost Round 6 🛠](#3.1.6)
* [4. Take away notes ⚙️](#4)

### ⬇️ Importing Libraries

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np 
import pandas as pd 
warnings.filterwarnings('ignore')
import os

import plotly
import plotly.express as ex
import plotly.graph_objs as go
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score,ConfusionMatrixDisplay,precision_score,recall_score,f1_score,classification_report,roc_curve,plot_roc_curve,auc,precision_recall_curve,plot_precision_recall_curve,average_precision_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id='1'></a>
### <p style="background-color:maroon; font-family:arial; font-size:160%; text-align:center; border-radius: 15px;color:white">Loading dataset</p>

In [None]:
data=pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

### Check the shape

In [None]:
display(data.shape)
display(data.head())

### Data info.

In [None]:
data.info()

In [None]:
data['hypertension'].unique()

In [None]:
data['heart_disease'].unique()

#### Hypertension and heart_disease seems to have binary values.


<a id='2'></a>
### <p style="background-color:maroon; font-family:arial; font-size:160%; text-align:center; border-radius: 15px;color:white">Lets do some EDA</p>

#### <i><p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:left;color:black;border-radius: 15px">&nbsp;&nbsp;&nbsp;&nbsp;Lets check missing values first</p></i>

In [None]:
data.isna().sum()
### There are few missing values in bmi column

### Replace nulls in BMI with median values

In [None]:
data[data['bmi'].notna()]['bmi'].median()

In [None]:
data['bmi'].fillna(value=data[data['bmi'].notna()]['bmi'].median(),inplace=True)

In [None]:
data.describe()

#### <i><p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:left;color:black;border-radius: 15px">&nbsp;&nbsp;&nbsp;&nbsp;Lets plot the box plot for numerical columns</p></i>

In [None]:
data[['age','avg_glucose_level','bmi']]

In [None]:
ex.box(data_frame=data,y=['age','avg_glucose_level','bmi'],template='ggplot2',title='Boxplot')

#### <i><p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:left;color:black;border-radius: 15px">&nbsp;&nbsp;&nbsp;&nbsp;Count plots</p></i>

### <p style="color:darkblue">Gender </p>

In [None]:
data.groupby('gender')['id'].count().reset_index().rename({'id':'count'},axis=1)

In [None]:
from plotly.subplots import make_subplots
fig=go.Figure()
fig.add_trace(go.Bar(
    x=data.groupby('gender')['id'].count().reset_index().rename({'id':'count'},axis=1)['gender'],
    y=data.groupby('gender')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    name='Gender Count',
    marker_color='orange',
    text=data.groupby('gender')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    textposition='inside',
    yaxis='y1'
))
fig.update_layout(
    title="Gender Wise distribution",
    xaxis_title="Gender",
    yaxis_title="Counts",
    template='ggplot2',
    font=dict(
        size=20,
        color="Black",  
    ),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    plot_bgcolor='white',
)
fig.show()

### <p style="color:darkblue">Hypertension </p>

In [None]:
data.groupby('hypertension')['id'].count().reset_index().rename({'id':'count'},axis=1)

In [None]:
from plotly.subplots import make_subplots
fig=go.Figure()
fig.add_trace(go.Bar(
    x=data.groupby('hypertension')['id'].count().reset_index().rename({'id':'count'},axis=1)['hypertension'],
    y=data.groupby('hypertension')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    name='Hypertension Count',
    marker_color='maroon',
    text=data.groupby('hypertension')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    textposition='inside',
    yaxis='y1'
))
fig.update_layout(
    title="Hypertension Wise distribution",
    xaxis_title="Hypertension",
    yaxis_title="Counts",
    template='ggplot2',
    font=dict(
        size=20,
        color="Black",  
    ),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    plot_bgcolor='white',
)
fig.show()

### <p style="color:darkblue"> Heart disease</p>

In [None]:
data.groupby('heart_disease')['id'].count().reset_index().rename({'id':'count'},axis=1)

In [None]:
from plotly.subplots import make_subplots
fig=go.Figure()
fig.add_trace(go.Bar(
    x=data.groupby('heart_disease')['id'].count().reset_index().rename({'id':'count'},axis=1)['heart_disease'],
    y=data.groupby('heart_disease')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    name='Hypertension Count',
    marker_color='pink',
    text=data.groupby('heart_disease')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    textposition='inside',
    yaxis='y1'
))
fig.update_layout(
    title="Heart Disease Wise distribution",
    xaxis_title="Heart Disease",
    yaxis_title="Counts",
    template='ggplot2',
    font=dict(
        size=20,
        color="Black",  
    ),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    plot_bgcolor='white',
)
fig.show()

### <p style="color:darkblue">Ever married</p>

In [None]:
from plotly.subplots import make_subplots
fig=go.Figure()
fig.add_trace(go.Bar(
    x=data.groupby('ever_married')['id'].count().reset_index().rename({'id':'count'},axis=1)['ever_married'],
    y=data.groupby('ever_married')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    name='Hypertension Count',
    marker_color='lightgreen',
    text=data.groupby('ever_married')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'],
    textposition='inside',
    yaxis='y1'
))
fig.update_layout(
    title="Ever Married Wise distribution",
    xaxis_title="Ever Married",
    yaxis_title="Counts",
    template='ggplot2',
    font=dict(
        size=20,
        color="Black",  
    ),
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=False),
    plot_bgcolor='white',
)
fig.show()

### <p style="color:darkblue">Work type</p>

In [None]:
# Create two additional DataFrames to traces
df1 = data[data["stroke"] == 1]
df2 = data[data["stroke"] == 0]

trace1 = go.Bar(x=df1.groupby('work_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['work_type'], 
                y=df1.groupby('work_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke0")
trace2 = go.Bar(x=df2.groupby('work_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['work_type'], 
                y=df2.groupby('work_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke1")
# Fill out  data with our traces
d = [trace1, trace2]
# Create layout and specify title, legend and so on
layout = go.Layout(title="Work Type Wise distribution",
                   xaxis=dict(title="Work Type"),
                   yaxis=dict(title="Counts"),
                   legend=dict(x=1.0, y=0.5),
                   # Here annotations need to create legend title
                   annotations=[
                                dict(
                                    xref="paper",
                                    yref="paper",
                                    text="Stroke",
                                    showarrow=False
                                )],
                   barmode="group",
                   template='ggplot2')
# Create figure with all prepared data for plot
fig = go.Figure(data=d, layout=layout)
fig.show()

### <p style="color:darkblue">Residence type</p>

In [None]:
data.info()

In [None]:
# Create two additional DataFrames to traces
df1 = data[data["stroke"] == 1]
df2 = data[data["stroke"] == 0]

# Create two traces, first "Medium" and second "High"
trace1 = go.Bar(x=df1.groupby('Residence_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['Residence_type'], 
                y=df1.groupby('Residence_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke0")
trace2 = go.Bar(x=df2.groupby('Residence_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['Residence_type'], 
                y=df2.groupby('Residence_type')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke1")
# Fill out  data with our traces
d = [trace1, trace2]
# Create layout and specify title, legend and so on
layout = go.Layout(title="Residence Types Wise distribution",
                   xaxis=dict(title="Residence types"),
                   yaxis=dict(title="Counts"),
                   legend=dict(x=1.0, y=0.5),
                   # Here annotations need to create legend title
                   annotations=[
                                dict(
                                    xref="paper",
                                    yref="paper",
                                    x=1.1,
                                    y=0.6,
                                    text="Stroke",
                                    showarrow=False
                                )],
                   barmode="group",
                   template='ggplot2')
# Create figure with all prepared data for plot
fig = go.Figure(data=d, layout=layout)
fig.show()

### <p style="color:darkblue">Smoking Status</p>

In [None]:
# Create two additional DataFrames to traces
df1 = data[data["stroke"] == 1]
df2 = data[data["stroke"] == 0]

trace1 = go.Bar(x=df1.groupby('smoking_status')['id'].count().reset_index().rename({'id':'count'},axis=1)['smoking_status'], 
                y=df1.groupby('smoking_status')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke0")
trace2 = go.Bar(x=df2.groupby('smoking_status')['id'].count().reset_index().rename({'id':'count'},axis=1)['smoking_status'], 
                y=df2.groupby('smoking_status')['id'].count().reset_index().rename({'id':'count'},axis=1)['count'], 
                name="Stroke1")
# Fill out  data with our traces
d = [trace1, trace2]
# Create layout and specify title, legend and so on
layout = go.Layout(title="Smoking Status Wise distribution",
                   xaxis=dict(title="Smoking Status"),
                   yaxis=dict(title="Counts"),
                   legend=dict(x=1.0, y=0.5),
                   # Here annotations need to create legend title
                   annotations=[
                                dict(
                                    xref="paper",
                                    yref="paper",
                                    x=1.09,
                                    y=0.6,
                                    text='Stroke',
                                    showarrow=False
                                )],
                   barmode="group",
                   template='ggplot2')
# Create figure with all prepared data for plot
fig = go.Figure(data=d, layout=layout)
fig.show()

#### <i><p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:left;color:black;border-radius: 15px">&nbsp;&nbsp;&nbsp;&nbsp;Pair Plots</p></i>

In [None]:
data.head()

In [None]:
fig = plt.figure(figsize=(10,10))
sns.pairplot(data[['gender','age','hypertension','heart_disease','avg_glucose_level','bmi','stroke']],hue='stroke',kind='kde')
plt.show()

#### <i><p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:left;color:black;border-radius: 15px">&nbsp;&nbsp;&nbsp;&nbsp;Insights from EDA</p></i>

* Age,hypertension ,heart disease doesnt seem to separate stroke. from the above pair plot having overlapping peaks across stork=1 and 0 signifies these variables are not strong enough to separate/explaing the stroke.
* The distribution of residence type across the stroke is not significant.
* There is a little of variation in smoking status for stroke =1.
* A lot of extreme values are observed in avg glucode level.


<a id='3'></a>
### <p style="background-color:maroon; font-family:arial; font-size:160%; text-align:center; border-radius: 15px;color:white">Models</p>

<a id='3.1'></a>
### <p style="color:darkblue">Xgboost</p>

In [None]:
data.head()

In [None]:
### Generate Label encoders
le = LabelEncoder()
data['gender'] = le.fit_transform(data['gender'])
data['ever_married'] = le.fit_transform(data['ever_married'])
data['work_type'] = le.fit_transform(data['work_type'])
data['Residence_type'] = le.fit_transform(data['Residence_type'])
data['smoking_status'] = le.fit_transform(data['smoking_status'])

In [None]:
X = data.iloc[:,1:-1]
Y = data.iloc[:,-1]

print('X Shape', X.shape)
print('Y Shape',Y.shape)

In [None]:
Y.unique()

In [None]:
### one hot encoding columns gender,work type and smoking status
X['gender']=X['gender'].astype(object)
X['work_type']=X['work_type'].astype(object)
X['smoking_status']=X['smoking_status'].astype(object)
X=pd.concat([X,pd.get_dummies(X[['gender','work_type','smoking_status']])],axis=1).drop(['gender','work_type','smoking_status'],axis=1)

In [None]:
X.shape

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=123)

print('Number transations x_train df',X_train.shape)
print('Number transations x_test df',X_test.shape)
print('Number transations y_train df',y_train.shape)
print('Number transations y_test df',y_test.shape)

In [None]:
X.columns

In [None]:
### Define independent variable
predictors = ['age', 'hypertension', 'heart_disease', 'ever_married',
       'Residence_type', 'avg_glucose_level', 'bmi', 'gender_0', 'gender_1',
       'gender_2', 'work_type_0', 'work_type_1', 'work_type_2', 'work_type_3',
       'work_type_4', 'smoking_status_0', 'smoking_status_1',
       'smoking_status_2', 'smoking_status_3']

### Set the number of iteration fixed and fit the model

In [None]:
Y.value_counts()

<a id='3.1.1'></a>
### Round 1

In [None]:
## Define default model with 1000 estimators and pass these params to the CV method of XGB to get the optimal n_estimators.
## Pass this optimal n_estimator to the fit method of Xgb on train data
xgb1 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 nthread=4,
 seed=27,
 scale_pos_weight=19.5)

### Run CV to get the number of iteration basis early stopping rounds

In [None]:
### Define a xgb_cv function to fit on data and find the optimal number of iteration keeping other parameters fixed
### Function takes input = xgb object with default params , train data ,train y data 
def modelfit(alg, dtrainX, dtrainY,predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrainX[predictors].values, label=dtrainY)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
                          metrics={'auc'},early_stopping_rounds=early_stopping_rounds)
    return cvresult ## return dataframe for the iteration till the optimal iteration is reached

In [None]:
### Object return the optimal number of trees to grow
n_est=modelfit(xgb1, X_train, y_train,predictors)

In [None]:
### check the returned dataframe
n_est.shape[0]### 12 iterations

In [None]:
### Now set the optimal n_estimators 
xgb1.set_params(n_estimators=n_est.shape[0])

In [None]:
#Fit the algorithm on the data
xgb1.fit(X_train[predictors], y_train)

#Predict training set:
dtrain_predictions = xgb1.predict(X_train[predictors])

#Print model report:
print("\nModel Report Train")
print("Accuracy score : %.4g" % accuracy_score(y_train, dtrain_predictions))
print("precision_score  : %.4g" % precision_score(y_train, dtrain_predictions))
print("recall score : %.4g" % recall_score(y_train, dtrain_predictions))
print("F1 score : %.4g" % f1_score(y_train, dtrain_predictions))
print("Auc score : %.4g" % roc_auc_score(y_train, dtrain_predictions))
print("classification report :{}".format(classification_report(y_train, dtrain_predictions)))

### Insights from Round 1

* scale_pos_weight uplifts the F1 score to 0.33.
* Recall for 1's is close to 92%.
* Overall accuracy is 81%.

<a id='3.1.2'></a>
### Round 2 : Fine tuning model complexity using depth,min_child_weights,gamma

In [None]:
### Use grid search by keepin n_estimators from above = 12 and tune max_depth and gamma ,min_child_weight
## Define the grid

param_test1 = {
    'max_depth':np.arange(3,10,2),
    'min_child_weight':np.arange(1,6,2),
    'gamma':[i/10.0 for i in range(0,5)]
}

### Base estimator with Default values and n_estimators=12
gsearch1 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=12, max_depth=5,
                                                      min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                                      objective= 'binary:logistic', nthread=4, seed=27,scale_pos_weight=19.5), 
                                    param_grid = param_test1,scoring='f1',
                                    n_jobs=4,
                                    cv=5)

In [None]:
gsearch1.fit(X_train[predictors],y_train.values)
gsearch1.best_params_, gsearch1.best_score_

### Insight Round 2
* Lower F1 score as comapred to Round 1 model.
* Tune depth and gamma a bit more and others to default at n_estimators=12

<a id='3.1.3'></a>
### Round 3

In [None]:
### This round try the param value found from above with adjacent values for max_depth,min_child_weight and gamma
param_test2 = {
 'max_depth':[8,9,10],
 'gamma':[0.1,0.2,0.3]
}
gsearch2 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate=0.1, n_estimators=12, max_depth=5,
                                                      min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
                                                      objective= 'binary:logistic', nthread=4,seed=27,scale_pos_weight=19.5), 
                        param_grid = param_test2, 
                        scoring='f1',
                        n_jobs=4, cv=5)


In [None]:
gsearch2.fit(X_train[predictors],y_train)
gsearch2.best_params_, gsearch1.best_score_

### Insight Round 3
* No improvement , set back the default params.

<a id='3.1.4'></a>
### Round 4

In [None]:
### Fit the Xgb with these parameters and get the optimal n_estimators 
## Pass this optimal n_estimator to the fit method of Xgb on train data
xgb2 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=8,
 min_child_weight=1,
 gamma=0.2,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 seed=27,
 scale_pos_weight=19.5)

In [None]:
### Object return the optimal number of trees to grow
n_est_1=modelfit(xgb2, X_train, y_train,predictors)

In [None]:
n_est_1.shape[0]

In [None]:
### Now set the optimal n_estimators 
xgb2.set_params(n_estimators=n_est_1.shape[0])

In [None]:
#Fit the algorithm on the data
xgb2.fit(X_train[predictors], y_train)

#Predict training set:
dtrain_predictions = xgb2.predict(X_train[predictors])

#Print model report:
print("\nModel Report Train")
print("Accuracy score : %.4g" % accuracy_score(y_train, dtrain_predictions))
print("precision_score  : %.4g" % precision_score(y_train, dtrain_predictions))
print("recall score : %.4g" % recall_score(y_train, dtrain_predictions))
print("F1 score : %.4g" % f1_score(y_train, dtrain_predictions))
print("Auc score : %.4g" % roc_auc_score(y_train, dtrain_predictions))
print("classification report :{}".format(classification_report(y_train, dtrain_predictions)))

### Insight Round 4
* Improvement over the benchmark score of F1 with new value to 0.61.
* Set depth=8,gamma=0.2,n_estimators=29 and tune regularization params like l1 and l2.

<a id='3.1.5'></a>
### Round 5

In [None]:
param_test3 = {
    'reg_alpha':[0.5, 1, 5, 10, 50],### regularization L1
    'reg_lambda':[5e-4, 1e-3, 5e-3] ### regularization L2
}
gsearch3 = GridSearchCV(estimator = xgb.XGBClassifier( learning_rate =0.1, n_estimators=29, max_depth=8,
                                                      min_child_weight=1, gamma=0.2, subsample=0.8, 
                                                      colsample_bytree=0.8,
                                                      objective= 'binary:logistic', nthread=4,seed=27,scale_pos_weight=19.5), 

                        param_grid = param_test3, 
                        scoring='f1',
                        n_jobs=4, cv=5)

gsearch3.fit(X_train[predictors],y_train)
gsearch3.best_params_, gsearch3.best_score_

<a id='3.1.6'></a>
### Round 6

In [None]:
### Fit the Xgb with these parameters and get the optimal n_estimators 
## Pass this optimal n_estimator to the fit method of Xgb on train data
xgb3 = xgb.XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=8,
 min_child_weight=1,
 gamma=0.2,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,reg_alpha=10,reg_lambda=0.0005,
 seed=27,scale_pos_weight=19.5)

In [None]:
### Object return the optimal number of trees to grow
n_est_2=modelfit(xgb3, X_train, y_train,predictors)

In [None]:
n_est_2.shape[0]

In [None]:
### Now set the optimal n_estimators 
xgb3.set_params(n_estimators=n_est_2.shape[0])

In [None]:
#Fit the algorithm on the data
xgb3.fit(X_train[predictors], y_train)

#Predict training set:
dtrain_predictions = xgb3.predict(X_train[predictors])

#Print model report:
print("\nModel Report Train")
print("Accuracy score : %.4g" % accuracy_score(y_train, dtrain_predictions))
print("precision_score  : %.4g" % precision_score(y_train, dtrain_predictions))
print("recall score : %.4g" % recall_score(y_train, dtrain_predictions))
print("F1 score : %.4g" % f1_score(y_train, dtrain_predictions))
print("Auc score : %.4g" % roc_auc_score(y_train, dtrain_predictions))
print("classification report :{}".format(classification_report(y_train, dtrain_predictions)))

### Choose xgb2 model with F1 score 0.6

In [None]:
xgb2

In [None]:
#Fit the algorithm on the data
xgb2.fit(X_train[predictors], y_train)

#Predict test set:
dtest_predictions = xgb2.predict(X_test[predictors])

#Print model report:
print("\nModel Report Test")
print("Accuracy score : %.4g" % accuracy_score(y_test, dtest_predictions))
print("precision_score  : %.4g" % precision_score(y_test, dtest_predictions))
print("recall score : %.4g" % recall_score(y_test, dtest_predictions))
print("F1 score : %.4g" % f1_score(y_test, dtest_predictions))
print("Auc score : %.4g" % roc_auc_score(y_test, dtest_predictions))
print("classification report :{}".format(classification_report(y_test, dtest_predictions)))


<a id='4'></a>
### <p style="background-color:maroon; font-family:arial; font-size:160%; text-align:center; border-radius: 15px;color:white">Take away notes</p>

* Model can further be improved with using min_sample_weight and colsample_bytree and fine tuning them.
* Stratified sampling while spliting into train test.
* Trying other algorithm likes LGBM ,catboost.