<h1 style="background-color: rgb(200, 75, 90);padding:10px;border-radius:10px 10px 10px 10px;text-align:center">Brain Stroke Prediction(ব্রেইন স্ট্রোক প্রিডিকশন)<h1/>

<img src="https://static.toiimg.com/photo/msid-87343087/87343087.jpg?79752" >

<span style="border-left: 6px solid rgb(200, 100, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(200, 50, 70);margin: 10px"><b>Context<b/><span/>

****A stroke is a medical condition in which poor blood flow to the brain causes cell death. There are two main types of stroke: ischemic, due to lack of blood flow, and hemorrhagic, due to bleeding. Both cause parts of the brain to stop functioning properly. Signs and symptoms of a stroke may include an inability to move or feel on one side of the body, problems understanding or speaking, dizziness, or loss of vision to one side. Signs and symptoms often appear soon after the stroke has occurred. If symptoms last less than one or two hours, the stroke is a transient ischemic attack (TIA), also called a mini-stroke. A hemorrhagic stroke may also be associated with a severe headache. The symptoms of a stroke can be permanent. Long-term complications may include pneumonia and loss of bladder control.****

****The main risk factor for stroke is high blood pressure. Other risk factors include high blood cholesterol, tobacco smoking, obesity, diabetes mellitus, a previous TIA, end-stage kidney disease, and atrial fibrillation. An ischemic stroke is typically caused by blockage of a blood vessel, though there are also less common causes. A hemorrhagic stroke is caused by either bleeding directly into the brain or into the space between the brain's membranes. Bleeding may occur due to a ruptured brain aneurysm. Diagnosis is typically based on a physical exam and supported by medical imaging such as a CT scan or MRI scan. A CT scan can rule out bleeding, but may not necessarily rule out ischemia, which early on typically does not show up on a CT scan. Other tests such as an electrocardiogram (ECG) and blood tests are done to determine risk factors and rule out other possible causes. Low blood sugar may cause similar symptoms.****

****Prevention includes decreasing risk factors, surgery to open up the arteries to the brain in those with problematic carotid narrowing, and warfarin in people with atrial fibrillation. Aspirin or statins may be recommended by physicians for prevention. A stroke or TIA often requires emergency care. An ischemic stroke, if detected within three to four and half hours, may be treatable with a medication that can break down the clot. Some hemorrhagic strokes benefit from surgery. Treatment to attempt recovery of lost function is called stroke rehabilitation, and ideally takes place in a stroke unit; however, these are not available in much of the world.****

<span style="border-left: 6px solid rgb(200, 100, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(200, 50, 70);margin: 10px"><b>Attribute Information<b/><span/>

<li style='font-size:16px'>gender: "Male", "Female" or "Other"
<li style='font-size:16px'>age: age of the patient
<li style='font-size:16px'>hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
<li style='font-size:16px'>heartdisease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 
<li style='font-size:16px'>evermarried: "No" or "Yes"
<li style='font-size:16px'>worktype: "children", "Govtjov", "Neverworked", "Private" or "Self-employed" 
<li style='font-size:16px'>Residencetype: "Rural" or "Urban"
<li style='font-size:16px'>avgglucoselevel: average glucose level in blood
<li style='font-size:16px'>bmi: body mass index
<li style='font-size:16px'>smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
<li style='font-size:16px'>stroke: 1 if the patient had a stroke or 0 if not
<br/>
    
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

<span style="border-left: 6px solid rgb(100, 220, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(50, 150, 70);margin: 10px"><b>Importing Libraries and Loading Data<b/><span/>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, precision_recall_curve, f1_score


%matplotlib inline

In [None]:
data = pd.read_csv('../input/full-filled-brain-stroke-dataset/full_data.csv')

In [None]:
data.head()

<span style="border-left: 6px solid rgb(100, 220, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(50, 150, 70);margin: 10px"><b>Exploratory Data Analysis<b/><span/>

In [None]:
data.info()

In [None]:
data.describe()

##### Finding the categorical variables in the data.

In [None]:
cat_ = data.select_dtypes(include='O').keys()
cat_

##### Finding unique categories in every categorical variables. 

In [None]:
for c in cat_:
    print(f'{c}:  {data[c].unique()}')

##### Encoding all categorical variables.

In [None]:
data = pd.get_dummies(data, columns = ['work_type', 'smoking_status'])
data['gender'] = [1 if i == 'Male' else 0 for i in data['gender']]
data['ever_married'] = [ 1 if i =='Yes' else 0 for i in data['ever_married'] ]
data['Residence_type'] = [ 1 if i =='Urban' else 0 for i in data['Residence_type'] ]

In [None]:
data.head()

##### Correlation of all columns with the target variable stroke.

In [None]:
data.corr()['stroke'].sort_values(ascending=False)

<span style="border-left: 6px solid rgb(100, 220, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(50, 150, 70);margin: 10px"><b>Target Visualization<b/><span/>

 #### It seems that older people are more prone to brain strokes.

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.histplot(x=data['age'][data['stroke']== 1], kde=True)
plt.xlabel('Age')
plt.ylabel('Count')
ax.lines[0].set_color('crimson')
plt.show()

#### There are more percentage of people getting a stroke if they are already suffering from any kind of heart disease.

In [None]:
a = data['stroke'][data['heart_disease']== 0].value_counts()
b = data['stroke'][data['heart_disease']== 1].value_counts()
c, d = 100*a[1]/(a[0]+a[1]), 100*b[1]/(b[0]+b[1])

sns.barplot(x=['stroke (if not) heart disease', 'stroke (if) heart disease'], y=[c, d])
plt.ylabel('Percentage')
plt.show()

#### There are also more percentage of people getting a stroke if they have hypertension.

In [None]:
a = data['stroke'][data['hypertension']== 0].value_counts()
b = data['stroke'][data['hypertension']== 1].value_counts()
c, d = 100*a[1]/(a[0]+a[1]), 100*b[1]/(b[0]+b[1])

sns.barplot(x=['stroke (if not) hypertension', 'stroke (if) hypertension'], y=[c, d])
plt.ylabel('Percentage')
plt.show()

In [None]:
sns.lmplot(x='stroke', y='hypertension', data=data)
plt.show()

#### Strokes are also more likely to happen to people who have high glucose level or are at risk of diabetes, compared to the ones with normal glucose level.

In [None]:
sns.lmplot(x='stroke', y='avg_glucose_level', data=data)
plt.show()

#### Smoking also increase the risk for brain stroke.

In [None]:
data.loc[:, ['stroke', 'smoking_status_smokes', 'smoking_status_formerly smoked', 'smoking_status_Unknown', 'smoking_status_never smoked']].corr()['stroke'].sort_values(ascending=False)

#### Women are at greater risk for brain stroke than men.

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.countplot(x=data['gender'][data['stroke']== 1])
plt.title('0: Female       1: Male')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

#### Surprisingly marriage is also responsible for increasing the risk for stroke.

In [None]:
plt.figure(figsize=(12, 6))
ax = sns.countplot(x=data['ever_married'][data['stroke']== 1])
plt.xlabel('Ever Married')
plt.ylabel('Count')
plt.title('0: No       1: Yes')
plt.show()

#### High BMI (more weight) can also increase the chance of getting a stroke.

In [None]:
sns.lmplot(x='stroke', y='bmi', data=data)
plt.show()

<span style="border-left: 6px solid rgb(100, 220, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(50, 150, 70);margin: 10px"><b>Data Preprocessing<b/><span/>

In [None]:
sns.countplot(x='stroke', data=data)
plt.title('0: No       1: Yes')
plt.show()

#### The data is imbalanced.

In [None]:
X = data.drop(['stroke'], axis=1)
y = data['stroke']

#### Oversampling the data using Synthetic Minority Oversampling Technique.

In [None]:
print('Before:')
print(y.value_counts())
smt = SMOTE(random_state=42)
X_smt, y_smt = smt.fit_resample(X, y)
print('\n\nAfter:')
print(y_smt.value_counts())

#### Splitting data into training and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt,test_size=0.2, random_state=42)
(y_train.shape, y_test.shape)

<span style="border-left: 6px solid rgb(100, 220, 120);
  height: 30px;font-size:30px;"><span/>
<span style="color:rgb(50, 150, 70);margin: 10px"><b>Model Building<b/><span/>

In [None]:
xgb_model = XGBClassifier(random_state=42)
lgbm_model = LGBMClassifier(random_state=42)
cat_model = CatBoostClassifier(random_state=42, verbose=False)

#### Fine tuning model hyperparameters

In [None]:
def fine_tune(model, param_grid):
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_, np.sqrt((-1)*grid_search.best_score_)

In [None]:
param_grid_1 = [{'n_estimators': [20, 40, 60, 80, 100], 'max_depth': [2, 4, 6, 8], 'max_leaves': [50, 100, 200]}]
param_grid_2 = [{'n_estimators': [20, 40, 60, 80, 100], 'max_depth': [2, 3, 4, 6, 8], 'num_leaves': [50, 100, 200]}]
param_grid_3 = [{'n_estimators': [20, 40, 60, 80, 100], 'max_depth': [2, 3, 4, 6, 8]}]

a = fine_tune(xgb_model, param_grid_1)
b = fine_tune(lgbm_model, param_grid_2)
c = fine_tune(cat_model, param_grid_3)

In [None]:
print('RMSE values \n')
print('xgboost:      ',a[1])
print('lightgbm     ',b[1])
print('catboost:     ',c[1])

In [None]:
print('XGBoost performs better and has best performance when Hyperparameters are set to:', a[0])

In [None]:
model = XGBClassifier(max_depth=8, max_leaves=50, n_estimators=80)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
print(f'Classification Report \n\n{classification_report(y_test,y_pred)}')

In [None]:
lr_precision, lr_recall, _ = precision_recall_curve(y_test, y_pred)
plt.plot(lr_precision,lr_recall)
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.title('Precision Recall Curve')
plt.show()

In [None]:
print(f'In this case XGBoost performed better with an f1-score of {f1_score(y_test, y_pred)}. ')

In this case XGBoost performed better with an f1-score of 0.97. 

<h1 style="color:rgb(50, 200, 200);margin: 10px;font-size:25px;text-align:center">Thank you for checking out this notebook. I would love to have your suggestions.<h1/>