# **Classification of stroke diseases**

> **In the first stage, we conducted a complete and detailed exploration and analysis of stroke disease data, and the results we obtained are very useful information. You can view this kernel through the following link:
https://www.kaggle.com/alimohammedbakhiet/eda-for-stroke-dataset**

> **In the second stage, we will apply machine learning algorithms and neural networks to craft that data and we will work hard to get the highest results in accuracy for the test data.**

> **Let's have fun...**

> **In the first step we will do the usual things like reading the data and cleaning the data, and then we will move on to splitting the data and creating the models:**

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
data=pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

In [None]:
data.head(5)

In [None]:
data.shape

**We will now drop some columns and replace some of them and transform the data.**

In [None]:
data=data.drop(["id"], axis=1)

**In order to treat the column that suffers from data loss, we compensate for those missing data by using the median in the statistics, and then we make a projection of the original column after replacing it with the new column.**

In [None]:
# Here we will replace one column with another and drop the original.
bmi_median = data.bmi.median()
data['bmi_median'] = data.bmi.fillna(bmi_median)

In [None]:
# Here I'm dropping the original column, it won't work for us anymore.
data=data.drop(["bmi"], axis=1)

**Note We have explained why we chose this statistical function to substitute for null values in the first part of the project, for a quick reminder because of the normal distribution of both columns.**

In [None]:
data.isnull().sum()

# **Here I convert the data.**

**I have also explained what I did to the data conversion process and also in the first part.**

In [None]:
data["gender"]=data["gender"].map({"Male":0 , "Female":1 , "Other":2})
data["ever_married"]=data["ever_married"].map({"Yes":1 , "No":0 })
data["work_type"]=data["work_type"].map({'Private':0, 'Self-employed':1, 'Govt_job':2, 'children':3, 'Never_worked':4 })
data["smoking_status"]=data["smoking_status"].map({'formerly smoked':0, 'never smoked':1, 'smokes':2, 'Unknown':3 })
data["Residence_type"]=data["Residence_type"].map({'Urban':0, 'Rural':1})

In [None]:
data.head()

# data scaling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

In [None]:
# Here I separated the target column from the rest of the columns, that is, an initial separation of the data.
target=data["stroke"]
features=data.drop(["stroke"],axis=1)

In [None]:
scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
X = scaler.fit_transform(features)

# **Let's start building some machine learning algorithms:**

**Before applying algorithms to the data, let's first look at the efficiency of the algorithms suitable for the data we have**

In [None]:
# Here I complete the data separation
x_train,x_test_and_val, y_train, y_test_and_val = train_test_split(X,target,test_size=0.25,random_state=0)

In [None]:
print(x_train.shape,x_test_and_val.shape,y_train.shape,y_test_and_val.shape)

In [None]:
# This subdivision is to support the neural network for the evaluation process of the network.
x_val ,x_test ,y_val , y_test= train_test_split(x_test_and_val,y_test_and_val,test_size=0.2,random_state=0)

In [None]:
print(x_val.shape, y_val.shape ,x_test.shape ,y_test.shape )

In [None]:
pip install lazypredict

In [None]:
from lazypredict.Supervised import LazyClassifier

In [None]:
model = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = model.fit(x_train, x_test, y_train, y_test)

In [None]:
print(models)

**Very good The results we got tell us that there are a lot of good algorithms and the accuracy level of these models is great, so we will not use much but will apply some few algorithms just for the sake of proof in order to show what we can do / link to it with a little correction.**

# **1 . Random Forest Algorithm**

**Since we have some outliers in our data and I noted them in the first stage, I'm going to use Random Forest because they are good at dealing with outliers.**

**Here we use this beautiful tool in order to test a set of values and filter the best for the data among those values that we will pass to the beautiful algorithm called GridSearchCV**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
RandomForestClassifierModel=RandomForestClassifier()

In [None]:
parameters = {
    "n_estimators":[50,70,100,150,200],
    "max_depth":[7,11,13,15,32,None]
    
}

In [None]:
#  I will pass the classifier and parameters and the number of iteration in the GridSearchCV method.
cv = GridSearchCV(RandomForestClassifierModel,parameters,cv=5)
cv.fit(x_train, y_train)

In [None]:
#I have defined the method for printing all the iteration done and scores in each iteration.
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

In [None]:
display(cv)

**I will now adjust the higher parameters of the model, which will play a major role in the accuracy of the model that we will reach.**

In [None]:
RandomForestClassifierModel=RandomForestClassifier(n_estimators=70, criterion='gini', max_depth=7,
                                min_samples_split=2, min_samples_leaf=1,min_weight_fraction_leaf=0.0,
                                max_features='auto',max_leaf_nodes=7,min_impurity_decrease=0.0,
                                min_impurity_split=None, bootstrap=True,oob_score=False, n_jobs=-1,
                                random_state=0, verbose=0,warm_start=True)
RandomForestClassifierModel.fit(x_train, y_train)

# The results we obtained.

In [None]:
#Calculating Details
print('RandomForestClassifierModel Train Score is : ' , RandomForestClassifierModel.score(x_train, y_train))
# And now we will see the accuracy of the model in the test data.
print('RandomForestClassifierModel Test Score is : ' , RandomForestClassifierModel.score(x_test, y_test))
# This instruction calculates the percentage of importance for each of the features.
print('RandomForestClassifierModel features importances are : ' , RandomForestClassifierModel.feature_importances_)

**Very excellent. The results we obtained from the test data exceeded the 95% barrier, with a difference of 0.002 percent from the results we obtained from the training data, and this makes us on the safe side, meaning that the model we built does not suffer from the problem of overfitting, and this is very good.**

**Now we're going to test the model we built together, let's go.**

In [None]:
#Calculating Prediction
y_pred = RandomForestClassifierModel.predict(x_test)
y_pred_prob = RandomForestClassifierModel.predict_proba(x_test)
print('Predicted Value for RandomForestClassifierModel is : ' , y_pred[:15])
print("real values of y_test>>>>>>>>>>>>>>>>>>>>>>>>>>>is : \n" ,y_test[:15] )
print('Prediction Probabilities Value for RandomForestClassifierModel is : ' , y_pred_prob[:1])

**Now let's calculate all the errors in the ratings and show them to our favorite Confusion Matrix.**

In [None]:
#Calculating Confusion Matrix
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
confusion_matrix(y_test,y_pred)

In [None]:
plot_confusion_matrix(RandomForestClassifierModel,x_test,y_test);

**In the end, the model was wrong in 12 ratings out of 256 ratings, so we have 244 correct and successful ratings, and I would like to send a message to the model that we built and tell him, it’s okay, you did a good job and tomorrow you will do a much better job than this.**

# **Let's build a deep learning algorithm**

**To begin with, I will not be interested in building the neural network, I will build a small and uncomplicated neural network that will fulfill the purpose only, and we will not exaggerate the tuning of the network to get higher results because the data we have will be appropriate for it if we use a small neural network.**

In [None]:
# Now we are going to use a neural network for classification
# Here we will call the libraries that we need.
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential # empty neural network
from keras.layers import Dense # layer constitution
import keras 
from keras.layers import Dropout
from keras import regularizers

In [None]:
x_train = keras.utils.normalize(x_train, axis=1)

In [None]:
model= Sequential([
    Dense(100, activation='relu', input_shape=(10,)),
    Dropout(0.5),
    Dense(100, activation='relu'),
    Dropout(0.5),
    Dense(50, activation='relu'),
    Dropout(0.5),
    Dense(10, activation='relu'),
    Dense(1, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
              
hist = model.fit(x_train, y_train,
          batch_size=5, epochs=5,
          validation_data=(x_val, y_val))

In [None]:
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper right')
plt.show()

In [None]:
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='lower right')
plt.show()

**It was a fast neural network to extract quick but good results.**

**In the end, the results we got from that network are very good compared to the little effort we put into building it, which is very little.**
**and, we made modifications to these simple codes and fixed some problems that these codes were experiencing relatively.**

# **The end**

**I applied some quick algorithms and extracted an accuracy rate of 95% of the evaluation data.
thank you for your time**