# Mechine Learning Approach for Predicting Stroke

### Data Collection

The data for the prediction of stroke collected from kaggle database https://www.kaggle.com/fedesoriano/stroke-prediction-dataset. This data set contain 5110 observations and 12 attributions.The first column contain unique ID number of each patients.

### Step-1: Identifying the problem and data cleaning

#### Getting started: Load Libraries

In [1]:
import numpy as np   #linear algebra
import pandas as pd

#read the csv file
data_frame=pd.read_csv("stroke-data.csv",index_col=False,)  #First column can not be used as the index

#### Load data

In [2]:
data_frame.shape

(5110, 12)

In [3]:
data_frame.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


The dataset has 5110 records, each with 12 columns.

#### Getting data information

In [4]:
data_frame.drop('id',axis=1,inplace=True)
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


A consize summery of the data is represented. It provide the type of data in each column, the number of non-null values in each column and how much memory the dataframe is using.

#### Checking for missing variables

In [5]:
data_frame.isnull().any()

gender               False
age                  False
hypertension         False
heart_disease        False
ever_married         False
work_type            False
Residence_type       False
avg_glucose_level    False
bmi                   True
smoking_status       False
stroke               False
dtype: bool

There are some missing data have been detected in body mass index parameter. Now we are going to check the missing data percentage. 

In [6]:
def missing_data_table(data_frame):
    total_number=data_frame.isnull().sum().sort_values(ascending=False)
    percent = (data_frame.isnull().sum()/data_frame.isnull().count()).sort_values(ascending=False)
    missing_value_data_frame=pd.DataFrame({"column_name":data_frame.columns,"total_number":total_number,"percent_missing":percent})
    return missing_value_data_frame
missing_data_table(data_frame)

Unnamed: 0,column_name,total_number,percent_missing
bmi,gender,201,0.039335
gender,age,0,0.0
age,hypertension,0,0.0
hypertension,heart_disease,0,0.0
heart_disease,ever_married,0,0.0
ever_married,work_type,0,0.0
work_type,Residence_type,0,0.0
Residence_type,avg_glucose_level,0,0.0
avg_glucose_level,bmi,0,0.0
smoking_status,smoking_status,0,0.0


In [7]:
data_frame.dropna(inplace=True)
data_frame.isnull().sum()

gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [8]:
print("Sample Number after removing missing data:%d"%(data_frame.shape[0]))
print("Feature After removing missing data:%d"%(data_frame.shape[1]))

Sample Number after removing missing data:4909
Feature After removing missing data:11


So, we have observed that the sample number has been reduced to 4909 from 5110. The inputation has not performed becuse the missing value percentage is less than 50% and the feature can not be inputated (it will change the feature).

#### Saving the clean version of data

In [9]:
data_frame.to_csv("clean_data.csv")