# __Stroke Prediction__


## __Contents:__
> 1. Overview
> 2. Installation
> 3. Read the dataset
> 4. Explore Information
> 5. Cleaning the dataset
>> 1. Handling whitespaces
>> 2. Handling the Missing Values
>> 3. Handling the Outliers
> 6. Explore Data Analysis
>> 1. Questions & Answers



## __Overview:__

In this project we take the dataset from kaggle [here](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset/code?datasetId=1120859&sortBy=voteCount)


# <hr>

## __Installation:__

In [2]:
# !pip install plotly_express
# !pip install seaborn

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# <hr>

## __Read the Dataset:__

In [4]:
# Read Dataset
df = pd.read_csv('data/stroke-data.csv')
df_eda = df.copy()
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [5]:
df_eda.smoking_status.value_counts()

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64

# <hr>

## __Cleaning the Dataset:__

### Handling whitespaces:

In [6]:
# strip whitespaces in columns
df.columns.str.strip()

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [7]:
# strip whitespaces in values of multiple columns
cols = ['gender', 'ever_married', 'work_type','Residence_type','smoking_status']
df[cols] = df[cols].apply(lambda x: x.str.strip())

### Handling the Missing Values:

In [8]:
# Check for null values
df_eda.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [9]:
df_eda.bmi.mean()

28.893236911794673

In [10]:
df_eda.bmi.median()

28.1

In [11]:
# fill null values in bmi column
df_eda.bmi= df_eda.bmi.fillna(df_eda.bmi.mean())

In [12]:
df_eda.isna().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [13]:
df_eda.smoking_status.value_counts()

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64

### Handling the Outliers:

In [14]:
# Find Outliers

# <hr>

## __Explore Data Analysis:__

### Questions & Answers:

#### 1. Does age has impact on strokes?

In [15]:
# Code
age_group = df_eda.groupby('age')[['stroke']].sum()
age_group

Unnamed: 0_level_0,stroke
age,Unnamed: 1_level_1
0.08,0
0.16,0
0.24,0
0.32,0
0.40,0
...,...
78.00,21
79.00,17
80.00,17
81.00,14


In [16]:
# Visual


#### 2. Does body mass index and glucose levels in a person, propel a stroke?

In [17]:
# Code
bmi_group = df_eda.groupby(['bmi','avg_glucose_level'])[['stroke']].sum().sort_values(ascending=False, by='stroke')
bmi_group.head(60)

Unnamed: 0_level_0,Unnamed: 1_level_0,stroke
bmi,avg_glucose_level,Unnamed: 2_level_1
28.893237,101.45,2
26.8,218.46,1
26.6,199.2,1
26.6,228.7,1
20.3,104.47,1
29.7,84.2,1
29.7,80.43,1
29.7,79.79,1
20.2,213.03,1
29.6,97.76,1


In [18]:
# Visual


#### 3. Assumption: Smoking can induce Stroke, is it true?

In [19]:
# Code
smoke_group = df_eda.groupby('smoking_status')[['stroke']].sum()
smoke_group

Unnamed: 0_level_0,stroke
smoking_status,Unnamed: 1_level_1
Unknown,47
formerly smoked,70
never smoked,90
smokes,42


In [20]:
# Visual


#### 4. Assumption: Heart with a Heart Disease is prone to Stroke, is it true?

In [21]:
# Code
heart_group = df_eda.groupby('heart_disease')[['stroke']].sum()
heart_group

Unnamed: 0_level_0,stroke
heart_disease,Unnamed: 1_level_1
0,202
1,47


In [22]:
# Visual


#### 5. Assumption: Workload results in high blood pressure and that could lead to Stroke, is it true?

In [23]:
# Code

In [24]:
# Visual

#### 6. Assumption: Males are most susceptible to strokes due to high work related stress, is it true

In [25]:
# Code

In [26]:
# Visual
