# Data Cleaning and Exploration

Resources:
- https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
- https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15
- https://towardsdatascience.com/7-data-types-a-better-way-to-think-about-data-types-for-machine-learning-939fae99a689

In [1]:
# import all packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
! ls

README.md                           healthcare-dataset-stroke-data.csv
data_cleaning_and_exploration.ipynb


The dataset file is -> healthcare-dataset-stroke-data.csv

In [5]:
# load the data
data = pd.read_csv('healthcare-dataset-stroke-data.csv')

# sneak peak of our data
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [7]:
# we can drop column id since it doesnt affect the output (stroke)
data.drop("id", axis=1, inplace=True)

data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [8]:
# Data size
number_of_entries = data.shape[0]
number_of_attributes = data.shape[1]

print(f'There are entries for {data.shape[0]} individuals')
print(f'There are {number_of_attributes} attributes')
print(f'There is 1 dependant variable and {number_of_attributes-1} independant variables')

There are entries for 5110 individuals
There are 11 attributes
There is 1 dependant variable and 10 independant variables


In [9]:
# infomation about the data types in our dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


In [10]:
# insights about the dependant variable
data.stroke.unique()

array([1, 0])

The dependant variable can be either 1 or 0, indicating a stroke or not a stroke.

In [11]:
data.stroke.value_counts()

0    4861
1     249
Name: stroke, dtype: int64

The data is highly imbalanced, there are 4861 entries where the patient did not have a stroke and only 249 entries where the patient had a stroke. The 0 class is ~1952% of the data in 1 class.

## Univariate Variable Analysis

We will analyze the numerical and catagorical data types
- numerical -> data represented by numbers (int, float, etc)
- catagorical -> all other data, particularly discrete labeled groups

### Numerical Variables



In [14]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = data.select_dtypes(include=numerics)

newdf

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
0,67.0,0,1,228.69,36.6,1
1,61.0,0,0,202.21,,1
2,80.0,0,1,105.92,32.5,1
3,49.0,0,0,171.23,34.4,1
4,79.0,1,0,174.12,24.0,1
...,...,...,...,...,...,...
5105,80.0,1,0,83.75,,0
5106,81.0,0,0,125.20,40.0,0
5107,35.0,0,0,82.99,30.6,0
5108,51.0,0,0,166.29,25.6,0
