# Handle missing data

#### Context
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

#### Attribute Information
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [155]:
import pandas as pd
import numpy as np

In [156]:
data = pd.read_csv('./healthcare-dataset-stroke-data.csv')
data_copy = data[:]
data.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [157]:
data.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

### How many missing data points do we have?¶


In [158]:
data.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

#### Some data of bmi column is missing, Let's see what percentage of the values in our dataset were missing ? for better scale

In [159]:
# اow many total missing values do we have?
total_cells = np.product(data['bmi'].shape)
total_missing = data["bmi"].isnull().sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print('{:.2f} %'.format(percent_missing))

3.93 %


#### 3.93 %   is missing from the 'bmi' column

### what might be going on, How that will affect our analysis ?
##### Is this value missing because it wasn't recorded or because it doesn't exist?
*  If it missing becaause it really may not exist ,It doesn't make since to guess the value and keep as NaN 
*  But if it's missing due to error recording it so we may try to guess the value (one way is using mean value instead)


In [160]:
# let's take a look at the column with missing values 
pd.DataFrame(data['bmi'])

Unnamed: 0,bmi
0,36.6
1,
2,32.5
3,34.4
4,24.0
...,...
5105,
5106,40.0
5107,30.6
5108,25.6


##### By looking to the documentation  of the dataset you would found that bmi is a value calculated for every person , so most likely it's missing due to error in recording it 
* It would be safe to drop those rows as any guess may cause bias in our result (the value is about 4 % may be it is not a problem to drop it)

In [161]:
# Drop NaN value or just take the notna values 
# data.dropna(inplace=True)
data = data[data['bmi'].notna()]
data.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1
10,12109,Female,81.0,1,0,Yes,Private,Rural,80.43,29.7,never smoked,1
11,12095,Female,61.0,0,1,Yes,Govt_job,Rural,120.46,36.8,smokes,1


##### You will find that the 201 missng rows is dropped , now our length is 4909

In [162]:
# Let's see the other way just for learning about it ,However we won't use it in our case 
# filling the NaN with the mean value
mean_val = data_copy['bmi'].mean()
data_copy.fillna(mean_val)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.600000,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.893237,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.500000,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.400000,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.000000,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,28.893237,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.000000,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.600000,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.600000,formerly smoked,0


##### Another way
#####  replace all NA's the value that comes directly after it in the same column, then replace all the remaining na's with 0


In [163]:
# data.fillna(method='bfill', axis=0).fillna(0)

# THANK YOU , WE ARE DONE 