# Data Cleaning Steps:

**Data Cleaning Steps:**

1. **Handling Missing Values**: Filling in, interpolating, or removing missing data points to ensure completeness.
   
2. **Correcting Errors**: Identifying and fixing typographical errors, incorrect values, or inconsistencies within the dataset.

3. **Removing Duplicates**: Identifying and eliminating duplicate records to avoid redundancy and ensure data accuracy.

4. **Standardizing Formats**: Ensuring consistent formats for dates, numbers, and categorical values. For example, converting all date entries to a standard format (e.g., YYYY-MM-DD).

5. **Filtering Outliers**: Identifying and addressing outliers that may indicate errors or exceptional cases that need special handling.

6. **Normalizing Data**: Transforming data into a common scale or format, which might involve scaling numerical values or converting text to a uniform case.

7. **Validating Data**: Ensuring that data adheres to defined rules or constraints, such as checking if email addresses are in the correct format.

8. **Resolving Inconsistencies**: Addressing discrepancies in data entries, such as variations in naming conventions or conflicting information across records.

9. **Consolidating Data**: Merging data from different sources or combining related fields to create a more coherent dataset.

# Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
import string
import os

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv(os.path.join("data", "heart_stroke_data.csv"))

In [3]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [4]:
df.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

## Numerical Data Cleaning

- not normalizing
- not transforming
- not scaling
- not handling missing
- bringing to same scale/ formt (e.g, salary $100 & rupee 100)
- replacing `np.nan` with some high / low number

In [5]:
NUM_COLS = ["age", "bmi", "avg_glucose_level"]

In [6]:
df[NUM_COLS].isna().sum()

age                    0
bmi                  201
avg_glucose_level      0
dtype: int64

In [7]:
print(df["age"].unique())

[6.70e+01 6.10e+01 8.00e+01 4.90e+01 7.90e+01 8.10e+01 7.40e+01 6.90e+01
 5.90e+01 7.80e+01 5.40e+01 5.00e+01 6.40e+01 7.50e+01 6.00e+01 5.70e+01
 7.10e+01 5.20e+01 8.20e+01 6.50e+01 5.80e+01 4.20e+01 4.80e+01 7.20e+01
 6.30e+01 7.60e+01 3.90e+01 7.70e+01 7.30e+01 5.60e+01 4.50e+01 7.00e+01
 6.60e+01 5.10e+01 4.30e+01 6.80e+01 4.70e+01 5.30e+01 3.80e+01 5.50e+01
 1.32e+00 4.60e+01 3.20e+01 1.40e+01 3.00e+00 8.00e+00 3.70e+01 4.00e+01
 3.50e+01 2.00e+01 4.40e+01 2.50e+01 2.70e+01 2.30e+01 1.70e+01 1.30e+01
 4.00e+00 1.60e+01 2.20e+01 3.00e+01 2.90e+01 1.10e+01 2.10e+01 1.80e+01
 3.30e+01 2.40e+01 3.40e+01 3.60e+01 6.40e-01 4.10e+01 8.80e-01 5.00e+00
 2.60e+01 3.10e+01 7.00e+00 1.20e+01 6.20e+01 2.00e+00 9.00e+00 1.50e+01
 2.80e+01 1.00e+01 1.80e+00 3.20e-01 1.08e+00 1.90e+01 6.00e+00 1.16e+00
 1.00e+00 1.40e+00 1.72e+00 2.40e-01 1.64e+00 1.56e+00 7.20e-01 1.88e+00
 1.24e+00 8.00e-01 4.00e-01 8.00e-02 1.48e+00 5.60e-01 4.80e-01 1.60e-01]


In [8]:
print(df["bmi"].unique())

[36.6  nan 32.5 34.4 24.  29.  27.4 22.8 24.2 29.7 36.8 27.3 28.2 30.9
 37.5 25.8 37.8 22.4 48.9 26.6 27.2 23.5 28.3 44.2 25.4 22.2 30.5 26.5
 33.7 23.1 32.  29.9 23.9 28.5 26.4 20.2 33.6 38.6 39.2 27.7 31.4 36.5
 33.2 32.8 40.4 25.3 30.2 47.5 20.3 30.  28.9 28.1 31.1 21.7 27.  24.1
 45.9 44.1 22.9 29.1 32.3 41.1 25.6 29.8 26.3 26.2 29.4 24.4 28.  28.8
 34.6 19.4 30.3 41.5 22.6 56.6 27.1 31.3 31.  31.7 35.8 28.4 20.1 26.7
 38.7 34.9 25.  23.8 21.8 27.5 24.6 32.9 26.1 31.9 34.1 36.9 37.3 45.7
 34.2 23.6 22.3 37.1 45.  25.5 30.8 37.4 34.5 27.9 29.5 46.  42.5 35.5
 26.9 45.5 31.5 33.  23.4 30.7 20.5 21.5 40.  28.6 42.2 29.6 35.4 16.9
 26.8 39.3 32.6 35.9 21.2 42.4 40.5 36.7 29.3 19.6 18.  17.6 19.1 50.1
 17.7 54.6 35.  22.  39.4 19.7 22.5 25.2 41.8 60.9 23.7 24.5 31.2 16.
 31.6 25.1 24.8 18.3 20.  19.5 36.  35.3 40.1 43.1 21.4 34.3 27.6 16.5
 24.3 25.7 21.9 38.4 25.9 54.7 18.6 24.9 48.2 20.7 39.5 23.3 64.8 35.1
 43.6 21.  47.3 16.6 21.6 15.5 35.6 16.7 41.9 16.4 17.1 29.2 37.9 44.6
 39.6 4

In [9]:
# replacing nan with -1 
# it's not realistinc
df["bmi"] = df["bmi"].replace(np.nan, -1)

In [10]:
print(df["bmi"].unique())

[36.6 -1.  32.5 34.4 24.  29.  27.4 22.8 24.2 29.7 36.8 27.3 28.2 30.9
 37.5 25.8 37.8 22.4 48.9 26.6 27.2 23.5 28.3 44.2 25.4 22.2 30.5 26.5
 33.7 23.1 32.  29.9 23.9 28.5 26.4 20.2 33.6 38.6 39.2 27.7 31.4 36.5
 33.2 32.8 40.4 25.3 30.2 47.5 20.3 30.  28.9 28.1 31.1 21.7 27.  24.1
 45.9 44.1 22.9 29.1 32.3 41.1 25.6 29.8 26.3 26.2 29.4 24.4 28.  28.8
 34.6 19.4 30.3 41.5 22.6 56.6 27.1 31.3 31.  31.7 35.8 28.4 20.1 26.7
 38.7 34.9 25.  23.8 21.8 27.5 24.6 32.9 26.1 31.9 34.1 36.9 37.3 45.7
 34.2 23.6 22.3 37.1 45.  25.5 30.8 37.4 34.5 27.9 29.5 46.  42.5 35.5
 26.9 45.5 31.5 33.  23.4 30.7 20.5 21.5 40.  28.6 42.2 29.6 35.4 16.9
 26.8 39.3 32.6 35.9 21.2 42.4 40.5 36.7 29.3 19.6 18.  17.6 19.1 50.1
 17.7 54.6 35.  22.  39.4 19.7 22.5 25.2 41.8 60.9 23.7 24.5 31.2 16.
 31.6 25.1 24.8 18.3 20.  19.5 36.  35.3 40.1 43.1 21.4 34.3 27.6 16.5
 24.3 25.7 21.9 38.4 25.9 54.7 18.6 24.9 48.2 20.7 39.5 23.3 64.8 35.1
 43.6 21.  47.3 16.6 21.6 15.5 35.6 16.7 41.9 16.4 17.1 29.2 37.9 44.6
 39.6 4

In [11]:
print(df["avg_glucose_level"].unique())

[228.69 202.21 105.92 ...  82.99 166.29  85.28]


## Categorical Data Cleaning

- correcting typos
- replacing `np.nan` with `n/a`

In [12]:
CAT_COLS = ["gender", "hypertension", "heart_disease", "ever_married", 
            "work_type", "Residence_type", "smoking_status", "stroke"]

In [13]:
df[CAT_COLS].isna().sum()

gender            0
hypertension      0
heart_disease     0
ever_married      0
work_type         0
Residence_type    0
smoking_status    0
stroke            0
dtype: int64

In [14]:
for i in CAT_COLS:
    print(f"{i} : {df[i].unique()}")

gender : ['Male' 'Female' 'Other']
hypertension : [0 1]
heart_disease : [1 0]
ever_married : ['Yes' 'No']
work_type : ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
Residence_type : ['Urban' 'Rural']
smoking_status : ['formerly smoked' 'never smoked' 'smokes' 'Unknown']
stroke : [1 0]


- `gender` is not normalized
- `ever_married` is not normalized
- `work_type` is not normalized & requires correction in naming convention
- `Residence_type` is not normalized
- `smoking_status` is not normalized & requires correction in naming convention

In [15]:
def correct_naming_convention(x):
    """ 
    this function reponsible for correcting
    naming conventions in column
    """
    pass

In [16]:
# normalizing text
df["gender"] = df["gender"].map(lambda x: x.lower())

In [17]:
# normalizing text
df["ever_married"] = df["ever_married"].map(lambda x: x.lower())

In [18]:
# correcting namming convention
df["work_type"] = df["work_type"].replace('Self-employed', 'self_employed')

# normalizing text
df["work_type"] = df["work_type"].map(lambda x: x.lower())

In [19]:
# normalizing text
df["Residence_type"] = df["Residence_type"].map(lambda x: x.lower())

In [20]:
# correcting namming convention
df["smoking_status"] = df["smoking_status"].replace({'formerly smoked': 'formerly_smoked', 'never smoked': 'never_smoked'})

# normalizing text
df["smoking_status"] = df["smoking_status"].map(lambda x: x.lower())

In [21]:
for i in CAT_COLS:
    print(f"{i} : {df[i].unique()}")

gender : ['male' 'female' 'other']
hypertension : [0 1]
heart_disease : [1 0]
ever_married : ['yes' 'no']
work_type : ['private' 'self_employed' 'govt_job' 'children' 'never_worked']
Residence_type : ['urban' 'rural']
smoking_status : ['formerly_smoked' 'never_smoked' 'smokes' 'unknown']
stroke : [1 0]


## Text Data Cleaning

- text extraction
- correcting typos
- replacing `np.nan` with `n/a`

In [22]:
# don't have text related columns

## Date Data Cleaning

- formmatting data
- extrcating basic features like day, month, year
- correcting typos
- replacing `np.nan` with `n/a`

In [23]:
# don't have date related columns

## Correcting Columns Name

In [24]:
df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [25]:
column_names = {
    'id'                : 'Id', 
    'gender'            : 'Gender', 
    'age'               : 'Age', 
    'hypertension'      : 'Hypertension', 
    'heart_disease'     : 'Heart_disease', 
    'ever_married'      : 'Ever_married',
    'work_type'         : 'Work_type', 
    'Residence_type'    : 'Residence_type',
    'avg_glucose_level' : 'Avg_glucose_level', 
    'bmi'               : 'Bmi',
    'smoking_status'    : 'Smoking_status', 
    'stroke'            : 'Stroke'
}

In [26]:
# Rename specific columns
df = df.rename(columns=column_names)

In [27]:
df.head()

Unnamed: 0,Id,Gender,Age,Hypertension,Heart_disease,Ever_married,Work_type,Residence_type,Avg_glucose_level,Bmi,Smoking_status,Stroke
0,9046,male,67.0,0,1,yes,private,urban,228.69,36.6,formerly_smoked,1
1,51676,female,61.0,0,0,yes,self_employed,rural,202.21,-1.0,never_smoked,1
2,31112,male,80.0,0,1,yes,private,rural,105.92,32.5,never_smoked,1
3,60182,female,49.0,0,0,yes,private,urban,171.23,34.4,smokes,1
4,1665,female,79.0,1,0,yes,self_employed,rural,174.12,24.0,never_smoked,1


## Saving Cleaned Data

In [28]:
df.to_csv(os.path.join("data", "heart_stroke_clean_data.csv"), index=False)