Before performing EDA (Exploratory Data Analysis), some initial data cleaning is necessary so you can trust what you’re analyzing.

**Basic cleaning includes:**

- Loading the data correctly (e.g., proper parsing of dates, encoding, etc.)

- Checking for and handling:

    - Missing values (at least identifying them)

    - Incorrect data types (e.g., string instead of numeric)

    - Obvious outliers or garbage values (e.g., negative age)

    - Duplicate rows

> 🧼 This stage is often called "initial or light cleaning" — it helps avoid misleading results during EDA.

## Loading the data correctly

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

**Import the CSV Data as Pandas DataFrame**

In [2]:
df = pd.read_csv(f"data/healthcare-dataset-stroke-data.csv")

In [3]:
df.shape

(5110, 12)

In [4]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


## Identifying missing values

In [5]:
df.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [9]:
np.round(df.isna().sum()/df.shape[0]*100, 3)

id                   0.000
gender               0.000
age                  0.000
hypertension         0.000
heart_disease        0.000
ever_married         0.000
work_type            0.000
Residence_type       0.000
avg_glucose_level    0.000
bmi                  3.933
smoking_status       0.000
stroke               0.000
dtype: float64

- `bmi` feature has 4% (201) values missing


## Incorrect data types

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [19]:
for i in df.columns:
    print(f"column : {i}")
    n = df[i].nunique()
    if n < 10:
        print(f"Unique : {df[i].unique()}")
    else:
        print((f"Unique Count : {n}"))
    print(f"column dtype: {df[i].dtype}")
    print(f"Missing Values: {df[i].isna().sum()}")
    print('-'*25)

column : id
Unique Count : 5110
column dtype: int64
Missing Values: 0
-------------------------
column : gender
Unique : ['Male' 'Female' 'Other']
column dtype: object
Missing Values: 0
-------------------------
column : age
Unique Count : 104
column dtype: float64
Missing Values: 0
-------------------------
column : hypertension
Unique : [0 1]
column dtype: int64
Missing Values: 0
-------------------------
column : heart_disease
Unique : [1 0]
column dtype: int64
Missing Values: 0
-------------------------
column : ever_married
Unique : ['Yes' 'No']
column dtype: object
Missing Values: 0
-------------------------
column : work_type
Unique : ['Private' 'Self-employed' 'Govt_job' 'children' 'Never_worked']
column dtype: object
Missing Values: 0
-------------------------
column : Residence_type
Unique : ['Urban' 'Rural']
column dtype: object
Missing Values: 0
-------------------------
column : avg_glucose_level
Unique Count : 3979
column dtype: float64
Missing Values: 0
-----------------

- data formats are correct

## Obvious outliers or garbage values (e.g., negative age)

In [22]:
for i in df.columns:
    print(f"column : {i}")
    print(f"Unique : {df[i].unique()}")
    print('-'*25)

column : id
Unique : [ 9046 51676 31112 ... 19723 37544 44679]
-------------------------
column : gender
Unique : ['Male' 'Female' 'Other']
-------------------------
column : age
Unique : [6.70e+01 6.10e+01 8.00e+01 4.90e+01 7.90e+01 8.10e+01 7.40e+01 6.90e+01
 5.90e+01 7.80e+01 5.40e+01 5.00e+01 6.40e+01 7.50e+01 6.00e+01 5.70e+01
 7.10e+01 5.20e+01 8.20e+01 6.50e+01 5.80e+01 4.20e+01 4.80e+01 7.20e+01
 6.30e+01 7.60e+01 3.90e+01 7.70e+01 7.30e+01 5.60e+01 4.50e+01 7.00e+01
 6.60e+01 5.10e+01 4.30e+01 6.80e+01 4.70e+01 5.30e+01 3.80e+01 5.50e+01
 1.32e+00 4.60e+01 3.20e+01 1.40e+01 3.00e+00 8.00e+00 3.70e+01 4.00e+01
 3.50e+01 2.00e+01 4.40e+01 2.50e+01 2.70e+01 2.30e+01 1.70e+01 1.30e+01
 4.00e+00 1.60e+01 2.20e+01 3.00e+01 2.90e+01 1.10e+01 2.10e+01 1.80e+01
 3.30e+01 2.40e+01 3.40e+01 3.60e+01 6.40e-01 4.10e+01 8.80e-01 5.00e+00
 2.60e+01 3.10e+01 7.00e+00 1.20e+01 6.20e+01 2.00e+00 9.00e+00 1.50e+01
 2.80e+01 1.00e+01 1.80e+00 3.20e-01 1.08e+00 1.90e+01 6.00e+00 1.16e+00
 1.00e+00

### 1. Negatives, zeros (which don't make sense)

In [26]:

print(df[df['age'] <= 0])

Empty DataFrame
Columns: [id, gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke]
Index: []


In [27]:
# Negative values
print(df[df['avg_glucose_level'] <= 0])

Empty DataFrame
Columns: [id, gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke]
Index: []


In [28]:
# Negative values
print(df[df['bmi'] <= 0])

Empty DataFrame
Columns: [id, gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke]
Index: []


- no negative or zero value found

### 2. Unusually high values

**Realistic Age Range (Human Lifespan):**

| Range     | Meaning                                |
| --------- | -------------------------------------- |
| **0–120** | Realistic, though age > 100 is rare    |
| **> 120** | Very likely a data entry or unit error |
| **< 0**   | Impossible — definitely a data error   |


In [42]:
print(df[df['age'] > 120])

Empty DataFrame
Columns: [id, gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, stroke]
Index: []


⚠️ **Unusual / Suspicious Values:**

| Range                               | Possible Issue                                                   |
| ----------------------------------- | ---------------------------------------------------------------- |
| **< 40**                            | Rare, possible hypoglycemia or data error                        |
| **> 250–300**                       | Very high, may indicate poorly controlled diabetes or unit error |
| **> 600**                           | Extremely rare, possibly a data or entry error                   |
| **Unrealistic values (e.g. >1000)** | Almost certainly data error                                      |


In [43]:
df[df['avg_glucose_level'] > 300]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


📈 **In practice:**
- A BMI over 40 is considered morbidly obese

- BMI > 60–70 is extremely rare

- The highest medically recorded BMIs in history are around 120–150

In [44]:
df[df['bmi'] > 100]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


### 3. Strings in numeric columns

In [45]:
#  Strings in numeric columns
df[~df['age'].apply(lambda x: isinstance(x, (int, float)))]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


In [46]:
#  Strings in numeric columns
df[~df['avg_glucose_level'].apply(lambda x: isinstance(x, (int, float)))]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


In [47]:
#  Strings in numeric columns
df[~df['bmi'].apply(lambda x: isinstance(x, (int, float)))]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke


## Duplicate rows

In [20]:
print(f"Duplicate rows: {df.duplicated().sum()}")

Duplicate rows: 0


In [21]:
df = df.drop_duplicates()

## Save data after basic cleaning

In [48]:
df.to_csv("data/heart_stroke_data.csv", index=False)