
# Stroke Prediction — Data Cleaning & Preparation

**Author:** _Your Name_  
**Dataset:** Stroke Prediction Dataset (`data/stroke.csv`)

## Objective
Prepare and clean the Stroke Prediction dataset for machine learning analysis.  
We'll handle missing values, encode categorical features, and scale numerical variables to produce a reliable dataset for EDA and modeling.

## Steps
1. Load and inspect data  
2. Handle missing values  
3. Encode categorical features  
4. Scale numeric columns  
5. Export cleaned dataset

## Tools
- Python: `pandas`, `numpy`, `matplotlib`
- Scikit-learn: `StandardScaler`


## 1) Load Dataset

In [None]:

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 140)

DATA_PATH = 'data/stroke.csv'
if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATA_PATH}. Please add it to the data/ folder.")

df = pd.read_csv(DATA_PATH)
print('Raw dataset shape:', df.shape)
df.head()


## 2) Basic Inspection

In [None]:

df.info()
df.describe(include='all').T.head(15)


## 3) Handle Missing Values

In [None]:

print("Missing values per column:")
print(df.isnull().sum())

# Fill BMI with median
if 'bmi' in df.columns:
    df['bmi'].fillna(df['bmi'].median(), inplace=True)


## 4) Clean and Normalize Categorical Columns

In [None]:

cat_cols = df.select_dtypes('object').columns
for c in cat_cols:
    df[c] = df[c].str.strip().str.lower()

for c in cat_cols:
    print(f"{c}: {df[c].unique()}")


## 5) Outlier Check

In [None]:

numeric_cols = ['age', 'avg_glucose_level', 'bmi']
for col in numeric_cols:
    if col in df.columns:
        plt.figure(figsize=(6,3))
        plt.boxplot(df[col], vert=False)
        plt.title(f'{col} Distribution')
        plt.tight_layout()
        plt.show()


## 6) Encode Categorical Variables

In [None]:

# Manual binary encodings
binary_map = {
    'gender': {'male': 1, 'female': 0},
    'ever_married': {'yes': 1, 'no': 0},
    'residence_type': {'urban': 1, 'rural': 0}
}
for col, mapping in binary_map.items():
    if col in df.columns:
        df[col] = df[col].map(mapping)

# One-hot encoding for multi-category variables
multi_cat = [c for c in df.select_dtypes('object').columns if c not in binary_map]
df = pd.get_dummies(df, columns=multi_cat, drop_first=True)
print('After encoding shape:', df.shape)


## 7) Scale Numerical Columns

In [None]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scale_cols = ['age','avg_glucose_level','bmi']
for col in scale_cols:
    if col in df.columns:
        df[col] = scaler.fit_transform(df[[col]])

df[scale_cols].head()


## 8) Target Distribution

In [None]:

if 'stroke' in df.columns:
    stroke_counts = df['stroke'].value_counts(normalize=True) * 100
    stroke_counts.plot(kind='bar', color=['lightblue', 'salmon'])
    plt.title('Stroke Class Distribution (%)')
    plt.xlabel('Stroke (0=No, 1=Yes)')
    plt.ylabel('Percentage')
    plt.tight_layout()
    plt.show()
    print('Class distribution (%):\n', stroke_counts)


## 9) Export Cleaned Dataset

In [None]:

OUT_DIR = 'data_cleaned'
os.makedirs(OUT_DIR, exist_ok=True)
out_path = os.path.join(OUT_DIR, 'stroke_cleaned.csv')
df.to_csv(out_path, index=False)
print('Cleaned dataset saved to:', out_path)



## Appendix
- `bmi` filled with median  
- Binary and one-hot encodings applied  
- Numeric features scaled for model readiness  
- Target imbalance noted for modeling adjustments (e.g., SMOTE or class weighting)

Next: **02_stroke_eda.ipynb** — exploratory data analysis & feature relationships
