<a href="https://colab.research.google.com/github/Shadabur-Rahaman/30-days-ml-projects/blob/main/Day2_Advanced_EDA_Feature_Engineering/notebooks/Day2_Advanced_EDA_Feature_Engineering_Cleaned.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📓 Day 2 – Advanced EDA & Feature Engineering

**Dataset Used:** Titanic / House Prices / Heart Disease / Your Own

In [None]:
# ========== 1. Setup ==========
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Optional: for clean visuals
sns.set(style="whitegrid")

# If using kagglehub (requires API key setup)
import kagglehub

# Download latest version of the dataset
path = kagglehub.dataset_download("mirichoi0218/insurance")

print("Path to dataset files:", path)

# List all files in the downloaded directory
for file in os.listdir(path):
    print(file)

# Load the dataset (typically it's 'insurance.csv')
df = pd.read_csv(os.path.join(path, "insurance.csv"))
df.head()


## 2. Automated EDA Report

We'll use one of the popular EDA libraries: `pandas-profiling` or `sweetviz`.

- These tools quickly show distributions, missing values, correlations, warnings, etc.

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport

# Load the uploaded dataset
df = pd.read_csv("insurance.csv")

# Generate profiling report
profile = ProfileReport(df, title="YData Profiling Report", explorative=True)
profile.to_file("ydata_profile_report.html")

# Display the report inside notebook (optional)
profile.to_notebook_iframe()


In [None]:
import pandas as pd
import sweetviz as sv

df = pd.read_csv("insurance.csv")

report = sv.analyze(df)
report.show_html("sweetviz_report.html")


## 3. Manual EDA

Here we'll:
- Check shape, datatypes, and missing values
- Get descriptive stats
- Plot distributions and correlations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Overview
print(df.shape)
print(df.info())
print(df.describe(include='all'))

# Missing values
df.isnull().sum()

# Correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")

## 4. Data Visualization

Helps in detecting skewness, outliers, class imbalance, and patterns.

In [None]:
# Distribution of numerical features
df.hist(bins=20, figsize=(15,10))

# Pairplot (optional for small datasets)
# sns.pairplot(df)

# Categorical countplot
# sns.countplot(x='target_column', data=df)  # Replace with actual target

## 5. Feature Engineering

Let's create meaningful features using domain knowledge, interaction terms, and encoding.

In [None]:
import numpy as np
import pandas as pd

# Example 1: Create a new feature 'FamilySize' using 'children' + 1 (assuming 1 = yourself)
df['FamilySize'] = df['children'] + 1

# Example 2: Log transformation on 'charges' to reduce skewness
df['charges_log'] = np.log1p(df['charges'])

# Example 3: Binning age
df['AgeBin'] = pd.cut(df['age'], bins=[0,12,18,35,60,100], labels=['Child','Teen','Young','Adult','Senior'])

# Example 4: One-hot encoding for categorical variables like 'sex' and 'region'
df = pd.get_dummies(df, columns=['sex', 'region'], drop_first=True)

# Example 5: Interaction feature between age and bmi
df['Age*BMI'] = df['age'] * df['bmi']


## 6. Handle Missing Values

Strategies:
- Fill with median/mean/mode
- Drop if too many missing
- Use domain rules

In [None]:
# Fill missing 'age' values with the median (if any missing)
df['age'].fillna(df['age'].median(), inplace=True)

# Since there's no 'Cabin' column in your dataset, skip this step
# df.dropna(subset=['Cabin'], inplace=True)  # Not applicable

## 7. Save Cleaned Dataset

In [None]:
df.to_csv("cleaned_dataset.csv", index=False)

## 8. Summary of Insights

- Key correlations found:
  - Age vs Fare → Weak/Moderate correlation
  - SibSp + Parch → Good for family feature

- Engineered Features:
  - FamilySize, AgeBin, Age*Fare, Fare_log
  - One-hot encoded categorical variables

- Missing Data:
  - Age filled using median
  - Cabin dropped due to heavy missing

- Dataset ready for modeling tomorrow 🚀