# 📘 Day 4: Data Cleaning and Preprocessing

## 🧼 Topics Covered
- Handling missing data
- Removing duplicates
- Feature scaling
- Encoding categorical variables

## 📂 Load Titanic Dataset — Choose Your Source

### Option 1: Seaborn (built-in)

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load Titanic dataset from seaborn
df = sns.load_dataset('titanic')
df.head()

### Option 2: GitHub Raw CSV (no account needed)

In [None]:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
df.head()

### Option 3: Kaggle Titanic Dataset (requires manual download)

In [None]:
# Upload 'train.csv' manually to your notebook environment
# from kaggle.com/c/titanic/data
# Then run:
# df = pd.read_csv('train.csv')
# df.head()

## 🔍 Check for Missing Values

In [None]:
df.isnull().sum()

In [None]:
df.info()

## 🧼 Handle Missing Data

In [None]:
# Fill 'age' with median if it exists
if 'age' in df.columns:
    df['age'] = df['age'].fillna(df['age'].median())

# Fill 'embarked' with mode if it exists
if 'embarked' in df.columns:
    df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Drop 'deck' if it exists and has too many missing values
if 'deck' in df.columns:
    df = df.drop(columns=['deck'])

df.isnull().sum()

## 🧹 Remove Duplicates

In [None]:
df = df.drop_duplicates()
df.shape

## 🏷️ Encode Categorical Variables

In [None]:
le = LabelEncoder()

# Encode 'sex' if present
if 'sex' in df.columns:
    df['sex'] = le.fit_transform(df['sex'])

# Encode 'embarked' if present
if 'embarked' in df.columns:
    df['embarked'] = le.fit_transform(df['embarked'])

df.select_dtypes(include='object').head()

## ⚖️ Feature Scaling

In [None]:
scaler = StandardScaler()

# Scale age and fare if they exist
for col in ['age', 'fare']:
    if col in df.columns:
        df[[col]] = scaler.fit_transform(df[[col]])

df.describe()

## 🎮 Game: Find the Odd Passenger

In [None]:
# Find passengers with extremely high or low fare values
outliers = df[(df['fare'] > 2.5) | (df['fare'] < -2.5)] if 'fare' in df.columns else pd.DataFrame()

print("🚨 Unusual Fare Passengers:")
if not outliers.empty:
    print(outliers[['age', 'fare', 'sex', 'embarked']] if 'age' in outliers.columns else outliers)
else:
    print("No extreme outliers found or 'fare' column missing.")

## ✅ Summary
- Cleaned missing data
- Removed duplicates
- Encoded and scaled features
- Created a game to detect outliers