# Titanic Dataset - Data Cleaning & Preprocessing

## Objective:
This notebook demonstrates the data cleaning and preprocessing of the Titanic dataset.
The goal is to prepare the data for Machine Learning by handling missing values, encoding
categorical features, scaling numerical data, and removing outliers.

**Tools Used:**
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler

sns.set(style="whitegrid")
%matplotlib inline

In [None]:
# Load the Titanic dataset
df = pd.read_csv("Titanic-Dataset.csv")
df.head()

In [None]:
# Basic info
df.info()

# Statistical summary
df.describe(include='all')

# Missing values
df.isnull().sum()

In [None]:
# Fill 'Age' with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill 'Embarked' with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop 'Cabin' due to high missing rate
df.drop(columns='Cabin', inplace=True)

# Verify missing values
df.isnull().sum()

In [None]:
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])         # Male=1, Female=0
df['Embarked'] = le.fit_transform(df['Embarked'])

In [None]:
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

In [None]:
# Boxplots for visualizing outliers
for col in ['Age', 'Fare']:
    sns.boxplot(data=df, x=col)
    plt.title(f"Boxplot for {col}")
    plt.show()

# Remove outliers using IQR method
def remove_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    return data[(data[column] >= Q1 - 1.5 * IQR) & (data[column] <= Q3 + 1.5 * IQR)]

for col in ['Age', 'Fare']:
    df = remove_outliers(df, col)

print("Final dataset shape:", df.shape)

In [None]:
df.to_csv("Cleaned_Titanic_Dataset.csv", index=False)
print("Cleaned dataset saved as 'Cleaned_Titanic_Dataset.csv'")

## Summary:

- Handled missing values in 'Age' and 'Embarked'
- Dropped 'Cabin' due to excessive missing data
- Encoded 'Sex' and 'Embarked' columns
- Standardized 'Age' and 'Fare'
- Detected and removed outliers using IQR method
- Final cleaned dataset saved for model training