# 🚢 Titanic Dataset ETL Pipeline
This notebook implements a complete ETL (Extract, Transform, Load) process on the Titanic dataset.

### Steps Involved:
1. Load the dataset
2. Clean missing data
3. Encode categorical variables
4. Engineer new features
5. Save the cleaned dataset

## 🔍 ETL Process Overview
Below is a visual representation of the ETL pipeline:

![ETL Process](ETL%20Process%20for%20Titanic%20Dataset%20-%20visual%20selection.png)

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset (assuming it's in the same directory)
df = pd.read_csv('Titanic-Dataset.csv')
df.head()

## 🧼 Step 1: Clean Missing Data

In [None]:
# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Cabin'] = df['Cabin'].fillna('Unknown')

# Drop unnecessary column
df.drop(columns=['Ticket'], inplace=True)

## 🔤 Step 2: Encode Categorical Variables

In [None]:
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])

## 🧠 Step 3: Feature Engineering

In [None]:
# Create FamilySize and IsAlone
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Extract title from name
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df['Title'] = le.fit_transform(df['Title'])

## 📊 Optional: Quick Visualization

In [None]:
sns.countplot(x='Survived', hue='Sex', data=df)
plt.title('Survival Count by Sex')
plt.show()

## 💾 Step 4: Save Cleaned Dataset

In [None]:
df.to_csv('cleaned_titanic_data.csv', index=False)
print('✅ Cleaned data saved to cleaned_titanic_data.csv')