# Titanic Dataset - Data Cleaning & Preprocessing
This walkthrough shows how we clean and prepare the Titanic dataset step-by-step so that it’s ready for machine learning models. Let’s break it down!


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler


## 1. Importing the Essentials
We start by importing the tools we'll use:
- pandas and numpy help us work with the data.
- seaborn and matplotlib are for plotting graphs.
- LabelEncoder and StandardScaler help us convert and scale data for ML.


In [2]:
df = pd.read_csv('Titanic-Dataset.csv')


## 2. Loading the Titanic Dataset
We load the Titanic dataset from a CSV file and store it in a variable called `df`.


In [3]:
print(df.info())
print(df.isnull().sum())


## 3. Taking a First Look at the Data
We check:
- What columns we have and what kind of data they contain (`info()`).
- Which columns are missing values (`isnull().sum()`).
This helps us understand what needs cleaning.


In [4]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns=['Cabin'], inplace=True)


## 4. Fixing Missing Values
Here's how we handle missing data:
- Age: Filled with the median value (since age isn't evenly distributed).
- Embarked: Filled with the most frequent value (the mode).
- Cabin: Dropped because it’s missing too much data and isn't very useful.


In [5]:
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


## 5. Turning Text into Numbers
Machines don’t understand text, so:
- Sex is turned into numbers using Label Encoding (Male = 1, Female = 0).
- Embarked is broken into separate columns using One-Hot Encoding (e.g., `Embarked_Q`, `Embarked_S`), and we drop the first one to avoid redundancy.


In [6]:
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])


## 6. Scaling the Numbers
We scale `Age` and `Fare` so that they:
- Have a mean of 0 and a standard deviation of 1.
This helps models treat all numeric columns fairly.


In [7]:
sns.boxplot(df['Fare'])
plt.show()
df = df[df['Fare'] < 3]


## 7. Spotting and Removing Outliers
The boxplot helps us visualize outliers in the `Fare` column.
Then we remove entries where Fare is greater than or equal to 3 — though this might be too strict and throw away useful data.
Better Option: Use the IQR method instead to remove only extreme outliers.


In [8]:
print(df.head())


## 8. Looking at the Cleaned Data
Finally, we take a peek at the cleaned-up data to make sure everything looks good.
With these steps, we’ve cleaned the data, handled missing values, converted text to numbers, scaled the features, and dealt with outliers — all crucial steps to get the dataset ML-ready!
Let me know if you want this converted into a notebook, report, or anything else!
