<a href="https://colab.research.google.com/github/OcSpice/Sentiment-Dataset-EDA/blob/main/Titanic_Data_Cleaning_(Train_%26_Test).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Titanic Data Cleaning & Preprocessing**

## 1️⃣ **Title & Objective**

##### **Level 1 – Task 1: Data Cleaning & Preprocessing (Titanic Dataset)**

##### 🎯 **Objective**
To clean and preprocess both **train.csv** and **test.csv** by:  
- Identifying and handling **missing values**  
- Removing **duplicate records**  
- Standardizing **categorical formats**
- Ensuring both datasets are aligned for future analysis/modeling

## 2️⃣ **Import Libraries & Load Data**

In [9]:
import pandas as pd
import numpy as np

In [16]:
# Load Titanic train & test datasets
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')

In [17]:
print("Train shape:", train.shape)
print("Test shape:", test.shape)

Train shape: (891, 12)
Test shape: (418, 11)


In [18]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3️⃣ **Inspect Missing Values**

#### **Step 1: Missing Value Overview**
- **Train.csv**: Age (177), Cabin (687), Embarked (2)  
- **Test.csv**: Age (86), Cabin (327), Fare (1)

In [19]:
print("Missing values in Train:\n", train.isnull().sum())
print("\nMissing values in Test:\n", test.isnull().sum())

Missing values in Train:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Missing values in Test:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


In [20]:
# Summary statistics
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
PassengerId,418.0,,,,1100.5,120.810458,892.0,996.25,1100.5,1204.75,1309.0
Pclass,418.0,,,,2.26555,0.841838,1.0,1.0,3.0,3.0,3.0
Name,418.0,418.0,"Peter, Master. Michael J",1.0,,,,,,,
Sex,418.0,2.0,male,266.0,,,,,,,
Age,418.0,,,,29.599282,12.70377,0.17,23.0,27.0,35.75,76.0
SibSp,418.0,,,,0.447368,0.89676,0.0,0.0,0.0,1.0,8.0
Parch,418.0,,,,0.392344,0.981429,0.0,0.0,0.0,0.0,9.0
Ticket,418.0,363.0,PC 17608,5.0,,,,,,,
Fare,417.0,,,,35.627188,55.907576,0.0,7.8958,14.4542,31.5,512.3292
Cabin,91.0,76.0,B57 B59 B63 B66,3.0,,,,,,,


## 4️⃣ **Handle Missing Values**

#### **Step 2: Handling Missing Values**
- Filled **Age** with median in both train & test.  
- Filled **Embarked** with mode in train.  
- Filled **Fare** with median in test.  
- Dropped **Cabin** in both due to excessive missing values.

In [21]:
# Train set
train['Age'] = train['Age'].fillna(train['Age'].median())
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])
train = train.drop(columns=['Cabin'])  # too many missing

In [22]:
# Test set
test['Age'] = test['Age'].fillna(test['Age'].median())
test['Fare'] = test['Fare'].fillna(test['Fare'].median())
test = test.drop(columns=['Cabin'])

## 5️⃣ **Remove Duplicates**

#### **Step 3: Duplicate Removal**
Checked for duplicate rows in both datasets and removed them (if any).

In [23]:
print("Duplicates before:", train.duplicated().sum(), test.duplicated().sum())

Duplicates before: 0 0


In [24]:
train = train.drop_duplicates()
test = test.drop_duplicates()

In [25]:
print("Duplicates after:", train.duplicated().sum(), test.duplicated().sum())

Duplicates after: 0 0


## 6️⃣ **Standardize Formats**

#### **Step 4: Standardizing Formats**
- Converted `Sex` to lowercase in both datasets.  
- Standardized `Embarked` codes in train.  
- Converted `Survived` to categorical in train.

In [26]:
# Convert 'Sex' to lowercase
train['Sex'] = train['Sex'].str.lower()
test['Sex'] = test['Sex'].str.lower()

In [27]:
# Ensure 'Embarked' uppercase (only in train)
if 'Embarked' in train.columns:
    train['Embarked'] = train['Embarked'].str.upper()

In [28]:
# Convert 'Survived' to categorical in train
train['Survived'] = train['Survived'].astype('category')

## 7️⃣ **Final Checks**

#### ✅ **Final Clean Datasets**
Both **train.csv** and **test.csv** are now cleaned:
- Missing values handled
- Duplicates removed
- Formats standardized

In [29]:
print("Train Info:")
train.info()
print("\nTest Info:")
test.info()

Train Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    category
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    object  
 4   Sex          891 non-null    object  
 5   Age          891 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    object  
 9   Fare         891 non-null    float64 
 10  Embarked     891 non-null    object  
dtypes: category(1), float64(2), int64(4), object(4)
memory usage: 70.7+ KB

Test Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   

In [30]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S
