# 🐼 Pandas Handbook

## 05 - Data Cleaning

Check out the official [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)  

This notebook uses the [Titanic - Machine Learning from Disaster dataset](https://www.kaggle.com/competitions/titanic/data) from Kaggle to demonstrate how to clean the data with pandas.

## 📚 Table of Contents
---

🧼 **Inspecting Missing Data**  
📥 **Handling Missing Data on Import**  
🧽 **Cleaning Data Types**  
🗑️ **Dropping Missing or unwanted Data**  
🧯 **Filling Missing Data**  
🧯 **Detecting and Cleaning Invalid Categorical Values**  
🧽 **Cleaning the Age Column**  
🕵️‍♂️ **Comparing Cleaned Age Data**  
🧽 **Cleaning the Cabin Column**  
🧽 **Cleaning the Embarked Column**  
🧼 **Cleaned Titanic DataFrame**  
👉 **Next Topic: Data Modifying**

---

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
data_raw = "../data/raw/"
csv_file = "titanic.csv"
import_path = os.path.join(data_raw, csv_file)
df = pd.read_csv(import_path, index_col="PassengerId")

### 🧼 Inspecting Missing Data  

Run some data inspection methodes to get an overview of the DataFrame and what needs to be cleaned

```df.isna()``` – Returns a DataFrame of the same shape indicating where values are `NaN` (True = missing).   
```df[df['COLUMN'].isna()]``` – Filters rows where the specified column has missing values.   
```df.isna().sum()``` – Counts missing values per column.  
```df.isna().sum().sum()``` – Total count of missing values in entire DataFrame.   
```df[df.isna().any(axis=1)]``` – Filters rows with *any* missing values.   
```df[df.isna().all(axis=1)]``` – Filters rows where *all* columns are missing.   
```pd.isna(value)``` – Checks if a scalar value is NaN.   
```pd.notna(value)``` – Checks if a scalar value is *not* NaN.   

In [3]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [5]:
df.isna().head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False


In [6]:
df[df['Age'].isna()].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q


In [7]:
df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [8]:
df.isna().sum().sum()

np.int64(866)

In [9]:
df[df.isna().any(axis=1)].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [10]:
df[df.isna().all(axis=1)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1


In [11]:
len(df[df.isna().any(axis=1)])

708

In [12]:
without_age_cabin = df[df['Age'].isna() & df['Cabin'].isna()]
without_age_cabin.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,158.0,158.0,0.0,158.0,158.0,158.0
mean,0.259494,2.759494,,0.601266,0.189873,18.658806
std,0.439751,0.601806,,1.711482,0.554678,26.614283
min,0.0,1.0,,0.0,0.0,0.0
25%,0.0,3.0,,0.0,0.0,7.75
50%,0.0,3.0,,0.0,0.0,8.05
75%,1.0,3.0,,0.0,0.0,16.1
max,1.0,3.0,,8.0,2.0,227.525


In [13]:
first_passenger = df.loc[1, "Cabin"]
pd.isna(first_passenger)

True

In [14]:
first_passenger = df.loc[1, "Cabin"]
pd.notna(first_passenger)

False

### 📥 Handling Missing Data on Import

```pd.read_csv(filepath, keep_default_na=False)``` – Disables automatic conversion of certain strings (e.g., "NA", "NaN") to `NaN`.  
```pd.read_csv(filepath, na_values=['val1', 'val2'])``` – Specifies additional strings to treat as `NaN` during import.  
```df.replace('VALUE', np.nan, inplace=True)``` – Replaces specific values with `NaN` in-place.  

In [15]:
a_df = pd.read_csv(import_path, index_col="PassengerId", keep_default_na=False)
a_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    object 
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     891 non-null    object 
 10  Embarked  891 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 83.5+ KB


In [16]:
na_vals = ["C", "Missing"]
b_df = pd.read_csv(import_path, index_col="PassengerId", na_values=na_vals)
b_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  721 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [17]:
b_df['Embarked'].unique()

array(['S', nan, 'Q'], dtype=object)

In [18]:
b_df.replace('Q', np.nan, inplace=True)

In [19]:
b_df['Embarked'].unique()

array(['S', nan], dtype=object)

### 🧽 Cleaning Data Types

```df.dtypes``` – Shows data types of each column.  
```df.select_dtypes(include=['type'])``` – Selects columns with a specific data type.  
```df.select_dtypes(exclude=['type'])``` – Excludes columns of a specific data type.  
```df['COLUMN'] = df['COLUMN'].astype('type')``` – Converts a column to the specified data type (e.g., `'int'`, `'float'`, `'object'`).  
```pd.to_numeric(df['COLUMN'])``` – Converts values to numbers.    

In [20]:
c_df = df.copy()
c_df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [21]:
c_df.select_dtypes(include=['int']).head()

Unnamed: 0_level_0,Survived,Pclass,SibSp,Parch
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,3,1,0
2,1,1,1,0
3,1,3,0,0
4,1,1,1,0
5,0,3,0,0


In [22]:
c_df.select_dtypes(exclude=['object']).head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,22.0,1,0,7.25
2,1,1,38.0,1,0,71.2833
3,1,3,26.0,0,0,7.925
4,1,1,35.0,1,0,53.1
5,0,3,35.0,0,0,8.05


In [23]:
c_df['Survived'] = c_df['Survived'].astype(str)

In [24]:
c_df.dtypes

Survived     object
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

In [25]:
c_df['Survived'] = pd.to_numeric(c_df['Survived'])

In [26]:
c_df.dtypes

Survived      int64
Pclass        int64
Name         object
Sex          object
Age         float64
SibSp         int64
Parch         int64
Ticket       object
Fare        float64
Cabin        object
Embarked     object
dtype: object

### 🗑️ Dropping missing or unwanted data

```df.drop(index=LABEL)``` – Drops a row by index label.  
```df.dropna(axis=0)``` – Drops rows with *any* missing values.  
```df.dropna(axis=1)``` – Drops columns with *any* missing values.  
```df.dropna(how='all')``` – Drops rows/columns only if *all* values are missing.  
```df.dropna(subset=['COLUMN_1', 'COLUMN_2'], how='any')``` – Drops rows where *any* specified columns are missing.  
```df.drop(columns=['COLUMN_1', 'COLUMN_2'])``` – Drops specified columns.  
```df.drop_duplicates(inplace=True)``` – Removes duplicate rows in-place.  
```filter``` = ```df[df['COLUMN'] != value]``` – Filters rows where a column is *not* equal to a value.  
```df.drop(index=df[filter].index)```  – Drops rows which match the filter

In [27]:
drop_a_df = df.copy()
drop_a_df.drop(index=1).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [28]:
drop_b_df = df.copy()
drop_b_df.dropna(axis=0).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [29]:
drop_c_df = df.copy()
drop_c_df.dropna(axis=1).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1
5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05


In [30]:
drop_d_df = df.copy()
drop_d_df.dropna(axis='index', how='all', subset=['Age', 'Cabin']).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [31]:
drop_e_df = df.copy()
drop_e_df.dropna(axis='index', how='any', subset=['Age', 'Cabin']).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [32]:
drop_f_df = df.copy()
drop_f_df.drop(columns=['Age', 'Cabin'], inplace=True)
drop_f_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C
3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.925,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1,S
5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.05,S


In [33]:
drop_g_df = df.copy()
drop_g_df = pd.concat([drop_g_df, drop_g_df])
len(drop_g_df)

1782

In [34]:
drop_g_df.drop_duplicates(inplace=True)
len(drop_g_df)

891

In [35]:
survived_df = df.copy()
dead_filter = survived_df['Survived'] == 0
survived_df.drop(index=survived_df[dead_filter].index).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### 🧯 Filling Missing Data  

```df.fillna(0)``` – Replaces all missing values with `0`.  
```df.fillna({'COLUMN': VALUE})``` – Replaces missing values only in specified column(s).  
```df['COLUMN'].fillna(df['COLUMN'].mean())``` – Fills missing values in a column with the mean.  
```df['COLUMN'].fillna(df['COLUMN'].median())``` – Fills missing values with the median.  
```df.groupby(['COLUMN'])['TARGET_COLUMN'].transform(lambda x: x.fillna(x.median()))``` – Fills missing values with grouped medians.  

In [36]:
fill_a_df = df.copy()
fill_a_df.fillna(0).head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,0,S


In [37]:
fill_b_df = df.copy()
fill_b_df.fillna({'Age': 0}, inplace=True)
fill_b_df.head(6)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,0.0,0,0,330877,8.4583,,Q


### 🧯 Detecting and Cleaning Invalid Categorical Values

```df['COLUMN'].str.strip().str.lower().str.title()``` – Strips, lowers and titles data in a column.  
```df['COLUMN'].replace('invalid', np.nan, inplace=True)``` – Replaces invalid values with `NaN`.  
```df['COLUMN'] = df['COLUMN'].replace(r'^\s*$', np.nan, regex=True)``` – Replaces empty strings or whitspaces with `NaN`.

In [38]:
san_a_df = df.copy()
san_a_df['Sex'].unique()

array(['male', 'female'], dtype=object)

In [39]:
san_a_df['Sex'] = san_a_df['Sex'].str.strip().str.lower().str.title()
san_a_df['Sex'].unique()

array(['Male', 'Female'], dtype=object)

In [40]:
san_b_df = df.copy()
san_b_df['Sex'] = san_b_df['Sex'].replace({'male': ' '})

In [41]:
san_b_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",,35.0,0,0,373450,8.05,,S


In [42]:
san_b_df['Sex'] = san_b_df['Sex'].replace(r'^\s*$', 'Male', regex=True)

In [43]:
san_b_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",Male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",Male,35.0,0,0,373450,8.05,,S


### 🧽 Cleaning the Age Column

To handle missing values in the `Age` column, we explored several common strategies:

**1. `clean_median_age_df`**  
Missing ages are filled with the median age of all passengers.  
- **Why median?** It’s robust to outliers and better represents the “typical” passenger age.
- **Result:** The age distribution remains realistic, and no missing values remain in `Age`.

**2. `clean_mean_age_df`**  
Missing ages are filled with the mean age.  
- **Why mean?** It uses the average, which can be skewed by very young or old ages.
- **Result:** The filled ages may slightly distort the age distribution if many outliers exist.

**3. `clean_related_age_df`**  
Missing ages are filled by estimating from related columns (e.g., using median age by `Pclass` and `Sex`).  
- **Why conditional fill?** This leverages patterns in the data, providing more context-aware and potentially more accurate imputations.
- **Result:** The filled ages reflect group-specific averages, improving realism for modeling.

**4. `clean_dropped_df`**  
All rows with missing `Age` values are dropped from the DataFrame.  
- **Why drop?** This is the strictest approach, ensuring all remaining rows have complete data.
- **Result:** No imputed values—only original data—but reduces dataset size and may lose valuable information.

**Summary:**  
Each method trades off realism, simplicity, and dataset size.  
- **Median and mean filling** are quick but can mask true variability.
- **Conditional filling** is more sophisticated, respecting group differences.
- **Dropping rows** guarantees no imputation but at the cost of losing data.

In [44]:
df['Age'].mean()

np.float64(29.69911764705882)

In [45]:
df['Age'].median()

np.float64(28.0)

In [46]:
without_age_df = df.copy()
without_age_filter = without_age_df['Age'].isna()
without_age_df[without_age_filter].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S
20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.225,,C
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q


In [47]:
clean_median_age_df = df.copy()
clean_median_age_df.fillna({'Age': clean_median_age_df['Age'].median()}, inplace=True)
clean_median_age_df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [48]:
clean_mean_age_df = df.copy()
clean_mean_age_df.fillna({'Age': clean_mean_age_df['Age'].mean()}, inplace=True)
clean_mean_age_df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [49]:
clean_related_age_df = df.copy()
relation_filter = ['Survived','Pclass', 'Sex']
clean_related_age_df['Age'] = clean_related_age_df.groupby(relation_filter)['Age'].transform(lambda x: x.fillna(x.median()))
clean_related_age_df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [50]:
clean_droped_df = df.copy()
clean_droped_df = clean_droped_df.dropna(subset=['Age'])
clean_droped_df.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       529
Embarked      2
dtype: int64

### 🕵️‍♂️ Comparing Cleaned Age Data

In [51]:
print(f"Raw data set median age: >> {df['Age'].median()} << vs mean: >> {df['Age'].mean()} <<")
print(f"Clean median data set median age: >> {clean_median_age_df['Age'].median()} << vs mean: >> {clean_median_age_df['Age'].mean()} <<")
print(f"Clean mean data set median age: >> {clean_mean_age_df['Age'].median()} << vs mean: >> {clean_mean_age_df['Age'].mean()} <<")
print(f"Clean related data set median age: >> {clean_related_age_df['Age'].median()} << vs mean: >> {clean_related_age_df['Age'].mean()} <<")
print(f"Clean droped data set median age: >> {clean_droped_df['Age'].median()} << vs mean: >> {clean_droped_df['Age'].mean()} <<")

Raw data set median age: >> 28.0 << vs mean: >> 29.69911764705882 <<
Clean median data set median age: >> 28.0 << vs mean: >> 29.36158249158249 <<
Clean mean data set median age: >> 29.69911764705882 << vs mean: >> 29.69911764705882 <<
Clean related data set median age: >> 26.0 << vs mean: >> 29.071459034792365 <<
Clean droped data set median age: >> 28.0 << vs mean: >> 29.69911764705882 <<


### 🧽 Cleaning the Cabin Column

Missing values in the `Cabin` column are replaced with `'Unknown'`, ensuring every row has a value for `Cabin` and making the data easier to work with for analysis and modeling.

In [52]:
clean_cabin_df = df.copy()
without_cabin_filter = clean_cabin_df['Cabin'].isna()
clean_cabin_df[without_cabin_filter].head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [53]:
clean_cabin_df['Cabin'] = clean_cabin_df['Cabin'].fillna('Unknown')
clean_cabin_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S


### 🧽 Cleaning the Embarked Column

Filters like `cabin_filter`, `ticket_filter`, etc were used to check for patterns or correlations in the missing data before cleaning. Since no useful relationship was found, missing values in the `Embarked` column were replaced with `'Unkown Port'`, and all codes are mapped to their full port names, ensuring consistency and completeness.

In [54]:
without_embarked_df = df.copy()
without_embarked_filter = without_embarked_df['Embarked'].isna()
without_embarked_df[without_embarked_filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [55]:
cabin_filter = without_embarked_df['Cabin'].fillna('Unknown').str.contains('B2')
without_embarked_df[cabin_filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
541,1,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.0,B22,S
691,1,1,"Dick, Mr. Albert Adrian",male,31.0,1,0,17474,57.0,B20,S
746,0,1,"Crosby, Capt. Edward Gifford",male,70.0,1,1,WE/P 5735,71.0,B22,S
782,1,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0,B20,S
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [56]:
ticket_filter = without_embarked_df['Ticket'].fillna('Unknown').str.contains('1135')
without_embarked_df[ticket_filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
167,1,1,"Chibnall, Mrs. (Edith Martha Bowerman)",female,,0,1,113505,55.0,E33,S
253,0,1,"Stead, Mr. William Thomas",male,62.0,0,0,113514,26.55,C87,S
352,0,1,"Williams-Lambert, Mr. Fletcher Fellows",male,,0,0,113510,35.0,C128,S
357,1,1,"Bowerman, Miss. Elsie Edith",female,22.0,0,1,113505,55.0,E33,S
378,0,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C
783,0,1,"Long, Mr. Milton Clyde",male,29.0,0,0,113501,30.0,D6,S
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [57]:
icard_filter = without_embarked_df['Name'].fillna('Unknown').str.contains('Icard')
stone_filter = without_embarked_df['Name'].fillna('Unknown').str.contains('Stone')
evelyn_filter = without_embarked_df['Name'].fillna('Unknown').str.contains('Evelyn')
nelson_filter = without_embarked_df['Name'].fillna('Unknown').str.contains('Nelson')
without_embarked_df[stone_filter + icard_filter + evelyn_filter + nelson_filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
320,1,1,"Spedden, Mrs. Frederic Oakley (Margaretta Corn...",female,40.0,1,1,16966,134.5,E34,C
622,1,1,"Kimball, Mr. Edwin Nelson Jr",male,42.0,1,0,11753,52.5542,D19,S
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [58]:
price_filter = (without_embarked_df['Fare'] >= 79) & (without_embarked_df['Fare'] <= 81)
without_embarked_df[price_filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
140,0,1,"Giglio, Mr. Victor",male,24.0,0,0,PC 17593,79.2,B86,C
257,1,1,"Thorne, Mrs. Gertrude Maybelle",female,,0,0,PC 17585,79.2,,C
263,0,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.65,E67,S
559,1,1,"Taussig, Mrs. Emil (Tillie Mandelbaum)",female,39.0,1,1,110413,79.65,E67,S
586,1,1,"Taussig, Miss. Ruth",female,18.0,0,2,110413,79.65,E68,S
588,1,1,"Frolicher-Stehli, Mr. Maxmillian",male,60.0,1,1,13567,79.2,B41,C
790,0,1,"Guggenheim, Mr. Benjamin",male,46.0,0,0,PC 17593,79.2,B82 B84,C
830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [59]:
clean_embarked_df = df.copy()
clean_embarked_df.fillna({'Embarked': 'Unkown Port'}, inplace=True)
clean_embarked_df['Embarked'] = clean_embarked_df['Embarked'].replace({'S': 'Southampton', 
                                                                           'C': 'Cherbourg',
                                                                           'Q': 'Queenstown'})
clean_embarked_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,Southampton
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,Cherbourg
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,Southampton
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,Southampton
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,Southampton


### 🧼 Cleaned Titanic DataFrame

In this section, the final cleaned Titanic DataFrame is assembled by combining the individually cleaned columns for `Age`, `Cabin`, and `Embarked`. After these steps, the dataset contains no missing values in the main columns, making it ready for analysis or modeling.

In [60]:
cleaned_df = df.copy()
cleaned_df['Age'] = clean_median_age_df['Age']
cleaned_df['Embarked'] = clean_embarked_df['Embarked']
cleaned_df['Cabin'] = clean_cabin_df['Cabin']
cleaned_df.isna().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Cabin       0
Embarked    0
dtype: int64

In [61]:
data_processed = "../data/processed/"
export_path = os.path.join(data_processed, 'clean_titanic.csv')
cleaned_df.to_csv(export_path)

### 👉 Next Topic: [Data Modifying](./06-data-modifying.ipynb)

Learn how to modify data with pandas.