# Cleaning & Preprocessing (Titanic Dataset)

The aim of this project is to prepare and clean the Titanic dataset, by handling missing values, fixing data types and removing unwanted data to get the data ready for analysis.

# *Libraries & Dataset*

In [1]:
# Import libraries

import pandas as pd
import numpy as np

In [2]:
# Read the dataset

df = pd.read_csv("titanic.csv")

# *Inspecting Raw Data*

                                                   Column Definitions

PassengerId	- A unique ID assigned to each passenger.

Survived - Survival status: 0 = Did not survive, 1 = Survived.

Pclass - Passenger class: 1 = First class, 2 = Second class, 3 = Third class.

Name - Name of the passenger.

Sex - Gender of the passenger: male or female.

Age - Age of the passenger. 

SibSp - Number of siblings or spouses aboard the Titanic with the passenger.

Parch - Number of parents or children aboard the Titanic with the passenger.

Ticket - Ticket number.

Fare - Price paid by the passengr for the ticket.

Cabin - Cabin number where the passenger stayed.

Embarked - Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton.

In [3]:
# Show the first 5 columns of the raw data

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Showing the columns, non-null values per column and the data types per column

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In the data cleaning stage, it was identified that the Pclass column, which represents the passenger’s ticket class, contains inconsistent data types. Although originally stored as integers, this column is more appropriately treated as a categorical variable because it denotes distinct classes rather than numeric values. Therefore, Pclass was converted from an integer type to a categorical type. Similarly, the Sex column, which indicates the passenger’s gender with values such as 'male' or 'female',  was also converted from a generic object/string type to a categorical type to optimise memory usage and analysis.

In [5]:
# Dataset dimensions rows/columns

df.shape

(891, 12)

In [6]:
# Finding missing values per column

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

An examination of missing values shows that most columns contain predominantly non-null entries. However, the Age and Cabin columns have a substantial number of missing values. The Age column, in particular, will be addressed in the cleaning process by imputing or replacing these missing values to ensure completeness for analysis.

# *Data Cleaning*

In [7]:
# Checking significance of 'Ticket'

df['Ticket'].describe()

count        891
unique       681
top       347082
freq           7
Name: Ticket, dtype: object

With 681 uinque values it is clear that this column holds little value for further analysis.

In [8]:
# Removing unwanted columns

df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Removing irrelevant columns helps streamline the dataset to include only those features that provide meaningful insights. For example, each passenger’s PassengerId is unique, making it unsuitable for drawing any general conclusions. While the Name column also serves as a unique identifier, it holds little analytical value and has been removed to avoid complications during machine learning model development. Finally, the Ticket and Cabin columns are excluded as they offer limited value for further analysis.

In [9]:
# Filling in missing values

df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

In [10]:
# Updated missing values per column

df.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

From the data inspection, it was identified that the Age and Cabin columns contained a significant number of missing values, while the Embarked column had only two missing entries. Since the Cabin column has been removed, the focus shifted to handling missing values in Age and Embarked.

The missing values in the Age column were replaced with the median, which represents the middle value of the distribution. Although the mode was considered as an alternative, it was deemed less suitable due to its sensitivity to outliers. For the Embarked column, the missing values were filled with the mode - the most frequent category in the column. Given there were only two missing values, this replacement is unlikely to introduce significant bias.

In [11]:
# Data types per column

df.dtypes

Survived      int64
Pclass        int64
Sex          object
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Embarked     object
dtype: object

In [12]:
# Changing data types

df['Pclass'] = df['Pclass'].astype('category')
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')

In [13]:
# Updated data types

df.dtypes

Survived       int64
Pclass      category
Sex         category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
Embarked    category
dtype: object

As noted earlier during the raw data inspection, certain columns required data type conversions. The Sex and Embarked columns were originally stored as objects (strings) and were converted to categorical types. Since these columns contain a limited set of repeating values, converting them to categories improves memory efficiency and analytical performance. The Pclass column, initially an integer, was also converted to a categorical type because its values (1, 2, or 3) represent passenger classes rather than numeric quantities. Treating Pclass as a numeric variable could mislead machine learning models by implying ordinal relationships that do not reflect reality.

The Survived column, although numeric, is a binary indicator (0 for deceased, 1 for survived). It is retained as an integer type to allow calculation of statistics such as the mean survival rate, which would not be straightforward if it were treated as a categorical variable.

# *Additional Column*

In [14]:
# Creating a new column

df['FamilyMembers'] = df['SibSp'] + df['Parch'] + 1

In [15]:
# Display updated columns

df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked', 'FamilyMembers'],
      dtype='object')

A new column was created to represent the total number of family members aboard by combining the SibSp (siblings/spouses) and Parch (parents/children) columns, plus one to include the individual passenger themselves. This aggregated feature has the potential to be a strong indicator of survival, as it consolidates two related variables into a single, more informative one while also reducing data sparsity.

For now, the original SibSp and Parch columns will be retained, but they may be removed later depending on their impact on model performance during the machine learning phase.

# *Save Cleaned Dataset*

In [16]:
# Saving CSV

df.to_csv("titanic_cleaned.csv", index=False)

Cleaned dataset "titanic_cleaned.csv" has been saved within my working folder.

In [17]:
# Saving Feather

df.to_feather("titanic_cleaned.feather")

Cleaned dataset "titanic_cleaned.feather" has been saved within my working folder to preserve dtype.