Let's now apply all the things that we got from the EDA and Auto EDA libraries analysis

Import the required libraries

In [43]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

Print library versions to avoid conflicts

In [44]:
print(f'Pandas Version: {pd.__version__}') 
print(f'Numpy Version: {np.__version__}') 
print(f'Matplotlib version: {matplotlib.__version__}')
print(f'Seaborn version: {sns.__version__}')

Pandas Version: 1.5.3
Numpy Version: 1.23.5
Matplotlib version: 3.6.3
Seaborn version: 0.12.2


Add some configurations

In [45]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.precision', 3)
pd.set_option('plotting.backend', 'matplotlib') 
pd.options.mode.chained_assignment = None
np.set_printoptions(suppress=True)
pd.options.display.float_format = '{:.4f}'.format

%matplotlib inline
plt.rcParams["figure.figsize"] = (15,7)

Load train dataset

In [46]:
ds_titanic_train = pd.read_csv(r"C:\Users\Administrador\Documents\IA\Proyectos\Titanic\Datasets\train.csv", encoding = 'unicode_escape')
ds_work = ds_titanic_train.copy()

ds_titanic_test = pd.read_csv(r"C:\Users\Administrador\Documents\IA\Proyectos\Titanic\Datasets\test.csv", encoding = 'unicode_escape')
ds_test = ds_titanic_test.copy()

# Remove unnecesary features

We found that Ticket doesn't give relevant information about whether a passenger died or not. Even repeated tickets don't give any information about that. So let's remove Ticket from out dataset

In [49]:
ds_work.drop(["Ticket"], axis = 1, inplace = True)

KeyError: "['Ticket'] not found in axis"

Name of the passengers also doesn't give any relevant information about survived passengers, let's remove it

In [None]:
ds_work.drop(["Name"], axis = 1, inplace = True)

Finally, since Cabin has 77% null values, and it doesn't give information

In [None]:
ds_work.drop(["Cabin"], axis = 1, inplace = True)

# Treating null values

In [29]:
ds_work[["Age", "Cabin", "Embarked"]].isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Out of 891 registers, 687 contain null values for the cabin feature (77% of the dataset). Considering that it doesn't seem to be a relevant feature to our analysis, we could remove it.

In [30]:
ds_work.drop(["Cabin"], axis = 1, inplace = True)

For age, it seems like a more relevant feature. We could lose valuable information by removing 177 rows out of 891 (20% of the dataset), so we can replace null values with zeros.

In [31]:
ds_work["Age"].fillna(0, inplace = True)

For embarked, we can check the rows with Nan values

In [32]:
ds_work[ds_work["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,


We could drop these rows since they don't have any valuable information

In [33]:
ds_work.dropna(inplace = True)

There are no more null values. Now, we should not have any null value left

In [34]:
ds_work.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

We check the new shape of the dataset; only 2 rows and 1 feature were eliminated

In [35]:
ds_work.shape

(889, 11)

# Optimizing memory usage

Information about the features, amount of null values, data type of each feature, and others

In [36]:
ds_work.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  889 non-null    int64  
 1   Survived     889 non-null    int64  
 2   Pclass       889 non-null    int64  
 3   Name         889 non-null    object 
 4   Sex          889 non-null    object 
 5   Age          889 non-null    float64
 6   SibSp        889 non-null    int64  
 7   Parch        889 non-null    int64  
 8   Ticket       889 non-null    object 
 9   Fare         889 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB


Memory usage of each feature

In [37]:
ds_work.memory_usage()

Index          7112
PassengerId    7112
Survived       7112
Pclass         7112
Name           7112
Sex            7112
Age            7112
SibSp          7112
Parch          7112
Ticket         7112
Fare           7112
Embarked       7112
dtype: int64

Now, we change the data type of some features to save memory, but without losing info of the features.

These features have a small range value (PassengerId from 1 to 891; Survived only 0 and 1; Pclass from 1 to 3; SibSp from 0 to 8; Parch from 0 to 6; Fare from 0 to 512.3292; Age from 0 to 80), so we can use a smaller data type for them

In [38]:
ds_work["PassengerId"] = ds_work["PassengerId"].astype("int16")
ds_work["Survived"] = ds_work["Survived"].astype("int8")
ds_work["Pclass"] = ds_work["Pclass"].astype("int8")
ds_work["SibSp"] = ds_work["SibSp"].astype("int8")
ds_work["Parch"] = ds_work["Parch"].astype("int8")
ds_work["Fare"] = ds_work["Fare"].astype("float16")
ds_work["Age"] = ds_work["Age"].astype("float16")

These features only take specific values (Sex takes male and female; Embarked takes S, C, Q or nan), so we can reduce them to categorical features

In [39]:
ds_work["Sex"] = ds_work["Sex"].astype("category")
ds_work["Embarked"] = ds_work["Embarked"].astype("category")

We check again the memory usage of the entire dataset and of each feature

In [40]:
ds_work.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  889 non-null    int16   
 1   Survived     889 non-null    int8    
 2   Pclass       889 non-null    int8    
 3   Name         889 non-null    object  
 4   Sex          889 non-null    category
 5   Age          889 non-null    float16 
 6   SibSp        889 non-null    int8    
 7   Parch        889 non-null    int8    
 8   Ticket       889 non-null    object  
 9   Fare         889 non-null    float16 
 10  Embarked     889 non-null    category
dtypes: category(2), float16(2), int16(1), int8(4), object(2)
memory usage: 31.5+ KB


In [41]:
ds_work.memory_usage()

Index          7112
PassengerId    1778
Survived        889
Pclass          889
Name           7112
Sex            1013
Age            1778
SibSp           889
Parch           889
Ticket         7112
Fare           1778
Embarked       1021
dtype: int64

Even though the dataset is small, we reduced the memory usage from 83 KB to 31 KB (About 2/3 of memory)

# Saving dataset

Now let's save our dataset

In [42]:
ds_work.to_csv(r"C:\Users\Administrador\Documents\IA\Proyectos\Titanic\Datasets\processed_ds.csv", encoding = 'latin-1', index = False, header = True, decimal = '.', sep = ',')