# 📈 Titanic - Classification Model
## The classic case of the classification model for Titanic survivors

## 📜 1. Resume

In this study, we analyze the Titanic survivors Dataset, clean all the data and test 3 classification models:

1 - Logistic regressions

2 - DecisionTreeClassifier

3 - Gradient Boosting

> **Results:** Accuracy on the validation set for the three models and the prediction for the test data.


## 📂 2. Dataset  

The data was originally extracted from  the "Titanic extended dataset (Kaggle + Wikipedia)" Dataset [Kaggle](https://www.kaggle.com/datasets/pavlofesenko/titanic-extended).

## 🛠️ 3. Data Information 

### Libraries 🐍

Python Libraries utilized:

In [65]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
import graphviz
from sklearn.tree import export_graphviz
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

## 🛠️ 3.1 Loading Data

In [66]:
df = pd.read_csv('C:/Users/Gabriel/OneDrive/Área de Trabalho/Análise de Dados/DataSets/Titanic/full.csv')

## 🛠️ 3.2 Acessing Data

In [67]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,WikiId,Name_wiki,Age_wiki,Hometown,Boarded,Destination,Lifeboat,Body,Class
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,691.0,"Braund, Mr. Owen Harris",22.0,"Bridgerule, Devon, England",Southampton,"Qu'Appelle Valley, Saskatchewan, Canada",,,3.0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,90.0,"Cumings, Mrs. Florence Briggs (née Thayer)",35.0,"New York, New York, US",Cherbourg,"New York, New York, US",4,,1.0
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,865.0,"Heikkinen, Miss Laina",26.0,"Jyväskylä, Finland",Southampton,New York City,14?,,3.0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,127.0,"Futrelle, Mrs. Lily May (née Peel)",35.0,"Scituate, Massachusetts, US",Southampton,"Scituate, Massachusetts, US",D,,1.0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,627.0,"Allen, Mr. William Henry",35.0,"Birmingham, West Midlands, England",Southampton,New York City,,,3.0


## 🛠️ 3.2 Acessing Data

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  WikiId       1304 non-null   float64
 13  Name_wiki    1304 non-null   object 
 14  Age_wiki     1302 non-null   float64
 15  Hometown     1304 non-null   object 
 16  Boarded      1304 non-null   object 
 17  Destination  1304 non-null   object 
 18  Lifeboat     502 non-null    object 
 19  Body  

## 🧹 5. Cleaning the Data

Cleaning the dataset and removing many columns that don't add anything.

In [69]:
df.drop(columns=['Name', 'Ticket', 'Cabin', 'Body', 'Class', 'Hometown', 'Boarded', 'Destination', 'WikiId', 'Lifeboat', 'Age_wiki'], inplace=True)

In [70]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Name_wiki
0,1,0.0,3,male,22.0,1,0,7.2500,S,"Braund, Mr. Owen Harris"
1,2,1.0,1,female,38.0,1,0,71.2833,C,"Cumings, Mrs. Florence Briggs (née Thayer)"
2,3,1.0,3,female,26.0,0,0,7.9250,S,"Heikkinen, Miss Laina"
3,4,1.0,1,female,35.0,1,0,53.1000,S,"Futrelle, Mrs. Lily May (née Peel)"
4,5,0.0,3,male,35.0,0,0,8.0500,S,"Allen, Mr. William Henry"
...,...,...,...,...,...,...,...,...,...,...
1304,1305,,3,male,,0,0,8.0500,S,"Spector, Mr. Woolf"
1305,1306,,1,female,39.0,0,0,108.9000,C,"and maid, Doña Fermina Oliva y Ocana"
1306,1307,,3,male,38.5,0,0,7.2500,S,"Sæther, Mr. Simon Sivertsen"
1307,1308,,3,male,,0,0,8.0500,S,"Ware, Mr. Frederick William"


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Sex          1309 non-null   object 
 4   Age          1046 non-null   float64
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Fare         1308 non-null   float64
 8   Embarked     1307 non-null   object 
 9   Name_wiki    1304 non-null   object 
dtypes: float64(3), int64(4), object(3)
memory usage: 102.4+ KB


In [72]:
df.isnull().sum()

PassengerId      0
Survived       418
Pclass           0
Sex              0
Age            263
SibSp            0
Parch            0
Fare             1
Embarked         2
Name_wiki        5
dtype: int64

He have a lot of missing data, lets deal with these column:

Handling the "Fare" column:

In [None]:
# Calculating the median of the 'Fare' column
median_fare = df['Fare'].median()

# Filling the null value with the median
df['Fare'].fillna(median_fare, inplace=True)


Processing the "Age" column
Let's use a conditional approach that more accurately guesses age based on the titles in the name.
(Mr., Mrs., Master.) People with the title Master (usually boys) are younger, while Mrs. (married women) tend to be older.

In [74]:
import re
# Function to extract title from name
def get_title(name):
    title_search = re.search(' ([A-Za-z]+\.)', name)
    if title_search:
        return title_search.group(1)
    return ""

In [75]:
# Converting column 'Name_wiki' to string
df['Name_wiki'] = df['Name_wiki'].astype(str)

In [76]:
df['Title'] = df['Name_wiki'].apply(get_title)

In [77]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Name_wiki,Title
0,1,0.0,3,male,22.0,1,0,7.25,S,"Braund, Mr. Owen Harris",Mr.
1,2,1.0,1,female,38.0,1,0,71.2833,C,"Cumings, Mrs. Florence Briggs (née Thayer)",Mrs.
2,3,1.0,3,female,26.0,0,0,7.925,S,"Heikkinen, Miss Laina",
3,4,1.0,1,female,35.0,1,0,53.1,S,"Futrelle, Mrs. Lily May (née Peel)",Mrs.
4,5,0.0,3,male,35.0,0,0,8.05,S,"Allen, Mr. William Henry",Mr.


In [78]:
# Mapping rare titles to more common categories
df['Title'] = df['Title'].replace(['Lady.', 'Countess.', 'Capt.', 'Col.', 'Don.', 'Dr.', 'Major.', 'Rev.', 'Sir.', 'Jonkheer.', 'Dona.'], 'Rare')
df['Title'] = df['Title'].replace('Mlle.', 'Miss.')
df['Title'] = df['Title'].replace('Ms.', 'Miss.')
df['Title'] = df['Title'].replace('Mme.', 'Mrs.')

In [79]:
# Calculating the median age for each title
median_ages_by_title = df.groupby('Title')['Age'].median()

In [80]:
# Imputing missing 'Age' values ​​with the title median
for title in df['Title'].unique():
    df.loc[(df['Age'].isnull()) & (df['Title'] == title), 'Age'] = median_ages_by_title[title]

In [None]:
# Treating any remaining cases
if df['Age'].isnull().sum() > 0:
    overall_median_age = df['Age'].median()
    df['Age'].fillna(overall_median_age, inplace=True)

In [82]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Name_wiki,Title
0,1,0.0,3,male,22.0,1,0,7.2500,S,"Braund, Mr. Owen Harris",Mr.
1,2,1.0,1,female,38.0,1,0,71.2833,C,"Cumings, Mrs. Florence Briggs (née Thayer)",Mrs.
2,3,1.0,3,female,26.0,0,0,7.9250,S,"Heikkinen, Miss Laina",
3,4,1.0,1,female,35.0,1,0,53.1000,S,"Futrelle, Mrs. Lily May (née Peel)",Mrs.
4,5,0.0,3,male,35.0,0,0,8.0500,S,"Allen, Mr. William Henry",Mr.
...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,,3,male,29.0,0,0,8.0500,S,"Spector, Mr. Woolf",Mr.
1305,1306,,1,female,39.0,0,0,108.9000,C,"and maid, Doña Fermina Oliva y Ocana",
1306,1307,,3,male,38.5,0,0,7.2500,S,"Sæther, Mr. Simon Sivertsen",Mr.
1307,1308,,3,male,29.0,0,0,8.0500,S,"Ware, Mr. Frederick William",Mr.


## 5.1 Feature Engineering
Feature Engineering: Let's transform the Pclass and Embarked columns into numeric ones.


In [None]:
# Filling in the missing values ​​of Embarked with the most common value (mode)
moda_embarque = df['Embarked'].mode()[0]
df['Embarked'].fillna(moda_embarque, inplace=True)

In [84]:
# Checking if null values ​​are filled
print("Null values ​​in Embarked' after imputation:")
print(df['Embarked'].isnull().sum())

Null values ​​in Embarked' after imputation:
0


In [85]:
# Now, let's convert the 'Embarked' column to numbers using One-Hot Encoding
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

In [86]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Name_wiki,Title,Embarked_Q,Embarked_S
0,1,0.0,3,male,22.0,1,0,7.25,"Braund, Mr. Owen Harris",Mr.,False,True
1,2,1.0,1,female,38.0,1,0,71.2833,"Cumings, Mrs. Florence Briggs (née Thayer)",Mrs.,False,False
2,3,1.0,3,female,26.0,0,0,7.925,"Heikkinen, Miss Laina",,False,True
3,4,1.0,1,female,35.0,1,0,53.1,"Futrelle, Mrs. Lily May (née Peel)",Mrs.,False,True
4,5,0.0,3,male,35.0,0,0,8.05,"Allen, Mr. William Henry",Mr.,False,True


Label Encoding. 
Let's map male to 0 and female to 1.

In [87]:
# Using the pandas .map() method
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Feature Engineering: Creating the FamilySize Category

One of the most valuable features we can create is family size. It's believed that passengers traveling in larger families had a different survival rate than those traveling alone or in small families. We can calculate this by adding the number of parents/children (Parch) and the number of siblings/spouses (SibSp), and then adding the individual (+1).

In [88]:
# Creating the feature 'FamilySize'
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [89]:
# Creating the feature 'IsAlone'
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

In [90]:
# Displaying the first few rows to see the new columns
print("DataFrame with the new features 'FamilySize' e 'IsAlone':")
print(df[['SibSp', 'Parch', 'FamilySize', 'IsAlone']].head())

DataFrame with the new features 'FamilySize' e 'IsAlone':
   SibSp  Parch  FamilySize  IsAlone
0      1      0           2        0
1      1      0           2        0
2      0      0           1        1
3      1      0           2        0
4      0      0           1        1


Final removal of columns to begin feature selection for the model.

In [91]:
# List of columns to remove
columns_to_drop = ['Name_wiki', 'Title', 'SibSp', 'Parch']

# Removing the columns
df.drop(columns_to_drop, axis=1, inplace=True)

## ⚙️6. Classification Model
Starting the classification model

Separating Training and Test Data


In [92]:
# Importing the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [93]:
# Splitting back into training and testing based on the original number of rows
df_train = df.iloc[:891].copy()
df_test = df.iloc[891:].copy()

# The Survived column exists only in the original df_train
df_test.drop('Survived', axis=1, inplace=True)

In [94]:
# Defining the features (X) and the target variable (y) for training
X_train = df_train.drop('Survived', axis=1)
y_train = df_train['Survived']

# The features for the test set are the same, but without the 'Survived' column
X_test = df_test

## 6.1 Logistic Regression model

For the first model, Logistic Regression is the perfect choice, why? 
It's a robust and effective algorithm for binary classification problems like this one.

In [None]:
# Instantiating the model
model = LogisticRegression(max_iter=200)

# Training the model with training data
model.fit(X_train, y_train)

# Making predictions on the test set
y_pred = model.predict(X_test)

Let's test the accuracy of the Logistic Regression model!

In [96]:
# Making predictions on training data
y_train_pred = model.predict(X_train)

# Comparing predictions to true training values
train_accuracy = accuracy_score(y_train, y_train_pred)

In [97]:
print(f"Model accuracy on training data: {train_accuracy:.2f}")

Model accuracy on training data: 0.81


One model is not enough!
Let's test it using the Decision Tree model:

## 6.2 Decision Tree Model

In [98]:
# Importing the Decision Tree library
from sklearn.tree import DecisionTreeClassifier

In [99]:
# Instantiating the model
# We use a 'random_state' parameter to ensure results are reproducible
tree_model = DecisionTreeClassifier(random_state=42)

# Training the model with the same data we used before
tree_model.fit(X_train, y_train)

In [100]:
# Making predictions on training data to assess accuracy
y_train_pred_tree = tree_model.predict(X_train)

In [101]:
# Calculating accuracy
train_accuracy_tree = accuracy_score(y_train, y_train_pred_tree)

print(f"Decision Tree Model Accuracy on Training Data: {train_accuracy_tree:.2f}")


Decision Tree Model Accuracy on Training Data: 1.00


Let's visualize the decision tree to understand the rules the model has learned.

In [None]:
pip install graphviz

In [103]:
import graphviz
from sklearn.tree import export_graphviz

# Names for classes (improves visualization)
class_names = ['Não Sobreviveu', 'Sobreviveu']

# Exporting the tree graph
dot_data = export_graphviz(
    tree_model,
    out_file=None,
    feature_names=X_train.columns,
    class_names=class_names,
    filled=True,
    rounded=True,
    special_characters=True
)

In [104]:
# Creating the chart and saving it as an image
graph = graphviz.Source(dot_data)
graph.render("decision_tree_titanic", view=True)

print("The tree view was saved as 'decision_tree_titanic.pdf' and 'decision_tree_titanic.png' and should have opened automatically.")

The tree view was saved as 'decision_tree_titanic.pdf' and 'decision_tree_titanic.png' and should have opened automatically.


## 6.3 Gradient Boosting
XGBClassifier

Let's do another one!
Let's test the model by GRADIENT BOOSTING:

In [105]:
# Importing the XGBoost library
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
# Instantiating the model
# We use 'use_label_encoder=False' to avoid a warning
# 'eval_metric=logloss' to set the evaluation metric
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Training the model
xgb_model.fit(X_train, y_train)

In [107]:
# Making predictions on training data to assess accuracy
y_train_pred_xgb = xgb_model.predict(X_train)

# Calculating accuracy
train_accuracy_xgb = accuracy_score(y_train, y_train_pred_xgb)

print(f"XGBoost model accuracy on training data: {train_accuracy_xgb:.2f}")

XGBoost model accuracy on training data: 1.00


## 📊7. Validation Model

In [108]:
# Splitting the training set into training and validation
# 20% of the training data will be used for validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [None]:
# 1. Logistic Regression Model
lr_model = LogisticRegression(max_iter=200)
lr_model.fit(X_train, y_train)
y_pred_lr_val = lr_model.predict(X_val)
acc_lr = accuracy_score(y_val, y_pred_lr_val)

In [110]:
# 2. Decision Tree Model
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)
y_pred_tree_val = tree_model.predict(X_val)
acc_tree = accuracy_score(y_val, y_pred_tree_val)

In [None]:
# 3. XGBoost Model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb_val = xgb_model.predict(X_val)
acc_xgb = accuracy_score(y_val, y_pred_xgb_val)

## 🎯7.1 Comparing the model's Accuracy

Now finally! 
Let's compare our 3 models using the test values:

In [112]:
print(f"Accuracy on the validation set:\n")
print(f"Logistic Regression: {acc_lr:.2f}")
print(f"Decision Tree: {acc_tree:.2f}")
print(f"XGBoost: {acc_xgb:.2f}")

Accuracy on the validation set:

Logistic Regression: 0.80
Decision Tree: 0.75
XGBoost: 0.83


## ✨8. Prediction

Let's make the prediction using our test data.

We will use the prediction from the XGBoost model due to his best accuracy.

In [113]:
# Make predictions with each model on the test set
y_pred_lr = model.predict(X_test)
y_pred_tree = tree_model.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

In [114]:
# The test set (df_test_final) must have the 'PassengerId' column for the submission file
submission = pd.DataFrame({
    'PassengerId': df_test['PassengerId'],
    'Survived': y_pred_xgb
})

In [115]:
# Save the DataFrame to a CSV file without the index
submission.to_csv('submission.csv', index=False)

In [116]:
print(submission.head(200))

      PassengerId  Survived
891           892         0
892           893         0
893           894         0
894           895         0
895           896         0
...           ...       ...
1086         1087         0
1087         1088         1
1088         1089         0
1089         1090         0
1090         1091         0

[200 rows x 2 columns]


This is our prediction using our best model, XGBoosting.