### Assignment No. 2 - Supervised Learning

<img src="../docs/intro.webp" alt="titanic" style="width:700px;"/>

## Introduction

The RMS Titanic dataset is a classic example used in data science and machine learning to explore factors influencing survival rates during the tragic sinking of the Titanic. The objective is to build predictive models that can accurately determine whether a passenger survived or not based on various attributes such as socio-economic status, age, gender, and more.

In this supervised learning task, we aim to develop and compare machine learning algorithms to predict survival outcomes using historical passenger data. By leveraging techniques like decision trees, neural networks, and support vector machines, we will analyze and evaluate the effectiveness of these models in classifying passengers into survival categories.

Through this exercise, we seek to gain insights into the factors that contributed significantly to survival during this historical event, demonstrating the practical application of supervised learning techniques in understanding complex real-world scenarios.

[Titanic Problem Kaggle Page](https://www.kaggle.com/datasets/sakshisatre/titanic-dataset?resource=download)

## Importing libraries

We firstly need to install the libraries we will use in this project.

To do so, run the following command in the terminal (make sure you are in the project's root directory):

```pip install -r requirements.txt```

Then, we can import the libraries we will use in this project.

We also disable the warnings, to make the notebook cleaner.

In [291]:
import warnings # Needed to ignore warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from imblearn.over_sampling import SMOTE


warnings.filterwarnings('ignore')

## Create a dataframe with the dataset from the csv file

In [292]:
df = pd.read_csv('../dataset/Titanic Dataset.csv')

df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


## Meaning of the columns:

- Pclass: Ticket class indicating the socio-economic status of the passenger. It is categorized into three classes: 1 = Upper, 2 = Middle, 3 = Lower.

- Survived: A binary indicator that shows whether the passenger survived (1) or not (0) during the Titanic disaster. This is the target variable for analysis.

- Name: The full name of the passenger, including title (e.g., Mr., Mrs., etc.).

- Sex: The gender of the passenger, denoted as either male or female.

- Age: The age of the passenger in years.

- SibSp: The number of siblings or spouses aboard the Titanic for the respective passenger.

- Parch: The number of parents or children aboard the Titanic for the respective passenger.

- Ticket: The ticket number assigned to the passenger.

- Fare: The fare paid by the passenger for the ticket.

- Cabin: The cabin number assigned to the passenger, if available.

- Embarked: The port of embarkation for the passenger. It can take one of three values: C = Cherbourg, Q = Queenstown, S = Southampton.

- Boat: If the passenger survived, this column contains the identifier of the lifeboat they were rescued in.

- Body: If the passenger did not survive, this column contains the identification number of their recovered body, if applicable.

- Home.dest: The destination or place of residence of the passenger.

## Data preprocessing


In [293]:
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [294]:
# Check missing values
df.isna().any()

pclass       False
survived     False
name         False
sex          False
age           True
sibsp        False
parch        False
ticket       False
fare          True
cabin         True
embarked      True
boat          True
body          True
home.dest     True
dtype: bool

### Filtering out outliers

In [295]:
# dropping columns, id doesn't have any significance and Unnamed: 32 has all null values
df = df.drop(columns=['body', 'embarked', 'boat', 'name'])

df.shape

(1309, 10)

### Encode the target variable
We need to encode the target variable, so that we can use it in our models, since it is currently object type.

In [296]:
encoder = LabelEncoder()

df['survived'] = encoder.fit_transform(df['survived'])
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,home.dest
0,1,1,female,29.0,0,0,24160,211.3375,B5,"St Louis, MO"
1,1,1,male,0.92,1,2,113781,151.55,C22 C26,"Montreal, PQ / Chesterville, ON"
2,1,0,female,2.0,1,2,113781,151.55,C22 C26,"Montreal, PQ / Chesterville, ON"
3,1,0,male,30.0,1,2,113781,151.55,C22 C26,"Montreal, PQ / Chesterville, ON"
4,1,0,female,25.0,1,2,113781,151.55,C22 C26,"Montreal, PQ / Chesterville, ON"


Solving NaN Values

In [297]:
# Calculate the mean age for men and women separately
mean_age_male = df[df['sex'] == 'male']['age'].mean()
mean_age_female = df[df['sex'] == 'female']['age'].mean()

# Replace missing values in the 'age' column based on gender
df.loc[(df['sex'] == 'male') & (df['age'].isnull()), 'age'] = mean_age_male
df.loc[(df['sex'] == 'female') & (df['age'].isnull()), 'age'] = mean_age_female

## Algorithms

### Transform train dataset

In [298]:
df_extracted = df.copy()
features = df_extracted.drop(['survived'],axis=1)
labels = df['survived']

np.unique(labels, return_counts=True)

(array([0, 1]), array([809, 500]))

In [299]:
X_train,X_test,y_train,y_test = train_test_split(features,labels,test_size=0.2)
X_train, y_train = SMOTE(random_state = 42).fit_resample(X_train, y_train)

ValueError: could not convert string to float: 'male'

In [None]:
y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)
