### Assignment No. 2 - Supervised Learning

<img src="../docs/intro.webp" alt="titanic" style="width:700px;"/>

## Introduction

The RMS Titanic dataset is a classic example used in data science and machine learning to explore factors influencing survival rates during the tragic sinking of the Titanic. The objective is to build predictive models that can accurately determine whether a passenger survived or not based on various attributes such as socio-economic status, age, gender, and more.

In this supervised learning task, we aim to develop and compare machine learning algorithms to predict survival outcomes using historical passenger data. By leveraging techniques like decision trees, neural networks, and support vector machines, we will analyze and evaluate the effectiveness of these models in classifying passengers into survival categories.

Through this exercise, we seek to gain insights into the factors that contributed significantly to survival during this historical event, demonstrating the practical application of supervised learning techniques in understanding complex real-world scenarios.

[Titanic Problem Kaggle Page](https://www.kaggle.com/datasets/sakshisatre/titanic-dataset?resource=download)

## Importing libraries

We firstly need to install the libraries we will use in this project.

To do so, run the following command in the terminal (make sure you are in the project's root directory):

```pip install -r requirements.txt```

Then, we can import the libraries we will use in this project.

We also disable the warnings, to make the notebook cleaner.

In [15]:
import warnings # Needed to ignore warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import math

warnings.filterwarnings('ignore')

## Create a dataframe with the dataset from the csv file

In [16]:
df = pd.read_csv('../dataset/Titanic Dataset.csv')

df.head(15)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


## Data preprocessing


In [17]:
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [18]:
# Check missing values
df.isna().any()

pclass       False
survived     False
name         False
sex          False
age           True
sibsp        False
parch        False
ticket       False
fare          True
cabin         True
embarked      True
boat          True
body          True
home.dest     True
dtype: bool

### Filtering out outliers

In [19]:
percentage_null_embarked = (df['embarked'].isnull().sum() / len(df)) * 100
print(percentage_null_embarked)



0.15278838808250572


In [20]:
average_age = df['age'].mean()
average_age


29.881137667304014