# Decision Trees - Challenge

### Predict which passengers survived the sinking of the Titanic*

*The dataset you are going to use comes from a Kaggle competition. 

The dataset that will be used to solve this problem has been taken from Kaggle competitions. This is a binary classification problem where the task is to predict whether a passenger survived or not the sinking of the Titanic. The dataset is composed of features that describe the passengers. These features will be used to determine whether they survived the sinking (target label is the column **Survival**). The total number of records in the datset is 1,309.

### The Dataset Description

*titanic.csv* is composed of the following columns:

* **Name**: Passenger's name.
* **PassengerID**: Passenger's indetification number.
* **Age**: The age of a passenger in years.
* **Sex**: The gender of a passenger.
* **SibSp**: The number of siblings/spouses of a passenger (Sibling = brother, sister, stepbrother, stepsister, Spouse = husband, wife).
* **Parch**: The number of parents/children of a passenger (Parent = mother, father, Child = daughter, son, stepdaughter, stepson. In case children travelled only with a nanny, parch=0).
* **Pclass**: The class of the ticket the passenger purchased (a proxy for socio-economic status: 1=1st (Upper), 2=2nd (Middle), 3=3rd (Lower)).
* **Ticket**: The ticket number of a passenger.
* **Fare**: The fare the passenger paid.
* **Cabin**: The cabin number of a passenger.
* **Embarked**: The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton) .
* **Survived**: Whether the passenger survived or not (0=No, 1=Yes).

Here are some tips that might help you in buidling your classifier:

* Familiarise yourself with the dataset content. Check whether the features are categorical or numerical values, is there a missing data in your columns, do you need to perform data normalisation. If you need help, check this website: https://scikit-learn.org/stable/ . Note in case of categorical features you should convert them to *dummy* variables. This might be helpful: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html


* Explore your dataset. Before training the classification model, it is advisable to plot the distribution of the features, check the correlation between them, understand the importance of features in solving the given problem.


* Import decision tree classifier and tune hyper-parameters of interest: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier. 


* Split the data into train and test set. The training data is used to train the classification model you choose, after which the model will be used to make predictions on the test dataset. Check: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html or if you are interested in more advanced data splitting, check: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html


* When you reach the stage to evaluate the performance of your classifier, define the metrics you want to observe, such as for example: accuracy, confusion matrix, ROC curve. Again, check https://scikit-learn.org/.

Have fun!

In [13]:
# Here are some packages that might be useful for your work.
# Feel free to add more packges if needed.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

path_data = "https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials/non-linear-algorithms/titanic.csv"

In [14]:
# Read the dataset as .csv and check its size.
df = pd.read_csv(path_data)

print("Dimensions of the dataset: {}".format(df.shape))

Dimensions of the dataset: (1309, 12)


In [15]:
# Check the names of the columns.
df.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Survived'],
      dtype='object')

In [16]:
# Check the type of each column.
df.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Survived         int64
dtype: object

In [17]:
# Check the content of the dataframe.
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
