# 2nd Homework - Data preprocessing and binary classification (deadline November 8th 23:59)

  * In this homework you have to handle features of various data types.
  * The features should be preprocessed and transformed to numeric representation, before training an ML model on the data.
  
> **Homework is assigned in a way that you have space for invention. Thinking of the _exact solution path_ is part of the assignment. Originality will be taken into account in the evaluation.**

## Data source

Your task is to predict the survival of Titanic passengers. Training data is in file **data.csv** and validation data in **evaluation.csv**.

### Features
* survived - 0 = No, 1 = Yes, **target variable**, 
* pclass - passenger's class, 1 = first, 2 = second, 3 = third
* name
* sex
* age - in years
* sibsp	- number of siblings / spouses onboard
* parch - number of parents / children onboard
* ticket - ticket number
* fare - ticket fare
* cabin	- cabin number
* embarked	- place of embarkment, C = Cherbourg, Q = Queenstown, S = Southampton
* home.dest - Home/destination

## Instructions

**Basic points of the assignment (8 points)**:
  * In the Jupyter notebook load **data.csv**. Split the data into subsets suitable for ML model training.
  * Explore and transform particular features into a format suitable for the selected classification method.
  * You can create new features (based on the existing ones), e.g. you can create a column with the length of the passenger's name. You can drop some features entirely too.
  * Handle the missing values in the dataset.
  * Select a suitable classification method from the lectures. Train it on the training set and tune the hyperparameters. Compute its accuracy on the training and validation set.
  * Load data from the file **evaluation.csv**. Compute predictions from these data (there are no target variable values in the file). Create a file **results.csv** and save your predictions into two columns - ID and the prediction of surviving. Upload this file alongside the Jupyter notebook to the repository.
  * Possible head of the file **results.csv**:
  
```
ID,survived
1000,0
1001,1
...
```
**Further points of assignment**, for possible more points (you can choose, maximum for the homework is 12):
  * (up to 4 points) Apply all of the classification methods discussed in the lectures to the problem. Select the best one based on the accuracy on the validation set. Use cross-validation to estimate the real accuracy of the best model. Use this model for prediction on the **evaluation.csv** data.
  * (up to 4 points) Try to use at least two advanced methods for filling missing values in the `age` feature. Explore the impact of these methods on the performance of the trained model. Use the method which you find to perform the best for the prediction on the **evaluation.csv** data.
  
## Submission notes

  * Follow instructions at https://courses.fit.cvut.cz/BIE-VZD/homeworks/index.html
  * Submit **Jupyter Notebook** (possibly with additional scripts) and file **results.csv** with the test predictions.
  * Reviewer may allow you to finish or correct your homework to achieve additional points. However, the first version is crucial.

In [17]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('data.csv', index_col='ID')
display(data.head())
data.info()

Unnamed: 0_level_0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1,3,"Dorking, Mr. Edward Arthur",male,19.0,0,0,A/5. 10482,8.05,,S,"England Oglesby, IL"
1,1,2,"Smith, Miss. Marion Elsie",female,40.0,0,0,31418,13.0,,S,
2,0,3,"Hegarty, Miss. Hanora ""Nora""",female,18.0,0,0,365226,6.75,,Q,
3,0,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S,
4,0,3,"Cacic, Miss. Marija",female,30.0,0,0,315084,8.6625,,S,


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   survived   1000 non-null   int64  
 1   pclass     1000 non-null   int64  
 2   name       1000 non-null   object 
 3   sex        1000 non-null   object 
 4   age        797 non-null    float64
 5   sibsp      1000 non-null   int64  
 6   parch      1000 non-null   int64  
 7   ticket     1000 non-null   object 
 8   fare       1000 non-null   float64
 9   cabin      226 non-null    object 
 10  embarked   998 non-null    object 
 11  home.dest  554 non-null    object 
dtypes: float64(2), int64(4), object(6)
memory usage: 101.6+ KB


### Preparing the data

In [3]:
print(f'\nNumber of NaN values in the columns:')
data.isnull().sum(axis=0)


Number of NaN values in the columns:


survived       0
pclass         0
name           0
sex            0
age          203
sibsp          0
parch          0
ticket         0
fare           0
cabin        774
embarked       2
home.dest    446
dtype: int64

In [4]:
display(data[data['embarked'].isnull()][['name', 'embarked']])
data.groupby( 'embarked').describe()

Unnamed: 0_level_0,name,embarked
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
218,"Stone, Mrs. George Nelson (Martha Evelyn)",
346,"Icard, Miss. Amelie",


Unnamed: 0_level_0,survived,survived,survived,survived,survived,survived,survived,survived,pclass,pclass,...,parch,parch,fare,fare,fare,fare,fare,fare,fare,fare
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
C,219.0,0.557078,0.497869,0.0,0.0,1.0,1.0,1.0,219.0,1.922374,...,1.0,3.0,219.0,58.906604,77.285609,4.0125,13.825,27.7208,78.73335,512.3292
Q,88.0,0.375,0.486897,0.0,0.0,0.0,1.0,1.0,88.0,2.875,...,0.0,5.0,88.0,12.833241,13.462257,6.75,7.75,7.7646,12.35,90.0
S,691.0,0.341534,0.474568,0.0,0.0,0.0,1.0,1.0,691.0,2.367583,...,0.0,9.0,691.0,27.714338,38.656831,0.0,8.05,13.0,27.825,263.0


S - is the largest group from embarked feature so lets change NaN's to S.

In [6]:
# S - is the largest group from embarked feature, so lets change this 2 NaN's to S.
data['embarked'] = data['embarked'].fillna('S').map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

# In Age column we can replace Nan's by -1.
data['age'] = data['age'].fillna(-1)
data['age'] = data['age'].astype(int)

# Creating new feature family_size
data['family_size'] = data['sibsp'] + data['parch'] + 1
data.drop( columns=['sibsp', 'parch'] )

# Label encoding of strings columns
object_dtype = data.select_dtypes(['object']).columns
data[object_dtype] = data[object_dtype].astype('category').apply(lambda x: x.cat.codes)

data.info()

### Training the decision tree

In [18]:
Xdata = data.drop(columns='survived')
ydata = data.survived 

rd_seed = 363
Xtrain, Xtest, ytrain, ytest = train_test_split(Xdata, ydata, test_size=0.25, random_state=rd_seed) 
Xtrain, Xvalid, ytrain, yvalid = train_test_split(Xtrain, ytrain, test_size=0.25, random_state=rd_seed) 
print(f"original data:   {Xdata.shape}")
print(f"train data:   {Xtrain.shape}")
print(f"validation data:   {Xvalid.shape}")
print(f"test data:   {Xtest.shape}")

original data:   (1000, 12)
train data:   (562, 12)
validation data:   (188, 12)
test data:   (250, 12)


### Hyperparameter tuning - `max_depth` and `criterion`
