# Úkol č. 2 - předzpracování dat a binární klasifikace (do 20. dubna)

  * Cílem tohoto úkolu je vyzkoušet si naučit prediktivní model pro binární klasifikaci.
  * Budete se muset vypořádat s příznaky, které jsou různých typů a které bude třeba nějakým způsobem převést do číselné reprezentace.
    
> **Úkoly jsou zadány tak, aby Vám daly prostor pro invenci. Vymyslet _jak přesně_ budete úkol řešit, je důležitou součástí zadání a originalita či nápaditost bude také hodnocena!**

## Zdroj dat

Budeme se zabývat predikcí přežití pasažérů Titaniku.
K dispozici máte trénovací data v souboru **data.csv** a data na vyhodnocení v souboru **evaluation.csv**.

#### Seznam příznaků:
* survived - zda přežil, 0 = Ne, 1 = Ano, **vysvětlovaná proměnná**, kterou chcete predikovat
* pclass - Třída lodního lístku, 1 = první, 2 = druhá, 3 = třetí
* name - jméno
* sex - pohlaví
* age - věk v letech
* sibsp	- počet sourozenců / manželů, manželek na palubě
* parch - počet rodičů / dětí na palubě
* ticket - číslo lodního lístku
* fare - cena lodního lístku
* cabin	- číslo kajuty
* embarked	- místo nalodění, C = Cherbourg, Q = Queenstown, S = Southampton
* home.dest - Bydliště/Cíl

## Pokyny k vypracování

**Základní body zadání**, za jejichž (poctivé) vypracování získáte **8 bodů**:
  * V Jupyter notebooku načtěte data ze souboru **data.csv**. Vhodným způsobem si je rozdělte na trénovací, testovací a případně i validační množinu (preferujeme ale použití cross-validation).
  * Projděte si jednotlivé příznaky a transformujte je do vhodné podoby pro použití ve vybraném klasifikačním modelu.
  * Podle potřeby si můžete vytvářet nové příznaky (na základě existujících), například tedy můžete vytvořit příznak měřící délku jména. Některé příznaky můžete také úplně zahodit.
  * Nějakým způsobem se vypořádejte s chybějícími hodnotami.
  * Následně si vyberte vhodný klasifikační model z přednášek. Najděte vhodné hyperparametry a určete jeho přesnost (accuracy) na trénovací množině. Také určete jeho přesnost na testovací/vaidační množině.
  * Načtěte vyhodnocovací data ze souboru **evaluation.csv**. Napočítejte predikce pro tyto data (vysvětlovaná proměnná v nich již není). Vytvořte **results.csv** soubor, ve kterém tyto predikce uložíte do dvou sloupců: ID, predikce přežití. Tento soubor nahrajte do repozitáře.

**Další body zadání** za případné další body  (můžete si vybrat, maximum bodů za úkol je každopádně 12 bodů):
  * (až +4 body) Aplikujte všechny klasifikační modely z přednášek a určete (na základě přesnosti na validační množině), který je nejlepší. Přesnost tohoto nejlepšího modelu odhadněte pomocí testovací množiny. K predikcím na vyhodnocovacích datech využijte tento model.
  * (až +4 body) Zkuste použít nějaké (alespoň dvě) netriviální metody doplňování chybějících hodnot u věku. Zaměřte pozornost na vliv těchto metod na přesnost predikce výsledného modelu. K predikcím na vyhodnocovacích datech využijte ten přístup, který Vám vyjde jako nejlepší.
  
 **jiné bonusové zadání (Druhá možnosť bonusových bodov ostáva rovnaká):** 

- Aplikujte všechny klasifikační modely z přednášek - rozhodovací strom, random forest, adaboost, KNN, logistická regresia
- Pre učenie rôznych modelov a ich porovnávanie, je pre Vás, ale tak isto aj pre mňa ako opravujúceho jednoduchšie a prehladnejšie vytvoriť jednu funkciu. HINT: pomocou parametra môžete predávať aj model, ktorý chcete učiť.
	

## Poznámky k odevzdání

  * Řiďte se pokyny ze stránky https://courses.fit.cvut.cz/BI-VZD/homeworks/index.html.
  * Odevzdejte nejen Jupyter Notebook, ale i _csv_ soubor(y) s predikcemi pro vyhodnocovací data.
  * Opravující Vám může umožnit úkol dodělat či opravit a získat tak další body. **První verze je ale důležitá a bude-li odbytá, budete za to penalizováni**

# Solution:

## Notes

nice overall description in detail, great visualisation:
https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

nice machine learning algorithms explanation, more simple:
https://www.kaggle.com/stuartwaller/lost-overwhelmed-start-here

well structured, great visualisation:
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

family size:
https://www.kaggle.com/lperez/titanic-a-deeper-look-on-family-size

**useful code**:
- shape(), head(), info(), describe(), nunique(), isna().sum()

## Import, data reading

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

Training data:

In [2]:
train_df = pd.read_csv('data.csv')

To be evaluated:

In [3]:
test_df = pd.read_csv('evaluation.csv')

## Modifying both training and testing dataset


For that, I join both datasets under a new variable name **full_data** (excluding the 'survived' column from training data).


In [4]:
full_data = pd.concat([train_df.drop('survived', axis=1),test_df], axis=0).reset_index(drop=True)
# display(full_data.head())
# full_data.shape

At first, I drop **ticket** and **home.dest** columns. I also decided to drop **'cabin'**, since one hot encoding would add too many dummies and could result in overfitting.

In [5]:
full_data = full_data.drop(['ticket', 'home.dest', 'cabin'], axis=1)
# display(full_data)
# full_data.shape

Extracting **title** from **name**. Since there are only 4 major categories **Mr, Miss, Mrs and Master** (Master - for male children, an old form of invitation, not used anymore), I decided to create a new category **Rare** for the remaining titles. I am performing conversion to dummy variables for these 5 categories. When finished, I drop the column **name**.

In [6]:
# name: extract title, drop name, one hot encoding for title

full_data['title'] = None

# title extraction
for name in range(len(full_data)):
        full_data['title'].iloc[name] = ((full_data.name.iloc[name].split(','))[1].split())[0].strip('.')
        
# ## display title categories
# full_data.title.unique()

# ## see frequency of occurences:
# display(full_data.title.value_counts())

# Mr, Miss, Mrs and Master - most common
# other types -> new label 'Rare'
for title in range(len(full_data.title)):
    if full_data.title.iloc[title] in ['Mr', 'Miss','Mrs','Master']:
        pass
    else:
        full_data.title.iloc[title] = 'Rare'
# display(full_data.title.value_counts())

# one hot encoding for title
titles_dummies = pd.get_dummies(full_data.title, prefix='Title')
full_data = pd.concat([full_data, titles_dummies], axis=1)

# drop column name
full_data = full_data.drop('name', axis=1)

# display(full_data)
# full_data.shape

**Sex** from string to numeric values (female 1, male 0)

In [7]:
# sex: from string to numeric (female 1, male 0)
full_data.sex.replace({'female': 1,'male': 0}, inplace=True)

# display(full_data)
# full_data.dtypes

**Fare**: fill missing data with median value + log() all values to reduce skewness of data. Visualisations:
https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling.

In [8]:
# fare: fill missing with median value

## missing data:
# display(full_data[full_data.fare.isnull()])

# since only one fare value is missing, i decided to fill it with fare median value for passenger class 3 (since )
pclass_fare = full_data.loc[:,['fare', 'pclass']][full_data.pclass==3]
median = pclass_fare.describe().fare['50%']
full_data.loc[:,'fare'][full_data.loc[:,'fare'].isnull()] = median

# transforming fare data by log function to reduce skewness of data
full_data.fare = full_data.fare.map(lambda i: np.log(i) if i > 0 else 0)
# full_data

**Embarked**: fill missing values with median; converting all three (C, Q, S) to dummies. When finished, I drop the column 'embarked'.

In [9]:
# embarked: fill missing values with median; dummies (one hot encoding) 

# two values missing:
full_data.isnull().sum().embarked

# S is the most common (689 / 998 values) - I'll assign this value to the missing values
embarked = full_data.embarked.describe()
full_data.loc[:,'embarked'][full_data.loc[:,'embarked'].isnull()] = embarked.top

# convert column 'embarked' to dummies
embarked_dummies = pd.get_dummies(full_data.embarked, prefix='Emb')
full_data = pd.concat([full_data, embarked_dummies], axis=1)

# drop embarked column
full_data = full_data.drop(['embarked'], axis=1)

# display(full_data)

###  dealing with missing "age" values two ways

### <1st_way>: filling missing values based on *pclass, title, sibsp* and *parch* data

**Age**: filling missing values based on **pclass, title, sibsp** and **parch** data. Inspired by https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

In [10]:
# age: fill missing values

# taking indexes of rows with missing values in the column 'age'
index_NaN_age = list(full_data["age"][full_data["age"].isnull()].index)

# filling missing values
for i in index_NaN_age :
    age_med = full_data["age"].median() # in case of no match
    age_pred = full_data["age"][((full_data['sibsp'] == full_data.iloc[i]["sibsp"])
                                  & (full_data['parch'] == full_data.iloc[i]["parch"]) 
                                  & (full_data['pclass'] == full_data.iloc[i]["pclass"])
                                  & (full_data['title'] == full_data.iloc[i]["title"]))].median()
    if not np.isnan(age_pred) :
        full_data['age'].iloc[i] = age_pred
    else :
        full_data['age'].iloc[i] = age_med
        
# # check
# full_data.isna().sum()

### </1st_way>

Dropping the **title** column as it is no longer needed


In [11]:
full_data = full_data.drop(['title'], axis=1)

**Sibsp and parch**: use both to determine family size and create **largeF** (large family) column with binary values (based on this kaggle article: https://www.kaggle.com/lperez/titanic-a-deeper-look-on-family-size).

In [12]:
# sibsp and parch: use both to determine family size and create largeFamily column with binary values DONE

# Create a family size column
full_data["fsize"] = full_data["sibsp"] + full_data["parch"] + 1

# transforming values to binary: large family (5+ members) => 1, less => 0
full_data.loc[full_data['fsize'] <= 4, 'fsize'] = 0
full_data.rename(columns={'fsize':'largeF'}, inplace=True)

Dropping columns **sibsp** and **parch** as they're no longer needed

In [13]:
full_data = full_data.drop(['sibsp', 'parch'], axis=1)

### <2nd_way>: filling missing "age" values with KNN method

In [14]:
full_data.age.isnull().sum()

0

#### Copied from the lecture 5:

The idea is this (assume we want to fill missing values in `Age` column):
  * Split the dataset into two parts: 
    * `D1` = contaning the lines with missing values in `Age` column, 
    * `D2` = the rest of the data.
  * Save the column `D2.Age` to `Y` and the remaining columns to `X` (exclude some columns if needed). The same columns of `D1` save to `X2`.
  * Fit a model (we use the kNN) to predict `Y` using `X`.
  * Use this model to predict the missing values of `Age` using the `X2` data.

###  </2nd_way>

### Final product of modification:

In [15]:
# full_data.head()
# full_data.describe()
# full_data.info()

### Dataset division: train, evaluation + survived

In [16]:
# training data
train_data = full_data[:1000]
train_data_no_ID = train_data.drop('ID', axis=1)
survived = train_df.loc[:,'survived']

# testing data
eval_data = full_data[1000:]
eval_no_ID = eval_data.drop('ID', axis=1)

### cross-validation

scikit documentation
>... training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
> However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. 
> A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.

\+ https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

### Function for trying different classification models

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import sklearn.metrics as metrics
import sklearn.metrics as confusion_matrix

def model_data(model, param_grid, Xdata, ydata):
    rd_seed = 333
    test_size = 0.25
    Xtrain, Xtest, ytrain, ytest = train_test_split(Xdata, ydata, test_size=test_size, random_state=rd_seed)
    
    grid_search = GridSearchCV(estimator = model, param_grid = param_grid, cv = 10, n_jobs = -1)
    grid_search.fit(Xtrain, ytrain)
    
    best_params = grid_search.best_params_
    best_model = grid_search.best_estimator_
    
    print(best_params)
    print('accuracy score (train): {0:.6f}'.format(best_model.score(Xtrain, ytrain)))
    print('accuracy score (test): {0:.6f}'.format(best_model.score(Xtest, ytest)))

### Function for trying multiple classification models at once

In [18]:
# function - multiple models
def try_different_models(models_with_grids, Xdata, ydata):
    '''requires 'models' in form of {model:{parameter_grid}}'''
    for m, p in models_with_grids.items():
        model = m
        param_grid = p
        print(str(model).split('(')[0],':')
        model_data(model, param_grid, Xdata, ydata)
        print('')

### Data

In [19]:
Xdata = train_data_no_ID
ydata = survived

In [20]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [21]:
models_with_params = {
    DecisionTreeClassifier(random_state=42) : {'max_depth' : range(1,101),
                              'criterion': ['entropy', 'gini']},
    
    RandomForestClassifier(random_state=42) : {'max_depth' : range(1,5), 
                              'n_estimators': range(1,60,2)},
    
    AdaBoostClassifier() : {'n_estimators': range(1,100,5),
                          'learning_rate': [0.01, 0.05, 0.1, 0.3, 0.5, 1]},
    
    LogisticRegression() : {},
    
    KNeighborsClassifier() : {'n_neighbors': range(1,20),
                            'p': range(1,5),
                            'weights': ['uniform', 'distance']
                           }
}

### Applying models

In [22]:
try_different_models(models_with_params, Xdata, ydata)

DecisionTreeClassifier :
{'criterion': 'entropy', 'max_depth': 5}
accuracy score (train): 0.833333
accuracy score (test): 0.836000

RandomForestClassifier :
{'max_depth': 4, 'n_estimators': 13}
accuracy score (train): 0.822667
accuracy score (test): 0.836000

AdaBoostClassifier :
{'learning_rate': 0.1, 'n_estimators': 86}
accuracy score (train): 0.808000
accuracy score (test): 0.808000

LogisticRegression :
{}
accuracy score (train): 0.810667
accuracy score (test): 0.820000

KNeighborsClassifier :
{'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}
accuracy score (train): 0.810667
accuracy score (test): 0.808000



## Prediction for evaluation.csv using chosen model

### Using the best model, parameters and missing "Age" values filling method
Eventhough RandomForestClassifier with "Age 2" option of filling missing values reached the highest test accuracy (84 %), it was not the only information I took into account. The final decision is based on both test accuracy and the difference between test and train accuracy. That's why **I picked the DecisionTreeClassifier with option "Age 1"** as the "best solution".
Overview of the accuracy scores for different models and missing Age values filling methods
 process is listed below.

In [27]:
import sklearn.metrics as metrics
rd_seed = 333
test_size = 0.25

Xtrain, Xtest, ytrain, ytest = train_test_split(Xdata, ydata, test_size=test_size, random_state=rd_seed)

model = DecisionTreeClassifier(max_depth=5, criterion='entropy', random_state=42)
model.fit(Xtrain, ytrain)
ypredicted = model.predict(Xtest)

print('accuracy score (train): {0:.6f}'.format(metrics.accuracy_score(ytrain, model.predict(Xtrain))))
print('accuracy score (test): {0:.6f}'.format(metrics.accuracy_score(ytest, ypredicted)))

accuracy score (train): 0.833333
accuracy score (test): 0.836000


#### Evaluation and writing into .csv file

In [70]:
final = pd.DataFrame()
final['ID'] = eval_data.ID
final['yeval'] = model.predict(eval_no_ID)
final.reset_index(drop=True, inplace=True)
final.to_csv('results.csv', index=False)

### Overview of the accuracy scores for different models and missing Age values filling methods

#### AGE 1

DecisionTreeClassifier :
{'criterion': 'entropy', 'max_depth': 5}
accuracy score (train): 0.833333
accuracy score (test): 0.836000

RandomForestClassifier :
{'max_depth': 4, 'n_estimators': 13}
accuracy score (train): 0.822667
accuracy score (test): 0.836000

AdaBoostClassifier :
{'learning_rate': 0.1, 'n_estimators': 86}
accuracy score (train): 0.808000
accuracy score (test): 0.808000

LogisticRegression :
{}
accuracy score (train): 0.810667
accuracy score (test): 0.820000

KNeighborsClassifier :
{'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}
accuracy score (train): 0.810667
accuracy score (test): 0.808000

#### AGE 2 and k=5 for KNN

DecisionTreeClassifier :
{'criterion': 'entropy', 'max_depth': 3}
accuracy score (train): 0.810667
accuracy score (test): 0.828000

RandomForestClassifier :
{'max_depth': 3, 'n_estimators': 15}
accuracy score (train): 0.816000
accuracy score (test): 0.816000

AdaBoostClassifier :
{'learning_rate': 1, 'n_estimators': 71}
accuracy score (train): 0.829333
accuracy score (test): 0.816000

LogisticRegression :
{}
accuracy score (train): 0.809333
accuracy score (test): 0.824000

KNeighborsClassifier :
{'n_neighbors': 15, 'p': 1, 'weights': 'uniform'}
accuracy score (train): 0.813333
accuracy score (test): 0.816000


#### AGE 2 and k=10 for KNN

DecisionTreeClassifier :
{'criterion': 'entropy', 'max_depth': 5}
accuracy score (train): 0.836000
accuracy score (test): 0.812000

RandomForestClassifier :
{'max_depth': 4, 'n_estimators': 13}
accuracy score (train): 0.822667
accuracy score (test): 0.840000

AdaBoostClassifier :
{'learning_rate': 1, 'n_estimators': 26}
accuracy score (train): 0.814667
accuracy score (test): 0.800000

LogisticRegression :
{}
accuracy score (train): 0.810667
accuracy score (test): 0.820000

KNeighborsClassifier :
{'n_neighbors': 13, 'p': 1, 'weights': 'uniform'}
accuracy score (train): 0.812000
accuracy score (test): 0.812000

#### AGE 2  and k=30 for KNN

DecisionTreeClassifier :
{'criterion': 'entropy', 'max_depth': 3}
accuracy score (train): 0.810667
accuracy score (test): 0.828000

RandomForestClassifier :
{'max_depth': 4, 'n_estimators': 13}
accuracy score (train): 0.821333
accuracy score (test): 0.840000

AdaBoostClassifier :
{'learning_rate': 1, 'n_estimators': 71}
accuracy score (train): 0.832000
accuracy score (test): 0.804000

LogisticRegression :
{}
accuracy score (train): 0.810667
accuracy score (test): 0.820000

KNeighborsClassifier :
{'n_neighbors': 13, 'p': 1, 'weights': 'uniform'}
accuracy score (train): 0.806667
accuracy score (test): 0.820000