| Variable | Definition | Key | Type |
| :- | :- | :- | :- |
| survived | Survived | 0 = No, 1 = Yes | Numerical |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | Numerical |
| name | Name | | String |
| sex | Sex | | String |
| age | Age in years | |Numerical | 	
| sibsp | # of siblings / spouses aboard the Titanic | | Numerical |
| parch | # of parents / children aboard the Titanic | | Numerical |	
| ticket | Ticket number | | Numerical |
| fare | Passenger fare | |	Numerical |
| cabin | Cabin number | | String |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | String |
| boat | Lifeboat (if survived) || Numerical |
| body | Body number (if did not survive and body was recovered) | | Numerical |
| home.dest | Destination | | String |


### Read csv

In [1]:
import pandas as pd

In [2]:
dataset = pd.read_csv('titanic.csv', encoding='utf-8', na_values='?')

### Numeric transformer

In this step, we need to take care of empty numeric values as well as the scaling of numerical values. It can be done in a similar way as it was on the presentation. 

- https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

See example below:

In [3]:
dataset.loc[dataset['age'].isnull()].head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
15,1,0,"Baumann, Mr. John D",male,,0,0,PC 17318,25.925,,S,,,"New York, NY"
37,1,1,"Bradley, Mr. George ('George Arthur Brayton')",male,,0,0,111427,26.55,,S,9.0,,"Los Angeles, CA"
40,1,0,"Brewe, Dr. Arthur Jackson",male,,0,0,112379,39.6,,C,,,"Philadelphia, PA"
46,1,0,"Cairns, Mr. Alexander",male,,0,0,113798,31.0,,S,,,
59,1,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev...",female,,0,0,17770,27.7208,,C,5.0,,"New York, NY"


In [None]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

### Categorical transformer

In this step, we need to take care of text values. The first step of the pipeline should fill in the gaps, the second one should convert all text values to categorical data. 

- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

See example below:

In [4]:
dataset.loc[dataset['embarked'].isnull()].head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,6,,
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,6,,"Cincinatti, OH"


In [None]:
from sklearn.preprocessing import OneHotEncoder

### Column transformer

This step combines the previous two steps.

- https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

In [None]:
from sklearn.compose import ColumnTransformer

### Feature selection

In this step, we can select the most important features that decided if the person survived or not and get rid of irrelevant features automatically.

For example, we can suspect that the ticket price was not critical in terms of survival. But we might be wrong and this price could be extremely important.

There is a mathematical solution to that problem. We can determine the statistical significance of each attribute and use only those important. Such an approach can save computational resources, the time needed to train the model, and most importantly leads to better results.

- https://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

### Test Train split

In this step we need to create two sets of data or testing and training
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = dataset.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = dataset.loc[:, 'survived']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Best Model Selection
In this step, we can try a few classifiers and choose the best one.

- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Feel free to add few more.

Hint: Use pipeline a few times to try each model and save best one.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

### Prediction

In [None]:
from random import randrange

In [None]:
to_predict = dataset.iloc[[randrange(len(dataset))]]
to_predict

In [None]:
to_be_predicted = to_predict.loc[:, ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

In [None]:
print('survived') if final_model.predict(to_be_predicted) else print('not survived')