# 0. Overview
- Dataset: https://www.kaggle.com/c/titanic/overview
- Reference: https://www.kaggle.com/rbud613/taitanic-eda/comments#846113
- Pre-requisite:
    - Knowledge of numpy and Python pandas;
    - Knowledge of machine learning concepts;
    - Passion to learn;
    - Ready to ask question (to both Google and your Tech Lead).
- Objective:
    - Understanding the basic ML model building procedure (for classification problem).

---------

# 1. Get Started
- **Overview**: 在这个部分，你将【读取】这个项目需要的数据

## 1.1. Getting the Data
- 首先，我们来看看我们要处理的数据长什么样
- 如你们所见，这个数据集包括了许多 Titanic 乘客的信息
- 这个部分有 **1 个 TODO** 

In [None]:
import numpy as np
import pandas as pd

**TODO 1.1:** Use pandas libary to read in the data we need (i.e. train.csv & test.csv)

In [None]:
# Your Code Starts Here
train = ...
test = ...

In [None]:
# First few lines of training set (i.e. the data WITH known labels)
train.head()

In [None]:
# First few lines of test set (i.e. the data WITHOUT known labels)
test.head()

![Data Dictionary](./data/data_dictionary.png)


- 变量名 (aka passenger info)
    - pclass: 仓位等级
        - 1st = Upper
        - 2nd = Middle
        - 3rd = Lower

    - age: 年龄。Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

    - sibsp: 同辈亲人（兄弟姐妹、配偶）。The dataset defines family relations in this way...
        - Sibling = brother, sister, stepbrother, stepsister
        - Spouse = husband, wife (mistresses and fiancés were ignored)

    - parch: 非同辈亲人（子女、父母）。The dataset defines family relations in this way...
        - Parent = mother, father
        - Child = daughter, son, stepdaughter, stepson
        - Some children travelled only with a nanny, therefore parch=0 for them.

- **Take a few minutes to look at the dataset and get an idea of it**
---------

# 2. Prediction
## 2.0 概述
- 在这个部分，我们会进行这个项目的核心环节：通过乘客们的信息来预测他们是否会在 titanic 沉船案中生存下来。
- 这个部分有 **1 个 TODO**

- 数据已经被分成了两部分：
  - 训练集 (data/train.csv)
  - 测试集 (data/test.csv)

- **训练集**将用来训练机器学习模型，因此我们将提供**真实**的分类答案（i.e. label），也就是说，对于训练集中的每一位乘客，我们都将有他们**是否存活**的数据 (在**Survived**那一列)

- **测试集**将用来测试我们的模型。因此我们**没有提供**真实的分类答案（i.e. label）。你在这个部分的主要工作就是用你训练出来的模型来预测每个乘客是否会在这场灾难里活下来。

- **最终提交**的结果需要遵循`gender_submission.csv`这个表格的格式。It is a set of predictions that **assume** all and only female passengers survive, as an example of what a submission file should look like.

`References`: 
- 1. https://www.kaggle.com/c/titanic/overview
- 2. https://www.kaggle.com/ash316/eda-to-prediction-dietanic

`Submission to:` https://www.kaggle.com/c/titanic/submit

In [None]:
# Example of final submission file - this is the kind of file you want to output at the end of the project
pd.read_csv('data/gender_submission.csv').head()

**TODO 2.0** 
- 1. 读取 train 和 test 两个表格（使用 Pandas）
- 2. 在 train 和 test 这两个表格中都加入一列 'dataset' 用于表示他们所属于的数据集
- 3. 将这两个表格合并到一起（使用 pd.concat）

In [None]:
# step 1 读取 train 和 test 两个表格（使用 Pandas）
train = ...
test = ...

# step 2 在 train 和 test 这两个表格中都加入一列 'dataset' 用于表示他们所属于的数据集


# step 3 将这两个表格合并到一起（使用 pd.concat）
data = ...
data.head(2)

## 2.1 Feature Understanding
- 在这个部分，我们将研究一下各个特征之间的联系，这将给我们的特征选择提供一定有价值的参考。
- 这个部分**没有 TODO**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
sns.heatmap(data.corr(), annot = True, cmap = 'RdYlGn' , linewidths = 0.2) # data.corr() --> correlation matrix
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()

## 2.2. Feature Engineering & Pre-processing
- 在这个部分，我们将进行这个 Pipeline 中的第二步：**特征工程**。这也是传统机器学习中最重要的部分 —— 因为这一部分中构造的特征将**直接决定**模型的表现。
- 特征工程的常用技巧有如下几种
    - **Imputation 填补**: 处理缺失数值
    - **Binning 分区**: 将一些数值划分到某些
    - **Feature Split&Combination 合并/分割特征**: 分割/合并某些特征
    - **One-Hot Encoding**: encode catogerical data
    - **Pipeline 管道**: 将多个特征处理合并到一起处理
- However, since Titanic dataset is somewhat simple, I would suggest the following [artical](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114) where feature engineering is better summarized. 
- 在这个部分有 **7 个 TODO**

### 2.2.1 Initials / Ages (Imputation/填补)
- The first thing we'd like to investigate is whether age could decide one's survival
- If you remember, there are lots of `nan` in the `Age` column, in order to fill those blanks with appropriate age, what we will do is to find the initial of every person and find the average age for each initial.
- Then we will fill those `nan` in `Age` column with average age of that initial. (**Think about this, does this make sense?**)

In [None]:
# 先看看 age 列有多少 NaN
data.Age.isnull().sum()

In [None]:
import re # Use regular expression（正则表达式）

# Do Not modify this cell, just run it.
data['initial'] = data.Name.apply(lambda s: re.findall(r'([A-Za-z]+)\.', s)[0]) 
data[['Name', 'initial']].head()

- For more information about the use of regular expression, here is a useful link: [Python regular expression](https://www.tutorialspoint.com/python/python_reg_expressions.htm)
- **It is not required for you to know regular expression for this project, but it's encouraged to learn it on your own (it takes a while to understand it!).**

- Now, take a look at our results of initials:

In [None]:
# As you can see here, there is too many unique initials, 
# while some of them could be manually classified into the same one 
data.initial.unique() 

In [None]:
# DO NOT modify this cell, just run it (try to understand the use of pd.replace() function!)
data.initial = data.initial.replace(
    ['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don', 'Dona'],
    ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr', 'Miss']
)

In [None]:
data.initial.unique() # Looks good now!

**TODO 2.1:** 对于每个 initial 组，计算平均年龄，然后将所有`nan`的地方填上 ta 所在年龄组的平均年龄
- Hint: make use of `pd.groupby()` function, [link to doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- Hint: make use of `pd.fillna()` function, [link to doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [None]:
# Your Code Starts Here


In [None]:
# Your Code Starts Here
data.loc[(data.Age.isnull()) & (data.initial == 'Mr'), 'Age'] = 33
data.loc[(data.Age.isnull()) & (data.initial == 'Mrs'), 'Age'] = ...
data.loc[(data.Age.isnull()) & (data.initial == 'Master'), 'Age'] = ...
data.loc[(data.Age.isnull()) & (data.initial == 'Miss'), 'Age'] = ...
data.loc[(data.Age.isnull()) & (data.initial == 'Other'), 'Age'] = ...

### 2.2.2 Age band (Binning/分区)
- 在这个部分，我们会将年龄分为 5 个类，这样做有助于将一个复杂的 feature 变成一个简单的 feature
- **TODO 2.2:** Implement the function `handle_age_band` and apply it to the train dataframe. Hint: use `pd.apply()` function

In [None]:
# Function to handle age band implementation
# [0-16]: 0
# [17-32]: 1
# [33-48]: 2
# [49-64]: 3
# [>64]: 4
def handle_age_band(age):
    if age <= 16:
        return 0
    # Your Code Starts Here  
    return -1 # Modify here
    # Your Code Ends Here


# Your Code Starts Here
data['age_band'] = ...
# Your Code Ends Here
data.head(2)

### 2.2.3 Family Size (Feature Combination)
- we would also like to investigate whether family size have an impact on passenger's survival

- **TODO 2.3:** Here we create a new feature called `family_size` which is the sum of `Parch` and `SibSp` columns for each row; also, create another column called `alone` where 1 means traveling without families and 0 otherwise.

In [None]:
# Your Code Starts Here
data['family_size'] = ...
data['alone'] = ... # hint: make use of pd.apply() function on family_size column
# Your Code Ends Here

In [None]:
data.head(2)

### 2.2.4 Checking null values again

In [None]:
data.isna().sum()

- **TODO 2.4:** 把数据集里的其他 NaN 数值进行填充
- Hint: make use of `pd.fillna()` function, [link to doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

In [None]:
# Examples
data.Embarked = data.Embarked.fillna('S') # Because 'S' is the most common embark (登船) location.

# TODO: handle other null values, if there is any; if not, just ignore it
# Note: if there is null in the 'Survived' column, it's fine.

# Your Code Starts Here
fare_median = ...
data.Fare = data.Fare.fillna(fare_median) # Fill the nan values in fare columns with median value of passenger fare.
# Your Code Ends Here

In [None]:
# Drop un-needed features
data = data.drop(columns = ['Name', 'Ticket', 'PassengerId', 'Cabin', 'Age'])

In [None]:
data.isna().sum()

### 2.2.5 One-hot Encoding
- In this section, we will look at another important feature engineering process, specially designed for categorical data: **One-hot Encoding**.
- The idea behind one-hot encoding is simple: instead of using categorical labels, we use several seperate columns for each label (see example below).
- Through one-hot encoding, we could tramsfrom data from some format that the model couldn't take as an input(text, image, etc) into vectors that the model could calculate.

In [None]:
# Original data
original = pd.DataFrame({
    'name': ['Joey', 'Scott', 'Jasmine', 'Alan', 'Mao'],
    'Major': ['DataSci', 'DataSci', 'CogsSci', 'DataSci', 'CompSci']
})
onehot = pd.DataFrame({
    'name': ['Joey', 'Scott', 'Jasmine', 'Alan', 'Mao'],
    'DataSci': [1, 1, 0, 1, 0],
    'CogsSci': [0, 0, 1, 0, 0],
    'CompSci': [0, 0, 0, 0, 1]
})

display(original, onehot)

- The way to achieve these encoded features is simple: Sci-kit Learn already has package for this:
    - [One-hot Encoder from Scikit](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
    - **Note**: the output for this transformer is a matrix, which is kind of hard to debug.

In [None]:
# This is a quick way of "query" into the dataframe, like a SQL query. 
train_df = data.query('dataset == "Train"').drop(columns = ['dataset'])

# We don't need survived column for test set.
test_df = data.query('dataset == "Test"').drop(columns = ['Survived', 'dataset']) 

In [None]:
X_train, y_train = train_df.drop(columns = ['Survived']), train_df.Survived.values

- **TODO 2.5:** 对 initial 这一列进行 one-hot encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = ... # remember to set 'handle_unknown' to 'ignore'!
enc.fit(...)

In [None]:
enc.categories_

In [None]:
enc.transform(...) # As you can see, the result is a sparse matrix

In [None]:
print('Shape of encoded:', enc.transform(train_df[['initial']]).todense().shape)
enc.transform(train_df[['initial']]).todense() # take a look a it, it's a matrix

### 2.2.6 Normalization/归一化
- **TODO 2.6:** 对 `Fare` 这一列进行归一化
- [Scikit Standard Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
ss = ...
ss.fit(...)

In [None]:
ss.transform(...)[:5]

In [None]:
train_df[['Fare']].head()

### 2.2.7 Pipeline & ColumnTransformer
- **TODO 2.7:** 将 one-hot encoding 和其他的 feature transform 都塞进 Pipeline 里
- [Scikit Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
- [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [None]:
train_df.head(2) # 这里的 Survived 是不需要的！

In [None]:
X_train, y_train = train_df.drop(columns = ['Survived']), train_df.Survived.values
X_test = test_df

In [None]:
# handle application features
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

# Your Code Starts Here
num_feat = [...]
num_trans = Pipeline(
    steps = [
        ('scalar', StandardScaler())
    ]
)

one_hot = OneHotEncoder(handle_unknown = 'ignore')
cat_feat = [...]
cat_trans = Pipeline(
    steps = [
        ('onehot', one_hot)
    ]
)

ct = ColumnTransformer(
    transformers = [
        ('num', ..., ...),
        ('cat', ..., ...)
    ]
)
ct = ct.fit(X_train)
# Your Code Ends Here

In [None]:
X_train = ct.transform(X_train)
X_test = ct.transform(X_test)

In [None]:
print(X_train.shape, y_train.shape, X_test.shape)

## 2.3. Model Selection
- 在这一部分，我们将训练并挑选最合适的模型。在开始之前，我们需要定义我们评判模型好坏的标准（Metric）。注意这和 Loss function 并不一样：
    - Loss function: 在 training 的过程中使用，取决于特定机器学习任务（e.g. 分类 / 回归）
    - Metric: 在训练结束后使用，也取决于特定机器学习任务（e.g. 分类 / 回归）
- 在这个部分，一共有 **1 个 TODO**

In [None]:
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.naive_bayes import GaussianNB #Naive bayes

from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

In [None]:
X, y = X_train.copy(), y_train.copy() # Save the training set for future use
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.33, random_state = 42)

- After creating validation set, we started to try some Sci-kit models for machine learning:
    - `SVC ` (Support Vector Classification)
    - `Logistic Regression`
- For both of them, we will try various model parameters in order to find out the best model

In [None]:
# SVC model
model = svm.SVC(kernel = 'rbf', C = 1, gamma = 0.1)
model.fit(X_train, y_train)
prediction1 = model.predict(X_val)
print('Val acc for rbf SVM is ', metrics.accuracy_score(prediction1, y_val))

- **TODO 2.8**: 
    - Step 1 请阅读以下两个 documentation：[SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) 和 [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) 尤其注意 `C` 和 `gamma` 这两个超参数
    - Step 2 从上面那个 cell 复制代码（SVM 模型）到下方的 cell 中，并尝试不同的 `C` 和 `gamma` 数值，观察模型的表现
    - Step 3 尝试使用 Logistic Regression 来进行同样的分类 (try `C` = 0.1, 1, 10 and keep the one w/ best acc)

In [None]:
all_result = []
# Your Code Starts Here - SVC
...
# Your Code Ends Here

In [None]:
display(pd.DataFrame(all_result).sort_values(by = 'acc', ascending = False).head())

In [None]:
all_result = []
# Your Code Starts Here - Logistic Regression
...
# Your Code Ends Here

In [None]:
display(pd.DataFrame(all_result).sort_values(by = 'acc', ascending = False).head())

## 2.4. Model Testing
- 在这个部分，我们将根据预先确定的 metric 来挑选最合适的数据集
- 这个部分有 **1 个 TODO**
- **TODO 2.9:** 从上面的模型和参数中选出一个最合适的模型以及它最好的参数

In [None]:
# Your Code Starts Here
model = ... # Pick the best model
# Your Code Ends Here

model.fit(X_train, y_train)
prediction = model.predict(X_test)

test_passenger_id = pd.read_csv('data/test.csv').PassengerId.values
test['PassengerId'] = test_passenger_id
test['Survived'] = prediction.astype('int')
test.head(2)

In [None]:
# prediction.astype('int')

## 2.5. Export Prediction & Submission
- Submit to this website: https://www.kaggle.com/c/titanic/submit

In [None]:
# Make sure your submission looks like this
pd.read_csv('data/gender_submission.csv').head()

In [None]:
test[['PassengerId', 'Survived']].to_csv('data/submission.csv', index = False)

In [None]:
test[['PassengerId', 'Survived']]

# The End.