# Module3. Machine Learning 機器學習入門
機器學習是一種讓電腦本身去找出資料背後的規律的過程。過去我們要讓電腦幫我們做事的時候，我們會給出明確的指令，而機器學習的目標是要在給定目標的情況下，讓電腦透過重複的嘗試與修正後，找出適合的方法。機器學習的範圍很廣，包含傳統的回歸到現在流行的深度學習都是包含在裏頭。
Machine learning is a process to let the program itself recognize the hidden pattern behind the data. This field involves traditional statistical tools like regression and more modern approach like deep learning. We will go through some of these technique in this module based on the titanic dataset.

In [None]:
import pandas as pd
import numpy as np

我們這邊使用的是來自Kaggle經典競賽:鐵達尼號的資料。
鐵達尼號的資料中包含乘客的個人資料、船票資料以及是否在船難中活下來。比賽的目標就是要訓練一個模型來透過個人跟船票資料來預測生存機率。

In [None]:
# as usual
# if you are working on local environment, please change the path to "titanic.csv" if the file is in the same folder
df = pd.read_csv("https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/titanic.csv")
df.head()

在資料裡有欄位 "Survived" 代表著這名乘客是否存活，我們以下會用不同的方式來找出預測存活率的模型。
As you can see in the column names, there is a variable, "Survived", indicates whether the passenger was alive. We will try to use different models to see if we can make a good prediction.

## 1. Train and Evaluate 訓練與評估
實務上，機器學習的過程包含(1)收集資料，(2)資料預處理，(3)模型訓練，(4)模型評估。在這邊我們先針對(3)模型訓練的部分來說明。
在訓練模型的過程中，我們通常會有一些事先收集好並標註的資料，用來作為電腦學習的範本，稱為訓練樣本(Training sample)。
用這些資料訓練出模型之後，我們再拿模型去預測實際的資料。
但機器學習有一個很重要的特性，就是通常我們在訓練的時候，如果模型的彈性很大(參數很多)，理論上只要訓練的時數夠長，就可以訓練出完全符合訓練樣本的模型。
我們稱這問題為 Overfitting。  

你可以想像是一個考生，在考卷題目都沒有變的情況下考個一千遍，他也就能靠著背下問題跟相對應的答案來拿到一百分。即使他並不是真的知道為甚麼答案是這樣。
因此，在訓練模型的過程中，我們習慣會將資料拆成訓練集(Train set)跟測試集(Test set)，確保模型不是因為Overfitting 的關係而在訓練集上拿到高分。
To start training a model, we need to first split the dataset into train set and test set.
The train set is the one used to train the model, and we will use the test set to evaluate the accuracy of the model and avoid overfitting.

In [None]:
# split the dataset into training set and test set by 8:2
split_n = int(0.8 * len(df))
df_train = df.iloc[:split_n, :]
df_test = df.iloc[split_n:, :]

print("sample size of training set:", len(df_train))
print("sample size of test set:", len(df_test))

### 1.1 Preprocessing
We first select three features, Pclass, Sex, and Age. But since the gender is stored in the form of string, we need to change it to 1/0.
Then we need to fill the nan value in the columns Age.

In [None]:
train_X = df_train.loc[:, ["Pclass", "Sex", "Age"]]
train_X = train_X.replace("female", 0)
train_X = train_X.replace("male", 1)
train_X = train_X.fillna(0)
train_y = df_train.Survived

test_X = df_test.loc[:, ["Pclass", "Sex", "Age"]]
test_X = test_X.replace("female", 0)
test_X = test_X.replace("male", 1)
test_X = test_X.fillna(0)
test_y = df_test.Survived

### Linear Regression
We first train a linear regression model as a benchmark. When performin machine learning, it is important to choose right features. Features are the variables that the model learns from. For instance, in this case, passenger's gender, age, and class might be good features to predict the chance of survival. Although not certainly, domain knowledge can usually help you figure out what are the good features.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
clf_linear = LinearRegression().fit(train_X, train_y)
print("R squared is", clf_linear.score(train_X, train_y))

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
print("The mean squared error for the linear regression is", mean_squared_error(clf_linear.predict(test_X), test_y))

### Logistic Regression
Logistic regression is a powerful tool comparing to the linear regression when it comes to the binary case.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)

In [None]:
print("The mean squared error for the logistc regression is",mean_squared_error(clf_logistic.predict(test_X), test_y))

In [None]:
print("logistic:", clf_logistic.predict(test_X)[:5])
print("linear:", clf_linear.predict(test_X)[:5])

In [None]:
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_logistic.predict(test_X) == test_y)/len(df_test))

In [None]:
binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Accuracy: ", sum(binary_output == test_y)/len(df_test))

### Random Forest
Random forest is perhaps one of the most powerful prediction models in the traditional ML tools.
![random_forest_1](https://github.com/JumpingSquid/py_tutorial/raw/master/image/rf_ilus.png)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=150, max_depth=3,random_state=0)
clf_rf.fit(train_X, train_y)

In [None]:
print("The mean squared error for the random forest is",mean_squared_error(clf_rf.predict(test_X), test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_rf.predict(test_X) == test_y)/len(df_test))

## 2. Feature Engineering
After using the three basic features to train the models, we may want to add more features to improve the accuracy. In practice, we can transform or combine one or more existing features to create a new feature. This is called feature engineering. In general, if we are facing a dataset with limited sample size and features, feature engineering is the key to increase the performance of a model.

In [None]:
print(train_X.head())

In [None]:
train_X["old"] = np.where(train_X.Age > 55, 1, 0)
train_X["young"] = np.where(train_X.Age < 20, 1, 0)
print(train_X.head())

test_X["old"] = np.where(test_X.Age > 55, 1, 0)
test_X["young"] = np.where(test_X.Age < 20, 1, 0)

In [None]:
train_X["old_man"] = np.where((train_X.old == 1)&(train_X.Sex==1), 1, 0)
train_X["young_man"] = np.where((train_X.young == 1)&(train_X.Sex==1), 1, 0)
print(train_X.head())

test_X["old_man"] = np.where((test_X.old == 1)&(test_X.Sex==1), 1, 0)
test_X["young_man"] = np.where((test_X.young == 1)&(test_X.Sex==1), 1, 0)

In [None]:
clf_linear = LinearRegression().fit(train_X, train_y)
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)
clf_rf.fit(train_X, train_y)

binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))