# Module3. Machine Learning 機器學習入門
Machine learning is a process to let the program itself recognize the hidden pattern behind the data.
In the past, when we want the computer to do something for us, we need to define the rule in advance.
And the idea of machine learning, is to let the computer know the best way for handling a task by itself with training.
This field involves traditional statistical tools like regression and more modern approach like deep learning.
We will go through some of these technique in this module based on the titanic dataset.

機器學習是一種讓電腦本身去找出資料背後的規律的過程。
過去我們要讓電腦幫我們做事，需要給出明確的規則。而機器學習的目標是要在給定目標的情況下，讓電腦透過重複的嘗試與修正後，找出適合的方法。
機器學習的範圍很廣，包含傳統的迴歸到現在流行的深度學習都屬於這個領域。

The nature of machine learning or modelling is to find the relationship between some variables X and y. Mathematically,
this means we and to have a function f(X) = y. The function can have many forms, like linear regression assumes y = X * W + e.
Deep learning is also an extension of this idea. Besides, there are other non-linear models (e.g. decision tree) that
forms a model using a series of conditions to predict y.

機器學習或是所謂"建模"、"模型"的本質，就是當我們有一些自變數X，我們想找出它們跟一個應變數y的關係，以數學的方式來說，就是找出一個函數讓y = f(X)。
f(X) 可以有很多形式，比方說線性迴歸模型就是假定 y 跟 X 之間是呈現 y = X * W + e 這樣一個線性關係。而深度學習也是這種線性模型的延伸。
另外也有像決策樹這樣非線性的模型，透過一連串的條件判斷，依據各個條件的符合與否，來預測 y。

In [None]:
import pandas as pd
import numpy as np

We first use the data from Kaggle's clasic competition: Titanic.
The data of Titanic dataset includes passengers' personal information, ticket information and whether they are survived or not.
The goal of the competition is to train a model with highest accuracy for predicting the survaival.
我們這邊使用的是來自Kaggle經典競賽:鐵達尼號的資料。
鐵達尼號的資料中包含乘客的個人資料、船票資料以及是否在船難中活下來。比賽的目標就是要訓練一個模型來透過個人跟船票資料來預測生存機率。

In [None]:
# as usual
# if you are working on local environment, please change the path to "titanic.csv" if the file is in the same folder
df = pd.read_csv("https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/data/titanic.csv")
df.head()

As you can see in the column names, there is a variable, "Survived", indicates whether the passenger was alive.
We will try to use different models to see if we can make a good prediction.

在資料裡有欄位 "Survived" 代表著這名乘客是否存活，我們以下會用不同的方式來找出預測存活率的模型。

## 1. Train and Evaluate 訓練與評估
In practice, machine learning tasks include (1) collect data, (2) data preprocess, (3) model training, and (4) model evaluation.
We start from the third part model training. In general, when we want to train a model, we will have a bunch of data with
some target to be used as the learning samples for the computer. Then we use the trained model for inference or predicting.
But there is a problem, that is, overfitting. Overfitting means that the model is too good at the training samples, but
performs worse when using other data.

To start training a model, we need to first split the dataset into train set and test set.
The train set is the one used to train the model, and we will use the test set to evaluate the accuracy of the model and avoid overfitting.

實務上，機器學習的過程包含(1)收集資料，(2)資料預處理，(3)模型訓練，(4)模型評估。在這邊我們先針對(3)模型訓練的部分來說明。
在訓練模型的過程中，我們通常會有一些事先收集好並標註的資料，用來作為電腦學習的範本，稱為訓練樣本(Training sample)。
用這些資料訓練出模型之後，我們再拿模型去預測實際的資料。
但機器學習有一個很重要的特性，就是通常我們在訓練的時候，如果模型的彈性很大(參數很多)，理論上只要訓練的時數夠長，就可以訓練出完全符合訓練樣本的模型。
我們稱這問題為 Overfitting。

你可以想像是一個考生，在考卷題目都沒有變的情況下考個一千遍，他能靠著背下問題跟答案來拿到一百分。即使他並不是真的知道為甚麼答案是這樣。
因此，在訓練模型的過程中，我們習慣會將資料拆成訓練集(Train set)跟測試集(Test set)，確保模型不是因為Overfitting 的關係而在訓練集上拿到高分。

In [None]:
# split the dataset into training set and test set by 8:2
split_n = int(0.8 * len(df))
df_train = df.iloc[:split_n, :]
df_test = df.iloc[split_n:, :]

print("sample size of training set:", len(df_train))
print("sample size of test set:", len(df_test))

### 1.1 Feature Selection and Preprocessing 特徵選擇與預處理
When performing machine learning, it is important to choose right features.
Features are the variables that the model use to predict the target value.
In this case, passenger's gender, age, and class might be good features to predict the chance of survival.
Although not certainly, domain knowledge can usually help you figure out what are the good features.

We first select three features: Pclass, Sex, and Age. But since the gender is stored in the form of string, we need to change it to 1/0.
Then we need to fill the nan value in the columns Age.

在開始訓練之前，我們通常會先選擇一些變數作為模型預測的基準。雖然一次把全部變數都拿來用是一個很直接的作法，但因為每個變數的完整度一般不會一致。
預先選擇一些跟預測目標比較有關聯性且資料完整的變數可以簡化訓練的過程。雖然不是一定，但如果你對問題本身的專業領域有一些認識(domain knowledge)，
或可以選擇出更適合的特徵。

在這邊我們選定資料中的三個變數做為預測的基礎:Pclass(船艙等級)、Sex(性別)、Age(年齡)。

機器學習有一個重要的概念，就是不論你原先資料是數字、文字還是其他形式，目前程式可以處理的型態就是數字而已。因此，如果我們的資料中有非數字的變數，
我們就需要將其轉成數字。由於性別在資料中是以文字記錄，在把資料交給程式之前，我們需要把它換成數字，常見的做法就是改成0跟1。

In [None]:
train_X = df_train.loc[:, ["Pclass", "Sex", "Age"]]
train_X = train_X.replace("female", 0)
train_X = train_X.replace("male", 1)
train_X = train_X.fillna(0)
train_y = df_train.Survived

test_X = df_test.loc[:, ["Pclass", "Sex", "Age"]]
test_X = test_X.replace("female", 0)
test_X = test_X.replace("male", 1)
test_X = test_X.fillna(0)
test_y = df_test.Survived

### 1.2 Model Selection 模型選擇

#### Linear Regression 線性迴歸
We first train a linear regression model as a benchmark.

我們一開始先以一個線性迴歸作為基礎的模型。
如前面所說的，機器學習包含的技術方法很多，所以通常我們會選定一個基本、易做的模型做為比較的基準，而線性迴歸就符合這樣的特性。

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
clf_linear = LinearRegression().fit(train_X, train_y)
print("R squared is", clf_linear.score(train_X, train_y))

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
print("The mean squared error for the linear regression is", mean_squared_error(clf_linear.predict(test_X), test_y))

#### Logistic Regression
Logistic regression is a powerful tool comparing to the linear regression when it comes to the binary case.
Common cases include probability or purchase behavior. Logistic regression usually is considered to have equivalent performance
with more complex models.

Logistic Regression 也是相當常見且有用的模型。通常 Logisitc regression 會用於y在0~1之間的情況，比方說機率或是購買與否，也常被用在網站上(例如判斷上網的瀏覽者是真人還是爬蟲機器人)。
在多種簡易模型中，Logistic Regression通常被認為有跟複雜模型相提並論的預測能力。

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)

In [None]:
print("The mean squared error for the logistc regression is",mean_squared_error(clf_logistic.predict(test_X), test_y))

In [None]:
print("logistic:", clf_logistic.predict(test_X)[:5])
print("linear:", clf_linear.predict(test_X)[:5])

In [None]:
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_logistic.predict(test_X) == test_y)/len(df_test))

In [None]:
binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Accuracy: ", sum(binary_output == test_y)/len(df_test))

#### Decision Tree and Random Forest
Among all models, decision tree based models are usually the best fit when we are pursuing better accuracy
with sufficient computing resource. In data predicting competition like Kaggle, the forerunner usually use some kind of
decision tree based models.
Only a single decision tree model may not be enough; however, we can train a bunch of decision tree model and aggregate
theirs prediction together and get better performace. And that is the starting point of the random forest model.

當我們對準確率有較高的要求，同時允許使用一定的計算資源時，基於決策樹的模型通常是最好的選擇。
現今在資料預測比賽中(如Kaggle)，獲得優勝的隊伍多數都會採用基於決策樹的演算法。
然而，單一個決策樹的預測效果比較有限，但如果我們一口氣訓練出幾百幾千個決策樹，
再用某種方式來總和每個決策樹的預測結果，或許就能有更好的預測力。而這也是隨機森林的出發點。
![random_forest_1](https://github.com/JumpingSquid/py_tutorial/raw/master/image/rf_ilus.png)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=150, max_depth=3,random_state=0)
clf_rf.fit(train_X, train_y)

In [None]:
print("The mean squared error for the random forest is",mean_squared_error(clf_rf.predict(test_X), test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_rf.predict(test_X) == test_y)/len(df_test))

## 2. Feature Engineering 特徵工程
After using three basic features to train the models, we may want to add more features to improve the accuracy.
In practice, we can transform or combine one or more existing features to create a new feature.
This is called feature engineering. In general, if we are facing a dataset with limited sample size and features,
 feature engineering is the key to increase the performance of a model.

當我們使用完原始資料的變數後，我們可能會想要使用更多的變數來提升模型的預測能力。實務上我們可以將變數做一些處理，或是把不同的變數組合在一起。
這個過程稱為特徵工程。如果我們今天的樣本數或變數有限，特徵工程對於提升模型效果有很大的幫助。

In [None]:
print(train_X.head())

In [None]:
train_X["old"] = np.where(train_X.Age > 55, 1, 0)
train_X["young"] = np.where(train_X.Age < 20, 1, 0)
print(train_X.head())

test_X["old"] = np.where(test_X.Age > 55, 1, 0)
test_X["young"] = np.where(test_X.Age < 20, 1, 0)

In [None]:
train_X["old_man"] = np.where((train_X.old == 1)&(train_X.Sex==1), 1, 0)
train_X["young_man"] = np.where((train_X.young == 1)&(train_X.Sex==1), 1, 0)
print(train_X.head())

test_X["old_man"] = np.where((test_X.old == 1)&(test_X.Sex==1), 1, 0)
test_X["young_man"] = np.where((test_X.young == 1)&(test_X.Sex==1), 1, 0)

In [None]:
clf_linear = LinearRegression().fit(train_X, train_y)
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)
clf_rf.fit(train_X, train_y)

binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))

## [DANGER ZONE] Deep Learning with tensorflow

![neural_network_1](https://github.com/JumpingSquid/py_tutorial/raw/master/image/nn1.png)
![neural_network_2](https://github.com/JumpingSquid/py_tutorial/raw/master/image/nn2.png)
![neural_network_3](https://github.com/JumpingSquid/py_tutorial/raw/master/image/inception_v2_resnet.png)

This is an example from tensorflow official tutorial
(https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/classification.ipynb).

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

In [None]:
plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.show()


In [None]:
train_images = train_images / 255.0
test_images = test_images / 255.0


In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)


In [None]:
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
probability_model = tf.keras.Sequential([model,
                                         tf.keras.layers.Softmax()])
predictions = probability_model.predict(test_images)

In [None]:
test_image_id = 1
plt.figure()
plt.imshow(test_images[test_image_id])

print(class_names[np.argmax(predictions[test_image_id])])
print(class_names[test_labels[test_image_id]])

some cool examples:
https://richzhang.github.io/colorization/
https://shunsukesaito.github.io/PIFuHD/

Any comments? https://forms.gle/qTjUWM2oC2VL1iaf8