# 鐵達尼號 - 選擇模型/評估
本次的課程將學習如何實作二元分類的模型，鐵達尼號是在kaggle上很有名的機器學習入門級比賽，目的是利用船上的乘客資料來預測他們是否能在船難存活；藉由此項專案將學會如何使用python裡的套件pandas和numpy來操作資料、並利用matplotlib、seaborn視覺化資料，以及用scikit-learn來建構模型。

### 環境提醒及備註
在執行本範例前請先確認Jupyter筆記本設置是否正確，首先點選主選單的「修改」─「筆記本設置」─「運行類別」，選擇「Python3」，同時將「硬件加速器」下拉式選單由「None」改成「GPU」，再按「保存」。

### 課程架構
在鐵達尼號的專案中，將帶著學員建構一個機器學習的模型，並進行乘客生存率預測，主要包括以下四個步驟：

>1.   如何進行資料前處理(Processing)

>2.   如何實作探索式數據分析(Exploratory Data Analysis)

>3.   如何導入特徵工程(Feature Engineering)

>4.   如何選擇模型並評估其效果(Model&Inference) 

---

**4.1 載入所需套件**

---

In [1]:
# 4-1
# 首先載入所需套件，一般會利用import (package_name) as (xxx) 來簡化套件名稱，使得之後呼叫它們時更方便

import pandas as pd # 主要資料型態為series以及dataframe，功能以numpy為基礎再延伸更多進階的操作
import numpy as np # 操作陣列型態資料的套件
import matplotlib.pyplot as plt # 基本的繪圖套件
import seaborn as sns # 基於matplotlib提供更多高階視覺化的套件

from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

import warnings
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
# 4-2
# 可以用pandas裡面的函式來讀取csv檔，使用方法為pd.read_csv('檔案名稱')

# 訓練資料
train = pd.read_excel("train/train_new.xls") 

# 測試資料
test = pd.read_excel("test/test_new.xls") 

---

**4.2 載入機器學習模型**

---

In [3]:
# 4-3
# 機器學習模組 machine learning

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

In [4]:
# 4-4

train.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,0,3,0,1,0,0,1,0,3
1,1,1,1,1,2,3,1,3,0,2
2,2,1,3,1,1,1,0,2,1,3
3,3,1,1,1,2,3,0,3,0,2
4,4,0,3,0,2,1,0,1,1,6


In [5]:
train = train.drop("Unnamed: 0",axis=1)

In [6]:
# 4-5

test.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,892,3,0,2,0,2,1,1,6
1,1,893,3,1,2,0,0,3,0,6
2,2,894,2,0,3,1,2,1,1,6
3,3,895,3,0,1,1,0,1,1,3
4,4,896,3,1,1,1,0,3,0,3


In [7]:
test = test.drop("Unnamed: 0",axis=1)

---

**4.3 建模與預測**

---
處理完資料前處理以及特徵工程後！再來就是建模的部分，必須根據想解決的問題來選擇使用什麼樣的模型。首先，目標是一個分類問題，是預測乘客存活與否；再來訓練資料都是有label的，也就是說這是一個監督式學習的分類問題。選了以下幾個常用的模型來做預測，至於各模型的原理，以及調整參數等方法這裡就不再多贅述：
- Logistic Regression
- k-Nearest Neighbors(KNN)
- Support Vector Machines(SVM)
- Naive Bayes classifier
- Perceptron
- Decision Tree
- Random Forrest

---

**4.4 準備訓練及測試資料**

---

In [8]:
# 4-6
# 乘客存活與否是想預測的，因此依變項Y為Survived，自變項X則為其他變數

X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

((891, 8), (891,), (418, 8))

In [9]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,3,0,1,0,0,1,0,3
1,1,1,2,3,1,3,0,2
2,3,1,1,1,0,2,1,3
3,1,1,2,3,0,3,0,2
4,3,0,2,1,0,1,1,6


In [10]:
# 4-7
logreg = LogisticRegression()

# 用train資料來訓練模型
logreg.fit(X_train, Y_train)

# 評估準確率須使用test資料
Y_pred = logreg.predict(X_test) 

# 將準確率轉換成百分制並且取到小數點後兩位
acc_log = round(logreg.score(X_train, Y_train)*100 , 2) 
acc_log

80.25

### k-Nearest Neighbors

In [11]:
# 4-8
# neighbor數目可以自己調整，找出最好的參數

knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

84.06

### Support Vector Machines

In [12]:
# 4-7

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train)*100,2)
acc_svc

83.84

### Gaussian Naive Bayes

In [13]:
# 4-8

gnb = GaussianNB()
gnb.fit(X_train, Y_train)
Y_pred = gnb.predict(X_test)
acc_gnb = round(gnb.score(X_train, Y_train)*100, 2)
acc_gnb

73.4

### Perceptron

In [14]:
# 4-9
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train)*100, 2)
acc_perceptron

75.2

### Linear SVC

In [15]:
# 4-10
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

81.03

### Stochastic Gradient Descent

In [16]:
# 4-11
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

70.71

### Decision Tree

In [17]:
# 4-12
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

86.76

### Random Forest

In [18]:
# 4-13
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

86.76

In [19]:
# 4-14
# 將模型準確率依照高低列出
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gnb, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
3,Random Forest,86.76
8,Decision Tree,86.76
1,KNN,84.06
0,Support Vector Machines,83.84
7,Linear SVC,81.03
2,Logistic Regression,80.25
5,Perceptron,75.2
4,Naive Bayes,73.4
6,Stochastic Gradient Decent,70.71


In [29]:
# 4-15
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,0,3,0,1,0,0,1,0,3
1,1,1,1,2,3,1,3,0,2
2,1,3,1,1,1,0,2,1,3
3,1,1,1,2,3,0,3,0,2
4,0,3,0,2,1,0,1,1,6


In [30]:
# 4-16
test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,892,3,0,2,0,2,1,1,6
1,893,3,1,2,0,0,3,0,6
2,894,2,0,3,1,2,1,1,6
3,895,3,0,1,1,0,1,1,3
4,896,3,1,1,1,0,3,0,3


In [32]:
# 4-17
X_test  = test.drop("PassengerId", axis=1)
X_test.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,Title,IsAlone,Age*Class
0,3,0,2,0,2,1,1,6
1,3,1,2,0,0,3,0,6
2,2,0,3,1,2,1,1,6
3,3,0,1,1,0,1,1,3
4,3,1,1,1,0,3,0,3


In [33]:
# 4-18
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(oob_score=True, random_state=1, n_jobs=-1)
param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10], "min_samples_split" : [2, 4, 10, 12, 16, 20], "n_estimators": [50, 100, 400, 700, 1000]}
gs = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=3, n_jobs=-1)

gs = gs.fit(train.iloc[:, 1:], train.iloc[:, 0])

print(gs.best_score_)
print(gs.best_params_)

0.819304152637486
{'criterion': 'gini', 'min_samples_leaf': 1, 'min_samples_split': 16, 'n_estimators': 50}


In [60]:
# 4-19
 
rf = RandomForestClassifier(criterion='gini', 
                             n_estimators=1000,
                             min_samples_split=12,
                             min_samples_leaf=1,
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1) 

rf.fit(train.iloc[:, 1:], train.iloc[:, 0])
print("%.4f" % rf.oob_score_)

0.8182


In [61]:
# 4-20

pd.concat((pd.DataFrame(train.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(rf.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

Unnamed: 0,variable,importance
5,Title,0.289186
1,Sex,0.24589
0,Pclass,0.150629
3,Fare,0.107262
7,Age*Class,0.087535
4,Embarked,0.048257
2,Age,0.042325
6,IsAlone,0.028916


In [64]:
# 4-21

rf_res =  rf.predict(X_test)
submit['Survived'] = rf_res
submit['Survived'] = submit['Survived'].astype(int)
submit.to_csv('submit.csv', index= False)

----