# 鐵達尼號 - 資料前處理
本次的課程將學習如何實作二元分類的模型，鐵達尼號是在kaggle上很有名的機器學習入門級比賽，目的是利用船上的乘客資料來預測他們是否能在船難存活；藉由此項專案將學會如何使用python裡的套件pandas和numpy來操作資料、並利用matplotlib、seaborn視覺化資料，以及用scikit-learn來建構模型。

### 環境提醒及備註
在執行本範例前請先確認Jupyter筆記本設置是否正確，首先點選主選單的「修改」─「筆記本設置」─「運行類別」，選擇「Python3」，同時將「硬件加速器」下拉式選單由「None」改成「GPU」，再按「保存」。

### 課程架構
在鐵達尼號的專案中，將帶著學員建構一個機器學習的模型，並進行乘客生存率預測，主要包括以下四個步驟：

>1.   如何進行資料前處理(Processing)

>2.   如何實作探索式數據分析(Exploratory Data Analysis)

>3.   如何導入特徵工程(Feature Engineering)

>4.   如何選擇模型並評估其效果(Model&Inference) 

---

**1.1 載入所需套件**

---

In [4]:
# 1-1
# 首先載入所需套件，一般會利用import (package_name) as (xxx) 來簡化套件名稱，使得之後呼叫它們時更方便

import pandas as pd # 主要資料型態為series以及dataframe，功能以numpy為基礎再延伸更多進階的操作
import numpy as np # 操作陣列型態資料的套件
import matplotlib.pyplot as plt # 基本的繪圖套件
import seaborn as sns # 基於matplotlib提供更多高階視覺化的套件

from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV

import warnings
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

%matplotlib inline

---

**1.2 載入資料集**

---
至https://www.kaggle.com/c/titanic/data 下載所需data，共有test、train以及gender_submission三個csv檔

In [5]:
# 1-2
# 可以用pandas裡面的函式來讀取csv檔，使用方法為pd.read_csv('檔案名稱')

# 訓練資料
train = pd.read_csv("train/train.csv") 

# 測試資料
test = pd.read_csv("test/test.csv") 

# 將訓練及測試資料結合以利後續的資料前處理
data = train.append(test) 
data.reset_index(inplace=True, drop=True)

# 使用type()查看物件的型態，可以確認我們讀取後的檔案型態為datarframe
type(test)

pandas.core.frame.DataFrame

---

**1.3 觀察資料型態**

---
讀入資料後，會想大致了解資料的樣貌，使用columns函式可以查看乘客資料有哪些欄位（特徵），而各特徵的描述則可以在剛剛的連結上找到

In [6]:
# 1-3

train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
# 1-4

train.shape

(891, 12)

In [8]:
# 1-5

test.shape

(418, 11)

---

**1.4 特徵名含義**

---
- Pclass:頭等艙，二等艙...
- Name:姓名
- Sex:性別
- Age:年齡
- SibSp:該乘客在船上兄弟姐妹或妻子的數量
- Parch:該乘客在船上父母或孩子的數量
- Ticket:船票
- Fare:費用
- Cabin:船艙
- Embarked: 出發港口

In [9]:
# 1-6
# 另外也可以利用head(n)函式來看前面n筆資料，另外若想看後面n筆資料，則是用.tail(n)

train.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


---

**1.5 觀察資料分布**

---
- 由於survived是binary的資料（0表示死亡，1表示存活），可由平均數得知整體存活率約為0.38
- 從Age的分布可看出年長乘客相當少；而有兄弟姐妹及配偶在船上的約30%；超過七成的乘客沒有帶父母及小孩
- 極少數的乘客票價相當高昂

In [10]:
# 1-7

train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
# 1-8

test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


---

**1.6 觀察物件型態資料分布狀況**

---
- unique表示不同資料的個數；top則是出現次數最多的資料；freq則是出現次數
- 每個乘客名字都不一樣；男女比約為2:1
- 有約22%的票有重複的號碼- 船艙的重複值非常多，表示有多數乘客住一間的情況
- Embarked表示出發的港口，其中在S港最多人上船

In [12]:
# 1-9
# 設定include參數為'O'可以只顯示型態為Object的特徵

train.describe(include=['O']) 

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Skoog, Mrs. William (Anna Bernhardina Karlsson)",male,1601,C23 C25 C27,S
freq,1,577,7,4,644


In [13]:
# 1-10
# 觀察資料的基本資訊，包括數量、是否有缺值以及型態

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [14]:
# 1-11

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [15]:
# 1-12
# 觀察哪些資料有缺失值？

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [16]:
# 1-13
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

---------------------