# 作業 : (Kaggle)鐵達尼生存預測

# [作業目標]
- 試著完成三種不同特徵類型的三種資料操作, 觀察結果
- 思考一下, 這三種特徵類型, 哪一種應該最複雜/最難處理

# [作業重點]
- 完成剩餘的八種 類型 x 操作組合 (In[6]-In[13], Out[6]-Out[13])
- 思考何種特徵類型, 應該最複雜

In [32]:
# 載入基本套件
import pandas as pd
import numpy as np

# 讀取訓練與測試資料
data_path = 'data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')
df_train.shape

(891, 12)

In [33]:
# 重組資料成為訓練 / 預測用格式
train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1) #drop好像是丟掉指定行/列的意思?
df_test = df_test.drop(['PassengerId'] , axis=1)
#所以前兩行程式碼要把訓練&預測資料修剪到一樣再合併
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [34]:
# 秀出資料欄位的類型與數量
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df = dtype_df.groupby("Column Type").aggregate('count').reset_index()
dtype_df

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,2
2,object,5


In [35]:
#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中
int_features = []
float_features = []
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64':
        float_features.append(feature)
    elif dtype == 'int64':
        int_features.append(feature)
    else:
        object_features.append(feature)
print(f'{len(int_features)} Integer Features : {int_features}\n')
print(f'{len(float_features)} Float Features : {float_features}\n')
print(f'{len(object_features)} Object Features : {object_features}')

3 Integer Features : ['Pclass', 'SibSp', 'Parch']

2 Float Features : ['Age', 'Fare']

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']


# 作業1 
* 試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique )  
中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因?  

In [6]:
# 例 : 整數 (int) 特徵取平均 (mean)
df[int_features].mean()

Pclass    2.294882
SibSp     0.498854
Parch     0.385027
dtype: float64

In [7]:
# 請依序列出 三種特徵類型 (int / float / object) x 三種方法 (平均 mean / 最大值 Max / 相異值 nunique) 的其餘操作

df[int_features].nunique()

Pclass    3
SibSp     7
Parch     8
dtype: int64

In [8]:
df[object_features].nunique()

Name        1307
Sex            2
Ticket       929
Cabin        186
Embarked       3
dtype: int64

In [28]:
df[object_features].mean()

Series([], dtype: float64)

#### 發現問題:
* (In[28] & Out[28]): 對object欄位取平均，跑出一個空series
* 推測原因: 資料型態不一致(字串、浮點數)無法取平均?

In [11]:
df[float_features].nunique()

Age      98
Fare    281
dtype: int64

In [12]:
df[float_features].max()

Age      80.0000
Fare    512.3292
dtype: float64

In [13]:
df[float_features].mean()

Age     29.881138
Fare    33.295479
dtype: float64

In [14]:
df[int_features].max()

Pclass    3
SibSp     8
Parch     9
dtype: int64

In [9]:
df[object_features].max()

Name      van Melkebeke, Mr. Philemon
Sex                              male
Ticket                      WE/P 5735
dtype: object

#### 發現問題:
1. 比大小的輸出結果中沒有 Cabin 和 Embarked 欄位

2. 看不出object欄位最大值的判斷方式(In[9] & Out[9])

先觀察 Cabin 和 Embarked 的表現:

In [44]:
print(df[object_features])
df['Embarked'].max()

                                                  Name     Sex  \
0                              Braund, Mr. Owen Harris    male   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
2                               Heikkinen, Miss. Laina  female   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
4                             Allen, Mr. William Henry    male   
..                                                 ...     ...   
413                                 Spector, Mr. Woolf    male   
414                       Oliva y Ocana, Dona. Fermina  female   
415                       Saether, Mr. Simon Sivertsen    male   
416                                Ware, Mr. Frederick    male   
417                           Peter, Master. Michael J    male   

                 Ticket Cabin Embarked  
0             A/5 21171   NaN        S  
1              PC 17599   C85        C  
2      STON/O2. 3101282   NaN        S  
3                113803  C123        S  
4 

TypeError: '>=' not supported between instances of 'str' and 'float'

In [45]:
df['Cabin'].max()

TypeError: '>=' not supported between instances of 'float' and 'str'

從錯誤說明中發現：Cabin 和 Embarked 欄位似乎同時具有「浮點數」和「字串」兩種資料型態

把最小值&各欄位排序印出來觀察看看:

In [15]:
df[object_features].min()

Name      Abbing, Mr. Anthony
Sex                    female
Ticket                 110152
dtype: object

In [26]:
ob = df[object_features]

ob = ob.sort_values('Ticket',ascending = False)
print(ob.head())

ob = ob.sort_values('Name',ascending = False)
print(ob.head())

ob = ob.sort_values('Sex',ascending = False)
print(ob.head())

ob = ob.sort_values('Cabin',ascending = False)
print(ob.head())

ob = ob.sort_values('Embarked',ascending = False)
ob.head()

                                                  Name     Sex       Ticket  \
745                       Crosby, Capt. Edward Gifford    male    WE/P 5735   
540                            Crosby, Miss. Harriet R  female    WE/P 5735   
219                                 Harris, Mr. Walter    male    W/C 14208   
14   Chaffee, Mrs. Herbert Fuller (Carrie Constance...  female  W.E.P. 5734   
92                         Chaffee, Mr. Herbert Fuller    male  W.E.P. 5734   

    Cabin Embarked  
745   B22        S  
540   B22        S  
219   NaN        S  
14    E31        S  
92    E31        S  
                                              Name     Sex         Ticket  \
868                    van Melkebeke, Mr. Philemon    male         345777   
153                van Billiard, Mr. Austin Blyler    male       A/5. 851   
192              van Billiard, Master. Walter John    male       A/5. 851   
344            van Billiard, Master. James William    male       A/5. 851   
15   del Carlo

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
339,"Blackwell, Mr. Stephen Weart",male,113784,T,S
873,"Vander Cruyssen, Mr. Victor",male,345765,,S
189,"Veal, Mr. James",male,28221,,S
333,"Vander Planke, Mr. Leo Edmondus",male,345764,,S
302,"Phillips, Mr. Escott Robert",male,S.O./P.P. 2,,S


2. 推測答案：
    * Name -- 姓名比序由左往右，字母次序越靠後(越接近z)，特徵值越大 (字串)
    * Sex -- male的特徵值 > female的特徵值 (字串)
    * Ticket -- 有字母的 > 純數字；字母比序同Name (字串)

# 作業2
* 思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?  
所以三大類特徵中，哪一大類處理起來應該最複雜?

答: 
* 時間型特徵的處理應該最複雜，因為資料型態有先後關係，需要做數值比較；但又要考慮週期性，根據不同尺度分類。