# 作業 : (Kaggle)鐵達尼生存預測

# [作業目標]
- 試著完成三種不同特徵類型的三種資料操作, 觀察結果
- 思考一下, 這三種特徵類型, 哪一種應該最複雜/最難處理

# [作業重點]
- 完成剩餘的八種 類型 x 操作組合 (In[6]~In[13], Out[6]~Out[13])
- 思考何種特徵類型, 應該最複雜

In [1]:
# 載入基本套件
import pandas as pd
import numpy as np

# 讀取訓練與測試資料
data_path = 'data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')
df_train.shape

(891, 12)

In [2]:
# 重組資料成為訓練 / 預測用格式
train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# 秀出資料欄位的類型與數量
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df = dtype_df.groupby("Column Type").aggregate('count').reset_index()
dtype_df

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,2
2,object,5


In [4]:
#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中
int_features = []
float_features = []
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64':
        float_features.append(feature)
    elif dtype == 'int64':
        int_features.append(feature)
    else:
        object_features.append(feature)
print(f'{len(int_features)} Integer Features : {int_features}\n')
print(f'{len(float_features)} Float Features : {float_features}\n')
print(f'{len(object_features)} Object Features : {object_features}')

3 Integer Features : ['Pclass', 'SibSp', 'Parch']

2 Float Features : ['Age', 'Fare']

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']


# 作業1 
* 試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique )  
中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因?  
Ans.: 程式如下。其中 object type 無法計算 mean ，因為不是數值資料；同樣 object type 對 max 的運算式將文數字依照 ASCII 大小作比較，與數值資料的 max 不同。  



# 作業2
* 思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?  
所以三大類特徵中，哪一大類處理起來應該最複雜?  
Ans.:  
1) 座標資料是一種 tuple 型別（sequence 、容器型別），不屬於課程資料中的任何一種資料型別，也應做適當的前處理方能進行下一步數據分析。  

2) 三大類特徵之中以類別型態不如整數和浮點數有可以有完整的統計運算，類別型態處理起來比較複雜。    
  
  
  

In [5]:
# 例 : 整數 (int) 特徵取平均 (mean)

# operations of integer type
print(f"mean of int_features:\n{df[int_features].mean(axis = 0)}\n")
print(f"max of int_features:\n{df[int_features].max(axis = 0)}\n")
print(f"nunique of int_features:\n{df[int_features].nunique(axis = 0)}\n")

# operations of float type
print(f"mean of float_features:\n{df[float_features].mean(axis = 0)}\n")
print(f"max of float_features:\n{df[float_features].max(axis = 0)}\n")
print(f"nunique of float_features:\n{df[float_features].nunique(axis = 0)}\n")

# operations of object type
print(f"mean of object_features:\n{df[object_features].mean(axis = 0)}\n")       # object can't count mean
print(f"max of object_features:\n{df[object_features].max(axis = 0)}\n")         # object can't count max
print(f"nunique of object_features:\n{df[object_features].nunique(axis = 0)}\n")


mean of int_features:
Pclass    2.294882
SibSp     0.498854
Parch     0.385027
dtype: float64

max of int_features:
Pclass    3
SibSp     8
Parch     9
dtype: int64

nunique of int_features:
Pclass    3
SibSp     7
Parch     8
dtype: int64

mean of float_features:
Age     29.881138
Fare    33.295479
dtype: float64

max of float_features:
Age      80.0000
Fare    512.3292
dtype: float64

nunique of float_features:
Age      98
Fare    281
dtype: int64

mean of object_features:
Series([], dtype: float64)

max of object_features:
Name      van Melkebeke, Mr. Philemon
Sex                              male
Ticket                      WE/P 5735
dtype: object

nunique of object_features:
Name        1307
Sex            2
Ticket       929
Cabin        186
Embarked       3
dtype: int64



In [6]:
# 請依序列出 三種特徵類型 (int / float / object) x 三種方法 (平均 mean / 最大值 Max / 相異值 nunique) 的其餘操作
"""
Your Code Here
"""

# operations of integer type
print(f"median of int_features:\n{df[int_features].median(axis = 0)}\n")
print(f"min of int_features:\n{df[int_features].min(axis = 0)}\n")
print(f"std of int_features:\n{df[int_features].std(axis = 0)}\n")
print(f"var of int_features:\n{df[int_features].var(axis = 0)}\n")
print(f"quantile of int_features:\n{df[int_features].quantile(axis = 0, q = 0.25)}\n")
print("==========================")

# operations of float type
print(f"median of float_features:\n{df[float_features].median(axis = 0)}\n")
print(f"min of float_features:\n{df[float_features].min(axis = 0)}\n")
print(f"std of float_features:\n{df[float_features].std(axis = 0)}\n")
print(f"var of float_features:\n{df[float_features].var(axis = 0)}\n")
print(f"quantile of float_features:\n{df[float_features].quantile(axis = 0, q = 0.25)}\n")
print("==========================")

# operations of float type
print(f"median of object_features:\n{df[object_features].median(axis = 0)}\n")   # object can't count median
print(f"min of object_features:\n{df[object_features].min(axis = 0)}\n")         # object can't count min
print(f"std of object_features:\n{df[object_features].std(axis = 0)}\n")         # object can't count std
print(f"var of object_features:\n{df[object_features].var(axis = 0)}\n")         # object can't count var
#print(f"quantile of object_features:\n{df[object_features].quantile(axis = 0, q = 0.25)}\n")  # object doesn't support quantile
print("==========================")




median of int_features:
Pclass    3.0
SibSp     0.0
Parch     0.0
dtype: float64

min of int_features:
Pclass    1
SibSp     0
Parch     0
dtype: int64

std of int_features:
Pclass    0.837836
SibSp     1.041658
Parch     0.865560
dtype: float64

var of int_features:
Pclass    0.701969
SibSp     1.085052
Parch     0.749195
dtype: float64

quantile of int_features:
Pclass    2.0
SibSp     0.0
Parch     0.0
Name: 0.25, dtype: float64

median of float_features:
Age     28.0000
Fare    14.4542
dtype: float64

min of float_features:
Age     0.17
Fare    0.00
dtype: float64

std of float_features:
Age     14.413493
Fare    51.758668
dtype: float64

var of float_features:
Age      207.748787
Fare    2678.959738
dtype: float64

quantile of float_features:
Age     21.0000
Fare     7.8958
Name: 0.25, dtype: float64

median of object_features:
Series([], dtype: float64)

min of object_features:
Name      Abbing, Mr. Anthony
Sex                    female
Ticket                 110152
dtype: object