# 作業 : (Kaggle)鐵達尼生存預測

# [作業目標]
- 試著完成三種不同特徵類型的三種資料操作, 觀察結果
- 思考一下, 這三種特徵類型, 哪一種應該最複雜/最難處理

# [作業重點]
- 完成剩餘的八種 類型 x 操作組合 (In[6]~In[13], Out[6]~Out[13])
- 思考何種特徵類型, 應該最複雜

In [1]:
# 載入基本套件
import pandas as pd
import numpy as np

# 讀取訓練與測試資料
data_path = 'D:\\dataimport\\'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_test.csv')
df_train.shape

(891, 12)

In [2]:
# 重組資料成為訓練 / 預測用格式
train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# 秀出資料欄位的類型與數量
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df = dtype_df.groupby("Column Type").aggregate('count').reset_index()
dtype_df

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,2
2,object,5


In [4]:
#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中
int_features = []
float_features = []
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64':
        float_features.append(feature)
    elif dtype == 'int64':
        int_features.append(feature)
    else:
        object_features.append(feature)
print(f'{len(int_features)} Integer Features : {int_features}\n')
print(f'{len(float_features)} Float Features : {float_features}\n')
print(f'{len(object_features)} Object Features : {object_features}')

3 Integer Features : ['Pclass', 'SibSp', 'Parch']

2 Float Features : ['Age', 'Fare']

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']


# 作業1 
* 試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique )  
中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因?  

# 作業2
* 思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?  
所以三大類特徵中，哪一大類處理起來應該最複雜?

In [5]:
# 例 : 整數 (int) 特徵取平均 (mean)
df[int_features].mean()

Pclass    2.294882
SibSp     0.498854
Parch     0.385027
dtype: float64

In [5]:
df[int_features].max()

Pclass    3
SibSp     8
Parch     9
dtype: int64

In [8]:
np.unique(df[int_features])

array([0, 1, 2, 3, 4, 5, 6, 8, 9], dtype=int64)

In [11]:
df[float_features].mean()

Age     29.881138
Fare    33.295479
dtype: float64

In [12]:
df[float_features].max()

Age      80.0000
Fare    512.3292
dtype: float64

In [14]:
np.unique(df[float_features])

array([0.000000e+00, 1.700000e-01, 3.300000e-01, 4.200000e-01,
       6.700000e-01, 7.500000e-01, 8.300000e-01, 9.200000e-01,
       1.000000e+00, 2.000000e+00, 3.000000e+00, 3.170800e+00,
       4.000000e+00, 4.012500e+00, 5.000000e+00, 6.000000e+00,
       6.237500e+00, 6.437500e+00, 6.450000e+00, 6.495800e+00,
       6.750000e+00, 6.858300e+00, 6.950000e+00, 6.975000e+00,
       7.000000e+00, 7.045800e+00, 7.050000e+00, 7.054200e+00,
       7.125000e+00, 7.141700e+00, 7.225000e+00, 7.229200e+00,
       7.250000e+00, 7.283300e+00, 7.312500e+00, 7.495800e+00,
       7.520800e+00, 7.550000e+00, 7.575000e+00, 7.579200e+00,
       7.629200e+00, 7.650000e+00, 7.720800e+00, 7.725000e+00,
       7.729200e+00, 7.733300e+00, 7.737500e+00, 7.741700e+00,
       7.750000e+00, 7.775000e+00, 7.779200e+00, 7.787500e+00,
       7.795800e+00, 7.800000e+00, 7.820800e+00, 7.829200e+00,
       7.850000e+00, 7.854200e+00, 7.875000e+00, 7.879200e+00,
       7.887500e+00, 7.895800e+00, 7.925000e+00, 8.0000

In [15]:
df[object_features].mean()

Series([], dtype: float64)

In [16]:
df[object_features].max()

Name      van Melkebeke, Mr. Philemon
Sex                              male
Ticket                      WE/P 5735
dtype: object

In [18]:
df[object_features]

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S
5,"Moran, Mr. James",male,330877,,Q
6,"McCarthy, Mr. Timothy J",male,17463,E46,S
7,"Palsson, Master. Gosta Leonard",male,349909,,S
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,347742,,S
9,"Nasser, Mrs. Nicholas (Adele Achem)",female,237736,,C


In [22]:
np.unique(df[object_features])  ### Type Error occurs due to incosistent type.

TypeError: '<' not supported between instances of 'float' and 'str'

In [23]:
df[object_features].columns

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

In [25]:
np.unique(df[object_features].Name)

array(['Abbing, Mr. Anthony', 'Abbott, Master. Eugene Joseph',
       'Abbott, Mr. Rossmore Edward', ...,
       'van Billiard, Master. Walter John',
       'van Billiard, Mr. Austin Blyler', 'van Melkebeke, Mr. Philemon'],
      dtype=object)

In [26]:
np.unique(df[object_features].Sex)

array(['female', 'male'], dtype=object)

In [27]:
np.unique(df[object_features].Ticket)

array(['110152', '110413', '110465', '110469', '110489', '110564',
       '110813', '111163', '111240', '111320', '111361', '111369',
       '111426', '111427', '111428', '112050', '112051', '112052',
       '112053', '112058', '112059', '112277', '112377', '112378',
       '112379', '112901', '113028', '113038', '113043', '113044',
       '113050', '113051', '113054', '113055', '113056', '113059',
       '113501', '113503', '113505', '113509', '113510', '113514',
       '113572', '113760', '113767', '113773', '113776', '113778',
       '113780', '113781', '113783', '113784', '113786', '113787',
       '113788', '113789', '113790', '113791', '113792', '113794',
       '113795', '113796', '113798', '113800', '113801', '113803',
       '113804', '113806', '113807', '11668', '11751', '11752', '11753',
       '11755', '11765', '11767', '11769', '11770', '11771', '11774',
       '11778', '11813', '11967', '1222', '12233', '12460', '12749',
       '13049', '13050', '13213', '13214', '13236',

作業2
思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?
所以三大類特徵中，哪一大類處理起來應該最複雜?

In [None]:
# Boolean is common in dataset.
# Object type is the most complicated. It can be string or float at the same time.