範例 : (Kaggle)鐵達尼號
 ===
以下用房價預測資料, 觀察特徵的幾種類型
這份資料有 'int64', 'float64', 'object' 三種欄位, 分別將其以python的list格式紀錄下來

## 作業目標

試著完成三種不同特徵類型的三種資料操作, 觀察結果
思考一下, 這三種特徵類型, 哪一種應該最複雜/最難處理

## 範例重點

1. 如何觀察目前的 DataFrame 中, 有哪些欄位類型, 以及數量各有多少

2. 如何將欄位名稱依欄位類型分開

3. 如何只顯示特定類型的欄位資料

## Thinking Flow

1. 由於需要訓練資料，所以我們要將"Survived" & "PassengerId"(identity)單獨先取出來，方便之後訓練模型之用

2. 再將dataset(data_train & data_test)去除"PassengerId" & "Survived"，並整合成一個dataFrame

3. 之後reset the index of data，將不同的data 分類成int64, float64, object的形式並和資料結合，將不同型態的columns的資料區別開來(顯示特定類型的欄位資料) 

## Resources
[Predictive Analysis of Survival Rate on Titanic - Kaggle](https://www.kaggle.com/beiqiwang/predictive-analysis-of-survival-rate-on-titanic)

* log1p(x): ```x值非常接近0時，不會出現log(x) = 0，會給予正確log結果```

![](https://i.stack.imgur.com/ycPOC.png)

[What is the purpose of numpy.log1p() - Stackoverflow](https://stackoverflow.com/questions/49538185/what-is-the-purpose-of-numpy-log1p)

* nunique(): ```可以計算出row/column的差異值```

[Pandas.DataFrame.nunique() - GeeksforGeeks](https://www.geeksforgeeks.org/python-pandas-dataframe-nunique/)


* drop(): ``````

[Pandas.DataFrame.drop - pandas documents](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

* concat(): ``````

[Pandas.concat - pandas documents](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

* aggregate(): ``````

[Pandas.DataFrame.aggregate - pandas documents](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)


* Groupby(): ``````

[Pandas.DataFrame.groupby - pandas documents](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

[Groupby: Get statistics for each group(such as count, mean, etc) - Stackoverflow](https://stackoverflow.com/questions/19384532/get-statistics-for-each-group-such-as-count-mean-etc-using-pandas-groupby)

* dtpyes(): ``````

[Pandas.DataFrame.dtypes - pandas documents](https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.groupby.html))

* zip(): ``````

[zip() function ](https://www.w3schools.com/python/ref_func_zip.asp)

In [3]:
import pandas as pd
import numpy as np

In [18]:
# 讀取訓練與測試資料
data_train = pd.read_csv('titanic_train.csv')
data_test = pd.read_csv('titanic_test.csv')
print(data_train.shape)

(891, 12)


In [19]:
# 訓練資料需要 train_X, train_Y / 預測輸出需要 ids(識別每個預測值), test_X
# 在此先抽離出 train_Y 與 ids, 而先將 train_X, test_X 該有的資料合併成 df, 先作特徵工程
train_Y = np.log1p(data_train['Survived'])
# print(data_train['Survived'])
ids = data_test['PassengerId']

# axis = 1 代表整個column
# SalePrice only exists in data_train
data_train = data_train.drop(['PassengerId', 'Survived'] , axis=1)
data_test = data_test.drop(['PassengerId'] , axis=1)
df = pd.concat([data_train, data_test], ignore_index = True)
df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
# 秀出資料欄位的類型, 與對應的數量
# df.dtypes : 轉成以欄位為 index, 類別(type)為 value 的 DataFrame
# .reset_index() : 預設是將原本的 index 轉成一個新的欄位, 如果不須保留 index, 則通常會寫成 .reset_index(drop=True)
dtype_df = df.dtypes.reset_index() 
dtype_df.columns = ["Count", "Column Type"]
dtype_df = dtype_df.groupby("Column Type").aggregate('count').reset_index()
dtype_df

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,2
2,object,5


## 了解程式碼背後的意義
DataFrame.dtypes/ DataFrame.columns/ zip(df.dtypes, df.columns)背後所儲存的資料

In [22]:
print(dtype_df)
print(df.columns)
print(zip(df.dtypes, df.columns))

  Column Type  Count
0       int64      3
1     float64      2
2      object      5
Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')
<zip object at 0x0000017881344C48>


In [11]:
#確定只有 int64, float64, object 三種類型後, 分別將欄位名稱存於三個 list 中
int_features = []
float_features = []
object_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64':
        float_features.append(feature)
    elif dtype == 'int64':
        int_features.append(feature)
    else:
        object_features.append(feature)
print(f'{len(int_features)} Integer Features : {int_features}\n')
print(f'{len(float_features)} Float Features : {float_features}\n')
print(f'{len(object_features)} Object Features : {object_features}')

3 Integer Features : ['Pclass', 'SibSp', 'Parch']

2 Float Features : ['Age', 'Fare']

5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']


---
## 作業一

試著執行作業程式，觀察三種類型的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique ) 中的九次操作會有那些問題?
並試著解釋那些發生Error的程式區塊的原因?

* 在處理object會出現錯誤，因為string無法做平均，這種數值運算，所以會得到Series([], dtype: float64) 空的list。而max() function 給的應該也是比較第一個英文字母的順序。整體只有nunique() function是可以做運算的。

In [12]:
df[int_features].head()
print(df[int_features].head().mean(), '\n')
print(df[int_features].head().max(), '\n')
print(df[int_features].head().nunique(), '\n')

Pclass    2.2
SibSp     0.6
Parch     0.0
dtype: float64 

Pclass    3
SibSp     1
Parch     0
dtype: int64 

Pclass    2
SibSp     2
Parch     1
dtype: int64 



In [24]:
print(df[object_features].head(), '\n')

print(df[object_features].head().mean(), '\n')
print(df[object_features].head().max(), '\n')
print(df[object_features].head().nunique(), '\n')

                                                Name     Sex  \
0                            Braund, Mr. Owen Harris    male   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
2                             Heikkinen, Miss. Laina  female   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
4                           Allen, Mr. William Henry    male   

             Ticket Cabin Embarked  
0         A/5 21171   NaN        S  
1          PC 17599   C85        C  
2  STON/O2. 3101282   NaN        S  
3            113803  C123        S  
4            373450   NaN        S   

Series([], dtype: float64) 

Name        Heikkinen, Miss. Laina
Sex                           male
Ticket            STON/O2. 3101282
Embarked                         S
dtype: object 

Name        5
Sex         2
Ticket      5
Cabin       2
Embarked    2
dtype: int64 



In [14]:
print(df[float_features].head().mean(), '\n')
print(df[float_features].head().max(), '\n')
print(df[float_features].head().nunique(), '\n')

Age     31.20000
Fare    29.52166
dtype: float64 

Age     38.0000
Fare    71.2833
dtype: float64 

Age     4
Fare    5
dtype: int64 



---
## 作業二

思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類?
所以三大類特徵中，哪一大類處理起來應該最複雜?

1. object處理起來最複雜，因為要定義order和分類是一件大工程

2. EX: 複數 (a+bi)，因為需要兩個數字表示，雖然概念上是數值，但因為無法比較大小，因此只能歸類為"類別型欄位"，不論其他更奇怪的例子，都可以歸類為類別型欄位，因此類別型欄位應該最難處理