# [D4：EDA: 欄位的資料類型介紹及處理](https://ai100.cupoy.com/mission/D4)
* 介紹一般的資料類型分 "離散變數" 與 "連續變數"
* 介紹 Pandas DataFrame 中最常見的欄位資料類型有三種: float64, int64, object(包含字串，用於表示類別型變數)
* 還有其它如日期、boolean 等等不同的格式
* 模型怎麼處理類別型的資料？有什麼表示方法？[Label Encoder vs. One Hot Encoder in Machine Learning](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)
  * 介紹 [Label Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) and [One Hot Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
    * 它們來自於 Scikit Learn Library in Python
  * 這二個的目的都是為了讓非數字的資料(如分類、文字…等)，轉換成 predective models 可以理解的型式。
  * Label Encoder:
    * 其實就是將 n 個類別，直接編成 1~n 個數字，每個數字代別其所屬類別
    * 但這有個缺點，怕 model 會讓認這編碼是有對應關係的。 ex: 1<2<3
    * 而使用 One Hot Encoder 則可以克服這項缺點
  * One Hot Encoder:
    * 將 n 個類別，以 n 個欄位來取代，每個欄位分別對應某個類別。只能其一為 1，其餘皆為 0
    * 例如共有 5 個類別，某個項目其標記為第 3 個類別，其表示式為 [0, 0, 1, 0, 0]

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
# 設定 data_path, 並讀取 app_train
dir_data = './data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train = pd.read_csv(f_app_train)

## 作業
將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

In [3]:
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()

(307511, 1)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START
0,WEDNESDAY
1,MONDAY
2,MONDAY
3,WEDNESDAY
4,THURSDAY


In [4]:
"""
Your Code Here
"""
sub_train = pd.get_dummies(sub_train)
print(f'sub_train.shape = {sub_train.shape}')
print(f'sub_train.columns = {sub_train.columns}')

for col in sub_train:
    print(f'Unique values in {col}: {sub_train[col].unique()}')
sub_train.head()

sub_train.shape = (307511, 7)
sub_train.columns = Index(['WEEKDAY_APPR_PROCESS_START_FRIDAY',
       'WEEKDAY_APPR_PROCESS_START_MONDAY',
       'WEEKDAY_APPR_PROCESS_START_SATURDAY',
       'WEEKDAY_APPR_PROCESS_START_SUNDAY',
       'WEEKDAY_APPR_PROCESS_START_THURSDAY',
       'WEEKDAY_APPR_PROCESS_START_TUESDAY',
       'WEEKDAY_APPR_PROCESS_START_WEDNESDAY'],
      dtype='object')
Unique values in WEEKDAY_APPR_PROCESS_START_FRIDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_MONDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_SATURDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_SUNDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_THURSDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_TUESDAY: [0 1]
Unique values in WEEKDAY_APPR_PROCESS_START_WEDNESDAY: [1 0]


Unnamed: 0,WEEKDAY_APPR_PROCESS_START_FRIDAY,WEEKDAY_APPR_PROCESS_START_MONDAY,WEEKDAY_APPR_PROCESS_START_SATURDAY,WEEKDAY_APPR_PROCESS_START_SUNDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY
0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1
4,0,0,0,0,1,0,0
