# Objective
A、在初步 EDA 的過程，我們無可避免會想問的問題

    不同資料類型各有多少個欄位？

    類別型欄位 (pandas 中的 object) 的類別數量?

    模型怎麼處理類別型的資料？有什麼表示方法？
    

將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

## Keyvalue:
    - Solve the problem of "OS path"
    - Learn the function of "One Hot Encoding"
    - Learn to extract the value from DataFrame- data.iloc() & data.loc()

## Method: 
- [OS module](https://www.geeksforgeeks.org/os-module-python-examples/) 
- [How to set os path in python](https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f#:~:text=To%20use%20it%2C%20you%20just,for%20the%20current%20operating%20system.)
- [Label Encoder vs. One Hot Encoder in Learning](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)

## Thought flow
    - First, understand how the "os.path" is a relative path in the python  
    
    - 以下幾種方法請你試試看：  1. 把 .csv 資料放到程式相同目錄下，直接用讀的： pd.read_csv('application.csv')  2. 把路徑中的斜線改成反方向：pd.read_csv('data\application.csv')  3. 用絕對路徑：pd.read_csv(r'C:\...\application.csv')
    - 使用 os.path.join 來拼網址主要有幾個原因：1. 可以的位置可以由變數彈性控制，之後要修改比較方便。2. 程式可攜性比較好，不同電腦的路徑符號可能有差異，例如：Mac 是  '/' 、 Win 是 '\'，os.path.join 會自己處理這個部分。

# Question: A(One Hot Encoding)

In [88]:
import os.path
import numpy as np
import pandas as pd
from pathlib import Path

## 1. OS path question and how to solve it
[Day6 os.path.join 設置問題](https://www.cupoy.com/qa/club/ai_tw/0000016D6BA22D97000000016375706F795F72656C656173654B5741535354434C5542/0000017392209885000000336375706F795F72656C656173655155455354)

下方程式碼有詳細的OS path介紹

In [128]:
# 設定 data_path, 並讀取 app_train
# # dir_data = './data/'
# dir_data = 'ML_marathon/Day_006'
# f_app_train = os.path.join(dir_data, 'application_train.csv')
# app_train = pd.read_csv(f_app_train)

# data_folder = os.path.join('ML_marathon', 'Day_006')
# file_to_open = os.path.join(data_folder, r'application_train.csv')
# app_train = pd.read_csv(file_to_open)

# f = open(file_to_open)
# print(f.read())

app_train = pd.read_csv('application_train.csv')
# app_train = pd.read_csv(r'./data/application_train.csv')
app_train

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,456251,0,Cash loans,M,N,N,0,157500.0,254700.0,27558.0,...,0,0,0,0,,,,,,
307507,456252,0,Cash loans,F,N,Y,0,72000.0,269550.0,12001.5,...,0,0,0,0,,,,,,
307508,456253,0,Cash loans,F,N,Y,0,153000.0,677664.0,29979.0,...,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
307509,456254,1,Cash loans,F,N,Y,0,171000.0,370107.0,20205.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## 如果要取出CSV內部特定的資料(specific columns or rows)，需要先取出DataFrame

In [122]:
# 取出我們要的data, 對其做one hot encoding 
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
gen_train = pd.DataFrame(app_train['CODE_GENDER'])
name_train = pd.DataFrame(app_train['NAME_EDUCATION_TYPE'])

print(sub_train.shape)
print(sub_train.head())

print(gen_train.head())
# print(genF_train.head())
print(name_train.head())

(307511, 1)
  WEEKDAY_APPR_PROCESS_START
0                  WEDNESDAY
1                     MONDAY
2                     MONDAY
3                  WEDNESDAY
4                   THURSDAY
  CODE_GENDER
0           M
1           F
2           M
3           F
4           M
             NAME_EDUCATION_TYPE
0  Secondary / secondary special
1               Higher education
2  Secondary / secondary special
3  Secondary / secondary special
4  Secondary / secondary special


## 2. Test - One Hot Encoding (Pandas.get_dummies) 

In [137]:
# app_train = pd.get_dummies(app_train)
app_train = pd.read_csv('application_train.csv')
gen_train = pd.DataFrame(app_train['CODE_GENDER'])
data_gen = pd.get_dummies(gen_train)

print('確認性別 - 所有性別\n')
print(data_gen[:5])


print('\n\n確認性別 - 單一性別\n')
print(data_gen['CODE_GENDER_F'].head())
print(data_gen['CODE_GENDER_M'].head())
# print(app_train['NAME_EDUCATION_TYPE_Academic degree'].head())

確認性別 - 所有性別

   CODE_GENDER_F  CODE_GENDER_M  CODE_GENDER_XNA
0              0              1                0
1              1              0                0
2              0              1                0
3              1              0                0
4              0              1                0


確認性別 - 單一性別

0    0
1    1
2    0
3    1
4    0
Name: CODE_GENDER_F, dtype: uint8
0    1
1    0
2    1
3    0
4    1
Name: CODE_GENDER_M, dtype: uint8


In [91]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import dask.dataframe

## HW - One Hot Encoding for days (Pandas.get_dummies)

In [130]:
## One Hot Encoding 
app_train = pd.read_csv('application_train.csv')
day_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
data_day = pd.get_dummies(day_train)
print(data_day[:5])

#
print('\nComparing the difference of the shape')
print('Original Data Sheet', data_day.shape)
print('One Hot Encoding', data_day.shape)

   WEEKDAY_APPR_PROCESS_START_FRIDAY  WEEKDAY_APPR_PROCESS_START_MONDAY  \
0                                  0                                  0   
1                                  0                                  1   
2                                  0                                  1   
3                                  0                                  0   
4                                  0                                  0   

   WEEKDAY_APPR_PROCESS_START_SATURDAY  WEEKDAY_APPR_PROCESS_START_SUNDAY  \
0                                    0                                  0   
1                                    0                                  0   
2                                    0                                  0   
3                                    0                                  0   
4                                    0                                  0   

   WEEKDAY_APPR_PROCESS_START_THURSDAY  WEEKDAY_APPR_PROCESS_START_TUESDAY  \
0       