# 類別變數特徵工程

## 作業程式碼
本作業將請學員完成以下要求：
1. 請至 Kaggle 平台找尋欲探索的資料集，進行本次作業。
2. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 Label Encoding
3. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 One-Hot Encoding 
4. 請挑選適合的兩個類別變數，分別從三種方法中使用兩種不同撰寫方法執行 Ordinal Encoding
5. 請挑選適合的兩個類別變數，撰寫並執行 Frequency Encoding
6. 請挑選適合的兩個類別變數，撰寫並執行Feature Combination

In [24]:
import numpy as np
import pandas as pd


## 輸入資料

In [33]:
# 輸入資料
raw_data = pd.read_csv("/Users/a7890123119/Downloads/Beatles.csv") # 此行需要填入資料路徑

raw_data.info()
raw_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 285 entries, 0 to 284
Data columns (total 45 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   URI                                  285 non-null    object 
 1   Title                                285 non-null    object 
 2   Year                                 277 non-null    object 
 3   Album                                284 non-null    object 
 4   Popularity                           285 non-null    float64
 5   Duration                             285 non-null    float64
 6   Key                                  285 non-null    float64
 7   Mode                                 285 non-null    float64
 8   Tempo                                285 non-null    float64
 9   Time_signature                       285 non-null    float64
 10  Valence                              285 non-null    float64
 11  Danceability                    

Unnamed: 0,URI,Title,Year,Album,Popularity,Duration,Key,Mode,Tempo,Time_signature,...,Weeks at No1 in UK (The Guardian),Highest position (Billboard),Weeks at No1 (Billboard),Top 50 (Billboard),Top 50 (Ultimate classic rock),Top 50 (Rolling Stone),Top 50 (NME),Top 50 (Top50songs.org),"Top 50 (USA today, 2017)","Top 50 (Vulture, by Bill Wyman)"
0,spotify:track:2FDEHIMkjxFLzj688M2I3h,(You're So Square) Baby I Don't Care - Studio Jam,,The Beatles,29.0,43.0,9.0,1.0,112.173,4.0,...,,,,,,,,,,
1,spotify:track:2HvTGx5fzFGpHSyRNvXd9T,12-bar Original,1965.0,Anthology 2,31.0,175.0,9.0,1.0,122.678,4.0,...,,,,,,,,,,
2,spotify:track:34dsKRJHIadrrNdCDtMwGn,A Beginning - Anthology 3 Version,,Anthology 3,29.0,50.0,0.0,1.0,90.588,3.0,...,,,,,,,,,,
3,spotify:track:0hKRSZhUGEhKU6aNSPBACZ,A Day in the Life,1967.0,Sgt. Pepper's Lonely Hearts Club Band,65.0,335.0,4.0,0.0,163.219,4.0,...,,,,,1.0,1.0,2.0,18.0,1.0,1.0
4,spotify:track:5J2CHimS7dWYMImCHkEFaJ,A Hard Day's Night,1964.0,A Hard Day's Night,71.0,152.0,0.0,1.0,138.514,4.0,...,3.0,1.0,2.0,8.0,18.0,11.0,19.0,19.0,15.0,41.0


In [34]:
# 為學習方便，在此先移除遺失值
raw_data.head() # 查看前幾行數據




Unnamed: 0,URI,Title,Year,Album,Popularity,Duration,Key,Mode,Tempo,Time_signature,...,Weeks at No1 in UK (The Guardian),Highest position (Billboard),Weeks at No1 (Billboard),Top 50 (Billboard),Top 50 (Ultimate classic rock),Top 50 (Rolling Stone),Top 50 (NME),Top 50 (Top50songs.org),"Top 50 (USA today, 2017)","Top 50 (Vulture, by Bill Wyman)"
0,spotify:track:2FDEHIMkjxFLzj688M2I3h,(You're So Square) Baby I Don't Care - Studio Jam,,The Beatles,29.0,43.0,9.0,1.0,112.173,4.0,...,,,,,,,,,,
1,spotify:track:2HvTGx5fzFGpHSyRNvXd9T,12-bar Original,1965.0,Anthology 2,31.0,175.0,9.0,1.0,122.678,4.0,...,,,,,,,,,,
2,spotify:track:34dsKRJHIadrrNdCDtMwGn,A Beginning - Anthology 3 Version,,Anthology 3,29.0,50.0,0.0,1.0,90.588,3.0,...,,,,,,,,,,
3,spotify:track:0hKRSZhUGEhKU6aNSPBACZ,A Day in the Life,1967.0,Sgt. Pepper's Lonely Hearts Club Band,65.0,335.0,4.0,0.0,163.219,4.0,...,,,,,1.0,1.0,2.0,18.0,1.0,1.0
4,spotify:track:5J2CHimS7dWYMImCHkEFaJ,A Hard Day's Night,1964.0,A Hard Day's Night,71.0,152.0,0.0,1.0,138.514,4.0,...,3.0,1.0,2.0,8.0,18.0,11.0,19.0,19.0,15.0,41.0


## 將原始資料切割成訓練與測試資料

In [35]:

from sklearn.model_selection import train_test_split

trainData, testData = train_test_split(raw_data, test_size=0.25, random_state=214)

In [37]:
def print_data(before, after) -> None:
    print_limit = 0

    for former, latter in zip(before, after):
        print("Before: {}, After: {}".format(former, latter))
        print_limit += 1

        if print_limit == 5:
            break

## Label Encoding

本次介紹以下兩種方法可進行 Label Encoding
1. 使用 sklearn 套件中 LabelEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 進行 Label Encoding

In [40]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數
from sklearn.preprocessing import LabelEncoder

category_var1 = 'Title'

label_encoder = LabelEncoder()
trainData[category_var1 + '_label_encoded'] = label_encoder.fit_transform(trainData[category_var1])

print_data(trainData[category_var1], trainData[category_var1 + '_label_encoded'])

Before: You Can't Do That, After: 207
Before: When I Get Home, After: 199
Before: That Means a Lot, After: 173
Before: Some Other Guy, After: 164
Before: Taxman, After: 170


In [39]:
# 方法二：使用 Dictionary 資料型態的功能
category_var2 = 'Title'

trainData[category_var2 + '_label_encoded'] = label_encoder.fit_transform(trainData[category_var2])

print_data(trainData[category_var2], trainData[category_var2 + '_label_encoded'])

Before: You Can't Do That, After: 207
Before: When I Get Home, After: 199
Before: That Means a Lot, After: 173
Before: Some Other Guy, After: 164
Before: Taxman, After: 170


## One-Hot Encodig

本次介紹以下兩種方法進行 One-Hot Encoding
1. 使用 sklearn 套件中的 OneHotEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 變數進行 One-Hot Encoding

In [17]:
# 方法一：使用 sklearn 套件中的 OneHotEncoder 函數


> 注意：使用 OneHotEncoder 轉換時，資料要先把維度轉換成二維才能轉換喔

In [21]:
# 方法二：使用 Dictionary 資料型態的功能


## Ordinal Encoding

本次介紹兩種方法進行 Ordinal Encoding
1. 使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序（比較不推薦）
2. 使用 sklearn 套件中的 OrdinalEncoder 函數且要自定義類別順序（比較推薦）
3. 使用 Dictionary 資料型態的功能


In [50]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()

category_var1_onehot_encoded = onehot_encoder.fit_transform(trainData[category_var1].values.reshape((-1, 1))).toarray()

print_data(trainData[category_var1], category_var1_onehot_encoded.tolist())
onehot_encoder = OneHotEncoder()

category_var2_onehot_encoded = onehot_encoder.fit_transform(trainData[category_var2].values.reshape((-1, 1))).toarray()

print_data(trainData[category_var2], category_var2_onehot_encoded.tolist())

Before: You Can't Do That, After: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 

In [48]:


category_var1_onehot_dict = {content: [1 if idx == encoded_idx else 0 for encoded_idx in
                                       range(trainData[category_var1].unique().tolist().__len__())] \
                             for idx, content in enumerate(trainData[category_var1].unique())}

category_var1_onehot_encoded_dict = trainData[category_var1].apply(lambda x: category_var1_onehot_dict[x])

print_data(trainData[category_var1], category_var1_onehot_encoded_dict)

category_var2_onehot_dict = {content: [1 if idx == encoded_idx else 0 for encoded_idx in
                                       range(trainData[category_var2].unique().tolist().__len__())] \
                             for idx, content in enumerate(trainData[category_var2].unique())}

category_var2_onehot_encoded_dict = trainData[category_var2].apply(lambda x: category_var2_onehot_dict[x])

print_data(trainData[category_var2], category_var2_onehot_encoded_dict)

Before: You Can't Do That, After: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Before: When I Get Home, After: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Frequency Encoding

本次主要使用 Dictionary 資料型態的功能實現 Frequency Encoding，步驟如下：
1. 先計算各類別的數量
2. 將計算結果轉換成 Dictionary 資料型態
3. 進行類別轉換

In [51]:
# Step1.、Step2. 先計算各類別的數量，且以 Dictionary 資料型態呈現
category_var1_frequency_dict = trainData[category_var1].value_counts().to_dict()


# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
trainData[category_var1 + '_frequency_dict'] = trainData[category_var1].apply(
    lambda x: category_var1_frequency_dict[x] if x in list(category_var1_frequency_dict.keys()) else 0)

print_data(trainData[category_var1], trainData[category_var1 + '_frequency_dict'])

Before: You Can't Do That, After: 1
Before: When I Get Home, After: 1
Before: That Means a Lot, After: 1
Before: Some Other Guy, After: 1
Before: Taxman, After: 1


In [52]:
category_var2_frequency_dict = trainData[category_var2].value_counts().to_dict()


# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
trainData[category_var2 + '_frequency_dict'] = trainData[category_var2].apply(
    lambda x: category_var2_frequency_dict[x] if x in list(category_var2_frequency_dict.keys()) else 0)

print_data(trainData[category_var2], trainData[category_var2 + '_frequency_dict'])

Before: You Can't Do That, After: 1
Before: When I Get Home, After: 1
Before: That Means a Lot, After: 1
Before: Some Other Guy, After: 1
Before: Taxman, After: 1


## Feature Combination

本次將介紹使用 Dictionary 資料型態的功能進行類別特徵的特徵合併，其步驟如下：
1. 建構合併規則，並以 Dictionary 資料型態呈現
2. 將類別變數作轉換

In [53]:
# Step1. 把 類別變數一 與 類別變數二 的類別取出來
category_var1_feature_class  = trainData[category_var1].unique()
category_var2_feature_class  = trainData[category_var2].unique()

# Step2. 建立一個二類別變數 Dict
combination_dict = {
    category_var1_class: {
        category_var2_class: "{}_{}".format(category_var1_class, category_var2_class) for category_var2_class in category_var2_feature_class
        } for category_var1_class in category_var1_feature_class
} 

# Step3. 在資料表中產生新特徵
trainData[category_var1 + '_' + category_var2] = trainData.apply(lambda x: combination_dict[x[category_var1]][x[category_var2]], axis = 1)
print(trainData[category_var1 + '_' + category_var2])

273                  You Can't Do That_You Can't Do That
261                      When I Get Home_When I Get Home
230                    That Means a Lot_That Means a Lot
217                        Some Other Guy_Some Other Guy
225                                        Taxman_Taxman
                             ...                        
111    I Want You (She's So Heavy)_I Want You (She's ...
118              I'll Follow the Sun_I'll Follow the Sun
97                   How Do You Do It?_How Do You Do It?
193                                            Rain_Rain
117                    I'll Cry Instead_I'll Cry Instead
Name: Title_Title, Length: 213, dtype: object
