# 類別變數特徵工程

## 作業程式碼
本作業將請學員完成以下要求：
1. 請至 Kaggle 平台找尋欲探索的資料集，進行本次作業。
2. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 Label Encoding
3. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 One-Hot Encoding 
4. 請挑選適合的兩個類別變數，分別從三種方法中使用兩種不同撰寫方法執行 Ordinal Encoding
5. 請挑選適合的兩個類別變數，撰寫並執行 Frequency Encoding
6. 請挑選適合的兩個類別變數，撰寫並執行Feature Combination

In [1]:
import pandas as pd

## 輸入資料

In [2]:
import os

folder = './data/'
path = os.path.join(folder, 'all_video_games.csv')

raw_data = pd.read_csv(path)
raw_data.head()

Unnamed: 0,Title,Release Date,Developer,Publisher,Genres,Genres Splitted,Product Rating,User Score,User Ratings Count,Platforms Info
0,Ziggurat (2012),2/17/2012,Action Button Entertainment,Freshuu Inc.,Action,['Action'],,6.9,14.0,"[{'Platform': 'iOS (iPhone/iPad)', 'Platform M..."
1,4X4 EVO 2,11/15/2001,Terminal Reality,Gathering,Auto Racing Sim,"['Auto', 'Racing', 'Sim']",Rated E For Everyone,,,"[{'Platform': 'Xbox', 'Platform Metascore': '5..."
2,MotoGP 2 (2001),1/22/2002,Namco,Namco,Auto Racing Sim,"['Auto', 'Racing', 'Sim']",Rated E For Everyone,5.8,,"[{'Platform': 'PlayStation 2', 'Platform Metas..."
3,Gothic 3,11/14/2006,Piranha Bytes,Aspyr,Western RPG,"['Western', 'RPG']",Rated T For Teen,7.5,832.0,"[{'Platform': 'PC', 'Platform Metascore': '63'..."
4,Siege Survival: Gloria Victis,5/18/2021,FishTankStudio,Black Eye Games,RPG,['RPG'],,6.5,10.0,"[{'Platform': 'PC', 'Platform Metascore': '69'..."


In [3]:
# 為學習方便，在此先移除遺失值
raw_data = raw_data.dropna()

## 將原始資料切割成訓練與測試資料

In [4]:
from sklearn.model_selection import train_test_split

trainData, testData = train_test_split(raw_data, test_size=0.25, random_state=214)

In [5]:
def print_data(before, after) -> None:
    print_limit = 0

    for former, latter in zip(before, after):
        print("Before: {}, After: {}".format(former, latter))
        print_limit += 1

        if print_limit == 5:
            break

## Label Encoding

本次介紹以下兩種方法可進行 Label Encoding
1. 使用 sklearn 套件中 LabelEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 進行 Label Encoding

In [6]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數
from sklearn.preprocessing import LabelEncoder

category_var1 = 'Genres'

label_encoder = LabelEncoder()
trainData[category_var1 + '_label_encoded'] = label_encoder.fit_transform(trainData[category_var1])

print_data(trainData[category_var1], trainData[category_var1 + '_label_encoded'])

Before: 2D Platformer, After: 2
Before: Party, After: 67
Before: Auto Racing, After: 19
Before: 2D Platformer, After: 2
Before: Real-Time Strategy, After: 75


In [7]:
category_var2 = 'Developer'

trainData[category_var2 + '_label_encoded'] = label_encoder.fit_transform(trainData[category_var2])

print_data(trainData[category_var2], trainData[category_var2 + '_label_encoded'])

Before: Steel Wool Games, After: 1764
Before: Sonic Team, After: 1722
Before: Criterion Games, After: 417
Before: Tozai Games, After: 1940
Before: Creative Assembly, After: 409


In [8]:
# 方法二：使用 Dictionary 資料型態的功能
category_var1_encoded_dict = {content: idx for idx, content in enumerate(trainData[category_var1].unique())}

trainData[category_var1 + '_label_encoded_dict'] = trainData[category_var1].apply(
    lambda x: category_var1_encoded_dict[x])

print_data(trainData[category_var1], trainData[category_var1 + '_label_encoded_dict'])

Before: 2D Platformer, After: 0
Before: Party, After: 1
Before: Auto Racing, After: 2
Before: 2D Platformer, After: 0
Before: Real-Time Strategy, After: 3


In [9]:
category_var2_encoded_dict = {content: idx for idx, content in enumerate(trainData[category_var2].unique())}

trainData[category_var2 + '_label_encoded_dict'] = trainData[category_var2].apply(
    lambda x: category_var2_encoded_dict[x])

print_data(trainData[category_var2], trainData[category_var2 + '_label_encoded_dict'])

Before: Steel Wool Games, After: 0
Before: Sonic Team, After: 1
Before: Criterion Games, After: 2
Before: Tozai Games, After: 3
Before: Creative Assembly, After: 4


## One-Hot Encoding

本次介紹以下兩種方法進行 One-Hot Encoding
1. 使用 sklearn 套件中的 OneHotEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 變數進行 One-Hot Encoding

In [10]:
# 方法一：使用 sklearn 套件中的 OneHotEncoder 函數
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()

category_var1_onehot_encoded = onehot_encoder.fit_transform(trainData[category_var1].values.reshape((-1, 1))).toarray()

print_data(trainData[category_var1], category_var1_onehot_encoded.tolist())

Before: 2D Platformer, After: [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Before: Party, After: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0

In [11]:
onehot_encoder = OneHotEncoder()

category_var2_onehot_encoded = onehot_encoder.fit_transform(trainData[category_var2].values.reshape((-1, 1))).toarray()

print_data(trainData[category_var2], category_var2_onehot_encoded.tolist())

Before: Steel Wool Games, After: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0

> 注意：使用 OneHotEncoder 轉換時，資料要先把維度轉換成二維才能轉換喔

In [12]:
# 方法二：使用 Dictionary 資料型態的功能
category_var1_onehot_dict = {content: [1 if idx == encoded_idx else 0 for encoded_idx in
                                       range(trainData[category_var1].unique().tolist().__len__())] \
                             for idx, content in enumerate(trainData[category_var1].unique())}

category_var1_onehot_encoded_dict = trainData[category_var1].apply(lambda x: category_var1_onehot_dict[x])

print_data(trainData[category_var1], category_var1_onehot_encoded_dict)

Before: 2D Platformer, After: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Before: Party, After: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Before: Auto Racing, After: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [13]:
category_var2_onehot_dict = {content: [1 if idx == encoded_idx else 0 for encoded_idx in
                                       range(trainData[category_var2].unique().tolist().__len__())] \
                             for idx, content in enumerate(trainData[category_var2].unique())}

category_var2_onehot_encoded_dict = trainData[category_var2].apply(lambda x: category_var2_onehot_dict[x])

print_data(trainData[category_var2], category_var2_onehot_encoded_dict)

Before: Steel Wool Games, After: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Ordinal Encoding

本次介紹兩種方法進行 Ordinal Encoding
1. 使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序（比較不推薦）
2. 使用 sklearn 套件中的 OrdinalEncoder 函數且要自定義類別順序（比較推薦）
3. 使用 Dictionary 資料型態的功能


In [14]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序
category_var3 = 'Product Rating'
custom_order = ['Low', 'Medium', 'High']
label_encoder_custom_order = LabelEncoder()
trainData[category_var3 + '_custom_order'] = label_encoder_custom_order.fit_transform(
    trainData[category_var3].astype(str))

print_data(trainData[category_var3], trainData[category_var3 + '_custom_order'])

Before: Rated E +10 For Everyone +10, After: 0
Before: Rated T For Teen, After: 4
Before: Rated E +10 For Everyone +10, After: 0
Before: Rated E For Everyone, After: 1
Before: Rated T For Teen, After: 4


In [15]:
# 方法二：使用 sklearn 中的 OrdinalEncoder 函數
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()

trainData[category_var3 + '_ordinal_encoded'] = ordinal_encoder.fit_transform(trainData[[category_var3]].astype(str))

print_data(trainData[category_var3], trainData[category_var3 + '_ordinal_encoded'])

Before: Rated E +10 For Everyone +10, After: 0.0
Before: Rated T For Teen, After: 4.0
Before: Rated E +10 For Everyone +10, After: 0.0
Before: Rated E For Everyone, After: 1.0
Before: Rated T For Teen, After: 4.0


In [16]:
# 方法三：使用 Dictionary 資料型態的功能
category_var3_ordinal_encoded_dict = {content: idx for idx, content in enumerate(trainData[category_var3].unique())}

trainData[category_var3 + '_ordinal_encoded_dict'] = trainData[category_var3].apply(
    lambda x: category_var3_ordinal_encoded_dict[x])

print_data(trainData[category_var3], trainData[category_var3 + '_ordinal_encoded_dict'])

Before: Rated E +10 For Everyone +10, After: 0
Before: Rated T For Teen, After: 1
Before: Rated E +10 For Everyone +10, After: 0
Before: Rated E For Everyone, After: 2
Before: Rated T For Teen, After: 1


## Frequency Encoding

本次主要使用 Dictionary 資料型態的功能實現 Frequency Encoding，步驟如下：
1. 先計算各類別的數量
2. 將計算結果轉換成 Dictionary 資料型態
3. 進行類別轉換

In [17]:
# Step1.、Step2. 先計算各類別的數量，且以 Dictionary 資料型態呈現
category_var1_frequency_dict = trainData[category_var1].value_counts().to_dict()

# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
trainData[category_var1 + '_frequency_dict'] = trainData[category_var1].apply(
    lambda x: category_var1_frequency_dict[x] if x in list(category_var1_frequency_dict.keys()) else 0)

print_data(trainData[category_var1], trainData[category_var1 + '_frequency_dict'])

Before: 2D Platformer, After: 350
Before: Party, After: 89
Before: Auto Racing, After: 169
Before: 2D Platformer, After: 350
Before: Real-Time Strategy, After: 189


In [18]:
# Step1.、Step2. 先計算各類別的數量，且以 Dictionary 資料型態呈現
category_var2_frequency_dict = trainData[category_var2].value_counts().to_dict()

# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
trainData[category_var2 + '_frequency_dict'] = trainData[category_var2].apply(
    lambda x: category_var2_frequency_dict[x] if x in list(category_var2_frequency_dict.keys()) else 0)

print_data(trainData[category_var2], trainData[category_var2 + '_frequency_dict'])

Before: Steel Wool Games, After: 2
Before: Sonic Team, After: 28
Before: Criterion Games, After: 11
Before: Tozai Games, After: 2
Before: Creative Assembly, After: 21


## Feature Combination

本次將介紹使用 Dictionary 資料型態的功能進行類別特徵的特徵合併，其步驟如下：
1. 建構合併規則，並以 Dictionary 資料型態呈現
2. 將類別變數作轉換

In [19]:
# Step1. 把 類別變數一 與 類別變數二 的類別取出來
category_var1_feature_class  = trainData[category_var1].unique()
category_var2_feature_class  = trainData[category_var2].unique()

# Step2. 建立一個二類別變數 Dict
combination_dict = {
    category_var1_class: {
        category_var2_class: "{}_{}".format(category_var1_class, category_var2_class) for category_var2_class in category_var2_feature_class
        } for category_var1_class in category_var1_feature_class
} 

# Step3. 在資料表中產生新特徵
trainData[category_var1 + '_' + category_var2] = trainData.apply(lambda x: combination_dict[x[category_var1]][x[category_var2]], axis = 1)
print(trainData[category_var1 + '_' + category_var2])


7990             2D Platformer_Steel Wool Games
442                            Party_Sonic Team
7631                Auto Racing_Criterion Games
10571                 2D Platformer_Tozai Games
3974       Real-Time Strategy_Creative Assembly
                          ...                  
5525                       Puzzle_Creat Studios
2394        Real-Time Strategy_Mad Doc Software
12867    First-Person Adventure_Terri Vellimann
2555         Open-World Action_SCE Japan Studio
2701                                 FPS_Crytek
Name: Genres_Developer, Length: 6813, dtype: object
