# 類別變數特徵工程

## 作業程式碼
本作業將請學員完成以下要求：
1. 請至 Kaggle 平台找尋欲探索的資料集，進行本次作業。
2. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 Label Encoding
3. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 One-Hot Encoding 
4. 請挑選適合的兩個類別變數，分別從三種方法中使用兩種不同撰寫方法執行 Ordinal Encoding
5. 請挑選適合的兩個類別變數，撰寫並執行 Frequency Encoding
6. 請挑選適合的兩個類別變數，撰寫並執行Feature Combination

In [21]:
import numpy as np
import pandas as pd

## 輸入資料

In [22]:
# 輸入資料
file_path = r"D:\Github\ML100Days\train_house.csv"
raw_data = pd.read_csv(file_path)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [5]:
# 為學習方便，在此先移除遺失值，Kaggle資料即已是訓練資料
#raw_data = raw_data.dropna()


## 將原始資料切割成訓練與測試資料

In [3]:
from sklearn.model_selection import train_test_split

trainData, testData = train_test_split(raw_data, test_size = 0.25, random_state = 214)

## Label Encoding

本次介紹以下兩種方法可進行 Label Encoding
1. 使用 sklearn 套件中 LabelEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 進行 Label Encoding

In [23]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# ★ 選用兩個類別欄位
cat1 = 'Neighborhood'
cat2 = 'HouseStyle'
le = LabelEncoder()

trainData[cat1 + '_label_sk'] = le.fit_transform(trainData[cat1])
trainData[cat2 + '_label_sk'] = le.fit_transform(trainData[cat2])

print("===== Label Encoding（方法一：Sklearn LabelEncoder）=====\n")

print(f"【{cat1}】原始值 vs. Label 編碼後：")
print(trainData[[cat1, cat1 + '_label_sk']].head(10))
print("\n")

print(f"【{cat2}】原始值 vs. Label 編碼後：")
print(trainData[[cat2, cat2 + '_label_sk']].head(10))
print("\n")

print("=== 完成：方法一 Label Encoding 結果展示 ===")

===== Label Encoding（方法一：Sklearn LabelEncoder）=====

【Neighborhood】原始值 vs. Label 編碼後：
     Neighborhood  Neighborhood_label_sk
1091      Somerst                     21
420       Mitchel                     11
74        OldTown                     17
845        Sawyer                     19
803       NridgHt                     16
843         NAmes                     12
826       BrkSide                      3
408       NridgHt                     16
339         NAmes                     12
247         NAmes                     12


【HouseStyle】原始值 vs. Label 編碼後：
     HouseStyle  HouseStyle_label_sk
1091     2Story                    5
420      SFoyer                    6
74       2Story                    5
845      SFoyer                    6
803      2Story                    5
843      1Story                    2
826      1.5Unf                    1
408      2Story                    5
339      1Story                    2
247      1Story                    2


=== 完成：方法一 Label Enco

In [11]:
# 方法二：使用 Dictionary 資料型態的功能
cat1_dict = {cat: idx for idx, cat in enumerate(trainData[cat1].unique())}
trainData[cat1 + '_label_dict'] = trainData[cat1].apply(lambda x: cat1_dict[x])

# HouseStyle
cat2_dict = {cat: idx for idx, cat in enumerate(trainData[cat2].unique())}
trainData[cat2 + '_label_dict'] = trainData[cat2].apply(lambda x: cat2_dict[x])

print("Label Encoding 完成！")
trainData[[cat1, cat1 + '_label_sk', cat1 + '_label_dict']].head()


Label Encoding 完成！


Unnamed: 0,Neighborhood,Neighborhood_label_sk,Neighborhood_label_dict
1091,Somerst,21,0
420,Mitchel,11,1
74,OldTown,17,2
845,Sawyer,19,3
803,NridgHt,16,4


## One-Hot Encodig

本次介紹以下兩種方法進行 One-Hot Encoding
1. 使用 sklearn 套件中的 OneHotEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 變數進行 One-Hot Encoding

In [12]:
# ★ 選用兩個類別欄位
cat3 = 'MSZoning'
cat4 = 'BldgType'

In [30]:
# 方法一：使用 sklearn 套件中的 OneHotEncoder 函數
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe_array = ohe.fit_transform(trainData[[cat3]])

ohe_df = pd.DataFrame(
    ohe_array,
    columns=[cat3 + "_" + str(c) for c in ohe.categories_[0]],
    index=trainData.index
)

trainData = pd.concat([trainData, ohe_df], axis=1)

print("===== 方法一：sklearn OneHotEncoder =====\n")

print(f"【原始欄位：{cat3}】前 10 筆：")
print(trainData[cat3].head(10))
print("\n")

print(f"【{cat3} One-Hot 產生的欄位名稱】：")
print(ohe_df.columns.tolist())
print("\n")

print("【前 10 筆 One-Hot 編碼結果】：")
print(trainData[ohe_df.columns].head(10))
print("\n")

===== 方法一：sklearn OneHotEncoder =====

【原始欄位：MSZoning】前 10 筆：
1091    FV
420     RM
74      RM
845     RL
803     RL
843     RL
826     RM
408     RL
339     RL
247     RL
Name: MSZoning, dtype: object


【MSZoning One-Hot 產生的欄位名稱】：
['MSZoning_C (all)', 'MSZoning_FV', 'MSZoning_RH', 'MSZoning_RL', 'MSZoning_RM', 'MSZoning_nan']


【前 10 筆 One-Hot 編碼結果】：
      MSZoning_C (all)  MSZoning_C (all)  MSZoning_C (all)  MSZoning_C (all)  \
1091               0.0               0.0               0.0               0.0   
420                0.0               0.0               0.0               0.0   
74                 0.0               0.0               0.0               0.0   
845                0.0               0.0               0.0               0.0   
803                0.0               0.0               0.0               0.0   
843                0.0               0.0               0.0               0.0   
826                0.0               0.0               0.0               0.0   
408   



> 注意：使用 OneHotEncoder 轉換時，資料要先把維度轉換成二維才能轉換喔

In [28]:
# 方法二：使用 Dictionary 資料型態的功能
unique_vals = trainData[cat4].unique()
cat4_ohe_dict = {val: idx for idx, val in enumerate(unique_vals)}

# 建空的 one-hot 欄位
for val in unique_vals:
    col_name = f"{cat4}_{val}"
    trainData[col_name] = (trainData[cat4] == val).astype(int)

print(f"【{cat4} 使用 Dictionary 自製 One-Hot 欄位】：")
print([f"{cat4}_{val}" for val in unique_vals][:10])  # 只印前 10 個欄位名
print(trainData[[cat4] + [f"{cat4}_{val}" for val in unique_vals]].head())

【BldgType 使用 Dictionary 自製 One-Hot 欄位】：
['BldgType_Twnhs', 'BldgType_Duplex', 'BldgType_1Fam', 'BldgType_TwnhsE', 'BldgType_2fmCon', 'BldgType_nan']
     BldgType  BldgType_Twnhs  BldgType_Duplex  BldgType_1Fam  \
1091    Twnhs               1                0              0   
420    Duplex               0                1              0   
74       1Fam               0                0              1   
845      1Fam               0                0              1   
803      1Fam               0                0              1   

      BldgType_TwnhsE  BldgType_2fmCon  BldgType_nan  
1091                0                0             0  
420                 0                0             0  
74                  0                0             0  
845                 0                0             0  
803                 0                0             0  


## Ordinal Encoding

本次介紹兩種方法進行 Ordinal Encoding
1. 使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序（比較不推薦）
2. 使用 sklearn 套件中的 OrdinalEncoder 函數且要自定義類別順序（比較推薦）
3. 使用 Dictionary 資料型態的功能


In [41]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

cat5 = 'ExterQual'
cat6 = 'KitchenQual'

trainData[cat5] = trainData[cat5].fillna("na")
trainData[cat6] = trainData[cat6].fillna("na")

# 自定義順序（由好到差）
qual_order = ["na", "Po", "Fa", "TA", "Gd", "Ex"]
qual_map = {"na": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}

cat5 = 'ExterQual'
cat6 = 'KitchenQual'

In [42]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序
# ExterQual/KitchenQual 等級：Ex > Gd > TA > Fa > Po

le = LabelEncoder()
le.fit(qual_order)

trainData[cat5 + "_ord_label"] = le.transform(trainData[cat5])
trainData[cat6 + "_ord_label"] = le.transform(trainData[cat6])

In [43]:
# 方法二：使用 sklearn 中的 OrdinalEncoder 函數
oe = OrdinalEncoder(categories=[qual_order])

trainData[cat5 + "_ord_oe"] = oe.fit_transform(trainData[[cat5]])
trainData[cat6 + "_ord_oe"] = oe.fit_transform(trainData[[cat6]])

In [44]:
# 方法三：使用 Dictionary 資料型態的功能
trainData[cat5 + "_ord_dict"] = trainData[cat5].map(qual_map)
trainData[cat6 + "_ord_dict"] = trainData[cat6].map(qual_map)

In [45]:
print("===== ExterQual =====")
print(trainData[[cat5, cat5 + "_ord_label", cat5 + "_ord_oe", cat5 + "_ord_dict"]].head())
print("\n")

print("===== KitchenQual =====")
print(trainData[[cat6, cat6 + "_ord_label", cat6 + "_ord_oe", cat6 + "_ord_dict"]].head())


===== ExterQual =====
     ExterQual  ExterQual_ord_label  ExterQual_ord_oe  ExterQual_ord_dict
1091        Gd                    2               4.0                   4
420         TA                    4               3.0                   3
74          Gd                    2               4.0                   4
845         TA                    4               3.0                   3
803         Ex                    0               5.0                   5


===== KitchenQual =====
     KitchenQual  KitchenQual_ord_label  KitchenQual_ord_oe  \
1091          Gd                      2                 4.0   
420           TA                      4                 3.0   
74            TA                      4                 3.0   
845           TA                      4                 3.0   
803           Ex                      0                 5.0   

      KitchenQual_ord_dict  
1091                     4  
420                      3  
74                       3  
845          

## Frequency Encoding

本次主要使用 Dictionary 資料型態的功能實現 Frequency Encoding，步驟如下：
1. 先計算各類別的數量
2. 將計算結果轉換成 Dictionary 資料型態
3. 進行類別轉換

In [18]:
cat7 = 'Neighborhood'   # 社區區域
cat8 = 'Exterior1st'    # 外牆材質

In [49]:
# Step1.、Step2. 先計算各類別的數量，且以 Dictionary 資料型態呈現
freq_dict_cat7 = trainData[cat7].value_counts(normalize=True).to_dict()
freq_dict_cat8 = trainData[cat8].value_counts(normalize=True).to_dict()

print("【Neighborhood 類別頻率 Dictionary】")
print(freq_dict_cat7)

print("【Exterior1st 類別頻率 Dictionary】")
print(freq_dict_cat8)


# Step3. 進行類別轉換（特別注意沒有的類別要標記為 0）
# 將類別轉換成「頻率值」
# 無法對應的類別 → 給 0
# =============================
trainData[cat7 + "_freq"] = trainData[cat7].map(freq_dict_cat7).fillna(0)
trainData[cat8 + "_freq"] = trainData[cat8].map(freq_dict_cat8).fillna(0)
print("===== Frequency Encoding 結果（Neighborhood）=====")
print(trainData[[cat7, cat7 + "_freq"]].head(10))
print("\n")

print("===== Frequency Encoding 結果（Exterior1st）=====")
print(trainData[[cat8, cat8 + "_freq"]].head(10))
print("\n")

【Neighborhood 類別頻率 Dictionary】
{'NAmes': 0.1589041095890411, 'CollgCr': 0.10319634703196347, 'OldTown': 0.0821917808219178, 'Edwards': 0.07214611872146119, 'NridgHt': 0.0593607305936073, 'Somerst': 0.057534246575342465, 'NWAmes': 0.0547945205479452, 'Gilbert': 0.052968036529680365, 'Sawyer': 0.04657534246575343, 'BrkSide': 0.0365296803652968, 'SawyerW': 0.03561643835616438, 'Mitchel': 0.034703196347031964, 'Crawfor': 0.0319634703196347, 'IDOTRR': 0.029223744292237442, 'NoRidge': 0.029223744292237442, 'Timber': 0.02009132420091324, 'ClearCr': 0.017351598173515982, 'SWISU': 0.017351598173515982, 'StoneBr': 0.014611872146118721, 'BrDale': 0.011872146118721462, 'MeadowV': 0.01004566210045662, 'Veenker': 0.00821917808219178, 'Blmngtn': 0.0073059360730593605, 'NPkVill': 0.006392694063926941, 'Blueste': 0.0018264840182648401}
【Exterior1st 類別頻率 Dictionary】
{'VinylSd': 0.354337899543379, 'MetalSd': 0.15616438356164383, 'HdBoard': 0.1506849315068493, 'Wd Sdng': 0.13972602739726028, 'Plywood': 0.

## Feature Combination

本次將介紹使用 Dictionary 資料型態的功能進行類別特徵的特徵合併，其步驟如下：
1. 建構合併規則，並以 Dictionary 資料型態呈現
2. 將類別變數作轉換

In [50]:
# 兩個類別欄位
cat_a = 'Neighborhood'
cat_b = 'HouseStyle'

In [51]:
# Step1. 把 類別變數一 與 類別變數二 的類別取出來
unique_a = trainData[cat_a].unique()
unique_b = trainData[cat_b].unique()

print("【Neighborhood 類別】")
print(unique_a)
print("\n")

print("【HouseStyle 類別】")
print(unique_b)


【Neighborhood 類別】
['Somerst' 'Mitchel' 'OldTown' 'Sawyer' 'NridgHt' 'NAmes' 'BrkSide'
 'NWAmes' 'Gilbert' 'SawyerW' 'StoneBr' 'Veenker' 'Edwards' 'CollgCr'
 'BrDale' 'NoRidge' 'IDOTRR' 'ClearCr' 'Crawfor' 'Timber' 'MeadowV'
 'SWISU' 'Blmngtn' 'NPkVill' 'Blueste' nan]


【HouseStyle 類別】
['2Story' 'SFoyer' '1Story' '1.5Unf' '1.5Fin' 'SLvl' '2.5Unf' '2.5Fin' nan]


In [52]:
# Step2. 建立一個二類別變數 Dict
# key = "類別A_類別B"
# value = 代碼（遞增數字）
combination_dict = {}

idx = 0
for a in unique_a:
    for b in unique_b:
        combination_dict[f"{a}_{b}"] = idx
        idx += 1

print("【Feature Combination Dictionary（前 20 筆）】")
print(dict(list(combination_dict.items())[:20]))


【Feature Combination Dictionary（前 20 筆）】
{'Somerst_2Story': 0, 'Somerst_SFoyer': 1, 'Somerst_1Story': 2, 'Somerst_1.5Unf': 3, 'Somerst_1.5Fin': 4, 'Somerst_SLvl': 5, 'Somerst_2.5Unf': 6, 'Somerst_2.5Fin': 7, 'Somerst_nan': 8, 'Mitchel_2Story': 9, 'Mitchel_SFoyer': 10, 'Mitchel_1Story': 11, 'Mitchel_1.5Unf': 12, 'Mitchel_1.5Fin': 13, 'Mitchel_SLvl': 14, 'Mitchel_2.5Unf': 15, 'Mitchel_2.5Fin': 16, 'Mitchel_nan': 17, 'OldTown_2Story': 18, 'OldTown_SFoyer': 19}


In [53]:
# Step3. 在資料表中產生新特徵
# 建構合併字串
trainData['Neighborhood_HouseStyle'] = trainData[cat_a] + "_" + trainData[cat_b]

# 將合併結果 map 成數字（沒有的給 -1）
trainData['Neighborhood_HouseStyle_code'] = (
    trainData['Neighborhood_HouseStyle'].map(combination_dict).fillna(-1)
)


In [54]:
print(trainData[['Neighborhood', 'HouseStyle', 
                 'Neighborhood_HouseStyle', 
                 'Neighborhood_HouseStyle_code']].head(10))

     Neighborhood HouseStyle Neighborhood_HouseStyle  \
1091      Somerst     2Story          Somerst_2Story   
420       Mitchel     SFoyer          Mitchel_SFoyer   
74        OldTown     2Story          OldTown_2Story   
845        Sawyer     SFoyer           Sawyer_SFoyer   
803       NridgHt     2Story          NridgHt_2Story   
843         NAmes     1Story            NAmes_1Story   
826       BrkSide     1.5Unf          BrkSide_1.5Unf   
408       NridgHt     2Story          NridgHt_2Story   
339         NAmes     1Story            NAmes_1Story   
247         NAmes     1Story            NAmes_1Story   

      Neighborhood_HouseStyle_code  
1091                           0.0  
420                           10.0  
74                            18.0  
845                           28.0  
803                           36.0  
843                           47.0  
826                           57.0  
408                           36.0  
339                           47.0  
247          