# Handling Categorical Values

#### Let's get our dataset and important libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
# save filepath to variable for easier access
melbourne_file_path = '../hitchhikersGuideToMachineLearning/home-data-for-ml-course/train.csv'
# read the data and store data in DataFrame titled melbourne_data
train_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
train_data.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Know your Data!
# Step 1: Preliminary investigation

Run the code cell below without changes.

In [3]:
# Shape of training data (num_rows, num_columns)
print(train_data.shape)

(1460, 81)


In [4]:
# Number of missing values in each column of training data
missing_val_count_by_column = (train_data.isnull().sum())
missing_val_count_by_column[missing_val_count_by_column > 0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [5]:
#dropping the missing values
cols_with_missing = [col for col in train_data.columns
                     if train_data[col].isnull().any()]

train_data = train_data.drop(cols_with_missing, axis=1)


### Types of Data
#### Numerical data:
These data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns,how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep.

#### Categorical data: 
Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning

In [6]:
train_data.dtypes

Id                int64
MSSubClass        int64
MSZoning         object
LotArea           int64
Street           object
                  ...  
MoSold            int64
YrSold            int64
SaleType         object
SaleCondition    object
SalePrice         int64
Length: 62, dtype: object

###### Numerical data will  have int64 or float64 representation while Categorical will follow object datatype.



### Select target variable and then Split the data in training and validation set

In [7]:
X = train_data.drop(['SalePrice'] , axis =1)
y = train_data.SalePrice

In [8]:
#Splitting in training and Validation set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

In [9]:
X_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
618,619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,108,0,0,260,0,0,7,2007,New,Partial
870,871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,0,8,2009,WD,Normal
92,93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,...,0,44,0,0,0,0,8,2009,WD,Normal
817,818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,...,59,0,0,0,0,0,7,2008,WD,Normal
302,303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,81,0,0,0,0,0,1,2006,WD,Normal


In [10]:
X_valid.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
529,530,20,RL,32668,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,...,0,200,0,0,0,0,3,2007,WD,Alloca
491,492,50,RL,9490,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,32,0,0,0,0,8,2006,WD,Normal
459,460,50,RL,7015,Pave,IR1,Bnk,AllPub,Corner,Gtl,...,0,248,0,0,0,0,7,2009,WD,Normal
279,280,60,RL,10005,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,117,0,0,0,0,0,3,2008,WD,Normal
655,656,160,RM,1680,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,0,3,2010,WD,Family


To compare different approaches to dealing with missing values, you'll use the same `score_dataset()` function from the tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

#### We are all set to LAUNCH!

In [42]:
print("MAE from NO Approach:") 
print(score_dataset(X_train, X_valid, y_train, y_valid))

MAE from NO Approach:


ValueError: could not convert string to float: 'RL'

# Step 2: Apply different Techniques



# 1: Drop columns with categorical data

You'll get started with the most straightforward approach.  Use the code cell below to preprocess the data in `X_train` and `X_valid` to remove columns with categorical data.  Set the preprocessed DataFrames to `drop_X_train` and `drop_X_valid`, respectively.  

In [12]:
# Fill in the lines below: drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

Run the next code cell without changes to obtain the MAE for this approach.

In [13]:
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
17952.591404109586


#  2 : Label encoding

Before jumping into label encoding, we'll investigate the dataset.  The code cell below prints the unique entries in both the training and validation sets.

In [44]:
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']

Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']


If you now write code to: 
- fit a label encoder to the training data, and then 
- use it to transform both the training and validation data, 

you'll get an error.  Can you see why this is the case?  (_You'll need  to use the above output to answer this question._)

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue.  For instance, you can write a custom label encoder to deal with new categories.  The simplest approach, however, is to drop the problematic categorical columns.  

Run the code cell below to save the problematic columns to a Python list `bad_label_cols`.  Likewise, columns that can be safely label encoded are stored in `good_label_cols`.

In [15]:
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

Categorical columns that will be label encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Functional', 'SaleType', 'Foundation', 'Condition2', 'Heating', 'Exterior2nd', 'HeatingQC', 'RoofStyle', 'ExterCond', 'Exterior1st', 'LandSlope', 'Neighborhood', 'RoofMatl', 'Utilities', 'Condition1']


Use the next code cell to label encode the data in `X_train` and `X_valid`.  Set the preprocessed DataFrames to `label_X_train` and `label_X_valid`, respectively.  
- We have provided code below to drop the categorical columns in `bad_label_cols` from the dataset. 
- You should label encode the categorical columns in `good_label_cols`.  

In [16]:
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 
label_encoder = LabelEncoder()
 # Your code here
for col in good_label_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])    


Run the next code cell without changes to obtain the MAE for this approach.

In [17]:
print("MAE from Approach 2 A(Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 A(Label Encoding):
17675.942500000005


##### Else you can also use this  RobustLabelEncoder code

In [18]:
class LabelEncoderExt(object):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        """
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        """
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        """
        self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        """
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        :return:
        """
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)

In [19]:
def robustlabelencoder(train,test):
    from sklearn.preprocessing import LabelEncoder
    object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

    label_enc = LabelEncoderExt()
    #cols = catVar2(train)
    #print(cols)
    for col in object_cols:
        label_enc.fit(train[col])
        train[col] = label_enc.transform(train[col])
        test[col] = label_enc.transform(test[col])
        
    #print(train.shape,test.shape)
    
    return train,test


In [20]:
#apply robust label encoder
Robustlabel_X_train,Robustlabel_X_valid = robustlabelencoder(X_train.copy(),X_valid.copy())



In [21]:
Robustlabel_X_train.describe()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
count,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,...,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0,1168.0
mean,738.685788,56.605308,3.02911,10589.672945,0.995719,1.925514,2.787671,0.000856,3.04024,0.069349,...,48.044521,23.02226,3.218322,14.528253,2.118151,50.936644,6.30137,2007.819349,8.412671,3.78339
std,421.609683,42.172322,0.631242,10704.180793,0.065316,1.416792,0.694786,0.02926,1.612162,0.294745,...,68.619199,63.153093,27.916593,54.009608,36.482294,550.380636,2.725977,1.335971,1.777488,1.085149
min,1.0,20.0,0.0,1300.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,0.0,0.0
25%,373.75,20.0,3.0,7589.5,1.0,0.0,3.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,9.0,4.0
50%,749.5,50.0,3.0,9512.5,1.0,3.0,3.0,0.0,4.0,0.0,...,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,9.0,4.0
75%,1108.75,70.0,3.0,11601.5,1.0,3.0,3.0,0.0,4.0,0.0,...,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,9.0,4.0
max,1460.0,190.0,4.0,215245.0,1.0,3.0,3.0,1.0,4.0,2.0,...,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,9.0,5.0


Run the next code cell without changes to obtain the MAE for this approach.

In [22]:
print("MAE from Approach 2 B(Robust Label Encoding):") 
print(score_dataset(Robustlabel_X_train, Robustlabel_X_valid, y_train, y_valid))

MAE from Approach 2 B(Robust Label Encoding):
17187.390513698632


# 3: One-hot encoding

In this step, you'll experiment with one-hot encoding.  But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

## But Why Only 10?



#### Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables.  And, you've seen that encoding categorical data yields better results than removing columns from the dataset.

Soon, you'll try one-hot encoding.  Before then, there's one additional topic we need to cover.  Begin by running the next code cell without changes.  

In [23]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

The output above shows, for each column with categorical data, the number of unique values in the column.  For instance, the `'Street'` column in the training data has two unique values: `'Grvl'` and `'Pave'`, corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the **cardinality** of that categorical variable.  For instance, the `'Street'` variable has cardinality 2.


For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset.  For this reason, we typically will only one-hot encode columns with relatively low cardinality.  Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.  


Run the code cell below without changes to set `low_cardinality_cols` to a Python list containing the columns that will be one-hot encoded.  Likewise, `high_cardinality_cols` contains a list of categorical columns that will be dropped from the dataset.

In [24]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior2nd', 'Exterior1st']


Use the next code cell to one-hot encode the data in `X_train` and `X_valid`.  Set the preprocessed DataFrames to `OH_X_train` and `OH_X_valid`, respectively.  
- The full list of categorical columns in the dataset can be found in the Python list `object_cols`.
- You should only one-hot encode the categorical columns in `low_cardinality_cols`.  All other categorical columns should be dropped from the dataset. 

In [25]:
from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
ohe = OneHotEncoder(handle_unknown='ignore',sparse =False)
ohe_col_train = pd.DataFrame(ohe.fit_transform(X_train[low_cardinality_cols]))
ohe_col_valid = pd.DataFrame(ohe.transform(X_valid[low_cardinality_cols]))

ohe_col_train.index = X_train.index
ohe_col_valid.index = X_valid.index

nX_train = X_train.drop(object_cols,axis =1)
nX_valid = X_valid.drop(object_cols,axis =1)


OH_X_train = pd.concat([nX_train,ohe_col_train],axis =1)# Your code here
OH_X_valid = pd.concat([nX_valid,ohe_col_valid],axis =1) # Your code here



In [26]:
print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from Approach 3 (One-Hot Encoding):
17514.224246575344


In [27]:
X_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
618,619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,108,0,0,260,0,0,7,2007,New,Partial
870,871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,0,8,2009,WD,Normal
92,93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,...,0,44,0,0,0,0,8,2009,WD,Normal
817,818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,...,59,0,0,0,0,0,7,2008,WD,Normal
302,303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,81,0,0,0,0,0,1,2006,WD,Normal


In [28]:
X_valid.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
529,530,20,RL,32668,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,...,0,200,0,0,0,0,3,2007,WD,Alloca
491,492,50,RL,9490,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,32,0,0,0,0,8,2006,WD,Normal
459,460,50,RL,7015,Pave,IR1,Bnk,AllPub,Corner,Gtl,...,0,248,0,0,0,0,7,2009,WD,Normal
279,280,60,RL,10005,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,117,0,0,0,0,0,3,2008,WD,Normal
655,656,160,RM,1680,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,0,3,2010,WD,Family


## 4: Count encodings

Here, encode the categorical features using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [29]:
#Splitting in training and Validation set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

In [45]:
import category_encoders as ce
cat_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Create the count encoder
count_enc = ce.CountEncoder(cols = cat_cols)

# Learn encoding from the training set
count_encoded = count_enc.fit(X_train[cat_cols])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as Count_X_traina suffix to the new columns
Count_X_train= X_train.join(count_enc.transform(X_train[cat_cols]).add_suffix('_count'))

Count_X_valid = X_valid.join(count_enc.transform(X_valid[cat_cols]).add_suffix('_count'))

Count_X_train = Count_X_train.drop(cat_cols,axis=1)
Count_X_valid = Count_X_valid.drop(cat_cols,axis=1)


  X.loc[:, self.cols] = X.fillna(value=pd.np.nan)


Run this cell and comprehend the results.

In [46]:
missing_val_count_by_column = (Count_X_valid.isnull().sum())
missing_val_count_by_column[missing_val_count_by_column > 0]

Condition2_count    3
RoofMatl_count      1
Functional_count    1
dtype: int64

There are bad_categorical_columns. These values are not seen by the model earlier. But its count encoding so we can replace their count by 1. This is not a very graceful solution!

In [47]:
print("MAE from Approach 4 (Count Encoding):") 
print(score_dataset(Count_X_train, Count_X_valid.fillna(1), y_train, y_valid))

MAE from Approach 4 (Count-Hot Encoding):
17183.800034246575


### 5: Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [34]:
cat_features = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)


# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(X_train[cat_features], y_train)

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns

target_X_train= X_train.join(target_enc.transform(X_train[cat_cols]).add_suffix('_target'))

target_X_valid = X_valid.join(target_enc.transform(X_valid[cat_cols]).add_suffix('_target'))

target_X_train = target_X_train.drop(cat_cols,axis=1)
target_X_valid = target_X_valid.drop(cat_cols,axis=1)


In [36]:
print("MAE from Approach 5(target Encoding):") 
print(score_dataset(target_X_train, target_X_valid, y_train, y_valid))

MAE from Approach 5(target Encoding):
17358.692328767123


### 6: CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [40]:
cat_features = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Create the CatBoost encoder
cb_enc = target_enc = ce.CatBoostEncoder(cols=cat_features , random_state=7)

# Learn encoding from the training set
cb_enc.fit(X_train[cat_features], y_train)

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns

cb_X_train= X_train.join(cb_enc.transform(X_train[cat_cols]).add_suffix('_cb'))

cb_X_valid = X_valid.join(cb_enc.transform(X_valid[cat_cols]).add_suffix('_cb'))

#drop the columnes
cb_X_train = cb_X_train.drop(cat_cols,axis=1)
cb_X_valid = cb_X_valid.drop(cat_cols,axis=1)


Run the next code cell without changes to obtain the MAE for this approach.

In [41]:
print("MAE from Approach 6 (Catboost Encoding):") 
print(score_dataset(cb_X_train, cb_X_valid, y_train, y_valid))

MAE from Approach 5 (Catboost Encoding):
17320.384383561646


A few other encoding techniques:
- Backward Difference Coding
- BaseN
- Binary
- Hashing
- Helmert Coding
- James-Stein Encoder
- Leave One Out
- M-estimate
- Ordinal
- Polynomial Coding
- Sum Coding
- Weight of Evidence

## Step 3: Devising Ultimate Strategy 

### Comparing All Solutions
What encoding should one choose?Which encoding will work better?
It is discussed in the article [link]
