# Categorical Features

AirBnB is a company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. In this section, we'll be using AirBnB New York City data to learn about how to handle categorical variables. Each row in this dataset will correspond to a specific home or apartment. The dataset will contain a number of variables surrounding an AirBnB home, such as price, number of reviews, minimum nights required, etc. 

In this notebook, we'll cover the following categorical variable encoding techniques:
- One Hot Encoding
- Ordinal Encoding
- Frequency Encoding
- Target Encoding
- Probability Ratio Encoding
- Weight of Evidence Encoder
- Binning

This a companion notebook for the 365 Data Science Course "Machine Learning Process A-Z". In the course, there is a video walkthrough of this notebook as well as theory and definitions of each of the techinques. We've designed this notebook to be a stand alone learning tool, but if you're interested in the additional features of the paid course, you can access it at a discount here: https://365datascience.com/learn-machine-learning-process-a-z/

Let's get started!

## Import Libraries

We'll first need to import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

## Load Data

Next, we'll need to load our AirBnb dataset. 

In [2]:
df = pd.read_csv("airbnb_dataset_ml_process.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,non-expensive
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,expensive
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365,non-expensive
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,non-expensive
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,non-expensive


## Data Pre-Processing

In [3]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df = df.dropna()

## Cross-Validation

In [4]:
from sklearn.model_selection import train_test_split

X = df.drop(['price'], axis = 1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

## One Hot Encoding

The first technique we'll dive into is one hot encoding. One hot encoding is the simplest form of encoding categorical variables. In this case, we'll look at the categorical column `expensive`. 

### One Categorical Variable

You'll see, that this column is denoted by two values, either expensive or not expensive. Let's say we wanted to one hot encode this column, what would it look like? 

In [5]:
#create a categorical based on a threshold we set
#we are using a lambda function for this. Lambdas allow us to write a function without defining it
#we are also using a ternary operator. This allows us to write a conditional statement in 1 line of code 
df['expensive'] = df['price'].apply(lambda x: 'expensive' if x > 175 else 'non-expensive')

In [6]:
df.groupby(['expensive']).mean()['price']

expensive
expensive        311.277369
non-expensive     92.569628
Name: price, dtype: float64

Creating a one-hot encoded variable will turn each one of these values into its own column, and then denote 0 or 1 if the row contains the column:

In [7]:
dummies = pd.get_dummies(df['expensive'])

hot_encoded_single_df = pd.concat([df.drop(['expensive'],axis=1),dummies], axis = 1)

dummies.head()

Unnamed: 0,expensive,non-expensive
0,0,1
1,1,0
3,0,1
4,0,1
5,1,0


In [8]:
X = hot_encoded_single_df.drop(['price'], axis = 1)
y = hot_encoded_single_df['price']

X_train_dummy, X_test_dummy, y_train_dummy, y_test_dummy = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [9]:
X_train_dummy.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,non-expensive
32728,25785298,Spacious room in Williamsburg (10min to Manhat...,889683,Rachel,Brooklyn,Williamsburg,40.71264,-73.94889,Private room,1,71,2019-06-21,5.37,2,51,0,1
26365,21005699,Vics cafe,45044964,Victor,Queens,Queens Village,40.72517,-73.76105,Shared room,1,11,2019-06-03,0.51,2,125,0,1
22249,17940198,"Cozy, queen sized bedroom in Bed-Stuy",24383863,Sophia,Brooklyn,Bedford-Stuyvesant,40.69471,-73.94406,Private room,2,24,2019-06-25,0.98,1,1,0,1
6170,4513084,Beautiful Sunny Private Penthouse Suite,23401472,Heather,Manhattan,Harlem,40.82624,-73.94527,Entire home/apt,30,1,2016-07-30,0.03,2,244,0,1
27267,21540496,Stylish 1 BD - 10 min to Manhattan & Central Park,156587568,Anna,Queens,Astoria,40.76129,-73.92009,Entire home/apt,1,56,2019-06-14,2.73,1,126,0,1


### Multiple Categorical Variables

Now, let's look at a different column with many, many possible categories: `neighbourhood`. You'll see that if we one hot encoded neighborhood, this would create 221 new columns. For some models, this would significantly increase the complexity of both the dataset and the model when training. This is called high **cardinality**. This could result in overfitting, large memory consumption or slow training times: 

In [10]:
mult_hot_enc = pd.get_dummies(df['neighbourhood'])

hot_encoded_df = pd.concat([df,mult_hot_enc], axis = 1)

mult_cols = mult_hot_enc.columns

mult_cols

Index(['Allerton', 'Arden Heights', 'Arrochar', 'Arverne', 'Astoria',
       'Bath Beach', 'Battery Park City', 'Bay Ridge', 'Bay Terrace',
       'Bay Terrace, Staten Island',
       ...
       'Westchester Square', 'Westerleigh', 'Whitestone', 'Williamsbridge',
       'Williamsburg', 'Willowbrook', 'Windsor Terrace', 'Woodhaven',
       'Woodlawn', 'Woodside'],
      dtype='object', length=218)

In [11]:
X = hot_encoded_df.drop(['price'], axis = 1)
y = hot_encoded_df['price']

X_train_me, X_test_me, y_train_me, y_test_me = train_test_split(X, y, test_size = 0.33, random_state = 42)

### Ordinal Encoding

There are a number of solutions to solving the cardinality problem. The first one is ordinal encoding. Ordinal encoding is a method of replacing categories with numbers. However, these numbers have an inherent ordering to them. Think, high school -> college -> grad_school as categorical variables with na inherent ordering. Luckily, sklearn has already written an implementation of ordinal encoding:

In [12]:
df_ord = df.copy()

ord_encodings = {
    'Entire home/apt':0,
    'Private room':1,
    'Shared room':2,
}

df_ord['room_type_ord_enc'] = df_ord['room_type'].map(ord_encodings)

In [13]:
X = df_ord.drop(['price'], axis = 1)
y = df_ord['price']

X_train_ord, X_test_ord, y_train_ord, y_test_ord = train_test_split(X, y, test_size = 0.33, random_state = 42)

### Frequency Encoding

Another technique to address the cardinality issue is frequency encoding. Rather than replace our categorical variables with ordinal variables, we're going to replace our categories with the frequency in which they occur. See here, we'll see how many times each neighborhood appears:

In [14]:
frequency_encoder = X_train.groupby(['neighbourhood']).size()

frequency_encoder

neighbourhood
Allerton            25
Arden Heights        4
Arrochar            14
Arverne             43
Astoria            480
                  ... 
Willowbrook          1
Windsor Terrace     79
Woodhaven           44
Woodlawn             8
Woodside           119
Length: 215, dtype: int64

Then, we can replace our categories with these different frequencies:

In [15]:
X_test['neighbourhood'] = X_test['neighbourhood'].map(frequency_encoder)

X_test['neighbourhood']

5576      580.0
7729      284.0
2020     2143.0
4195     2143.0
9758     2143.0
          ...  
43519     669.0
10297     669.0
36169     550.0
32098     995.0
37697    1489.0
Name: neighbourhood, Length: 12811, dtype: float64

At the time of writing this notebook, I couldn't find an implementation of this, so I wrote one on my own:

In [16]:
class FrequencyEncoder:
    def fit(self, train_df, column):
        self.train_df = train_df
        self.column = column
        self.frequency_encodings = None
        
    def _compute_frequency(self):
        self.frequency_encodings = self.train_df.groupby([self.column]).size()
        
    def transform(self, test_df, column, fillna = True):
        self._compute_frequency()
        
        col_name = column + '_freq'
        test_df[col_name] = test_df[column].map(self.frequency_encodings)
        
        if fillna is True:
            test_df[col_name] = test_df[col_name].fillna(0)
        
        return test_df

fe = FrequencyEncoder()
fe.fit(X_train, column = 'neighbourhood')

# Transform Datasets
X_train_freq_enc = fe.transform(X_train, column='neighbourhood')
X_test_freq_enc = fe.transform(X_test, column='neighbourhood')

X_test_freq_enc['neighbourhood_freq']

5576     0.0
7729     0.0
2020     0.0
4195     0.0
9758     0.0
        ... 
43519    0.0
10297    0.0
36169    0.0
32098    0.0
37697    0.0
Name: neighbourhood_freq, Length: 12811, dtype: float64

### Target Encoding

Another method of encoding is called Target Encoding. Earlier, we learned about frequency encoding, where we encode the number of instances a category occurs as its value. Rather than encode the number of instances a category occurs, we can encode the mean of our target variable, like this:

In [17]:
df.groupby(['neighbourhood']).mean()['price']

neighbourhood
Allerton            90.594595
Arden Heights       67.250000
Arrochar           118.250000
Arverne            158.515152
Astoria            116.018336
                      ...    
Willowbrook        249.000000
Windsor Terrace    130.687500
Woodhaven           62.722222
Woodlawn            60.090909
Woodside            89.852941
Name: price, Length: 218, dtype: float64

Similar to before, I couldn't find an implementation online, so I wrote one myself that you can re-use:

In [18]:
df_te = df.copy()

class TargetEncoder:
    def fit(self, train_df, target_col, categ_col):
        self.train_df = train_df
        self.target_col = target_col
        self.categ_col = categ_col
        self.target_encodings = None
        
    def transform(self, test_df, column = None, fillna = True):
        if column is None:
            column = self.categ_col
        
        self.target_encodings = self.train_df.groupby([self.categ_col]).mean()[self.target_col]

        col_name = column + '_target_enc'
        test_df.loc[:,col_name] = test_df[column].map(self.target_encodings)
        
        if fillna is True:
            test_df[col_name] = test_df[col_name].fillna(0)
            
        return test_df

    
df_train, df_test = train_test_split(df, test_size = 0.33, random_state = 42)

te = TargetEncoder()
te.fit(df_train, target_col = 'price', categ_col = 'neighbourhood')

X_train_te = te.transform(df_train)
X_test_te = te.transform(df_test)

X_test_te

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,neighbourhood_target_enc
5576,4053517,Two-Bedroom Greenpoint Apartment,3967335,Molly,Brooklyn,Greenpoint,40.72527,-73.95016,Entire home/apt,174,2,31,2019-06-02,0.54,2,12,non-expensive,150.825862
7729,5849991,3 bedroom Apt at $249 per Night.,9898029,Anthony,Brooklyn,East Flatbush,40.65041,-73.92574,Entire home/apt,249,3,10,2019-06-03,0.65,5,156,expensive,116.848592
2020,894015,Boldera: Your Home Away From Home,4622027,Damon And Kent,Brooklyn,Bedford-Stuyvesant,40.68194,-73.92896,Entire home/apt,107,2,147,2019-06-22,1.89,1,27,non-expensive,108.736351
4195,2730497,Entire Private Garden Floor of Brownstone,13974214,Jonathan,Brooklyn,Bedford-Stuyvesant,40.68058,-73.93856,Entire home/apt,130,3,92,2019-06-19,1.45,1,248,non-expensive,108.736351
9758,7500571,Gorgeous 1 Bedroom in Bed-Stuy,39288710,Audra,Brooklyn,Bedford-Stuyvesant,40.68544,-73.93872,Entire home/apt,102,3,64,2019-06-24,1.38,1,296,non-expensive,108.736351
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43519,33701468,Best Location! Spacious 3BR in Center of NYC!,253128090,Harvy Jake,Manhattan,Midtown,40.76167,-73.96944,Entire home/apt,149,2,7,2019-06-14,2.63,1,163,non-expensive,265.881913
10297,7880998,NoMad One Bedroom on Park Ave,41543076,Kelsey,Manhattan,Midtown,40.74321,-73.98474,Entire home/apt,175,7,1,2016-07-01,0.03,1,0,non-expensive,265.881913
36169,28754650,NYC Chelsea,216891728,Sonia,Manhattan,Chelsea,40.74580,-73.99570,Entire home/apt,140,32,29,2019-07-02,3.94,1,2,non-expensive,222.045455
32098,25075529,Comfy cozy studio in LES,164668733,Amalia,Manhattan,East Village,40.72228,-73.98584,Entire home/apt,125,5,3,2018-08-19,0.24,1,0,non-expensive,174.694472


### Probability Ratio Encoding

Probability Ratio Encoding is similar to target encoding. But rather than using the mean of the target, we're looking at the probability this category is going to be a positive label. Here, we've written out our own implementation:

In [19]:
class ProbabilityRatioEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df.copy()
        self.categ_col = categ_col
        self.target_col = target_col
        self.ratio_encoder = None
        
    def transform(self, test_df, constant = 0, fillna = True):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]
        self.ratio_encoder = ((sums+ constant)/totals)/(1-(sums+ constant)/totals)
        
        col_name = self.categ_col + '_prob_ratio'
        test_df.loc[:,col_name] = test_df[self.categ_col].map(self.ratio_encoder)
        
        if fillna is True:
            test_df.replace([np.inf, -np.inf], np.nan, inplace=True)
            test_df[col_name] = test_df[col_name].fillna(0)
            
        return test_df

In [20]:
## Ignore, creating a categorical target variable
target_variable = pd.get_dummies(df['expensive'])

df_pe = pd.concat([df.drop(['expensive'],axis = 1),target_variable], axis = 1)
df_pe = df_pe.drop(['non-expensive'], axis = 1)

In [21]:
df_train, df_test = train_test_split(df_pe, test_size = 0.33, random_state = 42)

pre = ProbabilityRatioEncoder()

pre.fit(df_train, 'neighbourhood','expensive')
df_train_pe = pre.transform(df_train)
df_test_pe = pre.transform(df_test)

df_test_pe

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,neighbourhood_prob_ratio
5576,4053517,Two-Bedroom Greenpoint Apartment,3967335,Molly,Brooklyn,Greenpoint,40.72527,-73.95016,Entire home/apt,174,2,31,2019-06-02,0.54,2,12,0,0.269147
7729,5849991,3 bedroom Apt at $249 per Night.,9898029,Anthony,Brooklyn,East Flatbush,40.65041,-73.92574,Entire home/apt,249,3,10,2019-06-03,0.65,5,156,1,0.075758
2020,894015,Boldera: Your Home Away From Home,4622027,Damon And Kent,Brooklyn,Bedford-Stuyvesant,40.68194,-73.92896,Entire home/apt,107,2,147,2019-06-22,1.89,1,27,0,0.113247
4195,2730497,Entire Private Garden Floor of Brownstone,13974214,Jonathan,Brooklyn,Bedford-Stuyvesant,40.68058,-73.93856,Entire home/apt,130,3,92,2019-06-19,1.45,1,248,0,0.113247
9758,7500571,Gorgeous 1 Bedroom in Bed-Stuy,39288710,Audra,Brooklyn,Bedford-Stuyvesant,40.68544,-73.93872,Entire home/apt,102,3,64,2019-06-24,1.38,1,296,0,0.113247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43519,33701468,Best Location! Spacious 3BR in Center of NYC!,253128090,Harvy Jake,Manhattan,Midtown,40.76167,-73.96944,Entire home/apt,149,2,7,2019-06-14,2.63,1,163,0,1.207921
10297,7880998,NoMad One Bedroom on Park Ave,41543076,Kelsey,Manhattan,Midtown,40.74321,-73.98474,Entire home/apt,175,7,1,2016-07-01,0.03,1,0,0,1.207921
36169,28754650,NYC Chelsea,216891728,Sonia,Manhattan,Chelsea,40.74580,-73.99570,Entire home/apt,140,32,29,2019-07-02,3.94,1,2,0,1.182540
32098,25075529,Comfy cozy studio in LES,164668733,Amalia,Manhattan,East Village,40.72228,-73.98584,Entire home/apt,125,5,3,2018-08-19,0.24,1,0,0,0.554688


### Weight of Evidence Encoding

Weight of Evidence encoding is similar to probability ratio encoding. The only difference, is we're applying a log transform on top of the probability ratio transformation:

In [22]:
class WeightofEvidenceEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df
        self.categ_col = categ_col
        self.target_col = target_col
        
    def transform(self, test_df, constant = 0, fillna = True):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]

        woe_encoder = np.log(((sums+ constant)/totals)/(1-(sums+ constant)/totals))
        
        col_name = self.categ_col + '_woe'
        test_df.loc[:,col_name] = test_df[self.categ_col].map(woe_encoder)
        
        if fillna is True:
            test_df.replace([np.inf, -np.inf], np.nan, inplace=True)
            test_df[col_name] = test_df[col_name].fillna(0)
        return test_df

In [23]:
## Ignore, creating a categorical target variable
target_variable = pd.get_dummies(df['expensive'])

df_woe = pd.concat([df.drop(['expensive'],axis = 1),target_variable], axis = 1)
df_woe = df_woe.drop(['non-expensive'], axis = 1)

In [24]:
df_train, df_test = train_test_split(df_woe, test_size = 0.33, random_state = 42)

woe = WeightofEvidenceEncoder()

woe.fit(df_train, 'neighbourhood','expensive')
df_train_woe = woe.transform(df_train, constant = 0.01)
df_test_woe = woe.transform(df_test, constant = 0.01)

df_test_woe

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,neighbourhood_woe
5576,4053517,Two-Bedroom Greenpoint Apartment,3967335,Molly,Brooklyn,Greenpoint,40.72527,-73.95016,Entire home/apt,174,2,31,2019-06-02,0.54,2,12,0,-1.312396
7729,5849991,3 bedroom Apt at $249 per Night.,9898029,Anthony,Brooklyn,East Flatbush,40.65041,-73.92574,Entire home/apt,249,3,10,2019-06-03,0.65,5,156,1,-2.579679
2020,894015,Boldera: Your Home Away From Home,4622027,Damon And Kent,Brooklyn,Bedford-Stuyvesant,40.68194,-73.92896,Entire home/apt,107,2,147,2019-06-22,1.89,1,27,0,-2.178135
4195,2730497,Entire Private Garden Floor of Brownstone,13974214,Jonathan,Brooklyn,Bedford-Stuyvesant,40.68058,-73.93856,Entire home/apt,130,3,92,2019-06-19,1.45,1,248,0,-2.178135
9758,7500571,Gorgeous 1 Bedroom in Bed-Stuy,39288710,Audra,Brooklyn,Bedford-Stuyvesant,40.68544,-73.93872,Entire home/apt,102,3,64,2019-06-24,1.38,1,296,0,-2.178135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43519,33701468,Best Location! Spacious 3BR in Center of NYC!,253128090,Harvy Jake,Manhattan,Midtown,40.76167,-73.96944,Entire home/apt,149,2,7,2019-06-14,2.63,1,163,0,0.188961
10297,7880998,NoMad One Bedroom on Park Ave,41543076,Kelsey,Manhattan,Midtown,40.74321,-73.98474,Entire home/apt,175,7,1,2016-07-01,0.03,1,0,0,0.188961
36169,28754650,NYC Chelsea,216891728,Sonia,Manhattan,Chelsea,40.74580,-73.99570,Entire home/apt,140,32,29,2019-07-02,3.94,1,2,0,0.167738
32098,25075529,Comfy cozy studio in LES,164668733,Amalia,Manhattan,East Village,40.72228,-73.98584,Entire home/apt,125,5,3,2018-08-19,0.24,1,0,0,-0.589307


### Binning

The last technique is called binning. This is where we take a continuous variable and bin them into different buckets, thus, transforming this continuous variable into a categorical variable:

In [25]:
# df.loc[:,'last_review'] = df['last_review']

def reviews_bin(x):
    if x < 50:
        return 'less_50'
    if x >= 50 and x < 100:
        return '50_to_100'
    if x >= 100 and x < 150:
        return '100_to_150'
    if x >= 150 and x < 200:
        return '150_to_200'
    if x >= 200:
        return '200_plus'
    
df['number_of_reviews'].map(reviews_bin)

0          less_50
1          less_50
3         200_plus
4          less_50
5        50_to_100
           ...    
48782      less_50
48790      less_50
48799      less_50
48805      less_50
48852      less_50
Name: number_of_reviews, Length: 38821, dtype: object

## Comparison

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

## Multiple Hot Encoded Variable
features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month'] + list(mult_cols)

lr = LinearRegression()
lr.fit(X_train_me[features], y_train_dummy)
pred_single_me = lr.predict(X_test_me[features])

multi_dummy_score = mean_absolute_error(pred_single_me, y_test_me)

multi_dummy_score

70.38943058456395

In [27]:
from sklearn.linear_model import LinearRegression


features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month', 'room_type_ord_enc']

## Ordinal Encoding
lr = LinearRegression()
lr.fit(X_train_ord[features], y_train_ord)
pred_single_ord = lr.predict(X_test_ord[features])

ord_score = mean_absolute_error(pred_single_ord, y_test_ord)

ord_score

63.19185794926268

In [28]:
## Frequency Encoding
features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month', 'neighbourhood_freq']

lr = LinearRegression()
lr.fit(X_train_freq_enc[features], y_train)
pred_freq_enc = lr.predict(X_test_freq_enc[features])

freq_score = mean_absolute_error(pred_freq_enc, y_test)

freq_score

77.84292568008051

In [29]:
## Target Encoding
features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month', 'neighbourhood_target_enc']

lr = LinearRegression()
lr.fit(X_train_te[features], y_train)
pred_te_enc = lr.predict(X_test_te[features])

te_score = mean_absolute_error(pred_te_enc, y_test)

te_score

70.42050163360693

In [30]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

## Probability Ratio Encoding
features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month', 'neighbourhood_prob_ratio']

lr = LogisticRegression()

X_train_pe = df_train_pe[features]
y_train_pe = df_train_pe['expensive']

X_test_pe = df_test_pe[features]
y_test_pe = df_test_pe['expensive']

lr.fit(X_train_pe[features], y_train_pe)
pred_pe_enc = lr.predict(X_test_pe[features])

pe_score = accuracy_score(pred_pe_enc, y_test_pe)

pe_score

0.7692607915072984

In [31]:
## Weight of Evidence
features = ['id', 'host_id', 'minimum_nights', 'number_of_reviews',
       'calculated_host_listings_count', 'availability_365',
'latitude', 'longitude', 'reviews_per_month', 'neighbourhood_woe']

lr = LogisticRegression()

X_train_woe = df_train_woe[features]
y_train_woe = df_train_woe['expensive']

X_test_woe = df_test_woe[features]
y_test_woe = df_test_woe['expensive']

lr.fit(X_train_woe[features], y_train_woe)
pred_woe_enc = lr.predict(X_test_woe[features])

woe_score = accuracy_score(pred_woe_enc, y_test_woe)

woe_score

0.7692607915072984

## Summary

In this section, you learned about a variety of categorical variable feature engineering techniques:

- Hot Encoding

- Ordinal Encoding

- Target Encoding

- Probability Ratio Encoding

- Weight of Evidence Encoding

- Binning

## External Resources
- [All about Categorical Variable Encoding by Baijayanta Roy](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)
- [Ordinal and One-Hot Encodings for Categorical Data by Jason Brownlee](https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/)
- [Feature Engineering Ordinal Variables by Sheng Jun](https://towardsdatascience.com/feature-engineering-ordinal-variables-bfea697f5eee)
- [Target Encoding by Ryan Holbrook](https://www.kaggle.com/code/ryanholbrook/target-encoding)
- [Weight of Evidence Coding by Bruce Lund](https://www.mwsug.org/proceedings/2016/AA/MWSUG-2016-AA15.pdf)

## Related Course Workbooks - Machine Learning Process A-Z
- [**Dealing with Missing Values - Section 5.1**](https://www.kaggle.com/code/kenjee/dealing-with-missing-values-section-5-1)
- [**Dealing with Outliers - Section 5.2**](https://www.kaggle.com/code/kenjee/dealing-with-outliers-section-5-2)
- [**Basic EDA Example - Section 6**](https://www.kaggle.com/code/kenjee/basic-eda-example-section-6)
- [**Categorical Feature Engineering - Section 7.1**](https://www.kaggle.com/code/kenjee/categorical-feature-engineering-section-7-1)
- [**Numeric Feature Engineering - Section 7.2**](https://www.kaggle.com/kenjee/numeric-feature-engineering-section-7-2)
- [**Cross Validation Foundations - Section 8**](https://www.kaggle.com/code/kenjee/cross-validation-foundations-section-8)
- [**Feature Selection - Section 9**](https://www.kaggle.com/code/kenjee/feature-selection-section-9)
- [**Dealing with Imbalanced Data - Section 10**](https://www.kaggle.com/code/kenjee/dealing-with-imbalanced-data-section-10)
- [**Model Building Example - Section 11**](https://www.kaggle.com/code/kenjee/model-building-example-section-11)
- [**Model Evaluation (Classification) - Section 11**](https://www.kaggle.com/code/kenjee/model-evaluation-classification-section-12)
- [**Model Evlauation (Regression) - Section 11**](https://www.kaggle.com/code/kenjee/model-evaluation-regression-12)