# Categorical Features

AirBnB is a company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. In this section, we'll be using AirBnB New York City data to learn about how to handle categorical variables. Each row in this dataset will correspond to a specific home or apartment. The dataset will contain a number of variables surrounding an AirBnB home, such as price, number of reviews, minimum nights required, etc. 

In this notebook, we'll cover the following categorical variable encoding techniques:
- One Hot Encoding
- Ordinal Encoding
- Frequency Encoding
- Target Encoding
- Probability Ratio Encoding
- Weight of Evidence Encoder
- Binning

Let's get started! 

## Import Libraries

We'll first need to import the relevant libraries.

In [3]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

## Load Data

Next, we'll need to load our AirBnb dataset. 

In [4]:
df = pd.read_csv("airbnb_dataset_ml_process.csv")

## One Hot Encoding

The first technique we'll dive into is one hot encoding. One hot encoding is the simplest form of encoding categorical variables. In this case, we'll look at the categorical column `expensive`. 

### One Categorical Variable

You'll see, that this column is denoted by two values, either expensive or not expensive. Let's say we wanted to one hot encode this column, what would it look like? 

In [17]:
df['expensive']

0        non-expensive
1            expensive
2        non-expensive
3        non-expensive
4        non-expensive
             ...      
48890    non-expensive
48891    non-expensive
48892    non-expensive
48893    non-expensive
48894    non-expensive
Name: expensive, Length: 48895, dtype: object

Creating a one-hot encoded variable will turn each one of these values into its own column, and then denote 0 or 1 if the row contains the column:

In [15]:
dummies = pd.get_dummies(df['expensive'])

dummies.head()

Unnamed: 0,expensive,non-expensive
0,0,1
1,1,0
2,0,1
3,0,1
4,0,1


### Multiple Categorical Variables

Now, let's look at a different column with many, many possible categories: `neighbourhood`. You'll see that if we one hot encoded neighborhood, this would create 221 new columns. For some models, this would significantly increase the complexity of both the dataset and the model when training. This is called high **cardinality**. This could result in overfitting, large memory consumption or slow training times: 

In [19]:
mult_hot_enc = pd.get_dummies(df['neighbourhood'])

hot_encoded_df = pd.concat([df,mult_hot_enc], axis = 1)

mult_hot_enc

Unnamed: 0,Allerton,Arden Heights,Arrochar,Arverne,Astoria,Bath Beach,Battery Park City,Bay Ridge,Bay Terrace,"Bay Terrace, Staten Island",...,Westerleigh,Whitestone,Williamsbridge,Williamsburg,Willowbrook,Windsor Terrace,Woodhaven,Woodlawn,Woodrow,Woodside
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48891,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48892,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48893,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Ordinal Encoding

There are a number of solutions to solving the cardinality problem. The first one is ordinal encoding. Ordinal encoding is a method of replacing categories with numbers. However, these numbers have an inherent ordering to them. Think, high school -> college -> grad_school as categorical variables with na inherent ordering. Luckily, sklearn has already written an implementation of ordinal encoding:

In [21]:
from sklearn.preprocessing import OrdinalEncoder

## Get the data we want to encode, convert to unique values
data = np.asarray(df[['room_type']])

# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(pd.DataFrame(result))


         0
0      1.0
1      0.0
2      1.0
3      0.0
4      0.0
...    ...
48890  1.0
48891  1.0
48892  0.0
48893  2.0
48894  1.0

[48895 rows x 1 columns]


In [24]:
## Aggregate it back into original dataframe
ord_encoded = pd.DataFrame(result)
ord_encoded.columns = ['room_type_ord_encoded']

ord_encoded_df = pd.concat([df, ord_encoded], axis = 1)

ord_encoded

Unnamed: 0,room_type_ord_encoded
0,1.0
1,0.0
2,1.0
3,0.0
4,0.0
...,...
48890,1.0
48891,1.0
48892,0.0
48893,2.0


### Frequency Encoding

Another technique to address the cardinality issue is frequency encoding. Rather than replace our categorical variables with ordinal variables, we're going to replace our categories with the frequency in which they occur. See here, we'll see how many times each neighborhood appears: 

In [25]:
frequency_encoded = df.groupby(['neighbourhood']).size()

frequency_encoded

neighbourhood
Allerton            42
Arden Heights        4
Arrochar            21
Arverne             77
Astoria            900
                  ... 
Windsor Terrace    157
Woodhaven           88
Woodlawn            11
Woodrow              1
Woodside           235
Length: 221, dtype: int64

Then, we can replace our categories with these different frequencies:

In [31]:
df['neighbourhood'].apply(lambda x: frequency_encoded[x])

0         175
1        1545
2        2658
3         572
4        1117
         ... 
48890    3714
48891    2465
48892    2658
48893    1958
48894    1958
Name: neighbourhood, Length: 48895, dtype: int64

At the time of writing this notebook, I couldn't find an implementation of this, so I wrote one on my own:

In [26]:
class FrequencyEncoder:
    def fit(self, train_df, column):
        self.train_df = train_df
        self.column = column
        
    def transform(self, test_df, column):
        frequency_encoded = self.train_df.groupby([self.column]).size()

        col_name = column + '_freq'
        test_df.loc[:,col_name] = test_df[column].apply(lambda x: frequency_encoded[x])
        return test_df

# frequency_encoding(df, column='neighbourhood')

fe = FrequencyEncoder()
fe.fit(df, column='neighbourhood')
df_freq_enc = fe.transform(df, column='neighbourhood')

df_freq_enc['neighbourhood_freq']

0         175
1        1545
2        2658
3         572
4        1117
         ... 
48890    3714
48891    2465
48892    2658
48893    1958
48894    1958
Name: neighbourhood_freq, Length: 48895, dtype: int64

### Target Encoding

Another method of encoding is called Target Encoding. Earlier, we learned about frequency encoding, where we encode the number of instances a category occurs as its value. Rather than encode the number of instances a category occurs, we can encode the mean of our target variable, like this:

In [29]:
df.groupby(['neighbourhood']).mean()['price']

  df.groupby(['neighbourhood']).mean()['price']


neighbourhood
Allerton            87.595238
Arden Heights       67.250000
Arrochar           115.000000
Arverne            171.779221
Astoria            117.187778
                      ...    
Windsor Terrace    138.993631
Woodhaven           67.170455
Woodlawn            60.090909
Woodrow            700.000000
Woodside            85.097872
Name: price, Length: 221, dtype: float64

Similar to before, I couldn't find an implementation online, so I wrote one myself that you can re-use:

In [25]:
class TargetEncoder:
    def fit(self, train_df, target_col, categ_col):
        self.train_df = train_df
        self.target_col = target_col
        self.categ_col = categ_col
        
    def transform(self, test_df, column = None):
        if column is None:
            column = self.categ_col
        
        target_encoder = self.train_df.groupby([self.categ_col]).mean()[self.target_col]

        df[self.categ_col].apply(lambda x: target_encoder[x])

        col_name = column + '_target_enc'
        test_df.loc[:,col_name] = test_df[column].apply(lambda x: target_encoder[x])
        return test_df
    
te = TargetEncoder()
te.fit(df, target_col = 'price', categ_col = 'neighbourhood')

te_df = te.transform(df)

te_df

  target_encoder = self.train_df.groupby([self.categ_col]).mean()[self.target_col]


Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,...,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,expensive.1,non-expensive,neighbourhood_prob_ratio,neighbourhood_target_enc
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,...,9,2018-10-19,0.21,6,365,non-expensive,0,1,0.091429,92.885714
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,...,45,2019-05-21,0.38,2,355,expensive,1,0,0.623948,282.719094
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,...,0,,,1,365,non-expensive,0,1,0.132807,118.974041
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,...,270,2019-07-05,4.64,1,194,non-expensive,0,1,0.243007,181.893357
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,...,9,2018-11-19,0.10,1,0,non-expensive,0,1,0.168308,133.198747
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,...,0,,,2,9,non-expensive,0,1,0.100969,107.678244
48891,48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,...,0,,,2,36,non-expensive,0,1,0.063692,84.800406
48892,48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,...,0,,,1,27,non-expensive,0,1,0.132807,118.974041
48893,48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,...,0,,,6,2,non-expensive,0,1,0.466803,204.794178


### Probability Ratio Encoding

Probability Ratio Encoding is similar to target encoding. But rather than using the mean of the target, we're looking at the probability this category is going to be a positive label. Here, we've written out our own implementation:

In [5]:
class ProbabilityRatioEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df
        self.categ_col = categ_col
        self.target_col = target_col
        
    def transform(self, test_df, constant = 0):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]

        ratio_encoder = (sums+ constant)/totals
        
        col_name = self.categ_col + '_prob_ratio'
        test_df.loc[:,col_name] = test_df[self.categ_col].apply(lambda x: ratio_encoder[x])
        return test_df

In [12]:
## Ignore, creating a categorical target variable
target_variable = pd.get_dummies(df['expensive'])

df = pd.concat([df,target_variable], axis = 1)

In [27]:
pre = ProbabilityRatioEncoder()

pre.fit(df, 'neighbourhood','expensive')
pre_df = pre.transform(df)

pre_df

  sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]


Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,...,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,expensive.1,non-expensive,neighbourhood_prob_ratio,neighbourhood_target_enc
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,...,9,2018-10-19,0.21,6,365,non-expensive,0,1,0.091429,92.885714
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,...,45,2019-05-21,0.38,2,355,expensive,1,0,0.623948,282.719094
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,...,0,,,1,365,non-expensive,0,1,0.132807,118.974041
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,...,270,2019-07-05,4.64,1,194,non-expensive,0,1,0.243007,181.893357
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,...,9,2018-11-19,0.10,1,0,non-expensive,0,1,0.168308,133.198747
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,...,0,,,2,9,non-expensive,0,1,0.100969,107.678244
48891,48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,...,0,,,2,36,non-expensive,0,1,0.063692,84.800406
48892,48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,...,0,,,1,27,non-expensive,0,1,0.132807,118.974041
48893,48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,...,0,,,6,2,non-expensive,0,1,0.466803,204.794178


### Weight of Evidence Encoding

Weight of Evidence encoding is similar to probability ratio encoding. The only difference, is we're applying a log transform on top of the probability ratio transformation:

In [28]:
class WeightofEvidenceEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df
        self.categ_col = categ_col
        self.target_col = target_col
        
    def transform(self, test_df, constant = 0):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]

        woe_encoder = np.log((sums+ constant)/totals)
        
        col_name = self.categ_col + '_woe'
        test_df.loc[:,col_name] = test_df[self.categ_col].apply(lambda x: woe_encoder[x])
        return test_df
    
woe = WeightofEvidenceEncoder()

woe.fit(df, 'neighbourhood','expensive')
woe_df = woe.transform(df)

woe_df

  sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,...,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,expensive.1,non-expensive,neighbourhood_prob_ratio,neighbourhood_target_enc,neighbourhood_woe
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,...,2018-10-19,0.21,6,365,non-expensive,0,1,0.091429,92.885714,-2.392197
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,...,2019-05-21,0.38,2,355,expensive,1,0,0.623948,282.719094,-0.471688
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,...,,,1,365,non-expensive,0,1,0.132807,118.974041,-2.018861
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,...,2019-07-05,4.64,1,194,non-expensive,0,1,0.243007,181.893357,-1.414665
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,...,2018-11-19,0.10,1,0,non-expensive,0,1,0.168308,133.198747,-1.781960
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,...,,,2,9,non-expensive,0,1,0.100969,107.678244,-2.292939
48891,48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,...,,,2,36,non-expensive,0,1,0.063692,84.800406,-2.753701
48892,48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,...,,,1,27,non-expensive,0,1,0.132807,118.974041,-2.018861
48893,48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,...,,,6,2,non-expensive,0,1,0.466803,204.794178,-0.761848


### Binning

The last technique is called binning. This is where we take a continuous variable and bin them into different buckets, thus, transforming this continuous variable into a categorical variable:

In [36]:
df.loc[:,'last_review'] = pd.to_datetime(df['last_review'])

def reviews_bin(x):
    if x < 50:
        return 'less_50'
    if x >= 50 and x < 100:
        return '50_to_100'
    if x >= 100 and x < 150:
        return '100_to_150'
    if x >= 150 and x < 200:
        return '150_to_200'
    if x >= 200:
        return '200_plus'
    
df['number_of_reviews'].map(reviews_bin)

0         less_50
1         less_50
2         less_50
3        200_plus
4         less_50
           ...   
48890     less_50
48891     less_50
48892     less_50
48893     less_50
48894     less_50
Name: number_of_reviews, Length: 48895, dtype: object

## Comparison

In [74]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


def run_logistic_regression_pipeline(df):
    
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df = df.dropna()

    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    X = df.select_dtypes(include=numerics)
    y = df['price']

    X_train = X[:30000]
    y_train = y[:30000]
    X_test = X[30000:]
    y_test = y[30000:]

    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    preds = lr.predict(X_test)

    accuracy = accuracy_score(preds, y_test)
    
    return accuracy

# Hot Encoding
hot_encode_accuracy_score = run_logistic_regression_pipeline(hot_encoded_df)

# ## Ordinal Encoding
ordinal_accuracy_score = run_logistic_regression_pipeline(ord_encoded_df)

# ## Target Encoding
target_encode_accuracy_score = run_logistic_regression_pipeline(te_df)

# ## Probability Ratio Encoding
prob_ratio_accuracy_score = run_logistic_regression_pipeline(pre_df)

# ## Weight of Evidence Encoding
woe_accuracy_score = run_logistic_regression_pipeline(woe)


print('Hot Encoded Accuracy Score: %.3f' % hot_encode_accuracy_score)
print('Ordinal Encoded Accuracy Score: %.3f' % ordinal_accuracy_score)
print('Target Encoded Accuracy Score: %.3f' % target_encode_accuracy_score)
print('Probability Ratio Encoded Accuracy Score: %.3f' % prob_ratio_accuracy_score)
print('Weight of Evidence Encoded Accuracy Score: %.3f' % woe_accuracy_score)

  y_train = y[:30000]
  y_test = y[30000:]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y_train = y[:30000]
  y_test = y[30000:]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/ind

AttributeError: 'WeightofEvidenceEncoder' object has no attribute 'replace'

## Summary

In this section, you learned about a variety of categorical variable feature engineering techniques:

- Hot Encoding

- Ordinal Encoding

- Target Encoding

- Probability Ratio Encoding

- Weight of Evidence Encoding

- Binning