# Categorical Features

AirBnB is a company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. In this section, we'll be using AirBnB New York City data to learn about how to handle categorical variables. Each row in this dataset will correspond to a specific home or apartment. The dataset will contain a number of variables surrounding an AirBnB home, such as price, number of reviews, minimum nights required, etc. 

- One Hot Encoding
- Ordinal Encoding
- Frequency Encoding



## Import Libraries

We'll first need to import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
# Load Data
df = pd.read_csv("../data/AirBnB_dataset_ML_process.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,non-expensive
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,expensive
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365,non-expensive
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,non-expensive
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0,non-expensive


In [9]:
# df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Unnamed: 0                      48895 non-null  int64  
 1   id                              48895 non-null  int64  
 2   name                            48879 non-null  object 
 3   host_id                         48895 non-null  int64  
 4   host_name                       48874 non-null  object 
 5   neighbourhood_group             48895 non-null  object 
 6   neighbourhood                   48895 non-null  object 
 7   latitude                        48895 non-null  float64
 8   longitude                       48895 non-null  float64
 9   room_type                       48895 non-null  object 
 10  price                           48895 non-null  int64  
 11  minimum_nights                  48895 non-null  int64  
 12  number_of_reviews               

In [10]:
# null values
df.isnull().sum()

Unnamed: 0                            0
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
expensive                             0
dtype: int64

In [11]:
# shape of the dataset
df.shape

(48895, 18)

## One Hot Encoding

The first technique we'll dive into is one hot encoding. One hot encoding is the simplest form of encoding categorical variables. In this case, we'll look at the categorical column `expensive`. 

### One Categorical Variable

You'll see, that this column is denoted by two values, either expensive or not expensive. Let's say we wanted to one hot encode this column, what would it look like? 

In [12]:
df["expensive"].value_counts()

expensive
non-expensive    36718
expensive        12177
Name: count, dtype: int64

In [13]:
# Creating a one-hot encoded variable will turn each one of these values into its own column, and then denote 0 or 1 if the row contains the column:
dummies = pd.get_dummies(df['expensive'])
dummies.head()

Unnamed: 0,expensive,non-expensive
0,False,True
1,True,False
2,False,True
3,False,True
4,False,True


### Multiple Categorical Variables

Now, let's look at a different column with many, many possible categories: `neighbourhood`. You'll see that if we one hot encoded neighborhood, this would create 221 new columns. For some models, this would significantly increase the complexity of both the dataset and the model when training. This is called high **cardinality**. This could result in overfitting, large memory consumption or slow training times: 

In [16]:
# Gives total number of unique value
df['neighbourhood'].nunique()

221

In [17]:
mul_ohe = pd.get_dummies(df["neighbourhood"])
mul_ohe.head()

Unnamed: 0,Allerton,Arden Heights,Arrochar,Arverne,Astoria,Bath Beach,Battery Park City,Bay Ridge,Bay Terrace,"Bay Terrace, Staten Island",...,Westerleigh,Whitestone,Williamsbridge,Williamsburg,Willowbrook,Windsor Terrace,Woodhaven,Woodlawn,Woodrow,Woodside
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [18]:
# lets concat dummies and have an understand

ohe_df = pd.concat([df, mul_ohe], axis=1)
ohe_df.head()

Unnamed: 0.1,Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,...,Westerleigh,Whitestone,Williamsbridge,Williamsburg,Willowbrook,Windsor Terrace,Woodhaven,Woodlawn,Woodrow,Woodside
0,0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,...,False,False,False,False,False,False,False,False,False,False
1,1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,...,False,False,False,False,False,False,False,False,False,False
2,2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,...,False,False,False,False,False,False,False,False,False,False
3,3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,...,False,False,False,False,False,False,False,False,False,False
4,4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,...,False,False,False,False,False,False,False,False,False,False


In [19]:
# if we see the shape the column is previous 18 + new 221

ohe_df.shape

(48895, 239)

### Ordinal Encoding

There are a number of solutions to solving the cardinality problem. The first one is ordinal encoding. Ordinal encoding is a method of replacing categories with numbers. However, these numbers have an inherent ordering to them. Think, high school -> college -> grad_school as categorical variables with na inherent ordering. Luckily, sklearn has already written an implementation of ordinal encoding:

In [20]:
from sklearn.preprocessing import OrdinalEncoder



In [22]:
# we will work on room type
df["room_type"].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

In [26]:
## Get the data we want to encode, convert to unique values
# data = np.asarray
# 
encoder = OrdinalEncoder()

result = encoder.fit_transform(df[['room_type']])
print(pd.DataFrame(result))

         0
0      1.0
1      0.0
2      1.0
3      0.0
4      0.0
...    ...
48890  1.0
48891  1.0
48892  0.0
48893  2.0
48894  1.0

[48895 rows x 1 columns]


In [29]:
df["room_type_ord_encoded"] = result
df['room_type_ord_encoded'].head()

0    1.0
1    0.0
2    1.0
3    0.0
4    0.0
Name: room_type_ord_encoded, dtype: float64

In [31]:
df.drop(columns="room_type", inplace=True)

In [32]:
df['room_type_ord_encoded']

0        1.0
1        0.0
2        1.0
3        0.0
4        0.0
        ... 
48890    1.0
48891    1.0
48892    0.0
48893    2.0
48894    1.0
Name: room_type_ord_encoded, Length: 48895, dtype: float64

### Frequency Encoding

Another technique to address the cardinality issue is frequency encoding. Rather than replace our categorical variables with ordinal variables, we're going to replace our categories with the frequency in which they occur. See here, we'll see how many times each neighborhood appears: 

In [34]:
freq_encod = df.groupby(["neighbourhood"]).size()
freq_encod

neighbourhood
Allerton            42
Arden Heights        4
Arrochar            21
Arverne             77
Astoria            900
                  ... 
Windsor Terrace    157
Woodhaven           88
Woodlawn            11
Woodrow              1
Woodside           235
Length: 221, dtype: int64

In [35]:
# Then, we can replace our categories with these different frequencies:
df['neighbourhood'].apply(lambda x: freq_encod[x])

0         175
1        1545
2        2658
3         572
4        1117
         ... 
48890    3714
48891    2465
48892    2658
48893    1958
48894    1958
Name: neighbourhood, Length: 48895, dtype: int64

In [39]:
df['neighbourhood'].size

48895

In [43]:
class FrequencyEncoder:
    def fit(self, train_df, column):
        self.train_df = train_df
        self.column = column
        
    def transform(self, test_df, column):
        frequency_encoded = self.train_df.groupby([self.column]).size()

        col_name = column + '_freq'
        test_df.loc[:,col_name] = test_df[column].apply(lambda x: frequency_encoded[x])
        return test_df

# frequency_encoding(df, column='neighbourhood')

fe = FrequencyEncoder()
fe.fit(df, column='neighbourhood')
df_freq_enc = fe.transform(df, column='neighbourhood')

df_freq_enc['neighbourhood_freq']

0         175
1        1545
2        2658
3         572
4        1117
         ... 
48890    3714
48891    2465
48892    2658
48893    1958
48894    1958
Name: neighbourhood_freq, Length: 48895, dtype: int64

### Target Encoding

Another method of encoding is called Target Encoding. Earlier, we learned about frequency encoding, where we encode the number of instances a category occurs as its value. Rather than encode the number of instances a category occurs, we can encode the mean of our target variable, like this:

In [51]:
df['price']

0        149
1        225
2        150
3         89
4         80
        ... 
48890     70
48891     40
48892    115
48893     55
48894     90
Name: price, Length: 48895, dtype: int64

There are others type of encoding. They are:

- Target Encoding
- Probability Ratio Encoding
- Weight of Evidence Encoder
- Binning

### Target Encoding

Another method of encoding is called Target Encoding. Earlier, we learned about frequency encoding, where we encode the number of instances a category occurs as its value. Rather than encode the number of instances a category occurs, we can encode the mean of our target variable.

### Probability Ratio Encoding

Probability Ratio Encoding is similar to target encoding. But rather than using the mean of the target, we're looking at the probability this category is going to be a positive label.

### Weight of Evidence Encoding

Weight of Evidence encoding is similar to probability ratio encoding. The only difference, is we're applying a log transform on top of the probability ratio transformation.


### Binning

The last technique is called binning. This is where we take a continuous variable and bin them into different buckets, thus, transforming this continuous variable into a categorical variable.
