# Categorical Features

In [2]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

## Load Data

In [3]:
df = pd.read_csv("AB_NYC_2019.csv")

## The Dataset

In [4]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Real-Life Applications

AirBnB is a company that operates an online marketplace for lodging, primarily homestays for vacation rentals, and tourism activities. In this section, we'll be using AirBnB New York City data to learn about how to handle categorical variables. 

Before we dive into anything related to the actual analysis, let's walk through the dataset and its business applications. 

**Pricing Recommendations**: AirBnb doesn't set its prices, hosts do. So in this case, we can predict price of stay. Using this data, we can provide price recommendations to a host. Price recommendations will help hosts decide on the most optimal price. Since AirBnb earns money when users book with a host, an optimal price point would ideally, maximize both host & airbnb earnings, while minimizing cost to the user.

**Room Recommendations**: From a user perspective, often times we might not look a specific apartment/homestay but we'd like to see similar apartments when making a booking. This dataset could also be used to recommend similar rooms to a user. Why is this valuable? From a business perspective, good room recommendations would likely drive more bookings from users. More bookings = more revenue for both hosts and AirBnb. If users can't find bookings they like, it's likely they could go to a different platform.

**Supply/Demand**: In AirBnb's case, Supply is the # of avaiable housing and Demand is the # of people who want to book. Since AirBnb is a three-sided marketplace, AirBnb needs to make sure that if there's a lot of demand in one neighborhood, that they have hosts in that neighborhood to meet the demand. If AirBnb does not have enough of hosts in the area, this means they're leaving money on the table. If there's a neighborhood where there are a ton of hosts but no demand, it means hosts aren't earning money, which will ultimately erode their satisfaction with the platform and potentially leave. To understand supply/demand, we could use this data to forecast supply growth, availability to understand suppply/demand. 



## One Hot Encoding

### One Categorical Variable

In [60]:
df['expensive'] = df['price'].apply(lambda x: 'expensive' if x > 175 else 'non-expensive')

In [23]:
dummies = pd.get_dummies(df['expensive'])

dummies.head()

Unnamed: 0,expensive,non-expensive
0,0,1
1,1,0
2,0,1
3,0,1
4,0,1


### Multiple Categorical Variables

In [22]:
pd.get_dummies(df['neighbourhood'])

Unnamed: 0,Allerton,Arden Heights,Arrochar,Arverne,Astoria,Bath Beach,Battery Park City,Bay Ridge,Bay Terrace,"Bay Terrace, Staten Island",...,Westerleigh,Whitestone,Williamsbridge,Williamsburg,Willowbrook,Windsor Terrace,Woodhaven,Woodlawn,Woodrow,Woodside
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48891,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48892,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48893,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Ordinal Encoding

In [23]:
from sklearn.preprocessing import OrdinalEncoder

## Get the data we want to encode, convert to unique values
data = np.asarray(df[['room_type']])

# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(pd.DataFrame(result))

         0
0      1.0
1      0.0
2      1.0
3      0.0
4      0.0
...    ...
48890  1.0
48891  1.0
48892  0.0
48893  2.0
48894  1.0

[48895 rows x 1 columns]


In [25]:
ord_encoded = pd.DataFrame(result)

pd.concat([df, ord_encoded], axis = 1)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,0
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,1.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,0.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,1.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,0.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,1.0
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,1.0
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,0.0
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,2.0


### Frequency Encoding

In [28]:
frequency_encoded = df.groupby(['neighbourhood']).size()

frequency_encoded

neighbourhood
Allerton            42
Arden Heights        4
Arrochar            21
Arverne             77
Astoria            900
                  ... 
Windsor Terrace    157
Woodhaven           88
Woodlawn            11
Woodrow              1
Woodside           235
Length: 221, dtype: int64

In [29]:
frequency_encoded

neighbourhood
Allerton            42
Arden Heights        4
Arrochar            21
Arverne             77
Astoria            900
                  ... 
Windsor Terrace    157
Woodhaven           88
Woodlawn            11
Woodrow              1
Woodside           235
Length: 221, dtype: int64

In [31]:
df['neighbourhood'].apply(lambda x: frequency_encoded[x])

0         175
1        1545
2        2658
3         572
4        1117
         ... 
48890    3714
48891    2465
48892    2658
48893    1958
48894    1958
Name: neighbourhood, Length: 48895, dtype: int64

In [41]:
class FrequencyEncoder:
    def fit(self, train_df, column):
        self.train_df = train_df
        self.column = column
        
    def transform(self, test_df, column):
        frequency_encoded = self.train_df.groupby([self.column]).size()

        col_name = column + '_freq'
        test_df.loc[:,col_name] = test_df[column].apply(lambda x: frequency_encoded[x])
        return test_df

# frequency_encoding(df, column='neighbourhood')

fe = FrequencyEncoder()
fe.fit(df, column='neighbourhood')
fe.transform(df, column='neighbourhood')

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_freq
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,175,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,175
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,1545,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,1545
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,2658,40.80902,-73.94190,Private room,150,3,0,,,1,365,2658
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,572,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,572
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,1117,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,1117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,3714,40.67853,-73.94995,Private room,70,2,0,,,2,9,3714
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,2465,40.70184,-73.93317,Private room,40,4,0,,,2,36,2465
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,2658,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,2658
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,1958,40.75751,-73.99112,Shared room,55,1,0,,,6,2,1958


### Target Encoding

In [50]:
class TargetEncoder:
    def fit(self, train_df, target_col, categ_col):
        self.train_df = train_df
        self.target_col = target_col
        self.categ_col = categ_col
        
    def transform(self, test_df, column = None):
        if column is None:
            column = self.categ_col
        
        target_encoder = self.train_df.groupby([self.categ_col]).mean()[self.target_col]

        df[self.categ_col].apply(lambda x: target_encoder[x])

        col_name = column + '_target_enc'
        test_df.loc[:,col_name] = test_df[column].apply(lambda x: target_encoder[x])
        return test_df
    
te = TargetEncoder()
te.fit(df, target_col = 'price', categ_col = 'neighbourhood')
te.transform(df)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,neighbourhood_target_enc
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,92.885714
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,282.719094
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,118.974041
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,181.893357
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,133.198747
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,107.678244
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,84.800406
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,118.974041
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,204.794178


### Probability Ratio Encoding

In [6]:
df['expensive'] = df['price'].apply(lambda x: 1 if x > 175 else 0)

In [23]:
class ProbabilityRatioEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df
        self.categ_col = categ_col
        self.target_col = target_col
        
    def transform(self, test_df, constant = 0):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]

        ratio_encoder = (sums+ constant)/totals
        
        col_name = self.categ_col + '_prob_ratio'
        test_df.loc[:,col_name] = test_df[self.categ_col].apply(lambda x: ratio_encoder[x])
        return test_df

In [22]:
pre = ProbabilityRatioEncoder()

pre.fit(df, 'neighbourhood','expensive')
pre.transform(df)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,neighbourhood_prob_ratio
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,0,0.091429
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,1,0.623948
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,0,0.132807
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,0,0.243007
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,0,0.168308
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,0,0.100969
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,0,0.063692
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,0,0.132807
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,0,0.466803


### Weight of Evidence Encoding

In [26]:
class WeightofEvidenceEncoder:
    def fit(self, train_df, categ_col, target_col):
        self.train_df = train_df
        self.categ_col = categ_col
        self.target_col = target_col
        
    def transform(self, test_df, constant = 0):
        totals = self.train_df.groupby([self.categ_col]).size() 
        sums = self.train_df.groupby([self.categ_col]).sum()[self.target_col]

        woe_encoder = np.log((sums+ constant)/totals)
        
        col_name = self.categ_col + '_woe'
        test_df.loc[:,col_name] = test_df[self.categ_col].apply(lambda x: woe_encoder[x])
        return test_df
    
pre = WeightofEvidenceEncoder()

pre.fit(df, 'neighbourhood','expensive')
pre.transform(df)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,expensive,neighbourhood_prob_ratio,neighbourhood_woe
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,0,0.091429,-2.392197
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,1,0.623948,-0.471688
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,0,0.132807,-2.018861
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,0,0.243007,-1.414665
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,0,0.168308,-1.781960
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,0,0.100969,-2.292939
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,0,0.063692,-2.753701
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,0,0.132807,-2.018861
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,0,0.466803,-0.761848


### Binning

In [36]:
df.loc[:,'last_review'] = pd.to_datetime(df['last_review'])

def reviews_bin(x):
    if x < 50:
        return 'less_50'
    if x >= 50 and x < 100:
        return '50_to_100'
    if x >= 100 and x < 150:
        return '100_to_150'
    if x >= 150 and x < 200:
        return '150_to_200'
    if x >= 200:
        return '200_plus'
    
df['number_of_reviews'].map(reviews_bin)

0         less_50
1         less_50
2         less_50
3        200_plus
4         less_50
           ...   
48890     less_50
48891     less_50
48892     less_50
48893     less_50
48894     less_50
Name: number_of_reviews, Length: 48895, dtype: object