# Introduction

A regression problem of predicting `reviews_per_month`, as a proxy for the popularity of the listing with [New York City Airbnb listings from 2019 dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

# Understand data and preprocessing

In this problem, we try to predict a number of review of AirBNB listing per month(`reviews_per_month`) so we can predict whether listing will be popular once create it. There are totally 16 variables with a combination of numeric, categorical, and text features such as
- `id` : Numerical data but not related to our problem and should be drop.                            
- `name` : Review from users, should be transformed with Bag-of-word.                       
- `host_id` : Unique id of hosts.                         
- `host_name` : Name of hosts, not related and got ethic problem, so should be drop.                      
- `neighbourhood_group` : Categorical data of neighbourhood groups.          
- `neighbourhood` : Categorical data of neighbourhoods.                  
- `latitude` : Latitude of each properties.                      
- `longitude` : Longitude of each properties.                     
- `room_type` : Type of rooms (Categorical).                     
- `price` : Price of each listing.                           
- `minimum_nights` : A minimum number of staying night.               
- `number_of_reviews` : A number of reviews of each properties.               
- `last_review` : Last time when it got reviewed.                     
- `reviews_per_month` : A number of reviews per months that we want to predict.             
- `calculated_host_listings_count` : Number of listing by hosts. 
- `availability_365` : Availability of listing per year.               

After reading the data as Pandas dataframe, I started checking into the data by airbnb.head() and airbnb.info(), I could see that we have a good size of data with 48,895 observations in total, and I also found that there are some missing-value features in this dataset that need to be handle as well.

I therefore started thinking about this problem. I thought that there are some features that we should not be useful for prediction such as `id` and `last_review`. Also there is one feature that should not be related and we should not be used in term of ethic such as `host_name` as well. Therefore I decided to drop these 3 features.

I then thought about handling the missing-value data, there are several ways to deal with it, but I finally decided to drop the missing-value data since I am not sure whether what data should be used to fill it. You may argue that we can fill it with zero, but in this case, since we can not talk to the domain expert or the person who collected the data, so I think it is better to drop these missing values. In addition, even we drop these missing values, we still have a good size of data of 38,837 observations.

Furthermore, if we can discuss this to the domain expert, we may know more how to treat these missing values. Other methods that we can consider may be filling with,
- fill with zero
- fill with mean value
- Knn imputer
we can try these methods as well if we have time.

I also changes the column name of `availability_365` to be `availability` to be more meaningful and easy to understand.

Lastly, after checking the data, I personally think that it is weird that we have feature `number_of_review` since , in general, the `number_of_review` will be pretty much the same as `review_per_month` that we want to predict. If the `number_of_review` is high, the `review_per_month` will be high as well. We can also see this from correlation coefficient of these two, 0.71 (Spearman).

In [1]:
# import libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import warnings
warnings.simplefilter(action="ignore", category=DeprecationWarning)
warnings.simplefilter(action="ignore", category=UserWarning)

import altair as alt
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')
# Include an image for each plot since Gradescope only supports displaying plots as images
alt.renderers.enable('mimetype')

from wordcloud import WordCloud
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures 
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
import xgboost as xg
from lightgbm.sklearn import LGBMRegressor
from catboost import CatBoostRegressor

from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE, RFECV
from sklearn.feature_selection import SequentialFeatureSelector

from sklearn.model_selection import RandomizedSearchCV

In [2]:
#1.1 Import data as dataframe and preliminary check the data
airbnb = pd.read_csv('data/AB_NYC_2019.csv')
airbnb.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
#1.2 Checking data with .info() to see type and missing value
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [4]:
#1.3 Decide to drop some features since they are unrelated to prediction and ethic problem
airbnb = airbnb.drop(columns = ['id','last_review','host_name'])
airbnb.head()

Unnamed: 0,name,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,4632,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,1,365
3,Cozy Entire Floor of Brownstone,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


In [5]:
#1.4 Check NA/missing-value data
airbnb.isna().sum()

name                                 16
host_id                               0
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [6]:
#1.5 Decide to drop NA/missing value, and we still have a good size of data
airbnb = airbnb.dropna()
airbnb.isna().sum()

name                              0
host_id                           0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

In [7]:
#1.6 Review the updated data
airbnb.head()

Unnamed: 0,name,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
3,Cozy Entire Floor of Brownstone,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0
5,Large Cozy 1 BR Apartment In Midtown East,7322,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,0.59,1,129


In [8]:
#1.7 Change column names to be meaningful names
airbnb.rename(columns = {'availability_365':'availability'}, inplace = True)
airbnb.head()

Unnamed: 0,name,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability
0,Clean & quiet apt home by the park,2787,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,2845,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
3,Cozy Entire Floor of Brownstone,4869,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,7192,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0
5,Large Cozy 1 BR Apartment In Midtown East,7322,Manhattan,Murray Hill,40.74767,-73.975,Entire home/apt,200,3,74,0.59,1,129


# Data splitting

- I started with spliting train and test data set with 70%-30% ratio but it took very long time to train the model and finally my laptop was hank.

- I then decided to split train and test data as 50%-50% ratio, but it still took a very long time running some model such as RidgeCV, I finally decided to use 30%-70% ratio, even with 30% training data, we still have a good size of data to train the model, 11651 observations.

- Please be noted that we can increase the train data size if we have more powerful resources or want the better result.

In [10]:
#2.1 Splitting data with 50-50 ratio
train_df, test_df = train_test_split(airbnb, test_size=0.7, random_state=123)
train_df.head()

Unnamed: 0,name,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability
21772,Luxurious Manhattan 1 Bedroom,3191545,Manhattan,Hell's Kitchen,40.76589,-73.98434,Entire home/apt,199,30,6,0.23,23,365
17591,Sunny room in modern building located in Bedstuy,252059,Brooklyn,Bedford-Stuyvesant,40.69181,-73.94312,Private room,60,5,3,0.08,1,0
15594,"Penthouse 2BR w skylight,terrace, and a roof d...",68094795,Manhattan,Upper East Side,40.76532,-73.9615,Entire home/apt,399,4,55,1.47,2,237
41598,Convenient 2 bedroom apt near Times Sq. 1C,190921808,Manhattan,Hell's Kitchen,40.75393,-73.99667,Entire home/apt,500,3,2,0.81,47,352
29321,Dina airbnb 61 street east D,164886138,Manhattan,Upper East Side,40.76015,-73.96227,Entire home/apt,140,2,4,0.53,11,354


# EDA

For numerical features, 
- `latitude` and `longitude` are quite normally distributed and we do not need to do any pre-processing.
- `price` is right skew distributed, so it is better to perform log transformation to have better normally distributed feature.
- `reviews_per_month` is right skew as well, since it is our target, we can use transformed target regressor.

For categorical features,
`neighbourhood_group`, `neighbourhood`, `room_type`, `minimum_nights`, `number_of_reviews`, `availability`,
they have some low value counts in some categories, we can consider dropping these low-count category.

`name` is text feature which we can explore it with wordcloud which we can see some common words from the reviews.


Finally, I decided to choose **MAE** as my metric because we can communicate to the user in term of how much our model deviate from the `number_of_review` target on average such as 0.5 review off.

In [11]:
#3.1 Checking train_df
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11651 entries, 21772 to 18946
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   name                            11651 non-null  object 
 1   host_id                         11651 non-null  int64  
 2   neighbourhood_group             11651 non-null  object 
 3   neighbourhood                   11651 non-null  object 
 4   latitude                        11651 non-null  float64
 5   longitude                       11651 non-null  float64
 6   room_type                       11651 non-null  object 
 7   price                           11651 non-null  int64  
 8   minimum_nights                  11651 non-null  int64  
 9   number_of_reviews               11651 non-null  int64  
 10  reviews_per_month               11651 non-null  float64
 11  calculated_host_listings_count  11651 non-null  int64  
 12  availability                

In [None]:
#3.2 Pulling all features
train_df.columns

In [None]:
#3.3 Separate features into each group
numerical_features = ['latitude', 'longitude', 'price', 
                      'reviews_per_month','calculated_host_listings_count']
discreteized_features = ['']
categorical_features = ['host_id', 'neighbourhood_group', 'neighbourhood', 'room_type', 
                        'minimum_nights', 'number_of_reviews', 'availability']
text_features = ['name']
drop_features = ['']

In [None]:
#3.3.1 Numerical feature distributions - latitude - normally distributed
latitude_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('latitude', type='quantitative', bin=alt.Bin(maxbins=40)),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
latitude_hist

In [None]:
#3.3.2 Numerical feature distributions - longitude - quite normally distributed
longitude_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('longitude', type='quantitative', bin=alt.Bin(maxbins=40)),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
longitude_hist

In [None]:
#3.3.3 Numerical feature distributions - price - right-skew distributed
price_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('price', type='quantitative', bin=alt.Bin(maxbins=200), scale=alt.Scale(domain=(0,1000))),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
price_hist

In [None]:
#3.3.4 Numerical feature distributions - log(price) - better normally distributed
# Since price is right-skew, we better transform it with long transformation
log_train_df = train_df.copy()
log_train_df['price'] = np.log(train_df['price'])
log_price_hist = alt.Chart(log_train_df).mark_bar().encode(
     x = alt.X('price', type='quantitative', bin=alt.Bin(maxbins=100)),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
log_price_hist

In [None]:
#3.3.5 Numerical feature distributions - reviews_per_month - right-skew distributed
reviews_per_month_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('reviews_per_month', type='quantitative', bin=alt.Bin(maxbins=100), scale=alt.Scale(domain=(0,15))),
     y = alt.Y('count()'),
).properties(
    width=400,
    height=150
)
# Show the plot
reviews_per_month_hist

In [None]:
#3.3.6 Numerical feature distributions - log(reviews_per_month) - better normally distributed
# Since reviews_per_month is right-skew, we better transform it with long transformation
log_train_df = train_df.copy()
log_train_df['reviews_per_month'] = np.log(train_df['reviews_per_month'])
log_price_hist = alt.Chart(log_train_df).mark_bar().encode(
     x = alt.X('reviews_per_month', type='quantitative', bin=alt.Bin(maxbins=100)),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
log_price_hist

In [None]:
#3.3.7 Categorical feature distributions - host_id - value_counts()
# We moght keep it all since they are all unique value
train_df['host_id'].value_counts()

In [None]:
#3.3.8 Categorical feature distributions - neighbourhood_group
neighbourhood_group_hist = alt.Chart(train_df).mark_bar().encode(
    x = 'count()',
    y = alt.Y('neighbourhood_group', type='nominal', sort='-x')
     ,
).properties(
    width=400,
    height=150
)
# Show the plot
neighbourhood_group_hist

In [None]:
#3.3.9 Categorical feature distributions - neighbourhood_group - value_counts()
# We might keep it all
train_df['neighbourhood_group'].value_counts()

In [None]:
#3.3.10 Categorical feature distributions - neighbourhood
neighbourhood_hist = alt.Chart(train_df).mark_bar().encode(
    x = alt.X('count()'),
    y = alt.Y('neighbourhood', type='nominal', sort='-x')
).properties(
    width=400,
    height=1500
)
# Show the plot
neighbourhood_hist

In [None]:
#3.3.11 Categorical feature distributions - neighbourhood - value_counts()
# We might drop value_counts that less than 10 occurances
train_df['neighbourhood'].value_counts()

In [None]:
#3.3.12 Categorical feature distributions - room_type
room_type_hist = alt.Chart(train_df).mark_bar().encode(
    x = alt.X('count()'),
    y = alt.Y('room_type', type='nominal', sort='-x')
).properties(
    width=400,
    height=100
)
# Show the plot
room_type_hist

In [None]:
#3.3.13 Categorical feature distributions - room_type - value_counts()
# We might keep it all
train_df['room_type'].value_counts()

In [None]:
#3.3.14 Categorical feature distributions - minimum_nights - right-skew distributed
minimum_nights_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('minimum_nights', type='quantitative', bin=alt.Bin(maxbins=500), scale=alt.Scale(domain=(0,50))),
     y = 'count()',
).properties(
    width=400,
    height=150
)
# Show the plot
minimum_nights_hist

In [None]:
#3.3.15 Categorical feature distributions - minimum_nights - value_counts()
# We might drop value_counts that less than 10 occurances
train_df['minimum_nights'].value_counts()

In [None]:
#3.3.16 Categorical feature distributions - number_of_reviews - right-skew distributed (OHE)
number_of_reviews_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('number_of_reviews', type='quantitative', bin=alt.Bin(maxbins=150), scale=alt.Scale(domain=(0,250))),
     y = alt.Y('count()'),
).properties(
    width=400,
    height=150
)
# Show the plot
number_of_reviews_hist

In [None]:
#3.3.17 Categorical feature distributions - availability - uniformly distributed
availability_hist = alt.Chart(train_df).mark_bar().encode(
     x = alt.X('availability', type='quantitative', bin=alt.Bin(maxbins=200)),
     y = alt.Y('count()'),
).properties(
    width=400,
    height=150
)
# Show the plot
availability_hist

In [None]:
#3.3.18 Categorical feature distributions - availability - value_counts()
# We might drop value_counts that less than 10 occurances
train_df['availability'].value_counts()

In [None]:
#3.3.19 Text feature wordcloud - name
wordcloud_name = WordCloud().generate(str(train_df['name']))
plt.imshow(wordcloud_name, interpolation='bilinear')
plt.axis("off")

In [None]:
#3.4 Correlation matrix : Pearson
train_df.corr('pearson').style.background_gradient().format(precision=2)

In [None]:
#3.5 Correlation matrix : Spearman
train_df.corr('spearman').style.background_gradient().format(precision=2)

# Feature engineering