# **Assignment 2**

This assignment focuses on Airbnbs in Berlin, Germany. The goal is to help a company operating small and mid-size apartments hosting 2-6 guests. The company is set to price their new apartments not on the market. This assignment will build a prediction model for prices, and discuss the modeling decisions and compare the results to those of the case study. 

Task
• You may use other variables we used in class.

• You may do different feature engineering depending on the selected environment.

• You may make other sample design decisions!

• In each case, document your steps!

• Have at least 3 different models and compare performance

• Argue for your choice of models
- One model must be theoretically profound linear regression via OLS.
- One model must be Random Forest or any boosting algorithm

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
from pathlib import Path
import sys
from patsy import dmatrices
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.inspection import PartialDependenceDisplay
from sklearn.inspection import partial_dependence
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

In [81]:
# DATA IMPORT - FROM GITHUB
data = pd.read_csv('https://github.com/Iandrewburg/Assignment_1/raw/main/Assignment_2/berlin_airbnb.csv')

In [82]:
data = data[(data.accommodates >= 2) & (data.accommodates <= 6)]


In [83]:
# clean missing values or 0 values for price
data = data[data['price'] != 0]
data = data.dropna(subset=['price'])


In [84]:
# List of all unique columns for the three models
columns = list(set(
    ['accommodates', 'beds', 'review_scores_rating', 'host_is_superhost',
     'latitude', 'longitude', 'host_since', 'number_of_reviews',
     'availability_365', 'minimum_nights', 'maximum_nights', 'property_type',
     'room_type', 'beds', 'price']
))

# Creating a new DataFrame with only the selected columns
data = data[columns]
data


Unnamed: 0,latitude,review_scores_rating,host_since,price,property_type,minimum_nights,accommodates,beds,availability_365,number_of_reviews,longitude,maximum_nights,host_is_superhost,room_type
0,52.534710,4.63,19/10/2008,$83.00,Entire rental unit,63,4,2.0,15,148,13.418100,184,f,Entire home/apt
2,52.500010,4.69,28/07/2011,$116.00,Entire rental unit,2,4,2.0,336,570,13.303490,365,f,Entire home/apt
4,52.434300,4.40,31/07/2011,$100.00,Entire rental unit,183,2,2.0,364,15,13.230370,730,f,Entire home/apt
5,52.503120,4.72,20/12/2009,$90.00,Entire condo,93,4,1.0,225,48,13.435080,365,f,Entire home/apt
6,52.492810,4.66,31/07/2011,$45.00,Entire rental unit,93,2,2.0,0,58,13.349510,365,f,Entire home/apt
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13322,52.552905,,20/01/2018,$47.00,Entire rental unit,1,2,1.0,231,0,13.400229,20,f,Entire home/apt
13323,52.522356,,15/12/2023,$87.00,Entire rental unit,2,2,1.0,65,0,13.426044,365,f,Entire home/apt
13324,52.485782,,14/12/2023,$58.00,Entire rental unit,1,3,1.0,171,0,13.335904,365,f,Entire home/apt
13325,52.544105,,15/03/2015,$161.00,Entire rental unit,1,6,3.0,11,0,13.373386,365,f,Entire home/apt


In [85]:
data.property_type.value_counts()

property_type
Entire rental unit                    4455
Private room in rental unit            969
Entire condo                           602
Entire serviced apartment              353
Room in hotel                          218
Entire loft                            198
Private room in condo                  133
Private room in home                   113
Entire home                            105
Entire guesthouse                       66
Room in boutique hotel                  64
Shared room in hostel                   49
Private room in hostel                  42
Entire vacation home                    40
Room in aparthotel                      39
Room in serviced apartment              36
Private room in loft                    32
Private room in bed and breakfast       29
Private room in townhouse               25
Houseboat                               22
Tiny home                               17
Private room in guesthouse              16
Entire bungalow                         

In [86]:
data.room_type.value_counts()

room_type
Entire home/apt    5961
Private room       1648
Hotel room          114
Shared room          64
Name: count, dtype: int64

In [87]:
# filtering out property_type categories with counts less than 100
property_type_value_counts = data['property_type'].value_counts()
to_remove = property_type_value_counts[property_type_value_counts < 100].index
data = data[~data['property_type'].isin(to_remove)]

# property type dummies
property_dummies = pd.get_dummies(data['property_type'], prefix='d_type')
data = pd.concat([data, property_dummies], axis=1)



In [88]:
# room type dummies
room_dummies = pd.get_dummies(data['room_type'], prefix= 'd_room')
data = pd.concat([data, room_dummies], axis=1)




In [91]:
def clean_all_column_names(df):
    df.rename(columns=lambda x: x.replace(" ", "_").lower(), inplace=True)
    return df

data = clean_all_column_names(data)


In [92]:
data.dtypes

latitude                              float64
review_scores_rating                  float64
host_since                             object
price                                  object
property_type                          object
minimum_nights                          int64
accommodates                            int64
beds                                  float64
availability_365                        int64
number_of_reviews                       int64
longitude                             float64
maximum_nights                          int64
host_is_superhost                      object
room_type                              object
d_type_entire_condo                      bool
d_type_entire_home                       bool
d_type_entire_loft                       bool
d_type_entire_rental_unit                bool
d_type_entire_serviced_apartment         bool
d_type_private_room_in_condo             bool
d_type_private_room_in_home              bool
d_type_private_room_in_rental_unit

In [90]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7146 entries, 0 to 13326
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   latitude                            7146 non-null   float64
 1   review_scores_rating                5647 non-null   float64
 2   host_since                          7146 non-null   object 
 3   price                               7146 non-null   object 
 4   property_type                       7146 non-null   object 
 5   minimum_nights                      7146 non-null   int64  
 6   accommodates                        7146 non-null   int64  
 7   beds                                7051 non-null   float64
 8   availability_365                    7146 non-null   int64  
 9   number_of_reviews                   7146 non-null   int64  
 10  longitude                           7146 non-null   float64
 11  maximum_nights                      7146 non-nu

In [79]:
# Convert 't' to 1 and 'f' to 0
data['host_is_superhost'] = data['host_is_superhost'].map({'t': 1, 'f': 0})


In [93]:
data = data.rename(columns={
    'review_scores_rating': 'n_review_scores_rating',
    'host_since': 'date_host_start',
    'minimum_nights': 'n_minimum_nights',
    'accommodates': 'n_accommodates',
    'beds': 'n_beds',
    'availability_365': 'n_availability_365',
    'number_of_reviews': 'n_number_of_reviews',
    'maximum_nights': 'n_maximum_nights',
    'room_type': 'f_room_type',
    'property_type': 'f_property_type'
})


In [94]:
data.columns

Index(['latitude', 'n_review_scores_rating', 'date_host_start', 'price',
       'f_property_type', 'n_minimum_nights', 'n_accommodates', 'n_beds',
       'n_availability_365', 'n_number_of_reviews', 'longitude',
       'n_maximum_nights', 'host_is_superhost', 'f_room_type',
       'd_type_entire_condo', 'd_type_entire_home', 'd_type_entire_loft',
       'd_type_entire_rental_unit', 'd_type_entire_serviced_apartment',
       'd_type_private_room_in_condo', 'd_type_private_room_in_home',
       'd_type_private_room_in_rental_unit', 'd_type_room_in_hotel',
       'd_room_entire_home/apt', 'd_room_hotel_room', 'd_room_private_room'],
      dtype='object')

In [97]:
def convert_d_columns_to_int(df):
    d_columns = df.columns[df.columns.str.startswith('d_')]
    df[d_columns] = df[d_columns].astype(int)
    return df

data = convert_d_columns_to_int(data)


In [99]:
data.price

0         $83.00
2        $116.00
4        $100.00
5         $90.00
6         $45.00
          ...   
13322     $47.00
13323     $87.00
13324     $58.00
13325    $161.00
13326     $94.00
Name: price, Length: 7146, dtype: object

In [100]:
# Remove the dollar sign and convert to integer
data['price'] = data['price'].str.lstrip('$').astype(float).astype(int)


ValueError: could not convert string to float: '1,260.00'

In [98]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7146 entries, 0 to 13326
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   latitude                            7146 non-null   float64
 1   n_review_scores_rating              5647 non-null   float64
 2   date_host_start                     7146 non-null   object 
 3   price                               7146 non-null   object 
 4   f_property_type                     7146 non-null   object 
 5   n_minimum_nights                    7146 non-null   int64  
 6   n_accommodates                      7146 non-null   int64  
 7   n_beds                              7051 non-null   float64
 8   n_availability_365                  7146 non-null   int64  
 9   n_number_of_reviews                 7146 non-null   int64  
 10  longitude                           7146 non-null   float64
 11  n_maximum_nights                    7146 non-nu

In [96]:
data.isna().sum().sum()

1621

### EDA

In [12]:
data = data[(data.accommodates >= 2) & (data.accommodates <= 6)]


In [14]:
# copy a variable - purpose later, see at variable importance
data['accommodates_copy'] = data['accommodates']

***numerical variables***

In [15]:
# too long to display and read
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,11233.0,2.633452e+17,3.936411e+17,3176.0,16972870.0,38422670.0,6.85234e+17,1.04908e+18
scrape_id,11233.0,20231200000000.0,0.0,20231200000000.0,20231200000000.0,20231200000000.0,20231200000000.0,20231200000000.0
description,0.0,,,,,,,
host_id,11233.0,142395400.0,163743400.0,1581.0,12818040.0,57628840.0,243483400.0,551079400.0
host_listings_count,11226.0,15.13264,71.38824,1.0,1.0,1.0,3.0,1083.0
host_total_listings_count,11226.0,16.94539,77.11085,1.0,1.0,2.0,5.0,1143.0
latitude,11233.0,52.51003,0.03284366,52.36904,52.49084,52.50989,52.53233,52.65611
longitude,11233.0,13.4032,0.06592717,13.10758,13.36761,13.41224,13.43815,13.71796
accommodates,11233.0,2.921303,1.233958,2.0,2.0,2.0,4.0,6.0
bathrooms,0.0,,,,,,,


***categorical variables***

In [18]:
data.room_type.value_counts()

room_type
Entire home/apt    7966
Private room       3078
Hotel room          114
Shared room          75
Name: count, dtype: int64

In [19]:
data.property_type.value_counts()

property_type
Entire rental unit                    6237
Private room in rental unit           2301
Entire condo                           722
Entire serviced apartment              358
Entire loft                            247
Room in hotel                          220
Private room in condo                  183
Private room in home                   127
Entire home                            126
Entire guesthouse                       67
Room in boutique hotel                  64
Shared room in hostel                   49
Private room in loft                    45
Entire vacation home                    45
Private room in hostel                  43
Room in aparthotel                      39
Room in serviced apartment              36
Private room in bed and breakfast       32
Private room in townhouse               29
Entire townhouse                        24
Houseboat                               23
Shared room in rental unit              20
Entire bungalow                         

In [20]:
data.number_of_reviews.value_counts()

number_of_reviews
0      2364
1       902
2       644
3       507
4       390
       ... 
493       1
653       1
655       1
435       1
387       1
Name: count, Length: 456, dtype: int64

In [21]:
data.neighbourhood_cleansed.value_counts()

neighbourhood_cleansed
Alexanderplatz               760
Frankfurter Allee Süd FK     612
Tempelhofer Vorstadt         554
Brunnenstr. Süd              487
Reuterstraße                 393
                            ... 
Allende-Viertel                3
MV 1                           3
Neu-Hohenschönhausen Süd       2
Hellersdorf-Süd                1
Neu-Hohenschönhausen Nord      1
Name: count, Length: 136, dtype: int64

***split train and test***
- train is where we do it all, incl CV

- first pick a smaller than usual training set so that models run faster and check if works
- if works, start anew without these two lines

In [22]:
data_train, data_holdout = train_test_split( data, train_size=0.7, random_state=42)

In [23]:
data_train.shape, data_holdout.shape

((7863, 76), (3370, 76))