# **Assignment 2**

This assignment focuses on Airbnbs in Berlin, Germany. The goal is to help a company operating small and mid-size apartments hosting 2-6 guests. The company is set to price their new apartments not on the market. This assignment will build a prediction model for prices, and discuss the modeling decisions and compare the results to those of the case study. 

Task
• You may use other variables we used in class.

• You may do different feature engineering depending on the selected environment.

• You may make other sample design decisions!

• In each case, document your steps!

• Have at least 3 different models and compare performance

• Argue for your choice of models
- One model must be theoretically profound linear regression via OLS.
- One model must be Random Forest or any boosting algorithm

In [3]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
from pathlib import Path
import sys
from patsy import dmatrices
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.inspection import PartialDependenceDisplay
from sklearn.inspection import partial_dependence
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

In [4]:
# DATA IMPORT - FROM GITHUB
data = pd.read_csv('https://github.com/Iandrewburg/Assignment_1/raw/main/Assignment_2/berlin_airbnb.csv')

In [5]:
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3176.0,https://www.airbnb.com/rooms/3176,20231200000000.0,19/12/2023,previous scrape,Rental unit in Berlin · ★4.63 · 1 bedroom · 2 ...,,The neighbourhood is famous for its variety of...,https://a0.muscache.com/pictures/243355/84afcf...,3718,...,4.69,4.92,4.62,First name and Last name: Nicolas Krotz <br/> ...,f,1,1,0,0,0.84
1,9991.0,https://www.airbnb.com/rooms/9991,20231200000000.0,19/12/2023,city scrape,Rental unit in Berlin · ★5.0 · 4 bedrooms · 7 ...,,Prenzlauer Berg is an amazing neighbourhood wh...,https://a0.muscache.com/pictures/42799131/59c8...,33852,...,5.0,4.86,4.86,03/Z/RA/003410-18,f,1,1,0,0,0.07
2,183988.0,https://www.airbnb.com/rooms/183988,20231200000000.0,19/12/2023,city scrape,Rental unit in Berlin · ★4.69 · 1 bedroom · 2 ...,,,https://a0.muscache.com/pictures/1041e6fd-c369...,882801,...,4.79,4.72,4.62,04/Z/ZA/004232-16,f,1,1,0,0,3.92
3,14325.0,https://www.airbnb.com/rooms/14325,20231200000000.0,19/12/2023,city scrape,Rental unit in Berlin · ★4.68 · Studio · 1 bed...,,,https://a0.muscache.com/pictures/508703/24988a...,55531,...,4.85,4.6,4.45,,f,4,4,0,0,0.16
4,186663.0,https://www.airbnb.com/rooms/186663,20231200000000.0,19/12/2023,city scrape,Rental unit in Berlin · ★4.40 · 1 bedroom · 2 ...,,,https://a0.muscache.com/pictures/1757562/947b4...,897302,...,4.73,4.87,4.0,,f,4,4,0,0,0.11


In [6]:
data.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13327 entries, 0 to 13326
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            13327 non-null  float64
 1   listing_url                                   13327 non-null  object 
 2   scrape_id                                     13327 non-null  float64
 3   last_scraped                                  13327 non-null  object 
 4   source                                        13327 non-null  object 
 5   name                                          13327 non-null  object 
 6   description                                   0 non-null      float64
 7   neighborhood_overview                         6867 non-null   object 
 8   picture_url                                   13327 non-null  object 
 9   host_id                                       13327 non-null 

In [8]:
data.room_type.value_counts()

room_type
Entire home/apt    8700
Private room       4287
Shared room         208
Hotel room          132
Name: count, dtype: int64

In [9]:
data.property_type.value_counts()

property_type
Entire rental unit             6775
Private room in rental unit    3274
Entire condo                    770
Entire serviced apartment       415
Entire loft                     270
                               ... 
Private room in boat              1
Island                            1
Shared room in loft               1
Entire chalet                     1
Casa particular                   1
Name: count, Length: 65, dtype: int64

In [13]:
def airbnb_cleaner(data): 
    
    # filter the data
    data = data[(data.accommodates >= 2) & (data.accommodates <= 6)]
    
    # List of all unique columns for the three models
    columns = list(set(
        ['accommodates', 'beds', 'review_scores_rating', 'host_is_superhost',
         'latitude', 'longitude', 'host_since', 'number_of_reviews',
         'availability_365', 'minimum_nights', 'maximum_nights', 'property_type',
         'room_type', 'beds', 'price']
    ))

    # Creating a new DataFrame with only the selected columns
    data = data[columns]
    
    # clean missing values or 0 values for price
    data = data[data['price'] != 0]
    data = data.dropna(subset=['price'])
    
    # clean up the price column
    data['price'] = data['price'].str.replace('$', '').str.replace(',', '').astype(float).astype(int)
    
    
    # filtering out property_type categories with counts less than 100
    property_type_value_counts = data['property_type'].value_counts()
    to_remove = property_type_value_counts[property_type_value_counts < 100].index
    data = data[~data['property_type'].isin(to_remove)]

    # property type dummies
    property_dummies = pd.get_dummies(data['property_type'], prefix='d_type')
    data = pd.concat([data, property_dummies], axis=1)

    # room type dummies
    room_dummies = pd.get_dummies(data['room_type'], prefix= 'd_room')
    data = pd.concat([data, room_dummies], axis=1)

    data.rename(columns=lambda x: x.replace(" ", "_").lower(), inplace=True)
    
    data = data.rename(columns={
        'review_scores_rating': 'n_review_scores_rating',
        'host_since': 'date_host_start',
        'minimum_nights': 'n_minimum_nights',
        'accommodates': 'n_accommodates',
        'beds': 'n_beds',
        'availability_365': 'n_availability_365',
        'number_of_reviews': 'n_number_of_reviews',
        'maximum_nights': 'n_maximum_nights',
        'room_type': 'f_room_type',
        'property_type': 'f_property_type'
    })
    
    # Convert 'date_host_start' to a date variable
    data['date_host_start'] = pd.to_datetime(data['date_host_start'])

    # Convert 'n_beds' to integer
    data['n_beds'] = data['n_beds'].fillna(0).astype(int)  # Assumes NaN values should be treated as 0

    # Convert 'host_is_superhost' to integer ('t' to 1, 'f' to 0)
    data['host_is_superhost'] = data['host_is_superhost'].map({'t': 1, 'f': 0}).fillna(0).astype(int)  # Assumes NaN values should be treated as 0

    # Convert 'n_review_scores_rating' to int
    data['n_review_scores_rating'] =  data['n_review_scores_rating'].fillna(0)
    

    # changing all dummies to be int
    d_columns = data.columns[data.columns.str.startswith('d_')]
    data[d_columns] = data[d_columns].astype(int)
    
    return data

data = airbnb_cleaner(data)
    

In [14]:
data.dtypes

f_property_type                               object
n_accommodates                                 int64
n_availability_365                             int64
date_host_start                       datetime64[ns]
longitude                                    float64
n_number_of_reviews                            int64
n_review_scores_rating                       float64
price                                          int32
n_maximum_nights                               int64
latitude                                     float64
f_room_type                                   object
host_is_superhost                              int32
n_minimum_nights                               int64
n_beds                                         int32
d_type_entire_condo                            int32
d_type_entire_home                             int32
d_type_entire_loft                             int32
d_type_entire_rental_unit                      int32
d_type_entire_serviced_apartment              

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7146 entries, 0 to 13326
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype         
---  ------                              --------------  -----         
 0   f_property_type                     7146 non-null   object        
 1   n_accommodates                      7146 non-null   int64         
 2   n_availability_365                  7146 non-null   int64         
 3   date_host_start                     7146 non-null   datetime64[ns]
 4   longitude                           7146 non-null   float64       
 5   n_number_of_reviews                 7146 non-null   int64         
 6   n_review_scores_rating              7146 non-null   float64       
 7   price                               7146 non-null   int32         
 8   n_maximum_nights                    7146 non-null   int64         
 9   latitude                            7146 non-null   float64       
 10  f_room_type                 

In [16]:
data.isna().sum().sum()

0

In [17]:
data.shape

(7146, 26)

### EDA

In [14]:
# copy a variable - purpose later, see at variable importance
data['n_accommodates_copy'] = data['n_accommodates']

***numerical variables***

In [15]:
# too long to display and read
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,11233.0,2.633452e+17,3.936411e+17,3176.0,16972870.0,38422670.0,6.85234e+17,1.04908e+18
scrape_id,11233.0,20231200000000.0,0.0,20231200000000.0,20231200000000.0,20231200000000.0,20231200000000.0,20231200000000.0
description,0.0,,,,,,,
host_id,11233.0,142395400.0,163743400.0,1581.0,12818040.0,57628840.0,243483400.0,551079400.0
host_listings_count,11226.0,15.13264,71.38824,1.0,1.0,1.0,3.0,1083.0
host_total_listings_count,11226.0,16.94539,77.11085,1.0,1.0,2.0,5.0,1143.0
latitude,11233.0,52.51003,0.03284366,52.36904,52.49084,52.50989,52.53233,52.65611
longitude,11233.0,13.4032,0.06592717,13.10758,13.36761,13.41224,13.43815,13.71796
accommodates,11233.0,2.921303,1.233958,2.0,2.0,2.0,4.0,6.0
bathrooms,0.0,,,,,,,


***categorical variables***

In [18]:
data.room_type.value_counts()

room_type
Entire home/apt    7966
Private room       3078
Hotel room          114
Shared room          75
Name: count, dtype: int64

In [19]:
data.property_type.value_counts()

property_type
Entire rental unit                    6237
Private room in rental unit           2301
Entire condo                           722
Entire serviced apartment              358
Entire loft                            247
Room in hotel                          220
Private room in condo                  183
Private room in home                   127
Entire home                            126
Entire guesthouse                       67
Room in boutique hotel                  64
Shared room in hostel                   49
Private room in loft                    45
Entire vacation home                    45
Private room in hostel                  43
Room in aparthotel                      39
Room in serviced apartment              36
Private room in bed and breakfast       32
Private room in townhouse               29
Entire townhouse                        24
Houseboat                               23
Shared room in rental unit              20
Entire bungalow                         

In [20]:
data.number_of_reviews.value_counts()

number_of_reviews
0      2364
1       902
2       644
3       507
4       390
       ... 
493       1
653       1
655       1
435       1
387       1
Name: count, Length: 456, dtype: int64

In [21]:
data.neighbourhood_cleansed.value_counts()

neighbourhood_cleansed
Alexanderplatz               760
Frankfurter Allee Süd FK     612
Tempelhofer Vorstadt         554
Brunnenstr. Süd              487
Reuterstraße                 393
                            ... 
Allende-Viertel                3
MV 1                           3
Neu-Hohenschönhausen Süd       2
Hellersdorf-Süd                1
Neu-Hohenschönhausen Nord      1
Name: count, Length: 136, dtype: int64

***split train and test***
- train is where we do it all, incl CV

- first pick a smaller than usual training set so that models run faster and check if works
- if works, start anew without these two lines

In [22]:
data_train, data_holdout = train_test_split( data, train_size=0.7, random_state=42)

In [23]:
data_train.shape, data_holdout.shape

((7863, 76), (3370, 76))