# Business Understanding and Set-up

## Background and Key Question

**Airbnb**

Airbnb is an American online platform offering vacation rentals for travelers - primarily homestays - as an alternative to traditional hotel or hostel stays. Additionally, as of late Airbnb also offers "experiences" at popular tourist destinations. For both listings and experiences, Airbnb does not own or host themselves but instead act as an intermediary broker, earning commissions for each booking. Consequently, as many other groundbreaking innovative services Airbnb has not been without controversies, as it offers owners and renters with a great opportunity to make use of unused periods, but incentivizes the dedicated (ab)use of living space as full-time rentals.

**Key Question**

Taking data from a given specific listing time stamp of Berlin, **can we accurately predict its price** in order to provide future hosts with a solid pricing estimate without requiring an Airbnb account beforehand? (aim: offer a tool for new users to get a first pricing and potential earnings indication without requiring an account)

**Assumptions**

As the data is accessible information only and does not include data such as actual occupancy, several assumptions were necessary and prediction could not possibly be conducted with a precision matching a state of complete information. Nevertheless, the wealth of available features made it possible to experiment extensively and work productively on the basis of these assumptions.

**Notes**

The final key question was derived as an iterative process, adapting the target based on the insights from the analysis. Hence, the analysis focuses on evaluating all feasible features, as the decision to include only those that are known to a new user beforehand was made further down the road.

Currently, predictive modeling is performed as a **regression on Target = PRICE_LOG**.

This notebook previously contained predictive modeling for various distinct target options, namely:

- Modeling: Binary Classification (**PRICE_BINARY**)
- Modeling: Multi-Class Classification (**PRICE_CLASS**)
- Modeling: Regression (**PRICE_LOG**)

**As a consequence, the notebook includes target selection via the Dashboard, which is no longer a requirement for the code to be run. Instead, the target variable must be kept as PRICE_LOG (unless you would like to target another feature, which is, however, not intended)**

## Feature Glossary

This glossary only lists the key_features used for the final model prediction and does not cover all 105 original features of the dataset.

| **FEATURE** | **DESCRIPTION** |
| :----- | :----- |
| **TARGET: price_log** | log of price per guests_included for a listing |
| **accommodates_per_bed** | created from "accommodates" and "beds" - how many people the property accommodates per available bed |
| **am_balcony** | created from "amenities" - does the listing have a balcony? |
| **am_breakfast** | created from "amenities" - does the rent include breakfast? |
| **am_child_friendly** | created from "amenities" - is the listing flagged as child friendly? |
| **am_elevator** | created from "amenities" - does the listing have an elevator? |
| **am_essentials** | created from "amenities" - does the host provide essentials (e.g. towels, bed linen, ...)? |
| **am_pets_allowed** | created from "amenities" - are pets allowed at the location? |
| **am_private_entrance** | created from "amenities" - do guests have a private entrance? |
| **am_smoking_allowed** | created from "amenities" - is smoking allowed at the location? |
| **am_tv** | created from "amenities" - is a TV available? |
| **bathrooms_log** | log of number of shared (0.5) or dedicated (1.0) bathrooms available |
| **bedrooms** | number of shared (0.5) or dedicated (1.0) bedrooms available |
| **calc_host_lst_count_sqrt_log** | Log of sqrt of number of listings the host owns in total |
| **cancellation_policy** | "flexible", "moderate", "strict" or "super strict" cancellation policies set by the host |
| **guests_included_calc** | number of guests included in the listing price (not necessarily equal to "accommodates", as host may charge for extra_people) |
| **host_is_superhost** | does the host fulfill the criteria and is flagged as a superhost (e.g. high ratings, response rate, ...)? |
| **instant_bookable** | is the listing instant bookable (e.g. without manual confirmation by the host)? |
| **maximum_nights** | what are the maximum  nights specified by the host? |
| **minimum_nights_sqrt** | log of what are the minimum nights specified by the host? |
| **property_type** | which type of property is the listing (e.g. apartment, house, hotel, ...)? |
| **room_type** | which type of room is listed (e.g. entire home/apt, private room, shared room, ...)? |
| **wk_mth_discount** | created from weekly_price and monthly_price - does the host offer a discount for weekly or monthly stays (in %)? |
| **zipcode** | at which zipcode is the listing located? |

## Dataset Glossary

| **DATASET** | **DESCRIPTION** |
| :----- | :----- |
| **data_raw** | Originally imported dataset listings.csv.gz (February 2020) |
| **data** | Naming for main working dataset throughout all notebooks |
| **data_clean** | Export from Notebook 1_Clean, import for Notebooks 2_EDA_Clean and 3_Feature_Engineering |
| **data_engineered** | Export from Notebook 3_Feature_Engineering, import for Notebook 4-EDA_Engineered and 5_Predictive_Modeling |
| **best_model_xyz** | Saves of best models from various algorithms |



## Target Feature(s) and Metric(s)

**Target**:
- Feature: price_log
- Metric: neg_median_absolute_error

## Libraries and Dashboard

In [1]:
# Import libraries

# General / Cleaning / Feature Engineering
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import pyplot
import math
import joblib
from numpy import loadtxt
import os, glob
from datetime import datetime, timedelta
%matplotlib inline

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline, make_pipeline  # Same, but with the latter it is not necessary to name estimator and transformer
from sklearn.compose import ColumnTransformer
from sklearn.cluster import DBSCAN

# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from lightgbm import LGBMClassifier

# Predictive Modeling (Models)
from sklearn.dummy import DummyClassifier, DummyRegressor
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, cross_val_predict, cross_val_score, cross_validate, KFold
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.svm import SVC, NuSVC, SVR
from sklearn.linear_model import LinearRegression, LogisticRegression, PassiveAggressiveRegressor, ElasticNet, SGDRegressor, RANSACRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor, VotingClassifier, RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, IsolationForest
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from xgboost import XGBClassifier, XGBRegressor
from scipy.stats import randint
from sklearn.multiclass import OneVsRestClassifier
from catboost import CatBoostRegressor

# Evaluation Metrics
from sklearn import metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, make_scorer, fbeta_score, accuracy_score, confusion_matrix, f1_score, precision_recall_curve, recall_score, precision_score, roc_auc_score
from scipy.sparse import csr_matrix
import scipy.stats as stats

# Neural Networks
from keras import models, layers, optimizers, regularizers
from keras.utils.vis_utils import model_to_dot
from keras.wrappers.scikit_learn import KerasRegressor
from IPython.display import SVG

Using TensorFlow backend.


In [2]:
# Dashboard
dataset_loc = "paris"  # "berlin", "paris", "amsterdam", "barcelona"
dataset_date = "2020-03-16"  # berlin: "2020-03-17", paris: "2020-03-16", amsterdam: "2020-03-14", barcelona: "2020-03-16", 
model_run = "2020-08-26"               # date of dataset/model creation (determines subfolder for saves of datasets/models)
target = 'price_log'             # for regression: 'occupancy_rate', 'price_log' | for classification: 'price_class', 'occupancy_class'
# !Important: Please select features for prediction under "Preprocessing"!
scoring = 'neg_median_absolute_error'  # for regression: 'neg_mean_squared_error', 'r2', 'neg_mean_poisson_deviance', 'neg_median_absolute_error' | for classification: "f1(_micro, _macro, _weighted for multiclass)", "recall", "precision", "accuracy", "roc_auc"
test_size = 0.2
random_state = 42

#occ_thr = 0.3  # Threshold for when a listing is deemed a "permanent rental"
review_rate = 0.5  # Assumed share of bookings that were followed up by a user review (feature engineering of occupancy)

pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)
pd.options.display.max_seq_items = 300
sns.set(style="white")

As mentioned further above, it is necessary to **explicitly declare target** as the notebook was previously set-up to enable analysis and modeling on varying target features.

## Global functions and variables

In [3]:
# "save_load": Function for saving and loading datasets/models (joblib)
def save_load(data=False, title="unknown", file_format="pkl", function="save", dataset_loc=dataset_loc, dataset_date=dataset_date, model_run=model_run):
    if function=="save":
        if file_format=="pkl":
            joblib.dump(data, f"data/{dataset_loc}_{dataset_date}/{model_run}/{title}.pkl")
        elif file_format=="app":
            joblib.dump(data, f"data/{dataset_loc}_{dataset_date}/{title}.pkl")
        else:
            print("Please enter a valid file_format (default is pkl; 'app' for data used in web app).")
    elif function=="load":
        if file_format=="pkl":
            return joblib.load(f"data/{dataset_loc}_{dataset_date}/{model_run}/{title}.pkl")
        elif file_format=="csv":
            return pd.read_csv(f"data/{dataset_loc}_{dataset_date}/{title}.csv")
        elif file_format=="csv.gz":
            return pd.read_csv(f"data/{dataset_loc}_{dataset_date}/{title}.csv.gz")
        elif file_format=="geojson":
            return pd.read_csv(f"data/{dataset_loc}_{dataset_date}/{title}.geojson")
        else:
            print("Please enter a valid file_format (default is pkl).")

In [4]:
# "model_eval": Function for final evaluation of "best model"
def model_eval(y, y_pred, model="reg"):
    """
    Please always specify the type of model:
    Regression: model="reg"
    Binary Classification: model="bclf"
    Multiclass Classification: model="clf"
    """
    if model=="reg":
        print("MSE: {:.2f}".format(mean_squared_error(y, y_pred)))
        print("RMSE: {:.2f}".format(
        mean_squared_error(y, y_pred, squared=False)))
        print("MAE: {:.2f}".format(mean_absolute_error(y, y_pred)))
        print("R2: {:.2f}".format(r2_score(y, y_pred)))
        print("MAPE: {:.2f}".format(mean_absolute_percentage_error(y, y_pred)))
        print("MAPE median: {:.2f}".format(median_absolute_percentage_error(y, y_pred)))

    elif model=="bclf":
        print("Accuracy: {:.2f}".format(accuracy_score(y, y_pred)))
        print("Recall: {:.2f}".format(recall_score(y, y_pred)))
        print("Precision: {:.2f}".format(precision_score(y, y_pred)))
        print("F1 Score: {:.2f}".format(f1_score(y, y_pred)))
        print("ROC/AUC: {:.2f}".format(roc_auc_score(y, y_pred)))
        print("Confusion Matrix: \n" + str(confusion_matrix(y, y_pred)))

    elif model=="clf":
        print("Accuracy: {:.2f}".format(accuracy_score(y, y_pred)))
        print("Recall: {:.2f}".format(recall_score(y, y_pred, average='weighted')))
        print("Precision: {:.2f}".format(precision_score(y, y_pred, average='weighted')))
        print("F1 Score: {:.2f}".format(f1_score(y, y_pred, average='weighted')))
        print("Confusion Matrix: \n" + str(confusion_matrix(y, y_pred)))
    
    else:
        print("Please revise your parameters (e.g. provide a valid model).")

In [5]:
# "print_target_setting": Function for printing current setting for TARGET and the corresponding features
def print_target_setting():
    target_upper = target.upper()
    print(f"You are currently using \033[1m{target_upper}\033[0m as the target and \033[1m{scoring}\033[0m for scoring to predict prices for \033[1m{dataset_loc}\033[0m on \033[1m{dataset_date}\033[0m\n")
    print(f"You are currently using these features for its prediction:\n\033[1m{pred_features}\033[0m\n")
    if 'price_class' in pred_features:
        print(f"WARNING: Please remove \033[1m'price_class'\033[0m from features before proceeding with target \033[1m{target_upper}\033[0m")
    if 'price_binary' in pred_features:
        print(f"WARNING: Please remove \033[1m'price_binary'\033[0m from features before proceeding with target \033[1m{target_upper}\033[0m")
    if "occupancy_class" in pred_features and "occupancy_rate" in pred_features:
        print(f"WARNING: Please remove \033[1m'ocupancy_class'\033[0m or \033[1m'ocupancy_rate'\033[0m from features before proceeding with target \033[1m{target_upper}\033[0m")
    else:
        print("No issues with your selection of pred_features have been detected. Please make sure to manually check for correctness nevertheless.")

In [6]:
# "mean_absolute_percentage_error": Function for mean absolute percentage error (MAPE)
def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [7]:
# "median_absolute_percentage_error": Function for median absolute percentage error (MAPE median)
def median_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.median(np.abs((y_true - y_pred) / y_true)) * 100

In [8]:
# "get_feat_importances": Function for retrieving feature importances
def get_feat_importances(model, column_names):
    model=model
    feat_importances = pd.DataFrame(model.feature_importances_,
                 columns=['weight'],
                 index=column_names)
    feat_importances.sort_values('weight', inplace=True, ascending=False)
    return feat_importances

In [9]:
# "clf_learning_curves": Function to evaluate classification model based on learning curves
def clf_learning_curves(model):
# Fit model on training data
    model = model
    eval_set = [(X_train_prep, y_train), (X_test_prep, y_test)]
    model.fit(X_train_prep, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)

    # Make predictions for test data
    y_pred = model.predict(X_test_prep)
    predictions = [round(value) for value in y_pred]

    # Evaluate predictions
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    
    # Retrieve performance metrics
    results = model.evals_result()
    epochs = len(results['validation_0']['error'])
    x_axis = range(0, epochs)
    
    # Plot log loss
    fig, ax = pyplot.subplots()
    ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
    ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
    ax.legend()
    pyplot.ylabel('Log Loss')
    pyplot.title('XGBoost Log Loss')
    pyplot.show()
    
    # Plot classification error
    fig, ax = pyplot.subplots()
    ax.plot(x_axis, results['validation_0']['error'], label='Train')
    ax.plot(x_axis, results['validation_1']['error'], label='Test')
    ax.legend()
    pyplot.ylabel('Classification Error')
    pyplot.title('XGBoost Classification Error')
    pyplot.show()

# Data Mining

## Data Checks

The monthly data is composed of various files that are briefly visualized here (based on dataset_loc and dataset_date):

- listings.csv.gz
- listings.csv
- reviews.csv.gz
- reviews.csv
- calendar.csv.gz
- neighbourhoods.csv
- neighbourhoods.geojson

**listings.csv.gz**

In [10]:
# Display contents of listings.csv.gz as well as its shape
#data_listings_gz_insp = save_load(title="listings", file_format="csv.gz", function="load")
#print(data_listings_gz_insp.shape)
#data_listings_gz_insp.head(3)

**listings.csv**

In [11]:
# Display contents of listings.csv as well as its shape
#data_listings_insp = save_load(title="listings", file_format="csv", function="load")
#print(data_listings_insp.shape)
#data_listings_insp.head(2)

**reviews.csv.gz**

In [12]:
# Display contents of reviews.csv.gz as well as its shape
#data_reviews_gz_insp = save_load(title="reviews", file_format="csv.gz", function="load")
#print(data_reviews_gz_insp.shape)
#data_reviews_gz_insp.head(2)

**reviews.csv**

In [13]:
# Display contents of reviews.csv as well as its shape
#data_reviews_insp = save_load(title="reviews", file_format="csv", function="load")
#print(data_reviews_insp.shape)
#data_reviews_insp.head(2)

**calendar.csv.gz**

In [14]:
# Display contents of calendar.csv.gz as well as its shape
#data_cal_insp = save_load(title="calendar", file_format="csv.gz", function="load")
#print(data_cal_insp.shape)
#data_cal_insp.head(2)

**neighbourhoods.csv**

In [15]:
# Display contents of neighbourhoods.csv as well as its shape
#data_neighb_insp = save_load(title="neighbourhoods", file_format="csv", function="load")
#print(data_neighb_insp.shape)
#data_neighb_insp.head(2)

**neighbourhoods.geojson**

In [16]:
# Display contents of neighbourhoods.geojson as well as its shape
#data_neighb_geojson_insp = save_load(title="neighbourhoods", file_format="geojson", function="load")
#print(data_neighb_geojson_insp.shape)
#data_neighb_geojson_insp.head(2)

## Data Import

**Create main dataset (listings on January 10th, i.e. pre-COVID-19)**

In [17]:
# Import dataset as DataFrame (as csv-file)
data_raw = save_load(title="listings", file_format="csv.gz", function="load")

  if (await self.run_code(code, result,  async_=asy)):


In [18]:
# Assign data_raw to data (in order to always keep a freshly imported data_raw) and set id as index
data = data_raw.copy()
data.set_index('id', inplace=True)

In [19]:
# Create path for saving datasets/models (if not existing)
if not os.path.exists(f"data/{dataset_loc}_{dataset_date}/{model_run}/"):
    os.mkdir(f"data/{dataset_loc}_{dataset_date}/{model_run}/")

# Data Cleaning

## Pre-cleaning

Before cleaning the main dataset, the columns were briefly reviewed and all those that were deemed unfruitful for further analysis eliminated beforehand, in order to focus time and effort on the relevant ones.

In [20]:
# Display shape of "data"
data.shape

(67323, 105)

The dataset includes a whopping **105 features** and slightly over 25.000 listings for Berlin for March 2020.

In [21]:
# Display head(1) of "data"
data.head(1)

Unnamed: 0_level_0,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,access,interaction,house_rules,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1
3109,https://www.airbnb.com/rooms/3109,20200315231126,2020-03-16,zen and calm,Appartement très calme de 50M2 Utilisation de ...,I bedroom appartment in Paris 14,I bedroom appartment in Paris 14 Good restaura...,none,Good restaurants very close the Montparnasse S...,,RER B Metro Ligne 13 Pernety Metro Ligne 13 Pl...,"A la demande, vous pouvez avoir accès à la cha...",yes I can help you out,,,,https://a0.muscache.com/im/pictures/baeae9e2-c...,,3631,https://www.airbnb.com/users/show/3631,Anne,2008-10-14,"Paris, Île-de-France, France",,within a few hours,100%,40%,f,https://a0.muscache.com/im/users/3631/profile_...,https://a0.muscache.com/im/users/3631/profile_...,Alésia,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",t,f,"Paris, Île-de-France, France",XIV Arrondissement,Observatoire,,Paris,Île-de-France,75014,Paris,"Paris, France",FR,France,48.83349,2.31852,f,Apartment,Entire home/apt,2,1.0,0.0,1.0,Real Bed,"{Internet,Wifi,Kitchen,""Paid parking off premi...",,$60.00,$490.00,,$150.00,$60.00,1,$0.00,2,30,2,2,30,30,2.0,30.0,3 weeks ago,t,16,46,76,351,2020-03-16,9,1,2016-12-27,2019-10-24,100.0,10.0,10.0,10.0,10.0,10.0,10.0,t,,"{""translation missing: en.occupancy.taxes.juri...",f,f,flexible,f,f,1,1,0,0,0.23


On first sight, most features appear relatively informative and well-structured. **Amenities** stores lists as a string, which needs to be reviewed later on.

In [22]:
# Display columns of "data"
#data.columns

In [23]:
# Define columns to keep after pre-cleaning
select_columns = [
    'accommodates', 'amenities', 'availability_365', 'availability_90',
    'bathrooms', 'bed_type', 'bedrooms', 'beds',
    'calculated_host_listings_count', 'cancellation_policy', 'cleaning_fee',
    'description', 'experiences_offered', 'extra_people', 'first_review',
    'guests_included', 'has_availability', 'host_acceptance_rate',
    'host_has_profile_pic', 'host_identity_verified', 'host_is_superhost',
    'host_listings_count', 'host_location', 'host_response_rate',
    'host_response_time', 'house_rules', 'instant_bookable', 'interaction',
    'is_business_travel_ready', 'is_location_exact', 'last_review', 'latitude',
    'listing_url', 'longitude', 'maximum_nights', 'minimum_nights',
    'monthly_price', 'name', 'neighborhood_overview', 'neighbourhood_cleansed',
    'notes', 'number_of_reviews', 'number_of_reviews_ltm', 'price',
    'property_type', 'require_guest_phone_verification',
    'require_guest_profile_picture', 'requires_license',
    'review_scores_accuracy', 'review_scores_checkin',
    'review_scores_cleanliness', 'review_scores_communication',
    'review_scores_location', 'review_scores_rating', 'review_scores_value',
    'reviews_per_month', 'room_type', 'security_deposit', 'space',
    'square_feet', 'summary', 'transit', 'weekly_price', 'zipcode'
]

In [24]:
# Drop innecessary columns and sort dataset
drop_columns = [el for el in data.columns if el not in select_columns]
data.drop(labels=drop_columns, inplace=True, axis=1)
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

As mentioned above, around half of the features are kept for further analysis and the rest is dropped - at least for the time being

## Inspection

In [25]:
# Display shape of "data"
data.shape

(67323, 64)

The feature set is now reduced to **64 features**

In [26]:
# Display head(5) of remaining "data"
data.head(5)

Unnamed: 0_level_0,accommodates,amenities,availability_365,availability_90,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,cancellation_policy,cleaning_fee,description,experiences_offered,extra_people,first_review,guests_included,has_availability,host_acceptance_rate,host_has_profile_pic,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_response_rate,host_response_time,house_rules,instant_bookable,interaction,is_business_travel_ready,is_location_exact,last_review,latitude,listing_url,longitude,maximum_nights,minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,security_deposit,space,square_feet,summary,transit,weekly_price,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1
3109,2,"{Internet,Wifi,Kitchen,""Paid parking off premi...",351,76,1.0,Real Bed,0.0,1.0,1,flexible,$60.00,I bedroom appartment in Paris 14 Good restaura...,none,$0.00,2016-12-27,1,t,40%,t,f,f,1.0,"Paris, Île-de-France, France",100%,within a few hours,,f,yes I can help you out,f,f,2019-10-24,48.83349,https://www.airbnb.com/rooms/3109,2.31852,30,2,,zen and calm,Good restaurants very close the Montparnasse S...,Observatoire,,9,1,$60.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,100.0,10.0,0.23,Entire home/apt,$150.00,I bedroom appartment in Paris 14,,Appartement très calme de 50M2 Utilisation de ...,RER B Metro Ligne 13 Pernety Metro Ligne 13 Pl...,$490.00,75014
5396,2,"{Internet,Wifi,Kitchen,Heating,Washer,""Smoke d...",32,32,1.0,Pull-out Sofa,0.0,1.0,1,strict_14_with_grace_period,$36.00,"Cozy, well-appointed and graciously designed s...",none,$0.00,2009-06-30,1,t,100%,t,t,f,1.0,"Istanbul, İstanbul, Turkey",100%,within an hour,This is a small flat in a very old building th...,t,We expect guests to operate rather independent...,f,t,2020-03-01,48.851,https://www.airbnb.com/rooms/5396,2.35869,2,1,"$2,000.00",Explore the heart of old Paris,"You are within walking distance to the Louvre,...",Hôtel-de-Ville,The staircase leading up to the apartment is n...,215,48,$115.00,Apartment,f,f,t,8.0,9.0,8.0,9.0,10.0,90.0,8.0,1.65,Entire home/apt,$0.00,"Small, well appointed studio apartment at the ...",,"Cozy, well-appointed and graciously designed s...",The flat is close to two or three major metro ...,$600.00,75004
7397,4,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",238,45,1.0,Real Bed,2.0,2.0,1,moderate,$50.00,"VERY CONVENIENT, WITH THE BEST LOCATION ! PLEA...",none,$10.00,2011-04-08,2,t,86%,t,t,t,2.0,"Paris, Île-de-France, France",100%,within an hour,ELECTRICITY INCLUDED FOR NORMAL USING. PLEASE ...,f,,f,t,2020-02-26,48.85758,https://www.airbnb.com/rooms/7397,2.35275,23,4,"$2,200.00",MARAIS - 2ROOMS APT - 2/4 PEOPLE,,Hôtel-de-Ville,Important: Be conscious that an apartment in a...,268,29,$119.00,Apartment,f,f,t,10.0,10.0,9.0,10.0,10.0,94.0,10.0,2.46,Entire home/apt,$200.00,PLEASE ASK ME BEFORE TO MAKE A REQUEST !!! No ...,,"VERY CONVENIENT, WITH THE BEST LOCATION !",Metro station HÖTEL-DE-VILLE is 100 meters close.,,75004
7964,2,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Buzzer/w...",274,1,1.0,Real Bed,1.0,1.0,1,strict_14_with_grace_period,$60.00,Very large & nice apartment all for you! - Su...,none,$20.00,2010-05-10,2,t,0%,t,t,f,0.0,"Paris, Île-de-France, France",100%,within a few hours,Respect.,f,"We are there to welcome you, give you keys and...",f,t,2015-09-14,48.87417,https://www.airbnb.com/rooms/7964,2.34245,365,6,,Large & sunny flat with balcony !,,Opéra,,6,0,$130.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,96.0,10.0,0.05,Entire home/apt,$500.00,hello ! We have a great 75 square meter apartm...,0.0,Very large & nice apartment all for you! - Su...,,,75009
9359,2,"{Internet,Wifi,Kitchen,Elevator,Heating,Essent...",15,0,1.0,Real Bed,1.0,1.0,1,strict_14_with_grace_period,$200.00,Location! Location! Location! Just bring your ...,none,$30.00,,1,t,,t,f,f,3.0,"New York, New York, United States",100%,within a day,This residence is limited to TWO people and is...,f,"I am available 18 hours a day by text, have lo...",f,t,,48.85899,https://www.airbnb.com/rooms/9359,2.34735,365,180,"$1,480.00","Cozy, Central Paris: WALK or VELIB EVERYWHERE !",,Louvre,Velib station outside.,0,0,$75.00,Apartment,t,t,t,,,,,,,,,Entire home/apt,"$1,500.00","Since I live in the USA, it is difficult to ma...",350.0,Location! Location! Location! Just bring your ...,All the major metros and RERs are at Les Halle...,,75001


In [27]:
# Describe data (summary)
data.describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
accommodates,67323.0,3.07,1.58,1.0,2.0,2.0,4.0,22.0
availability_365,67323.0,88.13,124.26,0.0,0.0,7.0,162.0,365.0
availability_90,67323.0,23.03,31.88,0.0,0.0,0.0,46.0,90.0
bathrooms,67272.0,1.13,0.65,0.0,1.0,1.0,1.0,50.0
bedrooms,67179.0,1.09,0.99,0.0,1.0,1.0,1.0,50.0
beds,66787.0,1.67,1.13,0.0,1.0,1.0,2.0,50.0
calculated_host_listings_count,67323.0,8.43,34.14,1.0,1.0,1.0,1.0,301.0
guests_included,67323.0,1.51,1.14,1.0,1.0,1.0,2.0,100.0
host_listings_count,67315.0,14.22,82.91,0.0,1.0,1.0,2.0,1232.0
latitude,67323.0,48.86,0.02,48.81,48.85,48.87,48.88,48.91


Data.describe gives a positive impression of the dataset quality. Certain features have missing values and it is noticeable that not all features which should have numeric values show up, which indicates that there may be some issues with datatypes.

In [28]:
# List datatypes (data.info()) (pre-cleaning)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67323 entries, 3109 to 42913034
Data columns (total 64 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   accommodates                      67323 non-null  int64  
 1   amenities                         67323 non-null  object 
 2   availability_365                  67323 non-null  int64  
 3   availability_90                   67323 non-null  int64  
 4   bathrooms                         67272 non-null  float64
 5   bed_type                          67323 non-null  object 
 6   bedrooms                          67179 non-null  float64
 7   beds                              66787 non-null  float64
 8   calculated_host_listings_count    67323 non-null  int64  
 9   cancellation_policy               67323 non-null  object 
 10  cleaning_fee                      50501 non-null  object 
 11  description                       66206 non-null  object 
 12

As expected, various numerical features are currently stored as objects and need to be transformed (e.g. cleaning_fee, extra_people, first_review, price, ...). Additionally, there are actually quite a few columns with missing values.

In [29]:
# Show maximum/minimum value for each numerical column
#num_features = list(data.columns[data.dtypes!=object])
#data[num_features].max()
#data[num_features].min()

Several rows with unusually high values can be identified and may in some cases be dropped at a certain threshold during data handling. Some particular features include:

| **FEATURE** | **MAX_VALUE** |
| :----- | :----- |
| **calculated_host_listings_count** | 57 |
| **accommodates** | 24 |
| **bedrooms** | 12 |
| **beds** | 24 |
| **minimum_nights** | 1.124 |
| **maximum_nights** | 10.000 |
| **number_of_reviews_ltm** | 590 (potentially misleading; actually had less reviews on Airbnb |

In [30]:
# List unique entries per column
data.nunique()

accommodates                           19
amenities                           59492
availability_365                      366
availability_90                        91
bathrooms                              19
bed_type                                5
bedrooms                               14
beds                                   19
calculated_host_listings_count         67
cancellation_policy                     8
cleaning_fee                          214
description                         65140
experiences_offered                     1
extra_people                          102
first_review                         2981
guests_included                        17
has_availability                        1
host_acceptance_rate                   99
host_has_profile_pic                    2
host_identity_verified                  2
host_is_superhost                       2
host_listings_count                    96
host_location                        2193
host_response_rate                

Three main insights from unique values:

- Some columns have only 1 value and can be dropped
- Some other columns have 2 values and appear to be true/false (i.e. can be recoded as 1/0)
- Certain columns have a high number of unique values, which can probably be clustered into a few relevant ones (e.g. cancellation_policy, property_type)

In [31]:
# List missing values (pre-cleaning)


def count_missing(data):
    null_cols = data.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)


count_missing(data)

square_feet                    66291
monthly_price                  61479
weekly_price                   58069
notes                          47202
house_rules                    34607
interaction                    31916
host_response_rate             30116
host_response_time             30116
neighborhood_overview          23910
transit                        21405
host_acceptance_rate           19841
space                          19834
security_deposit               19682
cleaning_fee                   16822
review_scores_checkin          15090
review_scores_location         15089
review_scores_value            15089
review_scores_accuracy         15074
review_scores_communication    15068
review_scores_cleanliness      15063
review_scores_rating           15020
last_review                    13923
reviews_per_month              13923
first_review                   13923
summary                         2837
description                     1117
zipcode                          608
b

Various features have a lot of missing values. In particular, there is an observable cut where many features have more than 4.500 missing values and the rest has less than 1.000. The former - except for review_scores - shall be removed, the latter imputed.

**Conclusions (selection)**

- **host_response_rate** and **host_response_time** are unfortunately not available for half of the dataset and consequently the columns are removed
- **review_scores** are difficult to replace if they do not exist, but at 0 they will distort the modeling. Hence, missing values are set to mean of the column
- listings without **name** and the few rows without enhanced **host information** (e.g. superhost), **bedrooms** or **bathrooms** are removed and not substantial in number
- missing values for **summary** and **description** are replaced with "" and kept in order to calculate length during feature engineering
- several features with missing values will be directly converted to 1/0 for simplification (**house_rules, security_deposit, space, cleaning_fee, monthly_price, weekly_price**)

## Define data cleaning functions

**Handle missing/incorrect values**

In [32]:
# Define function for filling missing/incorrect values

def cln_fill_missing_val(data):
   
    # Convert columns with missing values to 1/0
    data.security_deposit.fillna("0", inplace=True)
    data.cleaning_fee.fillna("0", inplace=True)
    data.monthly_price.fillna("0", inplace=True)
    data.weekly_price.fillna("0", inplace=True)

    
    # Fill missing values of "beds" with 0 and then set all with "bed_type" Real Bed to at least 1, those with value "0" to 0.5
    data.beds.fillna(0, inplace=True)
    data.beds = np.where((data.beds == 0) & (data.bed_type == "Real Bed"), 1,
                         data.beds)
    data.beds = np.where((data.beds == 0), 0.5, data.beds)

    
    # Set all with "bathrooms" and "bedrooms" 0 to at least 0.5
    data.bathrooms = np.where(data.bathrooms == 0, 0.5, data.bathrooms)
    data.bedrooms = np.where(data.bedrooms == 0, 0.5, data.bedrooms)

    
    # Fill review_scores with median
    data.review_scores_rating.fillna(data.review_scores_rating.median(),
                                     inplace=True)
    data.review_scores_value.fillna(data.review_scores_value.median(),
                                    inplace=True)
    data.review_scores_checkin.fillna(data.review_scores_checkin.median(),
                                      inplace=True)
    data.review_scores_location.fillna(data.review_scores_location.median(),
                                       inplace=True)
    data.review_scores_communication.fillna(
        data.review_scores_communication.median(), inplace=True)
    data.review_scores_accuracy.fillna(data.review_scores_accuracy.median(),
                                       inplace=True)
    data.review_scores_cleanliness.fillna(
        data.review_scores_cleanliness.median(), inplace=True)

    
    # Fill host_response/acceptance columns with median/"unknown"
    data.host_response_time.fillna("unknown", inplace=True)
    data.host_acceptance_rate.fillna(data.review_scores_rating.median(),
                                     inplace=True)
    data.host_response_rate.fillna(data.review_scores_rating.median(),
                                   inplace=True)

    
    # Fill missing text values with ""
    data.description.fillna("", inplace=True)
    data.interaction.fillna("", inplace=True)
    data.house_rules.fillna("", inplace=True)
    data.neighborhood_overview.fillna("", inplace=True)
    data.notes.fillna("", inplace=True)
    data.space.fillna("", inplace=True)
    data.summary.fillna("", inplace=True)
    data.transit.fillna("", inplace=True)
    
    return data

**Handle wrong/varying datatypes**

In [33]:
# Define function for changing datatypes

def cln_chg_datatypes(data):
    
    # Convert numeric objects to float
    data.cleaning_fee = [
        float(i.strip("$").replace(",", "")) for i in data.cleaning_fee
    ]
    data.extra_people = [
        float(i.strip("$").replace(",", "")) for i in data.extra_people
    ]
    data.host_acceptance_rate = [
        float(str(i).strip("%")) for i in data.host_acceptance_rate
    ]
    data.host_response_rate = [
        float(str(i).strip("%")) for i in data.host_response_rate
    ]
    data.monthly_price = [
        float(i.strip("$").replace(",", "")) for i in data.monthly_price
    ]
    data.price = [float(i.strip("$").replace(",", "")) for i in data.price]
    data.security_deposit = [
        float(i.strip("$").replace(",", "")) for i in data.security_deposit
    ]
    data.weekly_price = [
        float(i.strip("$").replace(",", "")) for i in data.weekly_price
    ]

    
    # Convert varying zipcode datatypes to string
    data.zipcode = ["zip_" + str(i)[:5] for i in data.zipcode]

    
    # Convert date objects to datetime
    data.first_review = data.first_review.astype('datetime64[D]')
    data.last_review = data.last_review.astype('datetime64[D]')
    
    return data

**Add select amenities as column to data**

In [34]:
# Define function for selecting amenities

def cln_sel_amenities(data):
  
    # Create temporary list with all amenities per listing
    amenities_temp = [
        data.amenities[i].strip("{").strip("}").split(',') for i in data.index
    ]

    
    # Add all amenities to single list in order to count occurrences
    amenities = []
    for lst in amenities_temp:
        for item in lst:
            amenities.append(item)
    amenities = pd.Series(amenities)

    
    # Display count of individual amenities
    #print(amenities.value_counts())

    
    # Add select amenities as distinct columns to data
    data.loc[data.amenities.str.contains('Balcony|Patio'), 'am_balcony'] = 1
    data.am_balcony.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains(
        'Beach view|Beachfront|Lake access|Mountain view|Ski-in/Ski-out|Waterfront'
    ), 'am_nature_and_views'] = 1
    data.am_nature_and_views.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Breakfast'), 'am_breakfast'] = 1
    data.am_breakfast.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('TV'), 'am_tv'] = 1
    data.am_tv.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Coffee maker|Espresso machine'),
             'am_coffee_machine'] = 1
    data.am_coffee_machine.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Cooking basics'),
             'am_cooking_basics'] = 1
    data.am_cooking_basics.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Dishwasher|Dryer|Washer'),
             'am_white_goods'] = 1
    data.am_white_goods.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Elevator'), 'am_elevator'] = 1
    data.am_elevator.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Essentials'), 'am_essentials'] = 1
    data.am_essentials.fillna(0, inplace=True)

    data.loc[
        data.amenities.str.contains('Family/kid friendly|Children|children'),
        'am_child_friendly'] = 1
    data.am_child_friendly.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('parking'), 'am_parking'] = 1
    data.am_parking.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Pets|pet|Cat(s)|Dog(s)'),
             'am_pets_allowed'] = 1
    data.am_pets_allowed.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Private entrance'),
             'am_private_entrance'] = 1
    data.am_private_entrance.fillna(0, inplace=True)

    data.loc[data.amenities.str.contains('Smoking allowed'),
             'am_smoking_allowed'] = 1
    data.am_smoking_allowed.fillna(0, inplace=True)
    
    return data

Out of the full list of amenities, not all will have a significant impact on the price. For the purpose of this analysis, an initial selection has been made and then enhanced by some great [previous work](https://github.com/L-Lewis/Airbnb-neural-network-price-prediction/blob/master/Airbnb-price-prediction.ipynb) on selecting relevant amenities. Additionally, most amenities with a split of more than 90/10 between 1/0 have been **removed (strikethrough in the list)** - except for some that were deemed substantial (24-hour check-in, breakfast, essentials, nature and views)

| **NEW COLUMN** | **PREVIOUS AMENITY/IES** |
| :----- | :----- |
| <s>**am_check_in_24h**</s> | <s>24-hour check-in</s> |
| **<s>am_air_con</s>** | <s>Air conditioning/central air conditioning</s> |
| **am_balcony** | Balcony/patio or balcony |
| **am_nature_and_views** | Beach view/beachfront/lake access/mountain view/ski-in ski-out/waterfront (i.e. great location/views) |
| **am_breakfast** | Breakfast |
| **am_tv** | Cable TV/TV |
| **am_coffee_machine** | Coffee maker/espresso machine |
| **am_cooking_basics** | Cooking basics |
| **am_white_goods** | Dishwasher/Dryer/Washer/Washer and dryer |
| **am_elevator** | Elevator |
| <s>**am_gym**</s> | <s>Exercise equipment/gym/private gym/shared gym</s> |
| **am_essentials** | Essentials |
| **am_child_friendly** | Family/kid friendly, or anything containing 'children' |
| **am_parking** | Free parking on premises/free street parking/outdoor parking/paid parking off premises/paid parking on premises |
| <s>**am_outdoor_space**</s> | <s>Garden or backyard/outdoor seating/sun loungers/terrace</s> |
| <s>**am_wellness**</s> | <s>Hot tub/jetted tub/private hot tub/sauna/shared hot tub/pool/private pool/shared pool</s> |
| <s>**am_internet**</s> | <s>Internet/pocket wifi/wifi</s> |
| **am_pets_allowed** | Pets allowed/cat(s)/dog(s)/pets live on this property/other pet(s) |
| **am_private_entrance** | Private entrance |
| <s>**am_secure**</s> | <s>Safe/security system</s> |
| <s>**am_self_check_in**</s> | <s>Self check-in</s> |
| **am_smoking_allowed** | Smoking allowed |

**Drop irrelevant rows**

Many of the decisions below are based on assumptions and judgement derived from the analysis, EDA and intuition.

In [35]:
# Define function for dropping irrelevant rows

def cln_drop_rows(data):
    
    # Drop missing rows of features with few missing values
    data.dropna(subset=[
        "name", "host_is_superhost", "bedrooms", "bathrooms",
        "neighbourhood_cleansed", "zipcode"
    ],
                inplace=True)

    
    # Drop missing rows of zipcode
    data = data[data.zipcode != "zip_nan"]

    
    # Remove "poor" listings (value above/below a certain threshold)
    data = data[data.price < 500]
    data = data[data.price >= 10]
    data = data[data.minimum_nights <= 100]

    
    # Remove listings where "accommodates" is lower than "guests_included"
    data = data[data.accommodates - data.guests_included >= 0]

    
    # Remove listings where "accommodates" > 10 (outliers)
    data = data[data.accommodates <= 10]

    
    # Remove listings where "accommodates" - "beds" < 0
    data = data[data.accommodates - data.beds >= 0]

    
    # Remove listings where "bedrooms" - "beds" > 2
    data = data[data.bedrooms - data.beds <= 2]

    
    # Remove listings where "beds" - "bedrooms" > 10
    data = data[data.beds - data.bedrooms <= 10]

    
    # Remove listings where "monthly_price" is more than 30x "price"
    data = data[data.monthly_price / data.price <= 30]

    
    # Remove listings where "weekly_price" is more than 7x "price"
    data = data[data.weekly_price / data.price <= 7]

    
    # Remove "inactive" or "new" listings with no reviews in last twelve months
    data = data[data.number_of_reviews_ltm != 0]

    
    # Remove listings with no "availability_365" and no reviews in last three months
    data = data[(data.availability_365 != 0) |
                (data.last_review > ((datetime.strptime(dataset_date, "%Y-%m-%d"))-timedelta(3 * 30)).strftime("%Y-%m-%d"))] # Calculate 90 days backwards from dataset_date
    
    return data

**Drop irrelevant columns**

Remove low-frequency classes from categorical columns: Neighbourhoods_cleansed and zipcodes both have a high volume of unique values, many with under 10 occurrences. Hence, all with <0.25% share are bundled as "other".

In [36]:
# Define function for dropping irrelevant columns

def cln_drop_cols(data):    
    
    # Change neighbourhoods_cleansed that make up <0.25% of data to "other"
    data = data.apply(lambda x: x.mask(
        x.map(x.value_counts()) < (0.0025 * len(data)), 'nb_other')
                      if x.name == 'neighbourhood_cleansed' else x)
    
    
    # Change zipcodes that make up <0.25% of data to "other"
    data = data.apply(lambda x: x.mask(
        x.map(x.value_counts()) < (0.0025 * len(data)), 'zip_other')
                      if x.name == 'zipcode' else x)
    
    
    # Drop irrelevant columns
    data.drop(
        [
            "bed_type",
            "experiences_offered",
            "has_availability",
    #        "host_acceptance_rate",
            "host_location",
    #        "host_response_rate",
    #        "host_response_time",  
    #        "number_of_reviews", 
    #        "number_of_reviews_ltm",
            "requires_license",
            "is_business_travel_ready",
            "host_has_profile_pic",
            "host_listings_count",
            "require_guest_profile_picture",
            "require_guest_phone_verification",
            "reviews_per_month",
            "square_feet"
        ],
        inplace=True,
        axis=1)
    
    return data

Explanation for selection of dropped columns:

| **FEATURE(S)** | **NOTES** |
| :----- | :----- | 
| **bed_type** | over 97% of values were "Real Bed", hence little added value |
| **experiences_offered** | all values are "none" |
| **has_availability** | all values are "t" |
| **requires_license, host_has_profile_pic** | almost all values are "t" |
| **is_business_travel_ready** | all values are "f" |
| **require_guest_xxx** | almost all values are "f" |
| **host_listings_count** | calculated_host_listings_count appears to be a sanitized version (ranges from 1 to 55) of host_listings_count (has values 0 and highest is 1397) |
 **other host_xyz** | too many missing values |
| **reviews_per_month** | number_of_reviews_ltm kept as better measure |
| **square_feet** | too many missing values |
| <s>**property_type**</s> | <s>90% of values are "apartment", too many unique values to sensibly classify</s> kept instead |

## Apply data cleaning functions

In [37]:
# Bundle data cleaning steps as function "data_cleaning"
def data_cleaning(data):
    data = cln_fill_missing_val(data)
    data = cln_chg_datatypes(data)
    data = cln_sel_amenities(data)
    data = cln_drop_rows(data)
    data = cln_drop_cols(data)
    
    return data

In [38]:
# Apply data cleaning to dataset
data = data_cleaning(data)

  return func(self, *args, **kwargs)


## Final Check, Cleaning and Export

In [39]:
# Sort columns in dataset
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

In [40]:
# List datatypes (data.info()) (post-cleaning)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29165 entries, 5396 to 42860297
Data columns (total 66 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   accommodates                    29165 non-null  int64         
 1   am_balcony                      29165 non-null  float64       
 2   am_breakfast                    29165 non-null  float64       
 3   am_child_friendly               29165 non-null  float64       
 4   am_coffee_machine               29165 non-null  float64       
 5   am_cooking_basics               29165 non-null  float64       
 6   am_elevator                     29165 non-null  float64       
 7   am_essentials                   29165 non-null  float64       
 8   am_nature_and_views             29165 non-null  float64       
 9   am_parking                      29165 non-null  float64       
 10  am_pets_allowed                 29165 non-null  float64       
 

Many columns are only kept for EDA or feature engineering and will be dropped afterwards. The remaining columns are prepared and ready for further processing.

In [41]:
# List missing values (post-cleaning)

def count_missing(data):
    null_cols = data.columns[data.isnull().any(axis=0)]
    X_null = data[null_cols].isnull().sum()
    X_null = X_null.sort_values(ascending=False)
    print(X_null)

count_missing(data)
#data.isnull().sum()

Series([], dtype: float64)


As we can see, we got rid of all the missing values

In [42]:
# Display cleaned dataset
print(data.shape)
data.head(3)

(29165, 66)


Unnamed: 0_level_0,accommodates,am_balcony,am_breakfast,am_child_friendly,am_coffee_machine,am_cooking_basics,am_elevator,am_essentials,am_nature_and_views,am_parking,am_pets_allowed,am_private_entrance,am_smoking_allowed,am_tv,am_white_goods,amenities,availability_365,availability_90,bathrooms,bedrooms,beds,calculated_host_listings_count,cancellation_policy,cleaning_fee,description,extra_people,first_review,guests_included,host_acceptance_rate,host_identity_verified,host_is_superhost,host_response_rate,host_response_time,house_rules,instant_bookable,interaction,is_location_exact,last_review,latitude,listing_url,longitude,maximum_nights,minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,room_type,security_deposit,space,summary,transit,weekly_price,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1
5396,2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"{Internet,Wifi,Kitchen,Heating,Washer,""Smoke d...",32,32,1.0,0.5,1.0,1,strict_14_with_grace_period,36.0,"Cozy, well-appointed and graciously designed s...",0.0,2009-06-30,1,100.0,t,f,100.0,within an hour,This is a small flat in a very old building th...,t,We expect guests to operate rather independent...,t,2020-03-01,48.851,https://www.airbnb.com/rooms/5396,2.35869,2,1,2000.0,Explore the heart of old Paris,"You are within walking distance to the Louvre,...",Hôtel-de-Ville,The staircase leading up to the apartment is n...,215,48,115.0,Apartment,8.0,9.0,8.0,9.0,10.0,90.0,8.0,Entire home/apt,0.0,"Small, well appointed studio apartment at the ...","Cozy, well-appointed and graciously designed s...",The flat is close to two or three major metro ...,600.0,zip_75004
7397,4,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Paid par...",238,45,1.0,2.0,2.0,1,moderate,50.0,"VERY CONVENIENT, WITH THE BEST LOCATION ! PLEA...",10.0,2011-04-08,2,86.0,t,t,100.0,within an hour,ELECTRICITY INCLUDED FOR NORMAL USING. PLEASE ...,f,,t,2020-02-26,48.85758,https://www.airbnb.com/rooms/7397,2.35275,23,4,2200.0,MARAIS - 2ROOMS APT - 2/4 PEOPLE,,Hôtel-de-Ville,Important: Be conscious that an apartment in a...,268,29,119.0,Apartment,10.0,10.0,9.0,10.0,10.0,94.0,10.0,Entire home/apt,200.0,PLEASE ASK ME BEFORE TO MAKE A REQUEST !!! No ...,"VERY CONVENIENT, WITH THE BEST LOCATION !",Metro station HÖTEL-DE-VILLE is 100 meters close.,0.0,zip_75004
9952,2,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,"{TV,Internet,Wifi,Kitchen,""Paid parking off pr...",305,45,1.0,1.0,1.0,1,moderate,30.0,"Je suis une dame retraitée, qui propose un agr...",0.0,2013-03-19,1,100.0,f,f,100.0,within an hour,DO NOT USE FIREPLACE,t,Host will only be present to hand the keys but...,t,2020-01-21,48.86227,https://www.airbnb.com/rooms/9952,2.37134,120,5,1300.0,Paris petit coin douillet,"Vibrant neighborhood, full of bars, cafés, fre...",Popincourt,,25,8,75.0,Apartment,10.0,10.0,10.0,10.0,10.0,98.0,10.0,Entire home/apt,250.0,Make your stay in Paris a perfect experience. ...,"Je suis une dame retraitée, qui propose un agr...",The closest metro stations: Oberkampf (metro l...,0.0,zip_75011


The number of listings has been reduced substantially by removing rows deemed to be irrelevant or distorting.

**Export data_clean**

In [43]:
# Export dataset for further use in 2_EDA_Clean and 3_Feature_Engineering
data_clean = data.copy()
save_load(data_clean, title="data_clean", function="save")

# Feature Engineering

In [44]:
# Import data_clean
data = save_load(title="data_clean", function="load")

In [45]:
# Import reviews.csv and convert date to datetime
data_rev = save_load(title="reviews", file_format="csv", function="load")
data_rev.date = data_rev.date.astype('datetime64[D]')
#print(data_rev.shape)
#data_rev.head(3)

## Define feature engineering functions

**Change column content**

- Reduce cancellation_policy and property_type classes
- Replace "0" values in monthly_price and weekly_price
- Recalculate guests_included (many listings specify guests_included=1 while accommodates is higher and no extra fee is charged)

In [46]:
# Define function for adapting existing features

def feat_adapt(data):

    # Reduce cancellation_policy to 4 classes
    data.cancellation_policy.replace(
        ["strict_14_with_grace_period", "super_strict_60", "super_strict_30"],
        ["strict", "super_strict", "super_strict"],
        inplace=True)

    
    # Reduce property_type to 6 classes, as per Airbnb classification (see listing creation in pdf)
    data.property_type.replace(["Condominium", "Loft", "Vacation home"],
                               "Apartment",
                               inplace=True)
    data.property_type.replace(
        ["Aparthotel", "Hostel", "Hotel", "Resort", "Serviced apartment"],
        "Boutique hotel",
        inplace=True)
    data.property_type.replace([
        "Casa particular (Cuba)", "Farm stay", "Nature lodge",
        "Pension (South Korea)"
    ],
                               "Bed and breakfast",
                               inplace=True)
    data.property_type.replace([
        "Bungalow", "Cabin", "Chalet", "Cottage", "Dome house", "Earth house",
        "Houseboat", "Hut", "Lighthouse", "Tiny house", "Townhouse", "Villa"
    ],
                               "House",
                               inplace=True)
    data.property_type.replace(["Guesthouse", "Guest suite"],
                               "Secondary unit",
                               inplace=True)
    data.property_type.replace([
        "Barn", "Boat", "Bus", "Camper/RV", "Campsite", "Castle", "Cave",
        "Igloo", "Island", "Plane", "Tent", "Tipi", "Train", "Treehouse",
        "Windmill", "Yurt"
    ],
                               "Unique space",
                               inplace=True)

    
    # Drop all listings that are not in the above 6 classes
    data = data[data.property_type.isin([
        "Apartment", "Boutique hotel", "Bed and breakfast", "House",
        "Secondary unit", "Unique space"
    ])]

    
    # Fill all columns with "0" of "monthly_price" and "weekly_price" with 7x/30x "price"
    data["monthly_price"] = np.where(data.monthly_price == 0, data.price * 30,
                                     data.monthly_price)
    data["weekly_price"] = np.where(data.weekly_price == 0, data.price * 7,
                                    data.weekly_price)

    
    # Re-calculate "guests_included_calc" to be identical to "accommodates" where "extra_people"==0
    data["guests_included_calc"] = np.where(data.extra_people == 0,
                                            data.accommodates,
                                            data.guests_included)
    
    return data

**Convert binary features to 1/0**

In [47]:
# Define function for engineering binary features

def feat_bin(data):
    
    # Convert t/f to 1/0 for various features
    data.host_is_superhost.replace(["t", "f"], [1, 0], inplace=True)
    data.host_identity_verified.replace(["t", "f"], [1, 0], inplace=True)
    data.is_location_exact.replace(["t", "f"], [1, 0], inplace=True)
    data.instant_bookable.replace(["t", "f"], [1, 0], inplace=True)
    
    
    # Change availability_365 to 1/0
    data.availability_365 = np.where(data.availability_365 != 0, 1, 0)
    
    
    # Create 1/0 for text descriptions
    data["description_exist"] = np.where(data.description != "", 1, 0)
    data["house_rules_exist"] = np.where(data.house_rules != "", 1, 0)
    data["interaction_exist"] = np.where(data.interaction != "", 1, 0)
    data["neighborhood_overview_exist"] = np.where(
        data.neighborhood_overview != "", 1, 0)
    data["notes_exist"] = np.where(data.notes != "", 1, 0)
    data["space_exist"] = np.where(data.space != "", 1, 0)
    data["summary_exist"] = np.where(data.summary != "", 1, 0)
    data["transit_exist"] = np.where(data.transit != "", 1, 0)
     
    return data

**Create numerical features**

In [48]:
# Define function for engineering numerical features

def feat_num(data):

    # Retrieve "listing_no" from "listing_url"
    data["listing_no"] = [int(el.split("/")[-1]) for el in data.listing_url]

    
    # Calculate "price_calc" for one person from "price", "guests_included", "extra_people" and remove listings where "price_calc" ends up being <= 5
    data["price_calc"] = data.price - 0.5 * data.extra_people * (
        data.guests_included - 1)
    data = data[data.price_calc > 5]

    
    # Calculate "price_extra_people" (price) for additional persons from "price", "guests_included", "extra_people" and "accommodates"
    data["price_extra_people"] = (
        data.extra_people * (data.accommodates - data.guests_included) +
        (0.5 * data.extra_people *
         (data.guests_included - 1))) / (data.accommodates - 1)
    data.price_extra_people.fillna(0, inplace=True)

    
    # Calculate occurrence of "price_extra_fees" from "security_deposit" and "cleaning_fee"
    data["price_extra_fees"] = 0 + data.security_deposit + data.cleaning_fee

    
    # Calculate "descr_detail" as measure for how well the listing is described
    data[
        "descr_detail"] = 0 + data.description_exist + data.house_rules_exist + data.interaction_exist + data.neighborhood_overview_exist + data.notes_exist + data.space_exist + data.summary_exist + data.transit_exist
    
    
    # Calculate "accommodates_per_bed" as feature to de-correlate "accommodates", "beds" and "bedrooms"
    data["accommodates_per_bed"] = data.accommodates / data.beds
    
    
    # Calculate "wk_mth_discount" from "monthly_price" and "weekly_price" with "price"
    data["wk_mth_discount"] = ((data.price * 30 - data.monthly_price) /
                               (data.price * 30) +
                               (data.price * 7 - data.weekly_price) /
                               (data.price * 7)) / 2

    
    # Calculate days since "first_review_days"
    data["first_review_days"] = (datetime.strptime(dataset_date,
                                                   '%Y-%m-%d')) - data.first_review
    data.first_review_days = [i.days for i in data.first_review_days]
    
    
    # Calculate days since "last_review_days"
    data["last_review_days"] = (datetime.strptime(dataset_date,
                                                  '%Y-%m-%d')) - data.last_review
    data.last_review_days = [i.days for i in data.last_review_days]
    
    
    # Calculate "review_scores_calc" as proxy considering number of reviews and penalizing new/inactive listings
    new_bias = [math.sqrt(el/50) for el in data.last_review_days]
    data["review_scores_calc"] = data.review_scores_rating - new_bias
    new_bias = []
    for reviews in data.number_of_reviews_ltm:
        if reviews < 10:
            new_bias.append(-3 + math.sqrt(reviews))
        else:
            new_bias.append(0)
    data.review_scores_calc = data.review_scores_calc + new_bias
        
    return data

**Create categorical features**

In [49]:
# Define function for engineering categorical features

def feat_cat(data):
    
    # Categorize listings by "state" (basic, moderate, luxurious)
    
    
    # Create "text_len" as word count from text-based (.split().count)
    data["description_len"] = [len(i.split()) for i in data.description]
    data["house_rules_len"] = [len(i.split()) for i in data.house_rules]
    data["interaction_len"] = [len(i.split()) for i in data.interaction]
    data["neighborhood_overview_len"] = [
        len(i.split()) for i in data.neighborhood_overview
    ]
    data["notes_len"] = [len(i.split()) for i in data.notes]
    data["space_len"] = [len(i.split()) for i in data.space]
    data["summary_len"] = [len(i.split()) for i in data.summary]
    data["transit_len"] = [len(i.split()) for i in data.transit]
    data["text_len"] = (
        data.description_len / data.description_len.max() +
        data.house_rules_len / data.house_rules_len.max() +
        data.interaction_len / data.interaction_len.max() +
        data.neighborhood_overview_len / data.neighborhood_overview_len.max() +
        data.notes_len / data.notes_len.max() + data.space_len /
        data.space_len.max() + data.summary_len / data.summary_len.max() +
        data.transit_len / data.transit_len.max()) / 8
    data.text_len = data.text_len / data.text_len.max()
    
    
    # Categorize listings as "review_scores_class" by "review_scores_rating"
    review_scores_class = []
    for score in data.review_scores_rating:
        if score == 0:
            review_scores_class.append(0)
        elif score <= 89:
            review_scores_class.append(1)
        elif score <= 93:
            review_scores_class.append(2)
        elif score <= 96:
            review_scores_class.append(3)
        elif score <= 99:
            review_scores_class.append(4)
        else:
            review_scores_class.append(5)
    data["review_scores_class"] = review_scores_class
    
    
    # Categorize listings as "review_scores_class_new" by "review_scores_calc"
    review_scores_class_new = []
    for score in data.review_scores_calc:
        if score == 0:
            review_scores_class_new.append(0)
        elif score <= 89:
            review_scores_class_new.append(1)
        elif score <= 92.5:
            review_scores_class_new.append(2)
        elif score <= 96:
            review_scores_class_new.append(3)
        elif score <= 98:
            review_scores_class_new.append(4)
        else:
            review_scores_class_new.append(5)
    data["review_scores_class_new"] = review_scores_class_new
    
    
    # Categorize listings as "price_class" by "price_calc"
    price_class = []
    for price in data.price_calc:
        if price <= 20:
            price_class.append(1)
        elif price <= 30:
            price_class.append(2)
        elif price <= 40:
            price_class.append(3)
        elif price <= 50:
            price_class.append(4)
        elif price <= 60:
            price_class.append(5)
        elif price <= 70:
            price_class.append(6)
        elif price <= 80:
            price_class.append(7)
        elif price <= 90:
            price_class.append(8)
        elif price <= 100:
            price_class.append(9)
        elif price <= 150:
            price_class.append(10)
        else:
            price_class.append(11)
    data["price_class"] = price_class
        
    return data

**Convert text columns into meaningful information**

In [50]:
# Define function for engineering text features

def feat_text(data):

    # FUTURE WORK: data.description.sample(5)
    pass
    
    return data

**Create log/sqrt from existing features**

Now we will replace certain features, which have relatively high skew (see 2_Clean), with their log or sqrt

In [51]:
# Define function for creating log/sqrt from skewed features

def feat_log_sqrt(data):
    
    # Create log "bathrooms_log" for numerical feature "bathrooms"
    data["bathrooms_log"] = [math.log(el) for el in data["bathrooms"]]
    
    
    # Create sqrt and log "calc_host_lst_count_sqrt_log" for numerical feature "calculated_host_listings_count"
    data["calc_host_lst_count_sqrt_log"] = [
        math.log(math.sqrt(el)) for el in data["calculated_host_listings_count"]
    ]
    
    
    # Create sqrt "first_review_days_sqrt" for numerical feature "first_review_days"
    data["first_review_days_sqrt"] = [
        math.sqrt(el) for el in data.first_review_days
    ]
    
    
    # Create sqrt "last_review_days_sqrt" for numerical feature "last_review_days"
    data["last_review_days_sqrt"] = [math.sqrt(el) for el in data.last_review_days]
    
    
    # Create sqrt "minimum_nights_sqrt" for numerical feature "minimum_nights"
    data["minimum_nights_sqrt"] = [math.sqrt(el) for el in data["minimum_nights"]]
    
    
    # Create log "number_of_reviews_ltm_log" for numerical feature "number_of_reviews_ltm"
    data["number_of_reviews_ltm_log"] = [
        math.sqrt(el) for el in data["number_of_reviews_ltm"]
    ]
    
    
    # Create log "price_extra_fees_sqrt" for numerical feature "price_extra_fees"
    data["price_extra_fees_sqrt"] = [
        math.sqrt(el) for el in data["price_extra_fees"]
    ]
    
    
    # Create log "price_log" for numerical feature "price"
    data["price_log"] = [math.log(el) for el in data["price"]]
    
    
    # Create log "price_calc_log" for numerical feature "price_calc"
    data["price_calc_log"] = [math.log(el) for el in data["price_calc"]]
    
    
    # Create sqrt "review_scores_rating_sqrt" for numerical feature "review_scores_rating"
    review_max = data.review_scores_rating.max()
    data["review_scores_rating_sqrt"] = [
        math.sqrt(review_max - el) for el in data.review_scores_rating
    ]
    review_log_max = data.review_scores_rating_sqrt.max()
    data["review_scores_rating_sqrt"] = [(review_log_max - el)
                                         for el in data.review_scores_rating_sqrt]
    #data["review_scores_rating_sqrt"].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black');
    
    
    # Create log "text_len_sqrt" for numerical feature "text_len"
    data["text_len_sqrt"] = [math.sqrt(el) for el in data["text_len"]]
        
    return data

**Calculate occupancy rate**

**Occupancy_rate initially played a major role in the consideration of creating the predictive model. It was, however, deemed too much of an insecure variable to be included in price prediction - or the other way around. It is still kept in this notebook for reference and potentially future work.**

Calculation of **occupancy rate** is inspired by the **San Francisco model**, which is also applied by [Inside AirBnB](http://insideairbnb.com/about.html):

- (**A**) Determine the **average length of stay for Berlin**
- (**B**) Calculate **reviews relevant for considered timeframe**
- (**C**) Determine **active months in timeframe** from price (not relevant if only 1 month)
- (**D**) Estimate **# of bookings in considered timeframe** using (**B**)
- (**E**) **Occupancy rate** = (**D**)x(**A**) / ((**C**)/months x time span)

Read more about the core idea behind the calculations of the model [here](https://sfbos.org/sites/default/files/FileCenter/Documents/52601-BLA.ShortTermRentals.051315.pdf). Assumptions were adapted for the purpose of this analysis, mainly due to the core idea of considering only the two most recent years.

**Notes**:
- **(A)**: For the purpose of this model, around **3 nights** are assumed as average length of stay in Berlin and used as basis for calculation, unless a higher minimum length is specified. Back in 2016, [4.6](https://www.airbnbcitizen.com/wp-content/uploads/2016/04/airbnb-community-berlin-en.pdf) has been reported as the average length of stay. Inside AirBnB uses 3 nights for cities where no current data is available, but uses [6.3 nights](http://insideairbnb.com/berlin/#) for its Berlin visualization
- **(D)**: Estimate **# of bookings in considered timeframe** by dividing (**B**) through an assumed 50% review rate (i.e. one review corresponds to two bookings)

In [52]:
# Define function for calculating occupancy rate based on formula above

def feat_occupancy(data): 

    # (**A**) Determine the **average length of stay for Berlin**
    # Add column to main dataframe for avg length of stay, being either a) 5 nights or b) minimum_nights if higher than 5 or c) avg of min and max if max is 5 or lower
    avg_nights = []
    for idx in data.index:
        if data.maximum_nights[idx] <= 5:
            avg_nights.append(
                (data.maximum_nights[idx] + data.minimum_nights[idx]) / 2)
        elif data.minimum_nights[idx] > 3:
            avg_nights.append(data.minimum_nights[idx])
        else:
            avg_nights.append(3)
    data["avg_nights"] = avg_nights
    
    
    # (**B**) Calculate **reviews in considered timeframe**
    # Keep only reviews within a specified timeframe (see Dashboard)
    data_rev_count = data_rev[(data_rev.date > ((datetime.strptime(dataset_date, "%Y-%m-%d"))-timedelta(3 * 30)).strftime("%Y-%m-%d"))
                              & (data_rev.date < dataset_date)]
    data_rev_count = pd.DataFrame(data_rev_count.listing_id.value_counts()
                                  )  # Count reviews per listing and save as table
    
    # Merge review count to "data"
    data_rev_count.rename(columns={"listing_id": "reviews_3mth"},
                          inplace=True)  # Change column name
    data = pd.merge(data,
                    data_rev_count,
                    how="left",
                    left_index=True,
                    right_index=True)  # Add column to main dataset
    data.reviews_3mth.fillna(0, inplace=True)
    
    
    # (**C**) Determine **active months and relevant months** from price
    # Count the months where listings were online with a price (not relevant if 1 mth)
    data["active_months"] = 1
    relevant_mths = 1
    
    
    # (**D**) Estimate **# of bookings in considered timeframe**
    # Calculate bookings estimate and replace NaN with 0
    data["bookings_est"] = data.reviews_3mth / review_rate
    data.bookings_est.fillna(0, inplace=True)
    
    
    # (**E**) **Occupancy rate** = (**D**)x(**A**) / ((**C**)/months x time span)
    # Calculate occupancy rate
    data["occupancy_rate"] = data.bookings_est * data.avg_nights / (
        data.active_months / relevant_mths * 90)
    
    
    # Modify occupancy rate
    # Cap occupancy at 100%
    occupancy_temp = []
    for rate in data.occupancy_rate:
        if rate < 1:
            occupancy_temp.append(rate)
        else:
            occupancy_temp.append(1)
    data.occupancy_rate = occupancy_temp
    
    # Split occupancy into 2 classes according to threshold (splitting into temporary and permanent rentals)
    occupancy_class = []
    for rate in data.occupancy_rate:
        if rate < 0.3:
            occupancy_class.append(0)
        else:
            occupancy_class.append(1)
    data["occupancy_class"] = occupancy_class
    
    # Show occupancy split
    print(data.occupancy_class.value_counts())
    
    return data

**Drop irrelevant columns**

In [53]:
# Define function for dropping irrelevant columns

def feat_drop_cols(data):
    
    # Drop further columns
    data.drop(
        [
            "active_months",
            "amenities",
            "am_coffee_machine",
            "am_cooking_basics",
            "am_parking",
            "availability_365",
            "avg_nights",
            "bathrooms",
            "beds",
            "bookings_est",
            "calculated_host_listings_count",
            "cleaning_fee",
            "descr_detail",
            "description",
            "description_exist",
            "description_len",
            "description_exist",
            "extra_people",
            "first_review",
            "first_review_days",
            "guests_included",
            "host_identity_verified",
            "house_rules",
            "house_rules_exist",
            "house_rules_len",
            "interaction",
            "interaction_exist",
            "interaction_len",
            "is_location_exact",
            "last_review",
            "last_review_days",
            "listing_url",
            "minimum_nights",
            "monthly_price",
            "name",
            "neighborhood_overview",
            "neighborhood_overview_exist",
            "neighborhood_overview_len",
            "notes",
            "notes_exist",
            "notes_len",
            "number_of_reviews",
            "number_of_reviews_ltm",
            #        "occupancy_class",
            #        "price",
            "price_calc",
            #        "price_avg", "price_diff", "price_diff_perc",
            "price_extra_fees",
            'review_scores_accuracy',
            'review_scores_checkin',
            'review_scores_cleanliness',
            'review_scores_communication',
            "review_scores_rating",
            'review_scores_value',
            "reviews_3mth",
            "security_deposit",
            "space",
            "space_exist",
            "space_len",
            "summary",
            "summary_exist",
            "summary_len",
            "text_len",
            "transit",
            "transit_exist",
            "transit_len",
            "weekly_price"
        ],
        inplace=True,
        axis=1)
        
    return data

A large number of features are dropped - some as they have been replaced, some have been deemed irrelevant, some have turned out to be highly correlated with others during EDA. Just a few notes on specific features:

| **DROPPED FEATURE** | **REASONING** |
| :----- | :----- |
| **am_coffee_machine** | high correlation (>0.3) with >5 other features |
| **am_parking** | high correlation (>0.3) with >5 other features |
| **availability_365** | high correlation (>0.3) with >5 other features |
| **descr_detail** | dropped in favour of **text_len** |
| **review_scores_xxx** | high correlation with review_scores_rating |


## Apply feature engineering functions

In [54]:
# Bundle feature engineering steps as function "feature_engineering"
def feature_engineering(data):
    data = feat_adapt(data)
    data = feat_bin(data)
    data = feat_num(data)
    data = feat_cat(data)
    data = feat_text(data)
    data = feat_log_sqrt(data)
    data = feat_occupancy(data)  
    data = feat_drop_cols(data)  
    return data

In [55]:
# Apply feature engineering to dataset
data = feature_engineering(data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  met

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


0    21084
1     7990
Name: occupancy_class, dtype: int64


## Final Check, Cleaning and Export

In [56]:
# Sort columns in dataset
data = data.reindex(sorted(data.columns, reverse=False), axis=1)

In [57]:
# Review datatypes (data.info()) (post-engineering)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29074 entries, 5396 to 42860297
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   accommodates                  29074 non-null  int64  
 1   accommodates_per_bed          29074 non-null  float64
 2   am_balcony                    29074 non-null  float64
 3   am_breakfast                  29074 non-null  float64
 4   am_child_friendly             29074 non-null  float64
 5   am_elevator                   29074 non-null  float64
 6   am_essentials                 29074 non-null  float64
 7   am_nature_and_views           29074 non-null  float64
 8   am_pets_allowed               29074 non-null  float64
 9   am_private_entrance           29074 non-null  float64
 10  am_smoking_allowed            29074 non-null  float64
 11  am_tv                         29074 non-null  float64
 12  am_white_goods                29074 non-null  float64


In [58]:
# Display engineered dataset
print(data.shape)
data.head(3)

(29074, 51)


Unnamed: 0_level_0,accommodates,accommodates_per_bed,am_balcony,am_breakfast,am_child_friendly,am_elevator,am_essentials,am_nature_and_views,am_pets_allowed,am_private_entrance,am_smoking_allowed,am_tv,am_white_goods,availability_90,bathrooms_log,bedrooms,calc_host_lst_count_sqrt_log,cancellation_policy,first_review_days_sqrt,guests_included_calc,host_acceptance_rate,host_is_superhost,host_response_rate,host_response_time,instant_bookable,last_review_days_sqrt,latitude,listing_no,longitude,maximum_nights,minimum_nights_sqrt,neighbourhood_cleansed,number_of_reviews_ltm_log,occupancy_class,occupancy_rate,price,price_calc_log,price_class,price_extra_fees_sqrt,price_extra_people,price_log,property_type,review_scores_calc,review_scores_class,review_scores_class_new,review_scores_location,review_scores_rating_sqrt,room_type,text_len_sqrt,wk_mth_discount,zipcode
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1
5396,2,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,32,0.0,0.5,0.0,strict,62.545983,2,100.0,0,100.0,within an hour,1,3.872983,48.851,5396,2.35869,2,1.0,Hôtel-de-Ville,6.928203,0,0.266667,115.0,4.744932,10,6.0,0.0,4.744932,Apartment,89.452277,2,2,10.0,5.781994,Entire home/apt,0.63066,0.337474,zip_75004
7397,4,2.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,45,0.0,2.0,0.0,moderate,57.140179,2,86.0,1,100.0,within an hour,0,4.358899,48.85758,7397,2.35275,23,2.0,Hôtel-de-Ville,5.385165,1,0.533333,119.0,4.736198,10,15.811388,8.333333,4.779123,Apartment,93.383559,3,3,10.0,6.494782,Entire home/apt,0.579981,0.191877,zip_75004
9952,2,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,45,0.0,1.0,0.0,moderate,50.537115,2,100.0,0,100.0,within an hour,1,7.416198,48.86227,9952,2.37134,120,2.236068,Popincourt,2.828427,0,0.222222,75.0,4.317488,7,16.733201,0.0,4.317488,Apartment,96.779618,4,4,10.0,7.530058,Entire home/apt,0.607842,0.211111,zip_75011


**Export data_engineered**

In [59]:
# Export dataset for further use in 4_EDA_Engineered and 5_Predictive_Modeling
data_engineered = data.copy()
save_load(data_engineered, title="data_engineered", function="save")

# Preprocessing (Train/Test Split and Pipeline)

In [60]:
# Import data_engineered
data = save_load(title="data_engineered", function="load")

In [61]:
# Feature selection

#... by removing certain features
all_features = [
    el for el in data.columns if el not in [
        'occupancy_rate', 'occupancy_class', 'listing_no', 'price_log',
        'price_class', 'price_binary', "review_scores_class_new",
        "review_scores_class", "review_scores_calc", "neighbourhood_cleansed"
    ]
]

#... by only considering certain features
key_features = [
    "accommodates_per_bed", "am_balcony", "am_breakfast", "am_child_friendly",
    "am_elevator", "am_essentials", "am_pets_allowed", "am_private_entrance",
    "am_smoking_allowed", "am_tv", "bathrooms_log", "bedrooms",
    "calc_host_lst_count_sqrt_log", "cancellation_policy",
    "guests_included_calc", "host_is_superhost", "instant_bookable",
    "maximum_nights", "minimum_nights_sqrt", "property_type", "room_type",
    "wk_mth_discount", "zipcode"
]

# select features for predictive modeling from above: [all_features, key_features]
pred_features = key_features

#Display columns:
#all_features
#key_features

Please make sure to carefully select the features you want to include in modeling via the cell above. Below you will see the output and potential issues with the selection, if detected.

In [62]:
# Print target setting and feature selection
print_target_setting()

You are currently using [1mPRICE_LOG[0m as the target and [1mneg_median_absolute_error[0m for scoring to predict prices for [1mparis[0m on [1m2020-03-16[0m

You are currently using these features for its prediction:
[1m['accommodates_per_bed', 'am_balcony', 'am_breakfast', 'am_child_friendly', 'am_elevator', 'am_essentials', 'am_pets_allowed', 'am_private_entrance', 'am_smoking_allowed', 'am_tv', 'bathrooms_log', 'bedrooms', 'calc_host_lst_count_sqrt_log', 'cancellation_policy', 'guests_included_calc', 'host_is_superhost', 'instant_bookable', 'maximum_nights', 'minimum_nights_sqrt', 'property_type', 'room_type', 'wk_mth_discount', 'zipcode'][0m

No issues with your selection of pred_features have been detected. Please make sure to manually check for correctness nevertheless.


In [63]:
# Drop columns
drop_columns = [el for el in data.columns if el not in pred_features]
drop_columns.remove(target)
data.drop(labels=drop_columns, inplace=True, axis=1)

In [64]:
# Drop rows (optional, just temporary)
#data = data[data.number_of_reviews_ltm_log>1.7]

## Preprocessing pipeline

In [65]:
# Create list for categorical predictors/features (used in "Scaling with Preprocessing Pipeline")
cat_features = list(data.columns[data.dtypes == object])
#cat_features.remove("neighbourhood")
#cat_features.remove("zipcode")
cat_features

['cancellation_policy', 'property_type', 'room_type', 'zipcode']

In [66]:
# Create list for numerical predictors/features (removing target column, used in "Scaling with Preprocessing Pipeline")
num_features = list(data.columns[data.dtypes != object])
num_features.remove(target)
num_features

['accommodates_per_bed',
 'am_balcony',
 'am_breakfast',
 'am_child_friendly',
 'am_elevator',
 'am_essentials',
 'am_pets_allowed',
 'am_private_entrance',
 'am_smoking_allowed',
 'am_tv',
 'bathrooms_log',
 'bedrooms',
 'calc_host_lst_count_sqrt_log',
 'guests_included_calc',
 'host_is_superhost',
 'instant_bookable',
 'maximum_nights',
 'minimum_nights_sqrt',
 'wk_mth_discount']

In [67]:
# Build preprocessor pipeline
# Pipeline for numerical features
num_pipeline = Pipeline([('imputer_num', SimpleImputer(strategy='median')),
                         ('std_scaler', StandardScaler())])

# Pipeline for categorical features
cat_pipeline = Pipeline([
    ('imputer_cat', SimpleImputer(strategy='constant', fill_value='missing')),
    ('1hot', OneHotEncoder(drop='first', handle_unknown='error'))
])

# Complete pipeline
preprocessor = ColumnTransformer([('num', num_pipeline, num_features),
                                  ('cat', cat_pipeline, cat_features)])

## Train/test split

In [68]:
# Define predictors and target variable
X = data.drop([target], axis=1)
y = data[target]

In [69]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state,
                                                    shuffle=True)
#                                                   stratify=y) # Use stratify=y if labels are inbalanced (e.g. most wines are 5 or 6; check with value_counts()!)

In [70]:
# Saving preprocessed X_train and X_test
X_train_prep_preprocessor = preprocessor.fit(X_train)

X_train_prep = X_train_prep_preprocessor.transform(X_train)
X_train_num_prep = num_pipeline.fit_transform(X_train[num_features])
X_test_prep = X_train_prep_preprocessor.transform(X_test)

In [71]:
# Get feature names from pipeline after one-hot encoding as "column_names"
onehot_columns = list(preprocessor.named_transformers_['cat']['1hot'].get_feature_names(cat_features))
column_names = num_features + onehot_columns

## Save preprocessor and X_test

In [72]:
X_train_prep

array([[-0.73234519, -0.34477044, -0.37555558, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.04692952, -0.34477044, -0.37555558, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.60547894, -0.34477044,  2.66272172, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 3.16402836,  2.90048071, -0.37555558, ...,  0.        ,
         0.        ,  0.        ],
       [-0.73234519, -0.34477044, -0.37555558, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.04692952, -0.34477044, -0.37555558, ...,  0.        ,
         0.        ,  0.        ]])

In [73]:
# Save preprocessor
save_load(X_train_prep_preprocessor, title="preprocessor", function="save")

In [74]:
# Save X_test
save_load(X_test, title="X_test", function="save")

## Feature selection (optional, on classification)

In [75]:
# Calculate "y_sel" from "price_log" (y) to get classification task
price_feat_sel = []
for price in y:
    if price <= 3.4:
        price_feat_sel.append(0)
    elif price <= 3.7:
        price_feat_sel.append(1)
    elif price <= 4:
        price_feat_sel.append(2)
    elif price <= 4.3:
        price_feat_sel.append(3)
    elif price <= 4.6:
        price_feat_sel.append(4)
    elif price <= 4.9:
        price_feat_sel.append(5)
    else:
        price_feat_sel.append(6)
y_sel = price_feat_sel

**Pearson's correlation**

In [76]:
# Define function
def cor_selector(X, y_sel):
    cor_list = []
    # calculate the correlation with y for each feature
    for i in num_features:
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in num_features]
    return cor_support, cor_feature

In [77]:
# Execute function
cor_support, cor_feature = cor_selector(X, y_sel)
print(str(len(cor_feature)), 'selected features')

19 selected features


**Chi-squared**

In [78]:
# Perform chi-squared
X_norm = MinMaxScaler().fit_transform(X[num_features])
chi_selector = SelectKBest(chi2, k=10)
chi_selector.fit(X_norm, y_sel)
chi_support = chi_selector.get_support()
chi_feature = X[num_features].loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

10 selected features


**Recursive feature elimination**

In [79]:
# Perform RFE
rfe_selector = RFE(estimator=LogisticRegression(max_iter=3000), n_features_to_select=10, step=10, verbose=5)
rfe_selector.fit(X_norm, y_sel)
rfe_support = rfe_selector.get_support()
rfe_feature = X[num_features].loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

Fitting estimator with 19 features.
10 selected features


**SelectFromModel: Lasso**

In [80]:
# Fit model
embedded_lr_selector = SelectFromModel(LogisticRegression(penalty="l2", max_iter=3000), max_features=10)
embedded_lr_selector.fit(X_norm, y_sel)

SelectFromModel(estimator=LogisticRegression(max_iter=3000), max_features=10)

In [81]:
# Evaluate features
embedded_lr_support = embedded_lr_selector.get_support()
embedded_lr_feature = X[num_features].loc[:,embedded_lr_support].columns.tolist()
print(str(len(embedded_lr_feature)), 'selected features')

5 selected features


**SelectFromModel: Tree-based**

In [82]:
# Fit model
embedded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=10)
embedded_rf_selector = embedded_rf_selector.fit(X_norm, y_sel)

In [83]:
# Evaluate features
embedded_rf_support = embedded_rf_selector.get_support()
embedded_rf_feature = X[num_features].loc[:,embedded_rf_support].columns.tolist()
print(str(len(embedded_rf_feature)), 'selected features')

6 selected features


**SelectFromModel: LightGBM**

In [84]:
# Fit model
lgbc=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embedded_lgb_selector = SelectFromModel(lgbc, max_features=10)
embedded_lgb_selector = embedded_lgb_selector.fit(X_norm, y_sel)

In [85]:
# Evaluate features
embedded_lgb_support = embedded_lgb_selector.get_support()
embedded_lgb_feature = X[num_features].loc[:,embedded_lgb_support].columns.tolist()
print(str(len(embedded_lgb_feature)), 'selected features')

6 selected features


**Feature evaluation**

In [86]:
# Put all selections together
feature_selection_df = pd.DataFrame({'Feature':num_features, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedded_rf_support, 'LightGBM':embedded_lgb_support})

In [87]:
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)

In [88]:
# display the top features
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
feature_selection_df.head(15)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,guests_included_calc,True,True,True,True,True,True,6
2,calc_host_lst_count_sqrt_log,True,True,True,False,True,True,5
3,bedrooms,True,True,True,True,True,False,5
4,accommodates_per_bed,True,False,True,True,True,True,5
5,minimum_nights_sqrt,True,False,True,False,True,True,4
6,maximum_nights,True,False,True,False,True,True,4
7,wk_mth_discount,False,False,True,True,False,True,3
8,host_is_superhost,True,True,True,False,False,False,3
9,bathrooms_log,True,False,True,True,False,False,3
10,am_tv,True,True,True,False,False,False,3


# Modeling: Regression ("price_log")

## Apply Regression Models

In [89]:
# Print current setting for TARGET
print_target_setting()

You are currently using [1mPRICE_LOG[0m as the target and [1mneg_median_absolute_error[0m for scoring to predict prices for [1mparis[0m on [1m2020-03-16[0m

You are currently using these features for its prediction:
[1m['accommodates_per_bed', 'am_balcony', 'am_breakfast', 'am_child_friendly', 'am_elevator', 'am_essentials', 'am_pets_allowed', 'am_private_entrance', 'am_smoking_allowed', 'am_tv', 'bathrooms_log', 'bedrooms', 'calc_host_lst_count_sqrt_log', 'cancellation_policy', 'guests_included_calc', 'host_is_superhost', 'instant_bookable', 'maximum_nights', 'minimum_nights_sqrt', 'property_type', 'room_type', 'wk_mth_discount', 'zipcode'][0m

No issues with your selection of pred_features have been detected. Please make sure to manually check for correctness nevertheless.


In [90]:
# Select models for comparison
regmodels = {
    'Baseline':
    DummyRegressor(strategy='mean'),
    'LinReg':
    LinearRegression(),
    'Passive Aggressive':
    PassiveAggressiveRegressor(),
    #        'RANSAC' : RANSACRegressor(),
    'ElasticNet':
    ElasticNet(),
    'Stochastic Gradient Descent':
    SGDRegressor(max_iter=1000, tol=1e-3),
    'Decision Tree':
    DecisionTreeRegressor(criterion="mse",
                          max_depth=3,
                          random_state=random_state),
    'Random Forest':
    RandomForestRegressor(random_state=random_state,
                          max_features='sqrt',
                          n_jobs=-1),
    'Gradient Boost':
    GradientBoostingRegressor(random_state=random_state),
    'XGBoost':
    XGBRegressor(),
    'AdaBoost':
    AdaBoostRegressor(random_state=random_state),
    'SVR':
    SVR(),
    'CatBoost':
    CatBoostRegressor()
}

In [91]:
# Calculate and display results
results = pd.DataFrame(columns=['Model', 'MSE', 'RMSE', 'R2', 'MAE', 'MAPE', 'MAPE median'])
i = 0
for m in regmodels.items():
    # Building a full pipeline with our preprocessor and a Classifier
    pipe = Pipeline([('preprocessor', preprocessor), (m[0], m[1])])
    # Making predictions on the training set using cross validation as well as calculating the probabilities
    y_train_pred = cross_val_predict(pipe,
                                     X_train,
                                     y_train.values.ravel(),
                                     cv=5,
                                     verbose=4,
                                     n_jobs=-1)
    # Calculating metrices
    temp = pd.DataFrame(
        {
            'Model':
            m[0],
            'MSE':
            "{:.2f}".format(mean_squared_error(y_train, y_train_pred)),
            'RMSE':
            "{:.2f}".format(
                mean_squared_error(y_train, y_train_pred, squared=False)),
            'R2':
            "{:.2f}".format(r2_score(y_train, y_train_pred)),
            'MAE':
            "{:.2f}".format(mean_absolute_error(y_train, y_train_pred)),
            'MAPE': mean_absolute_percentage_error(y_train, y_train_pred),
            'MAPE median': median_absolute_percentage_error(y_train, y_train_pred)
        },
        index=[i])
    i += 1
    results = pd.concat([results, temp])
results

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    3.0s remaining:    4.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.3s remaining:    0.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_

Unnamed: 0,Model,MSE,RMSE,R2,MAE,MAPE,MAPE median
0,Baseline,0.3,0.54,-0.0,0.43,9.37526,7.952435
1,LinReg,0.12,0.34,0.6,0.26,5.820351,4.701508
2,Passive Aggressive,0.25,0.5,0.14,0.39,8.65667,7.027504
3,ElasticNet,0.3,0.54,-0.0,0.43,9.37526,7.952435
4,Stochastic Gradient Descent,0.12,0.35,0.58,0.27,5.954404,4.772655
5,Decision Tree,0.17,0.41,0.42,0.32,7.17444,5.857215
6,Random Forest,0.11,0.33,0.63,0.26,5.637085,4.479338
7,Gradient Boost,0.12,0.34,0.61,0.26,5.805203,4.704215
8,XGBoost,0.12,0.34,0.61,0.26,5.808623,4.707482
9,AdaBoost,0.16,0.4,0.47,0.31,6.979245,5.704251


## Reg Model 1: XGBoost

In [92]:
# Create pipeline to use in RandomizedSearchCV and GridSearchCV
pipeline_xgb_reg = Pipeline([('preprocessor', preprocessor),
                             ('xgb_reg',
                              XGBRegressor(n_estimators=182,
                                           learning_rate=0.45,
                                           random_state=random_state,
                                           max_depth=5,
                                           gamma=0.3,
                                           bootstrap=True,
                                           max_features=21,
                                           scoring=scoring,
                                           n_jobs=-1))])

### Hyperparameter Pre-Tuning with RandomizedSearchCV

In [93]:
# Display possible hyperparameters for XGBoost Regressor
test_xgb_reg = XGBRegressor()
test_xgb_reg.get_params().keys()

dict_keys(['base_score', 'booster', 'colsample_bylevel', 'colsample_bynode', 'colsample_bytree', 'gamma', 'importance_type', 'learning_rate', 'max_delta_step', 'max_depth', 'min_child_weight', 'missing', 'n_estimators', 'n_jobs', 'nthread', 'objective', 'random_state', 'reg_alpha', 'reg_lambda', 'scale_pos_weight', 'seed', 'silent', 'subsample', 'verbosity'])

**Default values for XGBRegressor** (as base for hyperparameter search):

max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective='reg:linear', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, importance_type='gain'

In [94]:
# Define hyperparameter distribution
param_distribs_xgb_reg = {
    'xgb_reg__n_estimators': randint(low=80, high=300),
    'xgb_reg__bootstrap': [True, False],
    'xgb_reg__gamma': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    'xgb_reg__max_depth': randint(low=1, high=7),
    'xgb_reg__max_features': randint(low=1, high=40),
    'xgb_reg__learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]
}

In [95]:
# Create and fit RandomizedSearchCV, save "best_model"
rnd_xgb_reg = RandomizedSearchCV(pipeline_xgb_reg,
                                 param_distribs_xgb_reg,
                                 cv=5,
                                 scoring=scoring,
                                 n_iter=10,
                                 return_train_score=True,
                                 verbose=4,
                                 n_jobs=-1,
                                 random_state=random_state)

best_model_rnd_xgb_reg = rnd_xgb_reg.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   56.7s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.9min finished




In [96]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(rnd_xgb_reg.best_score_))
print("Best parameters:\n{}".format(rnd_xgb_reg.best_params_))
#print("Best estimator:\n{}".format(rnd_xgb_reg.best_estimator_))

Best score:
-0.20
Best parameters:
{'xgb_reg__bootstrap': False, 'xgb_reg__gamma': 0.2, 'xgb_reg__learning_rate': 0.5, 'xgb_reg__max_depth': 4, 'xgb_reg__max_features': 3, 'xgb_reg__n_estimators': 130}


### Hyperparameter Tuning with GridSearchCV

In [97]:
# Define hyperparameter grid
param_grid_xgb_reg = {
#    'xgb_reg__bootstrap': [True, False],
#    'xgb_reg__n_estimators': [190, 230, 290],
#    'xgb_reg__max_features': [40, 45],
    'xgb_reg__max_depth': [4, 5],
    'xgb_reg__learning_rate': [0.42, 0.45, 0.48]
}

In [98]:
# Create and fit GridSearchCV, save "best_model"
grid_xgb_reg = GridSearchCV(pipeline_xgb_reg,
                            param_grid_xgb_reg,
                            cv=5,
                            scoring=scoring,
                            return_train_score=True,
                            verbose=4,
                            n_jobs=-1)

grid_xgb_reg = grid_xgb_reg.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   36.0s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   51.8s finished




In [99]:
# Assign result to best model
best_model_xgb_reg = grid_xgb_reg.best_estimator_['xgb_reg']

In [100]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(grid_xgb_reg.best_score_))
print("Best parameters:\n{}".format(grid_xgb_reg.best_params_))
#print("Best estimator:\n{}".format(grid_xgb_reg.best_estimator_))

Best score:
-0.20
Best parameters:
{'xgb_reg__learning_rate': 0.45, 'xgb_reg__max_depth': 5}


### Feature Importances

In [101]:
# Display feature importances
fi_xgb_reg = get_feat_importances(best_model_xgb_reg, column_names=column_names)
fi_xgb_reg

Unnamed: 0,weight
bedrooms,0.188327
guests_included_calc,0.073738
room_type_Private room,0.058437
am_tv,0.054736
bathrooms_log,0.046783
zipcode_zip_75003,0.038951
zipcode_zip_75019,0.036275
zipcode_zip_75020,0.034876
zipcode_zip_75007,0.033526
zipcode_zip_75004,0.033263


### Final Evaluation Best Model

In [102]:
# Load existing model
#best_model_xgb_reg = save_load(title="best_model_xgb_reg", function="load")
#load_best_cv = save_load(title="best_cv_xgb_reg", function="load")

**Learning Curves (Overfitting)**

**Training Set**

In [103]:
# Predict target with "best model"
y_train_pred_xgb_reg = best_model_xgb_reg.predict(X_train_prep)

In [104]:
# Final evaluation of "best model"
model_eval(y_train, y_train_pred_xgb_reg, model="reg")

MSE: 0.08
RMSE: 0.29
MAE: 0.22
R2: 0.72
MAPE: 4.87
MAPE median: 3.88


**Testing Set**

In [105]:
# Predict target with "best model"
y_test_pred_xgb_reg = best_model_xgb_reg.predict(X_test_prep)

In [106]:
# Final evaluation of "best model"
model_eval(y_test, y_test_pred_xgb_reg, model="reg")

MSE: 0.10
RMSE: 0.32
MAE: 0.24
R2: 0.64
MAPE: 5.38
MAPE median: 4.21


In [107]:
# Display confidence interval (scipy stats)
confidence = 0.95
squared_errors = (y_test_pred_xgb_reg - y_test)**2
np.sqrt(
    stats.t.interval(confidence,
                     len(squared_errors) - 1,
                     loc=squared_errors.mean(),
                     scale=stats.sem(squared_errors)))

array([0.31258574, 0.3273949 ])

**Median Price Intervals**

In [108]:
# Save MAPE_median as variable
MAPE_median_xgb_reg = (median_absolute_percentage_error(y_test, y_test_pred_xgb_reg))/100

In [109]:
# Calculate price interval for MAPE median
y_pred_interval_xgb_reg = tuple([(round(math.exp(el-el*MAPE_median_xgb_reg),2),round(math.exp(el+el*MAPE_median_xgb_reg),2)) for el in y_test_pred_xgb_reg])
y_pred_interval_xgb_reg

((67.14, 97.16),
 (68.72, 99.64),
 (61.74, 88.69),
 (72.18, 105.11),
 (59.87, 85.77),
 (98.98, 148.19),
 (53.48, 75.85),
 (62.05, 89.17),
 (107.83, 162.66),
 (73.97, 107.96),
 (225.05, 362.15),
 (63.4, 91.28),
 (51.74, 73.18),
 (66.45, 96.07),
 (98.05, 146.67),
 (71.22, 103.58),
 (85.45, 126.3),
 (83.05, 122.44),
 (144.96, 224.43),
 (93.12, 138.67),
 (66.19, 95.65),
 (75.78, 110.83),
 (62.74, 90.24),
 (111.34, 168.43),
 (89.31, 132.5),
 (66.23, 95.72),
 (102.24, 153.51),
 (218.72, 351.08),
 (123.59, 188.68),
 (167.27, 262.24),
 (91.55, 136.14),
 (80.95, 119.07),
 (147.7, 229.04),
 (105.72, 159.21),
 (60.7, 87.06),
 (230.89, 372.38),
 (46.98, 65.88),
 (96.11, 143.53),
 (74.93, 109.47),
 (152.07, 236.43),
 (129.98, 199.31),
 (47.99, 67.42),
 (68.81, 99.78),
 (89.36, 132.58),
 (65.14, 94.01),
 (92.92, 138.35),
 (50.5, 71.26),
 (56.71, 80.85),
 (74.46, 108.73),
 (150.93, 234.5),
 (56.35, 80.29),
 (122.89, 187.52),
 (100.87, 151.26),
 (35.03, 47.87),
 (89.43, 132.71),
 (182.9, 289.01),
 (83

**Save Model and Params**

In [110]:
# Save best model and cv
save_load(best_model_xgb_reg, title="best_model_xgb_reg", function="save")
save_load(grid_xgb_reg, title="best_cv_xgb_reg", function="save")

## Reg Model 2: Support Vector Machines

In [111]:
# Create pipeline to use in RandomizedSearchCV and GridSearchCV
pipeline_svm_reg = Pipeline([('preprocessor', preprocessor),
                             ('svm_reg',
                              SVR(kernel='rbf',
                                  C=50,
                                  degree=4,
                                  gamma=0.005,
                                  epsilon=0.3))])

### Hyperparameter Pre-Tuning with RandomizedSearchCV

In [112]:
# Display possible hyperparameters for Support Vector Machine
test_svr_reg = SVR()
test_svr_reg.get_params().keys()

dict_keys(['C', 'cache_size', 'coef0', 'degree', 'epsilon', 'gamma', 'kernel', 'max_iter', 'shrinking', 'tol', 'verbose'])

**Default values for Support Vector Machine** (as base for hyperparameter search):

kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1

In [113]:
# Define hyperparameter distribution
param_distribs_svm_reg = {
#    'svm_reg__kernel': ['linear', 'poly', 'rbf'],
    'svm_reg__C': [0.1, 0.5, 0.8, 1, 1.5, 2, 3, 5, 10, 50, 100],        # initial: [0.1, 0.5, 1, 2, 5, 10, 50, 100, 500, 1000]
    'svm_reg__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 1],   # initial: [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 1]
    'svm_reg__epsilon': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.9],            # initial: [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.9]
    'svm_reg__degree': randint(low=1, high=5)
}

In [114]:
# Create and fit RandomizedSearchCV, save "best_model"
rnd_svm_reg = RandomizedSearchCV(pipeline_svm_reg,
                                 param_distribs_svm_reg,
                                 cv=5,
                                 scoring=scoring,
                                 n_iter=10,
                                 return_train_score=True,
                                 verbose=4,
                                 n_jobs=-1,
                                 random_state=random_state)

best_model_rnd_svm_reg = rnd_svm_reg.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 24.3min finished


In [115]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(rnd_svm_reg.best_score_))
print("Best parameters:\n{}".format(rnd_svm_reg.best_params_))
#print("Best estimator:\n{}".format(rnd_svm_reg.best_estimator_))

Best score:
-0.21
Best parameters:
{'svm_reg__C': 0.1, 'svm_reg__degree': 4, 'svm_reg__epsilon': 0.1, 'svm_reg__gamma': 0.05}


### Hyperparameter Tuning with GridSearchCV

In [116]:
# Define hyperparameter grid
param_grid_svm_reg = {
#    'svm_reg__kernel': ['linear', 'poly', 'rbf'],
    'svm_reg__gamma': [0.003, 0.005, 0.007],
    'svm_reg__C': [40, 50, 60],
    'svm_reg__degree': [3, 4, 5]
}

In [None]:
# Create and fit GridSearchCV, save "best_model"
grid_svm_reg = GridSearchCV(pipeline_svm_reg,
                            param_grid_svm_reg,
                            cv=5,
                            scoring=scoring,
                            return_train_score=True,
                            verbose=4,
                            n_jobs=-1)

grid_svm_reg = grid_svm_reg.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  4.2min


In [None]:
# Assign result to best model
best_model_svm_reg = grid_svm_reg.best_estimator_["svm_reg"]

In [None]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(grid_svm_reg.best_score_))
print("Best parameters:\n{}".format(grid_svm_reg.best_params_))
#print("Best estimator:\n{}".format(grid_svm_reg.best_estimator_))

### Feature Importances

In [None]:
# Display feature importances
#fi_svm_reg = get_feat_importances(best_model_svm_reg, column_names=column_names)
#fi_svm_reg

### Final Evaluation Best Model

In [None]:
# Load existing model
#load_best_model = load_model(title="best_model_svm_reg_01", dataset_loc=dataset_loc, dataset_date=dataset_date)
#load_best_cv = load_model(title="best_cv_svm_reg_01", dataset_loc=dataset_loc, dataset_date=dataset_date)

**Learning Curves (Overfitting)**

**Training Set**

In [None]:
# Predict target with "best model"
y_train_pred_svm_reg = best_model_svm_reg.predict(X_train_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_train, y_train_pred_svm_reg, model="reg")

**Testing Set**

In [None]:
# Predict target with "best model"
y_test_pred_svm_reg = best_model_svm_reg.predict(X_test_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_test, y_test_pred_svm_reg, model="reg")

In [None]:
# Display confidence interval (scipy stats)
confidence = 0.95
squared_errors = (y_test_pred_svm_reg - y_test)**2
np.sqrt(
    stats.t.interval(confidence,
                     len(squared_errors) - 1,
                     loc=squared_errors.mean(),
                     scale=stats.sem(squared_errors)))

**Median Price Intervals**

In [None]:
# Save MAPE_median as variable
MAPE_median_svm_reg = (median_absolute_percentage_error(y_test, y_test_pred_svm_reg))/100

In [None]:
# Calculate price interval for MAPE median
y_pred_interval_svm_reg = tuple([(round(math.exp(el-el*MAPE_median_svm_reg),2),round(math.exp(el+el*MAPE_median_svm_reg),2)) for el in y_test_pred_svm_reg])
y_pred_interval_svm_reg

**Save Model and Params**

In [None]:
# Save best model and cv
save_load(best_model_svm_reg, title="best_model_svm_reg", function="save")
save_load(grid_svm_reg, title="best_cv_svm_reg", function="save")

## Reg Model 3: Random Forest

In [None]:
# Create pipeline to use in RandomizedSearchCV and GridSearchCV
pipeline_rf_reg = Pipeline([('preprocessor', preprocessor),
                            ('rf_reg',
                             RandomForestRegressor(n_estimators=1500,
                                                   max_features='sqrt',
                                                   random_state=random_state,
                                                   max_depth=4,
                                                   min_samples_split=10,
                                                   min_samples_leaf=1,
                                                   bootstrap=False,
                                                   n_jobs=-1))])

### Hyperparameter Pre-Tuning with RandomizedSearchCV

In [None]:
# Display possible hyperparameters for XGBoost Regressor
test_rf_reg = RandomForestRegressor()
test_rf_reg.get_params().keys()

**Default values for Random Forest Regressor** (as base for hyperparameter search):

n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None

In [None]:
# Define hyperparameter distribution
param_distribs_rf_reg = {
    'rf_reg__n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
    'rf_reg__max_features': ['auto', 'sqrt'],
    'rf_reg__max_depth': [None, 1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 75, 100],
    'rf_reg__min_samples_split': [2, 5, 10],
    'rf_reg__min_samples_leaf': [1, 2, 4],
    'rf_reg__bootstrap': [True, False]
}

In [None]:
# Create and fit RandomizedSearchCV, save "best_model"
rnd_rf_reg = RandomizedSearchCV(pipeline_rf_reg,
                                 param_distribs_rf_reg,
                                 cv=4,
                                 scoring=scoring,
                                 n_iter=8,
                                 return_train_score=True,
                                 verbose=4,
                                 n_jobs=-1,
                                 random_state=random_state)

rnd_rf_reg = rnd_rf_reg.fit(X_train, y_train)

In [None]:
# Assign result to best model
best_model_rnd_rf_reg = rnd_rf_reg.best_estimator_['rf_reg']

In [None]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(rnd_rf_reg.best_score_))
print("Best parameters:\n{}".format(rnd_rf_reg.best_params_))
#print("Best estimator:\n{}".format(rnd_rf_reg.best_estimator_))

### Hyperparameter Tuning with GridSearchCV

In [None]:
# Define hyperparameter grid
#param_grid_rf_reg = {
#    'rf_reg__n_estimators': [1200, 2000],
#    'rf_reg__max_features': ['auto', 'sqrt'],
#    'rf_reg__max_depth': [10, 15],
#    'rf_reg__min_samples_split': [6, 10],
#    'rf_reg__min_samples_leaf': [1, 2],
#    'rf_reg__bootstrap': [True, False]
#}

In [None]:
# Create and fit GridSearchCV, save "best_model"
#grid_rf_reg = GridSearchCV(pipeline_rf_reg,
#                            param_grid_rf_reg,
#                            cv=5,
#                            scoring=scoring,
#                            return_train_score=True,
#                            verbose=4,
#                            n_jobs=-1)

#grid_rf_reg = grid_rf_reg.fit(X_train, y_train)

In [None]:
# Assign result to best model
#best_model_rf_reg = grid_rf_reg.best_estimator_['rf_reg']

In [None]:
# Display best_score_, best_params_ and best_estimator_
#print('Best score:\n{:.2f}'.format(grid_rf_reg.best_score_))
#print("Best parameters:\n{}".format(grid_rf_reg.best_params_))
#print("Best estimator:\n{}".format(grid_rf_reg.best_estimator_))

### Feature Importances

In [None]:
# Display feature importances
fi_rf_reg = get_feat_importances(best_model_rnd_rf_reg, column_names=column_names)
fi_rf_reg

### Final Evaluation Best Model

In [None]:
# Load existing model
#load_best_model = save_load(title="best_model_rf_reg", function="load")
#load_best_cv = save_load(title="best_cv_rf_reg", function="load")

**Learning Curves (Overfitting)**

**Training Set**

In [None]:
# Predict target with "best model"
y_train_pred_rf_reg = best_model_rnd_rf_reg.predict(X_train_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_train, y_train_pred_rf_reg, model="reg")

**Testing Set**

In [None]:
# Predict target with "best model"
y_test_pred_rf_reg = best_model_rnd_rf_reg.predict(X_test_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_test, y_test_pred_rf_reg, model="reg")

In [None]:
# Display confidence interval (scipy stats)
confidence = 0.95
squared_errors = (y_test_pred_rf_reg - y_test)**2
np.sqrt(
    stats.t.interval(confidence,
                     len(squared_errors) - 1,
                     loc=squared_errors.mean(),
                     scale=stats.sem(squared_errors)))

**Median Price Intervals**

In [None]:
# Save MAPE_median as variable
MAPE_median_rf_reg = (median_absolute_percentage_error(y_test, y_test_pred_rf_reg))/100

In [None]:
# Calculate price interval for MAPE median
y_pred_interval_rf_reg = tuple([(round(math.exp(el-el*MAPE_median_rf_reg),2),round(math.exp(el+el*MAPE_median_rf_reg),2)) for el in y_test_pred_rf_reg])
y_pred_interval_rf_reg

**Save Model and Params**

In [None]:
# Save best model and cv
save_load(best_model_rnd_rf_reg, title="best_model_rf_reg", function="save")
save_load(rnd_rf_reg, title="best_cv_rf_reg", function="save")

## Reg Model 4: CatBoost

In [None]:
# Create pipeline to use in RandomizedSearchCV and GridSearchCV
pipeline_cat_reg = Pipeline([('preprocessor', preprocessor),
                             ('cat_reg',
                              CatBoostRegressor(n_estimators=150,
                                                learning_rate=0.3,
                                                l2_leaf_reg=4,
                                                loss_function="RMSE",
                                                random_state=random_state,
                                                depth=4))])

### Hyperparameter Pre-Tuning with RandomizedSearchCV

In [None]:
# Display possible hyperparameters for XGBoost Regressor
test_cat_reg = CatBoostRegressor()
test_cat_reg.get_params().keys()

**Default values for CatBoostRegressor** (as base for hyperparameter search):

iterations=None, learning_rate=None, depth=None, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function='RMSE', border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None, od_wait=None, od_type=None, nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, best_model_min_trees=None, verbose=None, silent=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, target_border=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, pinned_memory_size=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, ctr_description=None, ctr_target_border_count=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, mvs_reg=None, sampling_frequency=None, sampling_unit=None, dev_score_calc_obj_block_size=None, dev_efb_max_buckets=None, sparse_features_conflict_fraction=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, gpu_cat_features_storage=None, data_partition=None, metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, per_object_feature_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, boost_from_average=None

In [None]:
# Define hyperparameter distribution
param_distribs_cat_reg = {
    'cat_reg__n_estimators': randint(low=130, high=180),    # initial: randint(low=10, high=200)
    'cat_reg__l2_leaf_reg': randint(low=2, high=11),       # initial: randint(low=1, high=15)
    'cat_reg__depth': randint(low=4, high=6),             # initial: randint(low=1, high=15)
    'cat_reg__learning_rate': [0.15, 0.18, 0.2, 0.22, 0.25, 0.27, 0.3] # initial: [0.01, 0.02, 0.05, 0.1, 0.2, 0.3]
}

In [None]:
# Create and fit RandomizedSearchCV, save "best_model"
rnd_cat_reg = RandomizedSearchCV(pipeline_cat_reg,
                                 param_distribs_cat_reg,
                                 cv=5,
                                 scoring=scoring,
                                 n_iter=15,
                                 return_train_score=True,
                                 verbose=4,
                                 n_jobs=-1,
                                 random_state=random_state)

best_model_rnd_cat_reg = rnd_cat_reg.fit(X_train, y_train)

In [None]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(rnd_cat_reg.best_score_))
print("Best parameters:\n{}".format(rnd_cat_reg.best_params_))
#print("Best estimator:\n{}".format(rnd_cat_reg.best_estimator_))

### Hyperparameter Tuning with GridSearchCV

In [None]:
# Define hyperparameter grid
param_grid_cat_reg = {
#    'cat_reg__n_estimators': [150, 155],
#    'cat_reg__l2_leaf_reg': [3, 4],
    'cat_reg__depth': [4, 5],
#    'cat_reg__learning_rate': [0.15, 0.2, 0.25]
}

In [None]:
# Create and fit GridSearchCV, save "best_model"
grid_cat_reg = GridSearchCV(pipeline_cat_reg,
                            param_grid_cat_reg,
                            cv=5,
                            scoring=scoring,
                            return_train_score=True,
                            verbose=4,
                            n_jobs=-1)

grid_cat_reg = grid_cat_reg.fit(X_train, y_train)

In [None]:
# Assign result to best model
best_model_cat_reg = grid_cat_reg.best_estimator_['cat_reg']

In [None]:
# Display best_score_, best_params_ and best_estimator_
print('Best score:\n{:.2f}'.format(grid_cat_reg.best_score_))
print("Best parameters:\n{}".format(grid_cat_reg.best_params_))
#print("Best estimator:\n{}".format(grid_xgb_reg.best_estimator_))

### Feature Importances

In [None]:
# Display feature importances
fi_cat_reg = get_feat_importances(best_model_cat_reg, column_names=column_names)
fi_cat_reg

### Final Evaluation Best Model

In [None]:
# Load existing model
#load_best_model = save_load(title="best_model_cat_reg", function="load")
#load_best_cv = save_load(title="best_cv_cat_reg", function="load")

**Learning Curves (Overfitting)**

**Training Set**

In [None]:
# Predict target with "best model"
y_train_pred_cat_reg = best_model_cat_reg.predict(X_train_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_train, y_train_pred_cat_reg, model="reg")

**Testing Set**

In [None]:
# Predict target with "best model"
y_test_pred_cat_reg = best_model_cat_reg.predict(X_test_prep)

In [None]:
# Final evaluation of "best model"
model_eval(y_test, y_test_pred_cat_reg, model="reg")

In [None]:
# Display confidence interval (scipy stats)
confidence = 0.95
squared_errors = (y_test_pred_cat_reg - y_test)**2
np.sqrt(
    stats.t.interval(confidence,
                     len(squared_errors) - 1,
                     loc=squared_errors.mean(),
                     scale=stats.sem(squared_errors)))

**Median Price Intervals**

In [None]:
# Save MAPE_median as variable
MAPE_median_cat_reg = (median_absolute_percentage_error(y_test, y_test_pred_cat_reg))/100

In [None]:
# Calculate price interval for MAPE median
y_pred_interval_cat_reg = tuple([(round(math.exp(el-el*MAPE_median_cat_reg),2),round(math.exp(el+el*MAPE_median_cat_reg),2)) for el in y_test_pred_cat_reg])
y_pred_interval_cat_reg

**Save Model and Params**

In [None]:
# Save best model and cv
save_load(best_model_cat_reg, title="best_model_cat_reg", function="save")
save_load(grid_cat_reg, title="best_cv_cat_reg", function="save")

## NN Model 1: Neural Networks

In [None]:
# Build the model
model_nn_seq = models.Sequential()

# Input Layer
model_nn_seq.add(
    layers.Dense(128,
                 input_shape=(X_train_prep.shape[1], ),
                 kernel_regularizer=regularizers.l1(0.005),
                 activation='relu'))

# Hidden Layers
model_nn_seq.add(
    layers.Dense(256,
                 kernel_regularizer=regularizers.l1(0.005),
                 activation='relu'))
model_nn_seq.add(
    layers.Dense(256,
                 kernel_regularizer=regularizers.l1(0.005),
                 activation='relu'))
model_nn_seq.add(
    layers.Dense(512,
                 kernel_regularizer=regularizers.l1(0.005),
                 activation='relu'))

# Output Layer
model_nn_seq.add(layers.Dense(1, activation='linear'))

# Compile the model
model_nn_seq.compile(loss='mean_absolute_percentage_error',
                     optimizer='adam',
                     metrics=['mean_absolute_percentage_error'])

# Model summary
print(model_nn_seq.summary())

# Visualize the neural network
#SVG(model_to_dot(model_nn_seq, show_layer_names=False, show_shapes=True).create(prog='dot', format='svg'))

In [None]:
# Create pipeline with preprocessing
#pipeline_nn_reg = Pipeline([('preprocessor', preprocessor),
#                             ('nn_reg',
#                              KerasRegressor(build_fn=model_nn_seq, epochs=20, batch_size=250))])

In [None]:
# Define hyperparameter grid
#param_grid_nn_reg = {
#    'xgb_reg__bootstrap': [True, False],
#    'nn_reg__epochs': [10, 20]
#}

In [None]:
# Create and fit GridSearchCV, save "best_model"
#grid_nn_reg = GridSearchCV(pipeline_nn_reg,
#                            param_grid_nn_reg,
#                            cv=5,
#                            scoring=scoring,
#                            return_train_score=True,
#                            verbose=4,
#                            n_jobs=-1)

#grid_nn_reg = grid_nn_reg.fit(X_train, y_train)

In [None]:
# Train the model
#model_nn_seq_start = time.time()

best_model_nn_reg = model_nn_seq.fit(X_train_prep,
                                        y_train,
                                        epochs=20,
                                        batch_size=256,
                                        validation_split=0.2)

#model_nn_seq_end = time.time()

#print(f"Time taken to run: {round((model_nn_seq_end - model_nn_seq_start)/60,1)} minutes")

**Training Set**

In [None]:
# Predict target with "best model"
y_train_pred_nn_reg = model_nn_seq.predict(X_train_prep).ravel()

In [None]:
# Final evaluation of "best model"
model_eval(y_train, y_train_pred_nn_reg, model="reg")

**Testing Set**

In [None]:
# Predict target with "best model"
y_test_pred_nn_reg = model_nn_seq.predict(X_test_prep).ravel()

In [None]:
# Final evaluation of "best model"
model_eval(y_test, y_test_pred_nn_reg, model="reg")

In [None]:
# Display confidence interval (scipy stats)
confidence = 0.95
squared_errors = (y_test_pred_nn_reg - y_test)**2
np.sqrt(
    stats.t.interval(confidence,
                     len(squared_errors) - 1,
                     loc=squared_errors.mean(),
                     scale=stats.sem(squared_errors)))

**Median Price Intervals**

In [None]:
# Save MAPE_median as variable
MAPE_median_nn_reg = (median_absolute_percentage_error(y_test, y_test_pred_nn_reg))/100

In [None]:
# Calculate price interval for MAPE median
y_pred_interval_nn_reg = tuple([(round(math.exp(el-el*MAPE_median_nn_reg),2),round(math.exp(el+el*MAPE_median_nn_reg),2)) for el in y_test_pred_nn_reg])
y_pred_interval_nn_reg

In [None]:
# Evaluate the model
#model_nn_seq.model_evaluation(model_nn_seq, skip_epochs=2, X_train=X_train, X_test=X_test)

#score_nn_seq = model_nn_seq.evaluate(X_train_prep, y_train,verbose=1)
#print(score_nn_seq)

**Save Model and Params**

In [None]:
# Save best model and cv
save_load(best_model_nn_reg, title="best_model_nn_reg", function="save")

## Final Evaluation with Testing Set

In [None]:
#best_models_reg = [best_model_xgb_reg, best_model_svm_reg]

In [None]:
# Transform X_test for final evaluation
#X_test_prep = preprocessor.transform(X_test)

In [None]:
# Predict target with "best model"
#y_pred_rf_reg = best_model_rf_reg.predict(X_test_prep)

In [None]:
# Final evaluation of "best model"
#print("MSE: {:.2f}".format(mean_squared_error(y_test, y_pred_rf_reg))),
#print("RMSE: {:.2f}".format(mean_squared_error(y_test, y_pred_rf_reg, squared=False))),
#print("MAE: {:.2f}".format(mean_absolute_error(y_test, y_pred_rf_reg))),
#print("R2: {:.2f}".format(r2_score(y_test, y_pred_rf_reg))),

In [None]:
# Illustrate best model
#fig, axes = plt.subplots(1, 2, figsize = (14, 6))
#axes = axes.flatten()

#y_pred = best_model.predict(X_test_prep)
#axes[0].scatter(y_test, y_pred)
#axes[0].set_xlabel('y_test')
#axes[0].set_ylabel('y_pred')

#coef = best_model.best_estimator_.named_steps['xgb'].coef_
#mean_coef = np.mean(coef)
#axes[1].plot(coef, 'o')
#axes[1].set_xlabel('coefficient index')
#axes[1].set_ylabel('coefficient size')
#axes[1].axhline(y = mean_coef, color = 'red', linestyle = '--', alpha = 0.5)
#plt.show()

In [None]:
# Display confidence interval (scipy stats)
#confidence = 0.95
#squared_errors = (y_pred_rf_reg - y_test) ** 2
#np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
#                         loc=squared_errors.mean(),
#                         scale=stats.sem(squared_errors)))

# Model Selection for Web App

**Best model**

In [None]:
# Load or assign best model
#APP_best_model = save_load(title="best_model_xgb_reg", function="load", dataset_loc=dataset_loc, dataset_date=dataset_date, model_run=model_run)
APP_best_model = best_model_xgb_reg

In [None]:
# Save best model and cv
save_load(APP_best_model, title="APP_best_model", file_format="app", function="save")

**Preprocessor**

In [None]:
# Load or assign preprocessor
#APP_preprocessor = save_load(title="preprocessor", function="load", dataset_loc=dataset_loc, dataset_date=dataset_date, model_run=model_run)
APP_preprocessor = preprocessor

In [None]:
# Save preprocessor
save_load(APP_preprocessor, title="APP_preprocessor", file_format="app", function="save")

**X_test**

In [None]:
# Load or assign X_test
#APP_X_test = save_load(title="X_test", function="load", dataset_loc=dataset_loc, dataset_date=dataset_date, model_run=model_run)
APP_X_test = X_test

In [None]:
# Save X_test
save_load(APP_X_test, title="APP_X_test", file_format="app", function="save")

**MAPE_median**

In [None]:
# Load or assign MAPE_median
#APP_MAPE_median = save_load(title="MAPE_median_xgb_reg", function="load", dataset_loc=dataset_loc, dataset_date=dataset_date, model_run=model_run)
APP_MAPE_median = MAPE_median_xgb_reg

In [None]:
# Save MAPE_median
save_load(APP_MAPE_median, title="APP_MAPE_median", file_format="app", function="save")

# Future Work

**Predictive modeling**
- Apply further models and adapt current ones (e.g. NN)
- Examine other prediction targets (e.g. occupancy rate)

**Feature engineering**
- Explore NLP for text fields (descriptions, reviews, ...)
- Scrape listing photos and analyze quality
- Enhance current feature set

**Lean structure**
- Remove remaining redundancies wherever possible (e.g. pack repeated steps into functions, apply more pipelines, ...)

**Cloud**
- Move both model creation and app into the cloud (GCP)

**Automatization and replicability**
- Build a workflow to automatically retrain model monthly with new datasets
- Use automated outlier detection
- Use automated feature engineering

**Replicability**
- Reduce redundancies wherever possible (e.g. pack repeated steps into functions, apply more pipelines, ...)
- Apply analysis to other cities and compare results