# In-depth analysis: Prediction of booking scores 

Now, using all the features possible, we will try to inference relations between them to used for the prediction of booking scores.

In [32]:
import gzip
import json
import csv
import pandas as pd
import numpy as np
from scipy import stats
from functools import reduce
import seaborn as sns
import matplotlib.pyplot as plt

In [33]:
from sklearn.preprocessing import LabelEncoder

In [34]:
listing = pd.read_csv('../Data/raw/listings.csv.gz', 
                      compression='gzip',
                      error_bad_lines=False, 
                      low_memory=False)

In [35]:
listing.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       ...
       'instant_bookable', 'is_business_travel_ready', 'cancellation_policy',
       'require_guest_profile_picture', 'require_guest_phone_verification',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object', length=106)

Seaborn package offers the function `pairplot` that allows to create scatterplots of all the variables used as input. If you enter the name of your dataset, you get the visual relation between all the variables and you can start your analysis quickly. As we saw before, this dataset contains 106 columns in different formats. A lot of them neither categorical or numerical. Thus, we select carefully the columns that could it makes sense for the next analysis. How? Using as reference the previos analysis.

In this section we are using df_dataset that select only a couple of columns:

In [36]:
list_columns = ['id',
                'host_id',
                'host_since', 
                'host_response_time', 
                'host_response_rate', 
                'host_is_superhost',
                'neighbourhood_cleansed', 
                'room_type',
                'property_type', 
                'accommodates', 
                'bathrooms', 
                'bedrooms', 
                'beds', 
                'bed_type', 
                'amenities', 
                'price', 
                'extra_people', 
                'minimum_nights',
                'maximum_nights',
                'alendar_updated',
                'has_availability',
                'availability_30',
                'availability_60',
                'availability_90',
                'availability_365',
                'number_of_reviews',
                'number_of_reviews_ltm',
                'first_review',
                'last_review',
                'review_scores_rating',
                'review_scores_accuracy',
                'review_scores_cleanliness',
                'review_scores_checkin',
                'review_scores_communication',
                'review_scores_location',
                'review_scores_value',
                'calculated_host_listings_count',
                'calculated_host_listings_count_entire_homes',
                'calculated_host_listings_count_private_rooms',
                'calculated_host_listings_count_shared_rooms',
                'reviews_per_month']

In [37]:
df_dataset = listing.loc[:, list_columns].reindex()

In [38]:
df_dataset.head()

Unnamed: 0,id,host_id,host_since,host_response_time,host_response_rate,host_is_superhost,neighbourhood_cleansed,room_type,property_type,accommodates,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2818,3159,2008-09-24,within an hour,100%,t,Oostelijk Havengebied - Indische Buurt,Private room,Apartment,2,...,10.0,10.0,10.0,9.0,10.0,1,0,1,0,2.13
1,20168,59484,2009-12-02,within an hour,100%,f,Centrum-Oost,Private room,Townhouse,2,...,10.0,10.0,10.0,10.0,9.0,2,0,2,0,2.57
2,25428,56142,2009-11-20,within an hour,100%,f,Centrum-West,Entire home/apt,Apartment,3,...,10.0,10.0,10.0,10.0,10.0,2,2,0,0,0.13
3,27886,97647,2010-03-23,within an hour,100%,t,Centrum-West,Private room,Houseboat,2,...,10.0,10.0,10.0,10.0,10.0,1,0,1,0,2.14
4,28871,124245,2010-05-13,within an hour,100%,t,Centrum-West,Private room,Apartment,2,...,10.0,10.0,10.0,10.0,10.0,3,0,3,0,2.81


Transforming non-numerical `host_response_rate` to numerical:

In [39]:
def str_rate2int(rate):
    if type(rate) is str:
        return float(rate.replace("%", ""))
    else:
        return rate 

In [40]:
df_dataset['host_response_rate_float'] = df_dataset.host_response_rate.apply(str_rate2int)

In [41]:
def str2boolean(row):
    if row == 't':
        return 1
    elif row == 'f':
        return 0
    else:
        return np.nan

In [42]:
df_dataset['superhost'] = df_dataset.host_is_superhost.apply(str2boolean)

In [43]:
def price2float(string_price):
    return float(string_price.split('.')[0].replace('$', '').replace(',', ''))

In [44]:
df_dataset['price_float'] = df_dataset.price.apply(price2float)

In [45]:
df_dataset['extra_people_float'] = df_dataset.extra_people.apply(price2float)

In [46]:
df_dataset.drop(columns=['host_response_rate', 'host_is_superhost', 'price', 'extra_people'], inplace=True)

In [47]:
df_dataset.head()

Unnamed: 0,id,host_id,host_since,host_response_time,neighbourhood_cleansed,room_type,property_type,accommodates,bathrooms,bedrooms,...,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_response_rate_float,superhost,price_float,extra_people_float
0,2818,3159,2008-09-24,within an hour,Oostelijk Havengebied - Indische Buurt,Private room,Apartment,2,1.5,1.0,...,10.0,1,0,1,0,2.13,100.0,1.0,59.0,20.0
1,20168,59484,2009-12-02,within an hour,Centrum-Oost,Private room,Townhouse,2,1.0,1.0,...,9.0,2,0,2,0,2.57,100.0,0.0,80.0,0.0
2,25428,56142,2009-11-20,within an hour,Centrum-West,Entire home/apt,Apartment,3,1.0,1.0,...,10.0,2,2,0,0,0.13,100.0,0.0,125.0,10.0
3,27886,97647,2010-03-23,within an hour,Centrum-West,Private room,Houseboat,2,1.0,1.0,...,10.0,1,0,1,0,2.14,100.0,1.0,155.0,0.0
4,28871,124245,2010-05-13,within an hour,Centrum-West,Private room,Apartment,2,1.0,1.0,...,10.0,3,0,3,0,2.81,100.0,1.0,75.0,0.0


In [48]:
df_dataset.dropna(subset=['host_response_time'], inplace=True)

In [49]:
df_dataset['host_response_time'].unique()

array(['within an hour', 'within a day', 'within a few hours',
       'a few days or more'], dtype=object)

#### Label Encoding

Encode target labels with value between 0 and n_classes-1. Label encoding is applying to different columns. To avoid redundate code, the function `label_encoding` is built:

In [50]:
def label_encoding(df, target_column, replace_column=False):
    """
    This method receive a dataframe and a string as the name of the column encoded. 
    It returns a column into the dataframe. The new column could replace the original 
    turning replace_column to True.
    """
    lb_make = LabelEncoder()
    encoded_name = lb_make.fit_transform(df[target_column])
    if replace_column:
        df.drop(columns=[target_column], inplace=True)
        
    return encoded_name

In [51]:
df_dataset['host_response_time_encode'] = label_encoding(df_dataset, 
                                                         'host_response_time', 
                                                         replace_column=True)

In [52]:
df_dataset['neighbourhood_cleansed_encode'] = label_encoding(df_dataset, 
                                                             'neighbourhood_cleansed', 
                                                             replace_column=True)

In [53]:
df_dataset['room_type_encode'] = label_encoding(df_dataset, 
                                                'room_type', 
                                                replace_column=True)

In [54]:
df_dataset['property_type_encode'] = label_encoding(df_dataset, 
                                                    'property_type', 
                                                    replace_column=True)

In [55]:
df_dataset['bed_type_encode'] = label_encoding(df_dataset, 
                                               'bed_type', 
                                               replace_column=True)

In [56]:
df_dataset.head()

Unnamed: 0,id,host_id,host_since,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,maximum_nights,...,reviews_per_month,host_response_rate_float,superhost,price_float,extra_people_float,host_response_time_encode,neighbourhood_cleansed_encode,room_type_encode,property_type_encode,bed_type_encode
0,2818,3159,2008-09-24,2,1.5,1.0,2.0,"{Internet,Wifi,""Paid parking off premises"",""Bu...",3,15,...,2.13,100.0,1.0,59.0,20.0,3,14,2,1,4
1,20168,59484,2009-12-02,2,1.0,1.0,1.0,"{TV,Internet,Wifi,""Paid parking off premises"",...",1,1000,...,2.57,100.0,0.0,80.0,0.0,3,4,2,29,4
2,25428,56142,2009-11-20,3,1.0,1.0,1.0,"{TV,""Cable TV"",Internet,Wifi,Kitchen,Elevator,...",14,60,...,0.13,100.0,0.0,125.0,10.0,3,5,0,1,4
3,27886,97647,2010-03-23,2,1.0,1.0,1.0,"{TV,Internet,Wifi,Breakfast,Heating,""Smoke det...",2,730,...,2.14,100.0,1.0,155.0,0.0,3,5,2,21,4
4,28871,124245,2010-05-13,2,1.0,1.0,1.0,"{Internet,Wifi,""Pets live on this property"",Ca...",2,1825,...,2.81,100.0,1.0,75.0,0.0,3,5,2,1,4


According to `analysis_neighborhoods`, the main amenities are the following:
Wifi, Heading, Washer, "Smoke detector", Shampoo, Hangers, "Hair dryer", Iron, "Laptop friendly workspace", "Hot water", TV, Kitchen, "Carbon monoxide detector", "First aid kit", "Fire extinguisher", "Private entrance", Essentials, "Bed linens", Stove, Oven, "Cooking basics", "Dishes and silverware", Dishwasher, "Coffee maker", Microwave, Dryer, "Family/kid friendly", "Cable TV", Refrigerator, "Host greets you".

We are using this amenities to create new categories:
1. Safety: "Smoke detector", "Carbon monoxide detector", "Fire extinguisher", "First aid kit"
2. Entertainment: Wifi, TV, Cable TV", "Laptop friendly workspace"
3. Personal care: Essentials, Shampoo, "Hair dryer" 
4. Host: Heading, "Host greets you"
5. Kitchen: Refrigerator, Microwave, "Coffee maker", Dishwasher, "Dishes and silverware", Oven, Kitchen, "Cooking basics", 
6. Comfort: Washer, Stove, Dryer, Iron, Hangers, "Hot water", "Bed linens"
7. Family/kid: "Family/kid friendly"
8. Entrance: "Private entrance"

"Paid parking off premises" 4861
"Buzzer/wireless intercom" 2551
"Safety card" 2231
"Lock on bedroom door" 3022
"24-hour check-in" 1650
"Extra pillows and blankets" 3165
"Single level home" 1446
"Garden or backyard" 1861
"No stairs or steps to enter" 3306
"Paid parking on premises" 1540
"Long term stays allowed" 2107
Elevator 2028
"Indoor fireplace" 1109
"Private living room" 2918
"Well-lit path to entrance" 1548
Breakfast 1260
"Self check-in" 1317
"Smoking allowed" 1247
"Pets allowed" 1542
"Luggage dropoff allowed" 3189
Bathtub 1728
"High chair" 1550
Crib 1095
"Children’s books and toys" 1827
"Wide hallways" 1033
Waterfront 1104
"Air conditioning" 1036
"Free parking on premises" 1987
"Wide entrance for guests" 1041