# EDA on Airbnb Listings in Toronto
The research interest lines in what makes a good Airbnb listing in Toronto

In [285]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

In [286]:
listings = pd.read_csv("listings.csv")
row, col = listings.shape
row
col

74

In [287]:
# Split training and testing dataset
train_index_end = int(row*0.75)
train_indx = np.arange(train_index_end)
test_indx = np.arange(train_index_end+1, row)
train_index_end, test_indx[-1]

(11313, 15083)

In [288]:
train_df = listings.iloc[train_indx]
test_df = listings.iloc[test_indx]

In [289]:
train_df.shape

(11313, 74)

In [290]:
train_df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

In [291]:
# Drop variables that are highly likely irrelevant or difficult to process
var_to_drop = [
       'listing_url', 'scrape_id', 'last_scraped', 'description',
       'neighborhood_overview', 'picture_url', 'neighbourhood', 'latitude',
       'longitude', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped'      
]

train_df.drop(var_to_drop, inplace=True, axis = 1)
train_df.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(var_to_drop, inplace=True, axis = 1)


(11313, 50)

## Get Data Type and Integrity Information

In [292]:
train_df.iloc[:,0:10].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  11313 non-null  int64 
 1   name                11312 non-null  object
 2   host_id             11313 non-null  int64 
 3   host_url            11313 non-null  object
 4   host_name           11307 non-null  object
 5   host_since          11307 non-null  object
 6   host_location       11303 non-null  object
 7   host_about          6587 non-null   object
 8   host_response_time  5588 non-null   object
 9   host_response_rate  5588 non-null   object
dtypes: int64(2), object(8)
memory usage: 972.2+ KB


* Name has one missing value
* Neighbor overview has nearly 4000 missing values
* IDs are integer valued

In [293]:
train_df.iloc[:,10:20].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   host_acceptance_rate       5776 non-null   object 
 1   host_is_superhost          11307 non-null  object 
 2   host_thumbnail_url         11307 non-null  object 
 3   host_picture_url           11307 non-null  object 
 4   host_neighbourhood         9706 non-null   object 
 5   host_listings_count        11307 non-null  float64
 6   host_total_listings_count  11307 non-null  float64
 7   host_verifications         11313 non-null  object 
 8   host_has_profile_pic       11307 non-null  object 
 9   host_identity_verified     11307 non-null  object 
dtypes: float64(2), object(8)
memory usage: 972.2+ KB


* All variables in this range have null values
* host acceptance rate, neighborhood are the variables with the most missing values. Possibly they are not required to create a profile. 
* The other with no or just a few missing values.
* Host verification is non-null, maybe every host need to be verified on ID
* host_has_profile_pic should be boolean, is boolean object?
* Many variables are url, for simplicity, **url will be removed**.

In [294]:
train_df.iloc[:,20:30].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   neighbourhood_cleansed        11313 non-null  object 
 1   neighbourhood_group_cleansed  0 non-null      float64
 2   property_type                 11313 non-null  object 
 3   room_type                     11313 non-null  object 
 4   accommodates                  11313 non-null  int64  
 5   bathrooms                     0 non-null      float64
 6   bathrooms_text                11306 non-null  object 
 7   bedrooms                      10509 non-null  float64
 8   beds                          11271 non-null  float64
 9   amenities                     11313 non-null  object 
dtypes: float64(4), int64(1), object(5)
memory usage: 972.2+ KB


* None of entries of neighbourhood group cleansed and bathrooms columns are populated with values. Remove them
* The others have only a few null values, fill NA is not big problem

In [295]:
train_df.iloc[:,30:40].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   price                      11313 non-null  object 
 1   number_of_reviews          11313 non-null  int64  
 2   number_of_reviews_ltm      11313 non-null  int64  
 3   number_of_reviews_l30d     11313 non-null  int64  
 4   first_review               9501 non-null   object 
 5   last_review                9501 non-null   object 
 6   review_scores_rating       9501 non-null   float64
 7   review_scores_accuracy     9330 non-null   float64
 8   review_scores_cleanliness  9332 non-null   float64
 9   review_scores_checkin      9329 non-null   float64
dtypes: float64(4), int64(3), object(3)
memory usage: 972.2+ KB


* From price to number of review in the last 30 days are non-null variables
* The remaining have around 1500 missing values, they need completion 
* Price should be float or integer, but it is object, may be $ is included

In [296]:
train_df.iloc[:,40:50].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 10 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   review_scores_communication                   9329 non-null   float64
 1   review_scores_location                        9328 non-null   float64
 2   review_scores_value                           9329 non-null   float64
 3   license                                       2633 non-null   object 
 4   instant_bookable                              11313 non-null  object 
 5   calculated_host_listings_count                11313 non-null  int64  
 6   calculated_host_listings_count_entire_homes   11313 non-null  int64  
 7   calculated_host_listings_count_private_rooms  11313 non-null  int64  
 8   calculated_host_listings_count_shared_rooms   11313 non-null  int64  
 9   reviews_per_month                             9501 non-null  

* License hastoo many missing values, should remove license
* Scores have around 1800 missing values
* Different types of listing count have no null values

### Decisions
* Remove columns: neighbouhood group cleansed, bathrooms, license
* Keep variables with non-null values as low as 5500, correlate them with good host. And use the correlation to impose the missing values
* Remove variables with _url in their names
* Convert multiple review scores into more synthesized score.
* Host about might be a long paragraph of text which is difficult to process, we will leave it out

#### Correcting by Removing Bad Data

In [297]:
null_col = ["neighbourhood_group_cleansed", "bathrooms", "license"]
url_col = list(filter(lambda x: "_url" in x, train_df.columns))
col_to_remove = null_col + url_col + ["host_about"]
col_to_remove

['neighbourhood_group_cleansed',
 'bathrooms',
 'license',
 'host_url',
 'host_thumbnail_url',
 'host_picture_url',
 'host_about']

In [298]:
train_df.drop(col_to_remove, axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(col_to_remove, axis=1, inplace=True)


In [299]:
row, col = train_df.shape
row, col

(11313, 43)

#### Convert Price to Numerical Data

In [300]:
train_df.price.head(10).values

array(['$469.00', '$94.00', '$72.00', '$45.00', '$75.00', '$125.00',
       '$100.00', '$70.00', '$70.00', '$130.00'], dtype=object)

In [301]:
# Delete $ and convert to float64
numerical_price = train_df.price.str.extract(r'(\d+)\.(\d+)')
numerical_price = numerical_price.loc[:, 0]
numerical_price

0        469
1         94
2         72
3         45
4         75
        ... 
11308     89
11309     40
11310     99
11311    109
11312     55
Name: 0, Length: 11313, dtype: object

In [302]:
train_df.price = numerical_price
train_df.price

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.price = numerical_price


0        469
1         94
2         72
3         45
4         75
        ... 
11308     89
11309     40
11310     99
11311    109
11312     55
Name: price, Length: 11313, dtype: object

In [303]:
# Set type to integer
train_df.price = train_df.price.astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.price = train_df.price.astype('int64')


#### Combining Scores into One Unified Metrics

* Rating is the overall experience of travelers, assign 0.025
* Cleaness is a big part, assign 0.2
* Accuracy relates to faithful representation, assign 0.2
* Communication relects attitude and warmth,  assign 0.15
* Check-in shouldn't be a big problem in most places, assign 0.025
* Value means cost-effectiveness, assign 0.2
* Location affect safety, noise, and convenience, assign 0.2

In [304]:
combine_score = (train_df.review_scores_rating*0.025 + 
train_df.review_scores_accuracy*0.2 + 
train_df.review_scores_cleanliness*0.2 + 
train_df.review_scores_checkin*0.05 + 
train_df.review_scores_communication*0.15+ 
train_df.review_scores_location*0.175+ 
train_df.review_scores_value*0.2)

In [305]:
combine_score[:10], combine_score.info()

<class 'pandas.core.series.Series'>
Int64Index: 11313 entries, 0 to 11312
Series name: None
Non-Null Count  Dtype  
--------------  -----  
9328 non-null   float64
dtypes: float64(1)
memory usage: 434.8 KB


(0    5.00000
 1    4.86650
 2    4.71875
 3    4.89450
 4    4.90550
 5        NaN
 6    4.70125
 7    4.07500
 8    4.78775
 9        NaN
 dtype: float64,
 None)

In [306]:
train_df[list(filter(lambda x: "review_scores" in x, train_df.columns.values))].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11313 entries, 0 to 11312
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   review_scores_rating         9501 non-null   float64
 1   review_scores_accuracy       9330 non-null   float64
 2   review_scores_cleanliness    9332 non-null   float64
 3   review_scores_checkin        9329 non-null   float64
 4   review_scores_communication  9329 non-null   float64
 5   review_scores_location       9328 non-null   float64
 6   review_scores_value          9329 non-null   float64
dtypes: float64(7)
memory usage: 965.1 KB


Seems that at least 9328 cells are populated with data for review scores, and exactly 9328 units are completely populated with data for the combined score.

In [307]:
score_band = pd.Series(map(lambda x: "good" if x >= 4 else ("fair" if x >= 3 else "poor"), combine_score))
score_band.head(6)

0    good
1    good
2    good
3    good
4    good
5    poor
dtype: object

In [308]:
train_df["score_band"] = score_band
train_df.drop(list(filter(lambda x: "review_scores" in x, train_df.columns.values)), axis = 1, inplace=True)
train_df.columns

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["score_band"] = score_band
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(list(filter(lambda x: "review_scores" in x, train_df.columns.values)), axis = 1, inplace=True)


Index(['id', 'name', 'host_id', 'host_name', 'host_since', 'host_location',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_cleansed', 'property_type', 'room_type', 'accommodates',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'first_review', 'last_review', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
       'score_band'],
      dtype='object')

## Describe the Data

In [309]:
train_df.iloc[:,0:10].describe(include="O")

Unnamed: 0,name,host_name,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost
count,11312,11307,11307,11303,5588,5588,5776,11307
unique,11171,4032,2855,293,4,65,87,2
top,Private room in a shared hostel suite downtown,David,2011-07-06,"Toronto, Ontario, Canada",within an hour,100%,100%,f
freq,8,89,49,8769,2802,3716,1751,8249


* There are duplicate names of listings, meaning that the same listing is rented twice or more
* Host name shows duplicates. The guy named "David" had 89 listing records
* Host_since should be date object, need to check if the type is desired
* The locations are more concentrated, 11303 listing to 293 unique locations. The top value seems not indicating accurate street address, need to further inpect value behaviour
* Host response time seems to have only a few unique values, and it shows in text like "within an hour." Need to further inspect it
* Response rate and acceptance rate are shown as text due % sign. Need to inspect values and remove % sign.
* Need to convert host_is_superhost into boolean values

### Correcting object-valued variables in the first 10 variables

In [310]:
train_df.host_name.unique

<bound method Series.unique of 0             Alexandra
1        Kathie & Larry
2         Yohan & Sarah
3                 Brent
4                  Rita
              ...      
11308            Bozica
11309             Cathy
11310            Bozica
11311            Elaine
11312            Irmita
Name: host_name, Length: 11313, dtype: object>

In [311]:
test_date = train_df.host_since.values[0]
type(test_date)

str

In [312]:
train_df.host_since = pd.to_datetime(train_df.host_since, infer_datetime_format=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_since = pd.to_datetime(train_df.host_since, infer_datetime_format=True)


In [313]:
train_df.host_response_time.unique()

array([nan, 'within a few hours', 'within an hour', 'within a day',
       'a few days or more'], dtype=object)

In [314]:
#Since host response time is labeled in an ordinal order, we can convert it to ordinal values
train_df.host_response_time = train_df.host_response_time.map(
    {'within a few hours':1, 'within an hour':2, 'within a day':3,
    'a few days or more':4, np.NaN: 4})
train_df.host_response_time

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_response_time = train_df.host_response_time.map(


0        4
1        4
2        4
3        4
4        1
        ..
11308    4
11309    4
11310    4
11311    4
11312    4
Name: host_response_time, Length: 11313, dtype: int64

In [315]:
train_df.host_acceptance_rate.unique()

array([nan, '54%', '100%', '82%', '50%', '97%', '0%', '96%', '94%', '92%',
       '75%', '33%', '99%', '80%', '61%', '86%', '95%', '68%', '91%',
       '18%', '65%', '90%', '98%', '72%', '71%', '69%', '67%', '93%',
       '40%', '27%', '73%', '76%', '57%', '10%', '85%', '64%', '56%',
       '83%', '81%', '47%', '74%', '62%', '17%', '70%', '88%', '22%',
       '25%', '79%', '89%', '30%', '63%', '42%', '41%', '20%', '78%',
       '60%', '29%', '35%', '77%', '52%', '14%', '87%', '55%', '46%',
       '38%', '59%', '51%', '31%', '8%', '84%', '13%', '36%', '44%',
       '43%', '5%', '58%', '9%', '15%', '24%', '53%', '21%', '11%', '48%',
       '66%', '45%', '34%', '39%', '16%'], dtype=object)

In [316]:
train_df.host_acceptance_rate = train_df.host_acceptance_rate.str.rstrip("%")
train_df.host_acceptance_rate.unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_acceptance_rate = train_df.host_acceptance_rate.str.rstrip("%")


array([nan, '54', '100', '82', '50', '97', '0', '96', '94', '92', '75',
       '33', '99', '80', '61', '86', '95', '68', '91', '18', '65', '90',
       '98', '72', '71', '69', '67', '93', '40', '27', '73', '76', '57',
       '10', '85', '64', '56', '83', '81', '47', '74', '62', '17', '70',
       '88', '22', '25', '79', '89', '30', '63', '42', '41', '20', '78',
       '60', '29', '35', '77', '52', '14', '87', '55', '46', '38', '59',
       '51', '31', '8', '84', '13', '36', '44', '43', '5', '58', '9',
       '15', '24', '53', '21', '11', '48', '66', '45', '34', '39', '16'],
      dtype=object)

In [317]:
def rate_band(rate):
    if rate == np.NaN:
        return 4

    if rate >= 80:
        return 1
    elif rate >= 60:
        return 2
    else :
        return 3
indx_for_convertion = train_df.host_acceptance_rate.notnull()
train_df.host_acceptance_rate.loc[indx_for_convertion] = train_df.host_acceptance_rate.loc[indx_for_convertion].astype("int32")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_acceptance_rate.loc[indx_for_convertion] = train_df.host_acceptance_rate.loc[indx_for_convertion].astype("int32")


In [318]:
train_df.host_acceptance_rate = train_df.host_acceptance_rate.map(rate_band)
train_df.host_acceptance_rate.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_acceptance_rate = train_df.host_acceptance_rate.map(rate_band)


(11313,)

In [319]:
# Convert response rate
train_df.host_response_rate = train_df.host_response_rate.str.rstrip("%")
indx_for_convertion = train_df.host_response_rate.notnull()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_response_rate = train_df.host_response_rate.str.rstrip("%")


In [320]:
train_df.host_response_rate = train_df.host_response_rate.astype("Int32").map(rate_band)
train_df.host_response_rate.dtype
type(train_df.host_response_rate[0]), train_df.host_response_rate

TypeError: boolean value of NA is ambiguous

In [None]:
train_df.host_response_rate.loc[train_df.host_response_rate.notnull()] = train_df.host_response_rate.loc[train_df.host_response_rate.notnull()].apply(int)
train_df.host_response_rate[4]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_response_rate.loc[train_df.host_response_rate.notnull()] = train_df.host_response_rate.loc[train_df.host_response_rate.notnull()].apply(int)


100

In [None]:
train_df.iloc[:, 0:10].describe()

Unnamed: 0,host_response_time,host_acceptance_rate
count,11313.0,11313.0
mean,3.067445,2.258729
std,1.136834,0.929808
min,1.0,1.0
25%,2.0,1.0
50%,4.0,3.0
75%,4.0,3.0
max,4.0,3.0


In [None]:
train_df.iloc[:, 0:10].describe(include = "O")

Unnamed: 0,name,host_name,host_location,host_response_rate,host_is_superhost
count,11312,11307,11303,5588,11307
unique,11171,4032,293,65,2
top,Private room in a shared hostel suite downtown,David,"Toronto, Ontario, Canada",100,f
freq,8,89,8769,3716,8249


In [None]:
## Convert id and host id int numerical values
train_df.id = str(train_df.id)
train_df.host_id = str(train_df.host_id)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.id = str(train_df.id)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_id = str(train_df.host_id)


In [None]:
train_df.iloc[:, :11].describe(include = "O")

Unnamed: 0,id,name,host_id,host_name,host_location,host_response_rate,host_is_superhost,host_neighbourhood
count,11313,11312,11313,11307,11303,5588,11307,9706
unique,1,11171,1,4032,293,65,2,165
top,0 1419\n1 8077\n2 ...,Private room in a shared hostel suite downtown,0 1565\n1 22795\n2 ...,David,"Toronto, Ontario, Canada",100,f,Downtown Toronto
freq,11313,8,11313,89,8769,3716,8249,1128


The first 10 columns are finished describing and correcting.

### Describe Numerically Valued Variables from 11 to 20

In [None]:
train_df.iloc[:, 11:21].describe()

Unnamed: 0,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms
count,11307.0,11307.0,11313.0,11290.0,10509.0
mean,4.921818,4.921818,3.029966,1.207263,1.433343
std,16.455443,16.455443,1.966904,0.510238,0.819825
min,0.0,0.0,1.0,0.0,1.0
25%,1.0,1.0,2.0,1.0,1.0
50%,1.0,1.0,2.0,1.0,1.0
75%,3.0,3.0,4.0,1.0,2.0
max,272.0,272.0,16.0,6.0,9.0


### Correcting object-values columns in 11th to 20th columns

In [None]:
train_df.iloc[:, 11:21].describe(include = "O")

Unnamed: 0,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,property_type,room_type,bathrooms_text
count,11313,11307,11307,11313,11313,11313,11306
unique,324,2,2,140,58,4,25
top,"['email', 'phone']",t,t,Waterfront Communities-The Island,Entire rental unit,Entire home/apt,1 bath
freq,1110,11273,9028,1771,2534,7333,5819


In [None]:
train_df.iloc[:, 11:21].describe()

Unnamed: 0,host_listings_count,host_total_listings_count,accommodates
count,11307.0,11307.0,11313.0
mean,4.921818,4.921818,3.029966
std,16.455443,16.455443,1.966904
min,0.0,0.0,1.0
25%,1.0,1.0,2.0
50%,1.0,1.0,2.0
75%,3.0,3.0,4.0
max,272.0,272.0,16.0


In [322]:
train_df["verify_by_email"] = \
    train_df.host_verifications.str.contains("email")

train_df["verify_by_phone"] = \
    train_df.host_verifications.str.contains("phone")

train_df["verify_by_gov_id"] = \
    train_df.host_verifications.str.contains("government_id")

train_df["verify_by_work_email"] = \
    train_df.host_verifications.str.contains("work_email")

train_df["verify_by_identity_manual"] = \
    train_df.host_verifications.str.contains("identity_manual")

train_df["verify_by_reviews"] = \
    train_df.host_verifications.str.contains("reviews")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["verify_by_email"] = \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["verify_by_phone"] = \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["verify_by_gov_id"] = \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =

In [None]:
train_df.host_verifications.explode().str.split()[10:20]

10    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
11    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
12    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
13    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
14    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
15                    [['email',, 'phone',, 'reviews']]
16    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
17    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
18    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
19    [['email',, 'phone',, 'reviews',, 'jumio',, 'g...
Name: host_verifications, dtype: object

In [None]:
train_df.host_verifications = \
    train_df.host_verifications.str.lstrip("[").str.rstrip("]")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.host_verifications = \


In [335]:
train_df["host_verification"] = train_df.host_verifications.str.split(",")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["host_verification"] = train_df.host_verifications.str.split(",")


In [336]:
train_df["veri_counts"] = train_df["host_verification"].map(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["veri_counts"] = train_df["host_verification"].map(lambda x: len(x))


In [337]:
train_df["veri_counts"]

0        5
1        4
2        7
3        7
4        7
        ..
11308    5
11309    6
11310    5
11311    4
11312    2
Name: veri_counts, Length: 11313, dtype: int64

In [338]:
train_df.drop("host_verifications", axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop("host_verifications", axis = 1, inplace = True)


In [None]:
# Replace all columns with "t" and "f" by boolean values
bool_col_list = []
for col in train_df.columns:
    # print(col)
    if train_df[col].dtype == "object":
        t_equal_value = train_df[col].str.fullmatch(r"^t$")
        if t_equal_value.any():
            bool_col_list.append(col)

print(bool_col_list)

['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable']


In [None]:
for name in bool_col_list:
    train_df[name] = train_df[name].map({"t": True, "f": False}).astype(bool)

train_df[bool_col_list].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[name] = train_df[name].map({"t": True, "f": False}).astype(bool)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[name] = train_df[name].map({"t": True, "f": False}).astype(bool)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[name] = train_df[name].map({"t": True, "f": Fals

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,instant_bookable
count,11313,11313,11313,11313
unique,2,2,2,2
top,False,True,True,False
freq,8249,11279,9034,8580


In [None]:
train_df.iloc[:, 11:21].describe()

Unnamed: 0,host_listings_count,host_total_listings_count,accommodates,bedrooms
count,11307.0,11307.0,11313.0,10509.0
mean,4.921818,4.921818,3.029966,1.433343
std,16.455443,16.455443,1.966904,0.819825
min,0.0,0.0,1.0,1.0
25%,1.0,1.0,2.0,1.0
50%,1.0,1.0,2.0,1.0
75%,3.0,3.0,4.0,2.0
max,272.0,272.0,16.0,9.0


In [None]:
train_df.iloc[:, 11:21].describe(include = "O")

Unnamed: 0,neighbourhood_cleansed,property_type,room_type,bathrooms_text
count,11313,11313,11313,11306
unique,140,58,4,25
top,Waterfront Communities-The Island,Entire rental unit,Entire home/apt,1 bath
freq,1771,2534,7333,5819


In [326]:
# Convert Bathroom to numerical and change the name
train_df.bathrooms_text = train_df.bathrooms_text.str.extract(r'(\d+)')
train_df.rename({"bathrooms_text":"bathrooms"}, axis = 1, inplace=True)
train_df.bathrooms[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.bathrooms_text = train_df.bathrooms_text.str.extract(r'(\d+)')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.rename({"bathrooms_text":"bathrooms"}, axis = 1, inplace=True)


0    3
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    2
Name: bathrooms, dtype: object

In [None]:
train_df.bathrooms = \
    pd.to_numeric(train_df.bathrooms, errors = 'ignore')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.bathrooms = \


So far so good
### Describe Variables from 21 to 31

In [None]:
train_df.iloc[:, 21:31].describe()

Unnamed: 0,beds,price,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,calculated_host_listings_count
count,11271.0,11313.0,11313.0,11313.0,11313.0,11313.0
mean,1.642002,134.723769,32.906126,2.954389,0.290374,3.635198
std,1.114177,118.647768,58.59625,9.027414,1.101991,7.399178
min,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,65.0,2.0,0.0,0.0,1.0
50%,1.0,100.0,10.0,0.0,0.0,1.0
75%,2.0,159.0,37.0,2.0,0.0,3.0
max,13.0,999.0,828.0,183.0,17.0,71.0


In [None]:
train_df.iloc[:, 11:21].describe(include = "O")

Unnamed: 0,neighbourhood_cleansed,property_type,room_type
count,11313,11313,11313
unique,140,58,4
top,Waterfront Communities-The Island,Entire rental unit,Entire home/apt
freq,1771,2534,7333


### Describe Columns from 31 to 40

In [None]:
train_df.iloc[:, 21:31].describe(include = "O")

Unnamed: 0,amenities,first_review,last_review
count,11313,9501,9501
unique,10634,2082,1860
top,"[""Long term stays allowed""]",2020-01-01,2020-01-01
freq,47,38,103


In [None]:
train_df.iloc[:, 21:31].describe()

Unnamed: 0,beds,price,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,calculated_host_listings_count
count,11271.0,11313.0,11313.0,11313.0,11313.0,11313.0
mean,1.642002,134.723769,32.906126,2.954389,0.290374,3.635198
std,1.114177,118.647768,58.59625,9.027414,1.101991,7.399178
min,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,65.0,2.0,0.0,0.0,1.0
50%,1.0,100.0,10.0,0.0,0.0,1.0
75%,2.0,159.0,37.0,2.0,0.0,3.0
max,13.0,999.0,828.0,183.0,17.0,71.0


In [None]:
train_df.first_review = pd.to_datetime(train_df.first_review, errors = "coerce")
train_df.last_review = pd.to_datetime(train_df.last_review, errors = "coerce")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.first_review = pd.to_datetime(train_df.first_review, errors = "coerce")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.last_review = pd.to_datetime(train_df.last_review, errors = "coerce")


### Describe columns from 41 to 50

In [None]:
train_df.iloc[:, 31:41].describe(include="all")

Unnamed: 0,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,score_band,verify_by_email,verify_by_phone,verify_by_gov_id,verify_by_work_email,verify_by_identity_manual
count,11313.0,11313.0,11313.0,9501.0,11313,11313,11313,11313,11313,11313
unique,,,,,3,2,2,2,2,2
top,,,,,good,True,True,True,False,False
freq,,,,,9121,10601,11235,8764,9632,7325
mean,2.291877,1.246531,0.04349,1.110394,,,,,,
std,6.830478,3.049472,0.654424,2.483943,,,,,,
min,0.0,0.0,0.0,0.01,,,,,,
25%,0.0,0.0,0.0,0.13,,,,,,
50%,1.0,0.0,0.0,0.45,,,,,,
75%,1.0,1.0,0.0,1.38,,,,,,


## Assumptions
* Rating band correlates with verification type
* Rating band correlates with host verification number 
* Rating band correlates with number of reviews
* Rating band correlates with number of listings
* Rating band correlates with housing's location 
* Rating band correlates with type of amenities
* Rating band correlates with the number of amenties
* Rating band correlates with price
* Rating band correlates with room_type and property type

## Analysis by pivoting columns

### Correplate with Verifcation Types and Quantity

In [None]:
##  Rating band correlates with verification type
train_df[["verify_by_phone", "score_band"]].groupby("score_band").verify_by_phone.mean()

score_band
fair    0.986755
good    0.994628
poor    0.986771
Name: verify_by_phone, dtype: float64

Seems that score band doesn't correlate with whether the host is phone verified

In [None]:
train_df[["verify_by_email", "score_band"]].groupby("score_band").verify_by_email.mean()

score_band
fair    0.920530
good    0.948799
poor    0.885840
Name: verify_by_email, dtype: float64

There is a weak correlation between score band and verification by email

In [None]:
train_df[["verify_by_gov_id", "score_band"]].groupby("score_band").verify_by_gov_id.mean()

score_band
fair    0.695364
good    0.821182
poor    0.572758
Name: verify_by_gov_id, dtype: float64

Correlation exists between score band and govenment id

In [None]:
train_df[["verify_by_work_email", "score_band"]].groupby("score_band").verify_by_work_email.mean()

score_band
fair    0.112583
good    0.159522
poor    0.102401
Name: verify_by_work_email, dtype: float64

Weak relationship exists between score band and work email

In [None]:
train_df[["verify_by_identity_manual", "score_band"]].groupby("score_band").verify_by_identity_manual.mean()

score_band
fair    0.298013
good    0.380112
poor    0.233219
Name: verify_by_identity_manual, dtype: float64

Some relationships exists

In [323]:
train_df[["verify_by_reviews", "score_band"]].groupby("score_band").verify_by_reviews.mean()

score_band
fair    0.549669
good    0.675145
poor    0.394904
Name: verify_by_reviews, dtype: float64

Some relation exists

In [340]:
train_df[["veri_counts", "score_band"]].groupby("score_band").veri_counts.mean()


score_band
fair    5.039735
good    5.763403
poor    4.425282
Name: veri_counts, dtype: float64

Weak relationship exists, probably not significant

In [342]:
## Relatinship between government id and verificantion counts
train_df[["veri_counts", "verify_by_gov_id"]].groupby("verify_by_gov_id").veri_counts.mean()


verify_by_gov_id
False    2.584543
True     6.363875
Name: veri_counts, dtype: float64

Hosts with government id verifcation has the number of verification 3 times of thos with goivernment id verification. So givernment id is an influential factor.

Decision: Drop verify by email, phone, work email, number of verifcation


In [344]:
train_df.drop(["verify_by_email", "verify_by_phone",\
     "verify_by_work_email", "veri_counts"], inplace=True, axis  =1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(["verify_by_email", "verify_by_phone",\


### Correlate with number of reviews

In [345]:
train_df[["score_band", "number_of_reviews"]].groupby("score_band").number_of_reviews.mean()

score_band
fair     3.152318
good    40.731718
poor     0.135718
Name: number_of_reviews, dtype: float64

Strong relationship with the number of reviews

In [346]:
train_df[["score_band", "number_of_reviews_l30d"]].groupby("score_band").number_of_reviews_l30d.mean()

score_band
fair    0.039735
good    0.359500
poor    0.000000
Name: number_of_reviews_l30d, dtype: float64

In [347]:
train_df[["score_band", "number_of_reviews_ltm"]].groupby("score_band").number_of_reviews_ltm.mean()

score_band
fair    0.284768
good    3.659138
poor    0.002450
Name: number_of_reviews_ltm, dtype: float64

Big differences in the reviews of the last 2 months, small difference in the last 1 month

Desicion: keep the reviews in the last two months, delete those in the last 1 month and cumulative number of reviews (because they are affected by the host since and extreme review frequency in the last)

In [348]:
train_df.drop(["number_of_reviews_ltm", "number_of_reviews"], inplace=True, axis  =1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(["number_of_reviews_ltm", "number_of_reviews"], inplace=True, axis  =1)


### Correlating score band with number of listings

In [349]:
## check if the calculated listing count is the sum of detailed types of listing counts
train_df.calculated_host_listings_count == train_df.calculated_host_listings_count_entire_homes+\
    train_df.calculated_host_listings_count_private_rooms+\
         train_df.calculated_host_listings_count_shared_rooms

0        True
1        True
2        True
3        True
4        True
         ... 
11308    True
11309    True
11310    True
11311    True
11312    True
Length: 11313, dtype: bool

In [352]:
train_df[["calculated_host_listings_count", "score_band"]].groupby("score_band").calculated_host_listings_count.mean()

score_band
fair    4.059603
good    3.927640
poor    2.296913
Name: calculated_host_listings_count, dtype: float64

No obvious correlation. Now pivot each listing quanity individually

In [367]:
train_df[["calculated_host_listings_count_entire_homes", "score_band"]].groupby("score_band").calculated_host_listings_count_entire_homes.mean()

score_band
fair    1.768212
good    2.531959
poor    1.257717
Name: calculated_host_listings_count_entire_homes, dtype: float64

In [355]:
train_df[["calculated_host_listings_count_entire_homes", "score_band"]].groupby("score_band").calculated_host_listings_count_entire_homes.sum()

score_band
fair      267
good    23094
poor     2567
Name: calculated_host_listings_count_entire_homes, dtype: int64

Better listing have high and average entire home liisting by host.

In [370]:
train_df[["calculated_host_listings_count_shared_rooms", "score_band"]].groupby("score_band").\
    calculated_host_listings_count_shared_rooms.mean()
# train_df[["calculated_host_listings_count_shared_rooms", "score_band"]].groupby("score_band").\
#     calculated_host_listings_count_shared_rooms.mean()

score_band
fair    0.251656
good    0.035194
poor    0.065164
Name: calculated_host_listings_count_shared_rooms, dtype: float64

No obvious correlation

In [361]:
train_df[["calculated_host_listings_count_private_rooms", "score_band"]].groupby("score_band").\
    calculated_host_listings_count_private_rooms.sum()

score_band
fair      293
good    11861
poor     1948
Name: calculated_host_listings_count_private_rooms, dtype: int64

In [368]:
train_df[["calculated_host_listings_count_private_rooms", "score_band"]].groupby("score_band").\
    calculated_host_listings_count_private_rooms.mean()

score_band
fair    1.940397
good    1.300406
poor    0.954434
Name: calculated_host_listings_count_private_rooms, dtype: float64

Maybe only a few fair rated listings exist, but they have many shared rooms. On contrary, many good rated listings have a few shared rooms, but their sum is huge.

Decision: only keep the metric of entire home

In [371]:
train_df.drop(["calculated_host_listings_count","calculated_host_listings_count_shared_rooms",\
    "calculated_host_listings_count_private_rooms"], axis =1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df.drop(["calculated_host_listings_count","calculated_host_listings_count_shared_rooms",\


In [372]:
train_df.shape

(11313, 35)