In [1]:
# Imports

import numpy as np
import pandas as pd

import pandas_profiling
from sklearn.model_selection import train_test_split

In [2]:
#Loading the data set
data = pd.read_csv('data\listings.csv')

In [3]:
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10080,https://www.airbnb.com/rooms/10080,20191109094845,2019-11-09,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, option to add up to 2 tw...","Stunning two bedroom, two bathroom apartment. ...",none,,...,t,f,strict_14_with_grace_period,f,f,43,43,0,0,0.16
1,13188,https://www.airbnb.com/rooms/13188,20191109094845,2019-11-09,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,Very Close (3min walk) to Nat Bailey baseball ...,Garden level studio suite with garden patio - ...,none,The uber hip Main street area is a short walk ...,...,t,f,moderate,f,f,1,1,0,0,1.9
2,13357,https://www.airbnb.com/rooms/13357,20191109094845,2019-11-09,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...,none,Amanzing bibrant professional neighbourhood. C...,...,f,f,strict_14_with_grace_period,t,t,3,1,2,0,0.48
3,13490,https://www.airbnb.com/rooms/13490,20191109094845,2019-11-09,Vancouver's best kept secret,This apartment rents for one month blocks of t...,"Vancouver city central, 700 sq.ft., main floor...",This apartment rents for one month blocks of t...,none,"In the heart of Vancouver, this apartment has ...",...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.82
4,14267,https://www.airbnb.com/rooms/14267,20191109094845,2019-11-09,EcoLoft Vancouver,"The Ecoloft is located in the lovely, family r...",West Coast Modern Laneway House Loft: We call ...,"The Ecoloft is located in the lovely, family r...",none,We live in the centre of the city of Vancouver...,...,t,f,strict_14_with_grace_period,f,f,1,1,0,0,0.28


In [4]:
len(sorted(data))

106

After taking a quick look into the number of features and realizing that it's unreasonably huge for our research question. Our team decided to manualy select features based on the relevancy and professional judgement.

In [5]:
# Preprosessing part

#List of chosen features
selected = ['id', 'host_id', 'host_response_rate', 'host_is_superhost', 'property_type', 
             'host_identity_verified', 'neighbourhood_cleansed', 'instant_bookable', 'cancellation_policy', 
            'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price'
           ]

data.drop(data.columns.difference(selected), axis = 1, inplace = True)

#Changing format of price column 
data.price = data.price.str.replace('$', '').str.replace(',', '').astype(float)

#Splitting data into test and train 
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = 'price'), 
                                                    data[['price']], 
                                                    test_size = 0.2)

With our preprocessing and collecting data into a reasonable, exploarable format being done it's time to answer most importaint questions. What's our research question is and will this data set be able to help us find the answer. 
In our proposal we came up with the question: How high should one's price per night of stay be for a rental unit in Vancouver? This wrangled data set now contains features that undisputably will have an influence on the end answer (like number of rooms or beds at a unit) and features which influence we're curious about (i.e. if a host has 'superhost' status). So we believe that the data set is reasonable for the research question.        

In [6]:
X_train['neighbourhood_cleansed'].unique()

array(['Downtown', 'Victoria-Fraserview', 'Kensington-Cedar Cottage',
       'West End', 'Kitsilano', 'Arbutus Ridge', 'Sunset',
       'Renfrew-Collingwood', 'Mount Pleasant', 'Dunbar Southlands',
       'Riley Park', 'Marpole', 'Strathcona', 'Downtown Eastside',
       'Hastings-Sunrise', 'Grandview-Woodland', 'Fairview', 'Killarney',
       'Kerrisdale', 'South Cambie', 'Oakridge', 'West Point Grey',
       'Shaughnessy'], dtype=object)

In [7]:
X_train.head()

Unnamed: 0,id,host_id,host_response_rate,host_is_superhost,host_identity_verified,neighbourhood_cleansed,property_type,accommodates,bathrooms,bedrooms,beds,instant_bookable,cancellation_policy
2475,24186439,4502616,100%,t,f,Downtown,Condominium,2,1.0,1.0,2.0,t,strict_14_with_grace_period
4914,36422330,218797364,100%,t,f,Victoria-Fraserview,House,1,1.0,1.0,1.0,f,flexible
2999,27409742,141025179,100%,t,f,Downtown,Condominium,16,1.0,3.0,7.0,t,strict_14_with_grace_period
925,10618317,36640532,,f,t,Kensington-Cedar Cottage,House,4,1.0,1.0,1.0,f,moderate
2878,26853674,61155359,100%,f,t,Downtown,Apartment,6,1.0,2.0,4.0,t,strict_14_with_grace_period


In [8]:
pandas_profiling.ProfileReport(X_train)



Above we can see pandas profiling report that works as a good starting point in our EDA.  We can see that our data has missing values that we'll have to deal with in the future for a model we'll choose. The highest percentage of missing values we observe in features 'bedrooms', 'beds', 'host_response rate', two of which are expected to have high influence on the predicted value. Property type has 21 unique inputs, might be reasonable to consider selecting a few features for the purpose of this project.   

In [9]:
X_train['neighbourhood_cleansed'].unique()

array(['Downtown', 'Victoria-Fraserview', 'Kensington-Cedar Cottage',
       'West End', 'Kitsilano', 'Arbutus Ridge', 'Sunset',
       'Renfrew-Collingwood', 'Mount Pleasant', 'Dunbar Southlands',
       'Riley Park', 'Marpole', 'Strathcona', 'Downtown Eastside',
       'Hastings-Sunrise', 'Grandview-Woodland', 'Fairview', 'Killarney',
       'Kerrisdale', 'South Cambie', 'Oakridge', 'West Point Grey',
       'Shaughnessy'], dtype=object)

One of the most challenging and variable parts of the data is the amount of the neighbourhoods Airbnb classified in Vancouver and greater area, which is 23. Potential solution could be combining neighouring ones to a bigger area, making model less specific. 