# Exploratory Data Analysis for Vancouver AirBnB Dataset

The purpose of this exploratory data analysis (EDA) is to understand:
- **Preliminary feature selection**: Our dataset contains many features, many of which are categorical variables. Which variables should we select (based on our understanding of AirBnB and collective personal experience) should we select to begin training our model with? 
- **The composition of our dataset**: What are the types of hosts and properties that are represented? This will help us understand the validity of the model we will develop to predict an appropriate nightly price for a new AirBnB property in Vancouver. 
- **Missing data and data preprocessing required**: How much missing data is there? Is it an acceptable amount and how will we treat missing data? What is the range of values in our features? Will we have to preprocess to normalize these values for our model to work well?


## Imports and reading in the data

Note that the data has not been provided in the GitHub repository. The `data.py` script has to be run in order to download the data to your local machine to reproduce this notebook. 

In [31]:
# Imports
import numpy as np
import pandas as pd
import altair as alt
import os

import pandas_profiling
from sklearn.model_selection import train_test_split

In [34]:
#Loading the data set
data = pd.read_csv(os.path.join('data', 'listings.csv.gz'))

In [28]:
data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10080,https://www.airbnb.com/rooms/10080,20191109094845,2019-11-09,D1 - Million Dollar View 2 BR,"Stunning two bedroom, two bathroom apartment. ...","Bed setup: 2 x queen, option to add up to 2 tw...","Stunning two bedroom, two bathroom apartment. ...",none,,...,t,f,strict_14_with_grace_period,f,f,43,43,0,0,0.16
1,13188,https://www.airbnb.com/rooms/13188,20191109094845,2019-11-09,Garden level studio in ideal loc.,Garden level studio suite with garden patio - ...,Very Close (3min walk) to Nat Bailey baseball ...,Garden level studio suite with garden patio - ...,none,The uber hip Main street area is a short walk ...,...,t,f,moderate,f,f,1,1,0,0,1.9
2,13357,https://www.airbnb.com/rooms/13357,20191109094845,2019-11-09,! Wow! 2bed 2bath 1bed den Harbour View Apartm...,Very spacious and comfortable with very well k...,"Mountains and harbour view 2 bedroom,2 bath,1 ...",Very spacious and comfortable with very well k...,none,Amanzing bibrant professional neighbourhood. C...,...,f,f,strict_14_with_grace_period,t,t,3,1,2,0,0.48
3,13490,https://www.airbnb.com/rooms/13490,20191109094845,2019-11-09,Vancouver's best kept secret,This apartment rents for one month blocks of t...,"Vancouver city central, 700 sq.ft., main floor...",This apartment rents for one month blocks of t...,none,"In the heart of Vancouver, this apartment has ...",...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.82
4,14267,https://www.airbnb.com/rooms/14267,20191109094845,2019-11-09,EcoLoft Vancouver,"The Ecoloft is located in the lovely, family r...",West Coast Modern Laneway House Loft: We call ...,"The Ecoloft is located in the lovely, family r...",none,We live in the centre of the city of Vancouver...,...,t,f,strict_14_with_grace_period,f,f,1,1,0,0,0.28


## Preliminary feature selection

Let's understand the number of data points and features our dataset has.

In [22]:
print("Our dataset has", data.shape[0],"number of rows and", data.shape[1],"number of columns")

Our dataset has 6181  number of rows and 106  number of columns


As we expected, our dataset has many more features than we would want to start with in our baseline model. Simpler models are better and more interpretable so we want to prune our dataset to only include features that:
1. We would reasonably expect to know when considering setting up a new AirBnB property
2. We believe matters most to nightly price based on our understanding of the AirBnB booking system

We decided on three categories of features: 
1. **Host-related information** such as host response rate to requests, whether the host is a superhost, whether the host identity has been verified
2. **Property-related information** such as property type, the neighborhood, number of people who can be accommodated, number of bathrooms, bedrooms and beds
3. **Booking-related information** such as whether the property can be instantly booked, the cancellation policy



There are also several potential target variables that relate to pricing we can select including monthly, weekly and nightly price. Our chosen target variable is nightly price since we believe that is the one that is most commonly used when booking a property.

We select these features and our target variable from our data below. Then we split our data into X and y training and test sets before doing further EDA. 

In [35]:
# Preprosessing part

#List of chosen features
selected = ['id', 'host_id', 'host_response_rate', 'host_is_superhost', 'property_type', 
             'host_identity_verified', 'neighbourhood_cleansed', 'instant_bookable', 'cancellation_policy', 
            'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price'
           ]

data.drop(data.columns.difference(selected), axis = 1, inplace = True)

#Changing format of price column 
data.price = data.price.str.replace('$', '').str.replace(',', '').astype(float)

#Changing format of host response rate column
data['host_response_rate'] = data['host_response_rate'].str.rstrip('%').astype('float') / 100.0

#Splitting data into test and train 
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = 'price'), 
                                                    data[['price']], 
                                                    test_size = 0.2)

In [8]:
X_train.head()

Unnamed: 0,id,host_id,host_response_rate,host_is_superhost,host_identity_verified,neighbourhood_cleansed,property_type,accommodates,bathrooms,bedrooms,beds,instant_bookable,cancellation_policy
2939,27225603,180741245,100%,f,f,Shaughnessy,Serviced apartment,3,1.0,1.0,2.0,f,moderate
4538,35262414,265475133,100%,t,f,Downtown,Apartment,3,1.0,1.0,1.0,f,strict_14_with_grace_period
4452,34989892,156460473,100%,f,t,Oakridge,House,2,2.0,1.0,1.0,t,moderate
4633,35573546,11080535,75%,f,t,Downtown Eastside,Apartment,2,1.0,1.0,1.0,t,moderate
1011,12005783,64206430,100%,f,t,Downtown,Apartment,2,1.0,1.0,1.0,f,strict_14_with_grace_period


## Composition of our dataset and missing data

To begin our EDA, we want to use Pandas Profiling Report. We want to understand the composition of the type of hosts, properties and bookings in our dataset. 

In [36]:
pandas_profiling.ProfileReport(X_train)



**Hosts:**

**Properties:** 

**Bookings:**

## Missing data

We can see that our data has missing values that we'll have to deal with in the future for a model we'll choose. The highest percentage of missing values we observe in features 'bedrooms', 'beds', 'host_response rate', two of which are expected to have high influence on the predicted value.

## Further feature processing

With our preprocessing and collecting data into a reasonable, exploarable format being done it's time to answer most importaint questions. What's our research question is and will this data set be able to help us find the answer. 
In our proposal we came up with the question: How high should one's price per night of stay be for a rental unit in Vancouver? This wrangled data set now contains features that undisputably will have an influence on the end answer (like number of rooms or beds at a unit) and features which influence we're curious about (i.e. if a host has 'superhost' status). So we believe that the data set is reasonable for the research question.        

In [7]:
X_train['neighbourhood_cleansed'].unique()

array(['Shaughnessy', 'Downtown', 'Oakridge', 'Downtown Eastside',
       'Mount Pleasant', 'Hastings-Sunrise', 'Kitsilano',
       'Dunbar Southlands', 'Renfrew-Collingwood', 'Marpole', 'West End',
       'Grandview-Woodland', 'Kensington-Cedar Cottage', 'Riley Park',
       'Kerrisdale', 'West Point Grey', 'Fairview', 'Killarney',
       'Arbutus Ridge', 'Victoria-Fraserview', 'South Cambie', 'Sunset',
       'Strathcona'], dtype=object)

One of the most challenging and variable parts of the data is the amount of the neighbourhoods Airbnb classified in Vancouver and greater area, which is 23. Potential solution could be combining neighouring ones to a bigger area, making model less specific. 

Property type has 21 unique inputs, might be reasonable to consider selecting a few features for the purpose of this project.  