# Capstone Project | Battle of the Neighbourhoods
-----------
## Seattle's Housing Market & Airbnb
_By Konstantina Vasileva_ | _See blog post on [Medium](https://medium.com/@Konstantinna/a-tale-of-two-pacific-cities-exploring-the-effects-of-airbnb-listings-on-the-rental-market-e4c806f65ebb)_


<img src="https://images.unsplash.com/photo-1516156008625-3a9d6067fab5?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80">




##  Introduction
### Background
Seattle has changed significantly after Amazon moved into its South Lake Union headquarters in 2010. Prices and rents skyrocketed, sending the city into a housing downward spiral. This trend is starting to shift and taper off in the past two years due to a surge in new property building and slow-down in rent growth. Nevertheless, housing remains an issue. A combination of factors shapes the state of the housing market in Seattle and other big cities across the world: starting from land availability (constrained by geography and urban planning), housing subsidies and taxes; increasingly exorbitant rent market prices, housing supply, mortgage interest rates, construction costs, as well as internal and international migration. Supply is also increasingly affected by demand by foreign investors and private buyers (primarily from China). Last but not least, the growth of services like Airbnb affect supply by taking entire properties off the conventional rental market.

###  Problem 
Renting a place to live in major developed cities like Seattle is increasingly difficult. It is out of the scope of this capstone project to analyse all variables affecting rental prices, so at this first stage, I will focus on claims that Airbnb listings exacerbate the housing crisis by "squandering precious long-term rental housing stock".  There has not been a proper look comparing entire place listings (which do take properties out of the long-term rental market) and single room listings which people rent out on Airbnb to supplement their income while they still live on the property.Â 
A higher number of listings corresponding to higher rental prices in a neighbourhood might be due to other factors beside supply shortage. For example, investors and private owners of properties in expensive neighbourhoods might be more tempted to list one or more of them on Airbnb. It is interesting to check whether single room vs entire place Airbnb clusters would correspond to different rental price clusters across neighbourhoods.

### Interest
Housing issues and soaring rental prices are increasingly becoming a problem across developed cities around the world. Open data projects and data insights on the topic can be used to inform public policy or generate productive debate on the topic.


##  Data Acquisition & Cleaning
###  Data Sources


* Airbnb & GeoData_: [Inside Airbnb](http://insideairbnb.com/get-the-data.html) is an independent, non-commercial Open Source data tool which provides Airbnb listings data to the public. I used it to download a .csv with current Airbnb listings in Seattle (updated in September 2019).  __It also features geo coordinates and neighbourhoods__, which I supplemented with a [geojson file from SeattleIO on github](https://github.com/seattleio/seattle-boundaries-data/blob/master/data/neighborhoods.geojson).
* Rental data: collected from [RentCafe](https://www.rentcafe.com/average-rent-market-trends/us/wa/seattle/)

In [None]:
# Step 1 | Loading the key libraries
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

!conda install -c anaconda pywget

In [3]:
!wget -O airbnb.csv http://data.insideairbnb.com/united-states/wa/seattle/2019-09-22/visualisations/listings.csv
!wget -O rent.csv https://raw.githubusercontent.com/Konstantinna/Coursera_Capstone/master/rent_seattle.csv


'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
df = pd.read_csv('airbnb.csv')

for col in df.columns: 
    print(col)

df.head()

id
name
host_id
host_name
neighbourhood_group
neighbourhood
latitude
longitude
room_type
price
minimum_nights
number_of_reviews
last_review
reviews_per_month
calculated_host_listings_count
availability_365


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2318,Casa Madrona - Urban Oasis 1 block from the park!,2536,Megan,Central Area,Madrona,47.61082,-122.29082,Entire home/apt,296,30,28,2019-08-30,0.21,2,84
1,5682,"Cozy Studio, min. to downtown -WiFi",8993,Maddy,Delridge,South Delridge,47.52398,-122.35989,Entire home/apt,48,3,462,2018-11-24,3.99,1,0
2,6606,"Fab, private seattle urban cottage!",14942,Joyce,Other neighborhoods,Wallingford,47.65411,-122.33761,Entire home/apt,90,2,147,2019-09-07,1.19,3,85
3,9419,Glorious sun room w/ memory foambed,30559,Angielena,Other neighborhoods,Georgetown,47.55062,-122.32014,Private room,62,2,144,2019-09-02,1.29,8,365
4,9460,Downtown Convention Center B&B -- Free Minibar,30832,Siena,Downtown,First Hill,47.61265,-122.32936,Private room,99,3,443,2019-09-02,3.62,4,150


In [4]:
df_rent = pd.read_csv('rent.csv')
df_rent.head()

Unnamed: 0,Neighborhood,AVGrent
0,The Highlands,"$1,295"
1,Richmond Beach,"$1,295"
2,Innis Arden,"$1,295"
3,Rainier View,"$1,379"
4,Zenith,"$1,466"


In [None]:
# download Seattle neighbourhoods geojson file
!wget --quiet https://github.com/seattleio/seattle-boundaries-data/raw/master/data/neighborhoods.geojson
    
print('GeoJSON file downloaded!')

GeoJSON file downloaded!


###  Data Cleaning  & Feature Selection

Since the rent database has a slightly different spelling of columns I had to rename the Neighborhood column, so that I could better append data to the airbnb listings. The Airbnb data included features which are not relevant to the current analysis, so I had to drop the columns (listing) name, host_name, minimum_nights, number_of_reviews, last_review, reviews_per_month and availability_365. 



In [None]:
df_rent.rename(columns={"Neighborhood":"neighbourhood"}, inplace = True)
df_rent.head()

Unnamed: 0,neighbourhood,AVGrent
0,The Highlands,"$1,295"
1,Richmond Beach,"$1,295"
2,Innis Arden,"$1,295"
3,Rainier View,"$1,379"
4,Zenith,"$1,466"


The remaining features neighbourhood_group, neighbourhood, latitude, longitude, room_type and price are directly relevant to analysing whether entire place or single room listings are clustered in neighbourhoods with higher rental prices. What is more, the columns Host_id and calculated_host_listings_count indicate property owners with multiple listed properties: a variable related to the notion that the relationship between Airbnb listings and neighbourhood prices going up goes two ways (and owners with higher income and more properties are likely list property in neighbourhoods where they can get more money out of them).

Cleaning the dataframe and __removing columns with data which is not relevant for the current analysis__

In [None]:
df = pd.DataFrame(df, columns=['id','neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'host_id', 'calculated_host_listings_count'])
df.drop(df.index[0],inplace=True)

df.head()

#Optional code for cleaning Missing data (this database does not have missing values)
#clean = df[df['name'] == 'Not assigned'].index
#df.drop(clean, inplace= True)

#Optional grouping
#df_grouped = df.groupby('neighbourhood_group').agg({'neighbourhood': ','.join}).reset_index()
#df_grouped.head()

Unnamed: 0,id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,host_id,calculated_host_listings_count
1,5682,Delridge,South Delridge,47.52398,-122.35989,Entire home/apt,48,8993,1
2,6606,Other neighborhoods,Wallingford,47.65411,-122.33761,Entire home/apt,90,14942,3
3,9419,Other neighborhoods,Georgetown,47.55062,-122.32014,Private room,62,30559,8
4,9460,Downtown,First Hill,47.61265,-122.32936,Private room,99,30832,4
5,9531,West Seattle,Fairmount Park,47.55539,-122.38474,Entire home/apt,165,31481,2


In [None]:
# Removing Hotel & Shared rooms
indexNames = df[ df['room_type'] == 'Hotel room' ].index
indexNames2 = df[ df['room_type'] == 'Shared room' ].index
# Delete these row indexes from dataFrame
df.drop(indexNames, inplace=True)
df.drop(indexNames2, inplace=True)


The shape of the resulting dataframe has 8728 rows (including the first row with column names) and 9 columns:

In [None]:
for col in df.columns: 
    print(col)
    
df.shape

id
neighbourhood_group
neighbourhood
latitude
longitude
room_type
price
host_id
calculated_host_listings_count


(8728, 9)

In [None]:
#Looking for NaN/Null values & Counting them
#pd.DataFrame.isna(df)
#df.isnull().sum()
#df.isnull().sum()

#Random Sample from the 9K listings
#dfr = pd.DataFrame.sample(df, n = 500)
#dfr.head()

# one hot encoding
seattle_onehot = pd.get_dummies(df[['room_type']], prefix="", prefix_sep="")

# add column back to dataframe
seattle_onehot['room_type'] = df['room_type'] 

# move neighborhood column to the first column
fixed_columns = [seattle_onehot.columns[-1]] + list(seattle_onehot.columns[:-1])
seattle_onehot = seattle_onehot[fixed_columns]

seattle_onehot.head()
df = pd.merge(df, seattle_onehot, how ='inner', on ='room_type') 

df.head()

In [1]:
df.rename(columns={'Entire home/apt':'entire','Private room':'part'}, inplace=True)
df.head()


NameError: name 'df' is not defined

### I need to further clean up the data as I want to analyse only neighbourhoods for which I know the latest average rental price. I also want a cleaned list grouped by neighbourhood in addition to my full list by listing item.

In [None]:
seattle_meanprice = df.groupby('neighbourhood').mean().reset_index()
seattle_meanprice.drop('host_id', axis = 1, inplace = True)
seattle_meanprice.drop('id', axis = 1, inplace = True)
seattle_meanprice.head()

In [None]:
seattle_hosts = df.groupby('neighbourhood').nunique()
seattle_hosts.drop(seattle_hosts.columns [[1,2,3,4,5,6,8]], axis = 1, inplace = True)

#rename host_id count to number of hosts in this neighbourhood
seattle_hosts.rename(columns={'host_id':'host_num'}, inplace=True)
seattle_hosts.head()


In [None]:
#Merge hosts and mean price data
dfmerge = pd.merge(seattle_meanprice, seattle_hosts, how ='inner', on ='neighbourhood') 
dfmerge.head()

In [None]:
#Group features back together and sort by neighbourhood
seattle_grouped = df.groupby('neighbourhood').count().reset_index()

#create a clean database by merging the csv with rent data and the cleaned listings csv

dfclean = pd.merge(df_rent, dfmerge, how ='inner', on ='neighbourhood') 


In [None]:
# clean up entries without AVGRent data
indexrent = dfclean[ dfclean['AVGrent'] == 0 ].index
dfclean.drop(indexrent, inplace=True)
dfclean.head()

In [None]:
#Rename the merged columns
dfclean.rename(columns={'price':'AVGprice','calculated_host_listings_count':'AVGhostlistings','id':'listing_num'}, inplace=True)
dfclean.head()

### Ultimately, there is rental data for 81 out of 88 neighbourhoods in the Airbnb listing database for Seattle.

In [None]:
dffinal = pd.merge(df_rent, df, how ='inner', on ='neighbourhood') 
indexfinal = dffinal[dffinal['AVGrent'] == 0 ].index
dffinal.drop(indexfinal, inplace=True)

dffinal.rename(columns={'price':'AVGprice','calculated_host_listings_count':'AVGhostlistings','id':'listing_num'}, inplace=True)

dffinal.head()

### Map of Seattle

To map the analysed data we will first need a Folium map of the city, based on Geographic coordinates

In [None]:
#Conda Folium Install
#!conda install -c conda-forge folium=0.5.0 --yes 
#import folium # map rendering library

#PLUGINS
import folium.plugins as plugins
#from folium.plugins import MeasureControl
#from folium.plugins import FloatImage
#import datetime


#pip alternative install
!pip install folium
import folium
import os

print('Libraries imported.')


In [None]:
#Seattle coordinates
latitude = 47.6062
longitude = -122.335167

# create map and display it
seattle_map = folium.Map(location=[latitude, longitude], zoom_start=13,
                        tiles = 'Stamen Terrain')


We can also map listings on the map, using different colours for _entire place_ and _single room_ listings

In [None]:
seattle_geo = r'neighborhoods.geojson' # geojson file
#from folium import plugins

airbnb = folium.Map(location = [latitude, longitude], zoom_start = 12,
                   tiles='Stamen Terrain')

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(dffinal.latitude, dffinal.longitude):
    airbnb.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add incidents to map
airbnb

In [None]:
listings = plugins.MarkerCluster().add_to(seattle_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(dfclean.latitude, dfclean.longitude, dfclean.AVGrent):
    folium.Marker(
        location=[lat, lng],
        popup=label,
        icon=folium.Icon(color='green',icon='house', prefix='fa')
    ).add_to(listings)

# display map
seattle_map

## Regression

In [None]:
import matplotlib.pyplot as plt
import pylab as pl

%matplotlib inline


In [None]:
plt.scatter(dfclean.AVGprice,dfclean.neighbourhood, color='blue')
plt.xlabel("Price by neighbourhood")
plt.ylabel("Emission")
plt.show()


__One hot encoding for room type__

__Grouping rows by neighborhood as well as calculating the mean of the frequency of occurrence of each category__

## k-Means Clustering of the obtained results

1. Setting up 6 clusters

In [None]:
# set number of clusters
kclusters = 6

seattle_grouped_clustering = seattle_grouped.drop('neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(seattle_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

2. Adding labels

In [None]:
# add clustering labels
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

3. Examininig clusters

In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]



In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


In [None]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 6, toronto_merged.columns[[1] + list(range(6, toronto_merged.shape[1]))]]


## Visualizing results

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Postcode'], 
                                  toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters