# Introduction

In my analysis, I will be exploring the following:

- Based on the reviews of the last year, what makes a good user experience at a Airbnb rental? What makes a poor user experience? What would you advise to new Airbnb hosts to ensure high reviews and consistent renters?
- What are some of the seasonal Airbnb trends you've noticed in the San Francisco area? When are the down months over the last year? Which months are the most busy? Which areas were the most popular during these times? How have prices changed over time?
- Are airbnb's that are available for monthly rent competitive with the local markets?
- Can you create a price predictor for listings price? What features allows renters to charge more?

In [2]:
#Read in libraries
import dask.dataframe as dd
import swifter

import pandas as pd
import pandas_profiling

import re

import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#supress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [4]:
#Set pandas backend to create interactive plots
#pd.options.plotting.backend = 'hvplot'

#Whats the defailt back end? Not all plots work with this

In [5]:
#Set plot aesthetics for notebook
sns.set(style='whitegrid', palette='pastel', color_codes=True)

#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#Set float format
pd.options.display.float_format = '{:.0f}'.format

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

**Read in Data**

In [6]:
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate/'

In [None]:
#Read in Airbnb Listings Data
listings = pd.read_csv(path + '12_27_2019_Listings_Cleaned.csv',index_col=0, low_memory=False, sep=',')

#Parse dates
parse_dates = ['date']

#Read in Airbnb Calendar and Reviews data
calendar = pd.read_csv(path + '12_23_2019_Calendar_Cleaned.csv', sep = ',',
                       parse_dates=parse_dates, low_memory=False,index_col=0)

reviews = pd.read_csv(path + '12_24_2019_Reviews_Cleaned.csv', sep=',',
                      parse_dates=parse_dates,index_col=0)

In [None]:
#Read in Zillow data
zillow = pd.read_csv(path + '12_29_2019_Zillow_Cleaned.csv', parse_dates=['Date'],
                     index_col=0, sep=',')

# Data Preview

**Airbnb Listings Data**

In [None]:
#Preview listings data
display(listings.head().T)

**Airbnb Reviews Data**

In [None]:
#Preview reviews data
display(reviews.head())

**Airbnb Calendar Data**

In [None]:
#Preview calendar data
display(calendar.head())

**Zillow Data**

In [None]:
#Preview zillow data
display(zillow.head())

# Data Exploration

- How has Airbnb grown over the last year? Which neighborhoods have shown the most growth?
- What are the metrics for the different neighborhoods in SF?
- Are airbnb's that are available for monthly rent competitive with the local rent market? Is Airbnb a legitmate option to consider for short term living as opposed to finding month to month leases?
- What are the different data distributions for the different outcome variables we are interested in exploring?



#### What are the different data distributions for the different outcome variables we are interested in exploring?


date

In [None]:
calendar.date.hist()

In [None]:
#plot hist
listings.review_scores_rating.hist(bins = 80)

#plot the mean
mean = np.mean(listings.review_scores_rating)

plt.axvline(mean, color='r', linewidth=2, linestyle='--', label= str(round(mean,2)))

Listings price

In [None]:
#KDE
ax = sns.kdeplot(listings.price, shade=True, color="grey")
#plot the mean
mean = np.mean(listings.price)

plt.axvline(mean, color='r', linewidth=2, linestyle='--', label= str(round(mean,2)))

In [None]:
listings.monthly_price.hist(bins = 30)

#plot the mean
mean = np.mean(listings.monthly_price)

plt.axvline(mean, color='r', linewidth=2, linestyle='--', label= str(round(mean,2)))

In [None]:
listings.weekly_price.hist(bins = 30)

#plot the mean
mean = np.mean(listings.weekly_price)

plt.axvline(mean, color='r', linewidth=2, linestyle='--', label= str(round(mean,2)))

#### How has Airbnb grown over the last year(11/2018 - 10/2019)?

Let's begin by taking a look at the number of unique listings per month available for rent from Airbnb hosts.

In [None]:
#Convert date to month year and add to calendar
#calendar['month_year'] = pd.to_datetime(calendar['date']).dt.to_period('M') #would this work in time series?
calendar.set_index('date', inplace=True)

#Extract date info from index
calendar['year'] = calendar.index.year
calendar['month'] = calendar.index.month
calendar['weekday'] = calendar.index.weekday_name

calendar.head()

In [None]:
#Calendar data only through 2019 december
last_year = calendar.loc[(calendar.year < 2020)]

In [None]:
#Group last year by month_year and get a count of unique listings per day
last_year_daily = last_year.groupby(['date'])['listing_id'].agg({'nunique'}).reset_index()

In [None]:
# #Rename columns
last_year_daily =last_year_daily.rename(columns = {'nunique': 'listings'})

In [None]:
last_year_monthly = last_year.reset_index()
last_year_monthly.head()

In [None]:
last_year_monthly['month_year'] = pd.to_datetime(last_year_monthly['date']).dt.to_period('M') #would this work in time series?
last_year_monthly = last_year_monthly.groupby('month_year')['listing_id'].agg({'nunique'}).reset_index()

# #Rename columns
last_year_monthly =last_year_monthly.rename(columns = {'nunique': 'listings'})

In [None]:
#Set 538 plot style
plt.style.use('fivethirtyeight')


#Plot daily unique listings data
ax = last_year_daily.plot(x='date', y='listings', kind = 'line',style='o-', markersize=.5,
          label= 'Daily Unique Listings',figsize = (15,10),
         linewidth = 1.5)
 

#plot weekly


#plot montly average
last_year_monthly.plot(x='month_year', y='listings',kind = 'line',style='o-', markersize= 10,
          label= 'Monthly Unique Listings',
         linewidth = 1.5 , ax=ax)

#Set fontdict
fontdict={'weight' : 'bold',
          'size': 17}

#Set x and y labels
ax.set_xlabel('Month',fontdict=fontdict)
ax.set_ylabel('Count', fontdict=fontdict)

#Format yticks
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

#Set Title
ax.set_title('Growth of Airbnb Listings in San Francisco', fontweight = 'bold', fontsize=22)

#Adjust plot margins
ax.margins(0,.12)

#Mute vertical grid lines
ax.grid(b = False, which ='major', axis = 'x')

#Add Text
xs,ys=last_year_monthly['month_year'], last_year_monthly['listings']

for x,y in zip(xs,ys):
    label = '{:,}'.format(y)
    plt.annotate(label, (x,y),textcoords="offset points",fontsize = 9, xytext=(-40,20), ha='center')

#Set legend
plt.legend(title='Legend', frameon = True, loc='upper left');

Over the last year, there has been significant growth in the number of listings available for rent month to month. Let us look into the number of nights booked by users over the last year.

**Growth by neighborhood September 2018 - December 2019**

We'll merge our calendar data with listings data to capture neighborhood data

In [None]:
#Convert calendar to dask to improve speed on merge
calendar_dd = dd.from_pandas(calendar, npartitions=3)


In [None]:
#capture listing id and neighbourhood_cleansed from listings for merge
neighborhoods = listings[['id', 'neighbourhood_cleansed']]

calendar_dd = calendar_dd.reset_index()

#Merge with calendar_dd
neighborhood_growth = calendar_dd.merge(neighborhoods, left_on ='listing_id', right_on='id')

#Drop redundant columnsa
neighborhood_growth= neighborhood_growth.drop(columns = ['id'])

In [None]:
neighborhood_growth=neighborhood_growth[(neighborhood_growth['date'] < '2020-01-01')]

#convert to pandas
neighborhood_growth_pd = neighborhood_growth.compute()

#Group last year by month_year and get a count of unique listings per month
neighborhood_growth_pd = neighborhood_growth_pd.groupby(['neighbourhood_cleansed', 'year', 'month'])['listing_id'].agg({'nunique'}).reset_index()

#Rename columns
neighborhood_growth_pd =neighborhood_growth_pd.rename(columns = {'neighbourhood_cleansed': 'neighborhoods',
                                                                 'nunique': 'listings'})

In [None]:
neighborhood_growth_pd = neighborhood_growth_pd.sort_values(by=['year','month', 'listings'], ascending=False)

In [None]:
#Set fig size
f, ax = plt.subplots(figsize = (6,15))

#set style and color_pallete
sns.set(style='whitegrid')

#Set color codes for modern data
sns.set_color_codes('pastel')

#plot data from 
j = sns.barplot(x = 'listings', y = 'neighborhoods', color = 'b',
            data = neighborhood_growth_pd.loc[(neighborhood_growth_pd.month == 12 ) & 
                                         (neighborhood_growth_pd.year == 2019)],
                label = 'December 2019')

#Set color codes for older data
sns.set_color_codes('dark')

#Plot data from 2018-09
g = sns.barplot(x = 'listings', y = 'neighborhoods', color = 'b',
            data = neighborhood_growth_pd.loc[(neighborhood_growth_pd.month == 9 ) & 
                                         (neighborhood_growth_pd.year == 2018)],
                label = 'September 2018')

#Set legend info
ax.legend(frameon = True, ncol = 2, loc= 'lower center');

#Set Labels
ax.set(ylabel="", xlabel="Airbnb Listings")
ax.set_title('Explosion of Airbnb', fontweight = 'bold', fontsize = 18)


sns.despine(left=True, bottom=True)


**Prices by property type**


In [None]:
#Set 538 plot style
plt.style.use('ggplot')

#Counts of property types
prop_count = listings.groupby('property_type')['id'].count().sort_values(ascending = False).reset_index()
prop_count.plot(x = 'property_type', y = 'id', kind = 'bar')

In [None]:
#Get top 15 common prop types
prop_list=list(prop_count.property_type.head(15))

#sort top 15 by median value
test = listings[listings.property_type.isin(prop_list)].groupby('property_type')['price'].median().sort_values(ascending = False).reset_index()

In [None]:
test_list = test.property_type.tolist()

In [None]:
#Set 538 plot style
plt.style.use('fivethirtyeight')

#Set Figure
f, ax = plt.subplots(figsize= (10,5))

#Plot
g= sns.boxplot(x="property_type", y="price", order=test_list, 
               width = .4,palette=sns.light_palette((210, 90, 60), input="husl"),
             data=listings[listings.property_type.isin(test_list)], ax=ax)

#Set Title
ax.set_title('15 Most common properties with listing price', fontweight = 'bold', fontsize=22)

#Set x and y Labels
ax.set_ylabel('Price per Night($)',fontdict=fontdict)
ax.set_xlabel('Property Type',fontdict=fontdict)

#Rotate x_ticklabels
g.set_xticklabels(g.get_xticklabels(), rotation=45, horizontalalignment='right');

#Format ticks on y-axis

**Heat map**

In [None]:
corr = listings.corr()
plt.figure(figsize=(15, 15))
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,linewidths=.1,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

# Zillow

In [38]:
zillow.dtypes

City                   object
County                 object
Metro                  object
Zip                     int64
State                  object
SizeRank                int64
Bedrooms                int64
Date           datetime64[ns]
Median_Rent           float64
dtype: object

In [39]:
zillow.set_index(keys = 'Date', inplace=True)

In [40]:
zillow['Year'] = zillow.index.year
zillow['Month'] = zillow.index.month
zillow['Weekday_Name'] = zillow.index.weekday_name

In [41]:
zillow.head()

Unnamed: 0_level_0,City,County,Metro,Zip,State,SizeRank,Bedrooms,Median_Rent,Year,Month,Weekday_Name
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2010-03-01,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,1200,2010,3,Monday
2010-04-01,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,1250,2010,4,Thursday
2010-05-01,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,1200,2010,5,Saturday
2010-06-01,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,1250,2010,6,Tuesday
2010-07-01,Virginia Beach,Virginia Beach City,Virginia Beach-Norfolk-Newport News,23462,VA,83,3,1225,2010,7,Thursday


# Time Series Analysis

For this analysis, we will isolate the rows in our Zillow data that are in the zipcode or city of the listings data.

In [42]:
#Capture zip information from listings
listings_zip = list(listings.zipcode.unique())

#Capture city information from listings 
listings_cities = list(listings.city.unique())

#Capture SF Data
sf_zillow = zillow[zillow['Zip'].isin(listings_zip) | zillow['City'].isin(listings_cities)]

#Check
print(sf_zillow.shape)
display(sf_zillow.tail())

(2015, 11)


Unnamed: 0_level_0,City,County,Metro,Zip,State,SizeRank,Bedrooms,Median_Rent,Year,Month,Weekday_Name
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2019-11-01,San Francisco,San Francisco,San Francisco-Oakland-Hayward,94105,CA,2269,2,5800,2019,11,Friday
2019-11-01,Sausalito,Marin,San Francisco-Oakland-Hayward,94965,CA,2434,2,3650,2019,11,Friday
2019-11-01,San Francisco,San Francisco,San Francisco-Oakland-Hayward,94158,CA,2477,2,5288,2019,11,Friday
2019-11-01,San Francisco,San Francisco,San Francisco-Oakland-Hayward,94110,CA,82,3,5575,2019,11,Friday
2019-11-01,San Francisco,San Francisco,San Francisco-Oakland-Hayward,94115,CA,518,3,6500,2019,11,Friday


In [1]:
#SF vs rest of US Rent

plt.style.use('ggplot')

fig, ax = plt.subplots(figsize = (12,5))
sf_zillow.groupby('Date')['Median_Rent'].mean().plot(kind = 'line')

zillow.loc['2011-09-01':'2019-11-01'].groupby(['Date'])['Median_Rent'].mean().plot(x_compat=True, kind = 'line', color= 'g')#green


#We also noted each state's region size ranking, which represents how big it is population-wise;
#California is ranked No. 1 with the largest population of all states, while Wyoming is ranked No. 51
zillow[zillow.SizeRank < 10].loc['2011-09-01':'2019-11-01'].groupby(['Date'])['Median_Rent'].mean().plot(x_compat=True, kind = 'line')

NameError: name 'plt' is not defined

Airbnb monthly rent vs bay area rent 

In [None]:
#Capture id and monthly_price from listings data for merge with calendar data
to_merge = listings[['id','monthly_price']]



In [None]:
monthly_listings_dd = calendar_dd.merge(to_merge, left_on='listing_id', right_on='id')

In [None]:
monthly_listings = monthly_listings_dd.compute()

In [None]:
test_pd=test_pd[-test_pd.monthly_price.isna()]

test_pd.drop_duplicates(inplace = True)

In [None]:
#Convert date to month-year format
test_pd
#test_pd['date'] = pd.to_datetime(test_pd['date']).dt.to_period('M')

In [None]:
test_pd.set_index('date', inplace=True)

In [None]:
test_pd.sort_index(inplace=True)

In [None]:
test_pd['Year'] = test_pd.index.year
test_pd['Month'] = test_pd.index.month
test_pd['Weekday Name'] = test_pd.index.weekday_name

In [None]:
test_pd.head()

In [None]:
test_pd.loc['2018-09':'2019-12'].groupby('date')['monthly_price'].mean().plot()

In [None]:
sf_zillow.loc['2018-09':'2019-12'].groupby(['Date'])['Median_Rent'].mean().plot(kind = 'line')

In [None]:
#test_pd.loc['2018-09':'2019-12'].groupby('date')['monthly_price'].mean().plot(kind = 'box')
sns.boxplot(x='Month', y = 'monthly_price', data = test_pd)

In [None]:
sf_zillow.tail()

In [None]:
calendar.head()

* Comparing monthly rent of airbnb to zillow

In [None]:
listings.head(2)

In [None]:
#test = sf_zillow.groupby(['Bedrooms', 'Date'])['Median_Rent'].mean().reset_index()
fig, ax = plt.subplots(figsize=(10,5))

sf_zillow.groupby("Bedrooms")['Median_Rent'].plot(kind="line", ax=ax)
plt.legend(title='Bedrooms', frameon = True, loc='upper right', labels=['studio', 'single','two','three'])


# Other variables and relationships worth exploring

### Principal Component Analysis