Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 1/2018 through 12/2019.

* The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/blob/master/Airbnb%20Raw%20Data%20Aggregation.ipynb)

* Raw data [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries, set notebook preferences, and read in data

In [5]:
#Read in libraries
import dask.dataframe as dd
import swifter
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

**Settings for Notebook**

In [6]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

**Read in Data**

In [None]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

## Preview Data

In [None]:
#Preview calendar data
display(calendar.head())

In [None]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [None]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights'], inplace = True)

#View
display(calendar)

## Column cleaning and data type correction

**Price column**

In [None]:
#Remove $ and , from price 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [None]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  1 if x =='t' else 0)

Set date as index**

## Missing Data

**Stats of missing values per column**

In [None]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

**View rows with missing price values**

In [None]:
calendar[calendar['price'].isnull()]

It appears that if available is false, price will be missing. Let's test that theory:

In [None]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == 1]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == 0]))

Not a bad hunch :)

**Drop rows with missing values and available is false**

In [None]:
#Get indeces of rows with missing values in price and available == 0
remove = missing_price[missing_price.available == 0].index.tolist()

#Drop
calendar = calendar.drop(remove)

#Check calendar shape
print('Current calendar data shape:',calendar.shape)

#View updated missing stats
missing_stats(calendar)

**Replace remaining missing values with median of price**

In [None]:
#Fill remaining na with median of price column
calendar['price'] = calendar['price'].fillna(calendar['price'].median())

#View updated missing stats
missing_stats(calendar)

## Outliers

**Percentiles for the price of a one night stay in San Francisco**

In [None]:
#Capture stats and percentile of price
print("median ",calendar.price.median())
calendar['price'].describe(percentiles=[.01,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99])

**How many rows are above the .99 percentile?**

In [None]:
q = calendar["price"].quantile(0.99)

len(calendar[calendar["price"] > q])

**Outlier detection using z-scores**

In [None]:
#Display distributions of calendar data
calendar['price'].hist(bins = 1000)



Distribution of prices in 99

In [None]:
calendar.price[calendar["price"] > 1252.00].hist(bins = 1_000)

Rows in the top 99
* are they offering minumum nights?
* is the price an aggragation of stay(seems like it)

data should reflect price for a one night stay, if rules are not followed then rows should be removed

In [None]:
calendar[calendar['minimum_nights']>1].sort_values(by='price', ascending=False)

In [None]:
bigballers = calendar[calendar["price"] > 1252.00].groupby(['listing_id'])['price'].max()

bigballers = bigballers.reset_index().sort_values(by='price', ascending = False)

listofballers = bigballers['listing_id'][0:200].to_list()


In [None]:
calendar[calendar["listing_id"].isin(listofballers)].sort_values(by='price', ascending=False).head(200)

Read in listings id and look at rows offering the extreme values to confirm if they are erroneous errors and removed from the data

In [None]:
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated'

In [None]:
listings = pd.read_csv(path + '/01_04_2020_Listings_Raw_Aggregated.csv', index_col=0 )

In [None]:
listings[listings.id == 1059961]

#### Price Outlier Removal

**Identify and remove outliers using IQR**

Removing outliers from price should normalize both price and adjusted_price columns.

In [None]:
# #Calculate IQR of price
# q25 = calendar['price'].quantile(0.25)
# q75= calendar['price'].quantile(0.75)
# iqr = q75 - q25

# #Print percentiles
# print('Percentiles: 25th={:.3f}, 75th={:.3f} \nIQR= {:.3f}'.format(q25, q75, iqr))

# #Calculate outlier cutoffs
# cut_off =1.5 * iqr
# lower, upper = q25 - cut_off, q75 + iqr

# #Identify outliers
# outliers = [x for x in calendar.price if x < lower or x > upper]
# print("Number of outliers identified: {}".format(len(outliers)))

# #Remove outliers
# outliers_removed = [x for x in calendar.price if x >= lower and x <= upper]
# print('Non-outlier observations: {}'.format(len(outliers_removed)))

# #Update df
# calendar = calendar[calendar.price.isin(outliers_removed)]

In [None]:
#Updated calendar Shape
print('New calendar shape: ', calendar.shape)

#Set ggplot plot style
plt.style.use('ggplot')

#Plot updated prices from calendar data
plot = calendar.price.plot(kind = 'hist', bins=1000, figsize=(12,7), label = 'Price',
                   legend = True)

#Get plot object
ax = plt.gca()

#Capture mean
mean = np.mean(calendar.price)

#Plot mean and median to histogram
ax.axvline(mean, color='white', linestyle='dashed', linewidth=2.5, label = "Avg/Night ${:}".format(str(round(mean,2))))

#Format x-axis
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "${:,}".format(int(x))))

#Format y ticks
ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.set_ylabel('')

#Mute vertical grid lines
ax.grid(b = False, which ='major', axis = 'x')

#Set Title
ax.set_title('Airbnb per Night Rental Prices in San Francisco', fontweight= 'normal', fontsize = 18)

#Show legend
plt.legend(frameon = True, loc='upper right');

#Save plot to png
# fig = plot.get_figure()
# fig.savefig(path + '\Airbnb per Night Rental Prices in San Francisco.png',bbox_inches = 'tight')

In [None]:
#Set path to write listings
# path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\01_04_2020_Calendar_Cleaned.csv'

# #Write listings to path
# calendar.to_csv(path, sep=',')