Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 1/2018 through 12/2019.

* The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/blob/master/Airbnb%20Raw%20Data%20Aggregation.ipynb)

* Raw data [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries, set notebook preferences, and read in data

In [1]:
#Read in libraries
import dask.dataframe as dd
import swifter
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

**Settings for Notebook**

In [2]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

**Read in Data**

In [None]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

## Preview Data

In [None]:
#Preview calendar data
display(calendar.head())

In [None]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [None]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights', 'minimum_nights'], inplace = True)

#View
display(calendar)

## Column cleaning and data type correction

**Price column**

In [None]:
#Remove $ and , from price 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [None]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  1 if x =='t' else 0)

## Missing Data

**Stats of missing values per column**

In [None]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

**View rows with missing price values**

In [None]:
calendar[calendar['price'].isnull()]

It appears that if available is false, price will be missing. Let's test that theory:

In [None]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == 1]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == 0]))

Not a bad hunch :)

**Drop rows with missing values and available is false**

In [None]:
#Get indeces of rows with missing values in price and available == 0
remove = missing_price[missing_price.available == 0].index.tolist()

#Drop
calendar = calendar.drop(remove)

#Check calendar shape
print('Current calendar data shape:',calendar.shape)

#View updated missing stats
missing_stats(calendar)

**Replace remaining missing values with median of price**

In [None]:
#Fill remaining na with median of price column
calendar['price'] = calendar['price'].fillna(calendar['price'].median())

#View updated missing stats
missing_stats(calendar)

## Outliers

**Percentiles for the price of a one night stay in San Francisco**

In [None]:
#Capture stats and percentile of price
print("median ",calendar.price.median())
print(calendar['price'].describe(percentiles=[.01,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99]))

#How many rows in 99th percentile?
q = calendar["price"].quantile(0.99)
print('/nNumber of rows above 99th percentile: ', len(calendar[calendar["price"] > q]))

**Note** Airbnb data contains listings for Airbnb Luxe as well. Airbnbn Luxe is Airbnb's foray into super-luxury travel. It is a new premium tier allowing you to book fully serviced mega homes and vacation experiences. Each property on Airbnb Luxe is unique, pristine and of course, luxurious. They even come with private butlers, special amenities, and helicopters in some cases!

Because of this, we will acknowledge that the extreme values are for luxury listings and are not outliers.

**Zero Price Removal**

In [None]:
#Remove rows where price == 0 (most likely a data error, rent cannot be $0)
calendar = calendar[calendar['price'] >0]

#View updated metrics of price
print("median ",calendar.price.median())
print(calendar['price'].describe(percentiles=[.01,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99]))

#How many rows in 99th percentile?
q = calendar["price"].quantile(0.99)
print('/nNumber of rows above 99th percentile: ', len(calendar[calendar["price"] > q]))

# Write file to CSV

In [None]:
Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0130_Calendar_Cleaned.csv'

#Write listings to path
calendar.to_csv(path, sep=',')