Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 1/2018 through 12/2019.

* The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

* Raw data [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [1]:
#Read in libraries
import swifter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Read in Data**

In [3]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

  mask |= (ar1 == a)


**Settings for Notebook**

In [2]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

## Preview Data

In [4]:
#Preview calendar data
display(calendar.head())

Unnamed: 0,adjusted_price,available,date,listing_id,maximum_nights,minimum_nights,price
0,,t,2019-04-05,21190709,,,"$5,000.00"
1,,t,2019-04-04,21190709,,,"$5,000.00"
2,,t,2019-04-03,21190709,,,"$5,000.00"
3,,t,2019-04-02,21190709,,,"$5,000.00"
4,,t,2019-04-01,21190709,,,"$5,000.00"


In [5]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

Calendar data shape: (26178939, 7)

Calendar data types: 
 adjusted_price            object
available                 object
date              datetime64[ns]
listing_id                object
maximum_nights           float64
minimum_nights           float64
price                     object
dtype: object


# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [6]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights', 'minimum_nights'], inplace = True)

#View
display(calendar)

Unnamed: 0,available,date,listing_id,price
0,t,2019-04-05,21190709,"$5,000.00"
1,t,2019-04-04,21190709,"$5,000.00"
2,t,2019-04-03,21190709,"$5,000.00"
3,t,2019-04-02,21190709,"$5,000.00"
4,t,2019-04-01,21190709,"$5,000.00"
...,...,...,...,...
61385197,f,2020-09-06,38533186,$255.00
61385198,f,2020-09-07,38533186,$342.00
61385199,f,2020-09-08,38533186,$455.00
61385200,f,2020-09-09,38533186,$522.00


## Column cleaning and data type correction

**Price column**

In [7]:
#Remove $ and , from price 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [8]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  1 if x =='t' else 0)

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=26178939.0, style=ProgressStyle(descri…




## Missing Data

**Stats of missing values per column**

In [9]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

Unnamed: 0,Total,Percent
price,4837396,18.48
listing_id,0,0.0
date,0,0.0
available,0,0.0


**View rows with missing price values**

In [10]:
calendar[calendar['price'].isnull()]

Unnamed: 0,available,date,listing_id,price
1146,0,2018-11-03,6938818,
1147,0,2018-11-02,6938818,
1148,0,2018-11-01,6938818,
1149,0,2018-10-31,6938818,
1150,0,2018-10-30,6938818,
...,...,...,...,...
58489652,0,2019-09-03,28201824,
58489653,0,2019-09-04,28201824,
58489654,0,2019-09-05,28201824,
58489655,0,2019-09-06,28201824,


It appears that if available is false, price will be missing. Let's test that theory:

In [11]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == 1]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == 0]))

Missing price values with availability:  322
Missing price values w/o availability:  4837074


Not a bad hunch :)

**Drop rows with missing values and available is false**

In [12]:
#Get indeces of rows with missing values in price and available == 0
remove = missing_price[missing_price.available == 0].index.tolist()

#Drop
calendar = calendar.drop(remove)

#Check calendar shape
print('Current calendar data shape:',calendar.shape)

#View updated missing stats
missing_stats(calendar)

Current calendar data shape: (21341865, 4)


Unnamed: 0,Total,Percent
price,322,0.0
listing_id,0,0.0
date,0,0.0
available,0,0.0


**Replace remaining missing values with median of price**

In [13]:
#Fill remaining na with median of price column
calendar['price'] = calendar['price'].fillna(calendar['price'].median())

#View updated missing stats
missing_stats(calendar)

Unnamed: 0,Total,Percent
price,0,0.0
listing_id,0,0.0
date,0,0.0
available,0,0.0


## Outliers

**Percentiles for the price of a one night stay in San Francisco**

In [14]:
#Capture stats and percentile of price
print("median ",calendar.price.median())
print(calendar['price'].describe(percentiles=[.01,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99]))

#How many rows in 99th percentile?
q = calendar["price"].quantile(0.99)
print('/nNumber of rows above 99th percentile: ', len(calendar[calendar["price"] > q]))

median  159.0
count   21341865.00
mean         230.78
std          600.20
min            0.00
1%            38.00
10%           71.00
20%           94.00
30%          115.00
40%          136.00
50%          159.00
60%          186.00
70%          220.00
80%          278.00
90%          400.00
95%          590.00
99%         1252.00
max       220713.00
Name: price, dtype: float64
/nNumber of rows above 99th percentile:  213311


**Note** Airbnb data contains listings for Airbnb Luxe as well. Airbnbn Luxe is Airbnb's foray into super-luxury travel. It is a new premium tier allowing you to book fully serviced mega homes and vacation experiences. Each property on Airbnb Luxe is unique, pristine and of course, luxurious. They even come with private butlers, special amenities, and helicopters in some cases!

Because of this, we will acknowledge that the extreme values are for luxury listings and are not outliers.

**Zero Price Removal**

In [15]:
#Remove rows where price == 0 (most likely a data error, rent cannot be $0)
calendar = calendar[calendar['price'] >0]

#View updated metrics of price
print("median ",calendar.price.median())
print(calendar['price'].describe(percentiles=[.01,0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99]))

#How many rows in 99th percentile?
q = calendar["price"].quantile(0.99)
print('/nNumber of rows above 99th percentile: ', len(calendar[calendar["price"] > q]))

median  159.0
count   21341193.00
mean         230.79
std          600.21
min           10.00
1%            38.00
10%           71.00
20%           94.00
30%          115.00
40%          136.00
50%          159.00
60%          186.00
70%          220.00
80%          278.00
90%          400.00
95%          590.00
99%         1252.00
max       220713.00
Name: price, dtype: float64
/nNumber of rows above 99th percentile:  213311


# Write file to CSV

In [None]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0130_Calendar_Cleaned.csv'

#Write listings to path
calendar.to_csv(path, sep=',')