Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 10/2018 through 03/2020.

The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

Raw data [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [31]:
#Read in libraries
import swifter
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

**Read in Data**

In [32]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\04_07_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

  mask |= (ar1 == a)


**Settings for Notebook**

In [33]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

## Preview Data

In [34]:
#Preview calendar data
display(calendar.head())

Unnamed: 0,adjusted_price,available,date,listing_id,maximum_nights,minimum_nights,price
0,$80.00,f,2019-04-03,187730,120.0,3.0,$80.00
1,$80.00,f,2019-04-04,187730,120.0,3.0,$80.00
2,$82.00,t,2019-04-05,187730,120.0,3.0,$82.00
3,$82.00,t,2019-04-06,187730,120.0,3.0,$82.00
4,$81.00,t,2019-04-07,187730,120.0,3.0,$81.00


In [35]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

Calendar data shape: (23980063, 7)

Calendar data types: 
 adjusted_price            object
available                 object
date              datetime64[ns]
listing_id                object
maximum_nights           float64
minimum_nights           float64
price                     object
dtype: object


# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [36]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights', 'minimum_nights'], inplace = True)

#View
display(calendar)

Unnamed: 0,available,date,listing_id,price
0,f,2019-04-03,187730,$80.00
1,f,2019-04-04,187730,$80.00
2,t,2019-04-05,187730,$82.00
3,t,2019-04-06,187730,$82.00
4,t,2019-04-07,187730,$81.00
...,...,...,...,...
50499076,f,2020-09-06,38533186,$255.00
50499077,f,2020-09-07,38533186,$342.00
50499078,f,2020-09-08,38533186,$455.00
50499079,f,2020-09-09,38533186,$522.00


## Filter dates

In [37]:
#Keep dates in 2018 and 2019
calendar = calendar.loc[(calendar['date'] < '05-01-2020')]

#View updated calendar shape
print('Updated calendar shape:', calendar.shape)

Updated calendar shape: (17112559, 4)


## Column cleaning and data type correction

**Price and adjusted_price columns**

In [38]:
#Remove $ and , 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [39]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  1 if x =='t' else 0)

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=17112559.0, style=ProgressStyle(descri…




## Missing Data

**Stats of missing values per column**

In [40]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

Unnamed: 0,Total,Percent
price,2049409,11.98
listing_id,0,0.0
date,0,0.0
available,0,0.0


### Missing price values

**View rows with missing price values**

In [41]:
#View rows with missing price
calendar[calendar.price.isnull()]

Unnamed: 0,available,date,listing_id,price
70150,1,2019-10-17,59831,
70151,1,2019-10-18,59831,
70152,1,2019-10-19,59831,
70153,1,2019-10-20,59831,
70154,1,2019-10-21,59831,
...,...,...,...,...
44838165,0,2019-01-04,28844783,
44838166,0,2019-01-03,28844783,
44838167,0,2019-01-02,28844783,
44838168,0,2019-01-01,28844783,


It appears that if available is false, price will be missing. Let's test that theory:

In [42]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == 1]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == 0]))

Missing price values with availability:  322
Missing price values w/o availability:  2049087


Not a bad hunch :) We'll leave those values as is for our time series analysis. We will, however, drop those 322 missing values

**Drop missing values where available is true**

In [43]:
#Reset calendar index
calendar.reset_index(inplace=True)

#Get index of rows that contain missing prices in available rooms
drop = calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)].index.to_list()

#Remove
calendar.drop(calendar.index[drop], inplace=True)

#Check
calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)]

Unnamed: 0,index,available,date,listing_id,price


## Misc cleaning

In [54]:
#Remove rows where price == 0 (most likely a data error or inactive account)
calendar = calendar[calendar['price'] > 0]

calendar.head()

Unnamed: 0_level_0,index,available,listing_id,price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-04-03,0,0,187730,80.0
2019-04-04,1,0,187730,80.0
2019-04-05,2,1,187730,82.0
2019-04-06,3,1,187730,82.0
2019-04-07,4,1,187730,81.0


In [56]:
#Drop index column
calendar.drop(columns = ['index'], inplace = True)

## Set date as index

In [47]:
calendar.set_index('date', inplace=True)

# Write file to CSV

In [57]:
#View calendar shape and head
print('Final calendar shape:', calendar.shape)
display(calendar.head())

Final calendar shape: (15062791, 3)


Unnamed: 0_level_0,available,listing_id,price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-04-03,0,187730,80.0
2019-04-04,0,187730,80.0
2019-04-05,1,187730,82.0
2019-04-06,1,187730,82.0
2019-04-07,1,187730,81.0


In [58]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0407_Calendar_Cleaned.csv'

#Write listings to path
calendar.to_csv(path, sep=',')