Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 1/2018 through 12/2019.

* The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

* Raw data [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [1]:
#Read in libraries
import swifter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Read in Data**

In [2]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

  mask |= (ar1 == a)


**Settings for Notebook**

In [3]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

## Preview Data

In [4]:
#Preview calendar data
display(calendar.head())

Unnamed: 0,adjusted_price,available,date,listing_id,maximum_nights,minimum_nights,price
0,,f,2018-04-01,10988680,,,
1,,f,2018-03-31,10988680,,,
2,,f,2018-03-30,10988680,,,
3,,f,2018-03-29,10988680,,,
4,,f,2018-03-28,10988680,,,


In [5]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

Calendar data shape: (36420835, 7)

Calendar data types: 
 adjusted_price            object
available                 object
date              datetime64[ns]
listing_id                object
maximum_nights           float64
minimum_nights           float64
price                     object
dtype: object


# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [6]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights', 'minimum_nights'], inplace = True)

#View
display(calendar)

Unnamed: 0,available,date,listing_id,price
0,f,2018-04-01,10988680,
1,f,2018-03-31,10988680,
2,f,2018-03-30,10988680,
3,f,2018-03-29,10988680,
4,f,2018-03-28,10988680,
...,...,...,...,...
105104167,f,2020-09-06,38533186,$255.00
105104168,f,2020-09-07,38533186,$342.00
105104169,f,2020-09-08,38533186,$455.00
105104170,f,2020-09-09,38533186,$522.00


## Filter dates

In [7]:
#Keep dates in 2018 and 2019
calendar = calendar.loc[(calendar['date'] < '2020')]

#View updated calendar shape
print('Updated calendar shape:', calendar.shape)

Updated calendar shape: (29351938, 4)


## Column cleaning and data type correction

**Price column**

In [8]:
#Remove $ and , from price 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [9]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  True if x =='t' else False)

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=29351938.0, style=ProgressStyle(descri…




## Missing Data

**Stats of missing values per column**

In [10]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

Unnamed: 0,Total,Percent
price,9401115,32.03
listing_id,0,0.0
date,0,0.0
available,0,0.0


**View rows with missing price values**

In [11]:
#View rows with missing price
calendar[calendar.price.isnull()]

Unnamed: 0,available,date,listing_id,price
0,False,2018-04-01,10988680,
1,False,2018-03-31,10988680,
2,False,2018-03-30,10988680,
3,False,2018-03-29,10988680,
4,False,2018-03-28,10988680,
...,...,...,...,...
102208622,False,2019-09-03,28201824,
102208623,False,2019-09-04,28201824,
102208624,False,2019-09-05,28201824,
102208625,False,2019-09-06,28201824,


It appears that if available is false, price will be missing. Let's test that theory:

In [12]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == True]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == False]))

Missing price values with availability:  102
Missing price values w/o availability:  9401013


Not a bad hunch :) We'll leave those values as is for our time series analysis. We will, however, drop those 322 missing values

**Drop missing values where available is true**

In [13]:
#Reset calendar index
calendar.reset_index(inplace=True)

#Get index of rows that contain missing prices in available rooms
drop = calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)].index.to_list()

#Remove
calendar.drop(calendar.index[drop], inplace=True)

#Check
calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)]

Unnamed: 0,index,available,date,listing_id,price


## Zero Price Removal

In [14]:
#Remove rows where price == 0 (most likely a data error, rent cannot be $0)
calendar = calendar[calendar['price'] > 0]

# Write file to CSV

In [15]:
#View final calendar shape
print('Final calendar shape:', calendar.shape)

Final calendar shape: (19950576, 5)


In [16]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0130_Calendar_Cleaned.csv'

#Write listings to path
calendar.to_csv(path, sep=',')