Data Cleaning - Aggregated Airbnb Calendar Data

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb's calendar data of the San Francisco area. This data consists of data from 1/2018 through 12/2019.

* The aggregation source code [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

* Raw data [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [None]:
#Read in libraries
import swifter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Read in Data**

In [None]:
#Set path to get aggregated calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Calendar_Raw_Aggregated.csv'

#Set date columns for parsing
parse_dates = ['date']

#Read in calendar data
calendar = pd.read_csv(path, sep = ',', dtype = {"listing_id" : "object"}, parse_dates=parse_dates,index_col=0, low_memory=False)

**Settings for Notebook**

In [None]:
#Set Pandas options
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#Set plot style
plt.style.use('ggplot')

## Preview Data

In [None]:
#Preview calendar data
display(calendar.head())

In [None]:
#Print shape and dtypes of calendar data
print('Calendar data shape:',calendar.shape)
print('\nCalendar data types: \n',calendar.dtypes)

# Data Cleaning

## Drop columns not needed for Time Series Analysis

In [None]:
#Drop columns not needed for time series analysis
calendar.drop(columns = ['adjusted_price','maximum_nights', 'minimum_nights'], inplace = True)

#View
display(calendar)

## Filter dates

In [None]:
# #Keep dates in 2018 and 2019
# calendar = calendar.loc[(calendar['date'] > '2017-12-31') & (calendar['date'] < '2020')]

# #View updated calendar shape
# print('Updated calendar shape:', calendar.shape)

## Column cleaning and data type correction

**Price column**

In [None]:
#Remove $ and , from price 
calendar['price']=calendar['price'].replace('[,$]','', regex=True)

#Convert string to numeric
calendar['price'] =calendar['price'].swifter.apply(pd.to_numeric, errors='coerce')

**Available column**

In [None]:
#Replace 't' and 'f' in available column to True and False
calendar.available = calendar.available.swifter.apply(lambda x:  True if x =='t' else False)

## Missing Data

**Stats of missing values per column**

In [None]:
#Create function that captures missing data stats of a pandas df
def missing_stats(pandas):
    #Create empty dataframe to store missing value stats in
    missing = pd.DataFrame()
    #Capture total number and total % of missing data per column
    missing['Total'] = pandas.isnull().sum().sort_values(ascending=False)
    missing['Percent'] = (missing['Total']/pandas.isnull().count()).apply(lambda x: x * 100)
    return missing

#Check
missing_stats(calendar)

**View rows with missing price values**

In [None]:
#View rows with missing price
calendar[calendar.price.isnull()]

It appears that if available is false, price will be missing. Let's test that theory:

In [None]:
#Capture rows with missing price data and assign to df
missing_price = calendar[calendar['price'].isnull()]

#View 
print('Missing price values with availability: ', len(missing_price[missing_price.available == True]))
print('Missing price values w/o availability: ', len(missing_price[missing_price.available == False]))

Not a bad hunch :) We'll leave those values as is for our time series analysis. We will, however, drop those 102 missing values

**Drop missing values where available is true**

In [None]:
#Reset calendar index
calendar.reset_index(inplace=True)

#Get index of rows that contain missing prices in available rooms
drop = calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)].index.to_list()

#Remove
calendar.drop(calendar.index[drop], inplace=True)

#Check
calendar.loc[(calendar['price'].isnull()) & (calendar['available'] == True)]

## Zero Price Removal

In [None]:
#Remove rows where price == 0 (most likely a data error, rent cannot be $0)
calendar = calendar[calendar['price'] > 0]

# Write file to CSV

In [None]:
#View final calendar shape
print('Final calendar shape:', calendar.shape)

In [None]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0130_Calendar_Cleaned.csv'

#Write listings to path
calendar.to_csv(path, sep=',')