Data Cleaning - Aggregated Airbnb Listings

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb listings data. This data pertains to the San Francisco area and consists of calendar data from 12/2018 through 12/2019.

The aggregation source code can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

Raw data can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [None]:
#Read in libraries
import pandas as pd
import swifter
import numpy as np

**Set notebook preferences**

In [None]:
#supress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#Set options for pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',200)

**Read in Data**

In [None]:
#Set path to get aggregated listings data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Listings_Raw_Aggregated.csv'

#list columns with date information to parse
dates = ['calendar_last_scraped', 'first_review', 'host_since', 'last_review']

#Read in Airbnb listings Data
listings = pd.read_csv(path,index_col=0, low_memory=False, 
                       dtype={'review_scores_accuracy':'object',
                              'review_scores_checkin':'object',
                              'review_scores_cleanliness':'object',
                              'review_scores_communication':'object',
                              'review_scores_location':'object',
                             'review_scores_rating':'object',
                             'review_scores_value':'object'} ,
                               sep=',', parse_dates=dates)


## Preview Data

In [None]:
print('Listings shape:', listings.shape)
display(listings.head())

In [None]:
listings.filter(regex='review')

In [None]:
#View data types
listings.dtypes

# Data Cleaning

## Column removal for collinearity or homogeneous values

**Test for and remove collinear features**

In [None]:
#Create a correlation matrix
corr_matrix = listings.corr().abs()

#Select upper triangle of matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

#Find features with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

print('Columns with a correlation > .9:\n', to_drop)

#Drop
listings.drop(columns=to_drop,inplace=True)

#View updated listings shape
print('\nUpdated listings shape: ',listings.shape)

**Remove columns with homogenous values**

In [None]:
#Capture columns with homogeneous values and store as list in cols
cols = list(listings.columns[listings.nunique() == 1])

#Drop cols
listings.drop(columns=cols, axis = 1, inplace=True)

#View updated listings shape
print('Updated listings shape: ',listings.shape)

**Check for additional columns with mostly homogenous values**

In [None]:
#Capture columns with homogeneous values and store as list in cols
cols = listings.columns[listings.nunique() <= 2]

#Check
display(listings[cols].head())

In [None]:
#Explore values in country, country_code, jurisdiction_names, and market
print(listings.groupby('country')['country'].count())
print('\n',listings.groupby('country_code')['country_code'].count())
print('\n',listings.groupby('jurisdiction_names')['jurisdiction_names'].count())
print('\n',listings.groupby('market')['market'].count())

In [None]:
#Dropping cols, data pertains to sf. Errors may be due to location of host
listings.drop(columns=['country','country_code','jurisdiction_names','market'], inplace = True)

#Updated listings shape
print('Updated listings shape:', listings.shape)

### Removing redundant columns

Columns city, street, and smart_location  encode the same information. Columns neighbourhood and neighbourhood_cleansed do the same. 

Keeping city and neighbourhood_cleansed columns

In [None]:
#Cols to drop
cols = ['street', 'smart_location','neighbourhood']

#Dropping redundant columns
listings.drop(columns=cols, inplace=True)

#Updated listings shape
print('Updated listings shape:', listings.shape)

## Column removal for containing unusable/unnecessary data

Columns containing url links or web scrape information are not needed for this analysis

In [None]:
#Drop cols ending in url
listings = listings[listings.columns.drop(list(listings.filter(regex='url$')))]

#Check
listings.head(3)

In [None]:
#Drop cols containing scrape
listings = listings[listings.columns.drop(list(listings.filter(regex='scrape')))]

#Check
listings.head(3)

## Data formatting

### Formatting continuous variables 

In [None]:
#Create list of cols that contain $%,{}[]"'
cols = ['cleaning_fee','extra_people','host_response_rate','monthly_price', 'price', 'security_deposit',
        'weekly_price']

#Remove $%, and convert cols to floats
listings[cols] = listings[cols].replace('[$,%]', '', regex=True).astype('float64')

#Check
print('Cols dtypes:\n', listings[cols].dtypes)
display(listings[cols].head(3))

### Formatting string variables

In [None]:
#cols with troublesome punctuation
cols = ['amenities', 'host_verifications']

#Remove punctuation
listings[cols] = listings[cols].replace('[^\w\s]+', ' ', regex = True)

### Formatting boolean variables

In [None]:
#List of columns to convert t's to 1's and f's to 0's
cols = ['host_has_profile_pic','host_identity_verified','host_is_superhost', 'instant_bookable',
       'is_location_exact', 'require_guest_phone_verification',	'require_guest_profile_picture', 'requires_license']

#Strip white space in strings
listings[cols] = listings[cols].apply(lambda x: x.str.strip())

#Create dictionary to map True and False
mymap = {'t':True, 'f':False}

#Replace t's and f's
listings[cols]=listings[cols].applymap(lambda s: mymap.get(s) if s in mymap else s)

#Convert cols to bool
listings[cols] = listings[cols].astype('bool')

#Check
print('Cols dtypes:\n', listings[cols].dtypes)
display(listings[cols].head(3))

## Missing Values

### Create a missing data tracker

In [None]:
def missing_tracker(pandas):
#function that returns a df containing the count and % of missing values per cool in pandas.
#Also captures dtype per col in pandas for easier cleaning
    missing = pd.DataFrame()
    missing['total'] = pandas.isna().sum()
    missing['missing%'] = missing['total']/len(pandas)
    missing['dtype'] = pandas.dtypes
    missing = missing[missing.total > 1].sort_values(by ='total',ascending = False)
    return missing

#View missing data in listings
missing = missing_tracker(listings)
display(missing)

### Remove columns missing more than 40% of data

In [None]:
#Get names of cols with more than 40% of values missing
cols = missing[missing['missing%'] > .40].index.tolist()

#Drop cols
listings.drop(columns=cols, inplace=True)

#Update and display missing values
missing = missing_tracker(listings)
display(missing)

### Resolve floats

In [None]:
#subset flaots from listings
floats = missing[missing['dtype'] == 'float64'].index.tolist()

#View stats
print('Median values : \n', listings[floats].median())
listings[floats].describe()

In [None]:
#Filling with median values due to pull from airbnb luxe listings
listings[floats] = listings[floats].fillna(listings[floats].median())

#Update and display missing values
missing = missing_tracker(listings)
display(missing)

### Resolve objects

In [None]:
#Preview object columns from listings that contain missing values
objects = missing.loc[missing.dtype == 'object'].index.to_list()
listings[objects].head(3)

In [None]:
#Text entry variables/host information to fill with 'Unavailable'
unavailable = ['notes','license','access','interaction','transit','house_rules','space',
               'summary','description','host_about','host_location', 'host_name','host_neighbourhood','neighborhood_overview']
#Fill 
listings[unavailable] = listings[unavailable].fillna('Unavailable')

#Categorical variables to fill with the mode of the column
mode = ['review_scores_value', 'review_scores_location', 'review_scores_checkin', 'review_scores_accuracy',
        'review_scores_cleanliness', 'review_scores_communication', 'review_scores_rating','host_response_time', 'cancellation_policy','city','state']
#Fill
for col in mode:
    listings[col].fillna(listings[col].mode()[0], inplace=True)

In [None]:
#Reverse engineer missing zipcode. Import libraries to reverse engineer zipcode
from uszipcode import SearchEngine
from uszipcode import Zipcode

#Instantiate SearchEngine
zipsearch = SearchEngine(simple_zipcode=True)

#Write function that finds zip given lat and long data
def get_zipcode(lat, lon):
    result = zipsearch.by_coordinates(lat = lat, lng = lon, returns = 1)
    return result[0].zipcode

temp = listings[listings.zipcode.isna()][['latitude', 'longitude']]

#Apply get_zipcode and assign to Zipcode
temp['zipcode']= temp.swifter.apply(lambda x: get_zipcode(x.latitude, x.longitude), axis =1)

#Combine temp.Zipcode onto original df. 
listings.zipcode = listings.zipcode.combine_first(temp.zipcode)

Missing dates will be left as is for the time being

In [None]:
#Updated listings shape
print('Updated listings shape:', listings.shape)

## Column Specific Cleaning

Cleaning specific columns in listings data in which there were specific value issues spotted in the Pandas Profiling report.

### City Column

In [None]:
#View values in city column
listings.groupby('city')['city'].count()

In [None]:
#Strip white space
listings.city = listings.city.str.strip()

#Replace neighborhood information with San Fancisco and correct Daly City Spelling
listings.city.replace('^(B|San|No|V|[^a-zA-Z]).*', 'San Francisco', regex=True, inplace=True)
listings.city.replace('^D.*', 'Daly City', regex=True, inplace=True)


#Check
listings.groupby('city')['city'].count()

### calendar_updated column

In [None]:
#convert 'a week ago' to '1 week ago' in calendar_updated
listings['calendar_updated'].replace('a week ago', '1 week ago', inplace=True)

### Price column

In [None]:
#View stats over price
print('Median Price : ', listings.price.median())
listings.price.describe(percentiles=[.1,.2,.3,.4,.5,.6,.7,.8,.9])

In [None]:
#Remove rows where price = 0 (Typo)
listings = listings[listings['price'] >0]

## Renaming some column names

In [None]:
#Setting calculated_host_listings to chl
listings.rename(columns={'calculated_host_listings_count': 'chlc',
'calculated_host_listings_count_private_rooms':'chlc_private_rooms',
'calculated_host_listings_count_shared_rooms':'chlc_shared_rooms'}, inplace=True)

# Write out file

In [None]:
print('Final shape of listings is:',listings.shape)

In [None]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0201_Listings_Cleaned.csv'

#Write listings to path
listings.to_csv(path, sep=',')