Data Cleaning - Aggregated Airbnb Listings

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb listings data. This data pertains to the San Francisco area and consists of calendar data from 12/2018 through 12/2019.

The aggregation source code can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

Raw data can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries,  read in data, and set notebook preferences

**Read in libraries**

In [1]:
#Read in libraries
import pandas as pd
import swifter
import numpy as np

**Set notebook preferences**

In [2]:
#supress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#Set options for pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',200)

**Read in Data**

In [3]:
#Set path to get aggregated listings data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Listings_Raw_Aggregated.csv'

#list columns with date information to parse
dates = ['calendar_last_scraped', 'first_review', 'host_since', 'last_review']

#Read in Airbnb listings Data
listings = pd.read_csv(path,index_col=0, low_memory=False, 
                       dtype={'review_scores_accuracy':'object',
                              'review_scores_checkin':'object',
                              'review_scores_cleanliness':'object',
                              'review_scores_communication':'object',
                              'review_scores_location':'object',
                             'review_scores_rating':'object',
                             'review_scores_value':'object'} ,
                               sep=',', parse_dates=dates)


## Preview Data

In [4]:
print('Listings shape:', listings.shape)
display(listings.head())

Listings shape: (98796, 106)


Unnamed: 0,access,accommodates,amenities,availability_30,availability_365,availability_60,availability_90,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,calendar_last_scraped,calendar_updated,cancellation_policy,city,cleaning_fee,country,country_code,description,experiences_offered,extra_people,first_review,guests_included,has_availability,host_about,host_acceptance_rate,host_has_profile_pic,host_id,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_name,host_neighbourhood,host_picture_url,host_response_rate,host_response_time,host_since,host_thumbnail_url,host_total_listings_count,host_url,host_verifications,house_rules,id,instant_bookable,interaction,is_business_travel_ready,is_location_exact,jurisdiction_names,last_review,last_scraped,latitude,license,listing_url,longitude,market,maximum_maximum_nights,maximum_minimum_nights,maximum_nights,maximum_nights_avg_ntm,medium_url,minimum_maximum_nights,minimum_minimum_nights,minimum_nights,minimum_nights_avg_ntm,monthly_price,name,neighborhood_overview,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,notes,number_of_reviews,number_of_reviews_ltm,picture_url,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,scrape_id,security_deposit,smart_location,space,square_feet,state,street,summary,thumbnail_url,transit,weekly_price,xl_picture_url,zipcode
0,*Full access to patio and backyard (shared wit...,3,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",0,77,0,1,1.0,Real Bed,1.0,2.0,1,1.0,0.0,0.0,2019-04-03,a week ago,moderate,San Francisco,$100.00,United States,US,New update: the house next door is under const...,none,$25.00,2009-07-23,2,t,We are a family with 2 boys born in 2009 and 2...,,t,1169,t,t,1.0,"San Francisco, California, United States",Holly,Duboce Triangle,https://a0.muscache.com/im/pictures/efdad96a-3...,100%,within an hour,2008-07-31,https://a0.muscache.com/im/pictures/efdad96a-3...,1.0,https://www.airbnb.com/users/show/1169,"['email', 'phone', 'facebook', 'reviews', 'kba']",* No Pets - even visiting guests for a short t...,958,t,A family of 4 lives upstairs with their dog. N...,f,t,"{""SAN FRANCISCO""}",2019-03-16,2019-04-03,37.76931,STR-0001256,https://www.airbnb.com/rooms/958,-122.43386,San Francisco,30.0,1.0,30,30.0,,30.0,1.0,1,1.0,"$4,200.00","Bright, Modern Garden Unit - 1BR/1B",*Quiet cul de sac in friendly neighborhood *St...,Duboce Triangle,Western Addition,,Due to the fact that we have children and a do...,183,51.0,https://a0.muscache.com/im/pictures/b7c2a199-4...,$170.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,97.0,10.0,1.55,Entire home/apt,20190400000000.0,$100.00,"San Francisco, CA","Newly remodeled, modern, and bright garden uni...",,CA,"San Francisco, CA, United States",New update: the house next door is under const...,,*Public Transportation is 1/2 block away. *Ce...,"$1,120.00",,94117
1,"Our deck, garden, gourmet kitchen and extensiv...",5,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",0,0,0,0,1.0,Real Bed,2.0,3.0,1,1.0,0.0,0.0,2019-04-03,4 months ago,strict_14_with_grace_period,San Francisco,$100.00,United States,US,We live in a large Victorian house on a quiet ...,none,$0.00,2009-05-03,2,t,Philip: English transplant to the Bay Area and...,,t,8904,t,f,2.0,"San Francisco, California, United States",Philip And Tania,Bernal Heights,https://a0.muscache.com/im/users/8904/profile_...,80%,within a day,2009-03-02,https://a0.muscache.com/im/users/8904/profile_...,2.0,https://www.airbnb.com/users/show/8904,"['email', 'phone', 'reviews', 'kba', 'work_ema...","Please respect the house, the art work, the fu...",5858,f,,f,t,"{""SAN FRANCISCO""}",2017-08-06,2019-04-03,37.74511,,https://www.airbnb.com/rooms/5858,-122.42102,San Francisco,60.0,30.0,60,60.0,,60.0,30.0,30,30.0,"$5,500.00",Creative Sanctuary,I love how our neighborhood feels quiet but is...,Bernal Heights,Bernal Heights,,All the furniture in the house was handmade so...,111,0.0,https://a0.muscache.com/im/pictures/17714/3a7a...,$235.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,98.0,9.0,0.92,Entire home/apt,20190400000000.0,,"San Francisco, CA",We live in a large Victorian house on a quiet ...,,CA,"San Francisco, CA, United States",,,The train is two blocks away and you can stop ...,"$1,600.00",,94110
2,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,60,90,4.0,Real Bed,1.0,1.0,9,0.0,9.0,0.0,2019-04-03,17 months ago,strict_14_with_grace_period,San Francisco,$50.00,United States,US,Nice and good public transportation. 7 minute...,none,$12.00,2009-08-31,1,t,7 minutes walk to UCSF. 15 minutes walk to US...,,t,21994,t,f,10.0,"San Francisco, California, United States",Aaron,Cole Valley,https://a0.muscache.com/im/users/21994/profile...,100%,within a few hours,2009-06-17,https://a0.muscache.com/im/users/21994/profile...,10.0,https://www.airbnb.com/users/show/21994,"['email', 'phone', 'reviews', 'jumio', 'govern...","No party, No smoking, not for any kinds of smo...",7918,f,,f,t,"{""SAN FRANCISCO""}",2016-11-21,2019-04-03,37.76669,,https://www.airbnb.com/rooms/7918,-122.4525,San Francisco,60.0,32.0,60,60.0,,60.0,32.0,32,32.0,"$1,685.00",A Friendly Room - UCSF/USF - San Francisco,"Shopping old town, restaurants, McDonald, Whol...",Cole Valley,Haight Ashbury,,Please email your picture id with print name (...,17,0.0,https://a0.muscache.com/im/pictures/26356/8030...,$65.00,Apartment,f,f,t,8.0,9.0,8.0,9.0,9.0,85.0,8.0,0.15,Private room,20190400000000.0,$200.00,"San Francisco, CA",Room rental-sunny view room/sink/Wi Fi (inner ...,,CA,"San Francisco, CA, United States",Nice and good public transportation. 7 minute...,,N Juda Muni and bus stop. Street parking.,$485.00,,94117
3,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,60,90,4.0,Real Bed,1.0,1.0,9,0.0,9.0,0.0,2019-04-03,17 months ago,strict_14_with_grace_period,San Francisco,$50.00,United States,US,Nice and good public transportation. 7 minute...,none,$12.00,2014-09-08,1,t,7 minutes walk to UCSF. 15 minutes walk to US...,,t,21994,t,f,10.0,"San Francisco, California, United States",Aaron,Cole Valley,https://a0.muscache.com/im/users/21994/profile...,100%,within a few hours,2009-06-17,https://a0.muscache.com/im/users/21994/profile...,10.0,https://www.airbnb.com/users/show/21994,"['email', 'phone', 'reviews', 'jumio', 'govern...",no pet no smoke no party inside the building,8142,f,,f,t,"{""SAN FRANCISCO""}",2018-09-12,2019-04-03,37.76487,,https://www.airbnb.com/rooms/8142,-122.45183,San Francisco,90.0,32.0,90,90.0,,90.0,32.0,32,32.0,"$1,685.00",Friendly Room Apt. Style -UCSF/USF - San Franc...,,Cole Valley,Haight Ashbury,,Please email your picture id with print name (...,8,1.0,https://a0.muscache.com/im/pictures/27832/3b1f...,$65.00,Apartment,f,f,t,9.0,10.0,9.0,10.0,9.0,93.0,9.0,0.14,Private room,20190400000000.0,$200.00,"San Francisco, CA",Room rental Sunny view Rm/Wi-Fi/TV/sink/large ...,,CA,"San Francisco, CA, United States",Nice and good public transportation. 7 minute...,,"N Juda Muni, Bus and UCSF Shuttle. small shopp...",$490.00,,94117
4,Guests have access to everything listed and sh...,5,"{TV,Internet,Wifi,Kitchen,Heating,""Family/kid ...",30,90,60,90,1.5,Real Bed,2.0,2.0,2,2.0,0.0,0.0,2019-04-03,4 months ago,strict_14_with_grace_period,San Francisco,$225.00,United States,US,Pls email before booking. Interior featured i...,none,$150.00,2009-09-25,2,t,Always searching for a perfect piece at Europe...,,t,24215,t,f,2.0,"San Francisco, California, United States",Rosy,Alamo Square,https://a0.muscache.com/im/users/24215/profile...,100%,within an hour,2009-07-02,https://a0.muscache.com/im/users/24215/profile...,2.0,https://www.airbnb.com/users/show/24215,"['email', 'phone', 'reviews', 'kba']",House Manual and House Rules will be provided ...,8339,f,,f,t,"{""SAN FRANCISCO""}",2018-08-11,2019-04-03,37.77525,STR-0000264,https://www.airbnb.com/rooms/8339,-122.43637,San Francisco,1125.0,7.0,1125,1125.0,,1125.0,7.0,7,7.0,,Historic Alamo Square Victorian,,Western Addition/NOPA,Western Addition,,tax ID on file tax ID on file,27,1.0,https://a0.muscache.com/im/pictures/6f84a7c2-e...,$785.00,House,t,t,t,10.0,10.0,10.0,10.0,10.0,97.0,9.0,0.23,Entire home/apt,20190400000000.0,$0.00,"San Francisco, CA",Please send us a quick message before booking ...,,CA,"San Francisco, CA, United States",Pls email before booking. Interior featured i...,,,,,94117


In [5]:
listings.filter(regex='review')

Unnamed: 0,first_review,last_review,number_of_reviews,number_of_reviews_ltm,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month
0,2009-07-23,2019-03-16,183,51.0,10.0,10.0,10.0,10.0,10.0,97.0,10.0,1.55
1,2009-05-03,2017-08-06,111,0.0,10.0,10.0,10.0,10.0,10.0,98.0,9.0,0.92
2,2009-08-31,2016-11-21,17,0.0,8.0,9.0,8.0,9.0,9.0,85.0,8.0,0.15
3,2014-09-08,2018-09-12,8,1.0,9.0,10.0,9.0,10.0,9.0,93.0,9.0,0.14
4,2009-09-25,2018-08-11,27,1.0,10.0,10.0,10.0,10.0,10.0,97.0,9.0,0.23
...,...,...,...,...,...,...,...,...,...,...,...,...
98791,NaT,NaT,0,0.0,,,,,,,,
98792,NaT,NaT,0,0.0,,,,,,,,
98793,NaT,NaT,0,0.0,,,,,,,,
98794,NaT,NaT,0,0.0,,,,,,,,


In [6]:
#View data types
listings.dtypes

access                                                  object
accommodates                                             int64
amenities                                               object
availability_30                                          int64
availability_365                                         int64
availability_60                                          int64
availability_90                                          int64
bathrooms                                              float64
bed_type                                                object
bedrooms                                               float64
beds                                                   float64
calculated_host_listings_count                           int64
calculated_host_listings_count_entire_homes            float64
calculated_host_listings_count_private_rooms           float64
calculated_host_listings_count_shared_rooms            float64
calendar_last_scraped                           datetim

# Data Cleaning

## Column removal for collinearity or homogeneous values

**Test for and remove collinear features**

In [7]:
#Create a correlation matrix
corr_matrix = listings.corr().abs()

#Select upper triangle of matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

#Find features with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]

print('Columns with a correlation > .9:\n', to_drop)

#Drop
listings.drop(columns=to_drop,inplace=True)

#View updated listings shape
print('\nUpdated listings shape: ',listings.shape)

Columns with a correlation > .9:
 ['availability_60', 'availability_90', 'calculated_host_listings_count_entire_homes', 'host_total_listings_count', 'maximum_nights', 'maximum_nights_avg_ntm', 'minimum_maximum_nights', 'minimum_minimum_nights', 'minimum_nights', 'minimum_nights_avg_ntm']

Updated listings shape:  (98796, 96)


**Remove columns with homogenous values**

In [8]:
#Capture columns with homogeneous values and store as list in cols
cols = list(listings.columns[listings.nunique() == 1])

#Drop cols
listings.drop(columns=cols, axis = 1, inplace=True)

#View updated listings shape
print('Updated listings shape: ',listings.shape)

Updated listings shape:  (98796, 93)


**Check for additional columns with mostly homogenous values**

In [9]:
#Capture columns with homogeneous values and store as list in cols
cols = listings.columns[listings.nunique() <= 2]

#Check
display(listings[cols].head())

Unnamed: 0,country,country_code,host_acceptance_rate,host_has_profile_pic,host_identity_verified,host_is_superhost,instant_bookable,is_location_exact,jurisdiction_names,market,medium_url,neighbourhood_group_cleansed,require_guest_phone_verification,require_guest_profile_picture,requires_license,thumbnail_url,xl_picture_url
0,United States,US,,t,t,t,t,t,"{""SAN FRANCISCO""}",San Francisco,,,f,f,t,,
1,United States,US,,t,t,f,f,t,"{""SAN FRANCISCO""}",San Francisco,,,f,f,t,,
2,United States,US,,t,t,f,f,t,"{""SAN FRANCISCO""}",San Francisco,,,f,f,t,,
3,United States,US,,t,t,f,f,t,"{""SAN FRANCISCO""}",San Francisco,,,f,f,t,,
4,United States,US,,t,t,f,f,t,"{""SAN FRANCISCO""}",San Francisco,,,t,t,t,,


In [10]:
#Explore values in country, country_code, jurisdiction_names, and market
print(listings.groupby('country')['country'].count())
print('\n',listings.groupby('country_code')['country_code'].count())
print('\n',listings.groupby('jurisdiction_names')['jurisdiction_names'].count())
print('\n',listings.groupby('market')['market'].count())

country
Mexico               5
United States    98791
Name: country, dtype: int64

 country_code
MX        5
US    98791
Name: country_code, dtype: int64

 jurisdiction_names
{"SAN FRANCISCO"}          98165
{"Solano County"," CA"}        1
Name: jurisdiction_names, dtype: int64

 market
D.C.                 2
San Francisco    98516
Name: market, dtype: int64


In [11]:
#Dropping cols, data pertains to sf. Errors may be due to location of host
listings.drop(columns=['country','country_code','jurisdiction_names','market'], inplace = True)

#Updated listings shape
print('Updated listings shape:', listings.shape)

Updated listings shape: (98796, 89)


### Removing redundant columns

Columns city, street, and smart_location  encode the same information. Columns neighbourhood and neighbourhood_cleansed do the same. 

Keeping city and neighbourhood_cleansed columns

In [12]:
#Cols to drop
cols = ['street', 'smart_location','neighbourhood']

#Dropping redundant columns
listings.drop(columns=cols, inplace=True)

#Updated listings shape
print('Updated listings shape:', listings.shape)

Updated listings shape: (98796, 86)


## Column removal for containing unusable/unnecessary data

Columns containing url links or web scrape information are not needed for this analysis

In [13]:
#Drop cols ending in url
listings = listings[listings.columns.drop(list(listings.filter(regex='url$')))]

#Check
listings.head(3)

Unnamed: 0,access,accommodates,amenities,availability_30,availability_365,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,calendar_last_scraped,calendar_updated,cancellation_policy,city,cleaning_fee,description,extra_people,first_review,guests_included,host_about,host_acceptance_rate,host_has_profile_pic,host_id,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_name,host_neighbourhood,host_response_rate,host_response_time,host_since,host_verifications,house_rules,id,instant_bookable,interaction,is_location_exact,last_review,last_scraped,latitude,license,longitude,maximum_maximum_nights,maximum_minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,neighbourhood_group_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,scrape_id,security_deposit,space,square_feet,state,summary,transit,weekly_price,zipcode
0,*Full access to patio and backyard (shared wit...,3,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",0,77,1.0,Real Bed,1.0,2.0,1,0.0,0.0,2019-04-03,a week ago,moderate,San Francisco,$100.00,New update: the house next door is under const...,$25.00,2009-07-23,2,We are a family with 2 boys born in 2009 and 2...,,t,1169,t,t,1.0,"San Francisco, California, United States",Holly,Duboce Triangle,100%,within an hour,2008-07-31,"['email', 'phone', 'facebook', 'reviews', 'kba']",* No Pets - even visiting guests for a short t...,958,t,A family of 4 lives upstairs with their dog. N...,t,2019-03-16,2019-04-03,37.76931,STR-0001256,-122.43386,30.0,1.0,"$4,200.00","Bright, Modern Garden Unit - 1BR/1B",*Quiet cul de sac in friendly neighborhood *St...,Western Addition,,Due to the fact that we have children and a do...,183,51.0,$170.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,97.0,10.0,1.55,Entire home/apt,20190400000000.0,$100.00,"Newly remodeled, modern, and bright garden uni...",,CA,New update: the house next door is under const...,*Public Transportation is 1/2 block away. *Ce...,"$1,120.00",94117
1,"Our deck, garden, gourmet kitchen and extensiv...",5,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",0,0,1.0,Real Bed,2.0,3.0,1,0.0,0.0,2019-04-03,4 months ago,strict_14_with_grace_period,San Francisco,$100.00,We live in a large Victorian house on a quiet ...,$0.00,2009-05-03,2,Philip: English transplant to the Bay Area and...,,t,8904,t,f,2.0,"San Francisco, California, United States",Philip And Tania,Bernal Heights,80%,within a day,2009-03-02,"['email', 'phone', 'reviews', 'kba', 'work_ema...","Please respect the house, the art work, the fu...",5858,f,,t,2017-08-06,2019-04-03,37.74511,,-122.42102,60.0,30.0,"$5,500.00",Creative Sanctuary,I love how our neighborhood feels quiet but is...,Bernal Heights,,All the furniture in the house was handmade so...,111,0.0,$235.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,98.0,9.0,0.92,Entire home/apt,20190400000000.0,,We live in a large Victorian house on a quiet ...,,CA,,The train is two blocks away and you can stop ...,"$1,600.00",94110
2,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,4.0,Real Bed,1.0,1.0,9,9.0,0.0,2019-04-03,17 months ago,strict_14_with_grace_period,San Francisco,$50.00,Nice and good public transportation. 7 minute...,$12.00,2009-08-31,1,7 minutes walk to UCSF. 15 minutes walk to US...,,t,21994,t,f,10.0,"San Francisco, California, United States",Aaron,Cole Valley,100%,within a few hours,2009-06-17,"['email', 'phone', 'reviews', 'jumio', 'govern...","No party, No smoking, not for any kinds of smo...",7918,f,,t,2016-11-21,2019-04-03,37.76669,,-122.4525,60.0,32.0,"$1,685.00",A Friendly Room - UCSF/USF - San Francisco,"Shopping old town, restaurants, McDonald, Whol...",Haight Ashbury,,Please email your picture id with print name (...,17,0.0,$65.00,Apartment,f,f,t,8.0,9.0,8.0,9.0,9.0,85.0,8.0,0.15,Private room,20190400000000.0,$200.00,Room rental-sunny view room/sink/Wi Fi (inner ...,,CA,Nice and good public transportation. 7 minute...,N Juda Muni and bus stop. Street parking.,$485.00,94117


In [14]:
#Drop cols containing scrape
listings = listings[listings.columns.drop(list(listings.filter(regex='scrape')))]

#Check
listings.head(3)

Unnamed: 0,access,accommodates,amenities,availability_30,availability_365,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,calendar_updated,cancellation_policy,city,cleaning_fee,description,extra_people,first_review,guests_included,host_about,host_acceptance_rate,host_has_profile_pic,host_id,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_name,host_neighbourhood,host_response_rate,host_response_time,host_since,host_verifications,house_rules,id,instant_bookable,interaction,is_location_exact,last_review,latitude,license,longitude,maximum_maximum_nights,maximum_minimum_nights,monthly_price,name,neighborhood_overview,neighbourhood_cleansed,neighbourhood_group_cleansed,notes,number_of_reviews,number_of_reviews_ltm,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,security_deposit,space,square_feet,state,summary,transit,weekly_price,zipcode
0,*Full access to patio and backyard (shared wit...,3,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",0,77,1.0,Real Bed,1.0,2.0,1,0.0,0.0,a week ago,moderate,San Francisco,$100.00,New update: the house next door is under const...,$25.00,2009-07-23,2,We are a family with 2 boys born in 2009 and 2...,,t,1169,t,t,1.0,"San Francisco, California, United States",Holly,Duboce Triangle,100%,within an hour,2008-07-31,"['email', 'phone', 'facebook', 'reviews', 'kba']",* No Pets - even visiting guests for a short t...,958,t,A family of 4 lives upstairs with their dog. N...,t,2019-03-16,37.76931,STR-0001256,-122.43386,30.0,1.0,"$4,200.00","Bright, Modern Garden Unit - 1BR/1B",*Quiet cul de sac in friendly neighborhood *St...,Western Addition,,Due to the fact that we have children and a do...,183,51.0,$170.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,97.0,10.0,1.55,Entire home/apt,$100.00,"Newly remodeled, modern, and bright garden uni...",,CA,New update: the house next door is under const...,*Public Transportation is 1/2 block away. *Ce...,"$1,120.00",94117
1,"Our deck, garden, gourmet kitchen and extensiv...",5,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",0,0,1.0,Real Bed,2.0,3.0,1,0.0,0.0,4 months ago,strict_14_with_grace_period,San Francisco,$100.00,We live in a large Victorian house on a quiet ...,$0.00,2009-05-03,2,Philip: English transplant to the Bay Area and...,,t,8904,t,f,2.0,"San Francisco, California, United States",Philip And Tania,Bernal Heights,80%,within a day,2009-03-02,"['email', 'phone', 'reviews', 'kba', 'work_ema...","Please respect the house, the art work, the fu...",5858,f,,t,2017-08-06,37.74511,,-122.42102,60.0,30.0,"$5,500.00",Creative Sanctuary,I love how our neighborhood feels quiet but is...,Bernal Heights,,All the furniture in the house was handmade so...,111,0.0,$235.00,Apartment,f,f,t,10.0,10.0,10.0,10.0,10.0,98.0,9.0,0.92,Entire home/apt,,We live in a large Victorian house on a quiet ...,,CA,,The train is two blocks away and you can stop ...,"$1,600.00",94110
2,,2,"{TV,Internet,Wifi,Kitchen,""Free street parking...",30,365,4.0,Real Bed,1.0,1.0,9,9.0,0.0,17 months ago,strict_14_with_grace_period,San Francisco,$50.00,Nice and good public transportation. 7 minute...,$12.00,2009-08-31,1,7 minutes walk to UCSF. 15 minutes walk to US...,,t,21994,t,f,10.0,"San Francisco, California, United States",Aaron,Cole Valley,100%,within a few hours,2009-06-17,"['email', 'phone', 'reviews', 'jumio', 'govern...","No party, No smoking, not for any kinds of smo...",7918,f,,t,2016-11-21,37.76669,,-122.4525,60.0,32.0,"$1,685.00",A Friendly Room - UCSF/USF - San Francisco,"Shopping old town, restaurants, McDonald, Whol...",Haight Ashbury,,Please email your picture id with print name (...,17,0.0,$65.00,Apartment,f,f,t,8.0,9.0,8.0,9.0,9.0,85.0,8.0,0.15,Private room,$200.00,Room rental-sunny view room/sink/Wi Fi (inner ...,,CA,Nice and good public transportation. 7 minute...,N Juda Muni and bus stop. Street parking.,$485.00,94117


## Data formatting

### Formatting continuous variables 

In [15]:
#Create list of cols that contain $%,{}[]"'
cols = ['cleaning_fee','extra_people','host_response_rate','monthly_price', 'price', 'security_deposit',
        'weekly_price']

#Remove $%, and convert cols to floats
listings[cols] = listings[cols].replace('[$,%]', '', regex=True).astype('float64')

#Check
print('Cols dtypes:\n', listings[cols].dtypes)
display(listings[cols].head(3))

Cols dtypes:
 cleaning_fee          float64
extra_people          float64
host_response_rate    float64
monthly_price         float64
price                 float64
security_deposit      float64
weekly_price          float64
dtype: object


Unnamed: 0,cleaning_fee,extra_people,host_response_rate,monthly_price,price,security_deposit,weekly_price
0,100.0,25.0,100.0,4200.0,170.0,100.0,1120.0
1,100.0,0.0,80.0,5500.0,235.0,,1600.0
2,50.0,12.0,100.0,1685.0,65.0,200.0,485.0


### Formatting string variables

In [16]:
#cols with troublesome punctuation
cols = ['amenities', 'host_verifications']

#Remove punctuation
listings[cols] = listings[cols].replace('[^\w\s]+', ' ', regex = True)

### Formatting boolean variables

In [17]:
#List of columns to convert t's to 1's and f's to 0's
cols = ['host_has_profile_pic','host_identity_verified','host_is_superhost', 'instant_bookable',
       'is_location_exact', 'require_guest_phone_verification',	'require_guest_profile_picture', 'requires_license']

#Strip white space in strings
listings[cols] = listings[cols].apply(lambda x: x.str.strip())

#Create dictionary to map True and False
mymap = {'t':1, 'f':0}

#Replace t's and f's
listings[cols]=listings[cols].applymap(lambda s: mymap.get(s) if s in mymap else s)

#Convert cols to bool
listings[cols] = listings[cols].astype('bool')

#Check
print('Cols dtypes:\n', listings[cols].dtypes)
display(listings[cols].head(3))

Cols dtypes:
 host_has_profile_pic                bool
host_identity_verified              bool
host_is_superhost                   bool
instant_bookable                    bool
is_location_exact                   bool
require_guest_phone_verification    bool
require_guest_profile_picture       bool
requires_license                    bool
dtype: object


Unnamed: 0,host_has_profile_pic,host_identity_verified,host_is_superhost,instant_bookable,is_location_exact,require_guest_phone_verification,require_guest_profile_picture,requires_license
0,True,True,True,True,True,False,False,True
1,True,True,False,False,True,False,False,True
2,True,True,False,False,True,False,False,True


## Missing Values

### Create a missing data tracker

In [18]:
def missing_tracker(pandas):
#function that returns a df containing the count and % of missing values per cool in pandas.
#Also captures dtype per col in pandas for easier cleaning
    missing = pd.DataFrame()
    missing['total'] = pandas.isna().sum()
    missing['missing%'] = missing['total']/len(pandas)
    missing['dtype'] = pandas.dtypes
    missing = missing[missing.total > 1].sort_values(by ='total',ascending = False)
    return missing

#View missing data in listings
missing = missing_tracker(listings)
display(missing)

Unnamed: 0,total,missing%,dtype
host_acceptance_rate,98796,1.0,float64
neighbourhood_group_cleansed,98796,1.0,float64
square_feet,97135,0.983188,float64
monthly_price,84400,0.854286,float64
weekly_price,84211,0.852373,float64
notes,37704,0.381635,object
license,35884,0.363213,object
access,34194,0.346107,object
interaction,32349,0.327432,object
transit,29043,0.293969,object


### Remove columns missing more than 40% of data

In [19]:
#Get names of cols with more than 40% of values missing
cols = missing[missing['missing%'] > .40].index.tolist()

#Drop cols
listings.drop(columns=cols, inplace=True)

#Update and display missing values
missing = missing_tracker(listings)
display(missing)

Unnamed: 0,total,missing%,dtype
notes,37704,0.381635,object
license,35884,0.363213,object
access,34194,0.346107,object
interaction,32349,0.327432,object
transit,29043,0.293969,object
house_rules,26112,0.264302,object
neighborhood_overview,25842,0.261569,object
host_about,23828,0.241184,object
security_deposit,20161,0.204067,float64
review_scores_value,20054,0.202984,object


### Resolve floats

In [20]:
#subset flaots from listings
floats = missing[missing['dtype'] == 'float64'].index.tolist()

#View stats
print('Median values : \n', listings[floats].median())
listings[floats].describe()

Median values : 
 security_deposit                                250.00
reviews_per_month                                 1.06
host_response_rate                              100.00
cleaning_fee                                     95.00
maximum_maximum_nights                          180.00
maximum_minimum_nights                            4.00
calculated_host_listings_count_private_rooms      0.00
number_of_reviews_ltm                             3.00
calculated_host_listings_count_shared_rooms       0.00
bathrooms                                         1.00
beds                                              1.00
host_listings_count                               2.00
bedrooms                                          1.00
dtype: float64


Unnamed: 0,security_deposit,reviews_per_month,host_response_rate,cleaning_fee,maximum_maximum_nights,maximum_minimum_nights,calculated_host_listings_count_private_rooms,number_of_reviews_ltm,calculated_host_listings_count_shared_rooms,bathrooms,beds,host_listings_count,bedrooms
count,78635.0,79352.0,83421.0,88013.0,91724.0,91724.0,91724.0,91724.0,91724.0,98580.0,98700.0,98733.0,98768.0
mean,470.245565,1.868642,96.34821,107.24485,130725.2,13134.38,2.951932,14.23937,0.483723,1.357172,1.760912,72.690711,1.346914
std,756.537857,2.031312,11.428452,83.618957,15896010.0,1143728.0,8.034275,22.558361,2.761046,0.841602,1.183153,258.170663,0.934376
min,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.32,100.0,50.0,29.0,2.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
50%,250.0,1.06,100.0,95.0,180.0,4.0,0.0,3.0,0.0,1.0,1.0,2.0,1.0
75%,500.0,2.82,100.0,150.0,1125.0,30.0,2.0,19.0,0.0,1.5,2.0,10.0,2.0
max,5100.0,23.68,100.0,700.0,2147484000.0,100000000.0,87.0,276.0,27.0,14.0,30.0,1814.0,30.0


In [21]:
#Filling with median values due to pull from airbnb luxe listings
listings[floats] = listings[floats].fillna(listings[floats].median())

#Update and display missing values
missing = missing_tracker(listings)
display(missing)

Unnamed: 0,total,missing%,dtype
notes,37704,0.381635,object
license,35884,0.363213,object
access,34194,0.346107,object
interaction,32349,0.327432,object
transit,29043,0.293969,object
house_rules,26112,0.264302,object
neighborhood_overview,25842,0.261569,object
host_about,23828,0.241184,object
review_scores_value,20054,0.202984,object
review_scores_checkin,20050,0.202943,object


### Resolve objects

In [22]:
#Preview object columns from listings that contain missing values
objects = missing.loc[missing.dtype == 'object'].index.to_list()
listings[objects].head(3)

Unnamed: 0,notes,license,access,interaction,transit,house_rules,neighborhood_overview,host_about,review_scores_value,review_scores_checkin,review_scores_location,review_scores_accuracy,review_scores_cleanliness,review_scores_communication,review_scores_rating,host_response_time,space,host_neighbourhood,zipcode,summary,description,host_location,city,state,host_name,cancellation_policy
0,Due to the fact that we have children and a do...,STR-0001256,*Full access to patio and backyard (shared wit...,A family of 4 lives upstairs with their dog. N...,*Public Transportation is 1/2 block away. *Ce...,* No Pets - even visiting guests for a short t...,*Quiet cul de sac in friendly neighborhood *St...,We are a family with 2 boys born in 2009 and 2...,10.0,10.0,10.0,10.0,10.0,10.0,97.0,within an hour,"Newly remodeled, modern, and bright garden uni...",Duboce Triangle,94117,New update: the house next door is under const...,New update: the house next door is under const...,"San Francisco, California, United States",San Francisco,CA,Holly,moderate
1,All the furniture in the house was handmade so...,,"Our deck, garden, gourmet kitchen and extensiv...",,The train is two blocks away and you can stop ...,"Please respect the house, the art work, the fu...",I love how our neighborhood feels quiet but is...,Philip: English transplant to the Bay Area and...,9.0,10.0,10.0,10.0,10.0,10.0,98.0,within a day,We live in a large Victorian house on a quiet ...,Bernal Heights,94110,,We live in a large Victorian house on a quiet ...,"San Francisco, California, United States",San Francisco,CA,Philip And Tania,strict_14_with_grace_period
2,Please email your picture id with print name (...,,,,N Juda Muni and bus stop. Street parking.,"No party, No smoking, not for any kinds of smo...","Shopping old town, restaurants, McDonald, Whol...",7 minutes walk to UCSF. 15 minutes walk to US...,8.0,9.0,9.0,8.0,8.0,9.0,85.0,within a few hours,Room rental-sunny view room/sink/Wi Fi (inner ...,Cole Valley,94117,Nice and good public transportation. 7 minute...,Nice and good public transportation. 7 minute...,"San Francisco, California, United States",San Francisco,CA,Aaron,strict_14_with_grace_period


In [23]:
#Text entry variables/host information to fill with 'Unavailable'
unavailable = ['notes','license','access','interaction','transit','house_rules','space',
               'summary','description','host_about','host_location', 'host_name','host_neighbourhood','neighborhood_overview']
#Fill 
listings[unavailable] = listings[unavailable].fillna('Unavailable')

#Categorical variables to fill with the mode of the column
mode = ['review_scores_value', 'review_scores_location', 'review_scores_checkin', 'review_scores_accuracy',
        'review_scores_cleanliness', 'review_scores_communication', 'review_scores_rating','host_response_time', 'cancellation_policy','city','state']
#Fill
for col in mode:
    listings[col].fillna(listings[col].mode()[0], inplace=True)

In [24]:
#Reverse engineer missing zipcode. Import libraries to reverse engineer zipcode
from uszipcode import SearchEngine
from uszipcode import Zipcode

#Instantiate SearchEngine
zipsearch = SearchEngine(simple_zipcode=True)

#Write function that finds zip given lat and long data
def get_zipcode(lat, lon):
    result = zipsearch.by_coordinates(lat = lat, lng = lon, returns = 1)
    return result[0].zipcode

temp = listings[listings.zipcode.isna()][['latitude', 'longitude']]

#Apply get_zipcode and assign to Zipcode
temp['zipcode']= temp.swifter.apply(lambda x: get_zipcode(x.latitude, x.longitude), axis =1)

#Combine temp.Zipcode onto original df. 
listings.zipcode = listings.zipcode.combine_first(temp.zipcode)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=2976.0, style=ProgressStyle(descriptio…




Missing dates will be left as is for the time being

In [25]:
#Updated listings shape
print('Updated listings shape:', listings.shape)

Updated listings shape: (98796, 70)


## Column Specific Cleaning

Cleaning specific columns in listings data in which there were specific value issues spotted in the Pandas Profiling report.

### City Column

In [26]:
#View values in city column
listings.groupby('city')['city'].count()

city
Bay Area                             3
Bernal Heights, San Francisco        9
Brisbane                             3
Daily city                           1
Daly City                          481
Daly City                            7
Noe Valley - San Francisco          13
Nor cal                              2
San Francisco                    98198
San Francisco                       33
San Francisco, Hayes Valley         13
San Franscisco                       1
San Fàncisco                         2
San Jose                             5
South San Francisco                 22
Vallejo                              1
旧金山                                  1
舊金山                                  1
Name: city, dtype: int64

In [27]:
#Strip white space
listings.city = listings.city.str.strip()

#Replace neighborhood information with San Fancisco and correct Daly City Spelling
listings.city.replace('^(B|San|No|V|[^a-zA-Z]).*', 'San Francisco', regex=True, inplace=True)
listings.city.replace('^D.*', 'Daly City', regex=True, inplace=True)


#Check
listings.groupby('city')['city'].count()

city
Daly City                489
San Francisco          98285
South San Francisco       22
Name: city, dtype: int64

### calendar_updated column

In [28]:
#convert 'a week ago' to '1 week ago' in calendar_updated
listings['calendar_updated'].replace('a week ago', '1 week ago', inplace=True)

### Price column

In [29]:
#View stats over price
print('Median Price : ', listings.price.median())
listings.price.describe(percentiles=[.1,.2,.3,.4,.5,.6,.7,.8,.9])

Median Price :  150.0


count    98796.000000
mean       216.110096
std        314.895913
min          0.000000
10%         70.000000
20%         90.000000
30%        110.000000
40%        131.000000
50%        150.000000
60%        180.000000
70%        215.000000
80%        269.000000
90%        390.000000
max      25000.000000
Name: price, dtype: float64

In [30]:
#Remove rows where price = 0 (Typo)
listings = listings[listings['price'] >0]

## Renaming some column names

In [31]:
#Setting calculated_host_listings to chl
listings.rename(columns={'calculated_host_listings_count': 'chlc',
'calculated_host_listings_count_private_rooms':'chlc_private_rooms',
'calculated_host_listings_count_shared_rooms':'chlc_shared_rooms'}, inplace=True)

# Write out file

In [32]:
print('Final shape of listings is:',listings.shape)

Final shape of listings is: (98781, 70)


In [33]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0201_Listings_Cleaned.csv'

#Write listings to path
listings.to_csv(path, sep=',')