# Data Cleaning - Airbnb Reviews

## Introduction

In the following notebook, I will be cleaning the Reviews data SF_Reviews_Nov2018_Oct2019.csv

**Read in necessary libraries**

In [58]:
#Read in libraries
import dask.dataframe as dd
import swifter

import pandas as pd

import re

import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

**Set Additional Settings for Notebook**

In [59]:
#supress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#Set plot aesthetics for notebook
sns.set(style='whitegrid', palette='pastel', color_codes=True)

#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows',100)

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

**Read in Data**

In [60]:
#Set path to get aggregated Calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\SF Airbnb Raw Data - Aggregated\SF_Reviews_Nov2018_Oct2019.csv'

#Parse dates
parse_dates = ['date']

#Read in Airbnb Review Data
reviews = pd.read_csv(path, sep='\t', parse_dates=parse_dates,index_col=0)


## Data Preview

In [61]:
reviews.head()

Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
0,"Our experience was, without a doubt, a five st...",2009-07-23,5977,958,15695,Edmund C
1,Returning to San Francisco is a rejuvenating t...,2009-08-03,6660,958,26145,Simon
2,We were very pleased with the accommodations a...,2009-09-27,11519,958,25839,Denis
3,We highly recommend this accomodation and agre...,2009-11-05,16282,958,33750,Anna
4,Holly's place was great. It was exactly what I...,2010-02-13,26008,958,15416,Venetia


In [62]:
# #Create Pandas Profiling Report for reviews data
# profile = reviews.profile_report(title='Airbnb Reviews Report', check_correlation_pearson= False, 
# correlations={'pearson': False,
# 'spearman': False,
# 'kendall': False,
# 'phi_k': False,
# 'cramers': False,
# 'recoded':False}, 
# plot={'histogram':{'bayesian_blocks_bins': False}})

# #Write profile to an HTML file
# profile.to_file(output_file="Airbnb Reviews Report.html")

# #View pandas profile for reviews data
# profile

## Reviews Preparation for Natural Language Processing

Add review_scores_rating from listings data to reviews data. Listings data only has review scores pertaining to the most recent review for a particular listing. This means that there will be many reviews that do not have a score, which we will remove during the merge.

In [63]:
#Set path to get cleaned listings data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\listings_cleaned.csv'

#Parse dates
parse_dates = ['last_review']

#Read in Airbnb cleaned_listings Data
listings = pd.read_csv(path,index_col=0, parse_dates=parse_dates, low_memory=False, sep='\t')

In [64]:
#Check listings
listings.head()

Unnamed: 0,accommodates,amenities,availability_30,availability_365,bathrooms,bed_type,bedrooms,beds,calculated_host_listings_count,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,calendar_last_scraped,calendar_updated,cancellation_policy,city,cleaning_fee,description,extra_people,first_review,guests_included,host_about,host_has_profile_pic,host_id,host_identity_verified,host_is_superhost,host_listings_count,host_location,host_name,host_neighbourhood,host_response_rate,host_response_time,host_since,host_verifications,house_rules,id,instant_bookable,is_location_exact,last_review,latitude,longitude,maximum_maximum_nights,name,neighborhood_overview,neighbourhood,neighbourhood_cleansed,number_of_reviews,number_of_reviews_ltm,price,property_type,require_guest_phone_verification,require_guest_profile_picture,requires_license,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value,reviews_per_month,room_type,security_deposit,space,summary,transit,zipcode
0,3,"TV,Cable TV,Internet,Wifi,Kitchen,Pets live on...",0,77,1.0,Real Bed,1.0,2.0,1,0.0,0.0,2019-04-03,1 week ago,moderate,San Francisco,100.0,New update: the house next door is under const...,25.0,2009-07-23,2,We are a family with 2 boys born in 2009 and 2...,1,1169,1,1,1.0,"San Francisco, California, United States",Holly,Duboce Triangle,100.0,within an hour,2008-07-31,"email, phone, facebook, reviews, kba",* No Pets - even visiting guests for a short t...,958,1,1,2019-03-16,37.76931,-122.43386,30.0,"Bright, Modern Garden Unit - 1BR/1B",*Quiet cul de sac in friendly neighborhood *St...,Duboce Triangle,Western Addition,183,51.0,170.0,Apartment,0,0,1,10.0,10.0,10.0,10.0,10.0,97.0,10.0,1.55,Entire home/apt,100.0,"Newly remodeled, modern, and bright garden uni...",New update: the house next door is under const...,*Public Transportation is 1/2 block away. *Ce...,94117.0
1,5,"Internet,Wifi,Kitchen,Heating,Family/kid frien...",0,0,1.0,Real Bed,2.0,3.0,1,0.0,0.0,2019-04-03,4 months ago,strict_14_with_grace_period,San Francisco,100.0,We live in a large Victorian house on a quiet ...,0.0,2009-05-03,2,Philip: English transplant to the Bay Area and...,1,8904,1,0,2.0,"San Francisco, California, United States",Philip And Tania,Bernal Heights,80.0,within a day,2009-03-02,"email, phone, reviews, kba, work_email","Please respect the house, the art work, the fu...",5858,0,1,2017-08-06,37.74511,-122.42102,60.0,Creative Sanctuary,I love how our neighborhood feels quiet but is...,Bernal Heights,Bernal Heights,111,0.0,235.0,Apartment,0,0,1,10.0,10.0,10.0,10.0,10.0,98.0,9.0,0.92,Entire home/apt,,We live in a large Victorian house on a quiet ...,,The train is two blocks away and you can stop ...,94110.0
2,2,"TV,Internet,Wifi,Kitchen,Free street parking,H...",30,365,4.0,Real Bed,1.0,1.0,9,9.0,0.0,2019-04-03,17 months ago,strict_14_with_grace_period,San Francisco,50.0,Nice and good public transportation. 7 minute...,12.0,2009-08-31,1,7 minutes walk to UCSF. 15 minutes walk to US...,1,21994,1,0,10.0,"San Francisco, California, United States",Aaron,Cole Valley,100.0,within a few hours,2009-06-17,"email, phone, reviews, jumio, government_id","No party, No smoking, not for any kinds of smo...",7918,0,1,2016-11-21,37.76669,-122.4525,60.0,A Friendly Room - UCSF/USF - San Francisco,"Shopping old town, restaurants, McDonald, Whol...",Cole Valley,Haight Ashbury,17,0.0,65.0,Apartment,0,0,1,8.0,9.0,8.0,9.0,9.0,85.0,8.0,0.15,Private room,200.0,Room rental-sunny view room/sink/Wi Fi (inner ...,Nice and good public transportation. 7 minute...,N Juda Muni and bus stop. Street parking.,94117.0
3,2,"TV,Internet,Wifi,Kitchen,Free street parking,H...",30,365,4.0,Real Bed,1.0,1.0,9,9.0,0.0,2019-04-03,17 months ago,strict_14_with_grace_period,San Francisco,50.0,Nice and good public transportation. 7 minute...,12.0,2014-09-08,1,7 minutes walk to UCSF. 15 minutes walk to US...,1,21994,1,0,10.0,"San Francisco, California, United States",Aaron,Cole Valley,100.0,within a few hours,2009-06-17,"email, phone, reviews, jumio, government_id",no pet no smoke no party inside the building,8142,0,1,2018-09-12,37.76487,-122.45183,90.0,Friendly Room Apt. Style -UCSF/USF - San Franc...,,Cole Valley,Haight Ashbury,8,1.0,65.0,Apartment,0,0,1,9.0,10.0,9.0,10.0,9.0,93.0,9.0,0.14,Private room,200.0,Room rental Sunny view Rm/Wi-Fi/TV/sink/large ...,Nice and good public transportation. 7 minute...,"N Juda Muni, Bus and UCSF Shuttle. small shopp...",94117.0
5,6,"TV,Cable TV,Internet,Wifi,Kitchen,Free parking...",0,20,1.0,Real Bed,2.0,3.0,1,0.0,0.0,2019-04-03,yesterday,moderate,San Francisco,125.0,"Fully furnished 2BR, 1BA flat in beautiful Vic...",0.0,2009-08-14,1,"We are a family of three who love live music,...",1,25601,0,0,1.0,"San Francisco, California, United States",Sandy,Western Addition/NOPA,90.0,within a day,2009-07-14,"email, phone, facebook, reviews","No smoking, as I'm quite allergic. Please put ...",8567,0,1,2019-03-30,37.78471,-122.44555,365.0,Lovely 2BR flat Great Location,"The neighborhood is very centrally located, cl...",Western Addition/NOPA,Western Addition,32,5.0,255.0,Apartment,0,0,1,9.0,10.0,8.0,10.0,9.0,90.0,9.0,0.27,Entire home/apt,0.0,"Fully furnished 2BR, 1BA flat in beautiful Vic...",,We're 2 blocks from several bus lines that can...,94115.0


**Merge review_scores_rating from listings to corresponding reviews**

In [65]:
#Merge
review_scores = reviews.merge(listings.loc[:,['last_review','id','review_scores_rating']], how='left', left_on= ['listing_id', 'date'], 
                              right_on=['id', 'last_review'], suffixes=('_review', '_listings'))
#Check
review_scores.head()

Unnamed: 0,comments,date,id_review,listing_id,reviewer_id,reviewer_name,last_review,id_listings,review_scores_rating
0,"Our experience was, without a doubt, a five st...",2009-07-23,5977,958,15695,Edmund C,NaT,,
1,Returning to San Francisco is a rejuvenating t...,2009-08-03,6660,958,26145,Simon,NaT,,
2,We were very pleased with the accommodations a...,2009-09-27,11519,958,25839,Denis,NaT,,
3,We highly recommend this accomodation and agre...,2009-11-05,16282,958,33750,Anna,NaT,,
4,Holly's place was great. It was exactly what I...,2010-02-13,26008,958,15416,Venetia,NaT,,


### Cleaning Merged Data Set for NLP

In [66]:
#View review_scores shape
print('review_scores original data shape:',reviews.shape)

#View missing values in review_scores
print('Missing values:', review_scores.isna().sum())

review_scores original data shape: (430766, 6)
Missing values: comments                   429
date                         0
id_review                    0
listing_id                   0
reviewer_id                  0
reviewer_name                1
last_review             397661
id_listings             397661
review_scores_rating    398084
dtype: int64


In [67]:
#Drop unnecessary columns from review_scores
review_scores.drop(columns=['last_review', 'id_listings'], axis = 1, inplace= True)

#Rename columns
review_scores.rename(columns={'review_scores_rating':'review_rating'}, inplace=True)

#Drop duplicate values
review_scores.drop_duplicates(inplace=True)

#Strip leading and trailing white space
review_scores.comments = review_scores.comments.str.strip()

#View updated reviews shape and missing values
print('Updated reviews data shape:',review_scores.shape)
print('Missing values: \n', review_scores.isna().sum())

Updated reviews data shape: (430798, 7)
Missing values: 
 comments            378
date                  0
id_review             0
listing_id            0
reviewer_id           0
reviewer_name         1
review_rating    397772
dtype: int64


In [68]:
#Filter rows that do not contain english characters in the comments
review_scores.comments.replace('[^a-zA-Z0-9]',' ',regex = True, inplace=True)

#Remove puncuation from comments
review_scores.comments = review_scores.comments.str.replace(r'[^\w\s]+', '')

#Replace empty comments with nan
review_scores.comments = review_scores.comments.replace('', np.nan)

#Remove rows with missing comments and/or review_rating
review_scores.dropna(subset=['comments', 'review_rating'], inplace=True)

#View updated reviews shape
print('Updated reviews data shape:',review_scores.shape)
print('Missing values: \n', review_scores.isna().sum())

Updated reviews data shape: (32994, 7)
Missing values: 
 comments         0
date             0
id_review        0
listing_id       0
reviewer_id      0
reviewer_name    0
review_rating    0
dtype: int64


In [69]:
#filter out rows where comments are less than 2 characters long
review_scores = review_scores[review_scores.comments.apply(len) > 2]


In [72]:
#View updated reviews shape
print('Updated reviews data shape:',review_scores.shape)

#View review_scores
display(review_scores)

Updated reviews data shape: (32944, 7)


Unnamed: 0,comments,date,id_review,listing_id,reviewer_id,reviewer_name,review_rating
168,Holly s place was perfect for our family of 3 ...,2018-10-28,342468218,958,189494409,Jason,97.0
171,This was a great place for me to stay travelin...,2018-11-16,349186191,958,28251745,Naomi,97.0
175,Holly s place is in a fantastic part of San Fr...,2018-12-18,359967184,958,41685622,Nick,97.0
178,Holly s description of the apartment was total...,2019-01-11,400345753,958,201398768,Barbara,97.0
179,Holly s apartment is in a safe neighborhood on...,2019-02-17,413667035,958,25563110,Melissa,97.0
...,...,...,...,...,...,...,...
462838,I stayed briefly in San Francisco and the room...,2019-09-09,526970773,38107361,27753648,Michael,60.0
462839,This place was great It was very nice and cle...,2019-09-08,526254371,38127644,291912447,Kevin,100.0
462840,We really enjoyed our stay at Astro s place T...,2019-09-07,525402646,38183193,10350895,Pariece,100.0
462843,Nice and stylish place,2019-09-06,524844122,38287564,119685705,Donut,100.0


In [73]:
# #Set path to write listings
# path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\review_scores_cleaned.csv'

# #Write listings to path
# review_scores.to_csv(path, sep='\t')