Data Cleaning - Aggregated Airbnb Reviews

# Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb Reviews data from different users about their stay in the San Francisco area. This aggregation consists of reviews data from 12/2018 through 12/2019.

The aggregation source code can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/blob/master/Project%20Codes/01.%20Raw%20Data%20Aggregation%20Scripts/2020_0129_Airbnb_Raw_Data_Aggregation.ipynb)

Raw data can be found [here](https://github.com/KishenSharma6/Airbnb-Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

## Read in libraries, data, and set notebook preferences

**Read in libraries**

In [16]:
#Read in libraries
import pandas as pd
import numpy as np

import swifter

**Read in Data**

In [17]:
#Set path to get aggregated Calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Reviews_Raw_Aggregated.csv'

#Read in Airbnb Review Data
reviews = pd.read_csv(path, sep=',', parse_dates=['date'],
                      dtype = {'id':'object','listing_id':'object','reviewer_id':'object'},
                      index_col=0)              

**Set Notebook preferences**

In [18]:
#Set Pandas options
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows',100)

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

## Data Preview

In [19]:
#View data shape and view head
print('Reviews data shape:', reviews.shape)
display(reviews.head())

Reviews data shape: (458157, 6)


Unnamed: 0,comments,date,id,listing_id,reviewer_id,reviewer_name
0,"Our experience was, without a doubt, a five st...",2009-07-23,5977,958,15695,Edmund C
1,Returning to San Francisco is a rejuvenating t...,2009-08-03,6660,958,26145,Simon
2,We were very pleased with the accommodations a...,2009-09-27,11519,958,25839,Denis
3,We highly recommend this accomodation and agre...,2009-11-05,16282,958,33750,Anna
4,Holly's place was great. It was exactly what I...,2010-02-13,26008,958,15416,Venetia


In [20]:
#Show data types
reviews.dtypes

comments                 object
date             datetime64[ns]
id                       object
listing_id               object
reviewer_id              object
reviewer_name            object
dtype: object

# Data Cleaning

## Column removal

In [21]:
#Drop columns not needed for NLP
reviews.drop(columns = ['id','listing_id', 'reviewer_id','reviewer_name'], inplace = True)

## Missing data

**Check for and remove missing data**

In [22]:
#Replace blank comments with NAN
reviews.comments.replace('^\s*$', np.nan, regex=True, inplace=True)

#View missing values
print('Missing values: \n', reviews.isna().sum())

Missing values: 
 comments    480
date          0
dtype: int64


In [23]:
#Remove rows with NA in comments
reviews  = reviews[~reviews.comments.isna()]

#Check
print('Missing values: \n', reviews.isna().sum())

Missing values: 
 comments    0
date        0
dtype: int64


## Language Detection

Not all comments are in English. We will assign comment languages to a 'language' column and remove non-English review data

In [24]:
#Import language detection library
from langdetect import detect

# write the function that detects the language
def language_detection(text):
    try:
        return detect(text)
    except:
        return None

#Apply function to reviews, remove non-English comments, drop language column
reviews['language']=reviews.comments.swifter.apply(language_detection)
reviews =reviews[reviews.language == 'en']
reviews.drop(columns = 'language', inplace = True)

#Check
print('Updated reviews shape:', reviews.shape)
display(reviews.head())

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=457677.0, style=ProgressStyle(descript…


Updated reviews shape: (425525, 2)


Unnamed: 0,comments,date
0,"Our experience was, without a doubt, a five st...",2009-07-23
1,Returning to San Francisco is a rejuvenating t...,2009-08-03
2,We were very pleased with the accommodations a...,2009-09-27
3,We highly recommend this accomodation and agre...,2009-11-05
4,Holly's place was great. It was exactly what I...,2010-02-13


## Comments cleaning

In [25]:
#Remove \n,\r and \t
reviews.comments.replace('(\\n|\\t|\\r)', ' ',regex=True, inplace=True)

#Replace new blank comments with NAN
reviews.comments.replace('^\s*$', np.nan, regex=True, inplace=True)

#Remove new rows with NA in comments
reviews  = reviews[~reviews.comments.isna()]

#Strip trailing and leading whitespace
reviews['comments'].str.strip()

#Remove rows where comments character string < 3
reviews = reviews[reviews.comments.apply(len) > 3].sort_values(by='comments')

#Check
display(reviews.head(10))

Unnamed: 0,comments,date
19330,...,2013-12-01
143113,Stop and book it now. Rea (Website hi...,2017-06-07
1021372,So I moved to SF in late May from Mich...,2013-06-02
64636,"This was the perfect home from home, o...",2014-10-16
174143,We loved our time in beautiful SF! The ...,2018-08-10
15734,Brian's house was awesome! I took my girl...,2011-10-22
77378,Delilah and Karen were both very sweet an...,2014-10-28
294659,"I am kind of a picky person, but I prefer...",2018-05-23
100011,"I was greeted enthusiastically, and was t...",2015-04-10
3935,I was looking for a quiet flat within 3 m...,2012-06-24


# Write to CSV

In [26]:
#View final shape of reviews data
print('Final reviews data shape', reviews.shape)

#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\2020_0131_Reviews_Cleaned.csv'

#Write listings to path
reviews.to_csv(path, sep=',')

Final reviews data shape (425509, 2)
