# Data Cleaning - Aggregated Airbnb Reviews

## Introduction

In the following notebook, I will be cleaning an aggregation of Airbnb Reviews Data of the San Francisco area. This aggregation consists of reviews data from 11/2018 through 12/2019.

The aggregation source code can be found [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/blob/master/Airbnb%20Raw%20Data%20Aggregation.ipynb)

Raw data can be found [here](https://github.com/KishenSharma6/Airbnb-SF_ML_-_Text_Analysis/tree/master/Data/01_Raw/SF%20Airbnb%20Raw%20Data)

**Read in necessary libraries**

In [10]:
#Read in libraries
import pandas as pd
import pandas_profiling

import re

import numpy as np

**Set Additional Settings for Notebook**

In [11]:
#supress future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#Increase number of columns and rows displayed by Pandas
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows',100)

#Ignore warnings
import warnings; warnings.simplefilter('ignore')

**Read in Data**

In [12]:
#Set path to get aggregated Calendar data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\01_Raw\SF Airbnb Raw Data\SF Airbnb Raw Data - Aggregated\01_04_2020_Reviews_Raw_Aggregated.csv'

#Parse dates
parse_dates = ['date']

#Read in Airbnb Review Data
reviews = pd.read_csv(path, sep=',', parse_dates=parse_dates,index_col=0)

## Pandas Profiling Report

In [13]:
# #Create Pandas Profiling Report for reviews data
# profile = reviews.profile_report(title='Airbnb Reviews Report', check_correlation_pearson= False, 
# correlations={'pearson': False,
# 'spearman': False,
# 'kendall': False,
# 'phi_k': False,
# 'cramers': False,
# 'recoded':False}, 
# plot={'histogram':{'bayesian_blocks_bins': False}})

# #Write profile to an HTML file
# profile.to_file(output_file="Airbnb Reviews Report.html")

# #View pandas profile for reviews data
# profile



## Data Cleaning

**Missing Data**

In [14]:
#Print current reviews shape
print('Original shape of reviews:', reviews.shape,end='\n\n')

#Replace blank comments with NAN
reviews.comments.replace('^\s*$', np.nan, regex=True, inplace=True)

#View missing values
print('Missing values: \n', reviews.isna().sum())

Original shape of reviews: (458157, 6)

Missing values: 
 comments         480
date               0
id                 0
listing_id         0
reviewer_id        0
reviewer_name      1
dtype: int64


In [15]:
#Remove rows with NA in comments
reviews  = reviews[~reviews.comments.isna()]

#reviewer_name does not have significance to comments or score from listings. Replacing with '-'
reviews.reviewer_name.fillna('-', inplace = True)

**Comment Anomalies**

The reviews data, as it is, does not contain review scores. This information is located within the listings dataset. In another notebook, we will merge the 2 datasets and perform an NLP analysis. 

In the meantime, there are a couple things we will want to check for before we can consider this data clean enough to run a text analysis on. Some of the things we need to check our comments for and consider removing are:

- Short length comments
- New line, tabs, and rogue spaces
- Punctation

In [16]:
#Remove Punctuation
reviews.comments.replace('[^\w\s]+', '', regex=True, inplace=True)

#Remove \n,\r and \t
reviews.comments.replace('(\\n|\\t|\\r)', ' ',regex=True, inplace=True)

#Replace new blank comments with NAN
reviews.comments.replace('^\s*$', np.nan, regex=True, inplace=True)

#Remove new rows with NA in comments
reviews  = reviews[~reviews.comments.isna()]

#Strip trailing and leading whitespace
reviews['comments'].str.strip()

#Remove rows where comments character string < 3
reviews = reviews[reviews.comments.apply(len) > 3].sort_values(by='comments')

#Print current reviews shape
print('Current shape of reviews:', reviews.shape)

Current shape of reviews: (456864, 6)


**Non-English Languages**

In [17]:
#Check for strings that contain characters outside of ascii(latin) alphabet
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

#How many rows contain non-ascii characters?
print('Rows containg non-English characters: ',len(reviews[reviews.comments.apply(isEnglish) == False]))

#Removing rows containing non-ascii characters
reviews = reviews[reviews.comments.apply(isEnglish) == True]

#Print current reviews shape
print('Current shape of reviews:', reviews.shape)

Rows containg non-English characters:  27842
Current shape of reviews: (429022, 6)


In [18]:
#Set path to write listings
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Air BnB - SF\Data\02_Intermediate\01_04_2020_Reviews_Cleaned.csv'

#Write listings to path
reviews.to_csv(path, sep=',')