## Data Pre Processing And Cleaning

Before cleaning reviews data, we need to pre process and clean other columns so that they can be used for EDA and sentiment analysis. We will use the CSV file that we scraped from trip advisor. Custom functions have been defined to handle each column in the dataset. You can find the explanation for each one above it definition block

In [28]:
import pandas as pd
import numpy as np
import re
from datetime import datetime

In [29]:
df = pd.read_csv('../data/oberoi_delhi_reviews.csv')
df.head()

Unnamed: 0,Customer Review,Date Of Stay,Customer Rating,Owner Responded
0,Excellent and highly recommended! We only sp...,Date of stay: November 2022,bubble_50,
1,This was our favourite hotel whilst we visited...,Date of stay: August 2022,bubble_50,"Dear Guest, Thank you for choosing to stay wi..."
2,We stayed 2 nights at the New Delhi Oberoi. Wh...,Date of stay: November 2022,bubble_50,
3,The Very BEST!! Quality of service and food is...,Date of stay: December 2022,bubble_50,"Dear Guest, I am delighted you had a memorab..."
4,"The service , food, location, cleanliness is e...",Date of stay: December 2022,bubble_50,"Dear Guest, I am delighted you had a memorab..."


Customer rating is in the form of a css class in the HTML. For e.g bubble_50 means the user has given a 5 star rating to the property. TripAdvisor doesn't have ratings in decimals (like 4.5 or 3.5). Using the class value we will set the review rating into 5 categories ('Excellent', 'Very Good', 'Average', 'Poor', 'Terrible') as defined by TripAdvisor.

In [30]:
# Custom function to set trip type value
def set_customer_rating(customer_rating) :
  if (customer_rating is np.NAN) :
    return
  elif 'bubble_50' == customer_rating:
    return 'Excellent'
  elif 'bubble_40' == customer_rating:
    return 'Very Good'
  elif 'bubble_30' == customer_rating:
    return 'Average'
  elif 'bubble_20' == customer_rating:
    return 'Poor'
  elif 'bubble_10' == customer_rating:
    return 'Terrible'

In [31]:
df['Customer Rating'] = df['Customer Rating'].apply(set_customer_rating)

Owner's Response column is where we set whether the owner has responded to a particular review or not. Almost all the responses are boiler plater templated responses, as such the content of these responses have very little value to us. Hence we will just check whether the owner (i.e Oberoi Delhi) has or has not responded to a particular reivew

In [32]:
# Set owner's response from collected date (Yes or No)
def set_owners_response(owners_response) :
  if (owners_response is np.NaN) :
    return False
  else :
    return True

In [33]:
df['Owner Responded'] = df['Owner Responded'].apply(set_owners_response)

Finally we will set the date object from the string that is scraped. This is especially useful information for performing EDA on the reviews

In [34]:
# Extract date from string
import datetime
def set_review_date(date_string) :
  if (date_string is np.NaN) :
    return np.NaN
  else :
    extracted_date = date_string.partition(': ')[2]
    return datetime.datetime.strptime(extracted_date, '%B %Y').strftime('%m/%y')

In [35]:
df['Date Of Stay'] = df['Date Of Stay'].apply(set_review_date)

Finally, we set the cleaned data to a CSV file. Note we have not yet performed any cleaning operation on the reviews themselves. This would the next step of our analysis

In [36]:
df.to_csv('../data/cleaned_data.csv', index = False)

Check for missing value....!

In [38]:
data = pd.read_csv('../data/cleaned_data.csv')
data.isnull().sum()

Customer Review    0
Date Of Stay       3
Customer Rating    0
Owner Responded    0
dtype: int64

We find that three reviews don't have date of stay attached to them. While we could remove these, it will result in loss of some information. As topic modelling and sentiment analysis are our primary objectives for this excercise we will let these reviews stay as they are.