## Reviews Data Cleaning

Punctuation and casing within reviews is an hindrance for language processing as they pollute the dataset with their dominance and redundancies, as such we lowercase the entire review dataset and remove all the punctuation. Customer reviews also contain a lot of words that have little / no value in topic modelling or sentiment analysis. These words called the stopwords should be removed from the dataset before we perform any kind of analysis. We use a combination of pre built python packages and custom defined stopwords to clean our reviews.

In [2]:
import pandas as pd
from textblob import Word
from gensim.parsing.preprocessing import remove_stopwords

Lowercasing all the words in the dataset. We use the in built python function of.lower for this operation.

In [3]:
df = pd.read_csv('../data/cleaned_data.csv')
df['Customer Review'] = df['Customer Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Customer Review']

0       excellent and highly recommended! we only spen...
1       this was our favourite hotel whilst we visited...
2       we stayed 2 nights at the new delhi oberoi. wh...
3       the very best!! quality of service and food is...
4       the service , food, location, cleanliness is e...
                              ...                        
2891    stayed at the oberoi for a long weekend with f...
2892    the oberoi hotels have established a world-cla...
2893    i've been through a few hotels in delhi over t...
2894    this is a beautiful and luxurious hotel. unfor...
2895    hotel is fine and food is good although pricey...
Name: Customer Review, Length: 2896, dtype: object

Next step is the remove all the punctuation from the dataset. We will use the str.replace and regex matching for this operation.

In [10]:
df['Customer Review'] = df['Customer Review'].str.replace('[^\w\s]', '', regex=True)
df['Customer Review']

0       excellent highly recommended spent nights left...
1       favourite whilst visited recently refurbished ...
2       stayed 2 nights new upgraded presidential suit...
3       best quality service food notch quality spread...
4       service food location cleanliness excellent be...
                              ...                        
2891    stayed long weekend friends arrived 1am luggag...
2892    hotels established worldclass reputation deser...
2893    ive hotels yearshilton hyatts marriott sofitel...
2894    beautiful luxurious unfortunately price id exp...
2895    fine food good pricey compared competition wat...
Name: Customer Review, Length: 2896, dtype: object

Let's first check the 15 of the most common words that appear in the corpus. We will then remove stopwords using the gensim package and then define our custom stopwords

In [7]:
freq= pd.Series(" ".join(df['Customer Review']).split()).value_counts()[:15]
freq

the       17666
and        9939
a          6237
to         6112
was        4742
in         4733
of         4374
is         4095
i          3754
hotel      3377
we         2882
for        2762
at         2615
with       2274
oberoi     2263
dtype: int64

Removing stop words using the gensim package. The package contains a list of stopwords from the english dictionary. This will save us a lot of time of manually defininig every stopword.

In [8]:
import gensim
stop_words = gensim.parsing.preprocessing.STOPWORDS
stop_words
# Remove Stopwords
df['Customer Review'] = df['Customer Review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))
freq= pd.Series(" ".join(df['Customer Review']).split()).value_counts()[:15]
# Show the most common words after removal of stopwords
freq

hotel        3377
oberoi       2263
service      1976
staff        1967
delhi        1755
room         1571
stay         1386
good         1093
food         1072
great        1060
rooms         984
stayed        900
excellent     882
new           866
best          726
dtype: int64

Finally we will provide our very own custom stopwords. For e.g Oberoi, Delhi and hotel are all stopwords in this excercise as they provide no value in topic modelling or sentiment analysis. A few others have also been added below

In [9]:
df['Customer Review']
custom_stopwords = ['hotel', 'oberoi', 'delhi', 'india', 'oberois', 'city']
df['Customer Review'] = df['Customer Review'].apply(lambda x: " ".join(x for x in x.split() if x not in custom_stopwords))
freq= pd.Series(" ".join(df['Customer Review']).split()).value_counts()[:15]
# Show the most common words after removal of stopwords
freq

service       1976
staff         1967
room          1571
stay          1386
good          1093
food          1072
great         1060
rooms          984
stayed         900
excellent      882
new            866
best           726
breakfast      702
restaurant     686
hotels         680
dtype: int64

We will write these cleaned reviews to a CSV file. Next step would be lemmatization of review data

In [7]:
df.to_csv('../data/cleaned_reviews_1.csv', index = False)