# Notebook 3
The purpose of this notebook is to a) get an idea of the review comments dataset, b) process the review comments so that they can be used to train the model in Notebook 4

In [2]:
import pandas as pd
import numpy as np
import nltk
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize as wt 
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import timeit

In [3]:
#We utilize the snowball stemmer for this project, instead of the commonly used porter stemmer.
#This is because the snowball stemmer appears to be more versatile
snowball_stemmer = SnowballStemmer(language='english')

In [6]:
#Read in reviews
reviews = pd.read_csv('reviews.csv')

#Download nltk packages
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DanielJoseph.Onsiter\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!


True

# Preliminary data analysis of the review comments

In [7]:
#preview the data
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...


Listing_id appears to correspond with the id column in the listings dataset

In [10]:
#Check the number of ids with reviews
reviews.listing_id.unique().shape

(2829,)

In [11]:
#Check for null values
reviews.isnull().sum()

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     0
comments         53
dtype: int64

# Cleaning and transforming the data

In [12]:
#Remove comments that are NaN
reviews = reviews[~reviews['comments'].isnull()].reset_index()

In [9]:
#Comments are cleaned by:
#1.Removing non-alphabetical symbols
#2.Splitting the review into individual words
#3.Converting the words to their stem form.
#4.Remove stop words
#5.Recombine words in list to form a string
#6.Append to 'corpus'
#Method adapted from https://medium.com/swlh/text-classification-using-the-bag-of-words-approach-with-nltk-and-scikit-learn-9a731e5c4e2f
#Courtesy of Charles Rajendran

cleaned_comments = []
st1 = timeit.default_timer()
for i in range(reviews.shape[0]):
    if (i%1000 == 0):
        print(i)
        
    review_comment = re.sub('[^a-zA-Z]',' ',reviews['comments'][i])
    review_comment = wt(review_comment.lower())
    review_comment = [snowball_stemmer.stem(word) for word in review_comment if snowball_stemmer.stem(word) not in stopwords.words('english')]
    review_comment = " ".join(review_comment)
    cleaned_comments.append(review_comment)

st2 = timeit.default_timer()
print('time taken: ' + str(st2-st1) +'s')

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
63000
64000
65000
66000
67000
68000
time taken: 2111.136779600056s


The processing takes a very long time, which is why it has been separated from the model training notebook. There would certainly be a better way to do this, but this was the best I was able to do.

## Using the scikit-learn count vectorizer to encode the comment words as frequency counts

The scikit-learn count vectorizer by default will take every word in all comments, and turn them into columns that will be used to keep track of the number of times each word occurs in a review comment.

If we convert every word in cleaned_comments into a column using the count vectorizer, we will have ~30000 columns.
To prevent this, we only keep words that exist in at least 0.05% of reviews(34 reviews)

In [11]:
#Using the vectorizer to transform the cleaned comments
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1),min_df=0.0005)
reviews_vectorized = vectorizer.fit_transform(cleaned_comments)
reviews_vectorized = pd.DataFrame(data=reviews_vectorized.toarray(),columns=list(vectorizer.get_feature_names_out()))

In [12]:
#combine comments & vectorized reviews
review_cleaned = pd.DataFrame(data=cleaned_comments ,columns=['review_cleaned'])
review = pd.concat([review_cleaned,reviews_vectorized],axis=1)

In [13]:
new_columns = ['reviewcomment_' + column for column in review.columns]

# Rename the columns representing review comment words with reviewcomment_*
review = review.rename(columns=dict(zip(review.columns, new_columns)))
review = review.rename(columns={'reviewcomment_review_cleaned':'review_cleaned'})

#Merging the other columns that contain metadata
reviews = pd.concat([reviews,review],axis=1)
reviews = reviews.drop(columns=['index'])

In [14]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,review_cleaned,reviewcomment_abbiamo,reviewcomment_aber,reviewcomment_abil,...,reviewcomment_zentrum,reviewcomment_zero,reviewcomment_zimmer,reviewcomment_zoe,reviewcomment_zona,reviewcomment_zone,reviewcomment_zu,reviewcomment_zum,reviewcomment_zur,reviewcomment_zwei
0,1178162,4724140,2013-05-21,4298113,Olivier,My stay at islam's place was really cool! Good...,stay islam place realli cool good locat min aw...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1178162,4869189,2013-05-29,6452964,Charlotte,Great location for both airport and city - gre...,great locat airport citi great amen hous plus ...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1178162,5003196,2013-06-06,6449554,Sebastian,We really enjoyed our stay at Islams house. Fr...,realli enjoy stay islam hous outsid hous look ...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1178162,5150351,2013-06-15,2215611,Marine,The room was nice and clean and so were the co...,room nice clean commod veri close airport metr...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1178162,5171140,2013-06-16,6848427,Andrew,Great location. Just 5 mins walk from the Airp...,great locat min walk airport station good food...,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
#Output to csv
reviews.to_csv("reviews_processed.csv",index=False)