## Web Scraping to Collect Review Data

You are given the following urls of a website that lists reviews of a well known restaurant in Singapore. 

You are tasked to scrape the urls for the following data from each review:
 - The review header
 - The review footer
 - The review number
 - Optional: The date of review
 - Optional: A Boolean Indicator for Mobile Users

A visual aid is given to help you identify the relevant data from the website:

![screenshot](./assets/Review_Card.png)

Use your knowledge of HTML and BeautifulSoup to obtain the data and pass it into a DataFrame.

NOTE: There are cards on the webpage that might have tags with the same class as the tags you are looking for. Be sure to restrict your search to only the relevant html elements.

Hint: Use the HTML Inspector using the Developer Tools in Chrome to help you.


In [360]:
from bs4 import BeautifulSoup
import requests
import re, os, string
from datetime import datetime

base_url = "https://www.tripadvisor.com.sg/Restaurant_Review-g294265-d11961130-Reviews-Hawker_Chan-Singapore.html"

urls = [
    f"https://www.tripadvisor.com.sg/Restaurant_Review-g294265-d11961130-Reviews-or{i*10}-Hawker_Chan-Singapore.html"
    for i in range(1, 48)
]

urls = [base_url, *urls]


def get_HTML(url):
    response = requests.get(url)
    html = response.text
    return html


def get_HTML_item_from_urls(urls):
    '''
    For each url, look for the review card.
    For each review card. Look for the following attributes:
    - The rating
    - The card header
    - The card body
    - Optional: The date of review
    - Optional: A Boolean Indicator for Mobile Users
    '''
    res = []
    for url in urls:
        html_string = get_HTML(url)
        soup = BeautifulSoup(html_string, "lxml")
        for element in soup(attrs={"class": "ui_column is-9"}):
            obj = {}
            # Find Review Number
            review = (
                element.find(attrs={"class": "ui_bubble_rating bubble_50"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_40"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_30"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_20"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_10"})
            )
            # If the review number can't be found, then move on. This is not the tag you are looking for
            if review == None:
                continue
            #Get Review number using regex
            num = int(
                re.findall(
                    r'(?<=<span class="ui_bubble_rating bubble_)(\d+).*(?=>)',
                    str(review),
                )[0]
            )
            #Find Card Header
            card_head = element.find(attrs={"class": "noQuotes"})
            #Find Card Body
            card_body = element.find(attrs={"class": "partial_entry"})
            #Find Date of Review
            review_date_raw_string = element.find(attrs={"class": "prw_rup prw_reviews_stay_date_hsx"})
            review_date_string = re.findall(r'(?<=Date of visit: )(.*)',review_date_raw_string.text)[0]
            review_date = datetime.strptime(review_date_string,"%B %Y").date()
            #Find Mobile Users
            is_mobile_user = element.find(attrs={"class": "viaMobile"})
            obj.update(header=card_head.text, body=card_body.text, review_num=num,review_date=review_date,is_mobile_user=bool(is_mobile_user))
            res.append(obj)
    return res


In [210]:
# my_text = get_HTML_item_from_urls(urls[:5])
my_text = get_HTML_item_from_urls(urls)

In [211]:
import pandas as pd

data = pd.DataFrame(my_text)
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   header          478 non-null    object
 1   body            478 non-null    object
 2   review_num      478 non-null    int64 
 3   review_date     478 non-null    object
 4   is_mobile_user  478 non-null    bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 15.5+ KB


The data of the results should be published in a pandas data frame format, as follows:

In [215]:
data.head()

Unnamed: 0,header,body,review_num,review_date,is_mobile_user
0,Superb Comfort food during COVID-19,With stricter travel advisories across the glo...,50,2020-03-01,True
1,"Glad I tried it, but didn’t finish it",I got part of the neck of the chicken which I ...,30,2020-03-01,True
2,"Good enough for its price, New Hawker Chan out...","Like others, drawn here for the cheapest Miche...",40,2020-01-01,False
3,Hawker stall is our preferable taste,We didn’t have in the restaurant at 78 smith s...,40,2020-02-01,True
4,"Good value and quite nice, maybe overrated",I'll start by saying this is probably a bit ov...,40,2020-02-01,True


Lastly, make sure to persist your data in a `csv` file or equivalent:

In [214]:
data.to_csv("./data/HawkerChanReviews.csv",index=False)

## NLP with Review Data 

In [802]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

data = pd.read_csv("./data/HawkerChanReviews.csv")

In [803]:
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   header          478 non-null    object
 1   body            478 non-null    object
 2   review_num      478 non-null    int64 
 3   review_date     478 non-null    object
 4   is_mobile_user  478 non-null    bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 15.5+ KB


Are there any missing values at all?

In [804]:
print("Missing Values:", data.isnull().any(),sep="\n")

Missing Values:
header            False
body              False
review_num        False
review_date       False
is_mobile_user    False
dtype: bool


First, let's scale the review number (represent the scale as 1 - 5):

In [805]:
data["review_num"] = data["review_num"]//10

Next let's make sure that there are no unnecessary whitespaces in the text columns:

In [806]:
for column in data.select_dtypes(include=["object"]).columns:
    data[column] = data[column].apply(lambda text: text.strip())

In [807]:
data

Unnamed: 0,header,body,review_num,review_date,is_mobile_user
0,Superb Comfort food during COVID-19,With stricter travel advisories across the glo...,5,2020-03-01,True
1,"Glad I tried it, but didn’t finish it",I got part of the neck of the chicken which I ...,3,2020-03-01,True
2,"Good enough for its price, New Hawker Chan out...","Like others, drawn here for the cheapest Miche...",4,2020-01-01,False
3,Hawker stall is our preferable taste,We didn’t have in the restaurant at 78 smith s...,4,2020-02-01,True
4,"Good value and quite nice, maybe overrated",I'll start by saying this is probably a bit ov...,4,2020-02-01,True
...,...,...,...,...,...
473,"Best chicken ever, at a very affordable price",Not sure I would give a Michelin star to this ...,5,2017-01-01,False
474,Fast food with Michelin star,"Long queue to get in, but it is worth. Custome...",4,2017-01-01,True
475,"""Hits the spot😋""",An expansion of the original hawker stall we h...,4,2017-01-01,True
476,Worth a visit,"Definitely worth a visit, the food was good an...",4,2016-12-01,True


Can we discriminate data based on the header only? 

In [808]:
value_counts_header = data["header"].value_counts()
value_counts_header.head(15)

Disappointing                                         4
Michelin star?                                        2
Very good                                             2
Delicious                                             2
Dinner                                                2
Lunch                                                 2
Nothing special                                       2
Worth a visit                                         1
Tasty chicken - no Michelin star                      1
best soya sauce chicken and charsiew michelin star    1
Cheap affordable Michelin lunch                       1
Definitely skip. Toa Payoh outlet.                    1
You have to Try!                                      1
Really good Char Siew and soy chicken                 1
Outlet at toa payoh hub                               1
Name: header, dtype: int64

In [809]:
data.loc[data.header.isin(value_counts_header[value_counts_header > 1].index)]

Unnamed: 0,header,body,review_num,review_date,is_mobile_user
60,Nothing special,"If you like rice, cold chicken and soy sauce t...",2,2019-07-01,False
73,Nothing special,The Rice and Chicken signature dish is good bu...,3,2019-07-01,True
75,Michelin star?,My husband and I wanted to try this place as w...,3,2019-07-01,False
99,Michelin star?,This is a very popular restaurant at the top o...,4,2019-05-01,True
113,Disappointing,Perhaps I expected too much but having spent t...,2,2019-03-01,True
117,Very good,Belly pork on the pork and noodles dish was su...,5,2019-03-01,True
160,Disappointing,Had high expectations but so disappointed. Lon...,1,2018-11-01,True
171,Delicious,We braved the long line to try this. It was d...,3,2017-12-01,True
239,Lunch,Had lunch with two colleagues. The three of us...,4,2018-06-01,True
339,Disappointing,Queued for 40 mins to get lukewarm chicken ric...,1,2017-12-01,True


Perhaps not, from the instances where the headers were the identical we had differing review numbers. However, the range between the reviews are close - indicating that commenters who made similar comments had the similar general viewpoints in their ratings. This does suggest that more information is needed from the body to supplement the analysis of review patterns. The last thing to note is that frequency of synonymous headers are rare and since the reviews are similar, it might be forgiveable to get away with just analysing the headers.

Let's now consider the effect of the comments on the ratings (the `review_num` column): 

First, we need to hold out a test set in order to be able to gauge the generalisation ability of our model:

In [810]:
#Isolate Target From Predictors

pred = data.drop(["review_num","review_date","is_mobile_user"],axis=1)
target = data.review_num

# create training and testing vars using a random seed to reproduce experiment results
X_train, X_test, y_train, y_test = train_test_split(pred, target, test_size=0.2, random_state=5)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(382, 2) (382,)
(96, 2) (96,)


## Preprocessing

### Stopword and Punctuation Removal

In [811]:
def remove_stop_words_punc(data_string):
    all_stopwords = stopwords.words("english") # from nltk
    puncs = list(string.punctuation)
    puncs += "’"

    final = []
    text = word_tokenize(data_string.lower())
    for token in text:
        if token not in all_stopwords and token not in puncs:
            final.append(token)

    return ' '.join(final)

X_train.loc[:,"body"] = X_train["body"].apply(remove_stop_words_punc)
X_train.loc[:,"header"] = X_train["header"].apply(remove_stop_words_punc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


### Stemming  

In [812]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stem_word(data_text):
    return " ".join([porter.stem(word) for word in word_tokenize(data_text)])

stem_word(X_train.loc[37,"header"])

'bit disappoint honest'

### Lemmatizing

In [813]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()


def lemmatize_word(data_text):
    return " ".join([lemma.lemmatize(word) for word in word_tokenize(data_text)])

lemmatize_word(X_train.loc[37,"header"])

'bit disappointment honest'

In [814]:
data.loc[37,:]

header                       A bit of a disappointment to be honest
body              After reading dozens of reviews and watching t...
review_num                                                        3
review_date                                              2019-09-01
is_mobile_user                                                 True
Name: 37, dtype: object

### Putting it all together

In [826]:
def TextPreprocessorFn(text,strategy="lemma"):
    lemma = WordNetLemmatizer() 
    stemmer = PorterStemmer()
    all_stopwords = stopwords.words("english")
    puncs = list(string.punctuation)
    puncs += ["“","’","”","-"]
    doc = word_tokenize(text.lower())
    removed = [t for t in doc if (t not in puncs) and (t not in all_stopwords)]
    if strategy == "lemma":
        return " ".join([lemma.lemmatize(token) for token in removed])
    else:
        return " ".join([stemmer.stem(token) for token in removed])

In [827]:
X_train["header"].apply(TextPreprocessorFn,strategy="lemma").values

array(['great pork belly so-so chicken', '5 michelin star',
       'michel-in michel-out', 'hype', 'good value taste', 'tragic',
       'michelin star', 'original pretty amazing',
       'singapore one find chicken rice 2 plate',
       'best chicken rice ever', "wo n't find michellin star street food",
       'fast food michelin star', 'eaten better food', 'much',
       'hawker chan famous chicken', 'cool try worth hype',
       'good experience', 'michelin star',
       'super sedap soya chicken noodle',
       "terrible service horrible food michelin worthy n't bother",
       'family lunch', 'cheap tasty -- life reputation', 'fantastic',
       'overhyped', 'rude bad service', 'worth',
       'michelin star chicken rice noodle', 'good must visit', 'alright',
       'tasty chicken', 'disappointing given hype', 'ok nothing special',
       'superb comfort food covid-19', 'nothing special',
       'e everything need', 'could use quality control',
       'cheapest michelin star restau

### Putting it all together into a Pipeline (Optional)

In [828]:
from sklearn.base import TransformerMixin, BaseEstimator

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self,strategy="lemma"):
        """
        Text preprocessing transformer includes steps:
            1. Text normalization
            2. Punctuation removal
            3. Stop words removal
            4. Lemmatization
        """
        self.strategy = strategy

    def fit(self, X, y=None):
        self._lemma = WordNetLemmatizer() 
        self._stemmer = PorterStemmer()
        self._all_stopwords = stopwords.words("english")
        self._puncs = list(string.punctuation)
        self._puncs += ["“","’","”","-"]
        self._vectorized_process = np.vectorize(self._preprocess_text)
        return self

    def transform(self, X, *_):
        X_copy = X.copy().values
        
        return self._vectorized_process(X_copy)
    
    
    def _preprocess_text(self, text):
        doc = word_tokenize(text.lower())
        removed_punct = self._remove_punct(doc)
        removed_stop_words = self._remove_stop_words(removed_punct)
        if self.strategy == "lemma":
            return self._lemmatize(removed_stop_words)
        else:
            return self._stem(removed_stop_words)

    def _remove_punct(self, doc):
        return [t for t in doc if t not in self._puncs]

    def _remove_stop_words(self, doc):
        return [t for t in doc if t not in self._all_stopwords]

    def _lemmatize(self, doc):
        return ' '.join([self._lemma.lemmatize(t) for t in doc])
    
    def _stem(self,doc):
        return ' '.join([self._stemmer.stem(t) for t in doc])
    
#     def _more_tags(self):
#         return {'multioutput_only': True,
#                 'non_deterministic': True}

In [899]:
processor = TextPreprocessor()
processed_X_train = X_train.copy()
processed_X_train.loc[:,"header"] = processor.fit_transform(X_train["header"])
processed_X_train.loc[:,"body"] = processor.fit_transform(X_train["body"])

processed_X_test = X_test.copy()
processed_X_test.loc[:,"header"] = processor.fit_transform(X_test["header"])
processed_X_test.loc[:,"body"] = processor.fit_transform(X_test["body"])

### Creating the Training and Testing Tf-Idf Matrices

In [927]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_head = TfidfVectorizer()
vectorizer_body = TfidfVectorizer()
tfidf_matrix_header_train = vectorizer_head.fit_transform(processed_X_train["header"])
tfidf_matrix_body_train = vectorizer_body.fit_transform(processed_X_train["body"])


In [928]:
tfidf_matrix_body_train

<382x1739 sparse matrix of type '<class 'numpy.float64'>'
	with 7745 stored elements in Compressed Sparse Row format>

In [929]:
tfidf_matrix_header_train

<382x418 sparse matrix of type '<class 'numpy.float64'>'
	with 1236 stored elements in Compressed Sparse Row format>

In [930]:
print(len(vectorizer_head.get_feature_names()))
print(len(vectorizer_body.get_feature_names()))

418
1739


In [931]:
import scipy

X_train_vec = scipy.sparse.hstack([tfidf_matrix_header_train,tfidf_matrix_body_train])

print(X_train_vec.shape)

(382, 2157)


In [932]:
test_vectorizer = TfidfVectorizer()

tfidf_matrix_header_test = vectorizer_head.transform(processed_X_test["header"])
tfidf_matrix_body_test = vectorizer_body.transform(processed_X_test["body"])

X_test_vec = scipy.sparse.hstack([tfidf_matrix_header_test,tfidf_matrix_body_test])

### Creating the Classifier

In [933]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vec, y_train)

LogisticRegression()

In [934]:
model.score(X_train_vec, y_train)

0.9764397905759162

In [935]:
preds = model.predict(X_test_vec)
model.score(X_test_vec, y_test)

0.40625

In [936]:
preds

array([5, 4, 3, 3, 5, 4, 3, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 4, 3, 4, 4, 2,
       3, 4, 5, 5, 4, 4, 4, 3, 4, 4, 5, 5, 4, 4, 5, 4, 1, 3, 5, 5, 4, 3,
       5, 5, 5, 1, 3, 5, 5, 3, 4, 3, 4, 4, 4, 4, 4, 5, 5, 5, 4, 5, 2, 4,
       5, 3, 5, 3, 4, 4, 5, 3, 1, 4, 5, 4, 3, 5, 3, 5, 5, 3, 3, 4, 3, 3,
       3, 4, 4, 5, 3, 3, 5, 3])

In [937]:
y_test

419    4
139    4
227    3
226    4
122    5
      ..
281    5
375    3
293    3
98     1
53     5
Name: review_num, Length: 96, dtype: int64