## Web Scraping to Collect Review Data

You are given the following urls of a website that lists reviews of a well known restaurant in Singapore. 

You are tasked to scrape the urls for the following data from each review:
 - The review header
 - The review footer
 - The review number
 - Optional: The date of review
 - Optional: A Boolean Indicator for Mobile Users

A visual aid is given to help you identify the relevant data from the website:

![screenshot](./assets/Review_Card.png)

Use your knowledge of HTML and BeautifulSoup to obtain the data and pass it into a DataFrame.

NOTE: There are cards on the webpage that might have tags with the same class as the tags you are looking for. Be sure to restrict your search to only the relevant html elements.

Hint: Use the HTML Inspector using the Developer Tools in Chrome to help you.


In [14]:
from bs4 import BeautifulSoup
import requests
import re, os, string
from datetime import datetime

base_url = "https://www.tripadvisor.com.sg/Restaurant_Review-g294265-d11961130-Reviews-Hawker_Chan-Singapore.html"

urls = [
    f"https://www.tripadvisor.com.sg/Restaurant_Review-g294265-d11961130-Reviews-or{i*10}-Hawker_Chan-Singapore.html"
    for i in range(1, 48)
]

urls = [base_url, *urls]


def get_HTML(url):
    response = requests.get(url)
    html = response.text
    return html


def get_HTML_item_from_urls(urls):
    '''
    For each url, look for the review card.
    For each review card. Look for the following attributes:
    - The rating
    - The card header
    - The card body
    - Optional: The date of review
    - Optional: A Boolean Indicator for Mobile Users
    '''
    res = []
    for url in urls:
        html_string = get_HTML(url)
        soup = BeautifulSoup(html_string, "lxml")
        for element in soup(attrs={"class": "ui_column is-9"}):
            obj = {}
            # Find Review Number
            review = (
                element.find(attrs={"class": "ui_bubble_rating bubble_50"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_40"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_30"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_20"})
                or element.find(attrs={"class": "ui_bubble_rating bubble_10"})
            )
            # If the review number can't be found, then move on. This is not the tag you are looking for
            if review == None:
                continue
            #Get Review number using regex
            num = int(
                re.findall(
                    r'(?<=<span class="ui_bubble_rating bubble_)(\d+).*(?=>)',
                    str(review),
                )[0]
            )
            #Find Card Header
            card_head = element.find(attrs={"class": "noQuotes"})
            #Find Card Body
            card_body = element.find(attrs={"class": "partial_entry"})
            #Find Date of Review
            review_date_raw_string = element.find(attrs={"class": "prw_rup prw_reviews_stay_date_hsx"})
            review_date_string = re.findall(r'(?<=Date of visit: )(.*)',review_date_raw_string.text)[0]
            review_date = datetime.strptime(review_date_string,"%B %Y").date()
            #Find Mobile Users
            is_mobile_user = element.find(attrs={"class": "viaMobile"})
            obj.update(header=card_head.text, body=card_body.text, review_num=num,review_date=review_date,is_mobile_user=bool(is_mobile_user))
            res.append(obj)
    return res


In [210]:
# my_text = get_HTML_item_from_urls(urls[:5])
my_text = get_HTML_item_from_urls(urls)

In [2]:
import pandas as pd

data = pd.DataFrame(my_text)
data.head()
data.info()

NameError: name 'my_text' is not defined

The data of the results should be published in a pandas data frame format, as follows:

In [3]:
data.head()

NameError: name 'data' is not defined

Lastly, make sure to persist your data in a `csv` file or equivalent:

In [214]:
data.to_csv("./data/HawkerChanReviews.csv",index=False)

## NLP with Review Data 

In [68]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

data = pd.read_csv("./data/HawkerChanReviews.csv")

In [69]:
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   header          478 non-null    object
 1   body            478 non-null    object
 2   review_num      478 non-null    int64 
 3   review_date     478 non-null    object
 4   is_mobile_user  478 non-null    bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 15.5+ KB


Are there any missing values at all?

In [70]:
print("Missing Values:", data.isnull().any(),sep="\n")

Missing Values:
header            False
body              False
review_num        False
review_date       False
is_mobile_user    False
dtype: bool


First, let's scale the review number (represent the scale as 1 - 5):

In [71]:
data["review_num"] = data["review_num"]//10

Next let's make sure that there are no unnecessary whitespaces in the text columns:

In [72]:
for column in data.select_dtypes(include=["object"]).columns:
    data[column] = data[column].apply(lambda text: text.strip())

In [73]:
data

Unnamed: 0,header,body,review_num,review_date,is_mobile_user
0,Superb Comfort food during COVID-19,With stricter travel advisories across the glo...,5,2020-03-01,True
1,"Glad I tried it, but didn’t finish it",I got part of the neck of the chicken which I ...,3,2020-03-01,True
2,"Good enough for its price, New Hawker Chan out...","Like others, drawn here for the cheapest Miche...",4,2020-01-01,False
3,Hawker stall is our preferable taste,We didn’t have in the restaurant at 78 smith s...,4,2020-02-01,True
4,"Good value and quite nice, maybe overrated",I'll start by saying this is probably a bit ov...,4,2020-02-01,True
...,...,...,...,...,...
473,"Best chicken ever, at a very affordable price",Not sure I would give a Michelin star to this ...,5,2017-01-01,False
474,Fast food with Michelin star,"Long queue to get in, but it is worth. Custome...",4,2017-01-01,True
475,"""Hits the spot😋""",An expansion of the original hawker stall we h...,4,2017-01-01,True
476,Worth a visit,"Definitely worth a visit, the food was good an...",4,2016-12-01,True


Can we discriminate data based on the header only? 

In [74]:
value_counts_header = data["header"].value_counts()
value_counts_header.head(15)

Disappointing                                                            4
Dinner                                                                   2
Delicious                                                                2
Lunch                                                                    2
Nothing special                                                          2
Very good                                                                2
Michelin star?                                                           2
Fun food                                                                 1
Awesome for cheap eats                                                   1
Worth the visit                                                          1
Get Better deal from their Original Hawker Stall in Chinatown Complex    1
Good value and quite nice, maybe overrated                               1
Why the hype?                                                            1
Simple & tasty           

In [75]:
data.loc[data.header.isin(value_counts_header[value_counts_header > 1].index)]

Unnamed: 0,header,body,review_num,review_date,is_mobile_user
60,Nothing special,"If you like rice, cold chicken and soy sauce t...",2,2019-07-01,False
73,Nothing special,The Rice and Chicken signature dish is good bu...,3,2019-07-01,True
75,Michelin star?,My husband and I wanted to try this place as w...,3,2019-07-01,False
99,Michelin star?,This is a very popular restaurant at the top o...,4,2019-05-01,True
113,Disappointing,Perhaps I expected too much but having spent t...,2,2019-03-01,True
117,Very good,Belly pork on the pork and noodles dish was su...,5,2019-03-01,True
160,Disappointing,Had high expectations but so disappointed. Lon...,1,2018-11-01,True
171,Delicious,We braved the long line to try this. It was d...,3,2017-12-01,True
239,Lunch,Had lunch with two colleagues. The three of us...,4,2018-06-01,True
339,Disappointing,Queued for 40 mins to get lukewarm chicken ric...,1,2017-12-01,True


Perhaps not, from the instances where the headers were the identical we had differing review numbers. However, the range between the reviews are close - indicating that commenters who made similar comments had the similar general viewpoints in their ratings. This does suggest that more information is needed from the body to supplement the analysis of review patterns. The last thing to note is that frequency of synonymous headers are rare and since the reviews are similar, it might be forgiveable to get away with just analysing the headers.

Let's now consider the effect of the comments on the ratings (the `review_num` column): 

First, we need to hold out a test set in order to be able to gauge the generalisation ability of our model:

In [76]:
#Isolate Target From Predictors

pred = data.drop(["review_num","review_date","is_mobile_user"],axis=1)
target = data.review_num

# create training and testing vars using a random seed to reproduce experiment results
X_train, X_test, y_train, y_test = train_test_split(pred, target, test_size=0.2, random_state=5)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(382, 2) (382,)
(96, 2) (96,)


## Preprocessing

### Stopword and Punctuation Removal

In [77]:
def remove_stop_words_punc(data_string):
    all_stopwords = stopwords.words("english") # from nltk
    puncs = list(string.punctuation)
    puncs += "’"

    final = []
    text = word_tokenize(data_string.lower())
    for token in text:
        if token not in all_stopwords and token not in puncs:
            final.append(token)

    return ' '.join(final)

print("Body:",X_train["body"].apply(remove_stop_words_punc),"\n",sep="\n")
print("Header:",X_train["header"].apply(remove_stop_words_punc),"\n",sep="\n")

Body:
312    pretty good pork belly chicken good michelin s...
296    else say ... 5 michelin star meal little hard ...
330    judge local asian food seriously understand pl...
176    singapore full hawker food part experience go ...
305    went toa payoh outlet ordered mixed platter ro...
                             ...                        
400    visited tai seng outlet lunch still queue thou...
118    sign front says michelin star queues inside at...
189    ate 2 nights singapore recently went almost cl...
206    good taste tried noodle rice vendor read vendo...
355    plain good something would expect michelin sta...
Name: body, Length: 382, dtype: object


Header:
312                       great pork belly so-so chicken
296                                      5 michelin star
330                                 michel-in michel-out
176                                                 hype
305                                     good value taste
                             ... 

### Stemming  

In [81]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def stem_word(data_text):
    return " ".join([porter.stem(word) for word in word_tokenize(data_text)])

stem_word(X_train.loc[35,"header"])

'tasti street food , friendli staff'

### Lemmatizing

In [82]:
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()


def lemmatize_word(data_text):
    return " ".join([lemma.lemmatize(word) for word in word_tokenize(data_text)])

lemmatize_word(X_train.loc[35,"header"])

'Tasty street food , friendly staff'

In [84]:
data.loc[35,:]

header                            Tasty street food, friendly staff
body              I don‘t know why snobbish people are going on ...
review_num                                                        5
review_date                                              2019-11-01
is_mobile_user                                                 True
Name: 35, dtype: object

### Putting it all together

In [85]:
def TextPreprocessorFn(text,strategy="lemma"):
    lemma = WordNetLemmatizer() 
    stemmer = PorterStemmer()
    all_stopwords = stopwords.words("english")
    puncs = list(string.punctuation)
    puncs += ["“","’","”","-"]
    doc = word_tokenize(text.lower())
    removed = [t for t in doc if (t not in puncs) and (t not in all_stopwords)]
    if strategy == "lemma":
        return " ".join([lemma.lemmatize(token) for token in removed])
    else:
        return " ".join([stemmer.stem(token) for token in removed])

In [86]:
X_train["header"].apply(TextPreprocessorFn,strategy="lemma").values

array(['great pork belly so-so chicken', '5 michelin star',
       'michel-in michel-out', 'hype', 'good value taste', 'tragic',
       'michelin star', 'original pretty amazing',
       'singapore one find chicken rice 2 plate',
       'best chicken rice ever', "wo n't find michellin star street food",
       'fast food michelin star', 'eaten better food', 'much',
       'hawker chan famous chicken', 'cool try worth hype',
       'good experience', 'michelin star',
       'super sedap soya chicken noodle',
       "terrible service horrible food michelin worthy n't bother",
       'family lunch', 'cheap tasty -- life reputation', 'fantastic',
       'overhyped', 'rude bad service', 'worth',
       'michelin star chicken rice noodle', 'good must visit', 'alright',
       'tasty chicken', 'disappointing given hype', 'ok nothing special',
       'superb comfort food covid-19', 'nothing special',
       'e everything need', 'could use quality control',
       'cheapest michelin star restau

### Putting it all together into a Transformer (Optional)

In [87]:
from sklearn.base import TransformerMixin, BaseEstimator

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self,strategy="lemma"):
        """
        Text preprocessing transformer includes steps:
            1. Text normalization
            2. Punctuation removal
            3. Stop words removal
            4. Lemmatization
        """
        self.strategy = strategy

    def fit(self, X, y=None):
        self._lemma = WordNetLemmatizer() 
        self._stemmer = PorterStemmer()
        self._all_stopwords = stopwords.words("english")
        self._puncs = list(string.punctuation)
        self._puncs += ["“","’","”","-"]
        self._vectorized_process = np.vectorize(self._preprocess_text)
        return self

    def transform(self, X, *_):
        X_copy = X.copy().values
        
        return self._vectorized_process(X_copy)
    
    
    def _preprocess_text(self, text):
        doc = word_tokenize(text.lower())
        removed_punct = self._remove_punct(doc)
        removed_stop_words = self._remove_stop_words(removed_punct)
        if self.strategy == "lemma":
            return self._lemmatize(removed_stop_words)
        else:
            return self._stem(removed_stop_words)

    def _remove_punct(self, doc):
        return [t for t in doc if t not in self._puncs]

    def _remove_stop_words(self, doc):
        return [t for t in doc if t not in self._all_stopwords]

    def _lemmatize(self, doc):
        return ' '.join([self._lemma.lemmatize(t) for t in doc])
    
    def _stem(self,doc):
        return ' '.join([self._stemmer.stem(t) for t in doc])
    
#     def _more_tags(self):
#         return {'multioutput_only': True,
#                 'non_deterministic': True}

### Creating the Training and Testing Tf-Idf Matrices

#### Using the `TfidfVectorizer` itself:



First, initialize the vectorizer (be sure to add the Text Preprocessing Function as a parameter). Then fit and transform the training data.

In [196]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_head = TfidfVectorizer(preprocessor=TextPreprocessorFn,min_df=5, max_df=0.7)
vectorizer_body = TfidfVectorizer(preprocessor=TextPreprocessorFn,min_df=5, max_df=0.7)
tfidf_matrix_header_train = vectorizer_head.fit_transform(X_train["header"])
tfidf_matrix_body_train = vectorizer_body.fit_transform(X_train["body"])




In [197]:
print(len(vectorizer_head.get_feature_names()))
print(len(vectorizer_body.get_feature_names()))

57
325


Combine the two matrices together:

In [198]:
import scipy

X_train_vec = scipy.sparse.hstack([tfidf_matrix_header_train,tfidf_matrix_body_train])

print(X_train_vec.shape)


(382, 382)


Now using the same vectorizer, transform the test data:

In [199]:
tfidf_matrix_header_test = vectorizer_head.transform(X_test["header"])
tfidf_matrix_body_test = vectorizer_body.transform(X_test["body"])

X_test_vec = scipy.sparse.hstack([tfidf_matrix_header_test,tfidf_matrix_body_test])

print(X_test_vec.shape)

(96, 382)


#### Using the Transformer Method

First, preprocess the data using the transformer. Then use the Tf-Idf vectorizer to fit and transform the training data.

In [200]:
processor = TextPreprocessor()
processed_X_train = X_train.copy()
processed_X_train.loc[:,"header"] = processor.fit_transform(X_train["header"])
processed_X_train.loc[:,"body"] = processor.fit_transform(X_train["body"])

processed_X_test = X_test.copy()
processed_X_test.loc[:,"header"] = processor.fit_transform(X_test["header"])
processed_X_test.loc[:,"body"] = processor.fit_transform(X_test["body"])


from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_head = TfidfVectorizer(min_df=5, max_df=0.7)
vectorizer_body = TfidfVectorizer(min_df=5, max_df=0.7)
tfidf_matrix_header_train = vectorizer_head.fit_transform(processed_X_train["header"])
tfidf_matrix_body_train = vectorizer_body.fit_transform(processed_X_train["body"])


In [201]:
print(len(vectorizer_head.get_feature_names()))
print(len(vectorizer_body.get_feature_names()))

57
325


In [202]:
import scipy

X_train_vec = scipy.sparse.hstack([tfidf_matrix_header_train,tfidf_matrix_body_train])

print(X_train_vec.shape)

(382, 382)


In [203]:
tfidf_matrix_header_test = vectorizer_head.transform(processed_X_test["header"])
tfidf_matrix_body_test = vectorizer_body.transform(processed_X_test["body"])

X_test_vec = scipy.sparse.hstack([tfidf_matrix_header_test,tfidf_matrix_body_test])

In [204]:
print(tfidf_matrix_body_train)

  (0, 157)	0.29373391789748027
  (0, 306)	0.30456928190755983
  (0, 137)	0.4710190245132066
  (0, 313)	0.2958035350168892
  (0, 270)	0.18056820434260834
  (0, 174)	0.16268481506488705
  (0, 48)	0.14985683984885734
  (0, 211)	0.27863544653978
  (0, 124)	0.439352129007043
  (0, 214)	0.3992530299946493
  (1, 274)	0.21069258889269624
  (1, 234)	0.2808655289565301
  (1, 230)	0.15252865358058867
  (1, 155)	0.2679925437198441
  (1, 140)	0.27407071337612515
  (1, 156)	0.28856885551813505
  (1, 268)	0.2679925437198441
  (1, 132)	0.13345289981625905
  (1, 196)	0.2251907310347061
  (1, 160)	0.28856885551813505
  (1, 108)	0.22800539925769311
  (1, 131)	0.2624941704516007
  (1, 153)	0.24460156369856614
  (1, 171)	0.17070544443054525
  (1, 239)	0.21513241402100713
  :	:
  (379, 194)	0.19610887079602962
  (379, 255)	0.18970907543816953
  (379, 313)	0.19886165523662352
  (379, 48)	0.10074517608172555
  (380, 271)	0.4449398916867505
  (380, 224)	0.5056800605012912
  (380, 281)	0.3941780182047317
  (380

### Creating the Classifier

Let's now create a Logistic Regression Model

In [205]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vec, y_train);

Score it against the training data:

In [206]:
model.score(X_train_vec, y_train)

0.8036649214659686

Now use it to predict and score it against the test data

In [207]:
preds = model.predict(X_test_vec)
model.score(X_test_vec, y_test)

0.40625

What do the results above tell you about the model?

- A Logistic Regression Model (with Regularization) does not really work well for the problem of accurately predicting user ratings.
- Overfitting can be seen.
- Rationale: linear models might not be geometrically complex enough to separate the classes based on textual information.
- Vectorizing int Tf-idf is a simplistic approach, which fails to capture contexts of words. Perhaps more complex transformers such as word embeddings can capture the nuances of the dataset better.

A proxy can be as such, instead of predicting the actual rating we can say that a prediction does well if it comes within a difference of one star:

In [209]:
np.mean(np.abs(y_test.values - preds)<2)

0.8020833333333334

The result above tells us that while the model might be terrible in predicting the actual ratings of users. It does capture the essence of what the user feels based on the comments.