### In-class API Assessment - Use Web Scraping to Extract Yelp Reviews

#### Team Members: Helen Jiang, Rachel Zhou, Sophia Zhao

#### Use the Yelp Fusion API to iterate through a bunch of Toronto-area restaurants and download their reviews and star ratings as training data for a very simple Naive Bayes classifier to do sentiment analysis into two ’positive’ and ’negative’ classes. 
#### Normally only 3 reviews per restaurant are shown on yelp webpage and we hope to call as many reviews and the corresponding numeric rating value giving to each review as possible. 

In [0]:
import requests
import json

import random
from pprint import pprint

from bs4 import BeautifulSoup
import requests

from nltk.sentiment import SentimentAnalyzer
import nltk.sentiment.util
from nltk.classify import NaiveBayesClassifier

import pandas as pd

In [0]:
YELP_TOKEN = 'x17InQX2Y5ZCUwwJu_k1qk5Ue-zElSwvCZmANSbvRXFCeQfDiZv8vgFntjl9QVEM9iggpB-5Sv58qbl3KbxaSH_ZlrExmz7-jod9m9bP_XYpos-pLMfajFsb6-UCXnYx'

In [0]:
# Get 15 pages of search results, each page returns 50 restaurants
r=[]
for i in range(15):
    o=50*i+1
    rr = requests.get("https://api.yelp.com/v3/businesses/search?location=Vancouver&offset= %d &limit=50" % o, headers={"Authorization": "Bearer %s" % YELP_TOKEN})
    r.append(rr)

In [0]:
# Extract url of each restaurant
links=[]
for i in range(len(r)):
    for j in range(50):
        links.append(r[i].json()['businesses'][j]['url'])

In [0]:
## Number of Restaurants Searched
len(links)

750

In [0]:
## Create a function that extracts the star (rating value) linked with each review 
## Reviews on the first page of each restaurant will be extracted 
def get_reviews_stars(url):
    response=requests.get(url)
    soup=BeautifulSoup(response.text,'html.parser')
    review=soup.findAll(itemprop="review")
    review_comments=[]
    for i in review:
        review_comments.append(i.find('p').text)
    rating=soup.findAll(itemprop="ratingValue") 
    rating=rating[1:len(rating)-10] 
    # remove the first rating: the overall restaurant rating
    # remove the last 10 ads ratings shown in each page
    ratingvalue=str(rating)
    soup1 = BeautifulSoup(ratingvalue, 'html.parser')
    star=[tag['content'] for tag in soup1.findAll('meta')]
    starposneg=[]
    for i in star:
        starvalue=int(float(i))
        if starvalue>3:
            posneg='positive' # if rating > 3, consider as positive rating
        else:
            posneg='negative'  
        starposneg.append(posneg)
    df=list(zip(review_comments,starposneg))
    return df

In [0]:
results=[]
for i in range(len(links)):
    results.extend(get_reviews_stars(links[i]))

In [0]:
results

[("If more places did brunch like this, I probably wouldn't rag on it the way I do.\n\nAs those of you who read my reviews know, I consider brunch to be the most middling of meals. To be fair, it's a prejudice I've formed while living in San Diego, where brunch is just an excuse to wake up hung over at noon and then go swill another fifth of tequila without putting shoes on. If you're familiar with the area, check out the scene on a Saturday or Sunday in North Park sometime.\n\nThus, the prospect of brunch in Vancouver was not one I was looking forward to, but one I walked away from pleasantly surprised and gastronomically satisfied, thanks to the venue: Cafe Medina, a bustling cafe with competent service and delicious food. The crowd of people was also a refreshing change--all grownups behaving normally and no one swinging from the light fixtures blotto. Speaking of light fixtures, they're hung from a high ceiling that opens to a dining room, a spacious bar, a waffle station, and an o

In [0]:
## Imbalanced Dataset Resolution:
## remove positive comments to maintain the Postive:Negative = 2:1 ratio
neg = []
pos = []
for review in results:
    if review[1] == 'positive':
        pos.append(review)
    else:
        neg.append(review)
        
import random
rand = random.sample(range(len(pos)), 2*len(neg))

pos_selected = [pos[i] for i in rand]
balanced = neg + pos_selected

In [0]:
balanced

[("I was very disappointed by the quality of food for the price that I am paying. (I am not saying the food is bad, but I just won't pay that much for this quality). You are able to get better quality for cheaper! For the price that I am paying, I expected better....\n\nI paid around $50 for a lunch set + a pint of beer.\n\nThe only good side is that it is close to Canada Place. If you are able to get a seat on the patio, will get a nice view of the area and North Vancouver.\n",
  'negative'),
 ('Okay, so I like others have heard so many great things about this restaurant and how there are long lineups to get in. Well, I finally got a chance to go for the first time and it is nothing special. I have been to better places in Vancouver that serve great breakfast and brunch food. \n\nHad their smoked salmon benny and it was okay; the shredded hash browns tasted like the potatoes was not clean prior to shredding it and had no flavour to it. One positive thing with the Benny is the hollanda

In [0]:
## Number of records left after balancing dataset
len(balanced)

10407

In [0]:
## export as .json file
a=pd.DataFrame(balanced,columns=['Review','Pos/Neg'])
df=a.reset_index(drop=True)
df.to_json('API Assessment.json')

### Split as list of words & Model building

In [0]:
splited=[(x.split(' '), 'positive' if y=='positive' else 'negative')  for (x,y) in balanced]

In [0]:
## test
splited[0]

(['I',
  'was',
  'very',
  'disappointed',
  'by',
  'the',
  'quality',
  'of',
  'food',
  'for',
  'the',
  'price',
  'that',
  'I',
  'am',
  'paying.',
  '(I',
  'am',
  'not',
  'saying',
  'the',
  'food',
  'is',
  'bad,',
  'but',
  'I',
  'just',
  "won't",
  'pay',
  'that',
  'much',
  'for',
  'this',
  'quality).',
  'You',
  'are',
  'able',
  'to',
  'get',
  'better',
  'quality',
  'for',
  'cheaper!',
  'For',
  'the',
  'price',
  'that',
  'I',
  'am',
  'paying,',
  'I',
  'expected',
  'better....\n\nI',
  'paid',
  'around',
  '$50',
  'for',
  'a',
  'lunch',
  'set',
  '+',
  'a',
  'pint',
  'of',
  'beer.\n\nThe',
  'only',
  'good',
  'side',
  'is',
  'that',
  'it',
  'is',
  'close',
  'to',
  'Canada',
  'Place.',
  'If',
  'you',
  'are',
  'able',
  'to',
  'get',
  'a',
  'seat',
  'on',
  'the',
  'patio,',
  'will',
  'get',
  'a',
  'nice',
  'view',
  'of',
  'the',
  'area',
  'and',
  'North',
  'Vancouver.\n'],
 'negative')

In [0]:
# Random train:test = 7:3 split
random.shuffle(splited)
training=splited[:round(len(splited)*0.7)]
test=splited[round(len(splited)*0.7):]

print("Training: %d, Testing: %d" % (len(training), len(test)))

sentim_analyzer = SentimentAnalyzer()

Training: 7285, Testing: 3122


In [0]:
all_words_neg = sentim_analyzer.all_words([nltk.sentiment.util.mark_negation(doc) for doc in training])
all_words_neg

["I've",
 'tried',
 'coming',
 'to',
 'this',
 'restaurant',
 'twice',
 'and',
 'both',
 'times',
 'it',
 'was',
 'packed.',
 'I',
 'finally',
 'got',
 'to',
 'try',
 'this',
 'restaurant',
 'today',
 'and',
 'was',
 'excited',
 'to',
 'see',
 'what',
 'the',
 'hype',
 'was',
 'about.',
 'Honestly,',
 'it',
 "wasn't",
 'really_NEG',
 'anything_NEG',
 'special_NEG',
 'or_NEG',
 'spectacular._NEG',
 '\n\nYou_NEG',
 'order_NEG',
 'at_NEG',
 'the_NEG',
 'front_NEG',
 'and_NEG',
 'find_NEG',
 'a_NEG',
 'seat_NEG',
 'and_NEG',
 'food_NEG',
 'is_NEG',
 'served_NEG',
 'to_NEG',
 'you._NEG',
 'The_NEG',
 'servers_NEG',
 'were_NEG',
 'great._NEG',
 'The_NEG',
 'ladies_NEG',
 'were_NEG',
 'so_NEG',
 'friendly_NEG',
 'and_NEG',
 'came_NEG',
 'by_NEG',
 'to_NEG',
 'collect_NEG',
 'our_NEG',
 'dishes_NEG',
 '(even_NEG',
 'though_NEG',
 "it's_NEG",
 'half_NEG',
 'self_NEG',
 'served)_NEG',
 'and_NEG',
 'asked_NEG',
 'if_NEG',
 'everything_NEG',
 'was_NEG',
 'okay._NEG',
 'We_NEG',
 "didn't_NEG",
 'fi

In [0]:
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(nltk.sentiment.util.extract_unigram_feats, unigrams=unigram_feats)

In [0]:
training_set = sentim_analyzer.apply_features(training)
test_set = sentim_analyzer.apply_features(test)

In [0]:
trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)
for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
     print('{0}: {1}'.format(key, value))

Training classifier
Evaluating NaiveBayesClassifier results...
Accuracy: 0.7616912235746316
F-measure [negative]: 0.6313181367690782
F-measure [positive]: 0.8239469947941315
Precision [negative]: 0.6493374108053007
Precision [positive]: 0.8131714152265297
Recall [negative]: 0.6142719382835101
Recall [positive]: 0.8350119904076738


In [0]:
# Citation
# Title: <Day1.ipynb>
# Author: <Adrian Petrescu>
# Date: <2019-Dec-15>
# Availability: <https://github.com/apetresc/rotman-api> or <https://rac.rotman.utoronto.ca/jupyter/user/ut_fzhao/lab?>