# Web Scraping 

In this notebook, I will be using sentiment analysis on Rotten Tomatoes movie reviews to make predictions on users star ratings. First I am web scraping the movie rating website Rotten Tomatoes. Here I will parse the user id, review and star rating and add these features to a dataframe. 

### Importing Modules

In [2]:
import pandas as pd
from collections import defaultdict
import pickle

import requests
import urllib.request
import re
import json

# Parsing The Raw Data

First, we send a GET request to our chosen website, which is outlined by the url variable. Next, we take the received HTML documentation and we complete a regular expression search to find the movie ID, which is needed later to change onto the next review pages. Rotten tomatoes have ten reviews on each page so to scrape many reviews at once, we need to have a method to automatically move onto the next page.

In [4]:
# Defining the URL and requesting the HTML documentation
url = 'https://www.rottentomatoes.com/m/black_panther_2018/reviews?type=user'
response = requests.get(url)

# Searching the HTML doc to extract the movie name and movie id
html_data  = json.loads(re.search('movieReview\s=\s(.*);', response.text).group(1))
movie_name = html_data['title']
movie_id   = html_data['movieId']


# # Function to flick through the review pages
# def getReviews(endCursor):
#     r = requests.get(f'https://www.rottentomatoes.com/napi/movie/{movie_id}/reviews/user',
#     params = {
#         "direction": "next",
#         "endCursor": endCursor,
#         "startCursor": ""
#     })
#     return r.json()

# # Empty reviews list and result dictionary
# reviews = []
# result = {}
   
# # Looping over review pages until final page
# i = 0
# while True:
#     result = getReviews(result['pageInfo']['endCursor'] if i != 0  else '')
#     if result['pageInfo']['hasNextPage']==False:
#         reviews.extend([t for t in result['reviews']])
#         break
#     reviews.extend([t for t in result['reviews']])
#     i += 1

Black Panther


# Creating Review Dataframe

In [9]:
# Empty data dictionary
data = defaultdict(list)

# Finding reviewers who have user id's 
users_all = [reviews[i]['user']['userId'] for i in range(len(reviews))]
idx = [i for i in range(len(users_all)) if len(users_all[i]) == 9]
users = [reviews[i]['user']['userId'] for i in idx]
data['user'].extend(users)

# Verified
super_reviewer = [reviews[i]['isSuperReviewer'] for i in idx]
super_reviewer = [int(x) for x in super_reviewer]
data['super_reviewer'].extend(super_reviewer)

# Profanity
profanity = [reviews[i]['hasProfanity'] for i in idx]
profanity = [int(x) for x in profanity]
data['profanity'].extend(profanity)

# Written Review
data['review'].extend([reviews[i]['review'] for i in idx])

# Star rating
star_rating = [reviews[i]['rating'] for i in idx]
star_rating = [float(x.replace('STAR_','').replace('_','.')) for x in star_rating]
data['rating'].extend(star_rating)

# Creating dataframe of reviews
df = pd.DataFrame(data)
df = df.iloc[3:]
df.to_pickle('reviews.pkl')

Unnamed: 0,user,super_reviewer,profanity,review,rating
3,978824977,0,0,Good but not that surprising.,4.5
4,906471241,0,0,It was exciting to see this in theaters with m...,5.0
5,978925578,0,0,"I'm not a huge Marvel fan, but this movie is V...",4.0
6,978898527,0,0,It was absolutely appaling!!!!! I have never b...,0.5
7,977911687,0,0,Best movie of all time? Best drama of all tim...,1.5
...,...,...,...,...,...
12529,907803058,0,0,A particular type of movie transports us to a ...,5.0
12530,903748135,0,0,Marvel Studios is on a roll lately. Their last...,4.5
12531,905445336,0,0,Black Panther has flashy action and one-liners...,4.0
12532,802139043,0,0,"Outstanding plot, costumes, characters, and pe...",5.0
