# Comparing Review Sentiment to User Ratings

Since I've learned how to perform sentiment analysis, I've been looking for a project to use it in a somewhat unique way. Analyzing IMDB reviews is nothing new, but I was wondering if I could compare how reviewer sentiment analysis compares with the the overall user rating for any given movie. For this project, I will be using the VADER (Valence Aware Dictionary for Sentiment Reasoning) sentiment analysis model. This specific model measures not only if a given input is positive or negative, but also to what extent that is conveyed in the text.

## Importing Packages

We will need to use requests and BeautifulSoup to scrape the data, and pandas and numpy to process it later on. Most of the code I will be using for webscraping is a modified version of the code from [this](https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/) tutorial.

In [3]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools

pd.options.display.max_colwidth=500

This is the url list to be iterated over. The urls correspond to the top 250 and bottom 250 movies by user rating which have at least 50,000 votes.

In [44]:
url_list = ["https://www.imdb.com/search/title/?title_type=feature&num_votes=50000,&view=simple&sort=user_rating,asc&count=250",
           "https://www.imdb.com/search/title/?title_type=feature&num_votes=50000,&view=simple&sort=user_rating,desc&count=250"]

This code (which is almost entirely taken from [this tutorial](https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/) pulls the movie tag for each movie from the urls listed above. Once this is done, the next code block iterates over the tag list to create a list of movie urls.

In [100]:
all_movie_tags = []

for url in url_list:
    # Get html soup for a given url
    response = requests.get(url)
    movies_soup = BeautifulSoup(response.text, 'html.parser')

    # find all a-tags with class:None
    movie_tags = movies_soup.find_all('a', attrs={'class': None})

    # filter the a-tags to get just the titles
    movie_tags = [tag.attrs['href'] for tag in movie_tags 
                  if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

    # remove duplicate links
    movie_tags = list(dict.fromkeys(movie_tags))

    all_movie_tags = all_movie_tags + movie_tags

# Prints number of movie urls
print(len(all_movie_tags))

500


In [46]:
# Makes movie links
base_url = "https://www.imdb.com"
movie_links = [base_url + tag for tag in all_movie_tags]

There are a total of 500 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt7886848/',
 'https://www.imdb.com/title/tt5988370/',
 'https://www.imdb.com/title/tt1213644/',
 'https://www.imdb.com/title/tt0362165/',
 'https://www.imdb.com/title/tt0799949/',
 'https://www.imdb.com/title/tt2574698/',
 'https://www.imdb.com/title/tt0185183/',
 'https://www.imdb.com/title/tt1098327/',
 'https://www.imdb.com/title/tt0466342/',
 'https://www.imdb.com/title/tt1073498/']

## Extracting data from the urls

Now that we ave the urls for each movie's page, we can manipulate them to pull the rating from the main page and pull the title and review text from the reviews page. We access the reviews page by appending the "review" suffix to the main url.

We then take the title, user rating, and review data and create a data frame with them.

In [47]:
data = []

for movie in movie_links:
    response = requests.get(movie + 'reviews')
    movie_soup = BeautifulSoup(response.text, 'html.parser')
    
    tit_response = requests.get(movie)
    tit_soup = BeautifulSoup(tit_response.text, 'html.parser')
    
    title = movie_soup.find_all('h3', itemprop='name')[0].getText().replace('\n','')
    reviews = movie_soup.find_all("div", {"class": "text show-more__control"})
    rating = tit_soup.find_all('span', itemprop='ratingValue')[0].getText()
    
    for review in reviews:
        data_app = [title, rating, review.getText()]
        data.append(data_app)
        
df = pd.DataFrame(data, columns=['title', 'rating', 'review_text'])

In [48]:
df.head()

Unnamed: 0,title,rating,review_text
0,Sadak 2 (2020),1.1,"This is a total ridiculous movie with a worst plot and there wasn't even a minute that this movie could be enjoyed. No story, no character development nothing whatsoever.The only plus point about this movie was it's soundtrack which was decent enough. My suggestion would be to skip this movie and rewatch some other comedy or drama movie..\nMy Rating : 0.5/5"
1,Sadak 2 (2020),1.1,"Movie said to be sequel have no connection from first part , it's just a ploy to grab eyeball , story is nt clear and screenplay is uneven , mahesh bhatt returned to direction after 21 years to give such a disaster, Alia bhatt hamming in every scene Sanjay dutt doesn't look interested Aditya roy kapur is wasted he is no where in second half I think bhatt have lost their touch in past few years It will add up to their flop listBetter give it miss their many better stuff available on ott"
2,Sadak 2 (2020),1.1,"The real pandemic is watching Sadak 2. Storyline, Screenplay everything is so Disappointing. Pain in the head. Even actors in this film have failed to impress."
3,Sadak 2 (2020),1.1,"Understand philosophically, Sadak 2 is the story of two broken characters coming together, who spiritually and mentally support each other.The story is straightforward, but the screenplay has been so entangled that for some time one does not understand where the film is going.The initial scenes of Sanjay Dutt and Aaliya confuse. However, later the picture starts to clear. However, by then the film is bored and runs on a very worn-out pattern.sadak 2 movie review\nThe writing of Pushpdeep Bha..."
4,Sadak 2 (2020),1.1,"#FinalVerdictThe first question that crosses your mind after having watched Sadak 2 is, for whom has director Mahesh Bhatt made this film? Is it for the Indian audience - the upper strata, the commoners, the hoi polloi? Or is it targetted at the international audience?Sadak 2 is neither novel nor experimental. It falls flat on its face! While bits and pieces of the first half are somewhat watchable, the film goes completely awry in its post-interval portions. A disaster of epic proportion! S..."


In [49]:
df.tail()

Unnamed: 0,title,rating,review_text
12466,Chinatown (1974),8.1,"As is often the case with any Jack Nicholson film, Jack was the greatest part of this film. While it is said to be a crime thriller meant to keep audiences on their toes with its action and drama, which is not really the effect that Chinatown has on the audience. This film actually makes more of a statement on the social and political situations in the United States (in this case L.A). If audiences walk into this film expecting a mindless crime thriller, then they will be sorely disappointed..."
12467,Chinatown (1974),8.1,"'Chinatown' is one of the best films of the 70s and without doubt one of the most memorable in the crime/detective genre. This is a first-rate picture all round with very few faults, if any. It's an intelligent mystery, complex yet relatively easy to follow, and has no difficulty in holding your attention from start to finish.Part of what makes 'Chinatown' so memorable is just how perfect it is in appearance. The cinematography is on another level to anything else I've seen from the 70s - ea..."
12468,Chinatown (1974),8.1,"By now it's only redundant to heap more praise on this film. The writing, acting, cinematography, direction, editing, etc. seamlessly come together as if predestined. And yes, I think Polanski's decision to go with a downbeat ending was the correct choice - that final scene is unforgettable. What I'd like to focus on is Faye Dunaway's remarkable contribution to the film. She reportedly did not get along with Polanski, in fact, was labeled ""difficult"" on several of her movies. Yet she turned ..."
12469,Chinatown (1974),8.1,"Very, very difficult to find any faults in this landmark film. The script is captivating; the soundtrack haunting; the cinematography and photography perfectly captures the hazy atmosphere of LA, especially at dusk; and the production values are first class. ""Chinatown"" is well deserving of its status as a 20th century classic in film.Through its story of a civic department run by greedy officials, ""Chinatown"" captures moral decay in 1930's LA so well. This is the key theme of the film. Ther..."
12470,Chinatown (1974),8.1,"Spoilers herein.Polanski is worth watching no matter what he does. Sometimes, the film is relatively free of context, like the nearly perfect `Ninth Gate.' But watching those take work because you have to cocreate the world.Sometimes the film is set in the context of a genre where the metanarrative is about how it sets within the genre. `Rosemary's Baby' was great because it played with everything that came before, adding great portions of architectural evil and fey vulnerability.Noir revolu..."


Now we save the raw data into a csv.

In [50]:
df.to_csv('reviews.csv')

Since we're going to aggregate the review data later by title, we also create a new dataframe (that we backup into another .csv file) that only contains the title and user ratings for each movie. More on this later.

In [55]:
df_small = df[['title','rating']].drop_duplicates()
df_small.to_csv('movies.csv')

## VADER Sentiment Analysis

Now is the part where we perform the sentiment analysis on each review. To keep things simple, we will only be saving the compound score of each review rather than including both positive and negative components.

In [57]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [63]:
rev_data = []

for review in df['review_text']:
    rev_data.append(analyzer.polarity_scores(review)['compound'])

rev_df = pd.DataFrame(rev_data, columns=['score'])
rev_df.head()

Unnamed: 0,score
0,-0.6369
1,-0.9146
2,-0.844
3,0.937
4,-0.9635


Now we join the scores to our main dataframe, which we, again, backup as a .csv file.

In [64]:
df = df.join(rev_df)
df.head()

Unnamed: 0,title,rating,review_text,score
0,Sadak 2 (2020),1.1,"This is a total ridiculous movie with a worst plot and there wasn't even a minute that this movie could be enjoyed. No story, no character development nothing whatsoever.The only plus point about this movie was it's soundtrack which was decent enough. My suggestion would be to skip this movie and rewatch some other comedy or drama movie..\nMy Rating : 0.5/5",-0.6369
1,Sadak 2 (2020),1.1,"Movie said to be sequel have no connection from first part , it's just a ploy to grab eyeball , story is nt clear and screenplay is uneven , mahesh bhatt returned to direction after 21 years to give such a disaster, Alia bhatt hamming in every scene Sanjay dutt doesn't look interested Aditya roy kapur is wasted he is no where in second half I think bhatt have lost their touch in past few years It will add up to their flop listBetter give it miss their many better stuff available on ott",-0.9146
2,Sadak 2 (2020),1.1,"The real pandemic is watching Sadak 2. Storyline, Screenplay everything is so Disappointing. Pain in the head. Even actors in this film have failed to impress.",-0.844
3,Sadak 2 (2020),1.1,"Understand philosophically, Sadak 2 is the story of two broken characters coming together, who spiritually and mentally support each other.The story is straightforward, but the screenplay has been so entangled that for some time one does not understand where the film is going.The initial scenes of Sanjay Dutt and Aaliya confuse. However, later the picture starts to clear. However, by then the film is bored and runs on a very worn-out pattern.sadak 2 movie review\nThe writing of Pushpdeep Bha...",0.937
4,Sadak 2 (2020),1.1,"#FinalVerdictThe first question that crosses your mind after having watched Sadak 2 is, for whom has director Mahesh Bhatt made this film? Is it for the Indian audience - the upper strata, the commoners, the hoi polloi? Or is it targetted at the international audience?Sadak 2 is neither novel nor experimental. It falls flat on its face! While bits and pieces of the first half are somewhat watchable, the film goes completely awry in its post-interval portions. A disaster of epic proportion! S...",-0.9635


In [65]:
df.to_csv('reviews.csv')

But since we're analyzing the aggregate sentiment across all reviews, we'll be taking the mean of all sentiment scores for each movie and matching that score to our small dataframe (which, again, we save).

In [69]:
mean_scores = []

for title in df_small['title']:
    score = df[df['title'] == title]['score'].mean()
    mean_scores.append([title, score])

score_frame = pd.DataFrame(mean_scores, columns=['title', 'sentiment'])
    
df_small = df_small.merge(score_frame)
df_small.head()

Unnamed: 0,title,rating,sentiment
0,Sadak 2 (2020),1.1,-0.444752
1,Reis (2017),1.4,-0.384672
2,Disaster Movie (2008),1.9,-0.153744
3,Son of the Mask (2005),2.2,0.332328
4,Epic Movie (2007),2.4,-0.105588


In [82]:
df_small.to_csv('movies.csv')

In [1]:
# This block is to reload the csv file we want to graph later

#import pandas as pd
#import numpy as np

#df_small = pd.read_csv('movies.csv')

## Graphing the Data
We import our graphical packages first. Then, we modify the data types of the rating and title data to make sure the ratings are plotted as a number and that the names display with the correct formattting. We also make it so that our Plottly graph can display the title of each movie when hovered over.

In [4]:
import matplotlib.pyplot as plt
import plotly.express as px

df_small['rating'] = df_small['rating'].astype('float64')
df_small['title'] = df_small['title'].str.replace('                  ',' ')

fig = px.scatter(df_small, x="sentiment", y="rating",
                 labels={"sentiment": "Review Sentiment Score",
                         "rating": "IMDB Rating"},
                 title="IMDB Movie Ratings vs. Review Sentiment Score",
                 hover_data=['title'])
fig.show()

## Observations
There appears to be a few broad categories of movie ratings vs. reviews. Keep in mind that the descriptions of these categories are generalizations, and some films will not fit any of these descriptions very well.
* **High Rating, Positive Sentiment:** Movies that most people think highly of and which were well received critically.
    * Examples: The Shawshank Redemption, The Lord of the Rings: The Return of the King, Toy Story, Casablanca
* **Low Rating, Positive Sentiment:** Movies that were not critically well recieved but are "fun." Think "low brow" or "turn your brain off" type movies. Also includes movies that are "good bad" 
    * Examples: Daddy Day Care, Gods of Egypt, Flubber, Judge Dredd, Paul Blart: Mall Cop, Twilight
* **High Rating, Negative Sentiment:** Movies that are critically well recieved and well executed, but which have a reputation of being "depressing" or "brutal."
    * Examples: Platoon, Come and See, Saving Private Ryan, 12 Years a Slave, Grave of the Fireflies
* **Low Rating, Negative Sentiment:** Movies that are generally seen as absolute trash, and are substantially less entertaining than "dumb but fun" or "good bad" movies. Either there's something incredibly objectionable about the movie or surrounding it. Some movies in this category are victims of review brigading (i.e. Gunday).
    * Examples: A Serbian Film, A Good Day to Die Hard, Resident Evil: The Final Chapter, Battlefield Earth

## Limitations

#### Any movie that was review brigaded positively or negatively will essentially be in the "wrong spot"
Review brigading is a term for when a group of people flood the reviews and ratings for a particular piece of media or product in order to alter its apparent reception. A specific example present in the graph is Gunday, which is listed with an IMDB Rating of 2.4 and sentiment score of -0.56978. This movie was apparently well received by critics, but was [review brigaded by Bangladeshis on social media](https://en.wikipedia.org/wiki/Gunday) in an effort to burry it due to a historical inaccuracy in the film. It is possible that other movies in this analysis have their ratings or reviews skewed due to such activity.