<a href="https://colab.research.google.com/github/Alan-Crowetz/Movie-Reviews/blob/main/MovieReviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rotten Tomatoes Movie Review Analysis

## Goals

## Getting the Data

Well to start off, we need to get the data off of Rotten Tomatoes to analyze. At the minimum, we should try and get the movie name, genre, reviewer name, reviewer score, and audience score. This should let us get a good start to achieving our goals.

Luckily for us, a kind soul on kaggle has already gone through the trouble of scraping movie and reviewer data off of Rotten Tomatoes! We can go ahead and download the dataset in CSV from the following link: https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

The data contains two CSV files, one covering critic reviews and the other covering movie information. 
Also of note is that the dataset is CC0: Public Domain which means we can use it however we like.

This will save us the time of having to manually collect and aggregate the data ourselves. Unfortunately this does come with some downsides:
1. The data has been run on 10-31-2020 and if we want to update it, we'll have to scrape it again ourselves.
2. The uploader specified that the scrape took several days and because of this may have some inconsistencies in the score aggregates for some movies due to reviews being posted as the program runs.
    
With that in mind, let's go ahead and read the two files using pandas and store it in a dataframe for analysis.

In [None]:
import pandas as pd

criticdf = pd.read_csv(r'rotten_tomatoes_critic_reviews.csv')
moviedf = pd.read_csv(r'rotten_tomatoes_movies.csv')

## Initial Observations

Alright, now we have them both in an easily editable and analyzable format. Let's take a quick look at them both:

## Critic Dataset

In [None]:
criticdf.head()

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...


In [None]:
criticdf.describe()

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
count,1130017,1111488,1130017,1130017,1130017,824081,1130017,1064211
unique,17712,11108,2,2230,2,814,8015,949181
top,m/star_wars_the_rise_of_skywalker,Emanuel Levy,False,New York Times,Fresh,3/5,2000-01-01,Parental Content Review
freq,992,8173,841481,13293,720210,90273,48019,267


Okay, the critic data seems pretty self explanatory, we have a movie link that corresponds to a specific movie, the critic's name/publisher, the review score, whether the review was fresh or rotten, whether the critic is considered top or not, the review date, and the actual content of the review.

We can use the movie link to join the movie database to get genre information for our genre analysis. We'll also probably want to create a unique id for critics combining name and publisher so reviewers with the same name aren't counted as one person. 

We could also probably drop the review date and content for this analysis, although it would definitely be an interesting project to run a sentiment analaysis on the content compared to the score or even see if the length of time from the movie air date to the review date is in any way coorelated to the score... Scope creep is a dangerous animal. 

The biggest issue with the dataset is not with the data itself, but rather with the way critics review movies. Just in the first five values alone, we only recieve a binary "Fresh" or "Rotten" rather than a scalar amount. In fact, let's get the number of different review scores that movies have gotten.

In [None]:
criticdf['review_score'].value_counts()

3/5         90273
4/5         83659
3/4         72366
2/5         60174
2/4         47546
            ...  
9.50/20         1
0.58/1          1
6.89/10         1
2.7/4           1
8.458/10        1
Name: review_score, Length: 814, dtype: int64

Wow, there are over 800 different ways that reviewers quantify movies. Who rates a movie 5.5542/10!? We'll need to get a way to standardize this across the board for all 1.13 million reviews. Okay, that's enough problems to conquer for the first dataset, let's take a look at the second.

## Movie Dataset

In [None]:
moviedf.head()

Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,critics_consensus,content_rating,genres,directors,authors,actors,original_release_date,...,production_company,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",Though it may seem like just another Harry Pot...,PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,...,20th Century Fox,Rotten,49.0,149.0,Spilled,53.0,254421.0,43,73,76
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,Nicole Holofcener's newest might seem slight i...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,...,Sony Pictures Classics,Certified-Fresh,87.0,142.0,Upright,64.0,11574.0,44,123,19
2,m/10,10,"A successful, middle-aged Hollywood songwriter...",Blake Edwards' bawdy comedy may not score a pe...,R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,...,Waner Bros.,Fresh,67.0,24.0,Spilled,53.0,14684.0,2,16,8
3,m/1000013-12_angry_men,12 Angry Men (Twelve Angry Men),Following the closing arguments in a murder tr...,Sidney Lumet's feature debut is a superbly wri...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,...,Criterion Collection,Certified-Fresh,100.0,54.0,Upright,97.0,105386.0,6,54,0
4,m/1000079-20000_leagues_under_the_sea,"20,000 Leagues Under The Sea","In 1866, Professor Pierre M. Aronnax (Paul Luk...","One of Disney's finest live-action adventures,...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,...,Disney,Fresh,89.0,27.0,Upright,74.0,68918.0,5,24,3


In [None]:
moviedf.describe()

Unnamed: 0,runtime,tomatometer_rating,tomatometer_count,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count
count,17398.0,17668.0,17668.0,17416.0,17415.0,17712.0,17712.0,17712.0
mean,102.214048,60.884763,57.139801,60.55426,143940.1,14.586326,36.374831,20.703139
std,18.702511,28.443348,68.370047,20.543369,1763577.0,15.146349,52.601038,30.248435
min,5.0,0.0,5.0,0.0,5.0,0.0,0.0,0.0
25%,90.0,38.0,12.0,45.0,707.5,3.0,6.0,3.0
50%,99.0,67.0,28.0,63.0,4277.0,8.0,16.0,8.0
75%,111.0,86.0,75.0,78.0,24988.0,23.0,44.0,24.0
max,266.0,100.0,574.0,100.0,35797640.0,69.0,497.0,303.0


Okay right away this looks like there's a lot of information we don't need for this analysis. The fields that we're most interested in are the audience score, genres, runtime, and content ratings. The rest would probably be too granular for this analysis. 

We also  won't bother with the critic scores since we'll be calculating them using the other dataset and there may be inconsistencies anyway. 

## Formating the data

Alright, let's go ahead and do some formatting based on the comments above. I'm going to go ahead and do this dirty and just edit the dataframes directly since this will likely be a one time analysis and the datasets are (relitively) small.

In [None]:
#Dropping the review date and review content since it's outside of scope
criticdf = criticdf.drop(columns = ['review_date','review_content'])

#Combining the critic and publisher name into a key then dropping the critic name
#I'm currently keeping the publisher name to see if there are any publisher contrarians as well



In [None]:
criticdf

Unnamed: 0,rotten_tomatoes_link,critic_name,top_critic,publisher_name,review_type,review_score
0,m/0814255,Andrew L. Urban,False,Urban Cinefile,Fresh,
1,m/0814255,Louise Keller,False,Urban Cinefile,Fresh,
2,m/0814255,,False,FILMINK (Australia),Fresh,
3,m/0814255,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5
4,m/0814255,Ethan Alter,True,Hollywood Reporter,Rotten,
...,...,...,...,...,...,...
1130012,m/zulu_dawn,Chuck O'Leary,False,Fantastica Daily,Rotten,2/5
1130013,m/zulu_dawn,Ken Hanke,False,"Mountain Xpress (Asheville, NC)",Fresh,3.5/5
1130014,m/zulu_dawn,Dennis Schwartz,False,Dennis Schwartz Movie Reviews,Fresh,B+
1130015,m/zulu_dawn,Christopher Lloyd,False,Sarasota Herald-Tribune,Rotten,3.5/5


## Searching for Contrarians

## Finding movies that audiences and reviewers disagree on

## Finding reviewers that hate certain genres

## Final Summarization