In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


## Overview

Should we trust online reviews and ratings websites like Fandango? Do different rating websites reflect similar ratings?

This data analysis is based on the article [Be Suspicious Of Online Movie Ratings, Especially Fandango’s].(http://fivethirtyeight.com/features/fandango-movies-ratings/)


----

### The Data

There are two csv files, one with Fandango Stars and Displayed Ratings, and the other with aggregate data for movie ratings from other sites, like Metacritic,IMDB, and Rotten Tomatoes.

#### all_sites_scores.csv

-----

`all_sites_scores.csv` contains every film that has a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, and IMDb score, and at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.

----
----

#### fandango_scrape.csv

`fandango_scrape.csv` contains every film 538 pulled from Fandango.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Exploring Fandango's dataset

Let's check if our analysis agrees with the article's conclusion.

In [None]:
fandango = pd.read_csv("/kaggle/input/fandango-rating-discrepancy/fandango_scrape.csv")
fandango.head()

In [None]:
fandango.describe()

In [None]:
fandango.info()

**Scatterplot showing the relationship between rating and votes.**

In [None]:
plt.figure(figsize=(11,5))
sns.scatterplot(data=fandango,x='RATING',y='VOTES');

**The correlation between the columns can be calculated with:**

In [None]:
fandango.corr()

**Getting the year from the film titles:**

In [None]:
fandango['YEAR'] = fandango['FILM'].apply(lambda title:title.split('(')[-1]).str.replace(')', '')

In [None]:
fandango['YEAR'].value_counts()

In [None]:
plt.figure(figsize=(5,4))
sns.countplot(data=fandango,x='YEAR')

**What are the 10 movies with the highest number of votes?**

In [None]:
fandango.nlargest(10,'VOTES')

**Creating a DataFrame of only reviewed films by removing any films that have zero votes.**

In [None]:
reviewed = fandango[fandango['VOTES'] > 0]

**We can visualize the difference in STARS vs RATING distribution creating a KDE plot.**

In [None]:
plt.figure(figsize=(10,4),dpi=200)
sns.kdeplot(data=reviewed,x='RATING',clip=[0,5],fill=True,label='True Rating')
sns.kdeplot(data=reviewed,x='STARS',clip=[0,5],fill=True,label='Stars Displayed')

plt.legend(loc=(1.05,0.75))

**Calculating this difference with STARS-RATING.**

In [None]:
reviewed['STARS_DIFF'] = round(reviewed['STARS'] - reviewed['RATING'],1)
reviewed

**With a count plot we can see the number of times a certain difference occurs:**

In [None]:
plt.figure(figsize=(10,4))
sns.countplot(data=reviewed,x='STARS_DIFF')

**From the plot we can visualize the biggest differences, one movie was displaying a 1 star difference than its true rating!**

In [None]:
reviewed[reviewed['STARS_DIFF']==1]

## Comparison between Fandango Ratings and Other Sites

Using the second dataset we can compare the scores from Fandango and other movies sites to see how they behave.


In [None]:
all_sites = pd.read_csv("/kaggle/input/fandango-rating-discrepancy/all_sites_scores.csv")

**Exploring the DataFrame columns, info, description.**

In [None]:
all_sites.head()

In [None]:
all_sites.info()

In [None]:
all_sites.describe()

In [None]:
all_sites.corr()

### Rotten Tomatoes

Rotten Tomatoes has critics reviews (ratings published by official critics) and user reviews. 

**Scatterplot exploring the relationship between RT Critic reviews and RT User reviews:**

In [None]:
plt.figure(figsize=(8,3))
sns.scatterplot(data=all_sites,x='RottenTomatoes',y='RottenTomatoes_User')
plt.xlim(0,100)
plt.ylim(0,100)

**Creating a new column based off the difference between critics ratings and users ratings for Rotten Tomatoes:** 

In [None]:
all_sites['DIFF_RATING'] = all_sites['RottenTomatoes']-all_sites['RottenTomatoes_User']


**Calculating the Mean Absolute Difference between RT scores and RT User scores as described above.**

In [None]:
all_sites['DIFF_RATING'].abs().mean()


**Plotting the distribution of the differences between RT Critics Score and RT User Score. There should be negative values in this distribution plot. Feel free to use KDE or Histograms to display this distribution.**

In [None]:
plt.figure(figsize=(12,4))
sns.histplot(data = all_sites,x='DIFF_RATING',kde=True,bins=25)
plt.title("RT Critics Score minus RT User Score");

**Creating a distribution showing the *absolute value* difference between Critics and Users on Rotten Tomatoes.**

In [None]:
plt.figure(figsize=(12,4))
all_sites['abs_diff'] = all_sites['DIFF_RATING'].abs()
sns.histplot(data = all_sites,x='abs_diff',kde=True,bins=25)
plt.title("Abs difference between RT Critics Score and RT User Score");

**The top 5 movies users rated higher than critics on average:**

In [None]:
print("Users Love but Critics Hate")
all_sites[['FILM','DIFF_RATING']].nsmallest(5,'DIFF_RATING')

**The top 5 movies critics scores higher than users on average.**

In [None]:
print("Critics Love but Users Hate")
all_sites[['FILM','DIFF_RATING']].nlargest(5,'DIFF_RATING')

## MetaCritic

Metacritic also shows an average user rating versus their official displayed rating.

**Scatterplot of the Metacritic Rating versus the Metacritic User rating:**

In [None]:
plt.figure(figsize=(12,4))
sns.scatterplot(data=all_sites,x='Metacritic',y='Metacritic_User')

## Fandago Scores vs. All Sites
**Combining the Fandango Table with the All Sites table. Not every movie in the Fandango table is in the All Sites table, since some Fandango movies have very little or no reviews.**

In [None]:
df = pd.merge(fandango,all_sites, how='inner', on="FILM")

In [None]:
df.info()

RT,Metacritic, and IMDB don't use a score between 0-5 stars like Fandango does. So we need to *normalize* these values so they all fall between 0-5 stars and the relationship between reviews stays the same.

**Creating new normalized columns for all ratings so they match up within the 0-5 star range shown on Fandango.**

In [None]:
df['RT_Norm'] = np.round(df['RottenTomatoes']/20,1)
df['RTU_Norm'] =  np.round(df['RottenTomatoes_User']/20,1)

In [None]:
df['Meta_Norm'] =  np.round(df['Metacritic']/20,1)
df['Meta_U_Norm'] =  np.round(df['Metacritic_User']/2,1)

In [None]:
df['IMDB_Norm'] = np.round(df['IMDB']/2,1)
df.head()

**Creating a norm_scores DataFrame that only contains the normalized ratings.**

In [None]:
norm_scores = df[['STARS','RATING','RT_Norm','RTU_Norm','Meta_Norm','Meta_U_Norm','IMDB_Norm']]


In [None]:
norm_scores.head()

**Now we can create a plot comparing the distributions of normalized ratings across all sites and see if Fandago's has a discrepancy in the ratings.**

In [None]:
plt.figure(figsize=(15,6),dpi=150)
kdes = sns.kdeplot(data=norm_scores,clip=[0,5],fill=True,palette='Set1')
sns.move_legend(kdes, "upper left")

**We can clearly see that Fandango has an uneven distribution. We can also see that RT critics have the most uniform distribution.**

### How are the worst movies rated across all platforms?

**Creating a clustermap visualization of all normalized scores. Note the differences in ratings, highly rated movies should be clustered together versus poorly rated movies.**

In [None]:
sns.clustermap(norm_scores,cmap='magma',col_cluster=False);

**By the clustermap we can visualize that Fandango is rating movies much higher than other sites. This can be confirmed looking at the correlation in the normalized ratings:**

In [None]:
norm_scores.corr()

**The RATING and STARS correlation with the other ratings are significant lower.**