Back in 2015, FiveThirtyEight published an [article](https://fivethirtyeight.com/features/fandango-movies-ratings/) entitled "Be Suspicious Of Online Movie Ratings, Especially Fandango’s". The article concluded that Fandango tends to overrate thier movies. In other words, the number of stars that the website shows for a certain movie is often higher than the true rating for the same movie.

Our goal in this notebook is validate these results through a detailed EDA for the given data set.

**To that end, the Analysis consists of two main parts:**

1. Exploring Fandango Displayed Scores versus True User Ratings
2. Comparison of Fandango Ratings to Other Sites

The data used is this note book is publicly available on [github](https://github.com/fivethirtyeight/data).


## **The Data**

### all_sites_scores.csv 
contains every film that has a Rotten Tomatoes rating, a RT User rating, a Metacritic score, a Metacritic User score, and IMDb score, and at least 30 fan reviews on Fandango. The data from Fandango was pulled on Aug. 24, 2015.

Column | Definition
--- | -----------
FILM | The film in question
RottenTomatoes | The Rotten Tomatoes Tomatometer score  for the film
RottenTomatoes_User | The Rotten Tomatoes user score for the film
Metacritic | The Metacritic critic score for the film
Metacritic_User | The Metacritic user score for the film
IMDB | The IMDb user score for the film
Metacritic_user_vote_count | The number of user votes the film had on Metacritic
IMDB_user_vote_count | The number of user votes the film had on IMDb

#### fandango_scape.csv
contains every film 538 pulled from Fandango.

Column | Definiton
--- | ---------
FILM | The movie
STARS | Number of stars presented on Fandango.com
RATING |  The Fandango ratingValue for the film, as pulled from the HTML of each page. This is the actual average score the movie obtained.
VOTES | number of people who had reviewed the film at the time we pulled it.

### Import the required liberaries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # basic visualization 
import seaborn as sns # advanced visualization 

### 1. Exploring Fandango Displayed Scores versus True User Ratings

In [None]:
# reading the data
fandango = pd.read_csv("../input/fandango-rating-discrepancy/fandango_scrape.csv")

In [None]:
# display the top 5 rows
fandango.head()

### lets do some feature engineering: seprate the year in new column

In [None]:
fandango["YEAR"] = fandango["FILM"].apply(lambda x: x.split("(")[-1].replace(")", "").strip())

In [None]:
fandango.head()

In [None]:
# a quick look at the data stucture
fandango.info()

**Main info:**
1. The data has 504 rows
2. There is no null values
3. The variables are in the right type

We are ready to start the analysis

In [None]:
# basic statistcis
fandango.describe()

**Findings:**
1. The mean of true rating (RATING) is lower than andango displayed scores (STARS)
2. The average number of votes is 1147.9 per movie
3. the meadian of true rating (RATING) is lower than the median of displayed scores (STARS), suggesting that the distribution of the latter is more skewed to the right.

In [None]:
# number of movies per year
fandango["YEAR"].value_counts()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.countplot(data = fandango, x = fandango["YEAR"]);

In [None]:
#The relationship between popularity of a film and its rating
plt.figure(figsize = (8,4), dpi = 100)
sns.scatterplot(data = fandango, x = "RATING", y = "VOTES");

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.scatterplot(data = fandango, x = "STARS", y = "VOTES", alpha = 0.5);

**Notice that many movies are rated 2 or above, despite that the number of votes is close to zero. We need to investigate it further as follows**

In [None]:
fandango[(fandango["VOTES"] < 2) & (fandango["RATING"] > 1)]

In [None]:
len(fandango[(fandango["VOTES"] < 2) & (fandango["RATING"] > 1)])

In [None]:
fandango["VOTES"].min()

In [None]:
fandango["VOTES"].max()

**As you can see there are about 32 movies that are rated by only 1 viewer. This  explains the very high standard error of the VOTES variable. In fact, the values VOTES ranges from zero all the way up to 34846. Lets see the distribution of VOTES**

In [None]:
# movies with zero votes
fandango[fandango["VOTES"] == 0]

In [None]:
# number of movies with zero votes
len(fandango[fandango["VOTES"] == 0])

In [None]:
# dropping out the non-reviewed films (have zero votes)
fandango = fandango[fandango["VOTES"] > 0]

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.kdeplot(fandango["VOTES"], shade = True)

#### The distribution makes it very clear that the majority of movies are rated by 1 - 5k voters, very few movies are rated by more than 5k voters.

#### Another note, the previous KDE Plot gives the impression that the VOTES varaible has negative values, which is incorrect. We will fix the limits of the x-axis in the follwoing graph to clear any misconseptions

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.kdeplot(fandango["VOTES"], shade = True)
plt.xlim((0,40000));

#### Now lets calculate the correlation between columns?

In [None]:
fandango.corr()

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.heatmap(fandango.corr(), cmap = "viridis", linewidth = 1, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

In [None]:
#Lets now plot the distribution of Treu RATING Vs. STARS
plt.figure(figsize = (8,4), dpi = 100)
sns.kdeplot(data = fandango, x = "RATING", label = "True Rating", shade = True)
sns.kdeplot(data = fandango, x = "STARS", label = "Stars Displayed", shade = True)
plt.legend(loc = 2)

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.kdeplot(data = fandango, x = "RATING", label = "True Rating", shade = True, cumulative = True)
sns.kdeplot(data = fandango, x = "STARS", label = "Stars Displayed", shade = True, cumulative = True)
plt.legend(loc = 2);

### As you can see for any movie with rating less than approxmately 4 there are more stars displayed than the true rating.

### Lets quantify it and visualize the result

In [None]:
fandango["STARS_DIFF"] = fandango["STARS"] - fandango["RATING"]
fandango.head()

In [None]:
fandango["STARS_DIFF"].value_counts(normalize = True) * 100

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.countplot(data = fandango, x = round(fandango["STARS_DIFF"], 2), palette = "viridis")
plt.xticks(rotation=90);

In [None]:
# We can see from the plot that one movie was displaying over a 1 star difference than its true rating!
fandango[fandango["STARS_DIFF"] == 1]

### 2. Comparison of Fandango Ratings to Other Sites

In [None]:
## Reading the Data
all_sites = pd.read_csv("../input/fandango-rating-discrepancy/all_sites_scores.csv")

In [None]:
all_sites.head()

In [None]:
all_sites.info()

In [None]:
round(all_sites.describe(), 1)

## **Lets give a closer look at one by one**

### **First: Rotten_Tomatoes**

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.scatterplot(data = all_sites, x = "RottenTomatoes", y = "RottenTomatoes_User");

**From the scatter plot, we can quicly realize that: for some movies the user rating is higher than he Rotten Tomamtoes (Critic) Rating, and vice versa for some other movies. Lests confirm this foinding by both: a KDE plot and Table calculation**

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.kdeplot(data = all_sites, x = "RottenTomatoes", label = "Rotten Tomamtoes (Critic) Rating", shade = True)
sns.kdeplot(data = all_sites, x = "RottenTomatoes_User", label = "User Rating", shade = True)
plt.legend(loc = 2)

**As you can see, the Critics are pretty conservative in thier ratings. To be percise they use  scientific criteria that make thier ratings more balanced than user ratings. In other words, Users seem to exaggerates thier feelings while Critics not.**

**For weak movies (those with a rating less than 30) Critics gave a higher rating than Users. As for good movies (30<rating<90) users seems to give higher Rating than Critics.**

In [None]:
all_sites["Rotten_Diff"] = all_sites["RottenTomatoes_User"] - all_sites["RottenTomatoes"]

In [None]:
all_sites.head()

**Let's now compare the overall mean difference. Since we're dealing with differences that could be negative or positive, first take the absolute value of all the differences, then take the mean. This would report back on average to absolute difference between the critics rating versus the user rating.**

In [None]:
all_sites["Rotten_Diff"].apply(lambda x: abs(x)).mean()
#abs(all_sites["Rotten_Diff"]).mean() #simpler way for achieving the same result

In [None]:
# Net difference 
all_sites["Rotten_Diff"].mean()

**On Average the User rating is higher than Critic rating by 3.03**

In [None]:
#Plot the distribution of the differences between RT Critics Score and RT User Score
plt.figure(figsize = (8,4), dpi = 100)
sns.displot(data = all_sites, x = all_sites["Rotten_Diff"], kde = True)
plt.title("RT Critics Score minus RT User Score")
plt.show();

In [None]:
# Distribution of the absolute value difference between Critics and Users on Rotten Tomatoes.
plt.figure(figsize = (8,4), dpi = 100)
sns.displot(data = all_sites, x = abs(all_sites["Rotten_Diff"]), kde = True)
plt.title("Abs Difference between RT Critics Score and RT User Score")
plt.xlim(0,80)
plt.show()

In [None]:
#Top 5 movies with the largest positive difference between Users and RT critics
print("Users Love but Critics Hate")
all_sites[["FILM", "Rotten_Diff"]].sort_values(by = ["Rotten_Diff"], ascending = False)[:5]

In [None]:
#Top 5 movies with the largest negative difference between Users and RT critics
print("Critics love, but Users Hate")
all_sites[["FILM", "Rotten_Diff"]].sort_values(by = ["Rotten_Diff"], ascending = True)[:5]

### **Second: Lests Compare Ratings Across Websites**

In [None]:
all_sites.columns

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.regplot(data = all_sites, x = 'RottenTomatoes', y = 'Metacritic');

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.regplot(data = all_sites, x = 'RottenTomatoes', y = 'IMDB');

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.regplot(data = all_sites, x = 'Metacritic', y = 'IMDB');

**As can be seen from the above figures, there is a positive relationship between Critic rating of the 3 websites. However, the relationship between IMD-Critic Rating on one hand and (Metacritic - Rotten Tomatoes) Critic rating on the other hand is quite dispersed, meaning that they have different taste for the same movie in many cases. This is not the case when comparing "Metacritic" and "Rotten Tomatoes" Ratings.**

In [None]:
#The relationship between vote counts on MetaCritic versus vote counts on IMDB.
plt.figure(figsize = (8,4), dpi = 100)
sns.scatterplot(data = all_sites, x = 'Metacritic_user_vote_count', y = 'IMDB_user_vote_count', size = 'IMDB_user_vote_count', sizes = (20,200), alpha = 0.8);

**There are some outliers in the above graph: The movie with the highest vote count on IMDB only has about 500 Metacritic ratings. What is this movie?**

In [None]:
all_sites[all_sites["IMDB_user_vote_count"] == all_sites["IMDB_user_vote_count"].max()]

In [None]:
# The movie tha has the highest Metacritic User Vote count
all_sites[all_sites["Metacritic_user_vote_count"] == all_sites["Metacritic_user_vote_count"].max()]

## **Fandago Scores vs. All Sites**
**let's begin to explore whether or not Fandango artificially displays higher ratings than warranted to boost ticket sales.**

In [None]:
# merging the two DataFrames
all_fan = pd.merge(all_sites, fandango, on  = "FILM", how="inner")
all_fan.head()

In [None]:
len(all_fan)

In [None]:
all_fan.info()

#### **Normalize columns to Fandango STARS and RATINGS 0-5**

Notice that RT,Metacritic, and IMDB don't use a score between 0-5 stars like Fandango does. In order to do a fair comparison, we need to normalize these values so they all fall between 0-5 stars and the relationship between reviews stays the same.

In [None]:
all_fan.columns

In [None]:
# Dont run this cell multiple times, otherwise you keep dividing!
all_fan['RT_Norm'] = np.round(all_fan['RottenTomatoes']/20,1)
all_fan['RTU_Norm'] =  np.round(all_fan['RottenTomatoes_User']/20,1)

# Dont run this cell multiple times, otherwise you keep dividing!
all_fan['Meta_Norm'] =  np.round(all_fan['Metacritic']/20,1)
all_fan['Meta_U_Norm'] =  np.round(all_fan['Metacritic_User']/2,1)

# Dont run this cell multiple times, otherwise you keep dividing!
all_fan['IMDB_Norm'] = np.round(all_fan['IMDB']/2,1)

In [None]:
all_fan.head()

In [None]:
norm_scores = all_fan[['STARS','RATING','RT_Norm','RTU_Norm','Meta_Norm','Meta_U_Norm','IMDB_Norm']]
norm_scores.head()

In [None]:
def move_legend(ax, new_loc, **kws):
    old_legend = ax.legend_
    handles = old_legend.legendHandles
    labels = [t.get_text() for t in old_legend.get_texts()]
    title = old_legend.get_title().get_text()
    ax.legend(handles, labels, loc=new_loc, title=title, **kws)

In [None]:
fig, ax = plt.subplots(figsize=(8,4),dpi=100)
sns.kdeplot(data=norm_scores,clip=[0,5],shade=True,palette='Set1',ax=ax)
move_legend(ax, "upper left")

Clearly Fandango has an uneven distribution. We can also see that RT critics have the most uniform distribution. Let's directly compare these two.

In [None]:
fig, ax = plt.subplots(figsize=(8,4),dpi=100)
sns.kdeplot(data=norm_scores[['RT_Norm','STARS']],clip=[0,5],shade=True,palette='Set1',ax=ax)
move_legend(ax, "upper left")

In [None]:
# Histogram to compare all normalized scores 
plt.subplots(figsize=(8,4),dpi=100)
sns.histplot(norm_scores,bins=50);

### How are the worst movies rated across all platforms?

In [None]:
sns.clustermap(norm_scores,cmap='magma',col_cluster=False);

#### Clearly Fandango is rating movies much higher than other sites, especially considering that it is then displaying a rounded up version of the rating.

#### Lests now display the top 10 lowest rated movies Based off the Rotten Tomatoes Critic Ratings. We will compare the normalized scores across all platforms for these movies. 

In [None]:
film = pd.DataFrame(all_fan["FILM"])
norm_scores = pd.DataFrame(norm_scores)
norm_scores_c = pd.concat([norm_scores, film], ignore_index=True)
norm_scores_c.head()


#for some reason the code does not work, will try something else

In [None]:
norm_films = all_fan[['STARS','RATING','RT_Norm','RTU_Norm','Meta_Norm','Meta_U_Norm','IMDB_Norm','FILM']]
norm_films.head()

In [None]:
norm_films.sort_values(by = ["RT_Norm"])[:10]

# norm_films.nsmallest(10,'RT_Norm')

In [None]:
#Visualize the distribution of ratings across all sites for the top 10 worst movies.
worest_10 = norm_films.sort_values(by = ["RT_Norm"])[:10]
fig, ax = plt.subplots(figsize=(8,4),dpi=100)
sns.kdeplot(data=worest_10,clip=[0,5],shade=True,palette='Set1',ax=ax)
move_legend(ax, "upper right")

## Fandango is showing around 3-4 star ratings for films that are clearly bad! 

> #### Main Source: 2021 Python for Machine Learning & Data Science Masterclass. Tought by Jose Portilla on Udemey
https://www.udemy.com/course/python-for-machine-learning-data-science-masterclass/ 