# Movie Ratings Project

## About the project

This project is about a company providing fake movie ratings to increase to viewers interest of watching the movie. For this reason we will call the cheating company `Fake company` and the rest of the companies offering movie ratings will be called `Company1`, `Company2` and `Company3`.

The goal of the project is to prove, that the Fake company is manipulating the ratings by slightly increasing the ratings of the movies in order their customer to be more interested in buying tickets for the movie.

## About the data

We have 2 csv files as a data source.

- `fake-company.csv` - contains data for the Fake company movies ratings
- `other-companies.csv` - contains data for the rest of the companies movies ratings

---

`fake-company.csv`

Column | Definiton
--- | ---------
MOVIE | The name of the movie and the release year
STARS | Number of stars presented on the website (rounded Rating value)
RATING | The rating of the movie, which the Fake company is providing to their customers. On a scale 1-5
RATING_COUNT | The number of users who rated the movie

`other-companies.csv`

Column | Definition
--- | -----------
MOVIE | The name of the movie and the release year
Company1_RATING | The movie rating provided by Company1 movie critics. On a scale 1-100
Company1_USER_RATING | The movie rating provided by Company1 users. On a scale 1-100
Company2_RATING | The movie rating provided by Company2 movie critics. On a scale 1-100
Company2_USER_RATING | The movie rating provided by Company2 users. On a scale 1-10
Company2_USER_RATING_COUNT | The number of users from Company2 who rated the movie
Company3_RATING | The movie rating provided by Company3 movie critics. On a scale 1-10
Company3_USER_RATING_COUNT | The number of users from Company3 who rated the movie

---

_Additional Notes:_

- _The different companies are providing ratings on different scales, sometimes even within the company. For example 1-5, 1-10, 1-100_
- _Within a company sometimes we have both RATING and USER_RATING. This is because the company is providing its own rating based on a movie critics and user rating based on a users who are rating the movie on the company's website_

## Solution

In [None]:
# import the libraries we are going to need for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Part 1: Exploring Fake company Displayed Scores versus Actual User Ratings

**Preview the Fake company data**

If we expect all the data one thing we can easily notice is the round up of RATING in the STARS column. For example 3.6 would be rounded to 4, not to 3.5

In [None]:
fake_company = pd.read_csv("fake-company.csv")
fake_company.head()

**Let's explore the relationship between popularity of a movie and its rating. The plot is showing the relationship between rating and ratings count.**

In [None]:
plt.figure(figsize=(10, 4), dpi=150)
sns.scatterplot(data=fake_company, x='RATING', y='RATING_COUNT');

**Calculation of the correlation between the columns**

We can notice here small discrepancy between STARS and RATING, they are not perfectly correlated.

In [None]:
fake_company.corr(numeric_only=True)

**Here we will extract the movie year from the title and set the year as a separate column. And then we display the number of movies for each year.**

We don't really need the year column to prove our theory for the cheating company. This is just for an exercise to work with our data.

In [None]:
fake_company['YEAR'] = fake_company['MOVIE'].apply(lambda title:title.split('(')[-1].replace(')', ''))
fake_company['YEAR'].value_counts()

**Visualize the count of movies per year with a plot**

In [None]:
sns.countplot(data=fake_company, x='YEAR')

**Display the 10 movies with the highest number of ratings count**

In [None]:
fake_company.nlargest(10, 'RATING_COUNT')

**Display how many movies have zero votes**

Displaying this data is also not needed to prove that the Fake company is cheating. This is also as an exercise and to get better understanding of the data we have

In [None]:
no_votes = fake_company['RATING_COUNT'] == 0 
no_votes.sum()

**Create DataFrame of only reviewed movies by removing any films that have zero votes.**

In [None]:
fan_reviewed = fake_company[fake_company['RATING_COUNT'] > 0]

**The Fake company is displaying the rating as stars which are rounded up. Let's visualize this difference in distributions.**

We are displaying the distribution of ratings that are displayed (STARS) versus what the actual rating was from votes (RATING). Clip the KDEs to 0-5, because these are the only possible values we can have

In [None]:
plt.figure(figsize=(10, 4), dpi=150)
sns.kdeplot(data=fan_reviewed, x='RATING', clip=[0, 5], fill=True, label='Actual Rating')
sns.kdeplot(data=fan_reviewed, x='STARS', clip=[0, 5], fill=True, label='Stars Displayed')

plt.legend(loc=(1.05, 0.5))

**Let's now actually quantify this discrepancy. We will create new column which displays the difference betweem stars displayed to the customers versus the actual rating.**

In [None]:
fan_reviewed = fan_reviewed.copy()
fan_reviewed.loc[:, "STARS_DIFF"] = fan_reviewed['STARS'] - fan_reviewed['RATING']
fan_reviewed.loc[:, "STARS_DIFF"] = fan_reviewed['STARS_DIFF'].round(2)

fan_reviewed

**Display the number of times a certain difference occurs**

For example how many times the STARS were 0.1 points higher than the RATING. Also how many times 0.2 points higher etc...

We can see that for around 100 of the movies the STARS and the RATING are the same, but for 1 of the movies the displayed stars are 1 higher than the rating

We can also notice that we have many movies which are displaying higher number of STARS than RATING, because we have many movies in the columns which are with value 0.1 or higher

In [None]:
plt.figure(figsize=(12, 4), dpi=150)
sns.countplot(data=fan_reviewed, x='STARS_DIFF', palette='magma')

**We can see from the plot that one movie was displaying over a 1 star difference than its actual rating! Let's display which movie it was**

In [None]:
fan_reviewed[fan_reviewed['STARS_DIFF'] == 1]

### Part 2: Comparison of Fake company Ratings to Other Companies

**Read other companies data and preview the data frame**

In [None]:
other_companies = pd.read_csv("other-companies.csv")
other_companies.head()

### Company1

**Let's first take a look at Company1. It has two sets of reviews, their critics reviews (ratings published by official critics) and user reviews.**

In [None]:
plt.figure(figsize=(10, 4), dpi=150)
sns.scatterplot(data=other_companies, x='Company1_RATING', y='Company1_USER_RATING')
plt.xlim(0,100)
plt.ylim(0,100)

**Let's quantify this difference by comparing the critics ratings and the user ratings for Company1.**

We will create a new column based off the difference between critics ratings and users ratings for Company1.

The higher positive value means, that the critics like the movie better than the users. This is because if critics gave score of 50 for a movie and users gave 40 we will have 50 - 40 = 10. And of course the bigger negative values means, that the users liked the movie better than the critics.

In [None]:
other_companies['Company1_Diff']  = other_companies['Company1_RATING'] - other_companies['Company1_USER_RATING']

**Let's now compare the overall mean difference. Since we're dealing with differences that could be negative or positive, first take the absolute value of all the differences, then take the mean. This would report back on average to absolute difference between the critics rating versus the user rating.**

In [None]:
other_companies['Company1_Diff'].apply(abs).mean()

**Plot the distribution of the differences between Company1 critics score and Company1 user score. There should be negative values in this distribution plot.**

In [None]:
plt.figure(figsize=(10, 4), dpi=200)
sns.histplot(data=other_companies, x='Company1_Diff', kde=True, bins=25)
plt.title("Company1 Critics score minus Company1 User score");

**Create a distribution showing the absolute value difference between Critics and Users on Company1.**

In [None]:
plt.figure(figsize=(10, 4), dpi=200)
sns.histplot(x=other_companies['Company1_Diff'].apply(abs), bins=25, kde=True)
plt.title("Absolute difference between Company1 Critics score and Company1 User score");

**Let's find out which movies are causing the largest differences. First, show the top 5 movies with the largest negative difference between Company1 users and critics. Since we calculated the difference as Critics Rating - Users Rating, then large negative values imply the users rated the movie much higher on average than the critics did.**

**Display the top 5 movies users rated higher than critics on average**

In [None]:
print("Users love, but Critics hate")
other_companies.nsmallest(5, 'Company1_Diff')[['MOVIE', 'Company1_Diff']]

**Now show the top 5 movies critics scores higher than users on average.**

In [None]:
print("Critics love, but Users hate")
other_companies.nlargest(5, 'Company1_Diff')[['MOVIE', 'Company1_Diff']]

### Company2

**Now let's take a quick look at the ratings from Company2. Company2 also shows an average user rating versus their critics displayed rating.**

**Display a scatterplot of the Company2 Critics rating versus User rating.**

In [None]:
plt.figure(figsize=(10, 4), dpi=150)
sns.scatterplot(data=other_companies, x='Company2_RATING', y='Company2_USER_RATING')
plt.xlim(0,100)
plt.ylim(0,10)

### Company3

**Finally let's explore Company3. Notice that both Company2 and Company3 report back rating counts. Let's analyze the most popular movies.**

Let's display scatterplot for the relationship between rating counts on Company2 versus rating counts on Company3

In [None]:
plt.figure(figsize=(10, 4), dpi=150)
sns.scatterplot(data=other_companies, x='Company2_USER_RATING_COUNT', y='Company3_USER_RATING_COUNT')

**Notice there are two outliers here. The first outlier with the highest rating count on Company3 only has about 500 Company2 ratings. The second outlier has high count for both Company2 and Company3**

**Display the first outlier - the movie with highest Company3 user rating count**

In [None]:
other_companies.nlargest(1, 'Company3_USER_RATING_COUNT')

**Display the second outlier - the movie with highest Company2 user rating count**

In [None]:
other_companies.nlargest(1,'Company2_USER_RATING_COUNT')

### Fake company scores vs. Other companies

**Finally let's begin to explore whether or not Fake company artificially displays higher ratings than warranted to boost movie ticket sales.**

**We will combine Company1 table with the other companies table. Not every movie in the Comapny1 table is in the Other companies table, since some Fake company movies have very little or no reviews. We only want to compare movies that are in both DataFrames, so we do an inner merge to merge together both DataFrames based on the Movie columns.**

In [None]:
df = pd.merge(fake_company, other_companies, on='MOVIE', how='inner')
df.head()

In [None]:
df.info()

### Normalize columns to Fake company STARS and RATINGS 0-5

Notice that Company1, Company2 and Company3 don't use a score between 0-5 stars like Fake company does. In order to do a fair comparison, we need to normalize these values so they all fall between 0-5 stars and the relationship between reviews stays the same.

Simple way to convert ratings to have everything on a scale 0-5:

100 / 20 = 5

10 / 2 = 5

In [None]:
# Dont run this cell multiple times, otherwise you keep dividing
df['Company1_Norm'] = np.round(df['Company1_RATING'] / 20, 1)
df['Company1User_Norm'] =  np.round(df['Company1_USER_RATING'] / 20, 1)

df['Company2_Norm'] =  np.round(df['Company2_RATING'] / 20, 1)
df['Company2User_Norm'] =  np.round(df['Company2_USER_RATING'] / 2, 1)

df['Company3_Norm'] = np.round(df['Company3_RATING'] / 2, 1)

In [None]:
df.head()

**Now let's create a norm_scores DataFrame that only contains the normalizes ratings. Including both STARS and RATING from the original Fake company table.**

In [None]:
norm_scores = df[['STARS', 'RATING', 'Company1_Norm', 'Company1User_Norm', 'Company2_Norm', 'Company2User_Norm', 'Company3_Norm']]
norm_scores.head()

### Comparing Distribution of Scores Across Companies

Now the moment of truth! Does Fake company display abnormally high ratings? We already know it pushes displayed RATING higher than STARS, but are the ratings themselves higher than average compared to the other companies?

Let's create a plot comparing the distributions of normalized ratings across all companies.

In [None]:
def move_legend(ax, new_loc, **kws):
    old_legend = ax.legend_
    handles = old_legend.legend_handles
    labels = [t.get_text() for t in old_legend.get_texts()]
    title = old_legend.get_title().get_text()
    ax.legend(handles, labels, loc=new_loc, title=title, **kws)

In [None]:
fig, ax = plt.subplots(figsize=(15, 6), dpi=150)
sns.kdeplot(data=norm_scores, clip=[0,5], fill=True, palette='Set1', ax=ax)
move_legend(ax, "upper left")

Clearly Fake Company has an uneven distribution. We can also see that Company1 critics have the most uniform distribution. Let's directly compare these two.

Let's create a KDE plot that compare the distribution of Company1 critic ratings against the STARS displayed by Fake company.

The below plot is similar to the above, but shows data only for the 2 columns (Fake company ratings and Company1 critics ratings) where we have big difference.

In [None]:
fig, ax = plt.subplots(figsize=(15, 6), dpi=150)
sns.kdeplot(data=norm_scores[['Company1_Norm', 'STARS']], clip=[0, 5], fill=True, palette='Set1', ax=ax)
move_legend(ax, "upper left")

Let's also create a histplot comparing all normalized scores.

In [None]:
plt.subplots(figsize=(15, 6), dpi=150)
sns.histplot(norm_scores, bins=50)

**How are the worst movies rated across all platforms?**

Let's create a clustermap visualization of all normalized scores. Note the differences in ratings, highly rated movies should be clustered together versus poorly rated movies.

In [None]:
sns.clustermap(norm_scores, cmap='magma', col_cluster=False)

Clearly Fake company is rating movies much higher than other companies, especially considering that it is then displaying a rounded up version of the rating. Let's examine the top 10 worst movies.

In [None]:
norm_films = df[['STARS', 'RATING', 'Company1_Norm', 'Company1User_Norm', 'Company2_Norm', 'Company2User_Norm', 'Company3_Norm', 'MOVIE']]

In [None]:
norm_films.nsmallest(10, 'Company1_Norm')

Let's visualize the distribution of ratings across all companies for the top 10 worst movies.

We can see even for the 10 worst movies by Company1 critics, the Fake company is still displaying around 3.5-4 stars

In [None]:
plt.figure(figsize=(15, 6), dpi=150)
worst_films = norm_films.nsmallest(10, 'Company1_Norm').drop('MOVIE', axis=1)
sns.kdeplot(data=worst_films, clip=[0,5], fill=True, palette='Set1')
plt.title("Ratings for Company1 critic's 10 Worst Reviewed Movies");

Select from the 10 worst movies the movie with index 26. We will select that movie, because it has the highest number of stars (4.5) the Fake company is displaying to their customers.

In [None]:
norm_films.iloc[26]

In [None]:
# Sum for the Norms of the other companies (including critics + users where available)
norms_sum = 0.4 + 2.3 + 1.3 + 2.3 + 3
avg_norms_sum = norms_sum / 5
print(avg_norms_sum)

Final thoughts: Fake company is showing around 3-4 star ratings for movies that are clearly bad! Notice the biggest offender, the movie `Movie27`! Fake company is displaying 4.5 stars on their site for a movie with an average rating of 1.86 across the other companies!