# IMDb Ratings EDA 🎬

Ever wondered who the highest rated actors are? Or whether newer movies are more highly rated than older ones? Well, if so (or not), you've come to the right place. In this notebook, I explore this comphrensive IMDb dataset and perform some exploratory data analysis to come up with conclusions to all those burning questions.

# Essential Imports

In [None]:
import numpy as np
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as py
import seaborn as sns
import math
import itertools
from scipy.stats import pearsonr
import gc
py.init_notebook_mode(connected=True)

The variable below, ```LOOK_AT```, controls the visualizations done below. If you fork this notebook and would like to visualize more/less per graph, the easiest way to do so is by changing the value of ```LOOK_AT``` below.



In [None]:
LOOK_AT = 10

In [None]:
movies = pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")
ratings = pd.read_csv("../input/imdb-extensive-dataset/IMDb ratings.csv")

# Data Preprocessing

Drop all columns with more than 8% null values.

In [None]:
drop_cols = movies.isnull().sum()/len(movies) <= 0.08
new_movies = movies.loc[:, drop_cols]

drop_cols2 = ratings.isnull().sum()/len(movies) <= 0.08
new_ratings = ratings.loc[:, drop_cols2]

Join the two DataFrames into a master DataFrame named ```df```.

In [None]:
df = new_movies.set_index('imdb_title_id').join(new_ratings.set_index('imdb_title_id'))
df

Now we need to make sure the types of the each column are correct (especially the numeric columns for when we utilize pandas' ```groupby``` method in the future).

In [None]:
for column in df.columns:
    try:
        df[column] = pd.to_numeric(df[column])
    except:
        df[column] = df[column].astype("string")

In [None]:
unique_cols = df.columns
unique_cols

# Mean Vote or Weighted Average Vote

In [None]:
fig = px.scatter(df, x="weighted_average_vote", y="mean_vote", trendline="ols")
fig.update_layout(title={'text': f"Weighted Average Vote vs Mean Vote, Corr: {round(pearsonr(df['weighted_average_vote'], df['mean_vote'])[0], 3)} ", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

It is quite evident that there a quite a lot of movies that are rated highly in terms of mean vote, but are rated very, very low in terms of weighted average vote. Per <a href="https://help.imdb.com/article/imdb/track-movies-tv/weighted-average-ratings/GWT2DSBYVT2F25SK#">IMDb's website</a>, this is because not every vote has the same importance according to their algorithm. For the remainder of this analysis, we opt to use the weighted average vote since it's more robust.

# Years With Best Reception

In [None]:
year_df = df[["year", "weighted_average_vote"]].groupby('year').describe().droplevel(0, axis=1).reset_index()
year_df = year_df.drop(len(year_df)-1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=year_df["year"], y=year_df["mean"], error_y=dict(type='data', array=2*year_df['std'])))
fig.update_layout(title={'text': f"Weighted Average of Movies From Different Years With 95% Confidence Interval", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

# Years with Most Movies + Most Casted Votes

In [None]:
fig = px.line(df.groupby('year').size().drop("TV Movie 2019"))
fig.update_layout(title={'text': f"Total Number of Movies Per Year", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}}, showlegend=False)
fig.show()

In [None]:
fig = px.line(df.groupby('year').sum().drop("TV Movie 2019"), y="total_votes")
fig.update_layout(title={'text': f"Total Number of Votes Casted For Movies From Each Year", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

There is a sudden surge of movies starting from the early 2000s, and the number of total votes casted reflects that. The sudden dip at the end of the graph (around 2015 to 2020) is probably due to the fact IMDb users haven't had enough time to rate movies that have been released more recently.

# Duration vs Rating

Are longer movies perhaps more highly related?

In [None]:
fig = px.scatter(df, x="duration", y="weighted_average_vote")
fig.update_layout(title={'text': f"Duration vs Average Rating", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

There doesn't seem to be any correlation.

# Votes By Language

In [None]:
include = df.groupby('language').size().sort_values(ascending=False) >= 500
lang_df = df.groupby('language').mean().loc[include].sort_values("weighted_average_vote", ascending=False)
fig = px.bar(lang_df, y="weighted_average_vote")
fig.update_layout(title={'text': f"Average Weighted Rating for Each Language (More than 500 Entries)", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

Interestingly, movies in English are the worst rated compared to some other "more likeable" languages like Japanese and Tamil. This may be due to the sheer number of movies in English, but still, it is an interesting feature of the data worth noting.

In [None]:
fig = go.Figure()
fig.add_trace(go.Bar(x=lang_df.index, y=lang_df["us_voters_rating"], name="US Voters Rating"))
fig.add_trace(go.Bar(x=lang_df.index, y=lang_df["non_us_voters_rating"], name="Non-US Voters Rating"))
fig.update_layout(title={'text': f"Comparison of US vs Non-US Voters Rating", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

Some of these languages are more favored by US voters compared to non-US voters, but for the most part, they are roughly similar. The only quite large difference is the average rating of English, as US voters seem to rate English much more highly than non-US voters.

# Most Highly Rated Actors

In [None]:
tmp_act_df = df.copy() 
tmp_act_df['actors'] = df['actors'].fillna("None")
tmp_act_df['actors'] = tmp_act_df['actors'].str.split(', ')

In [None]:
flat = [[x, df.loc[i, "weighted_average_vote"]] for i, y in tmp_act_df['actors'].iteritems() for x in y]
rating_df = pd.DataFrame(flat, columns=["Actor", "Rating"])
rating_df

In [None]:
actor_df = pd.DataFrame(rating_df.groupby('Actor').size(), columns=["Movies"])
actor_df['Rating'] = rating_df.groupby('Actor').mean()
actor_df

In [None]:
fig = px.bar(actor_df.sort_values("Movies", ascending=False).iloc[:LOOK_AT], y="Movies", hover_data=["Rating"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Actors with the Most Movies", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

Do you know any of these actors?

In [None]:
num_movies = 10
actor_df_split = actor_df.loc[actor_df["Movies"] >= num_movies].sort_values("Rating", ascending=False)
fig = px.bar(actor_df_split.iloc[:LOOK_AT], y="Rating", hover_data=["Movies"])
fig.update_layout(title={'text': f"Top {LOOK_AT} Highest Rated Actors With More than {num_movies} Movies", 'x': 0.5,
                             'xanchor': 'center', 'font': {'size': 20}})
fig.show()

What about these actors?

# Correlation Between Votes

In [None]:
total_votes = df.loc[:, df.columns.str.contains("votes_")]
corr = total_votes.corr()
plt.figure(figsize=(12, 9))
sns.heatmap(corr, annot=True)
plt.title("Pearson Correlation Matrix for Votes", fontsize=20)
plt.show()

And that's it! If you like this notebook, please <span style="color: green"> upvote </span> this notebook! Thanks for reading :)