 # IMDB Dataset Case Study Analysis - Exploratory Analysis and ML Model 

<b>Introduction: </b>
A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated imdb score.

<b>Data Description:</b>
The dataset (movie-review-data.csv) contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.

<b>Problem Statement:</b>
Build Model to predict what kind of movies are more successful.
Take imdb scores as response variable and focus on operating predictions by analyzing the rest of variables in the movie data. 

#### Table of contents

1. [Exploratory Data Analysis](#id1)

    1.1 [Likes of Movie and Director on Facebook gives a good leverage](#id1.1) 
    
    1.2 [Cast and Actor Popularity](#id1.2) 
    
    1.3 [Does Title year and duration of Movie impact scores?](#id1.3) 
    
    1.4 [High Budget and Gross influences Scores](#id1.4) 
    
    1.5 [Other fetaures impacting Scores](#id1.5) 


2. [Data cleaning and preprocessing](#id2)

    2.1 [Handling NAs](#id2.1) 
    
    2.2 [Feature Engineering](#id2.2) 


3. [Baseline Model](#id3)


4. [Model building and Metrics](#id4)

    4.1 [Decision Tree](#id4.1) 
    
    4.2 [Random Forest Model](#id4.2) 
    
    4.3 [Gradient Boosting Model](#id4.3) 
    
    4.4 [Cat Boost Model](#id4.4) 




5. [Conclusion](#id5)

    
    
    
    

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from astropy.visualization import hist
import matplotlib.pyplot as plt
import math
import warnings
warnings.filterwarnings('ignore')

In [2]:
import sys
!{sys.executable} -m pip install -U scipy

Collecting scipy
  Downloading scipy-1.6.1-cp37-cp37m-win_amd64.whl (32.6 MB)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.5.4
    Uninstalling scipy-1.5.4:
      Successfully uninstalled scipy-1.5.4
Successfully installed scipy-1.6.1


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
phik 0.10.0 requires numba>=0.38.1, which is not installed.
imagehash 4.2.0 requires PyWavelets, which is not installed.
spherecluster 0.1.7 requires nose, which is not installed.
shap 0.28.3 requires scikit-image, which is not installed.
sentence-transformers 0.3.7.2 requires nltk, which is not installed.
pyldavis 2.1.2 requires numexpr, which is not installed.
lime 0.1.1.32 requires scikit-image>=0.12, which is not installed.
ktrain 0.21.2 requires bokeh, which is not installed.
keras 2.4.3 requires h5py, which is not installed.
flair 0.6.1 requires lxml, which is not installed.
exploripy 1.0.3 requires statsmodels, which is not installed.
allennlp 0.8.3 requires flask>=1.0.2, which is not installed.
allennlp 0.8.3 requires gevent>=1.3.6, which is not installed.
allennlp 0.8.3 requires h5py, which is not install

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,plot_confusion_matrix

ImportError: cannot import name 'issparse' from 'scipy.sparse' (unknown location)

Read the dataset

In [None]:
df=pd.read_csv('movie_review_data.csv')
df.shape

In [None]:
df['movie_imdb_link'].value_counts()

In [None]:
dup=df['movie_imdb_link'].value_counts()

In [None]:
duplicate_titles=list(dup[dup>1].index)

In [None]:
df[df['movie_imdb_link'].isin(duplicate_titles)].shape

In [None]:
df.loc[df['movie_imdb_link'].isin(duplicate_titles),['movie_title','movie_imdb_link']].head(5).values

In [None]:
df.loc[df['movie_imdb_link']=='http://www.imdb.com/title/tt0413300/?ref_=fn_tt_tt_1',]

In [None]:
from time import time

In [None]:
x = np.random.random((1000, 1000))
y = np.random.random((1000, 1000))

In [None]:
t_start=time()
z=np.matmul(x,y)
t_end=time()

print( "time diff {}".format(t_end-t_start))

In [None]:
x.shape[1]

In [None]:
np.zeros([2,3])

In [None]:
c=np.zeros(a.shape[1],b.shape[0])

In [None]:
def matrix_mul(a,b):
    c=np.zeros([a.shape[1],b.shape[0]])
    if a.shape[1]==b.shape[0]:
        for i in range(a.shape[0]):
            for j in range(a.shape[1]):
                val=0
                for k in range(b.shape[1]):
                    val=val+a[i][j]*b[j][k]
                c[i,j]=val
        print(c)        
    else:
        print("dim dont match")
                
                    

In [None]:
a=[[1,2],
  [2,3]]

In [None]:
b=[[2,3],
  [2,1]]

In [None]:
a=np.ones([2,2])
a

In [None]:
b=a*3

In [None]:
matrix_mul(a,b)

In [None]:
x = np.random.random((100, 100))
y = np.random.random((100, 100))

In [None]:
t_start=time()
w=matrix_mul(x,y)
t_end=time()

print( "time diff {}".format(t_end-t_start))

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
print(df['imdb_score'].describe())
sns.set_theme()
sns.distplot(df['imdb_score'],bins=50,hist_kws={'alpha': 0.4})

In [None]:
ar=hist(df['imdb_score'],bins='freedman',)
fig.clear(True) 

In [None]:
df['imdb_score_bin']=pd.cut(df['imdb_score'],bins=len(ar[1])-1)

In [None]:
df['imdb_score_bin']=df['imdb_score_bin'].astype(str)

In [None]:
df.isna().sum()

In [None]:
numerical_cols=['num_critic_for_reviews','duration','director_facebook_likes', 'actor_3_facebook_likes', 
       'actor_1_facebook_likes', 'gross','num_voted_users', 'cast_total_facebook_likes','budget', 
    'actor_2_facebook_likes','movie_facebook_likes']                  

<a id="id1"></a>

### Exploratory Data Analysis

In [None]:
df_samp=df[['content_rating','imdb_score']].groupby(['content_rating']).agg("mean")
df_samp.reindex(labels)

#### Handling NAs for a few Numerical features 

In [None]:
np.where(list(df.columns.values)not in cat_columns_all)

In [None]:
num_columns=list(set(df.columns)-set(cat_columns_all))
num_columns.remove('imdb_score_bin')
num_columns.remove('imdb_score')
num_columns

In [None]:
num_columns=['director_facebook_likes',
 'movie_facebook_likes',
 'title_year',
 'cast_total_facebook_likes',
 'num_voted_users',
 'num_critic_for_reviews',
 'actor_3_facebook_likes',
 'actor_1_facebook_likes',
 'num_user_for_reviews',
 'actor_2_facebook_likes',
 'duration']

In [None]:
df.isna().sum()

In [None]:
num_mean=['director_facebook_likes','actor_3_facebook_likes','actor_1_facebook_likes',
          'actor_2_facebook_likes','duration']

for col in num_mean:
    df.loc[df[col].isna(),col]=df[col].mean()

In [None]:
num_median=['title_year','duration']
for col in num_median:
    df.loc[df[col].isna(),col]=df[col].median()

In [None]:
num_zero=['num_critic_for_reviews','num_user_for_reviews']
for col in num_zero:
    df.loc[df[col].isna(),col]=0

#### Helper functions for EDA

In [None]:
"""
Function to plot the distribution of the independant variables
"""
def plot_dist(dataf=df,x_feat='content_rating',labels=None,fig=None,ax1=None):
    if fig is None:
        ax1 = sns.set_style(style=None, rc=None )
        fig, ax1 = plt.subplots(figsize=(12,6))
    if labels is None:
        labels=list(dataf[x_feat].value_counts().index)
    df_samp=dataf[[x_feat,'imdb_score']].groupby([x_feat]).agg("mean")
    plot=sns.countplot(data = dataf, x=x_feat, alpha=0.5, order=labels,ax=ax1,palette="summer")
    plot.set_xticklabels(labels, fontsize=9, rotation=30, ha= 'right')
    ax1.tick_params(axis='y')
    plt.ylabel("# Movies")
    plt.xlabel(x_feat)
    ax2 = ax1.twinx()
    sns.lineplot(data = df_samp['imdb_score'].values, marker='o', sort=False, ax=ax2)
    total = len(df)
    for p in plot.patches:
        percentage = f'{100 * p.get_height() / total:.1f}%\n'
        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        plot.annotate(percentage, (x, y), ha='center', va='center',fontsize=7.5)
    plot.axhline(df_samp['imdb_score'].mean())
    plt.ylabel("IMDB rating")
    plt.title("Distribution of {}".format(x_feat)) 
    sns.despine()

In [None]:
#Function to bin numerical variables before distribtion plot
def bin_and_plot(num_columns):
    bin_columns=[]
    for col in num_columns:
        colname=col+"_bin"
        ar=hist(df[col],bins='freedman',)
        if len(ar[1])<50:
            num_bins= len(ar[1])-1
        else:
            num_bins=50
        df[colname]=pd.cut(df[col],bins=num_bins)
        bin_columns.append(colname)
    plt.clf()

    for colname in bin_columns:  
        labels=list(map(str,df[colname].unique().sort_values()))
        df[colname]=df[colname].astype(str)
        plot_dist(df,x_feat=colname,labels=labels,fig=None,ax1=None)

<a id="id1.1"></a>

#### 1. Popularity of Movie and Director

1. The popularity of the director and movie has a positive effect on the imdb ratings. Barring the first bin, others show a higher than average scores of >6.5. This is evident especially on the popularity of the movie.

2. Directors name is not available for about 100 movies. Most of them are TV Series based on the ratings. Therefore there would be multiple directors for them. Creater would be a more equivalent role in this case.

In [None]:
dir_columns=['director_facebook_likes', 'movie_facebook_likes']
bin_and_plot(dir_columns)

<a id="id1.2"></a>

#### 2. Popularity of Casts
The overall total like of the cast didnt seem to have quite trend, the likes of actor_2 had a slight upward trend, barring in one group. We could say that having a strong/popular lead 2 might slightly increase the chance of a good movie

In [None]:
cast_fb_columns=[ 'cast_total_facebook_likes',
 'actor_1_facebook_likes',
 'actor_2_facebook_likes', 'actor_3_facebook_likes']
bin_and_plot(cast_fb_columns)

<a id="id1.3"></a>

#### 3. Title year and Duration

1. The older movies in the dataset have a comparatively better scores although their volumes are lower
2. Most of movies that have scores below 5 have the duration around 1.5 hrs. The scores seem to be climbing up for movies that are upto 2hours 40 mins long. After that they dont seem to follow the trend

In [None]:
sns.scatterplot(data=df[~df['budget'].isna()],y='imdb_score',x='title_year')
plt.title('Movie budget vs Imdb score')

In [None]:
sns.scatterplot(data=df[~df['budget'].isna()],y='imdb_score',x='duration',hue='language')
plt.title('Movie budget vs Imdb score')

In [None]:
td_columns=['title_year','duration']
bin_and_plot(td_columns)

<a id="id1.4"></a>

#### 4. Budget and Profit

1. The Budget field seemed to have outliers. It was interesting that all of the Top 10 movies of high budget was not from Hollywood. So this could mean the currencies were not normalized and they could be within their local currency. This may affect our models if not normalized by each country. For the purpose of this excercise, we could skip the non US made Movies for this reason. 

2. Movies with high budget and gross both seem to have higher scores. Although there are instances where good scores are there for smaller budget/gross movies.

2. We could derive the profit the movie made by subtracting the budget from the gross. Most movies that have high profist also have high score. Soemtimes even if the movies didnt make much profit they still had good ratings

3. Since the movies made from 1970s are included, we should account for inflation.

In [None]:
sns.boxplot(df['budget'])

In [None]:
df[['budget','country','movie_title']].sort_values(by='budget',ascending=False).head(10)

In [None]:
df[['gross','country','movie_title']].sort_values(by='gross',ascending=False).head(10)

In [None]:
df_hwd=df[df['country']=='USA']
df_hwd.shape

In [None]:
sns.scatterplot(data=df_hwd[~df_hwd['budget'].isna()],x='imdb_score',y='budget')
plt.title('Movie budget vs Imdb score')

In [None]:
sns.scatterplot(data=df_hwd[~df_hwd['gross'].isna()],x='imdb_score',y='gross')
plt.title('Movie Gross vs Imdb score')

In [None]:
df_hwd['profit']=df_hwd['gross']-df_hwd['budget']
sns.scatterplot(data=df_hwd[~df_hwd['profit'].isna()],x='imdb_score',y='profit')
plt.title('Movie profit vs Imdb score')

<a id="id1.5"></a>

#### 5. Analysis of other Movie features with response variable

1. Assigning 'NA' for Null values.

This can show if any of the Null values are correlated with the imdb scores, which would otherwise be missed. Since the plots would typically ignore these blank values

1. Black and White movies seem to be rated higher on an average. Although the volume is only around 4.1%


2. Almost 93% of movies are in English. Some of the languages like Russian, Bosnian, Chinese have very low averaged

3. The TV-MA (Mature in TV) content rating is way above the rest. Although less than 1% of movies are from that list. 

4. 6.5% of movies have aspect Ratio as NA which comparatively have lower ratings. Some of these movies seem to games based on movies or lesser known series

In [None]:
cat_columns_all=['director_name','actor_2_name','actor_1_name',
                 'actor_3_name','plot_keywords','country',
                 'movie_imdb_link','movie_title','genres','color',
                 'facenumber_in_poster','language','content_rating',
                 'aspect_ratio']

for col in cat_columns_all:
    df.loc[df[col].isna(),col]='NA'

In [None]:
#Considering only a few of the categorical columns for distribution analysis. Others do not make much sense for this plot
cat_columns=['color','language','content_rating','aspect_ratio']
df['aspect_ratio']=df['aspect_ratio'].astype(str)

fig, ax = plt.subplots(round(len(cat_columns) / 2), 2, figsize = (20, 20))
for i, ax in enumerate(fig.axes):
    plot_dist(df,x_feat=cat_columns[i],fig=fig,ax1=ax)

<a id="id2"></a>

### 2. Data Cleaning and pre-processing

#### 2.1 Handling NAs

1. Assigning color for movies that have NA since all are made after 1990
2. Assign gross as budget and vice versa for values that have either of them. For both as blanks take the median
3. Adjusted for inflation in gross, budget. rate is calculated as 2.0% based on some estimateions from https://smartasset.com/investing/inflation-calculator
4. Handling NAs for other numeric fields are covered in the EDA section above



In [None]:
df_copy=df.copy()

In [None]:
df.loc[(df['color']=='NA'),'color']='Color'

In [None]:
df.loc[(df['aspect_ratio']=='NA'),'aspect_ratio']=0
df['aspect_ratio']=df['aspect_ratio'].astype('Float32')

In [None]:
df['budget']=np.where((df['budget'].isna() & (~df['gross'].isna())),df['gross'],df['budget'] )
df['gross']=np.where((df['gross'].isna() & (~df['budget'].isna())),df['budget'],df['gross'] )

In [None]:
num_median=['budget','gross']
for col in num_median:
    df.loc[df[col].isna(),col]=df[col].median()

In [None]:
df['adj_year']=2016-df['title_year']

<a id="id2.2"></a>

#### 2.2 Feature Engineering

1. Profit of the movie from gross and budget
2. Past imdb scores of directors and actors. Current scores are not included since it induces target leakage

In [None]:
def inflation_corrected_amount(principle,  time,rate=2.0): 
    return(principle * (pow((1 + rate / 100), time)))

In [None]:
df['gross_adj']=df.apply(lambda x: inflation_corrected_amount(x['gross'],x['adj_year']),axis=1)
df['budget_adj']=df.apply(lambda x: inflation_corrected_amount(x['budget'],x['adj_year']),axis=1)

In [None]:
df['profit_adj']=df['gross_adj']-df['budget_adj']

In [None]:
def get_past_score(name,year,field='director_name'):
    val=df.loc[(df[field]==name) & (df['title_year']<year),'imdb_score'].mean()
    if math.isnan(val):
        return 0
    else:
        return val
    

In [None]:
df['director_past_imdb_score']= df.apply(lambda x: get_past_score(x['director_name'],x['title_year']),axis=1)

In [None]:
df['actor2_past_imdb_score']= df.apply(lambda x: get_past_score(x['actor_2_name'],
                                                                x['title_year'],'actor_2_name'),axis=1)

In [None]:
df['actor1_past_imdb_score']= df.apply(lambda x: get_past_score(x['actor_1_name'],
                                                                x['title_year'],'actor_1_name'),axis=1)

In [None]:
df['actor1_as_actor2_past_imdb_score']= df.apply(lambda x: get_past_score(x['actor_1_name'],
                                                                x['title_year'],'actor_2_name'),axis=1)

In [None]:
df['actor2_as_actor1_past_imdb_score']= df.apply(lambda x: get_past_score(x['actor_2_name'],
                                                                x['title_year'],'actor_1_name'),axis=1)

In [None]:
def corr_plot(df):
    # Compute a correlation matrix and convert to long-form
    corr_mat = df.corr().stack().reset_index(name="correlation")

    # Draw each cell as a scatter point with varying size and color
    g = sns.relplot(
        data=corr_mat,
        x="level_0", y="level_1", hue="correlation", size="correlation",
        palette="vlag", hue_norm=(-1, 1), edgecolor=".7",
        height=10, sizes=(50, 350), size_norm=(-.2, .6),
    )

    # Tweak the figure to finalize
    g.set(xlabel="", ylabel="", aspect="equal")
    g.despine(left=True, bottom=True)
    g.ax.margins(.02)
    for label in g.ax.get_xticklabels():
        label.set_rotation(90)
    for artist in g.legend.legendHandles:
        artist.set_edgecolor(".7")

corr_plot(df)

<a id="id3"></a>

### 3. Baseline model

The baseline model, we can predict all the movies that are above average as good movies. Sice our average score is 6.44 we can consider the scores that are above 7 as good movies and the remaining as not so good.
Based on this we to get a baseline prediction which would predict all as the majority class

In [None]:
df['good_score']=0
df.loc[df['imdb_score']>=7.0,'good_score']=1

In [None]:
df['good_score'].value_counts()

In [None]:
df['baseline_predicted']=0

In [None]:
accuracy_score(df['good_score'],df['baseline_predicted'])

In [None]:
confusion_matrix(df['good_score'],df['baseline_predicted'])

<a id="id4"></a>

### 4. Model building

In [None]:
from sklearn.ensemble import RandomForestClassifier

from sklearn import tree
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold,train_test_split
from numpy import mean,std

In [None]:
def get_model_metrics(m,x_test,y_true,y_pred):
    plot_confusion_matrix(m,x_test,y_true,cmap=plt.cm.Blues)
    plt.show()
    acc=accuracy_score(y_pred=y_pred,y_true=y_true)
    print('Accuracy score: {}'.format(acc))
   

In [None]:
base_df=df[['num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 
       'actor_1_facebook_likes', 
        'num_voted_users', 'cast_total_facebook_likes', 'num_user_for_reviews', 
        'actor_2_facebook_likes',
      'aspect_ratio', 'movie_facebook_likes', 
       'gross_adj', 'budget_adj', 'profit_adj', 'actor2_past_imdb_score','country',
       'actor1_past_imdb_score', 'actor1_as_actor2_past_imdb_score',
       'director_past_imdb_score', 'actor2_as_actor1_past_imdb_score', 'good_score']]

In [None]:
base_df.shape

In [None]:
X = base_df.drop(['good_score','country'], axis = 1)
y = base_df['good_score']

In [None]:
# Splitting train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 24)

<a id="id4.1"></a>

#### 4.1. Decision Tree

THis is the simplest form of Tree which identifies rule which is then used to split and assign the leaf nodes to maximise our success metric

In [None]:
#train classifier
clf = tree.DecisionTreeClassifier() 
clf=clf.fit(X_train, y_train) 
clf_prediction = clf.predict(X_test) 
get_model_metrics(clf,X_test,y_test, clf_prediction)

In [None]:
tree.plot_tree(clf) 

<a id="id4.2"></a>

##### 4.2. Random Forest
Random forest is a bagging tree based model and usually it gives a good results because it is based on collection trees. Although the numbers are much better than our baseline and has predctions on both classes

In [None]:
rf=RandomForestClassifier(n_estimators=200, n_jobs=-1)
rf.fit(X_train, y_train)
rf_y_pred = rf.predict(X_test)
get_model_metrics(rf,X_test,y_test, rf_y_pred)

<a id="id4.3"></a>

##### 4.3. Gradient Boost Method
Sometimes the expection is that the xgboost model to have given better performance but is still comparable to the random forest model.

In [None]:
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
xgb_y_pred = xgb_model.predict(X_test)
get_model_metrics(xgb_model,X_test,y_test, xgb_y_pred)

<a id="id4.4"></a>

##### 4.4. Catboost model
The Cat boost model has gained popularity in the recently for its superior results and is known to be one of the best boosting algorithms.
The results from catboost are also comparable to our xgboost and random forest model.



In [None]:
cat_model = CatBoostClassifier(verbose=0, n_estimators=90)
cat_model.fit(X_train, y_train)
cat_y_pred = cat_model.predict(X_test)
get_model_metrics(cat_model,X_test,y_test, cat_y_pred)

<a id="id5"></a>

### 5. Conclusion

The features to making a sucessful imdb rated movies was explored. A baseline model was created and subsequent set of machine learning models were built that outperformed our baslien by a significant margin.

In order to improve the scores, some of the additional features that were not inlcuded like the Title year, content rating etc could be tried out to check if they improve the scores. The budget field would need to normalized to reflect the uniform currency. Model Fuinetuning and grid search is also needed to improve the accuracy.

Nevertheless our current model is able to predict a good movie with 80%+ accuracy.