# IMDB Rating Prediction Model

## 1- Packages

Let's first import all the packages that I'll be using during this project.

* numpy is the main package for scientific computing with Python.
* pandas is a library for data manipulation and analysis.
* LabelEncoder is used to encode the object datatypes.
* Linear_model contains regression models.
* train_test_split is used to split our training data into training, dev datasets.

In [142]:
import numpy as np, pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model

## 2 - Dataset

* Here I am using the IMDB's top 250 movies' data which I scrapped from [here](https://imdb.com/chart/top?ref_=nv_mv_250)

In [121]:
data = pd.read_csv('../input/imdb.csv')
print('Shape of dataset: {}'.format(data.shape))

Shape of dataset: (250, 28)


## 3 - Data Pre-processing

The data we have isn't suitable for training process yet. So we have to pre-process this data to get optimal results.

### First let's see what this data is made of.

In [122]:
data.head()

Unnamed: 0.1,Unnamed: 0,Name of the movie,Link,Year released,IMDB rating,Number of reviewers,Censor board rating,Length of the movie,Genre 1,Genre 2,Genre 3,Genre 4,Release date,story summary,Director Name,Writer 1,Writer 2,Writer 3,Star 1,Star 2,Star 3,Star 4,Star 5,Plot Keywords list,Budget,Gross USA,Cumulative worlwide Gross,Production company
0,0,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/?pf_rd_m=...,1994,9.3,2040670,A,2h22min,Drama,,,,14 October 1994 (USA),Two imprisoned men bond over a number of years...,Frank Darabont,Stephen King,Frank Darabont,,Tim Robbins,Morgan Freeman,Bob Gunton,,,wrongful imprisonment | escape from prison | b...,25000000.0,28341469.0,58500000.0,Castle Rock Entertainment
1,1,The Godfather,https://www.imdb.com/title/tt0068646/?pf_rd_m=...,1972,9.2,1399614,A,2h55min,Crime,Drama,,,24 March 1972 (USA),The aging patriarch of an organized crime dyna...,Francis Ford Coppola,Mario Puzo,Francis Ford Coppola,,Marlon Brando,Al Pacino,James Caan,,,mafia | crime family | patriarch | organized c...,6000000.0,13496641.0,245066411.0,Paramount Pictures
2,2,The Godfather: Part II,https://www.imdb.com/title/tt0071562/?pf_rd_m=...,1974,9.0,970207,R,3h22min,Crime,Drama,,,20 December 1974 (USA),The early life and career of Vito Corleone in ...,Francis Ford Coppola,Francis Ford Coppola,Mario Puzo,,Al Pacino,Robert De Niro,Robert Duvall,,,revenge | corrupt politician | bloody body of ...,13000000.0,57300000.0,,Paramount Pictures
3,3,The Dark Knight,https://www.imdb.com/title/tt0468569/?pf_rd_m=...,2008,9.0,2008344,UA,2h32min,Action,Crime,Drama,,18 July 2008 (India),When the menace known as the Joker emerges fro...,Christopher Nolan,Jonathan Nolan,Christopher Nolan,,Christian Bale,Heath Ledger,Aaron Eckhart,,,dc comics | moral dilemma | psychopath | clown...,185000000.0,53485844.0,100455844.0,Warner Bros.
4,4,12 Angry Men,https://www.imdb.com/title/tt0050083/?pf_rd_m=...,1957,8.9,574484,Not Rated,1h36min,Drama,,,,10 April 1957 (USA),A jury holdout attempts to prevent a miscarria...,Sidney Lumet,Reginald Rose,Reginald Rose,,Henry Fonda,Lee J. Cobb,Martin Balsam,,,jury | dialogue driven | courtroom | trial | p...,350000.0,4360000.0,,Orion-Nova Productions


### I believe that length of the movie can be a deciding factor for overall rating so here I transformed this column so that I can use it properly in my model.

In [123]:
for i in range(data.shape[0]):
    h = 0
    m = 0
    if 'h' in data['Length of the movie'][i]:
        h = int(data['Length of the movie'][i][0])
        if 'min' in data['Length of the movie'][i]:
            if len(data['Length of the movie'][i]) is 7:
                m = int(data['Length of the movie'][i][2:4])
            else:
                m = int(data['Length of the movie'][i][2:3])
    else:
        m = int(data['Length of the movie'][i][:2])
    data['Length of the movie'][i] = float(h * 60 + m)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


### Our dataset isn't all numbers so we have to see how many unique values does this dataset have so that we can check whether they are even affecting the rating or not.

In [124]:
data.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

Name of the movie      250
Link                   250
Censor board rating     11
Length of the movie    106
Genre 1                 12
Genre 2                 20
Genre 3                 17
Release date           246
story summary          250
Director Name          152
Writer 1               198
Writer 2               171
Star 1                 183
Star 2                 230
Star 3                 240
Plot Keywords list     249
Production company     154
dtype: int64

### From the above output we can see that features like release date, story summary, star2, star3, plot keyworks are almost unique for each movie. So keeping these features will not help our model at all so we are gonna get rid of them now.

In [125]:
data = data.drop(['Unnamed: 0', 'Name of the movie', 'Link', 'Genre 4', 'Release date', 'story summary', 'Writer 3', 'Star 2', 'Star 3', 'Star 4', 'Star 5', 'Plot Keywords list'], axis=1)
data.columns.values

array(['Year released', 'IMDB rating', 'Number of reviewers',
       'Censor board rating', 'Length of the movie', 'Genre 1', 'Genre 2',
       'Genre 3', 'Director Name', 'Writer 1', 'Writer 2', 'Star 1',
       'Budget', 'Gross USA', 'Cumulative worlwide Gross',
       'Production company'], dtype=object)

### We can't use values like director name for our predictions so we are gonna use label encoding to give our data a proper structure.

In [126]:
le = LabelEncoder()
le_count = 0
lom = data['Length of the movie']
data = data.drop(['Length of the movie'], axis=1)
for col in data:
    if data[col].dtype == 'object':
        if len(list(data[col].unique())) <= 2:
            le.fit(data[col])
            data[col] = le.transform(data[col])
            le_count += 1

data = pd.get_dummies(data)
data['Length of the movie'] = lom
print('Shape of dataset: {}'.format(data.shape))

Shape of dataset: (250, 925)


### Woah that's a lot of features... Let's see how does our dataset look now.. 

In [127]:
data.head()

Unnamed: 0,Year released,IMDB rating,Number of reviewers,Budget,Gross USA,Cumulative worlwide Gross,Censor board rating_2h,Censor board rating_A,Censor board rating_G,Censor board rating_GP,Censor board rating_Not Rated,Censor board rating_PG,Censor board rating_PG-13,Censor board rating_Passed,Censor board rating_R,Censor board rating_U,Censor board rating_UA,Genre 1_Action,Genre 1_Adventure,Genre 1_Animation,Genre 1_Biography,Genre 1_Comedy,Genre 1_Crime,Genre 1_Drama,Genre 1_Film-Noir,Genre 1_Horror,Genre 1_Mystery,Genre 1_Sci-Fi,Genre 1_Western,Genre 2_Action,Genre 2_Adventure,Genre 2_Biography,Genre 2_Comedy,Genre 2_Crime,Genre 2_Drama,Genre 2_Family,Genre 2_Fantasy,Genre 2_Film-Noir,Genre 2_History,Genre 2_Horror,...,Production company_Red Granite Pictures,Production company_Regency Enterprises,Production company_Road Movies Filmproduktion,Production company_Roxlom Films Inc.,Production company_Selznick International Pictures,Production company_Shamley Productions,Production company_Shinchosha Company,Production company_Shôchiku Eiga,Production company_Société générale des films,Production company_Sony Pictures Entertainment (SPE),Production company_Stage 6 Films,Production company_Strong Heart/Demme Production,Production company_Summit Entertainment,Production company_Svensk Filmindustri (SF),Production company_The Directors Company,Production company_The Ladd Company,Production company_The Mirisch Company,Production company_The Mirisch Corporation,Production company_The Samuel Goldwyn Company,Production company_The Weinstein Company,Production company_Toho Company,Production company_Tokuma Japan Communications,Production company_Tokuma Shoten,Production company_Tornasol Films,Production company_Touchstone Pictures,Production company_Twentieth Century Fox,Production company_United Artists,Production company_Universal International Pictures (UI),Production company_Universal Pictures,Production company_Universum Film (UFA),Production company_Vinod Chopra Productions,Production company_Walt Disney Pictures,Production company_Warner Bros.,Production company_Warner Bros. Pictures,Production company_Warner Independent Pictures (WIP),Production company_Wiedemann & Berg Filmproduktion,Production company_Zanuck/Brown Productions,Production company_Zoetrope Studios,Production company_micro_scope,Length of the movie
0,1994,9.3,2040670,25000000.0,28341469.0,58500000.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,142
1,1972,9.2,1399614,6000000.0,13496641.0,245066411.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,175
2,1974,9.0,970207,13000000.0,57300000.0,,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,202
3,2008,9.0,2008344,185000000.0,53485844.0,100455844.0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,152
4,1957,8.9,574484,350000.0,4360000.0,,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,96


### From above representation we can see that there are some missing values in our dataset so let's see how much of our data is actually missing.

In [128]:
mis_val = data.isnull().sum()
mis_val_percent = 100 * data.isnull().sum() / len(data)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
print ('Out of {} columns {} have missing values'.format(data.shape[1], mis_val_table_ren_columns.shape[0]))
mis_val_table_ren_columns.head()

Out of 925 columns 3 have missing values


Unnamed: 0,Missing Values,% of Total Values
Cumulative worlwide Gross,97,38.8
Gross USA,30,12.0
Budget,28,11.2


### Thank god only three columns are missing data. 

* 'Cumulative worlwide Gross' has 97 missing values which is almost 40% of the entire dataset. So we can't replace them with mean values so I'm gonna get rid of this troublesome column now.

* Gross USA and Budget are missing ~12% of the data so if we replace them with mean of the column then it might not affect the final accuracy all that much.

In [129]:
data = data.drop(['Cumulative worlwide Gross'], axis=1)
data.fillna(data.mean(), inplace=True)

### Correlations

It's always beneficial to check whether the features are affecting the target variable or not. 

In [132]:
correlations = data.corr()['IMDB rating'].sort_values()

print('Most Positive Correlations:\n', correlations.tail(20))
print('\nMost Negative Correlations:\n', correlations.head(20))

Most Positive Correlations:
 Writer 2_Christopher Nolan            0.178396
Writer 1_Jonathan Nolan               0.178396
Writer 2_Mario Puzo                   0.192621
Writer 1_Francis Ford Coppola         0.192621
Director Name_Christopher Nolan       0.208279
Length of the movie                   0.212761
Writer 2_Francis Ford Coppola         0.214028
Director Name_Frank Darabont          0.233671
Writer 2_Frank Darabont               0.233671
Production company_New Line Cinema    0.234180
Director Name_Peter Jackson           0.238554
Writer 1_J.R.R. Tolkien               0.238554
Writer 2_Fran Walsh                   0.238554
Star 1_Elijah Wood                    0.238554
Writer 1_Mario Puzo                   0.248068
Censor board rating_A                 0.258102
Star 1_Tim Robbins                    0.275791
Director Name_Francis Ford Coppola    0.286766
Number of reviewers                   0.574138
IMDB rating                           1.000000
Name: IMDB rating, dtype: float

### Now that our dataset is ready to use, We can split our dataset into train and test datasets.

In [136]:
Y = np.array(data['IMDB rating'])
X = np.array(data.drop(['IMDB rating'], axis=1))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("Y_train shape: {}".format(Y_train.shape))
print("Y_test shape: {}".format(Y_test.shape))

X_train shape: (200, 923)
X_test shape: (50, 923)
Y_train shape: (200,)
Y_test shape: (50,)


## 4 - Model testing

### There are a bunch of ready to use Linear Regression models available in Scikit-learn's library so for this assignment I'm gonna use these three models to check which one works best for this dataset

### 1 - [Linear Regression Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [151]:
reg = linear_model.LinearRegression()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
MAPE = np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100
print('MAPE for Linear Regression model is : {}'.format(MAPE))

MAPE for Linear Regression model is : 1.6661177167215406


### 2 - [Ridge Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

In [152]:
reg = linear_model.Ridge()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
MAPE = np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100
print('MAPE for Ridge Regression model is : {}'.format(MAPE))

MAPE for Ridge Regression model is : 2.2427469580514248




### 3 - [Lasso Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)

In [153]:
reg = linear_model.Lasso()
reg.fit(X_train, Y_train)
Y_pred = reg.predict(X_test)
MAPE = np.mean(np.abs((Y_test - Y_pred) / Y_test)) * 100
print('MAPE for Lasso Regression model is : {}'.format(MAPE))

MAPE for Lasso Regression model is : 1.8281196368609514


## 5 - Conclusion

from above MAPE values we can conclude that Linear Regression Model gives us the best results for this dataset.