# IMDb Movie Ratings Classification

## Goal

*To develop and train a classification model to help predict the average user rating for a given movie. This model aims to assist those in the movie industry by attempting to predict how well their movies might do. In addition, it may help directors focus on attributes that will reliably produce the highest score possible. A successful classification model will be able to accurately predict average user ratings of movies on IMDb given these sorts of attributes.*
<hr>

## Dataset Description

>**Context:**
>*IMDb is the most popular movie website and it combines movie plot description, Metastore ratings, critic and user ratings and reviews, release dates, and many more aspects. The website is well known for storing almost every movie that has ever been released (the oldest is from 1874 - "Passage de Venus") or just planned to be released (the newest movie is from 2027 - "Avatar 5"). IMDb stores information related to more than 6 million titles (of which almost 500,000 are featured films), and it has been owned by Amazon since 1998.*

>**Content:**
> *The movies dataset includes 85,855 movies with attributes such as movie description, average rating, number of votes, genre, etc. The ratings dataset includes 85,855 rating details from demographic perspective. The names dataset includes 297,705 cast members with personal attributes such as birth details, death details, height, spouses, children, etc. The title principals dataset includes 835,513 cast members roles in movies with attributes such as IMDb title id, IMDb name id, order of importance in the movie, role, and characters played.*

<hr>

## Loading the data
<hr>

In [496]:
import numpy as np
import pandas as pd

data = pd.read_csv("IMDb movies.csv", low_memory=False)
data.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1.0,2.0
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7.0,7.0
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5.0,2.0
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25.0,3.0
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31.0,14.0


# Data Pre-processing

Ideally, we want a data set with no missing values so that the classification model has more consistent performance. Since the dataset is quite large (80,000+ records), we can discard some of the records which contain missing values for important attributes, such as "USA Gross Income" and "Metascore".
<hr>

In [497]:
# These attributes are very important to have, so we ensure that there are no missing values for them.
data.dropna(subset=['usa_gross_income', 'worlwide_gross_income', 'budget',
                    'metascore', 'avg_vote', 'reviews_from_users', 'reviews_from_critics'],
            inplace=True
            )

# We also want to ensure that there are no duplicate movies in the dataset.
data.drop_duplicates(['original_title'], inplace=True)

# We create a new dataframe consisting of the attributes that we'll be using for the learning process.
dataframe = pd.DataFrame(data, columns=['imdb_title_id', 'original_title', 'year', 'genre', 'duration', 'country', 'language', 'director', 'writer', 'actors', 'avg_vote', 'votes', 'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore', 'reviews_from_users', 'reviews_from_critics'])
dataframe.head()

Unnamed: 0,imdb_title_id,original_title,year,genre,duration,country,language,director,writer,actors,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
506,tt0017136,Metropolis,1927,"Drama, Sci-Fi",153,Germany,German,Fritz Lang,"Thea von Harbou, Thea von Harbou","Alfred Abel, Gustav Fröhlich, Rudolf Klein-Rog...",8.3,156076,DEM 6000000,$ 1236166,$ 1349711,98.0,495.0,208.0
1048,tt0021749,City Lights,1931,"Comedy, Drama, Romance",87,USA,English,Charles Chaplin,Charles Chaplin,"Virginia Cherrill, Florence Lee, Harry Myers, ...",8.5,162668,$ 1500000,$ 19181,$ 46008,99.0,295.0,122.0
2454,tt0027977,Modern Times,1936,"Comedy, Drama, Family",87,USA,English,Charles Chaplin,Charles Chaplin,"Charles Chaplin, Paulette Goddard, Henry Bergm...",8.5,211250,$ 1500000,$ 163577,$ 457688,96.0,307.0,115.0
2795,tt0029453,Pépé le Moko,1937,"Crime, Drama, Romance",94,France,"French, Arabic",Julien Duvivier,"Henri La Barthe, Henri La Barthe","Jean Gabin, Gabriel Gabrio, Saturnin Fabre, Fe...",7.7,6180,$ 60000,$ 155895,$ 155895,98.0,46.0,55.0
2827,tt0029583,Snow White and the Seven Dwarfs,1937,"Animation, Family, Fantasy",83,USA,English,"William Cottrell, David Hand","Jacob Grimm, Wilhelm Grimm","Roy Atwell, Stuart Buchanan, Adriana Caselotti...",7.6,177157,$ 1499000,$ 184925486,$ 184925486,95.0,260.0,173.0


## Converting Columns from <code>str</code> to <code>int</code> in the DataFrame

Many of the columns in the dataframe are of type <code>str</code> and aren't suitable for our analysis (e.g. year, metascore). We can convert the columns from type <code>str</code> to <code>int</code> or <code>float</code> for better analysis of the data.
<hr>

In [498]:
dataframe['year'] = dataframe['year'].astype(int)
dataframe['votes'] = dataframe['votes'].astype(int)
dataframe['metascore'] = dataframe['metascore'].astype(int)

# There are a few missing values in the reviews from users and reviews
# from critics, so we want to convert the type to float to ensure that
# the column can still be converted to an appropriate value.
dataframe['reviews_from_users'] = dataframe['reviews_from_users'].astype(float)
dataframe['reviews_from_critics'] = dataframe['reviews_from_critics'].astype(float)

# The values for 'avg_vote' are a float, so we want to convert it to
# type float and so that we can keep the decimal values.
dataframe['avg_vote'] = dataframe['avg_vote'].astype(float)

## Filtering Movies from 1970 and Beyond

Movies produced and released prior to the 1970s had a fairly difference audience and also generally lower box-office revenue. Filtering the dataset to contain only movies from the 1970s and beyond will allow us to look at and classify modern movies.
<hr>

In [499]:
modern_df = dataframe.loc[dataframe['year'] >= 1970]
modern_df.head()

Unnamed: 0,imdb_title_id,original_title,year,genre,duration,country,language,director,writer,actors,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
4334,tt0035423,Kate & Leopold,2001,"Comedy, Fantasy, Romance",118,USA,"English, French",James Mangold,"Steven Rogers, James Mangold","Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",6.4,77852,$ 48000000,$ 47121859,$ 76019048,44,341.0,115.0
14260,tt0065134,Two Mules for Sister Sara,1970,"Adventure, Romance, War",116,"USA, Mexico","English, Spanish, French, Latin",Don Siegel,"Budd Boetticher, Albert Maltz","Shirley MacLaine, Clint Eastwood, Manolo Fábre...",7.0,23223,$ 2500000,$ 5050000,$ 5050000,62,79.0,39.0
14323,tt0065377,Airport,1970,"Action, Drama, Thriller",137,USA,"English, Italian","George Seaton, Henry Hathaway","Arthur Hailey, George Seaton","Burt Lancaster, Dean Martin, Jean Seberg, Jacq...",6.6,17068,$ 10000000,$ 100489151,$ 100489151,42,188.0,82.0
14340,tt0065421,The AristoCats,1970,"Animation, Adventure, Comedy",78,USA,English,Wolfgang Reitherman,"Larry Clemmons, Vance Gerry","Phil Harris, Eva Gabor, Sterling Holloway, Sca...",7.1,91085,$ 4000000,$ 35452658,$ 35459543,66,141.0,113.0
14357,tt0065462,Beneath the Planet of the Apes,1970,"Action, Adventure, Sci-Fi",95,USA,English,Ted Post,"Paul Dehn, Mort Abrahams","James Franciscus, Kim Hunter, Maurice Evans, L...",6.1,41159,$ 3000000,$ 18999718,$ 18999718,46,182.0,73.0


## Converting Currency Strings to Numerical Strings

There are a few attributes in the data that contain currency values but because
of the currency codes, they're of `str` data type instead of `int`. This causes
a loss of valuable data since we're not able to evaluate the data as numerical values.

Our goal is to convert every currency value to USD. We will be using a [Currency Conversion](https://pypi.org/project/CurrencyConverter/)
Python library by Alex Prengere. This will allow us to convert various currencies to USD
and also account for inflation.

We will also be using Python's `re` Regular Expressions library to parse the currency
codes and the numbers from the string data.
<hr>

In [500]:
from currency_converter import CurrencyConverter
import re

# Grab only the currency attributes from the data.
budget = modern_df['budget']
usa_gross_income = modern_df['usa_gross_income']
worldwide_gross_income = modern_df['worlwide_gross_income']

'''Use regular expressions to split the string into the currency code
and the value.'''
def split_currency_string(string):
    """Function takes a currency string and parses the currency code
    and currency value and stores it in a tuple.
    """

    match = re.match(r"(\$|[a-z]+) (\d+)", string, re.I)

    if match:
        items = match.groups()
        return items

c = CurrencyConverter()

'''Some of the currency codes in the data set don't match up with the converter.
We define a dictionary that contains those mis-labeled currency codes and
map them to the correct currency code for the converter.
'''
currency_codes = {
'$': 'USD',
'RUR': 'RUB',
}

'''There are a few currencies in the data set that are obsolete. These are currencies that were
used prior to the introduction of the Euro. This hash table maps each of the obsolete European
currencies to their conversion rate to the Euro.
'''
obsolete_euro_currencies = {
    'DEM': 0.511364,
    'FRF': 0.152449,
    'ITL': 0.000516457,
    'BEF': 0.0247894,
    'ESP': 0.00601012,
    'ATS': 0.0726728,
    'FIM': 0.168188
}

'''There are currencies in the data set that are not included in the currency converter. To
bypass this problem, we map each of these currencies to their conversion rate to the Euro.
'''
other_currencies = {
    'CLP': 0.0012,
    'NGN': 0.0022,
    'ARS': 0.0090,
    'DOP': 0.015
}

def convert_currency_to_USD(currency_code: str, currency_value: str):
    """The function takes in two arguments, a currency code and a currency value,
    and converts them to USD.
    """

    if currency_code in currency_codes:
        new_currency = (currency_codes[currency_code], currency_value)
        converted_value = c.convert(int(new_currency[1]), new_currency[0], 'USD')
    elif currency_code in obsolete_euro_currencies:
        obsolete_currency = int(currency_value)                             # Grab the value of the obsolete currency.
        euro = obsolete_currency * obsolete_euro_currencies[currency_code]  # Convert the deprecated currency value to the EUR.
        converted_value = c.convert(euro, 'EUR', 'USD')                     # Convert EUR to USD.
    elif currency_code in other_currencies:
        other_currency = int(currency_value)
        euro = other_currency * other_currencies[currency_code]
        converted_value = c.convert(euro, 'EUR', 'USD')
    else:
        converted_value = c.convert(int(currency_value), currency_code, 'USD')  # Convert all currencies to USD.

    return converted_value

'''Since we are attempting to write back to the original data frame, we need to
ensure that we disable the chained assignment protection in the Pandas options.
'''
pd.options.mode.chained_assignment = None  # default='warn'

def currency_str_to_float(df, column_name):
    """Takes in a data frame and a column name. Converts each row into USD and
    modifies the data frame in place.
    """

    for i in range(0, len(df)):
        currency = split_currency_string(df[column_name].iloc[i])       # Split the string into a code and value.
        converted = convert_currency_to_USD(currency[0], currency[1])   # Convert the currency value to USD.
        df[column_name].iloc[i] = converted                             # Apply the converted value to the data frame.

currency_str_to_float(modern_df, 'budget')                      # Convert Budget currencies to USD.
currency_str_to_float(modern_df, 'usa_gross_income')            # Convert USA Gross Income to USD.
currency_str_to_float(modern_df, 'worlwide_gross_income')       # Convert Worldwide Gross Income to USD.

## Sorting Data by Number of User Votes
<hr>

In [501]:
sorted_df = modern_df.sort_values('avg_vote', ascending=False)
sorted_df.head()

Unnamed: 0,imdb_title_id,original_title,year,genre,duration,country,language,director,writer,actors,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
28453,tt0111161,The Shawshank Redemption,1994,Drama,142,USA,English,Frank Darabont,"Stephen King, Frank Darabont","Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",9.3,2278845,25000000.0,28699976.0,28815245.0,80,8232.0,164.0
15528,tt0068646,The Godfather,1972,"Crime, Drama",175,USA,"English, Italian, Latin",Francis Ford Coppola,"Mario Puzo, Francis Ford Coppola","Marlon Brando, Al Pacino, James Caan, Richard ...",9.2,1572674,6000000.0,134966411.0,246120974.0,100,3977.0,253.0
48078,tt0468569,The Dark Knight,2008,"Action, Crime, Drama",152,"USA, UK","English, Mandarin",Christopher Nolan,"Jonathan Nolan, Christopher Nolan","Christian Bale, Heath Ledger, Aaron Eckhart, M...",9.0,2241615,185000000.0,535234033.0,1005455211.0,84,6938.0,423.0
16556,tt0071562,The Godfather: Part II,1974,"Crime, Drama",202,USA,"English, Italian, Spanish, Latin, Sicilian",Francis Ford Coppola,"Francis Ford Coppola, Mario Puzo","Al Pacino, Robert Duvall, Diane Keaton, Robert...",9.0,1098714,13000000.0,47834595.0,48035783.0,90,1030.0,178.0
27629,tt0108052,Schindler's List,1993,"Biography, Drama, History",195,USA,"English, Hebrew, German, Polish, Latin",Steven Spielberg,"Thomas Keneally, Steven Zaillian","Liam Neeson, Ben Kingsley, Ralph Fiennes, Caro...",8.9,1183248,22000000.0,96898818.0,322287794.0,94,1845.0,232.0


# The Multi-layer Perceptron Classifier

>*Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function that maps **R^m** to **R^o**
by training on a dataset, where <strong>m</strong> is the number of dimensions for input and <strong>o</strong> is the number of dimensions for output. Given a set of features
and a target , it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers.* (scikit-learn.org)

The Multi-layer Perceptron learning algorithm is a neural network classifier and is incredibly powerful. We'll be
using artificial neural networks as our classification model for predicting
ratings of movies based on the training data set. Let's set up our data for
the classifier by normalizing the data set.

## Encoding String Data into Numerical Values

Prior to normalizing and training our neural network model, we need to
encode our string attributes into numerical data.
<hr />

In [502]:
from sklearn.preprocessing import LabelEncoder

# Class to user the LabelEncoder across multiple columns.
class MultiColumnLabelEncoder:
    def __init__(self, columns = None):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        output = X.copy()

        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)

        return output

    def fit_transform(self, X, y = None):
        return self.fit(X, y).transform(X)

# Encode the categorical attributes in the data set.
encoded_df = MultiColumnLabelEncoder(columns=['imdb_title_id', 'original_title', 'genre', 'country',
                                 'language', 'director', 'writer', 'actors']).fit_transform(sorted_df)

imdb_title_id: <class 'numpy.int32'>
original_title: <class 'numpy.int32'>
year: <class 'numpy.int32'>
genre: <class 'numpy.int32'>
duration: <class 'numpy.int64'>
country: <class 'numpy.int32'>
language: <class 'numpy.int32'>
director: <class 'numpy.int32'>
writer: <class 'numpy.int32'>
actors: <class 'numpy.int32'>
avg_vote: <class 'numpy.float64'>
votes: <class 'numpy.int32'>
budget: <class 'float'>
usa_gross_income: <class 'float'>
worlwide_gross_income: <class 'float'>
metascore: <class 'numpy.int32'>
reviews_from_users: <class 'numpy.float64'>
reviews_from_critics: <class 'numpy.float64'>


## Separate Training and Test Set

We will be setting up our data for testing. As mentioned before, we will be focusing
on learning a model to predict ratings of movies. We'll split the data into `X` and `Y`,
where `X` will be the data set will every attribute besides the average vote and `Y`
will be the data set with <i>only</i> the average vote.
<hr />

In [504]:
from sklearn.model_selection import train_test_split

X = encoded_df.drop('avg_vote', axis=1)
Y = encoded_df['avg_vote']

# This statement is so that the classifier can fit the Y data set without any problems.
Y = Y.astype('int')

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print("X Training set size:", len(X_train))
print("Y Training set size:", len(Y_train))
print("X Test set size:", len(X_test))
print("Y Test set size:", len(Y_test))

X Training set size: 5154
Y Training set size: 5154
X Test set size: 1289
Y Test set size: 1289


In [505]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Normalize the training data set.
scaler.fit(X_train)

# And then we'll apply the transformations to the data.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Using a Hyper-parameter Optimization Tool

SKLearn provides extremely powerful tools to optimize our hyper parameters
of our classification model.

## Training the Model

In [509]:
from sklearn.neural_network import MLPClassifier

# Set the hidden layer sizes to 100 and the max iterations to 2000 for convergence.
mlp = MLPClassifier(hidden_layer_sizes=[200], max_iter=1000)
mlp.fit(X_train, Y_train)
mlp.score(X_test, Y_test)


0.5880527540729248