# Model Pipelines

In [7]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/imdb-top-rated-movies-user-rated.csv")
df.head()

Unnamed: 0,Rank,Title,IMDb Rating,Votes,Poster URL,Video URL,Meta Score,Tags,Director,Description,Writers,Stars,Summary,Worldwide Gross
0,1,Once Upon a Time... in Hollywood,7.6,927K,https://www.imdb.com/title/tt7131622/mediaview...,https://imdb-video.media-imdb.com/vi1385741849...,84.0,"""Period Drama, Showbiz Drama, Comedy, Drama""",Quentin Tarantino,"""As Hollywood's Golden Age is winding down dur...",Quentin Tarantino,"""Leonardo DiCaprio, Brad Pitt, Margot Robbie""","""Reviewers say 'Once Upon a Time in Hollywood'...",-
1,2,Mission: Impossible - Dead Reckoning Part One,7.6,311K,https://www.imdb.com/title/tt9603212/mediaview...,https://imdb-video.media-imdb.com/vi3500918553...,81.0,"""Action Epic, Adventure Epic, Spy, Action, Adv...",Christopher McQuarrie,Ethan Hunt and his IMF team must track down a ...,"""Bruce Geller, Christopher McQuarrie, Erik Jen...","""Tom Cruise, Hayley Atwell, Ving Rhames""","""Reviewers say 'Mission: Impossible - Dead Rec...",-
2,3,John Wick: Chapter 4,7.6,392K,https://www.imdb.com/title/tt10366206/mediavie...,https://imdb-video.media-imdb.com/vi289916185/...,78.0,"""Action Epic, Gun Fu, One,Person Army Action, ...",Chad Stahelski,"""John Wick uncovers a path to defeating The Hi...","""Shay Hatten, Michael Finch, Derek Kolstad""","""Keanu Reeves, Laurence Fishburne, George Geor...","""Reviewers say 'John Wick: Chapter 4' is laude...",-
3,4,Watchmen,7.6,603K,https://www.imdb.com/title/tt0409459/mediaview...,https://imdb-video.media-imdb.com/vi240565017/...,56.0,"""Dystopian Sci,Fi, Superhero, Action, Drama, M...",Zack Snyder,"""In a version of 1985 where superheroes exist-...","""Dave Gibbons, David Hayter, Alex Tse""","""Jackie Earle Haley, Patrick Wilson, Carla Gug...","""Reviewers say 'Watchmen' is acclaimed for its...",-
4,5,The Fifth Element,7.6,533K,https://www.imdb.com/title/tt0119116/mediaview...,https://imdb-video.media-imdb.com/vi854720793/...,52.0,"""Sci,Fi Epic, Space Sci,Fi, Action, Adventure,...",Luc Besson,"""In the colorful future- a cab driver unwittin...","""Luc Besson, Robert Mark Kamen""","""Bruce Willis, Milla Jovovich, Gary Oldman""",-,-


First we load a data from the csv file

In [8]:
def map_votes(votes):
    if votes[-1] == 'K':
        return int(float(votes[:-1])*1000)
    elif votes[-1] == 'M':
        return int(float(votes[:-1])*1000000)
    else:
        return int(votes)
df["Votes"] = df["Votes"].map(map_votes)

df.sample()

Unnamed: 0,Rank,Title,IMDb Rating,Votes,Poster URL,Video URL,Meta Score,Tags,Director,Description,Writers,Stars,Summary,Worldwide Gross
816,817,Vertigo,8.2,450000,https://www.imdb.com/title/tt0052357/mediaview...,https://imdb-video.media-imdb.com/vi216072473/...,100.0,"Conspiracy Thriller, Dark Romance, Psychologic...",Alfred Hitchcock,A former San Francisco police detective juggle...,"Alec Coppel, Samuel A. Taylor, Pierre Boileau","James Stewart, Kim Novak, Barbara Bel Geddes",Reviewers say 'Vertigo' is acclaimed for its n...,


In [9]:
df["Stars"] = df["Stars"].astype(str).map(lambda x:x.split(", "))
df["Writers"] = df["Writers"].astype(str).map(lambda x:x.split(", "))
df["Tags"] = df["Tags"].astype(str).map(lambda x:x.split(", "))

df.sample()

Unnamed: 0,Rank,Title,IMDb Rating,Votes,Poster URL,Video URL,Meta Score,Tags,Director,Description,Writers,Stars,Summary,Worldwide Gross
313,314,Whisper of the Heart,7.8,80000,https://www.imdb.com/title/tt0113824/mediaview...,https://imdb-video.media-imdb.com/vi825015321/...,75.0,"[Anime, Coming,of,Age, Hand,Drawn Animation, I...",Yoshifumi Kondô,A love story between a girl who loves reading ...,"[Aoi Hiiragi, Hayao Miyazaki]","[Yoko Honna, Issei Takahashi, Takashi Tachibana]",,$4-589-697


Then we sequentially clean our data.

In [10]:
from src.transformers.custom_transformer import ExplodedLooEcoder

Here we are importing the custom Encoder object for Encoding the exploded lists. It sits in the *src* folder.

In [11]:
from sklearn.model_selection import train_test_split
X = df[['Votes','Meta Score','Director','Tags','Writers']]
y = df['IMDb Rating']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=17)

finally we split our data into training and testing dataset

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import category_encoders as ce
from xgboost import XGBRegressor

meta_imputer = SimpleImputer(strategy='mean')
director_transformer = ce.LeaveOneOutEncoder(cols='Director')
tags_transformer = ExplodedLooEcoder('Tags')
writers_transformer = ExplodedLooEcoder('Writers')

preprocessor = ColumnTransformer(
    transformers=[
        ('meta_score',meta_imputer,['Meta Score']),
        ('director',director_transformer,['Director']),
        ('tags',tags_transformer,['Tags']),
        ('writers',writers_transformer,['Writers'])
    ],
    remainder='passthrough'
)

Now we form the required preprocessor for our model using ColumnTransformer.

In [13]:
xgb = XGBRegressor(
    objective='reg:squarederror', # Specifies the learning task
    n_estimators=300,            # Number of trees to build
    learning_rate=0.01,
    subsample = 0.7,    # How much each tree corrects the previous ones
    max_depth=5,
    colsample_bytree=0.8,
    random_state=42
)

Here we define our XGBRegressor, that will perform predictions on our pre-processed data.

In [14]:
from sklearn.pipeline import Pipeline
imdb_pipeline = Pipeline([
    ('preprocessor',preprocessor),
    ('xgboost',xgb)
])

Finally we create the pipeline object that sequentially pre-processes the data and then predicts the IMDb ratings

In [15]:
pipe = imdb_pipeline.fit(X_train,y_train)
y_pred = pd.Series(pipe.predict(X_test))

Now we train the pipeline on our data. We also make predictions on the test data to ensure the pipeline works properly

In [16]:
from sklearn.metrics import r2_score
r2_score(y_true=y_test,y_pred=y_pred)

0.24631863672859855

As evident, the r2 score is similar, so the pipeline is fine.

In [17]:
import joblib
joblib.dump(pipe,'pipeline.joblib')

['pipeline.joblib']

In the end, we save the train pipeline object in the joblib file so that we can predict and IMDb score whenever we like.