# Look at the big picture 🏞

**Given a set of information about different movies our task is to create an algorithm that predicts the worldwide box office revenue.**

To make things a bit more fun lets pretend that that we were employeed by a movie producer to see if his next movie is going to be a commercial success 💸💸💸

**The main goals of our model:**

1. It needs to be accurate. We need the producer to trust our prediction

2. It need to make predictions fast. The producer wants to try many different parameters like genre, actors, etc. to see what gives him the most money

**Frame the problem**

The problem can be expressed as a supervised learning univariate regression task. Meaning we will use a labeled training set from which we take multiple different parameters/features to predict a single value, which is the revenue of the movie

*The performance of our model will be measured by using root mean square error, RMSE for short*

![](https://miro.medium.com/max/966/1*lqDsPkfXPGen32Uem1PTNg.png)

# Get the data

> Load the data we will for training and testing our model

In [None]:
import numpy as np
import pandas as pd

# Load training and test data using pandas
train_data = pd.read_csv('/kaggle/input/tmdb-box-office-prediction/train.csv')
test_data = pd.read_csv('/kaggle/input/tmdb-box-office-prediction/test.csv')

> Seems we do not have much data to go around, so we will need to engineer some

In [None]:
train_data.shape, test_data.shape

> Lets take a quick look at the data 🧐

In [None]:
train_data.dtypes

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.describe()

# Discover and Visualize Data

> In order to get some insight from data we need to take a look at them

In [None]:
%matplotlib inline 

import matplotlib.pyplot as plt

train_data.drop('id', axis=1).hist(bins=50, figsize=(20, 15))
plt.show()

In [None]:
train_data.drop('id', axis=1).plot(kind='scatter', x='budget', y='popularity', alpha=0.4,
                                  s=train_data['runtime'], label='runtime', figsize=(10, 7),
                                  c='revenue', cmap=plt.get_cmap('jet'), colorbar=True)

> Popularity and budget seem to important in correlation to the revenue

In [None]:
corr_matrix = train_data.corr()
corr_matrix['revenue'].sort_values(ascending=False)

# Feature Engineering

> The numerical data available in the dataset are very limited, but using feature engineering it is possible to create new data for the ML-model to train with

In [None]:
from ast import literal_eval
from sklearn.base import BaseEstimator, TransformerMixin

class TextToDictTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        for column in self.features:
            X[column] = X[column].apply(lambda x: {} if pd.isna(x) else literal_eval(x))
        return X

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class BooleanTransformer(BaseEstimator, TransformerMixin):
    '''
    Transform binary values to boolean
    '''
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            for column in self.features:
                X[column] = X[column].apply(lambda x: 1 if x != {} and pd.isna(x) == False else 0)
        except Exception as ex:
            print("Boolean transformer error:", ex)
        return X

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class OneHotTransformer(BaseEstimator, TransformerMixin):
    '''
    One hot objects using the name attribute of the dict
    '''
    def __init__(self, features, top_values):
        self.features = features
        self.top_values = top_values
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            i = 0
            for feature in self.features:
                for name in self.top_values[i]:
                    X[f'{feature}_{name}'] = X[feature].apply(lambda x: 1 if name in str(x) else 0)
                i += 1
                    
            X = X.drop(self.features, axis=1)
        except Exception as ex:
            print("One hot tansformer error:", ex)
        return X

In [None]:
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin
    

class CastTransformer(BaseEstimator, TransformerMixin):
    '''
    Create columns for most common names and characters and also the size of the whole cast
    as well as the size divided to different genders
    '''
    def __init__(self, top_cast_names, top_cast_chars):
        self.top_cast_names = top_cast_names
        self.top_cast_chars = top_cast_chars
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            X['cast_len'] = X['cast'].apply(lambda x: len(x) if x != {} else 0)
            
            for name in self.top_cast_names:
                X[f'cast_name_{name}'] = X['cast'].apply(lambda x: 1 if name in str(x) else 0)
                
            for name in self.top_cast_chars:
                X[f'cast_char_{name}'] = X['cast'].apply(lambda x: 1 if name in str(x) else 0)
            
            X['cast_gender_undef'] = X['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
            X['cast_gender_male'] = X['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
            X['cast_gender_female'] = X['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
            
            X = X.drop('cast', axis=1)
        except Exception as ex:
            print("Cast transformer error:", ex)
        return X

In [None]:
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin


class CrewTransformer(BaseEstimator, TransformerMixin):
    '''
    Create columns for most common names, jobs and departments and also the size of the whole cast
    as well as the size divided to different genders
    '''
    def __init__(self, top_crew_names, top_crew_jobs, top_crew_departments):
        self.top_crew_names = top_crew_names
        self.top_crew_jobs = top_crew_jobs
        self.top_crew_departments = top_crew_departments
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            X['crew_len'] = X['crew'].apply(lambda x: len(x) if x != {} else 0)
            
            for name in self.top_crew_names:
                X[f'crew_name_{name}'] = X['crew'].apply(lambda x: 1 if name in str(x) else 0)
                
            for name in self.top_crew_jobs:
                X[f'crew_job_{name}'] = X['crew'].apply(lambda x: 1 if name in str(x) else 0)
                
            for name in self.top_crew_departments:
                X[f'crew_department_{name}'] = X['crew'].apply(lambda x: 1 if name in str(x) else 0)
            
            X['crew_gender_undef'] = X['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
            X['crew_gender_male'] = X['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
            X['crew_gender_female'] = X['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
            
            X = X.drop('crew', axis=1)
        except Exception as ex:
            print("Crew transformer error:", ex)
        return X

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DateTransformer(BaseEstimator, TransformerMixin):        
    '''
    Breaks date to 3 different values for year, month and day
    '''
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:            
            X['year'] = pd.Series(pd.DatetimeIndex(X['release_date']).year)
            X['month'] = pd.Series(pd.DatetimeIndex(X['release_date']).month)
            X['day'] = pd.Series(pd.DatetimeIndex(X['release_date']).day)
            X = X.drop('release_date', axis=1)
        except Exception as ex:
            print("Date transformer pipeline error:", ex)
        return X

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class FixRevenueTransformer(BaseEstimator, TransformerMixin):
    '''
    Fix some values that are problematic
    '''
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            X.loc[X['id'] == 16,'revenue'] = 192864          # Skinning
            X.loc[X['id'] == 90,'budget'] = 30000000         # Sommersby          
            X.loc[X['id'] == 118,'budget'] = 60000000        # Wild Hogs
            X.loc[X['id'] == 149,'budget'] = 18000000        # Beethoven
            X.loc[X['id'] == 313,'revenue'] = 12000000       # The Cookout 
            X.loc[X['id'] == 451,'revenue'] = 12000000       # Chasing Liberty
            X.loc[X['id'] == 464,'budget'] = 20000000        # Parenthood
            X.loc[X['id'] == 470,'budget'] = 13000000        # The Karate Kid, Part II
            X.loc[X['id'] == 513,'budget'] = 930000          # From Prada to Nada
            X.loc[X['id'] == 797,'budget'] = 8000000         # Welcome to Dongmakgol
            X.loc[X['id'] == 819,'budget'] = 90000000        # Alvin and the Chipmunks: The Road Chip
            X.loc[X['id'] == 850,'budget'] = 90000000        # Modern Times
            X.loc[X['id'] == 1112,'budget'] = 7500000        # An Officer and a Gentleman
            X.loc[X['id'] == 1131,'budget'] = 4300000        # Smokey and the Bandit   
            X.loc[X['id'] == 1359,'budget'] = 10000000       # Stir Crazy 
            X.loc[X['id'] == 1542,'budget'] = 1              # All at Once
            X.loc[X['id'] == 1570,'budget'] = 15800000       # Crocodile Dundee II
            X.loc[X['id'] == 1571,'budget'] = 4000000        # Lady and the Tramp
            X.loc[X['id'] == 1714,'budget'] = 46000000       # The Recruit
            X.loc[X['id'] == 1721,'budget'] = 17500000       # Cocoon
            X.loc[X['id'] == 1865,'revenue'] = 25000000      # Scooby-Doo 2: Monsters Unleashed
            X.loc[X['id'] == 2268,'budget'] = 17500000       # Madea Goes to Jail budget
            X.loc[X['id'] == 2491,'revenue'] = 6800000       # Never Talk to Strangers
            X.loc[X['id'] == 2602,'budget'] = 31000000       # Mr. Holland's Opus
            X.loc[X['id'] == 2612,'budget'] = 15000000       # Field of Dreams
            X.loc[X['id'] == 2696,'budget'] = 10000000       # Nurse 3-D
            X.loc[X['id'] == 2801,'budget'] = 10000000       # Fracture
            X.loc[X['id'] == 3889,'budget'] = 15000000       # Colossal
            X.loc[X['id'] == 6733,'budget'] = 5000000        # The Big Sick
            X.loc[X['id'] == 3197,'budget'] = 8000000        # High-Rise
            X.loc[X['id'] == 6683,'budget'] = 50000000       # The Pink Panther 2
            X.loc[X['id'] == 5704,'budget'] = 4300000        # French Connection II
            X.loc[X['id'] == 6109,'budget'] = 281756         # Dogtooth
            X.loc[X['id'] == 7242,'budget'] = 10000000       # Addams Family Values
            X.loc[X['id'] == 7021,'budget'] = 17540562       #  Two Is a Family
            X.loc[X['id'] == 5591,'budget'] = 4000000        # The Orphanage
            X.loc[X['id'] == 4282,'budget'] = 20000000       # Big Top Pee-wee

            power_six = X.id[X.budget > 1000][X.revenue < 100]

            for k in power_six :
                X.loc[X['id'] == k,'revenue'] =  X.loc[X['id'] == k,'revenue'] * 1000000
                
            return X
        
        except Exception as ex:
            print("Fix revenue transformer error:", ex)

In [None]:
from collections import Counter
from sklearn.base import BaseEstimator, TransformerMixin

class DropFeaturesTransformer(BaseEstimator, TransformerMixin):
    '''
    Drop unwanted features
    '''
    def __init__(self, features):
        self.features = features
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        try:
            return X.drop(self.features, axis=1)
        except Exception as ex:
            print("Drop features transformer error:", ex)

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin

class TrainTestTransformer(BaseEstimator, TransformerMixin):        
    def __init__(self, impute=False, normalize=False):
        self.impute = impute
        self.normalize = normalize
        
    def fit(self, X, y=None):
        
        if self.impute:
            X = X.fillna(X.median())
    
        self.X = X.drop('revenue', axis=1)    
        self.y = X['revenue']
        
        if self.normalize:
            self.X = MinMaxScaler().fit_transform(self.X)
        
        return self
    
    def transform(self, X):
        return train_test_split(self.X, self.y, test_size=0.10)

In [None]:
def top_values(X, column, attribute):
    '''
    Find the most common values for a column in the dataset
    '''
    try:
        values = X[column].apply(lambda x: [i[attribute] for i in x] if x != {} else []).values
        top_values = Counter([j for i in values for j in i]).most_common(30)
        top_values = [i[0] for i in top_values]
        return top_values
    except Exception as ex:
        print(ex)

# Prepare Data for Machine Learning algorithms

> Creating pipelines will allow us to handle data easily and make model testing and selection easier

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


text_to_dict = ['belongs_to_collection', 'genres', 'production_companies',
                'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']

boolean_features = ['homepage', 'belongs_to_collection']

one_hot_objects = ['genres', 'production_countries', 'spoken_languages', 'production_companies']

drop_features = ['id',
                'original_language',
                'Keywords',
                'imdb_id',
                'status',
                'poster_path', 
                'original_title',
                'overview',
                'tagline',
                'title'
                ]


engineered_data = TextToDictTransformer(text_to_dict).transform(train_data.copy())

one_hot_top_values = [top_values(engineered_data, i, 'name')  for i in one_hot_objects]

feature_engineering_pipeline = Pipeline([
    ('boolean_transformer', BooleanTransformer(boolean_features)), 
    ('one_hot_transformer', OneHotTransformer(one_hot_objects, one_hot_top_values)),
    ('date_transformer', DateTransformer()),
    ('cast_transformer', CastTransformer(top_values(engineered_data, 'cast', 'name'),
                                         top_values(engineered_data, 'cast', 'character'))),
    ('crew_transformer', CrewTransformer(top_values(engineered_data, 'crew', 'name'),
                                         top_values(engineered_data, 'crew', 'job'),
                                         top_values(engineered_data, 'crew', 'department'))),
    ('fix_revenue_transformer', FixRevenueTransformer()),
    ('drop_features_transformers', DropFeaturesTransformer(drop_features)),
])

engineered_data = feature_engineering_pipeline.fit_transform(engineered_data)
X_train, X_valid, y_train, y_valid = TrainTestTransformer(impute=True, normalize=False).fit_transform(engineered_data)

In [None]:
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_valid.shape)
print('y_train', y_valid.shape)

# Select and Train a Model

> Now that the pipeline is ready and the data are prepared, it is time to train our regression models and select the best on predicting movie revenues

**Linear regression model**

In [None]:
sample_data = X_valid[:]
sample_labels = np.log1p(y_valid[:])


num_models = 5
lin_reg_models = []
tree_reg_models = []
forest_reg_models = []

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

for i in range(num_models):
    lin_reg = LinearRegression()
    lin_reg.fit(X_train, np.log1p(y_train))

    preds = lin_reg.predict(sample_data)
    lin_mse = mean_squared_error(sample_labels, preds)
    lin_rmse = np.sqrt(lin_mse)
    
    lin_reg_models.append((lin_reg, lin_rmse))

**Decision Tree model**

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

for i in range(num_models):
    tree_reg = DecisionTreeRegressor()
    tree_reg.fit(X_train, np.log1p(y_train))
    
    preds = tree_reg.predict(sample_data)
    tree_mse = mean_squared_error(sample_labels, preds)
    tree_rmse = np.sqrt(tree_mse)
    
    tree_reg_models.append((tree_reg, tree_rmse))

**Random Forest Regressor**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

for i in range(num_models):
    forest_reg = RandomForestRegressor(n_estimators=100)
    forest_reg.fit(X_train, np.log1p(y_train))

    preds = forest_reg.predict(sample_data)
    forest_mse = mean_squared_error(sample_labels, preds)
    forest_rmse = np.sqrt(forest_mse)
    
    forest_reg_models.append((forest_reg, forest_rmse))

In [None]:
res = pd.DataFrame({'Linear': [i[1] for i in lin_reg_models], 'Tree': [i[1] for i in tree_reg_models], 'Forest': [i[1] for i in forest_reg_models]})
res

# The best model is "Random Forest"

> The best model is produced using random forest, now lets use grid search to look check different hyperparameters so we can get even better results

In [None]:
import seaborn as sns

def plot_feature_importance(feature_columns, importance_values):
    feature_imp = [col for col in zip(feature_columns, importance_values)]
    feature_imp.sort(key=lambda x:x[1], reverse=True)

    imp = pd.DataFrame(feature_imp[0:20], columns=['feature', 'importance'])
    plt.figure(figsize=(10, 8))
    sns.barplot(y='feature', x='importance', data=imp)
    plt.title('20 Most Important Features', fontsize=16)
    plt.ylabel("Feature", fontsize=16)
    plt.xlabel("")
    plt.show()
    return imp

In [None]:
# Get the best model
best_forest_model, best_forest_model_rmse = sorted(forest_reg_models, key=lambda x: x[1])[0]
model_importances = plot_feature_importance(X_train.columns, best_forest_model.feature_importances_)

**Grid Search**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'bootstrap': [False], 'n_estimators': [200, 250, 300], 'max_features': [60, 80, 100]},
    {'oob_score': [True, False], 'n_estimators': [150, 180, 200], 'max_features': [40, 50, 60]},
]

grid_search = GridSearchCV(RandomForestRegressor(), param_grid, cv=10, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, np.log1p(y_train))

In [None]:
grid_search.best_estimator_

In [None]:
grid_search_model = grid_search.best_estimator_

preds = grid_search_model.predict(sample_data)
grid_mse = mean_squared_error(sample_labels, preds)
grid_rmse = np.sqrt(forest_mse)
grid_rmse

In [None]:
importances = plot_feature_importance(X_train.columns, grid_search_model.feature_importances_)

> Compare our best forest model with the results of grid search, and use the best model

In [None]:
print('Random Forest model selected' if best_forest_model_rmse < grid_rmse else 'Grid search model selected')
model = best_forest_model if best_forest_model_rmse < grid_rmse else grid_search_model

*Note: The rmse is lower because we re-trained on the whole training set and that means we overfit the training data, which is not good. But this new training will come handy when making predictions on the test set*

In [None]:
X, y = X_train.append(X_valid), y_train.append(y_valid)

model.fit(X, np.log1p(y))

preds = model.predict(sample_data)
model_mse = mean_squared_error(sample_labels, preds)
model_rmse = np.sqrt(model_mse)
model_rmse

# Predictions

> Create the submission file for the competition

In [None]:
test_data = TextToDictTransformer(text_to_dict).transform(test_data.copy())
engineered_test_data = feature_engineering_pipeline.fit_transform(test_data.copy())

etg = engineered_test_data.drop('revenue', axis=1)
etg = etg.fillna(etg.median())

preds = model.predict(etg)

submission = test_data.copy()

submission['revenue'] = np.expm1(preds)
submission['revenue'] = submission['revenue'].astype(str)

submission[['id', 'title', 'budget', 'popularity', 'revenue']].head(50)
submission[['id', 'revenue']].to_csv('submission.csv', index=False)