<a href="https://www.kaggle.com/code/lucasaresin/nobody-knows-is-movie-success-unknowable?scriptVersionId=144680026" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Nobody knows
A principle in many creative industries is the 'nobody knows' principle, also known as demand uncertainty, which expresses the inability to predict a product's success at the box office because it is very difficult to predict which combination of factors will result in a product that people want. At first glance, it seems intuitively correct: Big movies with star power and a marketing war machine can fail miserably, while obscure independent projects suddenly rise unexpectedly with viral intensity.

Can big data / data science help us to confirm / disprove this idea? And how close can you get to predicting a movie's box office success?

# Feature selection
In reality, when we want to predict a movie's success, we want to do it before the movie releases, or ideally before it is green-lit. As such we have to confine ourselves to data that we would have prior to release. For example, user ratings might be a great predictor, but not something available to us before we decide if we want to finance a project.

# Import libraries and data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import ast
from tqdm import tqdm
from regex import regex
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor

sns.set()

In [None]:
movies_metadata = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv', low_memory = False)
credits = pd.read_csv('/kaggle/input/the-movies-dataset/credits.csv', low_memory = False)
keywords = pd.read_csv('/kaggle/input/the-movies-dataset/keywords.csv', low_memory = False)
# links = pd.read_csv('/kaggle/input/the-movies-dataset/links.csv', low_memory = False)
# links_small = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv', low_memory = False)
# ratings = pd.read_csv('/kaggle/input/the-movies-dataset/ratings.csv', low_memory = False)
# ratings_small = pd.read_csv('/kaggle/input/the-movies-dataset/ratings_small.csv', low_memory = False)

In [None]:
movies_metadata.shape

# Features we would know before release

Let's get all columns first so we can decide what we would and wouldn't know beforehand:

In [None]:
movies_metadata.columns

| Column | Knowledge |
|-|-|
| 'adult' | |
| 'belongs_to_collection' | |
|'budget' | |
| 'genres' | |
| 'homepage' | drop |
| 'id' | drop |
| 'imdb_id' | drop |
| 'original_language' | |
| 'original_title' | drop |
| 'overview' | drop |
| 'popularity' | drop |
| 'poster_path' | drop |
| 'production_companies' | |
| 'production_countries' | |
| 'release_date' | year only |
| 'revenue' | target |
| 'runtime' | |
| 'spoken_languages' | |
| 'status' | |
| 'tagline' | drop |
| 'title' | drop |
| 'video' | drop |
| 'vote_average' | drop |
| 'vote_count' | drop |

In [None]:
pre_release_features = [
    'id',
    'adult',
    'belongs_to_collection',
    'budget',
    'genres',
    'original_language',
    'production_companies',
    'production_countries',
    'release_date',
    'runtime',
    'spoken_languages',
    'status',
    'revenue'
]

movies_df = movies_metadata[pre_release_features].copy()

In [None]:
movies_df.head()

# Remove null values
The remaining columns might contain null values, so let's check!

In [None]:
movies_df.info()

In [None]:
def check_for_nulls(dataframe):
    return dataframe.isna().sum().sort_values(ascending=False)

In [None]:
check_for_nulls(movies_df)

In [None]:
movies_df[movies_df['spoken_languages'].isna()]

In [None]:
movies_df[movies_df['production_companies'].isna()]

In [None]:
movies_df[movies_df['runtime'].isna()]

Let's see what we can do about the missing values. Looking at the overall analysis, with one exception, all features missing data each cover less than 0.5 % of total observations, so we can probably drop them entirely and save us some time.

| Column | Knowledge |
|-|-|
| 'belongs_to_collection' | Here, the 'missing values' are simple np.nan values wherever the movie does not belong to a collection. To simplify, we turn this into a boolean category, where NaN -> 0, and not NaN -> 1 |
| 'original_language' | Drop |
| 'production_companies' | Drop |
| 'production_countries' | Drop |
| 'release_date' | Drop |
| 'revenue' | Drop |
| 'runtime' | Drop |
| 'spoken_languages' | Drop |
| 'status' | Drop |

While inspecting the columns with missing values, we notice that some data is messed up, for example in index 19730, the 'adult' column contains some text. That cannot be right and we will have to deal with it.

In [None]:
nan_to_drop = [
    'original_language',
    'production_companies',
    'production_countries',
    'release_date',
    'revenue',
    'runtime',
    'spoken_languages',
    'status'
]

indices_to_drop = []

for nan in nan_to_drop:
    iter_indices = [x for x in movies_df[movies_df[nan].isna()].index]
    for index in iter_indices:
        indices_to_drop.append(index)
    
indices_to_drop_no_duplicates = list(set(indices_to_drop))

movies_df = movies_df.drop(indices_to_drop_no_duplicates,axis=0).copy()
movies_df.info()

In [None]:
# If a movie is part of a collection, 'belongs_to_collection' will be a 1, otherwise a 0
movies_df['collection'] = movies_df['belongs_to_collection'].apply(lambda x: 0 if pd.isna(x) else 1 if isinstance(x, str) else 0).tolist()
movies_df = movies_df.drop('belongs_to_collection',axis=1).copy()

In [None]:
check_for_nulls(movies_df)

# Cast columns are correct datatypes

In [None]:
type_dict = {
    'adult':'category',
    'budget':'int32',
    'genres':'string',
    'original_language':'category',
    'production_companies':'string',
    'production_countries':'string',
    'runtime':'int32',
    'spoken_languages':'string',
    'status':'category',
    'revenue':'int64',
    'collection':'int16'
}

In [None]:
# cast columns as data types and show me where that fails

for i, term in enumerate(type_dict):
    try:
        movies_df[term] = movies_df[term].astype(type_dict[term])
    except:
        print(i, term)

Seems like removing the missing values also took care of the problematic data. Great!

In [None]:
movies_df.info()

# Deal with categorical data in strings
The columns **genres**, **production_companies** and **spoken_languages** are encoded in json-like lists. We want to turn those into actual lists so we may use them later.

In [None]:
literal_eval = lambda x: [x['name'] for x in ast.literal_eval(x)]
movies_df['genre_name'] = movies_df['genres'].apply(literal_eval)
movies_df['prod_comp_names'] = movies_df['production_companies'].apply(literal_eval)

literal_eval_lang = lambda x: [x['iso_639_1'] for x in ast.literal_eval(x)]
movies_df['lang'] = movies_df['spoken_languages'].apply(literal_eval_lang)

In [None]:
movies_df = movies_df.drop(['genres','production_companies','production_countries','spoken_languages','original_language'],axis=1).copy()

Dead with the release date column:

In [None]:
movies_df['release_date'] = pd.to_datetime(movies_df['release_date'])
movies_df['year'] = movies_df['release_date'].dt.year
movies_df['month'] = movies_df['release_date'].dt.month
movies_df = movies_df.drop('release_date',axis=1).copy()

In [None]:
movies_df['adult'] = movies_df['adult'].apply(lambda x: 0 if x == False else 1)
movies_df.head()

# Preliminary analysis

In [None]:
sns.pairplot(movies_df)

In [None]:
movies_df[list(movies_df.describe().columns)].corr()

We find that budget and revenue are more closely related than perhaps expected, whereas all other categories don't show significant correlation. This leads us to think that a very simple linear regression might already yield some results. Later it would be interesting to see if more complex models can capitalize on some of the categorical variables ...

# Dummies

In [None]:
movies_df2 = movies_df.copy()

In [None]:
dummy_columns = ['genre_name','prod_comp_names','lang']

In [None]:
unique_genres = list(set([x for list in movies_df2['genre_name'] for x in list]))
unique_prod_comps = list(set([x for list in movies_df2['prod_comp_names'] for x in list]))
unique_langs = list(set([x for list in movies_df2['lang'] for x in list]))

In [None]:
movies_df2 = movies_df2.join(movies_df2['genre_name'].str.join('|').str.get_dummies().add_prefix('genre_'))
# movies_df2 = movies_df2.join(movies_df2['prod_comp_names'].str.join('|').str.get_dummies().add_prefix('prod'))
movies_df2 = movies_df2.join(movies_df2['lang'].str.join('|').str.get_dummies().add_prefix('lang_'))
#movies_df2 = movies_df2.join(movies_df2['cast_list'].str.join('|').str.get_dummies().add_prefix('cast_'))
movies_df2.head()

In [None]:
movies_df2 = movies_df2.drop(['id','status','genre_name','lang','prod_comp_names'],axis=1)

# First try: A linear regression

In [None]:
linreg_df = movies_df2[['budget','runtime','revenue','year','month']].copy()

X = linreg_df.drop(['revenue'],axis=1)
y = linreg_df['revenue']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.75,
                                                    random_state=12,
                                                    shuffle=True)

In [None]:
reg = LinearRegression()
reg.fit(X_train,y_train)
r2 = reg.score(X_test,y_test)
print(f"Score: {r2}")
print(f"{np.round(r2 * 100,2)} %")

Using a simple linear regression, we achieve just below 60% accuracy.

In [None]:
y_preds = reg.predict(X_test)
reg_mse = mean_squared_error(y_test,y_preds)
reg_rmse = np.sqrt(reg_mse)
reg_rmse

Or, in real numbers, a root mean squared error of 36 million. No executive is going to be very happy with those predictions.

In [None]:
movies_df['revenue'].plot.hist()

Interestingly, we can see that almost all movies have very low revenue. This might open us up later to do a classification instead. So we won't predict revenue, but perhaps we have a shot at predicting if the movie will break even.

# Decision Tree Regressor

In [None]:
X = movies_df2.drop(['revenue'],axis=1)
y = movies_df2['revenue']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=91,
                                                    shuffle=True)

In [None]:
tree = DecisionTreeRegressor(
    criterion='squared_error',
    max_depth=4)

In [None]:
tree.fit(X_train,y_train)

In [None]:
tree.score(X_test,y_test)

In [None]:
movies_preds = tree.predict(X_test)
tree_mse = mean_squared_error(y_test,movies_preds)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

We don't see any change. Root mean squared error is 40 million.

# Random Forest Regressor

In [None]:
forest_reg = RandomForestRegressor()

In [None]:
forest_reg.fit(X_train,y_train)
forest_reg.score(X_test,y_test)

In [None]:
y_test_forest = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_test,y_test_forest)
forest_rmse = np.sqrt(forest_mse)

In [None]:
forest_rmse

In [None]:
param_grid = [
    {'n_estimators': [100,500,1500], 'max_features':[4,8]},
]

In [None]:
grid_search = GridSearchCV(
    estimator = forest_reg,
    param_grid = param_grid,
    scoring = "neg_mean_squared_error",
    cv = 3,
    return_train_score = True,
    verbose=10
)

In [None]:
grid_search.fit(X_train,y_train)

In [None]:
best_params = grid_search.best_params_

In [None]:
np.sqrt(-grid_search.best_score_)

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
forest_cv = RandomForestRegressor(n_estimators=best_params['n_estimators'],max_features=best_params['max_features'])
forest_cv.fit(X_train,y_train)
forest_r2 = forest_cv.score(X_test,y_test)
print(f"Score for this random forest: {forest_r2}")

# Classifier

Given the difficulty predicting movie box office success accurately, perhaps the next best thing is to classify a movie by the ability to break even or not. In this case we will do a very simple calculation where we classify a movie as break even when the revenue outperforms the budget. In reality, this is more complicated due to the different ways in which movies are financed. In a real-life business case, this would have to be taken into account, considering equity and non-equity financing, but for this example we will go with the simple solution.

While it is not as satisfying than predicting an actual number with great accuracy, determining if we will make our money back is at least something executives might be interested in.

In [None]:
movies_df3 = movies_df2.copy()

In [None]:
movies_df3['profit'] = np.subtract(movies_df3['revenue'],movies_df3['budget'])
movies_df3['break_even'] = movies_df3['profit'].apply(lambda x: 1 if x > 0 else 0)
movies_df3.head(15)

In [None]:
X = movies_df3.drop(['revenue','profit','break_even'],axis=1)
y = movies_df3['break_even']

X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75,random_state=90,shuffle=True)

treec = DecisionTreeClassifier()
treec.fit(X_train,y_train)
treec.score(X_test,y_test)

In [None]:
feature_importances = treec.fit(X_train,y_train).feature_importances_

In [None]:
feature_imp = pd.DataFrame()
feature_imp['feature_importances'] = feature_importances
feature_imp = feature_imp.set_index(X_train.columns)
feature_imp.sort_values(by='feature_importances',ascending=False).head(10)

# Conclusion
It seems the nobody knows principle is alive and well. Even with machine learning techniques and big data available, predictions struggle to become meaningful enough to make clear business decisions. This seems to be supported by academic literature on the topic. For example, in their paper **A Machine Learning Approach to Predict Movie Box-Office Success** [1], using only pre-release features, the authors' neural network achieves only 68% accuracy.

## Results
My own modeling proved similar results:

|Model|R²|
|-|-|
|Linear Regression|0.59|
|Decision Tree Regressor|0.61|
|Random Forest Regressor|0.65|
|Decision Tree Classifier|0.84|

Predicting the exact revenue of a movie based on pre-release features was possible to about 65% accuracy and a RMSE of ~40m.
Not unexpectedly, classifying the movies simply by break-even or not was more successful at 84%, but naturally much less helpful in determining movie success.

### Correlation vs Causation
Looking at the correlation between features, we can quickly understand why our models underperform:
* Highest correlation is **Budget**, but higher budgeted movies tend to be blockbusters with big releases and big marketing spend, so this is to be expected. Perhaps it's surprising that the correlation is as high as it is.
* Next we have **Collection**, which makes sense because only movies successful enough to get a sequel end up in this category. Could be interesting to compare average revenue of first installments and their respective sequels.
* After that, **Year**, but since **Budget** is increasing over time and we see a high correlation here, this is to be expected as well.
* **Runtime** plays a role as well, but p-values quickly fades into the single digits after that.

## Limitations
- When looking at break-even, I simply subtracted budget and revenue from each other. In reality, this calculation is much more complicated and takes into account equity and non-equity funding, marketing spend as well as the long tail with various VOD and disk releases. However, I still wanted to give it a try with the data available.
- More features are technically available, such as the cast and the production companies, which could easily improve the results. However, recreating these features as one-hot-encoded dummies broke the memory limitations of this notebook. It would be very interesting to take into account the **star power** of any given actor by looking at this actor's filmographic revenue in isolation. While this would introduce multicolinearity, it is a feature we would feasably have available to us and should use it.
- I attempted to replicate the **star power** index from the below cited paper [1], however, calculating the aggregated amounts for all movies would have taken 6 to 8 hours and was again beyond the notebook's capacity.

Source:
1. https://dspace.bracu.ac.bd/xmlui/bitstream/handle/10361/9015/13301028,13301019_CSE.pdf?sequence=1