## General information

In this kernel I'm working with data from TMDB Box Office Prediction Challenge. Film industry is booming, the revunues are growing, so we have a lot of data about films. Can we build models, which will be able to accurately predict film revenues? Could this models be used to make some changes in movies to increase their revenues even further? I'll try answer this questions in my kernel!


## Content

* [1 Data loading and overview](#data_loading)
* [1.1 belongs_to_collection](#btoc)
* [1.2 genres](#genres)
* [1.3 Production companies](#production_companies)
* [1.4 Production countries](#production_countries)
* [1.5 Spoken languages](#lang)
* [1.6 Keywords](#keywords)
* [1.7 Cast](#cast)
* [1.8 Crew](#crew)
* [2 Data exploration](#de)
* [2.1 Target](#target)
* [2.2 Budget](#budget)
* [2.3 Homepage](#homepage)
* [2.4 Original language](#or_lang)
* [2.5 Original title](#or_title)
* [2.6 Overview](#overview)
* [2.7 Popularity](#popularity)
* [2.8 Release date](#release_date)
* [2.9 Runtime](#runtime)
* [2.10 Status](#status)
* [2.11 Tagline](#tagline)
* [2.12 Collections](#collections)
* [2.13 Genres](#genres_)
* [2.14 Production companies](#prod_comp)
* [2.15 Production countries](#prod_count)
* [2.16 Cast](#cast_viz)
* [2.17 Keywords](#key_viz)
* [2.18 Crew](#crew_viz)
* [3 Modelling and feature generation](#basic_model)
* [3.1 OOF features based on texts](#oof)
* [3.2 Additional feature generation](#add_feat)
* [3.3 Important features](#imp_feats)
* [3.4 External features](#ext_feats)
* [3.5 Blending](#blending)
* [3.6 Stacking](#stacking)

In [6]:
!pip install seaborn eli5

Collecting seaborn
  Downloading seaborn-0.12.1-py3-none-any.whl (288 kB)
[K     |████████████████████████████████| 288 kB 666 kB/s eta 0:00:01
[?25hCollecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
[K     |████████████████████████████████| 216 kB 735 kB/s eta 0:00:01     |████████████████████████████▉   | 194 kB 735 kB/s eta 0:00:01
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 623 kB/s eta 0:00:01
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 1.8 MB/s eta 0:00:01
[?25hCollecting tabulate>=0.7.7
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25ldone
[?25h  Created wheel for eli5: filename=eli5-0.13.0-py2.py3-none-any.whl size=107729 sha256=8b9359085f1bd88e6121ac6e2a22246fcb0f92b5e6c86b9acc6c17940e6380df
  Stored in direc

In [4]:
# Libraries

import numpy as np
import pandas as pd
pd.set_option('max_columns', None)
import matplotlib.pyplot as plt
# import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
import datetime
# import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split, KFold
from wordcloud import WordCloud
from collections import Counter
#  from nltk.corpus import stopwords
#  from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
# stop = set(stopwords.words('english'))
from wordcloud import WordCloud
import seaborn as sns

# ТАК НЕ ДЕЛАТЬ 
from preprocess import *

import preprocess as pre

# from preprocess import (
#     preprocess_cast,
# )

In [7]:
import os
#  import plotly.offline as py
#  py.init_notebook_mode(connected=True)
#  import plotly.graph_objs as go
#  import plotly.tools as tls
#  import xgboost as xgb
#  import lightgbm as lgb
from sklearn import model_selection
from sklearn.metrics import accuracy_score
import json
import ast
import eli5
#  import shap
#  from catboost import CatBoostRegressor
from urllib.request import urlopen
from PIL import Image
from sklearn.preprocessing import LabelEncoder
import time
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

<a id="data_loading"></a>
## Data loading and overview

In [26]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# from this kernel: https://www.kaggle.com/gravix/gradient-in-a-box
dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
                'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']

def text_to_dict(df):
    for column in dict_columns:
        df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
    return df
        
train = text_to_dict(train)
test = text_to_dict(test)

train, test = belongs_to_collection_func(train, test)
train, test, list_of_genres, top_genres = preprocess_genders(train, test)

print('Number of production companies in films')
train, test, list_of_companies, top_companies = preprocess_companies(train, test)

train, test, list_of_countries, top_countries = preprocess_countries(train, test)

train, test, list_of_languages, top_languages = preprocess_languages(train, test)

train, test, list_of_keywords, top_keywords = preprocess_keywords(train, test)

train, test, list_of_cast_names, list_of_cast_names_url, list_of_cast_genders, list_of_cast_characters, top_cast_names, top_cast_characters = preprocess_cast(train, test)

train, test, list_of_crew_names_temp, list_of_crew_names, list_of_crew_names_url, list_of_crew_jobs, list_of_crew_genders, list_of_crew_departments, top_crew_names, top_crew_jobs = preprocess_crew(train, test)



Number of production companies in films


In [46]:
# train, test, data_dict = preprocess_crew(train, test)
# data_dict['list_of_crew_names_temp']

train, test = preprocess_homepage(train, test)

train['log_revenue'] = np.log1p(train['revenue'])
train['log_budget'] = np.log1p(train['budget'])
test['log_budget'] = np.log1p(test['budget'])

# EDA - Explantory Data Analyses 

In [None]:
test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98'

def fix_date(x):
    """
    Fixes dates which are in 20xx
    """
    year = x.split('/')[2]
    if int(year) <= 19:
        return x[:-2] + '20' + year
    else:
        return x[:-2] + '19' + year

train['release_date'] = train['release_date'].apply(lambda x: fix_date(x))
test['release_date'] = test['release_date'].apply(lambda x: fix_date(x))
train['release_date'] = pd.to_datetime(train['release_date'])
test['release_date'] = pd.to_datetime(test['release_date'])

In [61]:
# creating features based on dates
def process_date(df):
    date_parts = ["year", "weekday", "month", 'weekofyear', 'day', 'quarter']
    for part in date_parts:
        part_col = 'release_date' + "_" + part
        df[part_col] = getattr(df['release_date'].dt, part).astype(int)
    
    return df

train = process_date(train)
test = process_date(test)

  df[part_col] = getattr(df['release_date'].dt, part).astype(int)


AttributeError: Can only use .dt accessor with datetimelike values

In [81]:
train = train.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status', 'log_revenue'], axis=1)
test = test.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status'], axis=1)

for col in train.columns:
    if train[col].nunique() == 1:
        print(col)
        train = train.drop([col], axis=1)
        test = test.drop([col], axis=1)

language_
cast_character_


'[1, 2, 3]' -> [1, 2, 3]

Drama, Comedy and Thriller are popular genres.

In [16]:
# Counter([i for j in list_of_genres for i in j]).most_common()

In [18]:
# Counter([i for j in list_of_companies for i in j]).most_common(30)

In [19]:
# print('Number of production countries in films')
# Counter([i for j in list_of_countries for i in j]).most_common(25)

Number of production countries in films


In [21]:
# Counter([i for j in list_of_languages for i in j]).most_common(15)

In [22]:
# print('Number of Keywords in films')
# list_of_keywords[:10]

Here we have some keywords describing films. Of course there can be a lot of them. Let's have a look at the most common ones.

In [23]:
# plt.figure(figsize = (16, 12))
# text = ' '.join(['_'.join(i.split(' ')) for j in list_of_keywords for i in j])
# wordcloud = WordCloud(max_font_size=None, background_color='black', collocations=False,
#                       width=1200, height=1000).generate(text)
# plt.imshow(wordcloud)
# plt.title('Top keywords')
# plt.axis("off")
# plt.show()

<a id="cast"></a>
### cast

In [24]:
# for i, e in enumerate(train['cast'][:1]):
#     print(i, e)

In [25]:
# print('Number of casted persons in films')
# list_of_cast_genders[:10]

Those who are casted heavily impact the quality of the film. We have not only the name of the actor, but also the gender and character name/type.

At first let's have a look at the popular names.

In [27]:
# Counter([i for j in list_of_cast_names for i in j]).most_common(15)

In [28]:
# d = Counter([i for j in list_of_cast_names_url for i in j]).most_common(16)
# fig = plt.figure(figsize=(20, 12))
# for i, p in enumerate([j[0] for j in d]):
#     ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
#     im = Image.open(urlopen(f"https://image.tmdb.org/t/p/w600_and_h900_bestv2{p[1]}"))
#     plt.imshow(im)
#     ax.set_title(f'{p[0]}')

In [None]:
# Counter([i for j in list_of_cast_genders for i in j]).most_common()

[(2, 27949), (0, 20329), (1, 13533)]

0 is unspecified, 1 is female, and 2 is male. (https://www.kaggle.com/c/tmdb-box-office-prediction/discussion/80983#475572)

In [29]:
# Counter([i for j in list_of_cast_characters for i in j]).most_common(15)

I think it is quite funny the most popular male role is playing himself. :)

<a id="crew"></a>
### crew

In [30]:
# for i, e in enumerate(train['crew'][:1]):
#     print(i, e[:10])

In [31]:
# print('Number of casted persons in films')
# list_of_crew_names[:10]

The great crew is very important in creating the film. We have not only the names of the crew members, but also the genders, jobs and departments.

At first let's have a look at the popular names.

In [32]:
# Counter([i for j in list_of_crew_names for i in j]).most_common(15)

In [33]:
# d = Counter([i for j in list_of_crew_names_url for i in j]).most_common(16)
# fig = plt.figure(figsize=(20, 16))
# for i, p in enumerate([j[0] for j in d]):
#     ax = fig.add_subplot(4, 4, i+1, xticks=[], yticks=[])
#     if p[1]:
#         im = Image.open(urlopen(f"https://image.tmdb.org/t/p/w600_and_h900_bestv2{p[1]}"))
#     else:
#         im = Image.new('RGB', (5, 5))
#     plt.imshow(im)
#     ax.set_title(f'Name: {p[0]} \n Job: {p[2]}')

In [34]:
# Counter([i for j in list_of_crew_jobs for i in j]).most_common(15)

<a id="de"></a>
## Data exploration

In [35]:
# train.head()

<a id="target"></a>
### Target

In [37]:
# fig, ax = plt.subplots(figsize = (16, 6))
# plt.subplot(1, 2, 1)
# plt.hist(train['revenue']);
# plt.title('Distribution of revenue');
# plt.subplot(1, 2, 2)
# plt.hist(np.log1p(train['revenue']));
# plt.title('Distribution of log of revenue');

In [145]:
train['log_revenue']

0       16.326300
1       18.370959
2       16.387512
3       16.588099
4       15.182615
          ...    
2995    14.283442
2996    12.103990
2997    18.309266
2998    18.962792
2999    18.223292
Name: log_revenue, Length: 3000, dtype: float64

As we can see revenue distribution has a high skewness! It is better to use `np.log1p` of revenue.

<a id="budget"></a>
### Budget

In [36]:
# fig, ax = plt.subplots(figsize = (16, 6))
# plt.subplot(1, 2, 1)
# plt.hist(train['budget']);
# plt.title('Distribution of budget');
# plt.subplot(1, 2, 2)
# plt.hist(np.log1p(train['budget']));
# plt.title('Distribution of log of budget');

In [39]:
# plt.figure(figsize=(16, 8))
# plt.subplot(1, 2, 1)
# plt.scatter(train['budget'], train['revenue'])
# plt.title('Revenue vs budget');
# plt.subplot(1, 2, 2)
# plt.scatter(np.log1p(train['budget']), train['log_revenue'])
# plt.title('Log Revenue vs log budget');

We can see that budget and revenue are somewhat correlated. Logarithm transformation makes budget distribution more managable.

<a id="homepage"></a>
### homepage

In [40]:
# train['homepage'].value_counts().head()

Most of homepages are unique, so this feature may be useless.

In [47]:
# sns.catplot(x='has_homepage', y='revenue', data=train);
# plt.title('Revenue for film with and without homepage');

<a id="or_lang"></a>
### original_language

In [48]:
# plt.figure(figsize=(16, 8))
# plt.subplot(1, 2, 1)
# sns.boxplot(x='original_language', y='revenue', data=train.loc[train['original_language'].isin(train['original_language'].value_counts().head(10).index)]);
# plt.title('Mean revenue per language');
# plt.subplot(1, 2, 2)
# sns.boxplot(x='original_language', y='log_revenue', data=train.loc[train['original_language'].isin(train['original_language'].value_counts().head(10).index)]);
# plt.title('Mean log revenue per language');

<a id="or_title"></a>
### original_title

It can be interesting to see which words are common in titles.

In [49]:
# plt.figure(figsize = (12, 12))
# text = ' '.join(train['original_title'].values)
# wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
# plt.imshow(wordcloud)
# plt.title('Top words in titles')
# plt.axis("off")
# plt.show()

<a id="overview"></a>
### overview

In [50]:
# plt.figure(figsize = (12, 12))
# text = ' '.join(train['overview'].fillna('').values)
# wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
# plt.imshow(wordcloud)
# plt.title('Top words in overview')
# plt.axis("off")
# plt.show()

Let's try to see which words have high impact on the revenue. I'll build a simple model and use ELI5 for this.

In [52]:
# vectorizer = TfidfVectorizer(
#             sublinear_tf=True,
#             analyzer='word',
#             token_pattern=r'\w{1,}',
#             ngram_range=(1, 2),
#             min_df=5)

# overview_text = vectorizer.fit_transform(train['overview'].fillna(''))


# linreg = LinearRegression()
# linreg.fit(overview_text, train['log_revenue'])
# eli5.show_weights(linreg, vec=vectorizer, top=20, feature_filter=lambda x: x != '<BIAS>')

In [54]:
# print('Target value:', train['log_revenue'][1000])
# eli5.show_prediction(linreg, doc=train['overview'].values[1000], vec=vectorizer)

We can see that some words can be used to predict revenue, but we will need more that overview text to build a good model.

<a id="popularity"></a>
### popularity

I'm not exactly sure what does popularity represents. Maybe it is some king of weighted rating, maybe something else. It seems it has low correlation with the target.

In [55]:
# plt.figure(figsize=(16, 8))
# plt.subplot(1, 2, 1)
# plt.scatter(train['popularity'], train['revenue'])
# plt.title('Revenue vs popularity');
# plt.subplot(1, 2, 2)
# plt.scatter(train['popularity'], train['log_revenue'])
# plt.title('Log Revenue vs popularity');

<a id="release_data"></a>
### release_date

In [None]:
# d1 = train['release_date_year'].value_counts().sort_index()
# d2 = test['release_date_year'].value_counts().sort_index()
# data = [go.Scatter(x=d1.index, y=d1.values, name='train'), go.Scatter(x=d2.index, y=d2.values, name='test')]
# layout = go.Layout(dict(title = "Number of films per year",
#                   xaxis = dict(title = 'Year'),
#                   yaxis = dict(title = 'Count'),
#                   ),legend=dict(
#                 orientation="v"))
# py.iplot(dict(data=data, layout=layout))

In [63]:
# d1 = train['release_date_year'].value_counts().sort_index()
# d2 = train.groupby(['release_date_year'])['revenue'].sum()
# data = [go.Scatter(x=d1.index, y=d1.values, name='film count'), go.Scatter(x=d2.index, y=d2.values, name='total revenue', yaxis='y2')]
# layout = go.Layout(dict(title = "Number of films and total revenue per year",
#                   xaxis = dict(title = 'Year'),
#                   yaxis = dict(title = 'Count'),
#                   yaxis2=dict(title='Total revenue', overlaying='y', side='right')
#                   ),legend=dict(
#                 orientation="v"))
# py.iplot(dict(data=data, layout=layout))

In [64]:
# d1 = train['release_date_year'].value_counts().sort_index()
# d2 = train.groupby(['release_date_year'])['revenue'].mean()
# data = [go.Scatter(x=d1.index, y=d1.values, name='film count'), go.Scatter(x=d2.index, y=d2.values, name='mean revenue', yaxis='y2')]
# layout = go.Layout(dict(title = "Number of films and average revenue per year",
#                   xaxis = dict(title = 'Year'),
#                   yaxis = dict(title = 'Count'),
#                   yaxis2=dict(title='Average revenue', overlaying='y', side='right')
#                   ),legend=dict(
#                 orientation="v"))
# py.iplot(dict(data=data, layout=layout))

We can see that number of films and total revenue are growing, which is to be expected. But there were some years in the past with a high number of successful films, which brought high revenue.

In [62]:
# sns.catplot(x='release_date_weekday', y='revenue', data=train);
# plt.title('Revenue on different days of week of release');

Surprisingly films releases on Wednesdays and on Thursdays tend to have a higher revenue.

<a id="runtime"></a>
### runtime

The length of the film in minutes

In [65]:
# plt.figure(figsize=(20, 6))
# plt.subplot(1, 3, 1)
# plt.hist(train['runtime'].fillna(0) / 60, bins=40);
# plt.title('Distribution of length of film in hours');
# plt.subplot(1, 3, 2)
# plt.scatter(train['runtime'].fillna(0), train['revenue'])
# plt.title('runtime vs revenue');
# plt.subplot(1, 3, 3)
# plt.scatter(train['runtime'].fillna(0), train['popularity'])
# plt.title('runtime vs popularity');

<a id="tagline"></a>
### tagline

In [66]:
# plt.figure(figsize = (12, 12))
# text = ' '.join(train['tagline'].fillna('').values)
# wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
# plt.imshow(wordcloud)
# plt.title('Top words in tagline')
# plt.axis("off")
# plt.show()

<a id="collections"></a>
### Collections

In [67]:
# sns.boxplot(x='has_collection', y='revenue', data=train);

Films, which are part of a collection usually have higher revenues. I suppose such films have a bigger fan base thanks to previous films.

<a id="genres_"></a>
### Genres

In [68]:
# sns.catplot(x='num_genres', y='revenue', data=train);
# plt.title('Revenue for different number of genres in the film');

In [69]:
# sns.violinplot(x='genre_Drama', y='revenue', data=train[:100]);

In [70]:
# f, axes = plt.subplots(3, 5, figsize=(24, 12))
# plt.suptitle('Violinplot of revenue vs genres')
# for i, e in enumerate([col for col in train.columns if 'genre_' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

Some genres tend to have less revenue, some tend to have higher.

<a id="prod_comp"></a>
### Production companies

In [71]:
# f, axes = plt.subplots(6, 5, figsize=(24, 32))
# plt.suptitle('Violinplot of revenue vs production company')
# for i, e in enumerate([col for col in train.columns if 'production_company' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

There are only a couple of companies, which have distinctly higher revenues compared to others.

<a id="prod_count"></a>
### Production countries

In [72]:
# sns.catplot(x='num_countries', y='revenue', data=train);
# plt.title('Revenue for different number of countries producing the film');

In fact I think that number of production countries hardly matters. Most films are produced by 1-2 companies, so films with 1-2 companies have the highest revenue.

In [73]:
# f, axes = plt.subplots(5, 5, figsize=(24, 32))
# plt.suptitle('Violinplot of revenue vs production country')
# for i, e in enumerate([col for col in train.columns if 'production_country' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

There are only a couple of countries, which have distinctly higher revenues compared to others.

<a id="cast_viz"></a>
### Cast

In [74]:
# plt.figure(figsize=(16, 8))
# plt.subplot(1, 2, 1)
# plt.scatter(train['num_cast'], train['revenue'])
# plt.title('Number of cast members vs revenue');
# plt.subplot(1, 2, 2)
# plt.scatter(train['num_cast'], train['log_revenue'])
# plt.title('Log Revenue vs number of cast members');

In [75]:
# f, axes = plt.subplots(3, 5, figsize=(24, 18))
# plt.suptitle('Violinplot of revenue vs cast')
# for i, e in enumerate([col for col in train.columns if 'cast_name' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

In [76]:
# f, axes = plt.subplots(3, 5, figsize=(24, 18))
# plt.suptitle('Violinplot of revenue vs cast')
# for i, e in enumerate([col for col in train.columns if 'cast_character_' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

<a id="key_viz"></a>
### Keywords

In [77]:
# f, axes = plt.subplots(6, 5, figsize=(24, 32))
# plt.suptitle('Violinplot of revenue vs keyword')
# for i, e in enumerate([col for col in train.columns if 'keyword_' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

<a id="crew_viz"></a>
### Crew

In [78]:
# plt.figure(figsize=(16, 8))
# plt.subplot(1, 2, 1)
# plt.scatter(train['num_crew'], train['revenue'])
# plt.title('Number of crew members vs revenue');
# plt.subplot(1, 2, 2)
# plt.scatter(train['num_crew'], train['log_revenue'])
# plt.title('Log Revenue vs number of crew members');

In [79]:
# f, axes = plt.subplots(3, 5, figsize=(24, 18))
# plt.suptitle('Violinplot of revenue vs crew_character')
# for i, e in enumerate([col for col in train.columns if 'crew_character_' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

In [80]:
# f, axes = plt.subplots(3, 5, figsize=(24, 18))
# plt.suptitle('Violinplot of revenue vs jobs')
# for i, e in enumerate([col for col in train.columns if 'jobs_' in col]):
#     sns.violinplot(x=e, y='revenue', data=train, ax=axes[i // 5][i % 5]);

<a id="basic_model"></a>
## Modelling and feature generation

In [82]:
for col in ['original_language', 'collection_name', 'all_genres']:
    le = LabelEncoder()
    le.fit(list(train[col].fillna('')) + list(test[col].fillna('')))
    train[col] = le.transform(train[col].fillna('').astype(str))
    test[col] = le.transform(test[col].fillna('').astype(str))

In [83]:
train_texts = train[['title', 'tagline', 'overview', 'original_title']]
test_texts = test[['title', 'tagline', 'overview', 'original_title']]

In [84]:
for col in ['title', 'tagline', 'overview', 'original_title']:
    train['len_' + col] = train[col].fillna('').apply(lambda x: len(str(x)))
    train['words_' + col] = train[col].fillna('').apply(lambda x: len(str(x.split(' '))))
    train = train.drop(col, axis=1)
    test['len_' + col] = test[col].fillna('').apply(lambda x: len(str(x)))
    test['words_' + col] = test[col].fillna('').apply(lambda x: len(str(x.split(' '))))
    test = test.drop(col, axis=1)

In [85]:
# data fixes from https://www.kaggle.com/somang1418/happy-valentines-day-and-keep-kaggling-3
train.loc[train['id'] == 16,'revenue'] = 192864          # Skinning
train.loc[train['id'] == 90,'budget'] = 30000000         # Sommersby          
train.loc[train['id'] == 118,'budget'] = 60000000        # Wild Hogs
train.loc[train['id'] == 149,'budget'] = 18000000        # Beethoven
train.loc[train['id'] == 313,'revenue'] = 12000000       # The Cookout 
train.loc[train['id'] == 451,'revenue'] = 12000000       # Chasing Liberty
train.loc[train['id'] == 464,'budget'] = 20000000        # Parenthood
train.loc[train['id'] == 470,'budget'] = 13000000        # The Karate Kid, Part II
train.loc[train['id'] == 513,'budget'] = 930000          # From Prada to Nada
train.loc[train['id'] == 797,'budget'] = 8000000         # Welcome to Dongmakgol
train.loc[train['id'] == 819,'budget'] = 90000000        # Alvin and the Chipmunks: The Road Chip
train.loc[train['id'] == 850,'budget'] = 90000000        # Modern Times
train.loc[train['id'] == 1112,'budget'] = 7500000        # An Officer and a Gentleman
train.loc[train['id'] == 1131,'budget'] = 4300000        # Smokey and the Bandit   
train.loc[train['id'] == 1359,'budget'] = 10000000       # Stir Crazy 
train.loc[train['id'] == 1542,'budget'] = 1              # All at Once
train.loc[train['id'] == 1570,'budget'] = 15800000       # Crocodile Dundee II
train.loc[train['id'] == 1571,'budget'] = 4000000        # Lady and the Tramp
train.loc[train['id'] == 1714,'budget'] = 46000000       # The Recruit
train.loc[train['id'] == 1721,'budget'] = 17500000       # Cocoon
train.loc[train['id'] == 1865,'revenue'] = 25000000      # Scooby-Doo 2: Monsters Unleashed
train.loc[train['id'] == 2268,'budget'] = 17500000       # Madea Goes to Jail budget
train.loc[train['id'] == 2491,'revenue'] = 6800000       # Never Talk to Strangers
train.loc[train['id'] == 2602,'budget'] = 31000000       # Mr. Holland's Opus
train.loc[train['id'] == 2612,'budget'] = 15000000       # Field of Dreams
train.loc[train['id'] == 2696,'budget'] = 10000000       # Nurse 3-D
train.loc[train['id'] == 2801,'budget'] = 10000000       # Fracture
test.loc[test['id'] == 3889,'budget'] = 15000000       # Colossal
test.loc[test['id'] == 6733,'budget'] = 5000000        # The Big Sick
test.loc[test['id'] == 3197,'budget'] = 8000000        # High-Rise
test.loc[test['id'] == 6683,'budget'] = 50000000       # The Pink Panther 2
test.loc[test['id'] == 5704,'budget'] = 4300000        # French Connection II
test.loc[test['id'] == 6109,'budget'] = 281756         # Dogtooth
test.loc[test['id'] == 7242,'budget'] = 10000000       # Addams Family Values
test.loc[test['id'] == 7021,'budget'] = 17540562       #  Two Is a Family
test.loc[test['id'] == 5591,'budget'] = 4000000        # The Orphanage
test.loc[test['id'] == 4282,'budget'] = 20000000       # Big Top Pee-wee

power_six = train.id[train.budget > 1000][train.revenue < 100]

for k in power_six :
    train.loc[train['id'] == k,'revenue'] =  train.loc[train['id'] == k,'revenue'] * 1000000

In [89]:
X = train.drop(['id', 'revenue'], axis=1)
y = np.log1p(train['revenue'])
# y = train['revenue']
X_test = test.drop(['id'], axis=1)

In [87]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, seed=1)

In [92]:
# params = {'num_leaves': 30,
#          'min_data_in_leaf': 20,
#          'objective': 'regression',
#          'max_depth': 5,
#          'learning_rate': 0.01,
#          "boosting": "gbdt",
#          "feature_fraction": 0.9,
#          "bagging_freq": 1,
#          "bagging_fraction": 0.9,
#          "bagging_seed": 11,
#          "metric": 'rmse',
#          "lambda_l1": 0.2,
#          "verbosity": -1}

# model1 = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
# model1.fit(X_train, y_train, 
#         eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
#         verbose=1000, early_stopping_rounds=200)

In [93]:
# eli5.show_weights(model1, feature_filter=lambda x: x != '<BIAS>')

In [137]:
n_fold = 10
folds = KFold(n_splits=n_fold, shuffle=True, random_state=42)

In [133]:
import random
n = random.randint(1, 10)

a = np.random.rand(2,3, n) # 2 * 3 * n = 24
# np.ndarray
# a.reshape(12, 2) # 12 * 2 = 24
# n
print(n)
a.reshape(6, -1) #  n = 2 * k -> k = n / 2


3


array([[0.18146005, 0.9415344 , 0.50298877],
       [0.79408627, 0.48528662, 0.35034718],
       [0.8911608 , 0.90323986, 0.71356916],
       [0.1983269 , 0.00744653, 0.90131838],
       [0.56249088, 0.90128987, 0.16638828],
       [0.70645768, 0.99443143, 0.79233542]])

In [139]:
def train_model(
    X, 
    X_test, y, params=None, folds=folds, model_type='lgb', plot_feature_importance=False, model=None):
    
    # out of fold - 1000
    oof = np.zeros(X.shape[0])
    # 
    prediction = np.zeros(X_test.shape[0])
    
    scores = []
    feature_importance = pd.DataFrame()
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
        print('Fold', fold_n, 'started at', time.ctime())
#         if model_type == 'sklearn':
#             X_train, X_valid = X[train_index], X[valid_index]
#         else:
        X_train, X_valid = X.values[train_index], X.values[valid_index]
            
        y_train, y_valid = y[train_index], y[valid_index]
    
        if model_type == 'sklearn':
            # Callable объект .fit
            model = model
            model.fit(X_train, y_train)
            y_pred_valid = model.predict(X_valid).reshape(-1,)
            score = mean_squared_error(y_valid, y_pred_valid)
            
            y_pred = model.predict(X_test)
        
        
        oof[valid_index] = y_pred_valid.reshape(-1,) # Shape [1, 200] -> [200]  -1  всё остальное
        scores.append(mean_squared_error(y_valid, y_pred_valid) ** 0.5)
        
        prediction += y_pred    
        
    prediction /= n_fold
    
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
    
    return oof, prediction

In [141]:
for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
    print('Fold', fold_n, 'started at', time.ctime())
    X_train, X_valid = X.values[train_index], X.values[valid_index]
    print(len(X_train), len(X_valid))
    break
    

Fold 0 started at Mon Oct 24 19:05:49 2022
2700 300


In [147]:
np.isnan(X_train).sum() # False -> 0 True -> 1

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [140]:
params = {'num_leaves': 30,
         'min_data_in_leaf': 10,
         'objective': 'regression',
         "lambda_l1": 0.2,
         "verbosity": -1}


oof_linear, prediction_linear, _ = train_model(
    X, X_test, y, params=params, model_type='sklearn',
    model=LinearRegression(),
    plot_feature_importance=True
)



Fold 0 started at Mon Oct 24 19:04:39 2022


ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [97]:
# params = {'num_leaves': 30,
#          'min_data_in_leaf': 10,
#          'objective': 'regression',
#          'max_depth': 5,
#          'learning_rate': 0.01,
#          "boosting": "gbdt",
#          "feature_fraction": 0.9,
#          "bagging_freq": 1,
#          "bagging_fraction": 0.9,
#          "bagging_seed": 11,
#          "metric": 'rmse',
#          "lambda_l1": 0.2,
#          "verbosity": -1}
# oof_lgb, prediction_lgb, _ = train_model(X, X_test, y, params=params, model_type='lgb', plot_feature_importance=True)

<a id="oof"></a>
### OOF features based on texts

In [135]:
# for col in train_texts.columns:
#     vectorizer = TfidfVectorizer(
#                 sublinear_tf=True,
#                 analyzer='word',
#                 token_pattern=r'\w{1,}',
#                 ngram_range=(1, 2),
#                 min_df=10
#     )
#     vectorizer.fit(list(train_texts[col].fillna('')) + list(test_texts[col].fillna('')))
#     train_col_text = vectorizer.transform(train_texts[col].fillna(''))
#     test_col_text = vectorizer.transform(test_texts[col].fillna(''))
#     model = linear_model.RidgeCV(alphas=(0.01, 0.1, 1.0, 10.0, 100.0), scoring='neg_mean_squared_error', cv=folds)
#     oof_text, prediction_text = train_model(train_col_text, test_col_text, y, params=None, model_type='sklearn', model=model)
    
#     X[col + '_oof'] = oof_text
#     X_test[col + '_oof'] = prediction_text

<a id="add_feat"></a>
### Additional feature generation

In [None]:
X.head()

In [None]:
def new_features(df):
    df['budget_to_popularity'] = df['budget'] / df['popularity']
    df['budget_to_runtime'] = df['budget'] / df['runtime']
    
    # some features from https://www.kaggle.com/somang1418/happy-valentines-day-and-keep-kaggling-3
    df['_budget_year_ratio'] = df['budget'] / (df['release_date_year'] * df['release_date_year'])
    df['_releaseYear_popularity_ratio'] = df['release_date_year'] / df['popularity']
    df['_releaseYear_popularity_ratio2'] = df['popularity'] / df['release_date_year']
    
    df['runtime_to_mean_year'] = df['runtime'] / df.groupby("release_date_year")["runtime"].transform('mean')
    df['popularity_to_mean_year'] = df['popularity'] / df.groupby("release_date_year")["popularity"].transform('mean')
    df['budget_to_mean_year'] = df['budget'] / df.groupby("release_date_year")["budget"].transform('mean')
        
    return df

In [None]:
X = new_features(X)
X_test = new_features(X_test)

In [None]:
oof_lgb, prediction_lgb, _ = train_model(X, X_test, y, params=params, model_type='lgb', plot_feature_importance=True)

<a id="imp_feats"></a>
### Important features

Let's have a look at important features using ELI5 and SHAP!

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1)

params = {'num_leaves': 30,
         'min_data_in_leaf': 20,
         'objective': 'regression',
         'max_depth': 6,
         'learning_rate': 0.01,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.2,
         "verbosity": -1}
model1 = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model1.fit(X_train, y_train, 
        eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
        verbose=1000, early_stopping_rounds=200)

eli5.show_weights(model1, feature_filter=lambda x: x != '<BIAS>')

We can see that important features native to LGB and top features in ELI5 are mostly similar. This means that our model is quite good at working with these features.

In [None]:
explainer = shap.TreeExplainer(model1, X_train)
shap_values = explainer.shap_values(X_train)

shap.summary_plot(shap_values, X_train)

SHAP provides more detailed information even if it may be more difficult to understand.

For example low budget has negavite impact on revenue, while high values usually tend to have higher revenue.

In [None]:
top_cols = X_train.columns[np.argsort(shap_values.std(0))[::-1]][:10]
for col in top_cols:
    shap.dependence_plot(col, shap_values, X_train)

Here we can see interactions between important features. There are some interesting things here. For example relationship between release_date_year and log_budget. Up to ~1990 low budget films brought higher revenues, but after 2000 year high budgets tended to be correlated with higher revenues. And in genereal the effect of budget diminished.

Let's create new features as interactions between top important features. Some of them make little sense, but maybe they could improve the model.

In [None]:
def top_cols_interaction(df):
    df['budget_to_year'] = df['budget'] / df['release_date_year']
    df['budget_to_mean_year_to_year'] = df['budget_to_mean_year'] / df['release_date_year']
    df['popularity_to_mean_year_to_log_budget'] = df['popularity_to_mean_year'] / df['log_budget']
    df['year_to_log_budget'] = df['release_date_year'] / df['log_budget']
    df['budget_to_runtime_to_year'] = df['budget_to_runtime'] / df['release_date_year']
    df['genders_1_cast_to_log_budget'] = df['genders_1_cast'] / df['log_budget']
    df['all_genres_to_popularity_to_mean_year'] = df['all_genres'] / df['popularity_to_mean_year']
    df['genders_2_crew_to_budget_to_mean_year'] = df['genders_2_crew'] / df['budget_to_mean_year']
    df['overview_oof_to_genders_2_crew'] = df['overview_oof'] / df['genders_2_crew']
    
    return df

In [None]:
X = top_cols_interaction(X)
X_test = top_cols_interaction(X_test)

In [None]:
X = X.replace([np.inf, -np.inf], 0).fillna(0)
X_test = X_test.replace([np.inf, -np.inf], 0).fillna(0)

<a id="ext_feats"></a>
### External features
I'm adding external features from this kernel: https://www.kaggle.com/kamalchhirang/eda-feature-engineering-lgb-xgb-cat by kamalchhirang. All credit for these features goes to him and his kernel.

In [None]:
trainAdditionalFeatures = pd.read_csv('../input/tmdb-competition-additional-features/TrainAdditionalFeatures.csv')
testAdditionalFeatures = pd.read_csv('../input/tmdb-competition-additional-features/TestAdditionalFeatures.csv')

train = pd.read_csv('../input/tmdb-box-office-prediction/train.csv')
test = pd.read_csv('../input/tmdb-box-office-prediction/test.csv')
X['imdb_id'] = train['imdb_id']
X_test['imdb_id'] = test['imdb_id']
del train, test

X = pd.merge(X, trainAdditionalFeatures, how='left', on=['imdb_id'])
X_test = pd.merge(X_test, testAdditionalFeatures, how='left', on=['imdb_id'])

X = X.drop(['imdb_id'], axis=1)
X_test = X_test.drop(['imdb_id'], axis=1)

In [None]:
X.head()

In [None]:
params = {'num_leaves': 30,
         'min_data_in_leaf': 20,
         'objective': 'regression',
         'max_depth': 9,
         'learning_rate': 0.01,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.2,
         "verbosity": -1}
oof_lgb, prediction_lgb, _ = train_model(X, X_test, y, params=params, model_type='lgb', plot_feature_importance=True)

<a id="blending"></a>
### Blending

In [None]:
# 

<a id="stacking"></a>
### Stacking

In [None]:
train_stack = np.vstack([oof_lgb, oof_xgb, oof_cat, oof_lgb_1, oof_lgb_2]).transpose()
train_stack = pd.DataFrame(train_stack, columns=['lgb', 'xgb', 'cat', 'lgb_1', 'lgb_2'])
test_stack = np.vstack([prediction_lgb, prediction_xgb, prediction_cat, prediction_lgb_1, prediction_lgb_2]).transpose()
test_stack = pd.DataFrame(test_stack, columns=['lgb', 'xgb', 'cat', 'lgb_1', 'lgb_2'])

In [None]:
model = linear_model.RidgeCV(alphas=(0.01, 0.1, 1.0, 10.0, 100.0), scoring='neg_mean_squared_error', cv=folds)
oof_rcv_stack, prediction_rcv_stack = train_model(train_stack.values, test_stack.values, y, params=None, model_type='sklearn', model=model)

In [None]:
sub = pd.read_csv('../input/tmdb-box-office-prediction/sample_submission.csv')
sub['revenue'] = np.expm1(prediction_lgb)
sub.to_csv("lgb.csv", index=False)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub.to_csv("blend.csv", index=False)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub.to_csv("blend1.csv", index=False)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat + prediction_lgb_1) / 4)
sub.to_csv("blend2.csv", index=False)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat + prediction_lgb_1 + prediction_lgb_2) / 5)
sub.to_csv("blend3.csv", index=False)

sub['revenue'] = prediction_lgb_stack
sub.to_csv("stack_lgb.csv", index=False)
sub['revenue'] = prediction_rcv_stack
sub.to_csv("stack_rcv.csv", index=False)