## TMDB Box Office Prediction

The objective of this competition is to predict the worldwide box office revenue of the movie using the data points provided like cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.

### Acknowledgement

Most of the codes are inspired/reproduced from this awesome kernel - https://www.kaggle.com/artgor/eda-feature-engineering-and-model-interpretation

### Loading the required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
import ast
from tqdm import tqdm_notebook
from collections import Counter
warnings.filterwarnings('ignore')

In [2]:
kaggle=0
if kaggle==0:
    train=pd.read_csv("train.csv")
    test=pd.read_csv("test.csv")
    sub=pd.read_csv("sample_submission.csv")
else:
    train=pd.read_csv("../input/train.csv")
    test=pd.read_csv("../input/test.csv")
    sub=pd.read_csv("../input/sample_submission.csv")

### Data Cleaning 

In [5]:
print(f'Shape of train is {train.shape} and shape of test is {test.shape}')

Shape of train is (3000, 23) and shape of test is (4398, 22)


The test set is greater than the train set .

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
id                       3000 non-null int64
belongs_to_collection    604 non-null object
budget                   3000 non-null int64
genres                   2993 non-null object
homepage                 946 non-null object
imdb_id                  3000 non-null object
original_language        3000 non-null object
original_title           3000 non-null object
overview                 2992 non-null object
popularity               3000 non-null float64
poster_path              2999 non-null object
production_companies     2844 non-null object
production_countries     2945 non-null object
release_date             3000 non-null object
runtime                  2998 non-null float64
spoken_languages         2980 non-null object
status                   3000 non-null object
tagline                  2403 non-null object
title                    3000 non-null object
Keywords             

In [7]:
train.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,...,10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de...",13092000
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,...,3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de...",16000000
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,...,2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970


From the overview of the data we find that the columns belongs_to_collection,genres,spoken_languages,Keywords,cast,crew,production_companies,production_countries are of JSON type . Lets convert the columns to pandas list.

In [3]:
# from this kernel: https://www.kaggle.com/gravix/gradient-in-a-box & https://www.kaggle.com/artgor/eda-feature-engineering-and-model-interpretation
dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
                'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']

def text_to_dict(df):
    for column in dict_columns:
        df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
    return df

train=text_to_dict(train)
test=text_to_dict(test)

## Feature Extraction

### Belongs to Collection

In [4]:
for i, e in enumerate(train['belongs_to_collection'][:5]):
    print(i, e)

0 [{'id': 313576, 'name': 'Hot Tub Time Machine Collection', 'poster_path': '/iEhb00TGPucF0b4joM1ieyY026U.jpg', 'backdrop_path': '/noeTVcgpBiD48fDjFVic1Vz7ope.jpg'}]
1 [{'id': 107674, 'name': 'The Princess Diaries Collection', 'poster_path': '/wt5AMbxPTS4Kfjx7Fgm149qPfZl.jpg', 'backdrop_path': '/zSEtYD77pKRJlUPx34BJgUG9v1c.jpg'}]
2 {}
3 {}
4 {}


In [4]:
train['collection_name']=train['belongs_to_collection'].apply(lambda x:x[0]['name'] if x!= {} else 0)

In [5]:
test['collection_name']=test['belongs_to_collection'].apply(lambda x:x[0]['name'] if x!= {} else 0)

In [6]:
train=train.drop('belongs_to_collection',axis=1)
test=test.drop('belongs_to_collection',axis=1)

In [97]:
print(train['collection_name'].head(),test['collection_name'].head())

0    Hot Tub Time Machine Collection
1    The Princess Diaries Collection
2                                  0
3                                  0
4                                  0
Name: collection_name, dtype: object 0    Pokémon Collection
1                     0
2                     0
3                     0
4                     0
Name: collection_name, dtype: object


### Genres

In [8]:
for i, e in enumerate(train['genres'][:5]):
    print(i, e)

0 [{'id': 35, 'name': 'Comedy'}]
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}, {'id': 10749, 'name': 'Romance'}]
2 [{'id': 18, 'name': 'Drama'}]
3 [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'name': 'Drama'}]
4 [{'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}]


Each film consist of different geners . Lets take out most occuring geners for each film as the common genere.

In [7]:
genre_list=list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)

In [10]:
genre_list

[['Comedy'],
 ['Comedy', 'Drama', 'Family', 'Romance'],
 ['Drama'],
 ['Thriller', 'Drama'],
 ['Action', 'Thriller'],
 ['Animation', 'Adventure', 'Family'],
 ['Horror', 'Thriller'],
 ['Documentary'],
 ['Action', 'Comedy', 'Music', 'Family', 'Adventure'],
 ['Comedy', 'Music'],
 ['Drama'],
 ['Comedy'],
 ['Drama'],
 ['Comedy', 'Crime'],
 ['Action', 'Thriller', 'Science Fiction', 'Mystery'],
 ['Action', 'Crime', 'Drama'],
 ['Horror', 'Thriller'],
 ['Drama', 'Romance'],
 ['Comedy', 'Romance'],
 ['Action', 'Thriller', 'Crime'],
 ['Adventure', 'Family', 'Science Fiction'],
 ['Horror', 'Thriller'],
 ['Thriller', 'Horror'],
 ['Thriller', 'Mystery', 'Foreign'],
 ['Horror', 'Comedy'],
 ['Comedy', 'Horror', 'Mystery', 'Thriller'],
 ['Crime', 'Drama', 'Mystery', 'Thriller'],
 ['Drama', 'Comedy', 'Romance'],
 ['Animation'],
 ['Action', 'Adventure', 'Crime', 'Thriller'],
 ['Drama', 'Comedy'],
 ['Mystery', 'Drama', 'Thriller'],
 ['Fantasy', 'Action', 'Adventure'],
 ['Horror'],
 ['Action', 'Comedy', 'Cr

Lets print out most occuring genres,

In [11]:
Counter([i for j in genre_list for i in j]).most_common()

[('Drama', 1531),
 ('Comedy', 1028),
 ('Thriller', 789),
 ('Action', 741),
 ('Romance', 571),
 ('Crime', 469),
 ('Adventure', 439),
 ('Horror', 301),
 ('Science Fiction', 290),
 ('Family', 260),
 ('Fantasy', 232),
 ('Mystery', 225),
 ('Animation', 141),
 ('History', 132),
 ('Music', 100),
 ('War', 100),
 ('Documentary', 87),
 ('Western', 43),
 ('Foreign', 31),
 ('TV Movie', 1)]

It is seen that Drama , comedy , thriller are most represented movies in the dataset.

In [8]:
train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0)
train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
top_genres = [m[0] for m in Counter([i for j in genre_list for i in j]).most_common(15)]
for g in top_genres:
    train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)
    
    
test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0)
test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '')
for g in top_genres:
    test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)

train = train.drop(['genres'], axis=1)
test = test.drop(['genres'], axis=1)

In [98]:
train[['all_genres','genre_Comedy','genre_Drama']].head()

Unnamed: 0,all_genres,genre_Comedy,genre_Drama
0,Comedy,1,0
1,Comedy Drama Family Romance,1,1
2,Drama,0,1
3,Drama Thriller,0,1
4,Action Thriller,0,0


### Production Companies

In [15]:
for i,e in enumerate(train['production_companies'].head()):
    print(i,e)

0 [{'name': 'Paramount Pictures', 'id': 4}, {'name': 'United Artists', 'id': 60}, {'name': 'Metro-Goldwyn-Mayer (MGM)', 'id': 8411}]
1 [{'name': 'Walt Disney Pictures', 'id': 2}]
2 [{'name': 'Bold Films', 'id': 2266}, {'name': 'Blumhouse Productions', 'id': 3172}, {'name': 'Right of Way Films', 'id': 32157}]
3 {}
4 {}


Lets follow the same procedure as we did for genre.

In [9]:
prod_comp_list=list(train['production_companies'].apply(lambda x:[i['name'] for i in x] if x!={} else []).values)

In [17]:
prod_comp_list

[['Paramount Pictures', 'United Artists', 'Metro-Goldwyn-Mayer (MGM)'],
 ['Walt Disney Pictures'],
 ['Bold Films', 'Blumhouse Productions', 'Right of Way Films'],
 [],
 [],
 [],
 ['Ghost House Pictures', 'North Box Productions'],
 [],
 ['Walt Disney Pictures', 'Jim Henson Productions', 'Jim Henson Company, The'],
 ['Castle Rock Entertainment'],
 ['United Artists'],
 ['Twentieth Century Fox Film Corporation',
  'Amercent Films',
  'American Entertainment Partners L.P.',
  'Interscope Communications'],
 ['DreamWorks SKG', 'Jinks/Cohen Company'],
 ['Double Feature Films',
  'Jersey Films',
  'Nina Saxon Film Design',
  'Metro-Goldwyn-Mayer (MGM)'],
 ['DreamWorks SKG',
  'Cruise/Wagner Productions',
  'Amblin Entertainment',
  'Twentieth Century Fox Film Corporation',
  'Blue Tulip Productions',
  'Ronald Shusett/Gary Goldman',
  'Digital Image Associates'],
 ['Hypnopolis'],
 ['DreamWorks SKG', 'Craven-Maddalena Films', 'BenderSpink'],
 ['BBC Films',
  'Headline Pictures',
  'Magnolia Mae 

In [18]:
Counter([i for j in prod_comp_list for i in j]).most_common()

[('Warner Bros.', 202),
 ('Universal Pictures', 188),
 ('Paramount Pictures', 161),
 ('Twentieth Century Fox Film Corporation', 138),
 ('Columbia Pictures', 91),
 ('Metro-Goldwyn-Mayer (MGM)', 84),
 ('New Line Cinema', 75),
 ('Touchstone Pictures', 63),
 ('Walt Disney Pictures', 62),
 ('Columbia Pictures Corporation', 61),
 ('TriStar Pictures', 53),
 ('Relativity Media', 48),
 ('Canal+', 46),
 ('United Artists', 44),
 ('Miramax Films', 40),
 ('Village Roadshow Pictures', 36),
 ('Regency Enterprises', 31),
 ('BBC Films', 30),
 ('Dune Entertainment', 30),
 ('Working Title Films', 30),
 ('Fox Searchlight Pictures', 29),
 ('StudioCanal', 28),
 ('Lionsgate', 28),
 ('DreamWorks SKG', 27),
 ('Fox 2000 Pictures', 25),
 ('Summit Entertainment', 24),
 ('Hollywood Pictures', 24),
 ('Orion Pictures', 24),
 ('Amblin Entertainment', 23),
 ('Dimension Films', 23),
 ('Castle Rock Entertainment', 21),
 ('Epsilon Motion Pictures', 21),
 ('Morgan Creek Productions', 21),
 ('Original Film', 21),
 ('Focus 

Films produced by Warner Bros , Universal pictures , Paramount have been represented more in this dataset.

Lets create a column - the number of production companies backing a movie , list of production company for a movie and whether top production company has been backing a movie.We create columns of production companies if it is backed by one of the top 15 production companies .

In [10]:
train['prod_comp']=train['production_companies'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
train['num_prod']=train['production_companies'].apply(lambda x:len(x) if x!={} else 0)
top_prod=[m[0] for m in Counter([i for j in prod_comp_list for i in j]).most_common(15)]
for g in top_prod:
    train['prod_' + g] = train['prod_comp'].apply(lambda x: 1 if g in x else 0)
    
test['prod_comp']=test['production_companies'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
test['num_prod']=test['production_companies'].apply(lambda x:len(x) if x!={} else 0)
for g in top_prod:
    test['prod_' + g] = test['prod_comp'].apply(lambda x: 1 if g in x else 0)
    
train=train.drop('production_companies',axis=1)
test=test.drop('production_companies',axis=1)

In [20]:
train[['prod_comp','num_prod']].head()

Unnamed: 0,prod_comp,num_prod
0,Metro-Goldwyn-Mayer (MGM) Paramount Pictures U...,3
1,Walt Disney Pictures,1
2,Blumhouse Productions Bold Films Right of Way ...,3
3,,0
4,,0


### Production Countries

In [21]:
for i,e in enumerate(train['production_countries'].head()):
    print(i,e)

0 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
1 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
2 [{'iso_3166_1': 'US', 'name': 'United States of America'}]
3 [{'iso_3166_1': 'IN', 'name': 'India'}]
4 [{'iso_3166_1': 'KR', 'name': 'South Korea'}]


We take only the name from this column and create a feature.

In [11]:
train['production_countries'].apply(lambda x:len(x) if x!={} else 0).value_counts()

1    2222
2     525
3     116
4      57
0      55
5      21
6       3
8       1
Name: production_countries, dtype: int64

It is interesting to see that there are around 25 movies backed by more than 5 countries . Lets check which moves are they sorted by popularity.

In [12]:
train[train['production_countries'].apply(lambda x:len(x) if x!={} else 0) >=5][['original_title','popularity']].sort_values('popularity',ascending=False)

Unnamed: 0,original_title,popularity
60,Casino Royale,23.065078
777,The Great Wall,15.884744
1775,Filth,14.970485
710,Hysteria,14.331454
801,Love & Friendship,12.822085
1606,Louder Than Bombs,12.293202
1338,The Secret of Moonacre,12.01334
2185,1492: Conquest of Paradise,11.618792
2170,The Lobster,11.223033
2517,Astérix aux Jeux Olympiques,9.671944


Lets create a column for the production country , number of production countries and most common production countries similar to what we did for genere and production company.

In [12]:
prod_count_list=list(train['production_countries'].apply(lambda x:[i['name'] for i in x] if x!={} else []).values)

In [26]:
prod_count_list

[['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['India'],
 ['South Korea'],
 [],
 ['United States of America', 'Canada'],
 [],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['Serbia'],
 ['United States of America'],
 ['United Kingdom'],
 ['Austria', 'Germany', 'United Kingdom'],
 ['France'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['United States of America'],
 ['New Zealand'],
 ['Japan', 'United States of America'],
 ['United States of America'],
 ['Canada', 'Ireland'],
 ['United States of America'],
 ['United States of America'],
 ['France'],
 ['Austria', 'France', 'Germany', 'Italy', 'United States of America'],
 ['United States of America'],
 ['United States of America', 'New Zealand', 'Canada', 'Israel', 'Japan']

In [27]:
Counter([i for j in prod_count_list for i in j]).most_common()

[('United States of America', 2282),
 ('United Kingdom', 380),
 ('France', 222),
 ('Germany', 167),
 ('Canada', 120),
 ('India', 81),
 ('Italy', 64),
 ('Japan', 61),
 ('Australia', 61),
 ('Russia', 58),
 ('Spain', 54),
 ('China', 42),
 ('Hong Kong', 42),
 ('Ireland', 23),
 ('Belgium', 23),
 ('South Korea', 22),
 ('Mexico', 19),
 ('Sweden', 18),
 ('New Zealand', 17),
 ('Netherlands', 15),
 ('Czech Republic', 14),
 ('Denmark', 13),
 ('Brazil', 12),
 ('Luxembourg', 10),
 ('South Africa', 10),
 ('Hungary', 9),
 ('United Arab Emirates', 9),
 ('Austria', 8),
 ('Switzerland', 8),
 ('Romania', 8),
 ('Greece', 7),
 ('Norway', 7),
 ('Argentina', 6),
 ('Chile', 6),
 ('Finland', 6),
 ('Israel', 5),
 ('Turkey', 5),
 ('Iran', 5),
 ('Poland', 5),
 ('Morocco', 3),
 ('Philippines', 3),
 ('Taiwan', 3),
 ('Bulgaria', 3),
 ('Bahamas', 3),
 ('Serbia', 2),
 ('Iceland', 2),
 ('Cambodia', 2),
 ('Malta', 2),
 ('Pakistan', 2),
 ('Qatar', 2),
 ('Tunisia', 2),
 ('Ukraine', 2),
 ('Singapore', 2),
 ('Indonesia', 2)

Most of the movies in the database are produced by USA,UK,France and Germany

In [13]:
train['prod_country']=train['production_countries'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
train['num_country']=train['production_countries'].apply(lambda x:len(x) if x!={} else 0)
top_countries=[m[0] for m in Counter([i for j in prod_comp_list for i in j]).most_common(15)]
for g in top_prod:
    train['count_' + g] = train['prod_country'].apply(lambda x: 1 if g in x else 0)
    
test['prod_country']=test['production_countries'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
test['num_country']=test['production_countries'].apply(lambda x:len(x) if x!={} else 0)
for g in top_prod:
    test['count_' + g] = test['prod_country'].apply(lambda x: 1 if g in x else 0)
    
train=train.drop('production_countries',axis=1)
test=test.drop('production_countries',axis=1)

### Spoken Languages

Lets check the spoken languages column.

In [41]:
for i,e in enumerate(train['spoken_languages'].head()):
    print(i,e)

0 [{'iso_639_1': 'en', 'name': 'English'}]
1 [{'iso_639_1': 'en', 'name': 'English'}]
2 [{'iso_639_1': 'en', 'name': 'English'}]
3 [{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'hi', 'name': 'हिन्दी'}]
4 [{'iso_639_1': 'ko', 'name': '한국어/조선말'}]


Lets check which language movies are most represented in the database.

In [14]:
list_of_lang=list(train['spoken_languages'].apply(lambda x:[i['name'] for i in x] if x!={} else []).values)

In [42]:
Counter(i for j in list_of_lang for i in j).most_common(10)

[('English', 2618),
 ('Français', 288),
 ('Español', 239),
 ('Deutsch', 169),
 ('Pусский', 152),
 ('Italiano', 124),
 ('日本語', 89),
 ('普通话', 68),
 ('हिन्दी', 56),
 ('', 47)]

English , France ,Spanish movies are most represented in the database .Lets create a column having the languages, number of languages and whether the movie is shot in top 5 most common language.

In [15]:
train['lang']=train['spoken_languages'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
train['num_lang']=train['spoken_languages'].apply(lambda x:len(x) if x!={} else 0)
top_langs=[m[0] for m in Counter([i for j in prod_comp_list for i in j]).most_common(5)]
for g in top_langs:
    train['lang_' + g] = train['lang'].apply(lambda x: 1 if g in x else 0)
    
test['lang']=test['spoken_languages'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
test['num_lang']=test['spoken_languages'].apply(lambda x:len(x) if x!={} else 0)
for g in top_langs:
    test['lang_' + g] = test['lang'].apply(lambda x: 1 if g in x else 0)
    
train=train.drop('spoken_languages',axis=1)
test=test.drop('spoken_languages',axis=1)

### Keywords

Lets now check the keywords.

In [39]:
train['Keywords'].head()

0    [{'id': 4379, 'name': 'time travel'}, {'id': 9...
1    [{'id': 2505, 'name': 'coronation'}, {'id': 42...
2    [{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...
3    [{'id': 10092, 'name': 'mystery'}, {'id': 1054...
4                                                   {}
Name: Keywords, dtype: object

In [40]:
for i,e in enumerate(train['Keywords'].head()):
    print(i,e)

0 [{'id': 4379, 'name': 'time travel'}, {'id': 9663, 'name': 'sequel'}, {'id': 11830, 'name': 'hot tub'}, {'id': 179431, 'name': 'duringcreditsstinger'}]
1 [{'id': 2505, 'name': 'coronation'}, {'id': 4263, 'name': 'duty'}, {'id': 6038, 'name': 'marriage'}, {'id': 13072, 'name': 'falling in love'}]
2 [{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'name': 'obsession'}, {'id': 1640, 'name': 'conservatory'}, {'id': 2176, 'name': 'music teacher'}, {'id': 14512, 'name': 'new york city'}, {'id': 14819, 'name': 'violence'}, {'id': 33896, 'name': 'montage'}, {'id': 156823, 'name': 'drummer'}, {'id': 170418, 'name': 'public humiliation'}, {'id': 176095, 'name': 'jazz band'}, {'id': 206298, 'name': 'young adult'}, {'id': 207739, 'name': 'music school'}]
3 [{'id': 10092, 'name': 'mystery'}, {'id': 10540, 'name': 'bollywood'}, {'id': 11734, 'name': 'police corruption'}, {'id': 14536, 'name': 'crime'}, {'id': 14636, 'name': 'india'}, {'id': 208364, 'name': 'missing husband'}, {'id': 220935, 'name': 'ne

We find that a single movie has multiple keywords.Lets check the most common keywords in the database.

In [16]:
kwds=list(train['Keywords'].apply(lambda x:[i['name'] for i in x] if x!={} else []).values)

In [49]:
kwds

[['time travel', 'sequel', 'hot tub', 'duringcreditsstinger'],
 ['coronation', 'duty', 'marriage', 'falling in love'],
 ['jazz',
  'obsession',
  'conservatory',
  'music teacher',
  'new york city',
  'violence',
  'montage',
  'drummer',
  'public humiliation',
  'jazz band',
  'young adult',
  'music school'],
 ['mystery',
  'bollywood',
  'police corruption',
  'crime',
  'india',
  'missing husband',
  'nerve gas'],
 [],
 [],
 [],
 ['journalism',
  'translation',
  'television',
  'manipulation of the media',
  'iraq',
  'reporter',
  'woman director'],
 ['island', 'pirate gang', 'puppet', 'treasure hunt'],
 ['mockumentary', 'folk singer'],
 ['underdog',
  'philadelphia',
  'transporter',
  'italo-american',
  'fight',
  "love of one's life",
  'publicity',
  'boxer',
  'independence',
  'boxing match',
  'training',
  'lovers',
  'surprise',
  'world champion',
  'amateur',
  'victory'],
 ['nerd', 'vacation', 'farce', 'jock', 'frame up', 'defector'],
 ['male nudity',
  'female nu

In [45]:
Counter(i for j in kwds for i in j).most_common(10)

[('woman director', 175),
 ('independent film', 155),
 ('duringcreditsstinger', 134),
 ('murder', 123),
 ('based on novel', 111),
 ('violence', 87),
 ('sport', 82),
 ('biography', 77),
 ('aftercreditsstinger', 75),
 ('dystopia', 73)]

We find woman director , independent film ,duringcreditsstinger ,murder as most common keywords.

Similar to what we did for genere,production countries etc , here too we extract the number of keywords , most common keywords in separate columns.

In [17]:
train['keywords']=train['Keywords'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
train['num_kwds']=train['Keywords'].apply(lambda x:len(x) if x!={} else 0)
top_kwds=[m[0] for m in Counter([i for j in kwds for i in j]).most_common(5)]
for g in top_kwds:
    train['kwds_' + g] = train['keywords'].apply(lambda x: 1 if g in x else 0)
    
test['keywords']=test['Keywords'].apply(lambda x:' '.join(sorted([i['name'] for i in x])) if x != {} else '')
test['num_kwds']=test['Keywords'].apply(lambda x:len(x) if x!={} else 0)
for g in top_kwds:
    test['kwds_' + g] = test['keywords'].apply(lambda x: 1 if g in x else 0)
    
train=train.drop('Keywords',axis=1)
test=test.drop('Keywords',axis=1)

### Cast

Lets check the cast column .

In [52]:
train['cast'].head()

0    [{'cast_id': 4, 'character': 'Lou', 'credit_id...
1    [{'cast_id': 1, 'character': 'Mia Thermopolis'...
2    [{'cast_id': 5, 'character': 'Andrew Neimann',...
3    [{'cast_id': 1, 'character': 'Vidya Bagchi', '...
4    [{'cast_id': 3, 'character': 'Chun-soo', 'cred...
Name: cast, dtype: object

In [19]:
for i,e in enumerate(train['cast'].head()):
    print(i,e)

0 [{'cast_id': 4, 'character': 'Lou', 'credit_id': '52fe4ee7c3a36847f82afae7', 'gender': 2, 'id': 52997, 'name': 'Rob Corddry', 'order': 0, 'profile_path': '/k2zJL0V1nEZuFT08xUdOd3ucfXz.jpg'}, {'cast_id': 5, 'character': 'Nick', 'credit_id': '52fe4ee7c3a36847f82afaeb', 'gender': 2, 'id': 64342, 'name': 'Craig Robinson', 'order': 1, 'profile_path': '/tVaRMkJXOEVhYxtnnFuhqW0Rjzz.jpg'}, {'cast_id': 6, 'character': 'Jacob', 'credit_id': '52fe4ee7c3a36847f82afaef', 'gender': 2, 'id': 54729, 'name': 'Clark Duke', 'order': 2, 'profile_path': '/oNzK0umwm5Wn0wyEbOy6TVJCSBn.jpg'}, {'cast_id': 7, 'character': 'Adam Jr.', 'credit_id': '52fe4ee7c3a36847f82afaf3', 'gender': 2, 'id': 36801, 'name': 'Adam Scott', 'order': 3, 'profile_path': '/5gb65xz8bzd42yjMAl4zwo4cvKw.jpg'}, {'cast_id': 8, 'character': 'Hot Tub Repairman', 'credit_id': '52fe4ee7c3a36847f82afaf7', 'gender': 2, 'id': 54812, 'name': 'Chevy Chase', 'order': 4, 'profile_path': '/svjpyYtPwtjvRxX9IZnOmOkhDOt.jpg'}, {'cast_id': 9, 'characte

We have the information on cast id , character , credit id , gender , id , name ,order and  profile path .
We extract the most common characters , actors from this column since its an intuition that people will watch a movie if their favourite actor stars in the movie or if its a series , the character would already made an impact and peoplw will love to watch it . 

In [18]:
list_charac = list(train['cast'].apply(lambda x:[i['character'] for i in x] if x!={} else []).values)

In [22]:
Counter(i for j in list_charac for i in j).most_common(10)

[('', 818),
 ('Himself', 610),
 ('Herself', 155),
 ('Dancer', 144),
 ('Additional Voices (voice)', 100),
 ('Doctor', 77),
 ('Reporter', 70),
 ('Waitress', 69),
 ('Nurse', 65),
 ('Bartender', 55)]

From the list it is seen that there are more entries where character name is not specified.

In [19]:
list_actors =list(train['cast'].apply(lambda x:[i['name'] for i in x] if x!={} else []).values)

In [25]:
Counter(i for j in list_actors for i in j).most_common(15)

[('Samuel L. Jackson', 30),
 ('Robert De Niro', 30),
 ('Morgan Freeman', 27),
 ('J.K. Simmons', 25),
 ('Bruce Willis', 25),
 ('Liam Neeson', 25),
 ('Susan Sarandon', 25),
 ('Bruce McGill', 24),
 ('John Turturro', 24),
 ('Forest Whitaker', 23),
 ('Willem Dafoe', 23),
 ('Bill Murray', 22),
 ('Owen Wilson', 22),
 ('Nicolas Cage', 22),
 ('Sylvester Stallone', 21)]

From the list we understand that the actors include both lead as well as supporting actors . We also want to separate out the lead actor from the movie .This can be extracted from the dictionary where the cast id is equal to 1 . Lets do it .

In [20]:
lead_actors=list(train['cast'].apply(lambda x:[i['name'] for i in x if i['cast_id']==1 or i['cast_id']==2] if x!={} else []).values)

In [32]:
Counter(i for j in lead_actors for i in j).most_common(15)

[('Mel Gibson', 12),
 ('Denzel Washington', 11),
 ('Eddie Murphy', 10),
 ('John Travolta', 9),
 ('Jeff Bridges', 9),
 ('Nicolas Cage', 9),
 ('Robert De Niro', 9),
 ('Dennis Quaid', 9),
 ('Gene Hackman', 8),
 ('Sylvester Stallone', 8),
 ('Owen Wilson', 8),
 ('John Cusack', 8),
 ('Clint Eastwood', 8),
 ('Mark Wahlberg', 8),
 ('Diane Keaton', 7)]

We have 12 movies from Mel Gibson ,11 from Denzel Washington ,10 from Eddie Murphy . We create a separate column with lead actors.

In [21]:
train['lead_act_1']=train['cast'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['cast_id']==1])) if x != {} else '')
train['lead_act_2']=train['cast'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['cast_id']==2])) if x != {} else '')
# top_actors=[m[0] for m in Counter([i for j in list_actors for i in j]).most_common(10)]
# for g in top_actors:
#     train['act_' + g] = train['keywords'].apply(lambda x: 1 if g in x else 0)
test['lead_act_1']=test['cast'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['cast_id']==1])) if x != {} else '')
test['lead_act_2']=test['cast'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['cast_id']==2])) if x != {} else '')

train=train.drop('cast',axis=1)
test=test.drop('cast',axis=1)

### Crew 

Lets check the crew column.

In [35]:
for i,e in enumerate(train['crew'].head()):
    print(i,e)

0 [{'credit_id': '59ac067c92514107af02c8c8', 'department': 'Directing', 'gender': 0, 'id': 1449071, 'job': 'First Assistant Director', 'name': 'Kelly Cantley', 'profile_path': None}, {'credit_id': '52fe4ee7c3a36847f82afad7', 'department': 'Directing', 'gender': 2, 'id': 3227, 'job': 'Director', 'name': 'Steve Pink', 'profile_path': '/myHOgo8mQSCiCAZNGMRdHVr03jr.jpg'}, {'credit_id': '5524ed25c3a3687ded000d88', 'department': 'Writing', 'gender': 2, 'id': 347335, 'job': 'Writer', 'name': 'Josh Heald', 'profile_path': '/pwXJIenrDMrG7t3zNfLvr8w1RGU.jpg'}, {'credit_id': '5524ed2d925141720c001128', 'department': 'Writing', 'gender': 2, 'id': 347335, 'job': 'Characters', 'name': 'Josh Heald', 'profile_path': '/pwXJIenrDMrG7t3zNfLvr8w1RGU.jpg'}, {'credit_id': '5524ed3d92514166c1004a5d', 'department': 'Production', 'gender': 2, 'id': 57822, 'job': 'Producer', 'name': 'Andrew Panay', 'profile_path': None}, {'credit_id': '5524ed4bc3a3687df3000dd2', 'department': 'Production', 'gender': 0, 'id': 14

Lets check the various job types and departments represented in the crew.

In [22]:
list_jobs = list(train['crew'].apply(lambda x:[i['job'] for i in x] if x!={} else []).values)

Once we have extracted the list of jobs lets create a separate crew name column who are from  the jobs that are most represented.

In [39]:
Counter(i for j in list_jobs for i in j).most_common(10)

[('Producer', 6011),
 ('Executive Producer', 3459),
 ('Director', 3225),
 ('Screenplay', 2996),
 ('Editor', 2824),
 ('Casting', 2483),
 ('Director of Photography', 2288),
 ('Original Music Composer', 1947),
 ('Art Direction', 1821),
 ('Production Design', 1650)]

Thus we have the most important crew being represented the most in the list .We create separate columns to represent each of the jobs.

In [23]:
train['producer']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Producer"])) if x!={} else '')
train['exec_producer']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Executive Producer"])) if x!={} else '')
train['director']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Director"])) if x!={} else '')
train['screenplay']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Screenplay"])) if x!={} else '')
train['editor']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Editor"])) if x!={} else '')
train['music']=train['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Original Music Composer"])) if x!={} else '')


test['producer']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Producer"])) if x!={} else '')
test['exec_producer']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Executive Producer"])) if x!={} else '')
test['director']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Director"])) if x!={} else '')
test['screenplay']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Screenplay"])) if x!={} else '')
test['editor']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Editor"])) if x!={} else '')
test['music']=test['crew'].apply(lambda x:' '.join(sorted([i['name'] for i in x if i['job']=="Original Music Composer"])) if x!={} else '')

train=train.drop('crew',axis=1)
test=test.drop('crew',axis=1)

Now that we have extracted features from json type columns , lets check the data structure and proceed for building our model.

In [25]:
train.head()

Unnamed: 0,id,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,...,kwds_murder,kwds_based on novel,lead_act_1,lead_act_2,producer,exec_producer,director,screenplay,editor,music
0,1,14000000,,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,2/20/15,...,0,0,,,Andrew Panay,Ben Ormand Matt Moore Rob Corddry,Steve Pink,,Jamie Gross,Christophe Beck
1,2,40000000,,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,8/6/04,...,0,0,Anne Hathaway,Julie Andrews,Debra Martin Chase Mario Iscovich Whitney Houston,Ellen H. Schwartz,Garry Marshall,Shonda Rhimes,Bruce Green,John Debney
2,3,3300000,http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg,10/10/14,...,0,0,,,David Lancaster Helen Estabrook Jason Blum Mic...,Couper Samuelson Gary Michael Walters Jason Re...,Damien Chazelle,Damien Chazelle,Tom Cross,Justin Hurwitz
3,4,1200000,http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,/aTXRaPrWSinhcmCrcfJK17urp3F.jpg,3/9/12,...,0,0,Vidya Balan,,Sujoy Ghosh,,Sujoy Ghosh,,,
4,5,0,,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg,2/5/09,...,0,0,,,,,Jong-seok Yoon,,,


Lets try to remove unwanted columns.For simplicity purpose , lets remove the text columns overview ,tagline 

In [41]:
train.columns

Index(['id', 'budget', 'homepage', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'release_date', 'runtime', 'status', 'tagline', 'title', 'revenue',
       'collection_name', 'num_genres', 'all_genres', 'genre_Drama',
       'genre_Comedy', 'genre_Thriller', 'genre_Action', 'genre_Romance',
       'genre_Crime', 'genre_Adventure', 'genre_Horror',
       'genre_Science Fiction', 'genre_Family', 'genre_Fantasy',
       'genre_Mystery', 'genre_Animation', 'genre_History', 'genre_Music',
       'prod_comp', 'num_prod', 'prod_Warner Bros.', 'prod_Universal Pictures',
       'prod_Paramount Pictures',
       'prod_Twentieth Century Fox Film Corporation', 'prod_Columbia Pictures',
       'prod_Metro-Goldwyn-Mayer (MGM)', 'prod_New Line Cinema',
       'prod_Touchstone Pictures', 'prod_Walt Disney Pictures',
       'prod_Columbia Pictures Corporation', 'prod_TriStar Pictures',
       'prod_Relativity Media', 'prod_Canal+', 'prod_United

In [24]:
useful_cols=[f for f in train.drop('revenue',axis=1).columns if f not in ['title','id','overview','release_date','all_genres','keywords','homepage','imdb_id','poster_path','status','tagline','all_geners','prod_comp','prod_country','lang']]

Lets extract date features from release date.

In [31]:
train['release_date'].head()

0     2/20/15
1      8/6/04
2    10/10/14
3      3/9/12
4      2/5/09
Name: release_date, dtype: object

In [34]:
test['release_date'].isnull().sum()

1

There is only 1 null value in the test data . Lets check which movie is it and impute the value.

In [36]:
test[test['release_date'].isnull()==True]

Unnamed: 0,id,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,...,kwds_murder,kwds_based on novel,lead_act_1,lead_act_2,producer,exec_producer,director,screenplay,editor,music
828,3829,0,,tt0210130,en,"Jails, Hospitals & Hip-Hop","Jails, Hospitals &amp; Hip-Hop is a cinematic ...",0.009057,,,...,0,0,,,,,,,,


A quick check in [IMBD](https://www.imdb.com/title/tt0210130/) puts the release date for this movie at May 2000 . Lets take 1st may as the date and impute it.

In [25]:
test.loc[test['release_date'].isnull()==True,'release_date']='5/1/00'

In [44]:
# temp_train=train.copy()
# temp_test=test.copy()

In [26]:
def fix_date(x):
    """
    Fixes dates which are in 20xx
    """
    year = x.split('/')[2]
    if int(year) <= 19:
        return x[:-2] + '20' + year
    else:
        return x[:-2] + '19' + year

In [27]:
train['release_date'] = train['release_date'].apply(lambda x: fix_date(x))
test['release_date'] = test['release_date'].apply(lambda x: fix_date(x))

train['release_date']=pd.to_datetime(train['release_date'])
test['release_date']=pd.to_datetime(test['release_date'])

In [28]:
### Using the function from Andrews kernel ,

def process_date(df):
    date_parts = ["year", "weekday", "month", 'weekofyear', 'day', 'quarter']
    for part in date_parts:
        part_col = 'release_date' + "_" + part
        df[part_col] = getattr(df['release_date'].dt, part).astype(int)
    
    return df



In [29]:
train = process_date(train)
test = process_date(test)

In [93]:
train.head()

Unnamed: 0,id,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,release_date,...,director,screenplay,editor,music,release_date_year,release_date_weekday,release_date_month,release_date_weekofyear,release_date_day,release_date_quarter
0,1,14000000,,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,2015-02-20,...,Steve Pink,,Jamie Gross,Christophe Beck,2015,4,2,8,20,1
1,2,40000000,,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,2004-08-06,...,Garry Marshall,Shonda Rhimes,Bruce Green,John Debney,2004,4,8,32,6,3
2,3,3300000,http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg,2014-10-10,...,Damien Chazelle,Damien Chazelle,Tom Cross,Justin Hurwitz,2014,4,10,41,10,4
3,4,1200000,http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,/aTXRaPrWSinhcmCrcfJK17urp3F.jpg,2012-03-09,...,Sujoy Ghosh,,,,2012,4,3,10,9,1
4,5,0,,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg,2009-02-05,...,Jong-seok Yoon,,,,2009,3,2,6,5,1


Lets check the na values in the dataset.

In [30]:
train.isna().sum()

id                                                0
budget                                            0
homepage                                       2054
imdb_id                                           0
original_language                                 0
original_title                                    0
overview                                          8
popularity                                        0
poster_path                                       1
release_date                                      0
runtime                                           2
status                                            0
tagline                                         597
title                                             0
revenue                                           0
collection_name                                   0
num_genres                                        0
all_genres                                        0
genre_Drama                                       0
genre_Comedy

# Modelling

Before we train a model , we see that there are lot of categorical variables.Lets first encode them.

In [96]:
from category_encoders import * 

We try to do mean encoding keeping popularity as a target variable . Since the popularity of a movie depends on the cast,language,production company ,crew  etc of the movie , I intuitively take this column for mean encoding.

In [31]:
## Inspired from https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm

def target_encode(df,column,target='popularity'):
    mean_list =df.groupby(by=column)[target].mean()
    return df[column].map(mean_list)

In [32]:
temp_train=train.copy()
temp_test=test.copy()

In [114]:
# train=temp_train.copy()
# test=temp_test.copy()

In [115]:
test[['collection_name','lead_act_1','popularity']].head()

Unnamed: 0,collection_name,lead_act_1,popularity
0,Pokémon Collection,,3.851534
1,0,,3.559789
2,0,,8.085194
3,0,,8.596012
4,0,Dennis Hopper,3.21768


In [33]:
null_columns=test[useful_cols].isna().sum()
null_columns[null_columns>0]

runtime    4
dtype: int64

In [34]:
null_columns=train[useful_cols].isna().sum()
null_columns[null_columns>0]

runtime    2
dtype: int64

In [35]:
test['runtime']=test['runtime'].fillna(test['runtime'].mean())


In [36]:
train['len_' + 'title'] = train['title'].fillna('').apply(lambda x: len(str(x)))
train['words_' + 'title'] = train['title'].fillna('').apply(lambda x: len(str(x.split(' '))))
train=train.drop('title',axis=1)

test['len_' + 'title'] = test['title'].fillna('').apply(lambda x: len(str(x)))
test['words_' + 'title'] = test['title'].fillna('').apply(lambda x: len(str(x.split(' '))))
test=test.drop('title',axis=1)

In [37]:
train['runtime']=train['runtime'].fillna(train['runtime'].mean())

In [38]:
for cols in ['collection_name','producer', 'exec_producer', 'director','screenplay', 'editor', 'music','lead_act_1', 'lead_act_2']:
    print(f'Target Encoding {cols} \n')
    train[cols]=target_encode(df=train,column=cols)
    #test[cols]=target_encode(df=test,column=cols)
    print('Encoding Completed')

Target Encoding collection_name 

Encoding Completed
Target Encoding producer 

Encoding Completed
Target Encoding exec_producer 

Encoding Completed
Target Encoding director 

Encoding Completed
Target Encoding screenplay 

Encoding Completed
Target Encoding editor 

Encoding Completed
Target Encoding music 

Encoding Completed
Target Encoding lead_act_1 

Encoding Completed
Target Encoding lead_act_2 

Encoding Completed


In [39]:
for cols in ['collection_name','producer', 'exec_producer', 'director','screenplay', 'editor', 'music','lead_act_1', 'lead_act_2']:
    print(f'Target Encoding {cols} \n')
    #train[cols]=target_encode(df=train,column=cols)
    test[cols]=target_encode(df=test,column=cols)
    #print(test[cols].head())
    print('Encoding Completed')

Target Encoding collection_name 

Encoding Completed
Target Encoding producer 

Encoding Completed
Target Encoding exec_producer 

Encoding Completed
Target Encoding director 

Encoding Completed
Target Encoding screenplay 

Encoding Completed
Target Encoding editor 

Encoding Completed
Target Encoding music 

Encoding Completed
Target Encoding lead_act_1 

Encoding Completed
Target Encoding lead_act_2 

Encoding Completed


In [40]:
train[useful_cols].select_dtypes('object').columns

Index(['original_language', 'original_title'], dtype='object')

We do a label encoding for these columns,

In [41]:
cat_cols=['original_language', 'original_title']

In [42]:
%%time
indexer = {}
for col in cat_cols:
    # print(col)
    _, indexer[col] = pd.factorize(train[col].astype(str), sort=True)
    
for col in tqdm_notebook(cat_cols):
    print(f'Encoding {col}\n')
    train[col] = indexer[col].get_indexer(train[col].astype(str))
    test[col] = indexer[col].get_indexer(test[col].astype(str))
    

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

Encoding original_language

Encoding original_title


Wall time: 373 ms


In [43]:
X=train.drop(['id','revenue'],axis=1)
Y=np.log1p(train['revenue'])
X_test=test.drop('id',axis=1)

Now , lets train our model .

In [44]:
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb


In [45]:
x_train,x_valid,y_train,y_valid=train_test_split(X,Y,test_size=0.2,random_state=1000)

In [131]:
print(f'Shape of x_train is {x_train.shape},Shape of x_valid is {x_valid.shape},shape of y_valid is {y_valid.shape},shape of y_train is {y_train.shape}')

Shape of x_train is (2400, 94),Shape of x_valid is (600, 94),shape of y_valid is (600,),shape of y_train is (2400,)


In [46]:
m = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, max_features=0.8, n_jobs=-1, oob_score=True)
m.fit(x_train[useful_cols], y_train)


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=0.8, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=True, random_state=None, verbose=0, warm_start=False)

In [47]:
print(f'Score :{m.score(x_train[useful_cols],y_train)} Squared Error:{mean_squared_error(y_valid,m.predict(x_valid[useful_cols]))}')

Score :0.6644618233365843 Squared Error:5.245653824310716


The error rate is very high.Lets tune the parameters.

In [48]:
n_fold = 5
folds = KFold(n_splits=n_fold, shuffle=True, random_state=100)

In [None]:
oof = np.zeros(len(x_train))
predictions = np.zeros(len(x_valid))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(x_train)):
    print("Fold {}".format(fold_))
    trn_x,trn_y = x_train.iloc[trn_idx][useful_cols],y_train.iloc[trn_idx]
    val_x,val_y = x_train.iloc[val_idx][useful_cols],y_train.iloc[val_idx]

    num_round = 15000
    clf = RandomForestRegressor(n_estimators=1000, min_samples_leaf=2, max_features=0.8, n_jobs=-1, oob_score=True)
    clf.fit(trn_x,trn_y)
    oof[val_idx] = clf.predict(val_x)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = useful_cols
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(x_valid[useful_cols]) / folds.n_splits
    print('Fold {} most important features are: '.format(n_fold + 1))
    for i in np.argsort(fold_importance_df["importance"])[-10:]:
            print('{}  -> {}'.format(fold_importance_df.iloc[i, 0], fold_importance_df.iloc[i, 1]))
    
        
    print('Fold %2d RMSE : %.6f' % (fold_ + 1, mean_squared_error(val_y, oof[val_idx])))

print("CV score: {:<8.5f}".format(mean_squared_error(y_train, oof)))

From the 5 fold feature importance , it is seen that original title , runtime , director , popularity,budget,editor,number of keywords ,editor are playing a key role in deciding the revenue of the movie . Lets use these features and lightgbm to train the final model.

In [53]:
final_features=['screenplay','budget','popularity','director','runtime','original_title','num_kwds','num_prod']

In [56]:
param={'metric':'rmse'}
clf = lgb.LGBMRegressor(learning_rate=0.05,n_estimators=5000,num_leaves=100,min_split_gain=0.5,max_depth=5,random_state=100,min_child_samples=5,**param)

In [59]:
oof = np.zeros(len(x_train))
predictions = np.zeros(len(X_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(x_train)):
    print("Fold {}".format(fold_))
    trn_x,trn_y = x_train.iloc[trn_idx][final_features],y_train.iloc[trn_idx]
    val_x,val_y = x_train.iloc[val_idx][final_features],y_train.iloc[val_idx]

    num_round = 15000
    
    clf.fit(trn_x,trn_y,eval_set=[(val_x,val_y)],eval_metric='rmse',early_stopping_rounds=50,verbose=False)
    oof[val_idx] = clf.predict(val_x)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = final_features
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(X_test[final_features]) / folds.n_splits
    print('Fold {} most important features are: '.format(n_fold + 1))
    for i in np.argsort(fold_importance_df["importance"])[-10:]:
            print('{}  -> {}'.format(fold_importance_df.iloc[i, 0], fold_importance_df.iloc[i, 1]))
    
        
    print('Fold %2d RMSE : %.6f' % (fold_ + 1, mean_squared_error(val_y, oof[val_idx])))

print("CV score: {:<8.5f}".format(mean_squared_error(y_train, oof)))

Fold 0
Fold 6 most important features are: 
screenplay  -> 97
num_kwds  -> 126
num_prod  -> 126
original_title  -> 155
director  -> 219
runtime  -> 234
popularity  -> 250
budget  -> 277
Fold  1 RMSE : 4.857602
Fold 1
Fold 6 most important features are: 
screenplay  -> 160
num_prod  -> 165
num_kwds  -> 198
director  -> 307
original_title  -> 323
runtime  -> 327
popularity  -> 328
budget  -> 403
Fold  2 RMSE : 4.914707
Fold 2
Fold 6 most important features are: 
num_prod  -> 120
num_kwds  -> 140
screenplay  -> 143
original_title  -> 191
runtime  -> 212
popularity  -> 237
director  -> 243
budget  -> 318
Fold  3 RMSE : 5.024892
Fold 3
Fold 6 most important features are: 
screenplay  -> 201
num_prod  -> 217
num_kwds  -> 265
director  -> 365
popularity  -> 395
budget  -> 427
runtime  -> 438
original_title  -> 451
Fold  4 RMSE : 4.941461
Fold 4
Fold 6 most important features are: 
num_prod  -> 125
screenplay  -> 155
num_kwds  -> 195
original_title  -> 208
popularity  -> 231
director  -> 276
r

# Submission 

In [None]:

sub['revenue']=np.expm1(predictions)
sub.to_csv('sample_submission_0.csv')