## Overview

We're going to make you an offer you can't refuse: a Kaggle competition!

In a world... where movies made an estimated $41.7 billion in 2018, the film industry is more popular than ever. But what movies make the most money at the box office? How much does a director matter? Or the budget? For some movies, it's "You had me at 'Hello.'" For others, the trailer falls short of expectations and you think "What we have here is a failure to communicate."

In this competition, you're presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie's release.

Join in, "make our day", and then "you've got to ask yourself one question: 'Do I feel lucky?'"

## Data

In this dataset, you are provided with 7398 movies and a variety of metadata obtained from The Movie Database (TMDB). Movies are labeled with id. Data points include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.

You are predicting the worldwide revenue for 4398 movies in the test file.

Note - many movies are remade over the years, therefore it may seem like multiple instance of a movie may appear in the data, however they are different and should be considered separate movies. In addition, some movies may share a title, but be entirely unrelated.

E.g. The Karate Kid (id: 5266) was released in 1986, while a clearly (or maybe just subjectively) inferior remake (id: 1987) was released in 2010. Also, while the Frozen (id: 5295) released by Disney in 2013 may be the household name, don't forget about the less-popular Frozen (id: 139) released three years earlier about skiers who are stranded on a chairlift...

In [27]:
## initialization

%reset -f

import sys

import numpy as np, pandas as pd, json, ast
import sklearn
from sklearn.model_selection import train_test_split
import sklearn.metrics as skm
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer, TfidfVectorizer

# ignore warnings (only if you are the kind that would code when the world is burning)
import warnings
warnings.filterwarnings('ignore')

# some options
#MAX_EVALS=5
randomseed = 1 # the value for the random state used at various points in the pipeline
pd.options.display.max_rows = 2 # specify if you want the full output in cells rather the truncated list
pd.options.display.max_columns = 200

# to display multiple outputs in a cell without usin print/display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display wd files
import os as os
print('folder files: ', os.listdir(), '\n')
print('envir variables: ')
%who

folder files:  ['.ipynb_checkpoints', 'sample_submission.csv', 'test.csv', 'test.json', 'TMDB.ipynb', 'tmdb.pkl', 'train.csv', 'train.json'] 

envir variables: 
CountVectorizer	 HashingVectorizer	 InteractiveShell	 TfidfVectorizer	 ast	 json	 np	 os	 pd	 


In [28]:
## functions

def split_ohe(df, cols):
    for i in cols:
        temp = (df[i].str.split(' ', expand=True)
               .stack()
               .str.get_dummies()
               .sum(level=0))
        df = pd.concat([df.reset_index(drop=True), temp], axis=1)
        df.drop(i, axis=1, inplace=True)
    return df

In [29]:
cols = ['id', 'budget', 'genres', 'original_language', 'popularity', 'runtime', 'production_companies', 'production_countries', 
        'spoken_languages', 'Keywords']
nlp_cols = ['belongs_to_collection', 'title', 'overview', 'Keywords', 'tagline']
custcols = ['homepage', 'release_date']

train = pd.read_csv('../tmdb/train.csv')
test = pd.read_csv('../tmdb/test.csv')
train.shape, test.shape

((3000, 23), (4398, 22))

In [30]:
trainnew = train[cols].copy()
testnew = test[cols].copy()
ytrain = train.revenue

# regex cleaning on the json columns. very crude, very sad. yet simplest current solution
trainnew.replace('\{|id(.*?)\,|\'name\'\: |\[|\]|\}|\'|\s+|id(.*?)\}\]|iso(.*?)\:', '', regex=True, inplace=True)
trainnew.replace('\,', ' ', regex=True, inplace=True)
trainnew.replace('poster(.*)', '', regex=True, inplace=True)
trainnew.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
trainnew['production_countries'] = trainnew['production_countries'].replace(r'[A-Za-z0-9_]{3,}', '', regex=True)
trainnew['spoken_languages'] = trainnew['spoken_languages'].replace(r'\b(\w{3,})\b', '', regex=True)

testnew.replace('\{|id(.*?)\,|\'name\'\: |\[|\]|\}|\'|\s+|id(.*?)\}\]|iso(.*?)\:', '', regex=True, inplace=True)
testnew.replace('\,', ' ', regex=True, inplace=True)
testnew.replace('poster(.*)', '', regex=True, inplace=True)
testnew.replace('(^\s+|\s+$)', '', regex=True, inplace=True)
testnew['production_countries'] = testnew['production_countries'].replace(r'[A-Za-z0-9_]{3,}', '', regex=True)
testnew['spoken_languages'] = testnew['spoken_languages'].replace(r'\b(\w{3,})\b', '', regex=True)

In [31]:
trainnew

Unnamed: 0,id,budget,genres,original_language,popularity,runtime,production_companies,production_countries,spoken_languages,Keywords
0,1,14000000,Comedy,en,6.575393,93.0,ParamountPictures UnitedArtists Metro-Goldwyn-...,US,en,timetravel sequel hottub duringcreditsstinger
...,...,...,...,...,...,...,...,...,...,...
2999,3000,35000000,Thriller Action Mystery,en,10.512109,106.0,LionsGateFilms VertigoEntertainment GothamGrou...,US,en,cia airport hero fight ktimebomb training webc...


In [42]:
import pickle

# tmdb = open('./tmdb.pkl','wb')
# pickle.dump(trainnew, tmdb)
# pickle.dump(testnew, tmdb)
# tmdb.close()

# load backup
tmdb = open('./tmdb.pkl', 'rb')
trainnew = pickle.load(tmdb)
testnew = pickle.load(tmdb)
tmdb.close()

In [43]:
# splitting and ohe on the json cols
# trainnew = split_ohe(trainnew, ['genres', 'production_companies', 'production_countries', 'spoken_languages'])
# testnew = split_ohe(testnew, ['genres', 'production_companies', 'production_countries', 'spoken_languages'])
allcols = list(set(trainnew.columns) & set(testnew.columns))
trainnew = trainnew[allcols]
testnew = testnew[allcols]

In [44]:
trainnew.head(1)

Unnamed: 0,PorchlightFilms,DieterGeisslerFilmproduktion,VerdiProductions,RobertSimondsProductions,History,A.I.E.,Principato-YoungEntertainment,ImpactPictures,DegetoFilm,HasbroStudios,NetterProductions,LightWorkersMedia,SmokehousePictures,KimandJimProductions,AmericanEmpiricalPictures,AlbertS.RuddyProductions,ThisIsThatProductions,BigTalkProductions,C.R.R.A.V.NordPasdeCalais,AlfredJ.HitchcockProductions,Marubeni,TornasolFilms,UniversalPicturesInternational(UPI),Verisimilitude,40Acres&AMuleFilmworks,WaltDisneyFeatureAnimation,FurthurFilms,WestdeutscherRundfunk(WDR),LaurelProductions,ABCMotionPictures,hr,ImageEntertainment,ProtagonistPictures,TelecincoCinema,AardmanAnimations,GB,VersátilCinema,10thHoleProductions,ImagineEntertainment,ZoetropeStudios,PirayaFilmA/S,SummitEntertainment,ElectricEntertainment,ArtisanEntertainment,xx,de,PandoraFilmproduktion,HuayiBrothers,SverigesTelevision(SVT),CompagniaCinematograficaChampion,SG,A.M.A.Film,CrystalLakeEntertainment,JetToneFilms,SvenskaFilminstitutet(SFI),LyndaObstProductions,FilmEngine,LorimarMotionPictures,Donner/Shuler-DonnerProductions,UGCPH,WaypointEntertainment,KDDICorporation,MagnetReleasing,RenaissancePictures,SovereignPictures,BandaiVisualCompany,so,ElevationFilmworks,UA,Intrep,TR,IFCFilms,తెలుగు,BIMDistribuzione,km,Sierra/Affinity,EasternProductions,ItalianTaxCredit,InDigEnt(IndependentDigitalEntertainment),MadChance,PariahEntertainmentGroup,AnnapurnaPictures,NostromoPictures,TW,KudosFilmandTelevision,C.G.Cinema,CannonGroup,JerryLewisProductions,GolarProductions,MoonlightingFilms,MagnoliaMaeFilms,R.P.Productions,DEJProductions,NeoReel,GramercyPictures,SobiniFilms,CorduroyFilms,WhitewaterFilms,QuébecProductionServicesTaxCredit,StudioCanal,...,AmasiaEntertainment,Seagal/NassoProductions,SolarProductions,TroublemakerStudios,DragonPictures,Sedic,VanguardFilms,MacDonald/ParkesProductions,AtípicaFilms,IngeniousMedia,FoxFilmCorporation,ParabolicPictures,GilbertFilms,DharmaProductions,FreeStatePictures,MelvinSimonProductions,BayerischerRundfunk(BR),ZDFEnterprises,Telespan2000,ApolloProMediaGmbH&Co.1.FilmproduktionKG(I),RKOPictures,CannonFilms,MediaRightsCapital(MRC),"""CentreNationalduCinémaetdeLimageAnimée(CNC)""",MuskatFilmedProperties,FreeRangeFilms,NipponTelevisionNetwork(NTV),TheSamuelGoldwynCompany,HautetCourt,TheCriterionCollection,ImageMovers,SumitomoCorporation,RadarPictures,Perd,OffspringEntertainment,HugoProductions,VendianEntertainment,JerryBruckheimerFilms,KalimaProductionsGmbH&Co.KG,21LapsEntertainment,HomeBoxOffice(HBO),DaybreakProductions,CharlesK.FeldmanGroup,RedChilliesEntertainment,ZentropaEntertainments,VertigoFilms,Mitsubishi,Animation,CubeVision,IndochinaProductions,HK,বাংলা,ja,PotboilerProductions,UniversalCityStudios,WarnerBros.Animation,JosephsonEntertainment,HU,Keywords,GhoulardiFilmCompany,Zanuck/BrownProductions,EmperorMotionPictures,GarySanchezProductions,RandomFilms,Ciné+,ImprintEntertainment,LauraZiskinProductions,SenatorInternational,NomadFilms,KraneEntertainment,OlympusPictures,Strela,RomulusFilms,SolanaFilms,LipsyncProductions,EMIFilms,MundyLaneEntertainment,KrasnoffFosterProductions,DrafthouseFilms,CarouselPictureCompany,RubyFilms,StudioTrite,KillerFilms,Belgacom,NormaProductions,SilverwoodFilms,CaptivateEntertainment,EmpirePictures,TempleHillEntertainment,SevenPictures,SilverScreenPartnersII,ClinicaEstetico,FranchisePictures,GreenHatFilms,ArcadiaMotionPictures,IkiruFilms,Yellow,ArteFrance,ChinaFilmCo-ProductionCorporation,ChockstonePictures
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,timetravel sequel hottub duringcreditsstinger,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
xxx = trainnew.fillna(-1)
xt,xtt,xy,xyy=train_test_split(xxx,ytrain,test_size=0.33,random_state=1)

In [47]:
from sklearn.decomposition import PCA

pcamod = PCA()
pcamod.fit(xt)
Xt=pcamod.transform(xt)
Xtt=pcamod.transform(xtt)
Xt.shape, Xtt.shape

ValueError: could not convert string to float: 'en'

In [344]:
# ## nlp side of things

# train_text = trainnew[nlp_cols].copy()
# train_text['string_all'] = train_text.apply(lambda x: ' '.join(x.dropna()), axis=1)
# train_text = train_text[['string_all']]

# nlp_train, nlp_test, nlp_ytrain, nlp_ytest = train_test_split(train_text, ytrain, test_size=0.33, random_state=42)

# vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,3), stop_words='english',
#                                strip_accents='unicode', analyzer='word', norm='l2')
# vectorizer.fit(nlp_train)
# nlp_train_new = pd.DataFrame(vectorizer.transform(nlp_train.string_all).todense())
# nlp_test_new = pd.DataFrame(vectorizer.transform(nlp_test.string_all).todense())

# mod=sklearn.linear_model.SGDRegressor(penalty='l2', eta0=0.1, n_iter=100)
# mod.fit(nlp_train_new, nlp_ytrain)
# mod.score(nlp_test_new, nlp_ytest)
# sklearn.metrics.mean_squared_log_error(y_pred=mod.predict(nlp_test_new), y_true=nlp_ytest)

In [None]:
## rest

train_else = trainnew['runtime','popularity','budget', 'original_language']