# TMDB Box Office Predictions 
In this project, we will use supervised machine learning models to predict the box office revenue of hundreds of films

# API Pull 
To retrieve our dataset, we will conduct an API pull from the TMDB website 
https://www.themoviedb.org/

In [1]:
#import dependencies 
import pandas as pd
import numpy as np
from config import api_key
import json
from collections import Counter
from pprint import pprint
import requests
import os 
import csv 

In [2]:
data = pd.read_csv('train.csv')

In [3]:
data2= pd.read_csv('test.csv')

In [4]:
up_to_2019= pd.concat([data, data2])

In [5]:
up_to_2019.head()

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651.0
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435.0
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.29999,...,10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de...",13092000.0
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,...,3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de...",16000000.0
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,...,2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970.0


Deciding features to know what we want to pull from TMDB API 

Features: film, budget, genres, original_language, popularity, production_companies, 
           release_date, runtime',revenue

Target: Revenue

In [6]:
#Kaggle Dataset Provides us with movie info up until 2019
#to keep project up to date, we will preform API pulls for 2020 and 2021 films 

In [7]:
#The tmdb only allows API pulls for 1 page at a time 
#page 1 url
#we will pull the 1st 10 pages of each year with descending revenue
response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=1')

In [8]:
highest_revenue = response.json() # store parsed json response

# uncomment the next line to get a peek at the highest_revenue json structure
# highest_revenue

highest_revenue_films = highest_revenue['results']

In [9]:
# define column names for our new dataframe
columns = ['film', 'budget', 'genres', 'original_language', 'popularity', 'production_companies', 
           'release_date', 'runtime','revenue']

# create dataframe with film and revenue columns
df_2020 = pd.DataFrame(columns=columns)

In [10]:
# for each of the highest revenue films make an api call for that specific movie to return all of our column headers
for film in highest_revenue_films:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020.loc[len(df_2020)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe    

In [11]:
#page 2
response2 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=2')

In [12]:
two= response2.json()
page_two = two['results']

In [13]:
df_2020_2 = pd.DataFrame(columns=columns)

In [14]:
# for each of the highest revenue films make an api call for that specific movie to return all of our column headers
for film in page_two:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_2.loc[len(df_2020_2)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe    

In [15]:
#page 3
response3 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=3')
three= response3.json()
page_three = three['results']

In [16]:
df_2020_3 = pd.DataFrame(columns=columns)

In [17]:
for film in page_three:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_3.loc[len(df_2020_3)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 
    

In [18]:
#Page 4
response4 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=4')
four= response4.json()
page_four = four['results']
df_2020_4 = pd.DataFrame(columns=columns)

In [19]:
for film in page_four:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_4.loc[len(df_2020_4)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 
    

In [20]:
#page5
response5 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=5')
five= response5.json()
page_five = five['results']
df_2020_5 = pd.DataFrame(columns=columns)

In [21]:
for film in page_five:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_5.loc[len(df_2020_5)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 
    

In [22]:
#page6
response6 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=6')
six= response6.json()
page_six = six['results']
df_2020_6 = pd.DataFrame(columns=columns)

In [23]:
for film in page_six:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_6.loc[len(df_2020_6)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 
    

In [24]:
#page7
response7 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=7')
seven= response7.json()
page_seven = seven['results']
df_2020_7 = pd.DataFrame(columns=columns)

In [25]:
for film in page_seven:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_7.loc[len(df_2020_7)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 
    

In [26]:
#page8
response8 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=8')
eight= response8.json()
page_eight = eight['results']
df_2020_8 = pd.DataFrame(columns=columns)

In [27]:
for film in page_eight:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_8.loc[len(df_2020_8)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [28]:
#page9
response9 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=9')
nine= response9.json()
page_nine = nine['results']
df_2020_9 = pd.DataFrame(columns=columns)

In [29]:
for film in page_nine:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_9.loc[len(df_2020_9)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [30]:
#page10
response10 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2020&sort_by=revenue.desc'+'&page=10')
ten= response10.json()
page_ten = ten['results']
df_2020_10 = pd.DataFrame(columns=columns)

In [31]:
for film in page_ten:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2020_10.loc[len(df_2020_10)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [32]:
#Combine all 10 2020 movie dfs 
Total_2020_Movies= pd.concat([df_2020, df_2020_2, df_2020_3, df_2020_4, df_2020_5, df_2020_6,
          df_2020_7, df_2020_8, df_2020_9, df_2020_10], ignore_index=True, axis=0)

In [33]:
#drop production companies bc there are so many blanks
Total_2020_Movies = Total_2020_Movies.drop(columns = ['production_companies'])

Data for 2021

In [34]:
#page1
response1a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=1')
onea= response1a.json()
page_onea = onea['results']
df_2021 = pd.DataFrame(columns=columns)

In [35]:
for film in page_onea:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021.loc[len(df_2021)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [36]:
#page2
response2a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
twoa= response2a.json()
page_twoa = twoa['results']
df_2021_2 = pd.DataFrame(columns=columns)

In [37]:
for film in page_twoa:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_2.loc[len(df_2021_2)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [38]:
#page3
response3a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
threea= response3a.json()
page_threea = threea['results']
df_2021_3 = pd.DataFrame(columns=columns)

In [39]:
for film in page_threea:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_3.loc[len(df_2021_3)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [40]:
#page4
response4a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
foura= response4a.json()
page_foura = foura['results']
df_2021_4 = pd.DataFrame(columns=columns)

In [41]:
for film in page_foura:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_4.loc[len(df_2021_4)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [42]:
#page5
response5a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
fivea= response5a.json()
page_fivea = fivea['results']
df_2021_5 = pd.DataFrame(columns=columns)

In [43]:
for film in page_twoa:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_5.loc[len(df_2021_5)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [44]:
#page6
response6a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
sixa= response6a.json()
page_sixa = sixa['results']
df_2021_6 = pd.DataFrame(columns=columns)

In [45]:
for film in page_sixa:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_6.loc[len(df_2021_6)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [46]:
#page7
response7a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
sevena= response7a.json()
page_sevena = sevena['results']
df_2021_7 = pd.DataFrame(columns=columns)

In [47]:
for film in page_sevena:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_7.loc[len(df_2021_7)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [48]:
#page8
response8a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
eighta= response8a.json()
page_eighta = eighta['results']
df_2021_8 = pd.DataFrame(columns=columns)

In [49]:
for film in page_eighta:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_8.loc[len(df_2021_8)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [50]:
#page9
response9a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
ninea= response9a.json()
page_ninea = ninea['results']
df_2021_9 = pd.DataFrame(columns=columns)

In [51]:
for film in page_ninea:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_9.loc[len(df_2021_9)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [52]:
#page10
response10a = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&primary_release_year=2021&sort_by=revenue.desc'+'&page=2')
tena= response10a.json()
page_tena = tena['results']
df_2021_10 = pd.DataFrame(columns=columns)

In [53]:
for film in page_tena:
    # print(film['title'])
    film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
    film_revenue = film_revenue.json()
    #print(locale.currency(film_revenue['revenue'], grouping=True ))
    df_2021_10.loc[len(df_2021_10)]=[film['title'],film_revenue['budget'],film_revenue['genres'],
                     film_revenue['original_language'],film_revenue['popularity'],
                     film_revenue['production_companies'],film_revenue['release_date'],film_revenue['runtime'],
                     film_revenue['revenue']] # store title and revenue in our dataframe 

In [54]:
#Combine all 10 2021 movie dfs 
Total_2021_Movies = pd.concat([df_2021, df_2021_2, df_2021_3, df_2021_4, df_2021_5, df_2021_6,
          df_2021_7, df_2021_8, df_2021_9, df_2021_10], ignore_index=True, axis=0)

In [55]:
#drop production companies bc there are so many blanks
Total_2021_Movies = Total_2021_Movies.drop(columns = ['production_companies'])

In [56]:
#View your dataframes
Total_2020_Movies

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue
0,Demon Slayer -Kimetsu no Yaiba- The Movie: Mug...,15800000,"[{'id': 16, 'name': 'Animation'}, {'id': 28, '...",ja,1194.928,2020-10-16,117,503063688
1,The Eight Hundred,80000000,"[{'id': 10752, 'name': 'War'}, {'id': 36, 'nam...",zh,18.140,2020-08-14,147,460919368
2,Metallica: WorldWired Tour - Live in Mancheste...,0,"[{'id': 10402, 'name': 'Music'}]",en,1.064,2020-06-08,150,426900000
3,Bad Boys for Life,90000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",en,128.139,2020-01-15,124,426505244
4,Tenet,205000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",en,124.400,2020-08-22,150,363129000
...,...,...,...,...,...,...,...,...
195,Force of Nature,0,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,21.495,2020-07-02,91,215668
196,Two of Us,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",fr,5.478,2020-02-12,99,208723
197,The Predators,0,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",it,8.845,2020-10-22,109,206589
198,Ghosts of War,0,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",en,53.888,2020-07-03,94,178592


In [57]:
Total_2021_Movies

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue
0,Spider-Man: No Way Home,200000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,7517.432,2021-12-15,148,1832000000
1,The Battle at Lake Changjin,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10752, 'n...",zh,15.302,2021-09-30,176,888577720
2,"Hi, Mom",59000000,"[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name...",zh,8.188,2021-02-12,128,822049668
3,No Time to Die,250000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '...",en,512.994,2021-09-29,163,774153007
4,F9,200000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,494.338,2021-05-19,143,721077945
...,...,...,...,...,...,...,...,...
195,Halloween Kills,20000000,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",en,336.130,2021-10-14,105,127000000
196,The King's Man,100000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,3507.400,2021-12-22,131,124005195
197,The Addams Family 2,0,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",en,376.331,2021-10-01,93,119815153
198,Wrath of Man,40000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,379.364,2021-04-22,119,103966489


In [58]:
Total_Movies=  pd.concat([Total_2020_Movies, Total_2021_Movies])

In [59]:
Total_Movies['release_date'] = pd.to_datetime(Total_Movies.release_date, format='%Y-%m-%d')

In [60]:
Total_Movies['release_date'] = Total_Movies["release_date"].dt.strftime("%m/%d/%y")

In [61]:
Total_Movies

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue
0,Demon Slayer -Kimetsu no Yaiba- The Movie: Mug...,15800000,"[{'id': 16, 'name': 'Animation'}, {'id': 28, '...",ja,1194.928,10/16/20,117,503063688
1,The Eight Hundred,80000000,"[{'id': 10752, 'name': 'War'}, {'id': 36, 'nam...",zh,18.140,08/14/20,147,460919368
2,Metallica: WorldWired Tour - Live in Mancheste...,0,"[{'id': 10402, 'name': 'Music'}]",en,1.064,06/08/20,150,426900000
3,Bad Boys for Life,90000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",en,128.139,01/15/20,124,426505244
4,Tenet,205000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",en,124.400,08/22/20,150,363129000
...,...,...,...,...,...,...,...,...
195,Halloween Kills,20000000,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",en,336.130,10/14/21,105,127000000
196,The King's Man,100000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,3507.400,12/22/21,131,124005195
197,The Addams Family 2,0,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",en,376.331,10/01/21,93,119815153
198,Wrath of Man,40000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",en,379.364,04/22/21,119,103966489


Importing Kaggle Dataset 
https://www.kaggle.com/c/tmdb-box-office-prediction/data?select=train.csv

In [62]:
pd.set_option('display.max_columns', None)

In [63]:
up_to_2019

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651.0
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435.0
2,3,,3300000,"[{'id': 18, 'name': 'Drama'}]",http://sonyclassics.com/whiplash/,tt2582802,en,Whiplash,"Under the direction of a ruthless instructor, ...",64.299990,/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg,"[{'name': 'Bold Films', 'id': 2266}, {'name': ...","[{'iso_3166_1': 'US', 'name': 'United States o...",10/10/14,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The road to greatness can take you to the edge.,Whiplash,"[{'id': 1416, 'name': 'jazz'}, {'id': 1523, 'n...","[{'cast_id': 5, 'character': 'Andrew Neimann',...","[{'credit_id': '54d5356ec3a3683ba0000039', 'de...",13092000.0
3,4,,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",http://kahaanithefilm.com/,tt1821480,hi,Kahaani,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,3.174936,/aTXRaPrWSinhcmCrcfJK17urp3F.jpg,,"[{'iso_3166_1': 'IN', 'name': 'India'}]",3/9/12,122.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,,Kahaani,"[{'id': 10092, 'name': 'mystery'}, {'id': 1054...","[{'cast_id': 1, 'character': 'Vidya Bagchi', '...","[{'credit_id': '52fe48779251416c9108d6eb', 'de...",16000000.0
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.148070,/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg,,"[{'iso_3166_1': 'KR', 'name': 'South Korea'}]",2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4393,7394,,42000000,"[{'id': 53, 'name': 'Thriller'}]",,tt0218922,en,Original Sin,A young man is plunged into a life of subterfu...,9.970359,/i8FEQy5IWAqOzXm4uDHy2r3Swym.jpg,"[{'name': 'Intermedia Films', 'id': 763}, {'na...","[{'iso_3166_1': 'FR', 'name': 'France'}, {'iso...",8/3/01,118.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,This is not a love story - it's a story about ...,Original Sin,"[{'id': 515, 'name': 'women'}, {'id': 572, 'na...","[{'cast_id': 17, 'character': 'Julia Russell/B...","[{'credit_id': '52fe4330c3a36847f80412db', 'de...",
4394,7395,"[{'id': 146534, 'name': 'Without a Paddle Coll...",19000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,tt0364751,en,Without a Paddle,"Three friends, whose lives have been drifting ...",6.046516,/oZDbFtTnTwW5GSfyaGFGaYxDBgD.jpg,"[{'name': 'Paramount Pictures', 'id': 4}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",8/20/04,95.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"The call of the wild, the thrill of adventure....",Without a Paddle,"[{'id': 4959, 'name': 'death of a friend'}, {'...","[{'cast_id': 40, 'character': 'Dan Mott', 'cre...","[{'credit_id': '52fe43b29251416c7501a909', 'de...",
4395,7396,,16000000,"[{'id': 18, 'name': 'Drama'}]",,tt0084855,en,The Verdict,"Frank Galvin is a down-on-his luck lawyer, red...",9.596883,/hh9sIE1PT7Pjq3n2fzHNEHh8Ogq.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",12/8/82,129.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"The doctors want to settle, the Church wants t...",The Verdict,"[{'id': 1680, 'name': 'boston'}, {'id': 6148, ...","[{'cast_id': 1, 'character': 'Frank Galvin', '...","[{'credit_id': '52fe448bc3a368484e028c55', 'de...",
4396,7397,,2000000,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",,tt3235888,en,It Follows,"For 19-year-old Jay, fall should be about scho...",20.359336,/4MrwJZr0R9LbyOgZqwLNmtzzxbu.jpg,"[{'name': 'Northern Lights Films', 'id': 8714}...","[{'iso_3166_1': 'US', 'name': 'United States o...",2/4/15,100.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"It doesn't think, it doesn't feel, it doesn't ...",It Follows,"[{'id': 3713, 'name': 'chase'}, {'id': 6152, '...","[{'cast_id': 1, 'character': 'Jay Height', 'cr...","[{'credit_id': '537770b20e0a261431002299', 'de...",


In [64]:
up_to_2019 = up_to_2019.drop(columns=['belongs_to_collection', 'id', 'homepage','imdb_id','overview',
                                      'spoken_languages','status','tagline','original_title','Keywords','cast',
                                     'crew'])

In [65]:
up_to_2019= up_to_2019.drop(columns=['poster_path','production_countries'])

In [66]:
up_to_2019 = up_to_2019[['title','budget','genres','original_language','popularity',
                         'release_date','runtime','revenue']]


In [67]:
up_to_2019.rename(columns={'title': 'film'}, inplace=True)

In [68]:
up_to_2019

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue
0,Hot Tub Time Machine 2,14000000,"[{'id': 35, 'name': 'Comedy'}]",en,6.575393,2/20/15,93.0,12314651.0
1,The Princess Diaries 2: Royal Engagement,40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,8.248895,8/6/04,113.0,95149435.0
2,Whiplash,3300000,"[{'id': 18, 'name': 'Drama'}]",en,64.299990,10/10/14,105.0,13092000.0
3,Kahaani,1200000,"[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...",hi,3.174936,3/9/12,122.0,16000000.0
4,Marine Boy,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",ko,1.148070,2/5/09,118.0,3923970.0
...,...,...,...,...,...,...,...,...
4393,Original Sin,42000000,"[{'id': 53, 'name': 'Thriller'}]",en,9.970359,8/3/01,118.0,
4394,Without a Paddle,19000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",en,6.046516,8/20/04,95.0,
4395,The Verdict,16000000,"[{'id': 18, 'name': 'Drama'}]",en,9.596883,12/8/82,129.0,
4396,It Follows,2000000,"[{'id': 27, 'name': 'Horror'}, {'id': 53, 'nam...",en,20.359336,2/4/15,100.0,


In [69]:
all_movies= pd.concat([up_to_2019, Total_Movies], ignore_index=True, axis=0)

In [70]:
all_movies["genres"] = all_movies.genres.astype(str)

In [71]:
#split genres column by comma
def proc_json(string, key):
    try:
        data = eval(string)
        return ",".join([d[key] for d in data])
    except:
        return ''

In [72]:
#identifying genres 
all_movies.genres = all_movies.genres.apply(lambda x: proc_json(x, 'name'))

In [73]:
#turning genre category into numeric 
genres = []
for idx, val in all_movies.genres.iteritems():
    gen_list = val.split(',')
    for gen in gen_list:
        if gen == '':
            continue

        if gen not in genres:
            genres.append(gen)

In [74]:
all_movies

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue
0,Hot Tub Time Machine 2,14000000,Comedy,en,6.575393,2/20/15,93.0,12314651.0
1,The Princess Diaries 2: Royal Engagement,40000000,"Comedy,Drama,Family,Romance",en,8.248895,8/6/04,113.0,95149435.0
2,Whiplash,3300000,Drama,en,64.299990,10/10/14,105.0,13092000.0
3,Kahaani,1200000,"Thriller,Drama",hi,3.174936,3/9/12,122.0,16000000.0
4,Marine Boy,0,"Action,Thriller",ko,1.148070,2/5/09,118.0,3923970.0
...,...,...,...,...,...,...,...,...
7793,Halloween Kills,20000000,"Horror,Thriller",en,336.130000,10/14/21,105,127000000
7794,The King's Man,100000000,"Action,Adventure,Thriller,War",en,3507.400000,12/22/21,131,124005195
7795,The Addams Family 2,0,"Animation,Adventure,Comedy,Family",en,376.331000,10/01/21,93,119815153
7796,Wrath of Man,40000000,"Action,Crime,Thriller",en,379.364000,04/22/21,119,103966489


In [75]:
#adding index column so we can remove the film name for scaling before splitting into test and train so the indexes remain true
all_movies['index'] = all_movies.index

In [76]:
#filling empty revenue columns with average revenue 
rev_median = all_movies.loc[all_movies['revenue'] > 0, 'revenue'].median()
all_movies['revenue'] = all_movies['revenue'].fillna(rev_median)

In [77]:
all_movies.dtypes

film                  object
budget                object
genres                object
original_language     object
popularity           float64
release_date          object
runtime               object
revenue              float64
index                  int64
dtype: object

In [78]:
all_movies

Unnamed: 0,film,budget,genres,original_language,popularity,release_date,runtime,revenue,index
0,Hot Tub Time Machine 2,14000000,Comedy,en,6.575393,2/20/15,93.0,12314651.0,0
1,The Princess Diaries 2: Royal Engagement,40000000,"Comedy,Drama,Family,Romance",en,8.248895,8/6/04,113.0,95149435.0,1
2,Whiplash,3300000,Drama,en,64.299990,10/10/14,105.0,13092000.0,2
3,Kahaani,1200000,"Thriller,Drama",hi,3.174936,3/9/12,122.0,16000000.0,3
4,Marine Boy,0,"Action,Thriller",ko,1.148070,2/5/09,118.0,3923970.0,4
...,...,...,...,...,...,...,...,...,...
7793,Halloween Kills,20000000,"Horror,Thriller",en,336.130000,10/14/21,105,127000000.0,7793
7794,The King's Man,100000000,"Action,Adventure,Thriller,War",en,3507.400000,12/22/21,131,124005195.0,7794
7795,The Addams Family 2,0,"Animation,Adventure,Comedy,Family",en,376.331000,10/01/21,93,119815153.0,7795
7796,Wrath of Man,40000000,"Action,Crime,Thriller",en,379.364000,04/22/21,119,103966489.0,7796


In [79]:
all_movies = all_movies[['index', 'film', 'budget', 'genres', 'original_language', 'popularity', 'release_date',
                       'runtime', 'revenue']]

In [80]:
budget_median = all_movies.loc[all_movies['budget'] > 0, 'budget'].median()
all_movies["budget_processed"] = all_movies["budget"].mask(all_movies["budget"] == 0, budget_median)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_movies["budget_processed"] = all_movies["budget"].mask(all_movies["budget"] == 0, budget_median)


In [81]:
all_movies = all_movies[['index', 'film', 'budget', 'budget_processed','genres', 'original_language', 'popularity', 'release_date',
                       'runtime', 'revenue']]

In [82]:
all_movies= all_movies[['index', 'film', 'budget','budget_processed', 'genres',
                        'original_language', 'popularity', 'release_date',
                       'runtime', 'revenue']]

In [83]:
all_movies= all_movies.drop(columns=['budget'])

In [84]:
#fix release date column
all_movies.release_date.isnull().sum()

1

In [85]:
all_movies['release_date'] = all_movies['release_date'].fillna('7/14/07')

In [86]:
all_movies.release_date.isnull().sum()

0

In [87]:
import re
def yearfix(x): # run year fix, then date fix
    """Regular expression to pull out the two digit year from release_date and return it as an int"""
    r = re.match(r"(\d+/\d+/)(\d+)",x)[2]
    return int(r)
def datefix(x):
    """The dates only provide two digits for year.  This is meant to fix this issue.
    The youngest movie is from 2021, so we'll say any year digits less than 21 are from this century.
    Otherwise, we'll say they are from the 1900s."""
    if x<21:
        x = x+2000
        return x
    if x >=21:
        x = x+1900
        return x

In [88]:
all_movies

Unnamed: 0,index,film,budget_processed,genres,original_language,popularity,release_date,runtime,revenue
0,0,Hot Tub Time Machine 2,14000000,Comedy,en,6.575393,2/20/15,93.0,12314651.0
1,1,The Princess Diaries 2: Royal Engagement,40000000,"Comedy,Drama,Family,Romance",en,8.248895,8/6/04,113.0,95149435.0
2,2,Whiplash,3300000,Drama,en,64.299990,10/10/14,105.0,13092000.0
3,3,Kahaani,1200000,"Thriller,Drama",hi,3.174936,3/9/12,122.0,16000000.0
4,4,Marine Boy,18000000.0,"Action,Thriller",ko,1.148070,2/5/09,118.0,3923970.0
...,...,...,...,...,...,...,...,...,...
7793,7793,Halloween Kills,20000000,"Horror,Thriller",en,336.130000,10/14/21,105,127000000.0
7794,7794,The King's Man,100000000,"Action,Adventure,Thriller,War",en,3507.400000,12/22/21,131,124005195.0
7795,7795,The Addams Family 2,18000000.0,"Animation,Adventure,Comedy,Family",en,376.331000,10/01/21,93,119815153.0
7796,7796,Wrath of Man,40000000,"Action,Crime,Thriller",en,379.364000,04/22/21,119,103966489.0


In [89]:
all_movies.dtypes

index                  int64
film                  object
budget_processed      object
genres                object
original_language     object
popularity           float64
release_date          object
runtime               object
revenue              float64
dtype: object

creating a seperate dataset for tableau visualizations

In [90]:
my_movies= all_movies.copy()

In [91]:
my_movies['genre_1'] = my_movies['genres'].str.split(',').str[0]

In [92]:
my_movies = my_movies[['index', 'film', 'budget_processed', 'genres','genre_1','original_language', 'popularity', 'release_date',
                       'runtime', 'revenue']]

In [93]:
my_movies= my_movies.drop(columns=['genres'])

In [94]:
my_movies.dtypes

index                  int64
film                  object
budget_processed      object
genre_1               object
original_language     object
popularity           float64
release_date          object
runtime               object
revenue              float64
dtype: object

fixing the release date

In [156]:
my_movies.to_csv('my_tableau_movies.csv')

# Machine Learning Data Preprocessing

In [96]:
#import sklearn dependencies 
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

In [97]:
all_movies.shape

(7798, 9)

In [98]:
from numpy import mean
import time
from datetime import datetime
import calendar
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
import re
from scipy.stats import pearsonr
import math
from statistics import median
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

In [99]:
all_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7798 entries, 0 to 7797
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              7798 non-null   int64  
 1   film               7795 non-null   object 
 2   budget_processed   7798 non-null   object 
 3   genres             7798 non-null   object 
 4   original_language  7798 non-null   object 
 5   popularity         7798 non-null   float64
 6   release_date       7798 non-null   object 
 7   runtime            7792 non-null   object 
 8   revenue            7798 non-null   float64
dtypes: float64(2), int64(1), object(6)
memory usage: 548.4+ KB


In [100]:
#converting objects that are currently numeric to float 
#converting release date to datetime
all_movies["index"] = all_movies.index.astype(float)
all_movies["budget_processed"] = all_movies.budget_processed.astype(float)
all_movies["runtime"] = all_movies.runtime.astype(float)
all_movies["revenue"] = all_movies.revenue.astype(float)
all_movies["release_date"] = pd.to_datetime(all_movies["release_date"])

In [101]:
all_movies['release_date'] = all_movies[('release_date')].values.astype("float64")

In [102]:
all_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7798 entries, 0 to 7797
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              7798 non-null   float64
 1   film               7795 non-null   object 
 2   budget_processed   7798 non-null   float64
 3   genres             7798 non-null   object 
 4   original_language  7798 non-null   object 
 5   popularity         7798 non-null   float64
 6   release_date       7798 non-null   float64
 7   runtime            7792 non-null   float64
 8   revenue            7798 non-null   float64
dtypes: float64(6), object(3)
memory usage: 548.4+ KB


In [103]:
#creating numerical data for the genre column 
genre_column_names = []
for gen in genres:
    col_name = 'genre_' + gen.replace(' ', '_')
    all_movies[col_name] = all_movies.genres.str.contains(gen).astype('uint8')
    genre_column_names.append(col_name)

In [122]:
print ('The movie data has {} rows and {} columns'.format(all_movies.shape[0],all_movies.shape[1]))

The movie data has 7798 rows and 29 columns


In [104]:
all_movies

Unnamed: 0,index,film,budget_processed,genres,original_language,popularity,release_date,runtime,revenue,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie
0,0.0,Hot Tub Time Machine 2,14000000.0,Comedy,en,6.575393,1.424390e+18,93.0,12314651.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,The Princess Diaries 2: Royal Engagement,40000000.0,"Comedy,Drama,Family,Romance",en,8.248895,1.091750e+18,113.0,95149435.0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2.0,Whiplash,3300000.0,Drama,en,64.299990,1.412899e+18,105.0,13092000.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3.0,Kahaani,1200000.0,"Thriller,Drama",hi,3.174936,1.331251e+18,122.0,16000000.0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4.0,Marine Boy,18000000.0,"Action,Thriller",ko,1.148070,1.233792e+18,118.0,3923970.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7793,7793.0,Halloween Kills,20000000.0,"Horror,Thriller",en,336.130000,1.634170e+18,105.0,127000000.0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
7794,7794.0,The King's Man,100000000.0,"Action,Adventure,Thriller,War",en,3507.400000,1.640131e+18,131.0,124005195.0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0
7795,7795.0,The Addams Family 2,18000000.0,"Animation,Adventure,Comedy,Family",en,376.331000,1.633046e+18,93.0,119815153.0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
7796,7796.0,Wrath of Man,40000000.0,"Action,Crime,Thriller",en,379.364000,1.619050e+18,119.0,103966489.0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [127]:
all_movies = pd.get_dummies(all_movies, columns = ["original_language"])

Next we will need to perform a test train split. 
After data is split drop the revenue column from the test data because that will be our target variable in our models

In [128]:
train=all_movies.sample(frac=0.75,random_state=42) #random state is a seed value
test=all_movies.drop(train.index)

In [129]:
train

Unnamed: 0,index,film,budget_processed,genres,popularity,release_date,runtime,revenue,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2926,2926.0,Balls of Fury,18000000.0,"Comedy,Crime",5.325766,1.188346e+18,90.0,41098065.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5502,5502.0,The Pebble and the Penguin,28000000.0,"Animation,Adventure,Family",2.304986,7.975584e+17,74.0,18000000.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2809,2809.0,The Swan Princess,35000000.0,Animation,8.910462,7.851168e+17,89.0,9771658.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1025,1025.0,A Case of You,18000000.0,"Comedy,Romance",5.819403,1.383696e+18,91.0,4187.0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
101,101.0,A Beautiful Mind,60000000.0,"Drama,Romance",11.936460,1.008029e+18,135.0,313542341.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5089,5089.0,Ondine,12000000.0,"Drama,Romance",4.757269,1.252886e+18,111.0,18000000.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1247,1247.0,[REC],1500000.0,"Horror,Mystery",8.504511,1.176163e+18,78.0,30448000.0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5856,5856.0,Thor,150000000.0,"Adventure,Fantasy,Action",29.158489,1.303344e+18,115.0,18000000.0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4681,4681.0,"Oh, God!",18000000.0,"Fantasy,Comedy",2.660367,2.450304e+17,98.0,18000000.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [130]:
test

Unnamed: 0,index,film,budget_processed,genres,popularity,release_date,runtime,revenue,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2,2.0,Whiplash,3300000.0,Drama,64.299990,1.412899e+18,105.0,13092000.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3.0,Kahaani,1200000.0,"Thriller,Drama",3.174936,1.331251e+18,122.0,16000000.0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4.0,Marine Boy,18000000.0,"Action,Thriller",1.148070,1.233792e+18,118.0,3923970.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,5.0,Pinocchio and the Emperor of the Night,8000000.0,"Animation,Adventure,Family",0.743274,5.552064e+17,83.0,3261638.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,9.0,A Mighty Wind,6000000.0,"Comedy,Music",4.672036,1.050451e+18,91.0,18750246.0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7779,7779.0,The Conjuring: The Devil Made Me Do It,39000000.0,"Horror,Mystery,Thriller",392.601000,1.621901e+18,111.0,201000000.0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7780,7780.0,Ghostbusters: Afterlife,75000000.0,"Fantasy,Comedy,Adventure",1556.443000,1.636589e+18,124.0,191000000.0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7782,7782.0,Cliff Walkers,18000000.0,"Drama,History,Thriller,Crime",7.772000,1.619741e+18,120.0,172919448.0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7786,7786.0,Peter Rabbit 2: The Runaway,45000000.0,"Family,Comedy,Adventure,Animation,Fantasy",172.378000,1.616630e+18,93.0,153000000.0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [131]:
#removing film from dataframes for scaling purposes
train = train.drop(columns=['film'])
test= test.drop(columns=['film'])

In [132]:
#dropping genres column since the data is now numerical
train = train.drop(columns=['genres'])
test = test.drop(columns=['genres'])

In [133]:
y_train = train["revenue"]
X_train = train.drop(columns = ["revenue"])
X_train.head()

Unnamed: 0,index,budget_processed,popularity,release_date,runtime,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2926,2926.0,18000000.0,5.325766,1.188346e+18,90.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5502,5502.0,28000000.0,2.304986,7.975584e+17,74.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2809,2809.0,35000000.0,8.910462,7.851168e+17,89.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1025,1025.0,18000000.0,5.819403,1.383696e+18,91.0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
101,101.0,60000000.0,11.93646,1.008029e+18,135.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [134]:
#drop Revenue column from TEST df because that will be our target variable
y_test = test["revenue"]
X_test = test.drop(columns = ["revenue"])
X_test.head()

Unnamed: 0,index,budget_processed,popularity,release_date,runtime,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2,2.0,3300000.0,64.29999,1.412899e+18,105.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3.0,1200000.0,3.174936,1.331251e+18,122.0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4.0,18000000.0,1.14807,1.233792e+18,118.0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,5.0,8000000.0,0.743274,5.552064e+17,83.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,9.0,6000000.0,4.672036,1.050451e+18,91.0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [135]:
#take a look at the shape of our data
print ('The train data has {} rows and {} columns'.format(train.shape[0],train.shape[1]))
print ('---------------------------------------------')
print ('The test data has {} rows and {} columns'.format(test.shape[0],test.shape[1]))

The train data has 5848 rows and 73 columns
---------------------------------------------
The test data has 1950 rows and 73 columns


In [136]:
print("Training set missing values:\n", train.isna().sum())
print("\nTest set missing values:\n", test.isna().sum())

Training set missing values:
 index                   0
budget_processed        0
popularity              0
release_date            0
runtime                 5
                       ..
original_language_uk    0
original_language_ur    0
original_language_vi    0
original_language_xx    0
original_language_zh    0
Length: 73, dtype: int64

Test set missing values:
 index                   0
budget_processed        0
popularity              0
release_date            0
runtime                 1
                       ..
original_language_uk    0
original_language_ur    0
original_language_vi    0
original_language_xx    0
original_language_zh    0
Length: 73, dtype: int64


In [137]:
train.dtypes

index                   float64
budget_processed        float64
popularity              float64
release_date            float64
runtime                 float64
                         ...   
original_language_uk      uint8
original_language_ur      uint8
original_language_vi      uint8
original_language_xx      uint8
original_language_zh      uint8
Length: 73, dtype: object

In [114]:
from sklearn import *

In [138]:
train

Unnamed: 0,index,budget_processed,popularity,release_date,runtime,revenue,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2926,2926.0,18000000.0,5.325766,1.188346e+18,90.0,41098065.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5502,5502.0,28000000.0,2.304986,7.975584e+17,74.0,18000000.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2809,2809.0,35000000.0,8.910462,7.851168e+17,89.0,9771658.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1025,1025.0,18000000.0,5.819403,1.383696e+18,91.0,4187.0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
101,101.0,60000000.0,11.936460,1.008029e+18,135.0,313542341.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5089,5089.0,12000000.0,4.757269,1.252886e+18,111.0,18000000.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1247,1247.0,1500000.0,8.504511,1.176163e+18,78.0,30448000.0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5856,5856.0,150000000.0,29.158489,1.303344e+18,115.0,18000000.0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4681,4681.0,18000000.0,2.660367,2.450304e+17,98.0,18000000.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [139]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modeling

In [140]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [142]:
# add missing dummy variables to testing set (in training only)
for column in X_train.columns:
    if column not in X_test.columns:
        X_test[column] = 0

In [150]:
X_train.isnull().values.any()

True

In [149]:
X_train

Unnamed: 0,index,budget_processed,popularity,release_date,runtime,genre_Comedy,genre_Drama,genre_Family,genre_Romance,genre_Thriller,genre_Action,genre_Animation,genre_Adventure,genre_Horror,genre_Documentary,genre_Music,genre_Crime,genre_Science_Fiction,genre_Mystery,genre_Foreign,genre_Fantasy,genre_War,genre_Western,genre_History,genre_TV_Movie,original_language_af,original_language_ar,original_language_bm,original_language_bn,original_language_ca,original_language_cn,original_language_cs,original_language_da,original_language_de,original_language_el,original_language_en,original_language_es,original_language_fa,original_language_fi,original_language_fr,original_language_he,original_language_hi,original_language_hu,original_language_id,original_language_is,original_language_it,original_language_ja,original_language_ka,original_language_kn,original_language_ko,original_language_lt,original_language_ml,original_language_mr,original_language_nb,original_language_nl,original_language_no,original_language_pl,original_language_pt,original_language_ro,original_language_ru,original_language_sr,original_language_sv,original_language_ta,original_language_te,original_language_th,original_language_tl,original_language_tr,original_language_uk,original_language_ur,original_language_vi,original_language_xx,original_language_zh
2926,2926.0,18000000.0,5.325766,1.188346e+18,90.0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5502,5502.0,28000000.0,2.304986,7.975584e+17,74.0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2809,2809.0,35000000.0,8.910462,7.851168e+17,89.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1025,1025.0,18000000.0,5.819403,1.383696e+18,91.0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
101,101.0,60000000.0,11.936460,1.008029e+18,135.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5089,5089.0,12000000.0,4.757269,1.252886e+18,111.0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1247,1247.0,1500000.0,8.504511,1.176163e+18,78.0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5856,5856.0,150000000.0,29.158489,1.303344e+18,115.0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4681,4681.0,18000000.0,2.660367,2.450304e+17,98.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [147]:
y_train

2926     41098065.0
5502     18000000.0
2809      9771658.0
1025         4187.0
101     313542341.0
           ...     
5089     18000000.0
1247     30448000.0
5856     18000000.0
4681     18000000.0
3526     18000000.0
Name: revenue, Length: 5848, dtype: float64

In [143]:
reg = LogisticRegression().fit(X_train, y_train)
reg.score(X_test, y_test)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').