# Creating a AI producer or: How I Learn to Stop Worrying and Love box office Bomb

If you love the movie, you have probably seen a trailer for a new blockbuster in the past and ask yourself "Why would someone spend 100 million dollars to make such a stupid movie?". For me, the last time this happens is when I heard of the existence of the "Baywatch" movie. I mean nobody wanted to watch a "Baywatch" movie; no director could possibly have dreamed of making a "Baywatch" movie. So what happened? There's probably a producer who's been sitting on that script for years now and Zack Efron had bills to pay. Maybe Dwayne "the Rock" Johnson saw that and thought "Hey, I like to have a new house and I'm free for two months between shooting "The Fast and the Furious 8" and "Jumanji 2"! I should be in this movie!". Then a studio executive talked with the producer and taught something like "It's an IP known by the public, there's two names we can put on the poster and we have a reason to put girls in bikini in the trailers... That could do 80 million at the box-office!". Then the movie got green lit... and will probably fail.

For every stupid movie made, there's always a weird commercial logic, who's been used to justify the investment and this logic is often based on hype, rumour and other subjective considerations. What if we used hard data to predict the success of a movie? Would we be more successful that a producer or the movie goers are so unpredictable that only a good judge the zeitgeist can predict if a movie will be successful.

In this project, I will use a data set of 5000 movies, scrape from IMDB by https://www.kaggle.com/deepmatrix, to try to predict if a movie will make money at the box-office by looking at his cast, his director and others key characteristics. The project is divided in three sections: the first one, this article, explain my process for cleaning and getting the data; in the second one, I explore the data by looking at the distribution of the variables and the relations between them; in the third one, I test some statistical model with the objective to predict the revenue of a movie with the information available before the release of the movie.

You can find the dataset here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.  

# Loading data

First of all, I load the library that I will use in that notebook.

In [35]:
#library(ggplot2)
library(dplyr)
library(data.table)
#library(plotly)
#library(formattable)
library(bit64)

------------------------------------------------------------------------------
data.table + dplyr code now lives in dtplyr.
Please library(dtplyr)!
------------------------------------------------------------------------------

Attaching package: 'dplyr'

The following objects are masked from 'package:data.table':

    between, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



After I load the data in the data frame "movies", I look at the first five rows of the table  to get a sense of the data and at the list of the variables.

In [438]:
movies <- fread("movie_metadata2.csv",stringsAsFactors = F)

In [439]:
head(movies)

color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes;
Color,James Cameron,723.0,178.0,0,855.0,Joel David Moore,1000,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936,7.9,1.78,33000;
Color,Gore Verbinski,302.0,169.0,563,1000.0,Orlando Bloom,40000,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000,7.1,2.35,0;
Color,Sam Mendes,602.0,148.0,0,161.0,Rory Kinnear,11000,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393,6.8,2.35,85000;
Color,Christopher Nolan,813.0,164.0,22000,23000.0,Christian Bale,27000,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000,8.5,2.35,164000;
,Doug Walker,,,131,,Rob Walker,131,,Documentary,...,,,,,,,12,7.1,,0;
Color,Andrew Stanton,462.0,132.0,475,530.0,Samantha Morton,640,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632,6.6,2.35,24000;


In [440]:
str(movies)

Classes 'data.table' and 'data.frame':	5043 obs. of  28 variables:
 $ color                    : chr  "Color" "Color" "Color" "Color" ...
 $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
 $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
 $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
 $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
 $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
 $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
 $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
 $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
 $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|

# Cleaning data

When I first looked at the variables in the summary above, two things caught my eyes. First, there's a lot of missing values, so I'll have either to scrape them from IMDB, estimate them or delete those observations completely. Second, the formatting of the string in the column movie_title is weird...

In [441]:
head(movies$movie_title)

While getting the data from IMDB, there must have been an encoding error. I'll have to delete the character "Â" at the end of each string.

In [442]:
#movies$movie_title<-as.character(movies$movie_title)
movies$movie_title<-sub('Â', "", movies$movie_title)

In [443]:
head(movies$movie_title)

Now, I take care of the missing values. First, let's look at the number of missing values of the first variable "color".

In [444]:
print(paste0("Empty string: ",sum(movies$color=="")))
print(paste0("NA: ",sum(movies$color=="NA")))

[1] "Empty string: 18"
[1] "NA: 0"


There's 18 observations where the value of the variable "color" is missing. Since before 1939, the year where the wizard of Oz was released, the majority of the movie was in black and white, my first thought was to look for a separation in the data. If the distributions of the variable "title_year", who represent the year where the movie was released, for the movies in black and white is quite different from the one for the movies in color, I will be able to tell if a movie is in color or in black and white just by looking at the variable "title_year".     

In [445]:
print(paste0("Mean, Black and White: ",mean(movies[which(color=="Black and White"),title_year], na.rm=TRUE)))
print(paste0("Standard deviation, Black and White: ",sd(movies[which(color=="Black and White"),title_year],na.rm=TRUE)))
print(paste0("Mean, Color: ",mean(movies[which(color=="Color"),title_year], na.rm=TRUE)))
print(paste0("Standard deviation, Color: ",sd(movies[which(color=="Color"),title_year],na.rm=TRUE)))

[1] "Mean, Black and White: 1984.07843137255"
[1] "Standard deviation, Black and White: 26.2500992349088"
[1] "Mean, Color: 2003.26709677419"
[1] "Standard deviation, Color: 10.7673361222041"


In [446]:
print("Summary Black and White")
summary(movies[which(color=="Black and White"),title_year])
print("Summary color")
summary(movies[which(color=="Color"),title_year])

[1] "Summary Black and White"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1916    1962    1998    1984    2004    2015       3 

[1] "Summary color"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1937    1999    2006    2003    2011    2016      99 

In [447]:
movies[which(color==""),title_year]

We can see that the two means are much closer than I thought and that the standard deviation of the movies in black and white is quite large. The shape of those two distributions tells me that there's a high probability that a movie made after 2000 is in color and that those made before 1993 are in black and white. By looking at the year of release of the movies with a missing "color" value I count four movies with a missing value for the variable "title_year" and one movie made in 1990 right in between the two distributions. Let's look at the name of those movies.

In [448]:
movies[which(color==""&(title_year==1990|is.na(title_year))),movie_title]

Since there's only five problematics movies, I decided to do some research and find the missing value. After 2 minutes of searching on IMDB, I learned that all those movies are in color, so I can assign that factor to all the movie with a missing value on that variable. 

In [449]:
movies[which(color==""),"color"]="Color"

In [450]:
sum(movies$color=="")

Let's look at the number of missing values for each variable.

In [451]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

Of those variables, only the variable duration as few missing data that are easily obtainable, so I will manually fetch those data by searching them on IMDB.

In [452]:
movies[is.na(movies$duration),movie_title]

In [453]:
movies[movie_title=="Star Wars: Episode VII - The Force Awakens",4]<-136
movies[movie_title=="Harry Potter and the Deathly Hallows: Part II",4]<-130
movies[movie_title=="Harry Potter and the Deathly Hallows: Part I",4]<-146
movies[movie_title=="Black Water Transit",4]<-100
movies[movie_title=="Should've Been Romeo",4]<-90
movies[movie_title=="Barfi",4]<-151
movies[movie_title=="Hum To Mohabbat Karega",4]<-87
movies[movie_title=="N-Secure",4]<-115
movies[movie_title=="Dil Jo Bhi Kahey...",4]<-144
movies[movie_title=="Wolf Creek",4]<-95
movies[movie_title=="Karachi se Lahore",4]<-143
movies[movie_title=="Destiny",4]<-105
movies[movie_title=="Romantic Schemer",4]<-85
movies[movie_title=="The Naked Ape",4]<-110
movies[movie_title=="War & Peace",4]<- 379

Now if we look back at the number of missing value of every variable in the data set, we see that the variable "duration" have none.

In [454]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

Having missing values of the variables "num_critic_for_reviews","num_voted_users","movie_imdb_link","num_user_for_reviews", "imdb_score" and "movie_facebook_likes" is not really an issue, since I'm interested in predicting the box-office result of a movie before his production and those variables are metrics collected after the theater release of their respective movie. Also, some variable like "director_facebook_likes", "facenumber_in_poster", "title_year" and "actor_2_facebook_likes" have a relatively low number of missing values and the omission of those observations won't affect significantly the quality of our model. With that in mind, I'll focus my attention on the variables "gross" and "budget".

First, I need to find why the scrapping code returned some empty field. I looked at the IMDB page of some movies with missing data to try to find a pattern.

# Scrapping data

In [455]:
head(movies[is.na(movies$gross),movie_title],)

In [456]:
movies[which(movies$movie_title=='Star Wars: Episode VII - The Force Awakens'),movie_imdb_link]

In [457]:
movies[which(movie_title=='Miami Vice'),movie_imdb_link]

In [458]:
movies[which(movies$movie_title=='Harry Potter and the Deathly Hallows: Part II '), movie_imdb_link]

By looking at those three pages, we see the problem: IMDB list not only theatrical movie release, but also web series, tv show, direct to dvd release and since some entries have the same title some error have been injected in the data set. For example, instead of having data on the movie "Miami Vice", the script return data on an episode of the tv serie of the same name, where the fields "gross" ans "budget" are empty. In consequence, a missing value in those columns seems to be an indicator of a bad observation. 

I took a sample of 20 movies (about 3% of the observations with the value "gross" missing) to see if that hypothesis is correct.

In [459]:
print(paste0("Number of rows: ",nrow(subset(movies,is.na(movies$gross)&is.na(movies$budget)))))
head(subset(movies,is.na(movies$gross)&is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 224"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
Star Wars: Episode VII - The Force Awakens,Doug Walker,Doug Walker,Rob Walker,,,http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
The Lovers,Roland JoffÃ©,Tamsin Egerton,Alice Englert,Bipasha Basu,2015.0,http://www.imdb.com/title/tt1321869/?ref_=fn_tt_tt_1
Godzilla Resurgence,Hideaki Anno,Mark Chinnery,Shin'ya Tsukamoto,Atsuko Maeda,2016.0,http://www.imdb.com/title/tt4262980/?ref_=fn_tt_tt_1
Harry Potter and the Deathly Hallows: Part II,Matt Birch,Rupert Grint,Dave Legeno,Ralph Ineson,2011.0,http://www.imdb.com/title/tt1680310/?ref_=fn_tt_tt_1
Godzilla Resurgence,Hideaki Anno,Mark Chinnery,Shin'ya Tsukamoto,Atsuko Maeda,2016.0,http://www.imdb.com/title/tt4262980/?ref_=fn_tt_tt_1
Harry Potter and the Deathly Hallows: Part I,Matt Birch,Rupert Grint,Toby Jones,Alfred Enoch,2010.0,http://www.imdb.com/title/tt1571403/?ref_=fn_tt_tt_1
The A-Team,,George Peppard,Dirk Benedict,Dwight Schultz,,http://www.imdb.com/title/tt0084967/?ref_=fn_tt_tt_1
"""10,000 B.C. """,Christopher Barnard,Mathew Buck,,,,http://www.imdb.com/title/tt1869849/?ref_=fn_tt_tt_1
Ben-Hur,Timur Bekmambetov,Morgan Freeman,Ayelet Zurer,Moises Arias,2016.0,http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1
Hannibal,,Caroline Dhavernas,Scott Thompson,Hettienne Park,,http://www.imdb.com/title/tt2243973/?ref_=fn_tt_tt_1


In these data, 13 observations, on a total of 20, come from the wrong IMDB page, and six are movies whose IMDB page don't show that information and one is a duplicate. From that sample, I notice that data who were mistakenly taken from the IMDB page of a TV show doesn't have a value for the variable "director_name", since generally more than one director work on a TV show, and for the variable "title_year", since they often run for more than a season. That could indicate that missing value for those two variables and the variable "gross" and/or "budget" is a strong indicator of an odd observation. 

Let's look at some examples of observations were only "gross" values is missing to see if that hypothesis is correct.

In [460]:
print(paste0("Number of rows: ",nrow(subset(movies,is.na(movies$gross)&!is.na(movies$budget)))))
head(subset(movies,is.na(movies$gross)&!is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 660"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
Miami Vice,,Don Johnson,Philip Michael Thomas,John Diehl,,http://www.imdb.com/title/tt0086759/?ref_=fn_tt_tt_1
Asterix at the Olympic Games,FrÃ©dÃ©ric Forestier,Alain Delon,Santiago Segura,Vanessa Hessler,2008.0,http://www.imdb.com/title/tt0463872/?ref_=fn_tt_tt_1
Creepshow,George A. Romero,Ted Danson,Hal Holbrook,Adrienne Barbeau,1982.0,http://www.imdb.com/title/tt0083767/?ref_=fn_tt_tt_1
Top Cat Begins,AndrÃ©s Couturier,Sariann Monaco,David Hoffman,Ben Diskin,2015.0,http://www.imdb.com/title/tt4057916/?ref_=fn_tt_tt_1
Red Dawn,John Milius,Lea Thompson,Jennifer Grey,William Smith,1984.0,http://www.imdb.com/title/tt0087985/?ref_=fn_tt_tt_1
Xi you ji zhi: Sun Wukong san da Baigu Jing,Pou-Soi Cheang,Li Gong,Aaron Kwok,Eddie Peng,2016.0,http://www.imdb.com/title/tt4591310/?ref_=fn_tt_tt_1
"""Sabrina, the Teenage Witch """,,Nate Richert,Soleil Moon Frye,Caroline Rhea,,http://www.imdb.com/title/tt0115341/?ref_=fn_tt_tt_1
Stargate SG-1,,Christopher Judge,Don S. Davis,Gary Jones,,http://www.imdb.com/title/tt0118480/?ref_=fn_tt_tt_1
Lolita,Stanley Kubrick,James Mason,Shelley Winters,Lois Maxwell,1962.0,http://www.imdb.com/title/tt0056193/?ref_=fn_tt_tt_1
Eye See You,Jim Gillespie,Sylvester Stallone,Tom Berenger,Charles S. Dutton,2002.0,http://www.imdb.com/title/tt0160184/?ref_=fn_tt_tt_1


Three observations out of 20 are from a TV show instead of the movie of the same name and they all have missing value in the column "director_name" and "title_year".

Now I look at the observations were only the "budget" values is missing to see if I should keep them in the data set.

In [461]:
print(paste0("Number of rows: ",nrow(subset(movies,!is.na(movies$gross)&is.na(movies$budget)))))
head(subset(movies,!is.na(movies$gross)&is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 268"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
The Good Dinosaur,Peter Sohn,A.J. Buckley,Jack McGraw,Peter Sohn,2015,http://www.imdb.com/title/tt1979388/?ref_=fn_tt_tt_1
Charlotte's Web,Gary Winick,Steve Buscemi,Julia Roberts,Oprah Winfrey,2006,http://www.imdb.com/title/tt0413895/?ref_=fn_tt_tt_1
DÃ©jÃ Vu,Henry Jaglom,Vanessa Redgrave,Stephen Dillane,Michael Brandon,1997,http://www.imdb.com/title/tt0119033/?ref_=fn_tt_tt_1
The Edge,Lee Tamahori,Anthony Hopkins,Harold Perrineau,Bart the Bear,1997,http://www.imdb.com/title/tt0119051/?ref_=fn_tt_tt_1
Carriers,David Pastor,Christopher Meloni,Kiernan Shipka,Lou Taylor Pucci,2009,http://www.imdb.com/title/tt0806203/?ref_=fn_tt_tt_1
The Finest Hours,Craig Gillespie,Michael Raymond-James,Abraham Benrubi,Graham McTavish,2016,http://www.imdb.com/title/tt2025690/?ref_=fn_tt_tt_1
Dinner for Schmucks,Jay Roach,Steve Carell,Stephanie Szostak,Bruce Greenwood,2010,http://www.imdb.com/title/tt0427152/?ref_=fn_tt_tt_1
Wild Hogs,Walt Becker,Jill Hennessy,Tichina Arnold,Drew Sidora,2007,http://www.imdb.com/title/tt0486946/?ref_=fn_tt_tt_1
State of Play,Kevin Macdonald,Robin Wright,Harry Lennix,Michael Weston,2009,http://www.imdb.com/title/tt0473705/?ref_=fn_tt_tt_1
Trouble with the Curve,Robert Lorenz,Clint Eastwood,Ed Lauter,Bob Gunton,2012,http://www.imdb.com/title/tt2083383/?ref_=fn_tt_tt_1


All the observations from that sample have the same value on their IMDB page, so I will keep those observations in the data set. In fact, this table shows me that the absence of information on the budget of a movie is not a good indicator of an aberrant observation. 

The last three tables comfort my opinion that observations with missing "gross","director_name" and "title_year" value won't contribute positively to the model. I will delete them from the data set, but in the future when I'll have more time, I would like to scrape them by myself.

One more thing I saw while looking at the IMDB page of those movies, is that the value for the variables "actor_1_name", "actor_2_name" and "actor_3_name" are not necessarily goods indicators for the name of the starring actor of a movie. For example, the movie "The Edge" has Anthony Hopkins, Alec Baldwin, Elle Macpherson and Harold Perrineau as top billing actor, according to Wikipedia, while the data set list Anthony Hopkins, Harold Perrineau and "Bart the Bear" instead. I'm sure "Bart the Bear" did a good acting job in that movie, but I doubt that many people have bought a ticket to see him instead of Elle Macpherson or Alec Baldwin. I'll have to remember this while creating the predictive model.

All that being said, I have three things to do before moving into the exploration of the data: delete the duplicate in the table, delete the observations without a value for the variables "gross","director_name" and "title_year" and scrape the value of the missing budget and gross.

In [462]:
print(paste0("Number of rows before: ",nrow(movies)))
movies<-unique(movies)
print(paste0("Number of rows after: ",nrow(movies)))

[1] "Number of rows before: 5043"
[1] "Number of rows after: 4998"


In [463]:
print(paste0("Number of rows before: ",nrow(movies)))
movies<-subset(movies,!is.na(gross)|!is.na(title_year)|director_name!="")
print(paste0("Number of rows after: ",nrow(movies)))

[1] "Number of rows before: 4998"
[1] "Number of rows after: 4898"


I want to find the missing data from the variables "budget" and "gross" by using a Python script to scrape the data from the web. Since Kaggle user chuansun76 (https://www.kaggle.com/deepmatrix) was kind enough to share his source code with the IMDB data set, I'll modify his code to scrape the data. You can find the details of that scrapping process here https://github.com/GTouzin/Portfolio/tree/master/Python (full notebook coming soon).

I saved the results in the file "scrap_gross_budget.csv"

In [464]:
budget <- read.csv ("scrap_gross_budget_copie.csv",stringsAsFactors = F)
budget<-data.table(budget)
head(budget)

title_year,budget,movie_title,gross
2009,425000000,Avatar,760507625
2015,306000000,Star Wars Ep. VII: The Force Awakens,936662225
2007,300000000,Pirates of the Caribbean: At Worlds End,309420425
2015,300000000,Spectre,200074175
2012,275000000,The Dark Knight Rises,448139099
2013,275000000,The Lone Ranger,89302115


I have to custom function to fill the missing "budget" and "gross" values with the new values.

In [466]:
fill_budget<-function(title_ref,year_ref,budget_ref,title_to_fill,year_to_fill,budget_to_fill){
    value<-budget_to_fill
    for (i in 1:length(title_to_fill))
    {
       index<-which(gsub(" ", "", tolower(title_ref), fixed = TRUE) %in% gsub(" ", "", tolower(title_to_fill[i]), fixed = TRUE)
                    & gsub(" ", "", tolower(year_ref), fixed = TRUE) %in% gsub(" ", "", tolower(year_to_fill[i]), fixed = TRUE))
       
        if(length(index)==0||length(index)>=2)
        {
           #print(paste0(title_to_fill[i],": ",length(index)))            
        }
        else
        {
            if(is.na(value[i]))
            {
                value[i]<-budget_ref[index[1]]
                #print(paste0(title_to_fill[i],": ",budget_ref[index[1]]))  
            }  
        }
    } 
    return(value)
}

In [467]:
temp<-fill_budget(budget$movie_title,budget$title_year,budget$budget,movies$movie_title,movies$title_year,movies$budget)

I count the numbers of missing observation that have been replaced by my function.

In [468]:
sum(is.na(movies[, budget]))-sum(is.na(temp))

I filled the third of the missing data: Not bad! By looking at the name of the movies who still have a missing value, I realized that some movies names in the reference data set are a little bit different than the movie name in the data set with the new budget values. Generally, the difference is small, often a quote missing is the cause of the difference. When I'll have time in the future, I would like to write a script to fix that problem.

Let's make sure my function didn't create any error in the data by comparing the value of the original vector with the vector I created.

In [469]:
sum(is.na(movies[, budget]))
length(temp)-sum(movies[, budget]==temp,na.rm =TRUE)

Since the only difference between the two vectors are present on the rows where there's a missing value in the first vector, my function didn't change a original value in the first vector. Reassured by that fact, I copies the new vector, the one with less missing value in the original data set.

In [470]:
movies$budget<-temp

Now, to fill the missing "gross" values, I'll apply the same logic than before. 

In [471]:
temp<-fill_budget(budget$movie_title,budget$title_year,budget$gross,movies$movie_title,movies$title_year,movies$gross)

In [473]:
sum(is.na(movies[, gross]))-sum(is.na(temp))
sum(is.na(movies[, gross]))
length(temp)-sum(movies[, gross]==temp,na.rm =TRUE)

In [474]:
movies$gross<-temp

Here's the numbers of missing values in the data set for each variable. Since most of the variable is categorical with too much level to allow for two observations to be similar I'm afraid that to impute missing data would generate too much noise in the data. So, I will capitalise on the fact that my data set is large and I won't use those observations while creating my model. Also, I won't delete the observations with missing values, since the information in those observations will be useful in the estimation of the distribution for each variable.  

In [492]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

In [489]:
write.table(movies,file ="movies_clean.csv",row.names=FALSE,sep=";")

# Converting the currency

Here I'm cheating a little bit: I realise there was a problem with the data way into the exploration phase, which will be my next article. But, even though I saw this problem way past this particular point, I believe that I should covert it while I'm cleaning the data. So without further adieu, let's look at the distribution of the budget variable.

In [495]:
budget_hist <- plot_ly(x=~movies$budget,type="histogram")
embed_notebook(budget_hist)

When the outliers of your graph are so far away from the mode of the distribution that your histogram look like a density plot, something's wrong! Let's look at the table to have a better sense of what is happening. 

In [490]:
temp<-data.table(movie_title=movies$movie_title,budget=movies$budget,gross=movies$gross,
                 country=movies$country,title_year=movies$title_year)
temp<-temp[order(-budget)]
head(temp, n=10)

movie_title,budget,gross,country,title_year
The Host,12215500000,2201412,South Korea,2006
Lady Vengeance,4200000000,211667,South Korea,2005
Fateless,2500000000,195888,Hungary,2005
Princess Mononoke,2400000000,2298191,Japan,1997
Steamboy,2127519898,410388,Japan,2004
Akira,1100000000,439162,Japan,1988
Godzilla 2000,1000000000,10037390,Japan,1999
Kabhi Alvida Naa Kehna,700000000,3275443,India,2006
Tango,700000000,1687311,Spain,1998
Kites,600000000,1602466,India,2010


We see that foreing movie skew the distribution of that variable, especially the Asian films. That is because those "budget" and "gross" revenues are listed in a foreign currency and since most currencies have an exchange rate bigger than one with the US dollar, their value tends to skew negatively those two distributions.

I looked at the movie from South Korea to see how I could deal with this problem.

In [496]:
temp[temp$country=="South Korea"]

movie_title,budget,gross,country,title_year
The Host,12215500000.0,2201412.0,South Korea,2006
Lady Vengeance,4200000000.0,211667.0,South Korea,2005
Inchon,48000000.0,,South Korea,1981
Snowpiercer,39200000.0,4563029.0,South Korea,2013
Dragon Wars: D-War,35000000.0,10956379.0,South Korea,2007
The Last Godfather,13400000.0,163591.0,South Korea,2010
Tae Guk Gi: The Brotherhood of War,12800000.0,1110186.0,South Korea,2004
Operation Chromite,12620000.0,31662.0,South Korea,2016
Jungle Shuffle,10000000.0,,South Korea,2014
"""The Good, the Bad, the Weird """,10000000.0,128486.0,South Korea,2008


The budget for "Oldboy" is in US dollar. This observation tells me that I can't just convert all budgets of foreign movie without taking in consideration in which currency it is valued. Maybe I can convert the budget of foreign movies with budget high enough to indicate that it's not written in us dollar, let's say 50 000 000$? Of course, it's not an optimal approach, because some movies that cost under 50 million dollars will still be written in a foreign currency, but this approach would diminish the negative skew of the distribution and be the most time effective. Of course, it's not an optimal approach, because some movies that cost under 50 million dollars will still be written in a foreign currency, but this approach would diminish the effect of those outliers on the distribution and be the most time effective. 

On the next table, the foreign movies with a budget of more than 50 million dollars are displayed in descending order.

In [498]:
head(temp[temp$country!="USA"&temp$budget>=50000000], n=20)

movie_title,budget,gross,country,title_year
The Host,12215500000,2201412,South Korea,2006
Lady Vengeance,4200000000,211667,South Korea,2005
Fateless,2500000000,195888,Hungary,2005
Princess Mononoke,2400000000,2298191,Japan,1997
Steamboy,2127519898,410388,Japan,2004
Akira,1100000000,439162,Japan,1988
Godzilla 2000,1000000000,10037390,Japan,1999
Kabhi Alvida Naa Kehna,700000000,3275443,India,2006
Tango,700000000,1687311,Spain,1998
Kites,600000000,1602466,India,2010


We see that there's quite a few american movies that are credited as foreign movie in IMDB (probably for tax reasons) but their budget are in US dollars. See for exemple, King Kong, X-Men: The Last Stand and Harry Potter and the Half-Blood Prince.

So using the strategy that I discribed above, not only some budget will still be written in a foreign currency, but I will change the value of some American movies for an incorrect value.  I'll have to scrap some more data... 

# Getting the currency

In [201]:
currency <- fread("find_estimated.csv",stringsAsFactors = F)

In [202]:
head(currency)

title_year,currency,estimated,movie_title
2006,b'$',1,"b""""Pirates of the Caribbean: Dead Man's Chest"""""
2013,b'$',1,b'The Lone Ranger'
2013,b'$',1,b'Man of Steel'
2008,b'$',1,b'The Chronicles of Narnia: Prince Caspian'
2012,b'$',1,b'The Avengers'
2011,b'$',1,b'Pirates of the Caribbean: On Stranger Tides'


Soo there's gonna be a bit of cleaning to do.

In [203]:
unique(currency$currency)

I'll drop the first three caracters in the variable currency and movie_title

In [204]:
currency$currency<-as.character(currency$currency)
currency$currency<-substr(currency$currency, 3, nchar(currency$currency)-1)
unique(currency$currency)

In [205]:
currency$movie_title<-as.character(currency$movie_title)
currency$movie_title<-substr(currency$movie_title, 3, nchar(currency$movie_title)-1)
sample(currency$movie_title, size=5)

Then, I'll change the caracters '$', '€' and '£' for their country code.

In [206]:
currency$currency[currency$currency=='$']<-"USA"
currency$currency[currency$currency==currency$currency[currency$movie_title=="Micmacs"]]<-"FRF"
currency$currency[currency$currency=='£']<-"GBR"
unique(currency$currency)

There's with no currency label, let's look at some of them

In [207]:
head(currency$movie_title[currency$currency==''])

We see that the majority of those movies are american movie, which make sens since the us dollar is the default currency, but some of them are just missing value. We also see that there's some duplicate in the table, for exemple:

In [208]:
subset(currency,movie_title=='Godzilla Resurgence')
str(currency)

title_year,currency,estimated,movie_title
2016,,,Godzilla Resurgence
2016,,,Godzilla Resurgence


Classes 'data.table' and 'data.frame':	5029 obs. of  4 variables:
 $ title_year : int  2006 2013 2013 2008 2012 2011 2012 2014 2012 2010 ...
 $ currency   : chr  "USA" "USA" "USA" "USA" ...
 $ estimated  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movie_title: chr  "\"Pirates of the Caribbean: Dead Man's Chest\"" "The Lone Ranger" "Man of Steel" "The Chronicles of Narnia: Prince Caspian" ...
 - attr(*, ".internal.selfref")=<externalptr> 


Let's get rid of them

In [209]:
#currency<-currency[!duplicated(currency),]
setkey(currency,NULL)
currency<-unique(currency)
str(currency)

Classes 'data.table' and 'data.frame':	4907 obs. of  4 variables:
 $ title_year : int  2006 2013 2013 2008 2012 2011 2012 2014 2012 2010 ...
 $ currency   : chr  "USA" "USA" "USA" "USA" ...
 $ estimated  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movie_title: chr  "\"Pirates of the Caribbean: Dead Man's Chest\"" "The Lone Ranger" "Man of Steel" "The Chronicles of Narnia: Prince Caspian" ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [162]:
head(currency$movie_title[currency$currency==''])
length(currency$movie_title[currency$currency==''])

Now I'll use the data from the IMDB data.table to assign the USA label to american film with no currency data

In [210]:
change_US_currency<-function(x,y){
    values<-x$currency
    for (i in 1:length(x$currency)){
        for (j in 1:length(y$country)){
            if(x$currency[i]==''&y$movie_title[j]==x$movie_title[i]&y$country[j]=="USA")
            {
                values[i]<-"USA"
            }
        }
    } 
    return(values)
}

In [211]:
temp<-change_US_currency(currency,movies)

In [212]:
length(temp[temp=="USA"])
length(temp[temp==""])
length(temp)-(length(temp[temp=="USA"])+length(temp[temp==""]))
length(temp)

In [213]:
length(currency$currency[currency$currency=="USA"])
length(currency$currency[currency$currency==""])
length(currency$currency)-(length(currency$currency[currency$currency=="USA"])+length(currency$currency[currency$currency==""]))
length(currency$currency)

In [214]:
currency$currency<-temp

In [216]:
n<-c("title_year","currency","movie_title")
currency<-currency[,n,with=FALSE]

In [217]:
write.csv(currency,"currency.csv",row.names=FALSE, col.names=TRUE)

"attempt to set 'col.names' ignored"

In [218]:
currency<-fread("currency.csv", stringsAsFactors=FALSE)

# Merging the data

I found a data set of the historical exchange rate from 1950 to 2015 from the OECD website  https://data.oecd.org/conversion/exchange-rates.htm. OECD (2017), Exchange rates (indicator). doi: 10.1787/037ed317-en (Accessed on 13 January 2017)

In [219]:
abreviation <- fread("Abr.csv",stringsAsFactors = F)
#abreviation$Country<-substr(abreviation$Country, 1, nchar(abreviation$Country)-1)
str(abreviation)
setdiff(unique(currency$currency),abreviation$CODE)

Classes 'data.table' and 'data.frame':	234 obs. of  2 variables:
 $ CODE   : chr  "ABW" "AFG" "AFRI" "AGO" ...
 $ Country: chr  "Aruba" "Afghanistan" "Africa" "Angola" ...
 - attr(*, ".internal.selfref")=<externalptr> 


That data set use the ISO 3166 country name abreviation as an index for the table, while the IMDB website use the ISO 4217 currency codes to caracterise the budget. In consequence, I'll have to map the abreviation of the currency with the abreviation of the country to be able to use those data. 

In [220]:
currency$currency[currency$currency=='FRF']<-"FRA"
currency$currency[currency$currency=='RUR']<-"USSR"
currency$currency[currency$currency=='CNY']<-"CHN"
currency$currency[currency$currency=='AUD']<-"AUS"
currency$currency[currency$currency=='HKD']<-"HKG"
currency$currency[currency$currency=='CAD']<-"CAN"
currency$currency[currency$currency=='JPY']<-"JPN"
currency$currency[currency$currency=='NOK']<-"NOR"
currency$currency[currency$currency=='DEM']<-"DEU"
currency$currency[currency$currency=='THB']<-"THA"
currency$currency[currency$currency=='KRW']<-"KOR"
currency$currency[currency$currency=='HUF']<-"HUN"
currency$currency[currency$currency=='INR']<-"IND"
currency$currency[currency$currency=='DKK']<-"DNK"
currency$currency[currency$currency=='CZK']<-"CZE"
currency$currency[currency$currency=='NZD']<-"NZL"
currency$currency[currency$currency=='CHF']<-"CHE"
currency$currency[currency$currency=='BRL']<-"BRA"
currency$currency[currency$currency=='ZAR']<-"ZAF"
currency$currency[currency$currency=='SEK']<-"SWE"

Let's see if there's currency code that I left out

In [221]:
setdiff(unique(currency$currency),abreviation$CODE)

In [234]:
resumer_movie<-data.table(movie_title=movies$movie_title,budget=movies$budget,gross=movies$gross,
                 country=movies$country,title_year=movies$title_year)
resumer_movie<-resumer_movie[order(-budget)]
head(resumer_movie, n=10)
str(resumer_movie)

movie_title,budget,gross,country,title_year
The Host,12215500000,2201412,South Korea,2006
Lady Vengeance,4200000000,211667,South Korea,2005
Fateless,2500000000,195888,Hungary,2005
Princess Mononoke,2400000000,2298191,Japan,1997
Steamboy,2127519898,410388,Japan,2004
Akira,1100000000,439162,Japan,1988
Godzilla 2000,1000000000,10037390,Japan,1999
Kabhi Alvida Naa Kehna,700000000,3275443,India,2006
Tango,700000000,1687311,Spain,1998
Red Cliff,553632000,626809,China,2008


Classes 'data.table' and 'data.frame':	4034 obs. of  5 variables:
 $ movie_title: chr  "The Host" "Lady Vengeance" "Fateless" "Princess Mononoke" ...
 $ budget     :Class 'integer64'  num [1:4034] 6.04e-314 2.08e-314 1.24e-314 1.19e-314 1.05e-314 ...
 $ gross      : int  2201412 211667 195888 2298191 410388 439162 10037390 3275443 1687311 626809 ...
 $ country    : chr  "South Korea" "South Korea" "Hungary" "Japan" ...
 $ title_year : int  2006 2005 2005 1997 2004 1988 1999 2006 1998 2008 ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [282]:
merge_movie<-merge(resumer_movie,currency,by=c("movie_title","title_year"))
head(merge_movie)

movie_title,title_year,budget,gross,country,currency
10 Cloverfield Lane,2016,15000000,71897215,USA,USA
10 Days in a Madhouse,2015,12000000,14616,USA,USA
10 Things I Hate About You,1999,16000000,38176108,USA,USA
102 Dalmatians,2000,85000000,66941559,USA,USA
10th & Wolf,2006,8000000,53481,USA,USA
12 Rounds,2009,22000000,12232937,USA,USA


In [238]:
str(resumer_movie)
str(merge_movie)

Classes 'data.table' and 'data.frame':	4034 obs. of  5 variables:
 $ movie_title: chr  "The Host" "Lady Vengeance" "Fateless" "Princess Mononoke" ...
 $ budget     :Class 'integer64'  num [1:4034] 6.04e-314 2.08e-314 1.24e-314 1.19e-314 1.05e-314 ...
 $ gross      : int  2201412 211667 195888 2298191 410388 439162 10037390 3275443 1687311 626809 ...
 $ country    : chr  "South Korea" "South Korea" "Hungary" "Japan" ...
 $ title_year : int  2006 2005 2005 1997 2004 1988 1999 2006 1998 2008 ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':	3785 obs. of  6 variables:
 $ movie_title: chr  "10 Cloverfield Lane" "10 Days in a Madhouse" "10 Things I Hate About You" "102 Dalmatians" ...
 $ title_year : int  2016 2015 1999 2000 2006 2009 2013 2010 2004 2016 ...
 $ budget     :Class 'integer64'  num [1:3785] 7.41e-317 5.93e-317 7.91e-317 4.20e-316 3.95e-317 ...
 $ gross      : int  71897215 14616 38176108 66941559 53481 12232937 56667870 18329466 56044241 

I make sure that all the label in the variable currency are included in the data set from the OCED

# Convert budget

In [231]:
change_string_value<-function(x,y,z){
    value<-z
    for (i in 1:length(x)){
        for (j in 1:length(z)){
            if(tolower(x[i])==tolower(z[j]))
            {
                value[j]<-y[i]
            }
        }
    } 
    return(value)
}

In [None]:
#temp$V4<-change_string_value(abreviation$Country,abreviation$CODE,temp$V4)

In [232]:
head(temp$V4)

NULL

In [233]:
ex_rate <- fread("ExRate50-15.csv",stringsAsFactors = F)

str(ex_rate)

Classes 'data.table' and 'data.frame':	2764 obs. of  8 variables:
 $ LOCATION  : chr  "AUS" "AUS" "AUS" "AUS" ...
 $ INDICATOR : chr  "EXCH" "EXCH" "EXCH" "EXCH" ...
 $ SUBJECT   : chr  "TOT" "TOT" "TOT" "TOT" ...
 $ MEASURE   : chr  "NATUSD" "NATUSD" "NATUSD" "NATUSD" ...
 $ FREQUENCY : chr  "A" "A" "A" "A" ...
 $ TIME      : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
 $ Value     : num  0.893 0.893 0.893 0.893 0.893 ...
 $ Flag Codes: logi  NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [286]:
setdiff(merge_movie$currency,ex_rate$LOCATION)

In [276]:
#a=country of the rate
#b=year of the exchange rate
#c=value of the exchage rate
#x=country of the movie
#y=year of the making of the movie
#z=budget of the movie

convert_gross<-function(a,b,c,x,y,z){
    value<-z
    for (i in 1:length(y)){
       index<-which(a %in% x[i] & b %in%y [i])
        min<-1
        find<-FALSE
        if(length(index)==0)
        {
           value[i]<-z[i]
        }
        else
            {
            for (j in 1:length(index)){
            
            if(j==1)
                {
                    min<-c[index[j]]
                    find<-FALSE
                }           

            if(y[i]==b[index[j]])
            {
                value[i]<-z[i]/c[index[j]]
                Find<-TRUE
            }
        }
        
        if(!find)
            {
            
            value[i]<-z[i]/min
        }
            
        }
        
        
    } 
    return(value)
}

In [274]:
index<-which(ex_rate$LOCATION %in% merge_movie$currency[2] & ex_rate$TIME %in% merge_movie$title_year[2])
str(index)
length(index)
merge_movie$currency[1]==ex_rate$LOCATION[index[1]]

 int 1879


In [277]:
merge_movie$budget_us<-convert_gross(ex_rate$LOCATION,ex_rate$TIME,ex_rate$Value,
                                  merge_movie$currency,merge_movie$title_year,merge_movie$budget)

In [273]:
merge_movie$movie_title[1]

In [278]:
head(merge_movie)

movie_title,title_year,budget,gross,country,currency,Result,result
10 Cloverfield Lane,2016,15000000,71897215,USA,USA,15000000,15000000
10 Days in a Madhouse,2015,12000000,14616,USA,USA,12000000,12000000
10 Things I Hate About You,1999,16000000,38176108,USA,USA,16000000,16000000
102 Dalmatians,2000,85000000,66941559,USA,USA,85000000,85000000
10th & Wolf,2006,8000000,53481,USA,USA,8000000,8000000
12 Rounds,2009,22000000,12232937,USA,USA,22000000,22000000


In [279]:
head(merge_movie[merge_movie$currency!="USA"])

movie_title,title_year,budget,gross,country,currency,Result,result
3,2010,,59774,Germany,,,
8 Women,2002,8000000.0,3076425,France,FRA,15007015.0,7529043.0
A Dangerous Method,2011,15000000.0,5702083,UK,FRA,28138154.0,20852013.0
A Room for Romeo Brass,1999,,18434,UK,,,
Aimee & Jaguar,1999,15000000.0,927107,Germany,DEU,6993460.0,15980788.0
Ajami,2009,,621240,Germany,,,


For exemple, the movie "A Dangerous Method" made in 2011 for 15 000 000 euro according to IMDB. So far, the table is good. That year, the exchage rate for the euro was 0.719355, so the budget should be, in US dollar

In [253]:
15000000/0.719355

That seem Alright! Let's look at the asian film that skew the distribution

In [280]:
head(merge_movie[merge_movie$currency=="KOR"])

movie_title,title_year,budget,gross,country,currency,Result,result
Lady Vengeance,2005,4200000000,211667,South Korea,KOR,1680000000,4101095
The Host,2006,12215500000,2201412,South Korea,KOR,4886200000,12793905


The movie the "The Host" 

In [281]:
12215500000/954.790516

In [None]:
movies<-merge(movies,merge_movie,by=c("movie_title","title_year"))
head(movies)

I'll have to delete that observation with the others. But before, I'll delete the variable that I won't use with the intension to minimise the number of observation deleted. For exemple, the variable num_user_for_reviews
represent the number of review by users on IMDB for each movie, after their release (I hope). Since I'm interested in predicting the box-office of a movie before his production, that variable is useless. 

I'm deleting the variable num_voted_users, movie_imdb_link, num_user_for_reviews, imdb_score and movie_facebook_likes.

In [181]:
col<-names(movies)%in% c("num_critic_for_reviews","num_voted_users","movie_imdb_link","num_user_for_reviews",
                        "imdb_score","movie_facebook_likes")
movies<-movies[,!col,with=FALSE]

Delete observations with missing data

In [182]:
movies<-movies[complete.cases(movies),]

Now, let's look at the value of each variable.

In [183]:
summary(movies)

    color           director_name         duration     director_facebook_likes
 Length:4034        Length:4034        Min.   : 37.0   Min.   :    0.0        
 Class :character   Class :character   1st Qu.: 95.0   1st Qu.:   10.0        
 Mode  :character   Mode  :character   Median :106.0   Median :   59.0        
                                       Mean   :109.8   Mean   :  768.2        
                                       3rd Qu.:120.0   3rd Qu.:  226.0        
                                       Max.   :330.0   Max.   :23000.0        
                                                                              
 actor_3_facebook_likes actor_2_name       actor_1_facebook_likes
 Min.   :    0.0        Length:4034        Min.   :     0        
 1st Qu.:  178.2        Class :character   1st Qu.:   722        
 Median :  423.0        Mode  :character   Median :  1000        
 Mean   :  737.1                           Mean   :  7471        
 3rd Qu.:  681.0                      

In the future, I would like to check for redondancy in the categorial variable. The reason being is that those data have been scraped from IMDB with a python package called "scrapy" and not from the IMDB database themself. In consequence, a bad formatting in a page or a networking error can cause some error in the value. So, for exemple, there's a possibility that there's a value Steven Spi^Xberg in the column director name who won't be associated to the factor Steven Spielberg. 

For the time being, looking at the value of the quantile of each numerical variable, there's two things I would like to investigate furter. First of all, the maximal value taken by the variable facenumber_in_poster that count the number of head on the movie poster is 43, which seem to be quite high. Since the number of head in each poster has been counted by a face recognicion algorithm, I suspect that this is a aberrant value. If that's not the case, investigating these data is worth my time just to find what kind of abobination of a poster can containt that many head.

In [184]:
p1 <- plot_ly(x=~movies$facenumber_in_poster,type="histogram")
embed_notebook(p1)

In [185]:
faaaaaaaaaaace<-data.table(movie_title=movies$movie_title,facenumber=movies$facenumber)
faaaaaaaaaaace<-faaaaaaaaaaace[order(-movies$facenumber)]
head(faaaaaaaaaaace, n=10)

movie_title,facenumber
500 Days of Summer,43
The Master,31
Battle of the Year,19
The Expendables 3,15
Cheaper by the Dozen,15
New Year's Eve,15
Boogie Nights,15
As It Is in Heaven,15
A Bridge Too Far,14
Love the Coopers,13


In [186]:
tail(faaaaaaaaaaace, n=10)

movie_title,facenumber
George Washington,0
The Last Waltz,0
The Legend of God's Gun,0
In the Company of Men,0
Slacker,0
The Circle,0
The Cure,0
Primer,0
El Mariachi,0
My Date with Drew,0


Mother of god: the movies "500 Days of Summer" poster really has 43 faces on it! I also looked at the poster of 10 movies who have the most head on their poster and the face recognition algorithm was on mark. I conclude that I can trust those number.

The second thing I find suspicious is that the two variables "director_facebook_likes" and "actor_3_facebook_likes" have the same maximum value, but have  

In [187]:
p2 <- plot_ly(x=~movies$director_facebook_likes,type="histogram")
embed_notebook(p2)

In [188]:
p3 <- plot_ly(x=~movies$actor_3_facebook_likes,type="histogram")
embed_notebook(p3)

In [189]:
p4 <- plot_ly(x=~movies$actor_2_facebook_likes,type="histogram")
embed_notebook(p4)

In [190]:
p5 <- plot_ly(x=~movies$actor_1_facebook_likes,type="histogram")
embed_notebook(p5)

In [191]:
p9 <- plot_ly(x=~movies$cast_total_facebook_likes,type="histogram")
embed_notebook(p9)

# Country

In [192]:
unique(movies$country) 

'New Line','Official site','West Germany' 

In [193]:
movies$movie_title[movies$country=="New Line"]
movies$budget[movies$country=="New Line"]

integer64
[1] 90000000

In [194]:
movies$country[movies$movie_title=="Town & Country"]<-"USA"

In [195]:
movies$movie_title[movies$country=="Official site"]
movies$budget[movies$country=="Official site"]

integer64
[1] 15000000

In [196]:
movies$country[movies$country=="Official site"]<-"USA"

In [197]:
movies$country[movies$movie_title=="Das Boot"]<-"USA"

In [32]:
p6 <- plot_ly(x=~movies$gross,type="histogram")
embed_notebook(p6)

# Scrapping currency data

In [201]:
currency <- fread("find_estimated.csv",stringsAsFactors = F)

In [202]:
head(currency)

title_year,currency,estimated,movie_title
2006,b'$',1,"b""""Pirates of the Caribbean: Dead Man's Chest"""""
2013,b'$',1,b'The Lone Ranger'
2013,b'$',1,b'Man of Steel'
2008,b'$',1,b'The Chronicles of Narnia: Prince Caspian'
2012,b'$',1,b'The Avengers'
2011,b'$',1,b'Pirates of the Caribbean: On Stranger Tides'


Soo there's gonna be a bit of cleaning to do.

In [203]:
unique(currency$currency)

I'll drop the first three caracters in the variable currency and movie_title

In [204]:
currency$currency<-as.character(currency$currency)
currency$currency<-substr(currency$currency, 3, nchar(currency$currency)-1)
unique(currency$currency)

In [205]:
currency$movie_title<-as.character(currency$movie_title)
currency$movie_title<-substr(currency$movie_title, 3, nchar(currency$movie_title)-1)
sample(currency$movie_title, size=5)

Then, I'll change the caracters '$', '€' and '£' for their country code.

In [206]:
currency$currency[currency$currency=='$']<-"USA"
currency$currency[currency$currency==currency$currency[currency$movie_title=="Micmacs"]]<-"FRF"
currency$currency[currency$currency=='£']<-"GBR"
unique(currency$currency)

There's with no currency label, let's look at some of them

In [207]:
head(currency$movie_title[currency$currency==''])

We see that the majority of those movies are american movie, which make sens since the us dollar is the default currency, but some of them are just missing value. We also see that there's some duplicate in the table, for exemple:

In [208]:
subset(currency,movie_title=='Godzilla Resurgence')
str(currency)

title_year,currency,estimated,movie_title
2016,,,Godzilla Resurgence
2016,,,Godzilla Resurgence


Classes 'data.table' and 'data.frame':	5029 obs. of  4 variables:
 $ title_year : int  2006 2013 2013 2008 2012 2011 2012 2014 2012 2010 ...
 $ currency   : chr  "USA" "USA" "USA" "USA" ...
 $ estimated  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movie_title: chr  "\"Pirates of the Caribbean: Dead Man's Chest\"" "The Lone Ranger" "Man of Steel" "The Chronicles of Narnia: Prince Caspian" ...
 - attr(*, ".internal.selfref")=<externalptr> 


Let's get rid of them

In [209]:
#currency<-currency[!duplicated(currency),]
setkey(currency,NULL)
currency<-unique(currency)
str(currency)

Classes 'data.table' and 'data.frame':	4907 obs. of  4 variables:
 $ title_year : int  2006 2013 2013 2008 2012 2011 2012 2014 2012 2010 ...
 $ currency   : chr  "USA" "USA" "USA" "USA" ...
 $ estimated  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ movie_title: chr  "\"Pirates of the Caribbean: Dead Man's Chest\"" "The Lone Ranger" "Man of Steel" "The Chronicles of Narnia: Prince Caspian" ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [162]:
head(currency$movie_title[currency$currency==''])
length(currency$movie_title[currency$currency==''])

Now I'll use the data from the IMDB data.table to assign the USA label to american film with no currency data

In [210]:
change_US_currency<-function(x,y){
    values<-x$currency
    for (i in 1:length(x$currency)){
        for (j in 1:length(y$country)){
            if(x$currency[i]==''&y$movie_title[j]==x$movie_title[i]&y$country[j]=="USA")
            {
                values[i]<-"USA"
            }
        }
    } 
    return(values)
}

In [211]:
temp<-change_US_currency(currency,movies)

In [212]:
length(temp[temp=="USA"])
length(temp[temp==""])
length(temp)-(length(temp[temp=="USA"])+length(temp[temp==""]))
length(temp)

In [213]:
length(currency$currency[currency$currency=="USA"])
length(currency$currency[currency$currency==""])
length(currency$currency)-(length(currency$currency[currency$currency=="USA"])+length(currency$currency[currency$currency==""]))
length(currency$currency)

In [214]:
currency$currency<-temp

In [216]:
n<-c("title_year","currency","movie_title")
currency<-currency[,n,with=FALSE]

In [217]:
write.csv(currency,"currency.csv",row.names=FALSE, col.names=TRUE)

"attempt to set 'col.names' ignored"

In [218]:
currency<-fread("currency.csv", stringsAsFactors=FALSE)

# Merging the data

I found a data set of the historical exchange rate from 1950 to 2015 from the OECD website  https://data.oecd.org/conversion/exchange-rates.htm. OECD (2017), Exchange rates (indicator). doi: 10.1787/037ed317-en (Accessed on 13 January 2017)

In [219]:
abreviation <- fread("Abr.csv",stringsAsFactors = F)
#abreviation$Country<-substr(abreviation$Country, 1, nchar(abreviation$Country)-1)
str(abreviation)
setdiff(unique(currency$currency),abreviation$CODE)

Classes 'data.table' and 'data.frame':	234 obs. of  2 variables:
 $ CODE   : chr  "ABW" "AFG" "AFRI" "AGO" ...
 $ Country: chr  "Aruba" "Afghanistan" "Africa" "Angola" ...
 - attr(*, ".internal.selfref")=<externalptr> 


That data set use the ISO 3166 country name abreviation as an index for the table, while the IMDB website use the ISO 4217 currency codes to caracterise the budget. In consequence, I'll have to map the abreviation of the currency with the abreviation of the country to be able to use those data. 

In [220]:
currency$currency[currency$currency=='FRF']<-"FRA"
currency$currency[currency$currency=='RUR']<-"USSR"
currency$currency[currency$currency=='CNY']<-"CHN"
currency$currency[currency$currency=='AUD']<-"AUS"
currency$currency[currency$currency=='HKD']<-"HKG"
currency$currency[currency$currency=='CAD']<-"CAN"
currency$currency[currency$currency=='JPY']<-"JPN"
currency$currency[currency$currency=='NOK']<-"NOR"
currency$currency[currency$currency=='DEM']<-"DEU"
currency$currency[currency$currency=='THB']<-"THA"
currency$currency[currency$currency=='KRW']<-"KOR"
currency$currency[currency$currency=='HUF']<-"HUN"
currency$currency[currency$currency=='INR']<-"IND"
currency$currency[currency$currency=='DKK']<-"DNK"
currency$currency[currency$currency=='CZK']<-"CZE"
currency$currency[currency$currency=='NZD']<-"NZL"
currency$currency[currency$currency=='CHF']<-"CHE"
currency$currency[currency$currency=='BRL']<-"BRA"
currency$currency[currency$currency=='ZAR']<-"ZAF"
currency$currency[currency$currency=='SEK']<-"SWE"

Let's see if there's currency code that I left out

In [221]:
setdiff(unique(currency$currency),abreviation$CODE)

In [234]:
resumer_movie<-data.table(movie_title=movies$movie_title,budget=movies$budget,gross=movies$gross,
                 country=movies$country,title_year=movies$title_year)
resumer_movie<-resumer_movie[order(-budget)]
head(resumer_movie, n=10)
str(resumer_movie)

movie_title,budget,gross,country,title_year
The Host,12215500000,2201412,South Korea,2006
Lady Vengeance,4200000000,211667,South Korea,2005
Fateless,2500000000,195888,Hungary,2005
Princess Mononoke,2400000000,2298191,Japan,1997
Steamboy,2127519898,410388,Japan,2004
Akira,1100000000,439162,Japan,1988
Godzilla 2000,1000000000,10037390,Japan,1999
Kabhi Alvida Naa Kehna,700000000,3275443,India,2006
Tango,700000000,1687311,Spain,1998
Red Cliff,553632000,626809,China,2008


Classes 'data.table' and 'data.frame':	4034 obs. of  5 variables:
 $ movie_title: chr  "The Host" "Lady Vengeance" "Fateless" "Princess Mononoke" ...
 $ budget     :Class 'integer64'  num [1:4034] 6.04e-314 2.08e-314 1.24e-314 1.19e-314 1.05e-314 ...
 $ gross      : int  2201412 211667 195888 2298191 410388 439162 10037390 3275443 1687311 626809 ...
 $ country    : chr  "South Korea" "South Korea" "Hungary" "Japan" ...
 $ title_year : int  2006 2005 2005 1997 2004 1988 1999 2006 1998 2008 ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [282]:
merge_movie<-merge(resumer_movie,currency,by=c("movie_title","title_year"))
head(merge_movie)

movie_title,title_year,budget,gross,country,currency
10 Cloverfield Lane,2016,15000000,71897215,USA,USA
10 Days in a Madhouse,2015,12000000,14616,USA,USA
10 Things I Hate About You,1999,16000000,38176108,USA,USA
102 Dalmatians,2000,85000000,66941559,USA,USA
10th & Wolf,2006,8000000,53481,USA,USA
12 Rounds,2009,22000000,12232937,USA,USA


In [238]:
str(resumer_movie)
str(merge_movie)

Classes 'data.table' and 'data.frame':	4034 obs. of  5 variables:
 $ movie_title: chr  "The Host" "Lady Vengeance" "Fateless" "Princess Mononoke" ...
 $ budget     :Class 'integer64'  num [1:4034] 6.04e-314 2.08e-314 1.24e-314 1.19e-314 1.05e-314 ...
 $ gross      : int  2201412 211667 195888 2298191 410388 439162 10037390 3275443 1687311 626809 ...
 $ country    : chr  "South Korea" "South Korea" "Hungary" "Japan" ...
 $ title_year : int  2006 2005 2005 1997 2004 1988 1999 2006 1998 2008 ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'data.table' and 'data.frame':	3785 obs. of  6 variables:
 $ movie_title: chr  "10 Cloverfield Lane" "10 Days in a Madhouse" "10 Things I Hate About You" "102 Dalmatians" ...
 $ title_year : int  2016 2015 1999 2000 2006 2009 2013 2010 2004 2016 ...
 $ budget     :Class 'integer64'  num [1:3785] 7.41e-317 5.93e-317 7.91e-317 4.20e-316 3.95e-317 ...
 $ gross      : int  71897215 14616 38176108 66941559 53481 12232937 56667870 18329466 56044241 

I make sure that all the label in the variable currency are included in the data set from the OCED

# Convert budget

In [231]:
change_string_value<-function(x,y,z){
    value<-z
    for (i in 1:length(x)){
        for (j in 1:length(z)){
            if(tolower(x[i])==tolower(z[j]))
            {
                value[j]<-y[i]
            }
        }
    } 
    return(value)
}

In [None]:
#temp$V4<-change_string_value(abreviation$Country,abreviation$CODE,temp$V4)

In [232]:
head(temp$V4)

NULL

In [233]:
ex_rate <- fread("ExRate50-15.csv",stringsAsFactors = F)

str(ex_rate)

Classes 'data.table' and 'data.frame':	2764 obs. of  8 variables:
 $ LOCATION  : chr  "AUS" "AUS" "AUS" "AUS" ...
 $ INDICATOR : chr  "EXCH" "EXCH" "EXCH" "EXCH" ...
 $ SUBJECT   : chr  "TOT" "TOT" "TOT" "TOT" ...
 $ MEASURE   : chr  "NATUSD" "NATUSD" "NATUSD" "NATUSD" ...
 $ FREQUENCY : chr  "A" "A" "A" "A" ...
 $ TIME      : int  1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
 $ Value     : num  0.893 0.893 0.893 0.893 0.893 ...
 $ Flag Codes: logi  NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [286]:
setdiff(merge_movie$currency,ex_rate$LOCATION)

In [276]:
#a=country of the rate
#b=year of the exchange rate
#c=value of the exchage rate
#x=country of the movie
#y=year of the making of the movie
#z=budget of the movie

convert_gross<-function(a,b,c,x,y,z){
    value<-z
    for (i in 1:length(y)){
       index<-which(a %in% x[i] & b %in%y [i])
        min<-1
        find<-FALSE
        if(length(index)==0)
        {
           value[i]<-z[i]
        }
        else
            {
            for (j in 1:length(index)){
            
            if(j==1)
                {
                    min<-c[index[j]]
                    find<-FALSE
                }           

            if(y[i]==b[index[j]])
            {
                value[i]<-z[i]/c[index[j]]
                Find<-TRUE
            }
        }
        
        if(!find)
            {
            
            value[i]<-z[i]/min
        }
            
        }
        
        
    } 
    return(value)
}

In [274]:
index<-which(ex_rate$LOCATION %in% merge_movie$currency[2] & ex_rate$TIME %in% merge_movie$title_year[2])
str(index)
length(index)
merge_movie$currency[1]==ex_rate$LOCATION[index[1]]

 int 1879


In [277]:
merge_movie$budget_us<-convert_gross(ex_rate$LOCATION,ex_rate$TIME,ex_rate$Value,
                                  merge_movie$currency,merge_movie$title_year,merge_movie$budget)

In [273]:
merge_movie$movie_title[1]

In [278]:
head(merge_movie)

movie_title,title_year,budget,gross,country,currency,Result,result
10 Cloverfield Lane,2016,15000000,71897215,USA,USA,15000000,15000000
10 Days in a Madhouse,2015,12000000,14616,USA,USA,12000000,12000000
10 Things I Hate About You,1999,16000000,38176108,USA,USA,16000000,16000000
102 Dalmatians,2000,85000000,66941559,USA,USA,85000000,85000000
10th & Wolf,2006,8000000,53481,USA,USA,8000000,8000000
12 Rounds,2009,22000000,12232937,USA,USA,22000000,22000000


In [279]:
head(merge_movie[merge_movie$currency!="USA"])

movie_title,title_year,budget,gross,country,currency,Result,result
3,2010,,59774,Germany,,,
8 Women,2002,8000000.0,3076425,France,FRA,15007015.0,7529043.0
A Dangerous Method,2011,15000000.0,5702083,UK,FRA,28138154.0,20852013.0
A Room for Romeo Brass,1999,,18434,UK,,,
Aimee & Jaguar,1999,15000000.0,927107,Germany,DEU,6993460.0,15980788.0
Ajami,2009,,621240,Germany,,,


For exemple, the movie "A Dangerous Method" made in 2011 for 15 000 000 euro according to IMDB. So far, the table is good. That year, the exchage rate for the euro was 0.719355, so the budget should be, in US dollar

In [253]:
15000000/0.719355

That seem Alright! Let's look at the asian film that skew the distribution

In [280]:
head(merge_movie[merge_movie$currency=="KOR"])

movie_title,title_year,budget,gross,country,currency,Result,result
Lady Vengeance,2005,4200000000,211667,South Korea,KOR,1680000000,4101095
The Host,2006,12215500000,2201412,South Korea,KOR,4886200000,12793905


The movie the "The Host" 

In [281]:
12215500000/954.790516

In [None]:
movies<-merge(movies,merge_movie,by=c("movie_title","title_year"))
head(movies)

# Facebook likes

In [None]:
p8 <- plot_ly(x=~movies$actor_1_facebook_likes,type="histogram")
embed_notebook(p8)

In [None]:
p59 <- plot_ly(x=~movies$actor_1_facebook_likes,type="histogram")
embed_notebook(p9)

In [None]:
index<-which(ex_rate$LOCATION %in% temp$V4[1])
head(index)

In [None]:
head(ex_rate$ï»¿"LOCATION")

In [None]:

temp <- movies %>% select(budget,title_year)
temp <- temp %>% group_by(title_year)%>% summarise(score=mean(budget))
temp <- na.omit(temp)
p <- plot_ly(temp, x = title_year, y = score, name = "Avg Score by Year")
p %>%
  add_trace(y = fitted(loess(score ~ as.numeric(title_year))), x = title_year) %>%
  layout(title = "Year and Score",
         showlegend = FALSE) %>%
  dplyr::filter(score == max(score)) %>%
  layout(annotations = list(x = title_year, y = score, text = "Peak", showarrow = T))

In [None]:
summary(movies$color)

In [None]:
head(summary(movies$director_name))

In [None]:
col<-names(movies)%in% c("num_critic_for_reviews")
movies<-movies[,!col,with=FALSE]

In [None]:
movies<-movies[!(movies$director_name==""),]

In [None]:
sort(unique(movies$aspect_ratio))

In [None]:
movies[aspect_ratio==16,movie_title]

When I'll have time, I'll use IMDbPY to find those missing name, but for now, let's delete those line 

In [None]:
movies<-movies[!(movies$director_name==""),]

In [None]:
str(movies)

In [None]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

The number of missing cases drastically decreases since the beginning of the cleaning and it's time to start exploring the data! If you like, you can access this section of the project at the address:  