# Creating a AI producer or: How I Learn to Stop Worrying and Love box office Bomb

If you love the movie, you have probably seen a trailer for a new blockbuster in the past and ask yourself "Why would someone spend 100 million dollars to make such a stupid movie?". For me, the last time this happens is when I heard of the existence of the "Baywatch" movie. I mean nobody wanted to watch a "Baywatch" movie; no director could possibly have dreamed of making a "Baywatch" movie. So what happened? There's probably a producer who's been sitting on that script for years now and Zack Efron had bills to pay. Maybe Dwayne "the Rock" Johnson saw that and thought "Hey, I like to have a new house and I'm free for two months between shooting "The Fast and the Furious 8" and "Jumanji 2"! I should be in this movie!". Then a studio executive talked with the producer and taught something like "It's an IP known by the public, there's two names we can put on the poster and we have a reason to put girls in bikini in the trailers... That could do 80 million at the box-office!". Then the movie got green lit... and will probably fail.

For every stupid movie made, there's always a weird commercial logic, who's been used to justify the investment and this logic is often based on hype, rumour and other subjective considerations. What if we used hard data to predict the success of a movie? Would we be more successful that a producer or the movie goers are so unpredictable that only a good judge the zeitgeist can predict if a movie will be successful.

In this project, I will use a data set of 5000 movies, scrape from IMDB by https://www.kaggle.com/deepmatrix, to try to predict if a movie will make money at the box-office by looking at his cast, his director and others key characteristics. The project is divided in three sections: the first one, this article, explain my process for cleaning and getting the data; in the second one, I explore the data by looking at the distribution of the variables and the relations between them; in the third one, I test some statistical model with the objective to predict the revenue of a movie with the information available before the release of the movie.

You can find the dataset here: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset.  

# Loading data

First of all, I load the library that I will use in that notebook.

In [40]:
library(dplyr)
library(data.table)
library(bit64)

After I load the data in the data frame "movies", I look at the first five rows of the table  to get a sense of the data and at the list of the variables.

In [2]:
movies <- fread("movie_metadata.csv",na.strings="",sep=",",stringsAsFactors = F)

In [3]:
head(movies)

color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes;
Color,James Cameron,723.0,178.0,0,855.0,Joel David Moore,1000,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936,7.9,1.78,33000;
Color,Gore Verbinski,302.0,169.0,563,1000.0,Orlando Bloom,40000,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000,7.1,2.35,0;
Color,Sam Mendes,602.0,148.0,0,161.0,Rory Kinnear,11000,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393,6.8,2.35,85000;
Color,Christopher Nolan,813.0,164.0,22000,23000.0,Christian Bale,27000,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000,8.5,2.35,164000;
,Doug Walker,,,131,,Rob Walker,131,,Documentary,...,,,,,,,12,7.1,,0;
Color,Andrew Stanton,462.0,132.0,475,530.0,Samantha Morton,640,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632,6.6,2.35,24000;


In [4]:
str(movies)

Classes 'data.table' and 'data.frame':	5043 obs. of  28 variables:
 $ color                    : chr  "Color" "Color" "Color" "Color" ...
 $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
 $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
 $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
 $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
 $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
 $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
 $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
 $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
 $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|

# Formatting data

The end line character ";", used in the JSON file containing the data originally, is included in the string "movie_facebook_likes". Since that variable is a numeric variable, I have to delet that character and convert the variable to an integer.

In [5]:
head(movies$movie_facebook_likes)
movies$movie_facebook_likes<-substr(movies$movie_facebook_likes,1,nchar(movies$movie_facebook_likes)-1)
head(movies$movie_facebook_likes)

In [6]:
col_drop<-"movie_facebook_likes;"
movies<-movies[,(col_drop):=NULL]

In [7]:
movies$movie_facebook_likes<-as.integer(movies$movie_facebook_likes)
str(movies)

"NAs introduits lors de la conversion automatique"

Classes 'data.table' and 'data.frame':	5043 obs. of  28 variables:
 $ color                    : chr  "Color" "Color" "Color" "Color" ...
 $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
 $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
 $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
 $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
 $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
 $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
 $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
 $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
 $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|

In the table above, each movie is in only one row and each variable are in his own column except for the column "genres" which contain too much information. 

In [8]:
head(unique(movies$genres), n=10)

To respect the principles of "tidy data" and facilitate the readability of the table, I'll have to split this column in multiple ones, one for each genre of movies listed in the variable "genres".   

In [9]:
library(splitstackshape)
movies<-concat.split.expanded(movies, "genres", sep="|", type="character",fill = 0)

In [10]:
col<- c("genres", "genres_Action","genres_Adventure","genres_Animation","genres_Biography","genres_Comedy",
        "genres_Crime","genres_Documentary","genres_Drama","genres_Family","genres_Fantasy","genres_Film-Noir",
        "genres_Game-Show","genres_History","genres_Horror","genres_Music","genres_Musical", "genres_Mystery",
        "genres_News", "genres_Reality-TV","genres_Romance","genres_Sci-Fi","genres_Short","genres_Sport",
        "genres_Thriller","genres_War","genres_Western")
head(movies[,col, with=FALSE])

genres,genres_Action,genres_Adventure,genres_Animation,genres_Biography,genres_Comedy,genres_Crime,genres_Documentary,genres_Drama,genres_Family,...,genres_Mystery,genres_News,genres_Reality-TV,genres_Romance,genres_Sci-Fi,genres_Short,genres_Sport,genres_Thriller,genres_War,genres_Western
Action|Adventure|Fantasy|Sci-Fi,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Action|Adventure|Fantasy,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Action|Adventure|Thriller,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Action|Thriller,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Documentary,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Action|Adventure|Sci-Fi,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Now that the table follows the conditions of a tidy data set, let's look at the variables in more details.

In [11]:
str(movies)

Classes 'data.table' and 'data.frame':	5043 obs. of  54 variables:
 $ color                    : chr  "Color" "Color" "Color" "Color" ...
 $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
 $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
 $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
 $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
 $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
 $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
 $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
 $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
 $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|

For this data set, I'll use the variable "movie_imdb_link" as a the principal key to identify an observation, since two movies can have the same value in all of the others variable. Also, I see two problems with this data set. First, there's a lot of missing values, so I'll have either to scrape them from IMDB, estimate them or delete those observations completely. Second, the formatting of the string in the column movie_title is weird...

In [12]:
head(movies$movie_title)

While getting the data from IMDB, there must have been an encoding error. I'll have to delete the character "Â" at the end of each string.

In [13]:
movies$movie_title<-sub("Â", "", movies$movie_title)
movies$movie_title<-sub(" ", "", movies$movie_title)

In [14]:
head(movies$movie_title)

# Missing values
First, let's look at the number of missing values from the first variable "color". To do this, I have to replace some string with their correct value.

In [17]:
unique(movies$color)

In [18]:
table(movies$color)


                " " Black and White            "Color   Black and White 
                1                 2                66               207 
            Color 
             4749 

In [19]:
movies$color<-sub("\"Color", "Color", movies$color)
movies$color<-sub("\" Black and White", "Black and White", movies$color)
movies$color<-sub("\"", NA, movies$color)
unique(movies$color)

In [20]:
print(paste0("NA: ",sum(is.na(movies$color))))

[1] "NA: 19"


There's 19 observations where the value of the variable "color" is missing. Since before 1939, the year where the wizard of Oz was released, the majority of the movie was in black and white, my first thought was to look for a separation in the data. If the distributions of the variable "title_year", who represent the year where the movie was released, for the movies in black and white is quite different from the one for the movies in color, I will be able to tell if a movie is in color or in black and white just by looking at the variable "title_year".     

In [21]:
print(paste0("Mean, Black and White: ",mean(movies[which(color=="Black and White"),title_year], na.rm=TRUE)))
print(paste0("Standard deviation, Black and White: ",sd(movies[which(color=="Black and White"),title_year],na.rm=TRUE)))
print(paste0("Mean, Color: ",mean(movies[which(color=="Color"),title_year], na.rm=TRUE)))
print(paste0("Standard deviation, Color: ",sd(movies[which(color=="Color"),title_year],na.rm=TRUE)))

[1] "Mean, Black and White: 1984.02912621359"
[1] "Standard deviation, Black and White: 26.252416868317"
[1] "Mean, Color: 2003.24819686042"
[1] "Standard deviation, Color: 10.821288180493"


In [22]:
print("Summary Black and White")
summary(movies[which(color=="Black and White"),title_year])
print("Summary color")
summary(movies[which(color=="Color"),title_year])

[1] "Summary Black and White"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1916    1962    1998    1984    2004    2015       3 

[1] "Summary color"


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1937    1999    2006    2003    2011    2016     101 

In [23]:
movies[is.na(movies$color),title_year]

We can see that the two means are much closer than I thought and that the standard deviation of the movies in black and white is quite large. The shape of those two distributions tells me that there's a high probability that a movie made after 2000 is in color and that those made before 1993 are in black and white. By looking at the year of release of the movies with a missing "color" value I count four movies with a missing value for the variable "title_year" and one movie made in 1990 right in between the two distributions. Let's look at the name of those movies.

In [24]:
movies[which(is.na(movies$color)&(title_year==1990|is.na(title_year))),movie_title]

Since there's only five problematics movies, I decided to do some research and find the missing value. After 2 minutes of searching on IMDB, I learned that all those movies are in color, so I can assign that factor to all the movie with a missing value on that variable. 

In [25]:
movies[is.na(movies$color),"color"]="Color"

In [26]:
sum(is.na(movies$color))

Let's look at the number of missing values for the other variables.

In [27]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

Of those variables, only the variable duration as few missing data that are easily obtainable, so I will manually fetch those data by searching them on IMDB.

In [28]:
movies[is.na(movies$duration),movie_title]

In [29]:
movies[movie_title=="Star Wars: Episode VII - The Force Awakens",4]<-136
movies[movie_title=="Harry Potter and the Deathly Hallows: Part II",4]<-130
movies[movie_title=="Harry Potter and the Deathly Hallows: Part I",4]<-146
movies[movie_title=="Black Water Transit",4]<-100
movies[movie_title=="Should've Been Romeo",4]<-90
movies[movie_title=="Barfi",4]<-151
movies[movie_title=="Hum To Mohabbat Karega",4]<-87
movies[movie_title=="N-Secure",4]<-115
movies[movie_title=="Dil Jo Bhi Kahey...",4]<-144
movies[movie_title=="Wolf Creek",4]<-95
movies[movie_title=="Karachi se Lahore",4]<-143
movies[movie_title=="Destiny",4]<-105
movies[movie_title=="Romantic Schemer",4]<-85
movies[movie_title=="The Naked Ape",4]<-110
movies[movie_title=="War & Peace",4]<- 379

Now if we look back at the number of missing value of every variable in the data set, we see that the variable "duration" have none.

In [30]:
sapply(movies, function(y) sum(length(which(is.na(y)))))

Having missing values in the variables "num_critic_for_reviews","num_voted_users","movie_imdb_link","num_user_for_reviews", "imdb_score" and "movie_facebook_likes" is not really an issue, since I'm interested in predicting the box-office result of a movie before his production and those variables are metrics collected after the theater release of their respective movie. Also, some variable like "director_facebook_likes", "facenumber_in_poster", "title_year" and "actor_2_facebook_likes" have a relatively low number of missing values and the omission of those observations won't affect significantly the quality of our model. With that in mind, I'll focus my attention on the variables "gross" and "budget".

First, I need to find why the script that scrape the data from IMDB returned those empty fields. I looked at the IMDB page of some movies with missing data to try to find a pattern.

In [31]:
head(movies[is.na(movies$gross),movie_title],)

In [32]:
movies[movie_title=='Star Wars: Episode VII - The Force Awakens',movie_imdb_link]

In [33]:
movies[movies$movie_title=='Miami Vice', movie_imdb_link]

By looking at those three pages, we see the problem: IMDB list not only theatrical movie release, but also web series, tv show, direct to dvd release and since some entries have the same title some error have been injected in the data set. For example, instead of having data on the movie "Miami Vice", the script return data on an episode of the tv serie of the same name, where the fields "gross" ans "budget" are empty. In consequence, a missing value in those columns seems to be an indicator of a bad observation. 

I took a sample of 20 movies (about 3% of the observations with the value "gross" missing) to see if that hypothesis is correct.

In [34]:
print(paste0("Number of rows: ",nrow(subset(movies,is.na(movies$gross)&is.na(movies$budget)))))
head(subset(movies,is.na(movies$gross)&is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 224"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
StarWars: Episode VII - The Force Awakens,Doug Walker,Doug Walker,Rob Walker,,,http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1
TheLovers,Roland JoffÃ©,Tamsin Egerton,Alice Englert,Bipasha Basu,2015.0,http://www.imdb.com/title/tt1321869/?ref_=fn_tt_tt_1
GodzillaResurgence,Hideaki Anno,Mark Chinnery,Shin'ya Tsukamoto,Atsuko Maeda,2016.0,http://www.imdb.com/title/tt4262980/?ref_=fn_tt_tt_1
HarryPotter and the Deathly Hallows: Part II,Matt Birch,Rupert Grint,Dave Legeno,Ralph Ineson,2011.0,http://www.imdb.com/title/tt1680310/?ref_=fn_tt_tt_1
GodzillaResurgence,Hideaki Anno,Mark Chinnery,Shin'ya Tsukamoto,Atsuko Maeda,2016.0,http://www.imdb.com/title/tt4262980/?ref_=fn_tt_tt_1
HarryPotter and the Deathly Hallows: Part I,Matt Birch,Rupert Grint,Toby Jones,Alfred Enoch,2010.0,http://www.imdb.com/title/tt1571403/?ref_=fn_tt_tt_1
TheA-Team,,George Peppard,Dirk Benedict,Dwight Schultz,,http://www.imdb.com/title/tt0084967/?ref_=fn_tt_tt_1
"""10,000B.C. """,Christopher Barnard,Mathew Buck,,,,http://www.imdb.com/title/tt1869849/?ref_=fn_tt_tt_1
Ben-Hur,Timur Bekmambetov,Morgan Freeman,Ayelet Zurer,Moises Arias,2016.0,http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1
Hannibal,,Caroline Dhavernas,Scott Thompson,Hettienne Park,,http://www.imdb.com/title/tt2243973/?ref_=fn_tt_tt_1


In these data, 13 observations, on a total of 20, come from the wrong IMDB page, and six are movies whose IMDB page don't show that information and one is a duplicate. From that sample, I notice that data who were mistakenly taken from the IMDB page of a TV show doesn't have a value for the variable "director_name", since generally more than one director work on a TV show, and for the variable "title_year", since they often run for more than a season. That could indicate that missing value for those two variables and the variable "gross" and/or "budget" is a strong indicator of an odd observation. 

Let's look at some examples of observations were only "gross" values is missing to see if that hypothesis is correct.

In [35]:
print(paste0("Number of rows: ",nrow(subset(movies,is.na(movies$gross)&!is.na(movies$budget)))))
head(subset(movies,is.na(movies$gross)&!is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 660"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
MiamiVice,,Don Johnson,Philip Michael Thomas,John Diehl,,http://www.imdb.com/title/tt0086759/?ref_=fn_tt_tt_1
Asterixat the Olympic Games,FrÃ©dÃ©ric Forestier,Alain Delon,Santiago Segura,Vanessa Hessler,2008.0,http://www.imdb.com/title/tt0463872/?ref_=fn_tt_tt_1
Creepshow,George A. Romero,Ted Danson,Hal Holbrook,Adrienne Barbeau,1982.0,http://www.imdb.com/title/tt0083767/?ref_=fn_tt_tt_1
TopCat Begins,AndrÃ©s Couturier,Sariann Monaco,David Hoffman,Ben Diskin,2015.0,http://www.imdb.com/title/tt4057916/?ref_=fn_tt_tt_1
RedDawn,John Milius,Lea Thompson,Jennifer Grey,William Smith,1984.0,http://www.imdb.com/title/tt0087985/?ref_=fn_tt_tt_1
Xiyou ji zhi: Sun Wukong san da Baigu Jing,Pou-Soi Cheang,Li Gong,Aaron Kwok,Eddie Peng,2016.0,http://www.imdb.com/title/tt4591310/?ref_=fn_tt_tt_1
"""Sabrina,the Teenage Witch """,,Nate Richert,Soleil Moon Frye,Caroline Rhea,,http://www.imdb.com/title/tt0115341/?ref_=fn_tt_tt_1
StargateSG-1,,Christopher Judge,Don S. Davis,Gary Jones,,http://www.imdb.com/title/tt0118480/?ref_=fn_tt_tt_1
Lolita,Stanley Kubrick,James Mason,Shelley Winters,Lois Maxwell,1962.0,http://www.imdb.com/title/tt0056193/?ref_=fn_tt_tt_1
EyeSee You,Jim Gillespie,Sylvester Stallone,Tom Berenger,Charles S. Dutton,2002.0,http://www.imdb.com/title/tt0160184/?ref_=fn_tt_tt_1


Three observations out of 20 are from a TV show instead of the movie of the same name and they all have missing value in the column "director_name" and "title_year".

Now I look at the observations were only the "budget" values is missing to see if I should keep them in the data set.

In [36]:
print(paste0("Number of rows: ",nrow(subset(movies,!is.na(movies$gross)&is.na(movies$budget)))))
head(subset(movies,!is.na(movies$gross)&is.na(movies$budget),
            select=c("movie_title","director_name","actor_1_name","actor_2_name",
                     "actor_3_name","title_year","movie_imdb_link")), n=20)

[1] "Number of rows: 268"


movie_title,director_name,actor_1_name,actor_2_name,actor_3_name,title_year,movie_imdb_link
TheGood Dinosaur,Peter Sohn,A.J. Buckley,Jack McGraw,Peter Sohn,2015,http://www.imdb.com/title/tt1979388/?ref_=fn_tt_tt_1
Charlotte'sWeb,Gary Winick,Steve Buscemi,Julia Roberts,Oprah Winfrey,2006,http://www.imdb.com/title/tt0413895/?ref_=fn_tt_tt_1
DÃ©jÃ Vu,Henry Jaglom,Vanessa Redgrave,Stephen Dillane,Michael Brandon,1997,http://www.imdb.com/title/tt0119033/?ref_=fn_tt_tt_1
TheEdge,Lee Tamahori,Anthony Hopkins,Harold Perrineau,Bart the Bear,1997,http://www.imdb.com/title/tt0119051/?ref_=fn_tt_tt_1
Carriers,David Pastor,Christopher Meloni,Kiernan Shipka,Lou Taylor Pucci,2009,http://www.imdb.com/title/tt0806203/?ref_=fn_tt_tt_1
TheFinest Hours,Craig Gillespie,Michael Raymond-James,Abraham Benrubi,Graham McTavish,2016,http://www.imdb.com/title/tt2025690/?ref_=fn_tt_tt_1
Dinnerfor Schmucks,Jay Roach,Steve Carell,Stephanie Szostak,Bruce Greenwood,2010,http://www.imdb.com/title/tt0427152/?ref_=fn_tt_tt_1
WildHogs,Walt Becker,Jill Hennessy,Tichina Arnold,Drew Sidora,2007,http://www.imdb.com/title/tt0486946/?ref_=fn_tt_tt_1
Stateof Play,Kevin Macdonald,Robin Wright,Harry Lennix,Michael Weston,2009,http://www.imdb.com/title/tt0473705/?ref_=fn_tt_tt_1
Troublewith the Curve,Robert Lorenz,Clint Eastwood,Ed Lauter,Bob Gunton,2012,http://www.imdb.com/title/tt2083383/?ref_=fn_tt_tt_1


All the observations from that sample have the same value on their IMDB page, so I will keep those observations in the data set. In fact, this table shows me that the absence of information on the budget of a movie is not a good indicator of an aberrant observation. 

The last three tables comfort my opinion that observations with missing "gross","director_name" and "title_year" value won't contribute positively to the model, so I will delete them from the data set.

One more thing I saw while looking at the IMDB page of those movies, is that the value for the variables "actor_1_name", "actor_2_name" and "actor_3_name" are not necessarily goods indicators for the name of the starring actor of a movie. For example, the movie "The Edge" has Anthony Hopkins, Alec Baldwin, Elle Macpherson and Harold Perrineau as top billing actor, according to Wikipedia, while the data set list Anthony Hopkins, Harold Perrineau and "Bart the Bear" instead. I'm sure "Bart the Bear" did a good acting job in that movie, but I doubt that many people have bought a ticket to see him instead of Elle Macpherson or Alec Baldwin. I'll have to remember this while creating the predictive model.

All that being said, I have two things to do before moving forward: delete the duplicate in the table and delete the observations without a value for the variables "gross","director_name" and "title_year". 

In [37]:
print(paste0("Number of rows before: ",nrow(movies)))
movies<-unique(movies)
print(paste0("Number of rows after: ",nrow(movies)))

[1] "Number of rows before: 5043"
[1] "Number of rows after: 4998"


In [38]:
print(paste0("Number of rows before: ",nrow(movies)))
movies<-subset(movies,!is.na(gross)|!is.na(title_year)|director_name!="")
print(paste0("Number of rows after: ",nrow(movies)))

[1] "Number of rows before: 4998"
[1] "Number of rows after: 4898"


In [39]:
write.csv(movies,file ="movies_clean_na.csv",row.names=FALSE)

# Conclusion

  In this article, I made sure that the data set follow the principle of tidy data, I formatted the data and I started replace or delete the observations with missings values. There's still a lot of missing information for the variable "gross" and "budget" and since they are two of the most important variable of my model, I will spend some time to fetch them before moving to the exploration of the data. 
  
  In the next article (https://github.com/GTouzin/Portfolio/blob/master/R/IMDB/IMDB_data_cleaning2.ipynb), I'll explain how I got that missing information and I'll finish cleaning the data set. 
  
  Thanks for reading!