In [1]:
import pandas as pd
import numpy as np

# Movie Revenue Predictions


Onderzoeksvragen:
1. In hoeverre is de omzet van een film te voorspellen op basis van de populariteit op Facebook en IMDB zelf?
2. In hoeverre is het mogelijk om op basis van schrijvers en regisseur te voorspellen welke acteurs in een film spelen?
3. In hoeverre is het mogelijk om op basis van plot keywords te voorspellen welke genres de film behoort?
4. In hoeverre is het mogelijk om de budget-winst verhouding te voorspellen?

### Concept vragen, om te verwijderen.

Drie concept-onderzoeksvragen:
1. In hoeverre is de omzet van een film te voorspellen op basis van de populariteit op Facebook en IMDB zelf? (min. eisen 4.?)
2. In hoeverre is het mogelijk om op basis van plot keywords te voorspellen welke genres de film heeft? (min. eisen 5.?)
3. In hoeverre is de verhouding budget-omzet veranderd sinds 1935? (min. eisen 4.)

Externe data (minimale eisen 1.)
4. In hoeverre is het mogelijk om op basis van schrijvers en regisseur te voorspellen welke acteurs in een film spelen? (min. eisen 6.?)
5. In hoeverre kunnen wij voorspellen wanneer het budget en omzet hun zenit bereiken gebaseerd op de wereldpopulatie. (min. eisen 5.?)

## Het Data Science proces
Voor de eerste verkenning is ons gevraagd om de eerste vier stappen uit te voeren:
1. Data collection
2. Data processing (ook wel data munging)
3. Data cleaning
4. Data exploration & analysis
5. Model building
6. Visualization
6. Communication


## 1. Data Collection
De Data Collection is deels al voor ons gedaan. De dataset `movie.csv` is ons aangeleverd. Echter word voor de opdracht 
gevraagd om deze te combineren met een dataset van derden. Om erachter te komen welke dataset geschikt is om te 
combineren met `movies.csv`zullen wij deze dataset eerst moeten processen, cleanen en exploren.  

Om te zien of de dataset `movie.csv` goed is ingeladen, worden de eerste vijf rijen getoond:

In [2]:
df_movies = pd.read_csv('data/movie.csv')
df_movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


### Externe dataset
Onze originele dataset bevat geen data over de schrijvers van deze films. Deze data valt echter wel te reconstrueren met
behulp van andere IMDB datasets. Het gaat hierbij om de volgende datasets:

In [3]:
# These datasets were originally named 'data.tsv', they have been renamed accordingly:
# title.crew.tsv.gz -> crew.tsv
df_crew = pd.read_csv(r"data\imdb\crew.tsv", sep="\t")
df_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [4]:
# name.basics.tsv.gz -> names.tsv
df_names = pd.read_csv(r"data/imdb/names.tsv", sep="\t")
df_names.head()


Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0043044,tt0072308,tt0053137,tt0050419"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0037382,tt0038355,tt0071877"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0054452,tt0059956,tt0049189,tt0057345"
3,nm0000004,John Belushi,1949,1982,"actor,writer,soundtrack","tt0072562,tt0080455,tt0078723,tt0077975"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0069467,tt0083922,tt0050976,tt0050986"


In [5]:
# title.basics.tst.gz -> titles.tsv
df_titles = pd.read_csv(r"data/imdb/titles.tsv", sep="\t")
df_titles.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


## 2. Data Processing
### 2a. Aangeleverde dataset
Ook deze stap is grotendeels voor ons gedaan. De data is goed opgeslagen in een `.csv`-bestand en kan direct worden 
ingelezen in een _Pandas_ DataFrame.

Verder rest ons nog de volgende drie stappen:
1. Ongewenste kolommen droppen
2. Onduidelijke kolomnamen veranderen
3. Volgorde kolommen aanpassen
4. Datatypes aanpassen

In [6]:
df_movies.drop(["movie_imdb_link", "aspect_ratio"], axis=1, inplace=True)

df_movies.rename(columns={'color': 'Colour',
                          'director_name': 'Director',
                          'num_critic_for_reviews': 'Number of critics',
                          'duration': 'Duration',
                          'director_facebook_likes': 'Director FB likes',
                          'actor_3_facebook_likes': 'Actor 3 FB likes',
                          'actor_2_name': 'Actor 2 name',
                          'actor_1_facebook_likes': 'Actor 1 FB likes',
                          'gross': 'Gross',
                          'genres': 'Genres',
                          'actor_1_name': 'Actor 1 name',
                          'movie_title': 'Movie title',
                          'num_voted_users': 'Number of voted users',
                          'cast_total_facebook_likes': 'Total Cast FB likes',
                          'actor_3_name': 'Actor 3 name',
                          'facenumber_in_poster': 'Number of faces on poster',
                          'plot_keywords': 'Plot Keywords',
                          'num_user_for_reviews': 'Number of user reviews',
                          'language': 'Language',
                          'country': 'Country',
                          'content_rating': 'Age rating',
                          'budget': 'Budget',
                          'title_year': 'Release year',
                          'actor_2_facebook_likes': 'Actor 2 FB likes',
                          'imdb_score': 'IMDB Score',
                          'movie_facebook_likes': 'Movie FB likes'}, inplace=True)

# Volgorde kolommen aanpassen
df_movies = df_movies[['Movie title',
                       'Release year',
                       'Director',
                       'Director FB likes',
                       'Movie FB likes',
                       'Gross',
                       'Budget',
                       'Duration',
                       'Language',
                       'Country',
                       'Colour',
                       'Genres',
                       'IMDB Score',
                       'Number of voted users',
                       'Number of critics',
                       'Number of user reviews',
                       'Age rating',
                       'Total Cast FB likes',
                       'Actor 1 name',
                       'Actor 2 name',
                       'Actor 3 name',
                       'Actor 1 FB likes',
                       'Actor 2 FB likes',
                       'Actor 3 FB likes',
                       'Plot Keywords',
                       'Number of faces on poster',
                       ]]

df_movies = df_movies.fillna(0.0).astype(int, errors='ignore')
df_movies["Release year"] = pd.to_datetime(df_movies["Release year"], format='%Y', errors='coerce')

De DataFrame ziet er na stap 2. Data Processing als volgt uit:

In [7]:
df_movies.head()


Unnamed: 0,Movie title,Release year,Director,Director FB likes,Movie FB likes,Gross,Budget,Duration,Language,Country,...,Age rating,Total Cast FB likes,Actor 1 name,Actor 2 name,Actor 3 name,Actor 1 FB likes,Actor 2 FB likes,Actor 3 FB likes,Plot Keywords,Number of faces on poster
0,Avatar,2009-01-01,James Cameron,0,33000,760505847,237000000,178,English,USA,...,PG-13,4834,CCH Pounder,Joel David Moore,Wes Studi,1000,936,855,avatar|future|marine|native|paraplegic,0
1,Pirates of the Caribbean: At World's End,2007-01-01,Gore Verbinski,563,0,309404152,300000000,169,English,USA,...,PG-13,48350,Johnny Depp,Orlando Bloom,Jack Davenport,40000,5000,1000,goddess|marriage ceremony|marriage proposal|pi...,0
2,Spectre,2015-01-01,Sam Mendes,0,85000,200074175,245000000,148,English,UK,...,PG-13,11700,Christoph Waltz,Rory Kinnear,Stephanie Sigman,11000,393,161,bomb|espionage|sequel|spy|terrorist,1
3,The Dark Knight Rises,2012-01-01,Christopher Nolan,22000,164000,448130642,250000000,164,English,USA,...,PG-13,106759,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,27000,23000,23000,deception|imprisonment|lawlessness|police offi...,0
4,Star Wars: Episode VII - The Force Awakens ...,NaT,Doug Walker,131,0,0,0,0,0,0,...,0,143,Doug Walker,Rob Walker,0,131,12,0,0,0


### 2b. Extra datasets
Uit de extra datasets moeten een dataframe worden gemaakt die de volgende kolommen bevat:  
- Movie title
- Director
- Writers
- Actor 1 name
- Actor 2 name
- Actor 3 name

Dit gaat in een paar stappen:
1. Het verwijderen van kolommen die niet gebruikt worden
2. Het samenvoegen van `df_crew` en `df_titles` op `tconst`
3. Het verwijderen van alle rijen die geen films zijn
4. Het toevoegen van de namen van de schrijvers uit `df_names`  
5. Het toevoegen van de kolommen 'Actor # name' uit `df_movies`
6. Het toevoegen van de naam van de directors



In [8]:
# Removing unused columns
df_titles.drop(["isAdult", "endYear", "genres"], axis=1, inplace=True)

In [9]:
# Joining df_crew and df_titles
df_writers = pd.DataFrame(pd.merge(df_crew, df_titles))

In [10]:
# Dropping all non-movies
df_writers.drop(df_writers[df_writers["titleType"] != "movie"].index, inplace=True)

In [11]:
# Adding names of writers to `df_writers`


In [12]:
# Adding "Actor # name" 1 through 3 to `df_writers`


Here's a sample of the current version of `df_writers` 

In [13]:
df_writers.sample(5)

Unnamed: 0,tconst,directors,writers,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes
6041532,tt9684382,nm10437548,\N,movie,Taj Mahal,Taj Mahal,\N,96
4199569,tt5722116,nm0959774,\N,movie,Rammstein: Paris,Rammstein: Paris,2016,98
2073636,tt1679620,nm3961281,nm3961281,movie,Blood Ties,Magkakapatid,2010,87
1014403,tt1007950,nm0455767,"nm0996527,nm1563091",movie,Diminished Capacity,Diminished Capacity,2008,92
3678338,tt4553438,"nm7219105,nm7219104,nm7191113,nm7219103",\N,movie,Crossovers,Crossovers,2012,\N


## 3. Data Cleaning
### 3a. Aangeleverde dataset
1. NaN-types verwijderen
2. Dubbele Movie Titles verwijderen

In [14]:
df_movies.dropna(inplace=True)

# Dubbele titels verwijderen

Na stap 3. Data Cleaning ziet het DataFrame er als volgt uit:

In [15]:
df_movies.head()

Unnamed: 0,Movie title,Release year,Director,Director FB likes,Movie FB likes,Gross,Budget,Duration,Language,Country,...,Age rating,Total Cast FB likes,Actor 1 name,Actor 2 name,Actor 3 name,Actor 1 FB likes,Actor 2 FB likes,Actor 3 FB likes,Plot Keywords,Number of faces on poster
0,Avatar,2009-01-01,James Cameron,0,33000,760505847,237000000,178,English,USA,...,PG-13,4834,CCH Pounder,Joel David Moore,Wes Studi,1000,936,855,avatar|future|marine|native|paraplegic,0
1,Pirates of the Caribbean: At World's End,2007-01-01,Gore Verbinski,563,0,309404152,300000000,169,English,USA,...,PG-13,48350,Johnny Depp,Orlando Bloom,Jack Davenport,40000,5000,1000,goddess|marriage ceremony|marriage proposal|pi...,0
2,Spectre,2015-01-01,Sam Mendes,0,85000,200074175,245000000,148,English,UK,...,PG-13,11700,Christoph Waltz,Rory Kinnear,Stephanie Sigman,11000,393,161,bomb|espionage|sequel|spy|terrorist,1
3,The Dark Knight Rises,2012-01-01,Christopher Nolan,22000,164000,448130642,250000000,164,English,USA,...,PG-13,106759,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,27000,23000,23000,deception|imprisonment|lawlessness|police offi...,0
5,John Carter,2012-01-01,Andrew Stanton,475,24000,73058679,263700000,132,English,USA,...,PG-13,1873,Daryl Sabara,Samantha Morton,Polly Walker,640,632,530,alien|american civil war|male nipple|mars|prin...,1


### 3a. Extra datasets
1. "\\N" waardes omzetten naar NaN-types
2. NaN-typers verwijderen
3. Dubbele Movie Titles verwijderen


In [16]:
# Replacing "\\N" with None
df_writers.replace("\\N", np.NaN, inplace=True)
# Dropping all rows with NaN-types
df_writers.dropna(inplace=True)

Na stap 3. Data Cleaning ziet `df_writers` er zo uit:

In [17]:
df_writers.sample(5)

Unnamed: 0,tconst,directors,writers,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes
2392242,tt2012110,nm0082154,nm0082154,movie,Being Venice,Being Venice,2012,89
105763,tt0108099,nm0202578,nm0944058,movie,Shadow Force,Shadow Force,1992,80
277867,tt0289760,nm0620638,nm0883570,movie,Pistolero a sueldo,Pistolero a sueldo,1989,97
22582,tt0022953,nm0397227,"nm0340719,nm0726588",movie,The Golden West,The Golden West,1932,74
2598077,tt2234564,nm0594186,"nm2757832,nm0594186",movie,Thomas and Friends: Thomas's Snowy Surprise,Thomas and Friends: Thomas's Snowy Surprise,2003,50


## 4. Data Exploration & Analysis

In [18]:
df_movies["Director"].value_counts()

Steven Spielberg    26
Woody Allen         22
Clint Eastwood      20
Martin Scorsese     20
Ridley Scott        17
                    ..
Daniel Schechter     1
Anthony Powell       1
William Wyler        1
Matt Johnson         1
Masayuki Ochiai      1
Name: Director, Length: 2395, dtype: int64

In [19]:
df_movies["Actor 1 name"].value_counts()

Robert De Niro     49
Johnny Depp        41
Nicolas Cage       33
J.K. Simmons       31
Bruce Willis       30
                   ..
Jennifer Hale       1
Todd Stashwick      1
Phil Davis          1
Bebe Neuwirth       1
Mary Kate Wiles     1
Name: Actor 1 name, Length: 2044, dtype: int64

In [20]:
df_movies["Genres"].sample(10)

3770                                         Documentary
2948                                         Crime|Drama
3410                              Drama|Romance|Thriller
2773                                 Drama|Music|Romance
2555                               Drama|Horror|Thriller
90              Adventure|Animation|Comedy|Family|Sci-Fi
2122                              Drama|Mystery|Thriller
162     Action|Adventure|Animation|Comedy|Family|Fantasy
456                        Comedy|Family|Fantasy|Romance
32                               Action|Adventure|Sci-Fi
Name: Genres, dtype: object

In [21]:
# df_movies.describe(include = "all")

In [22]:
# df_movies.count()

In [23]:
# df_movies.dtypes

In [24]:
# df_movies.set_index("Movie title", inplace=True)
# df_movies.sort_index(inplace=True)