## Imports and version information

In [1]:
import sys
import numpy as np
import pandas as pd
print("Python version {}".format(sys.version[:5]))
print("Numpy version: {}".format(np.version.version))
print("Pandas version: {}".format(pd.__version__))

Python version 3.6.0
Numpy version: 1.12.1
Pandas version: 0.19.2


# IMDB dataset

The IMDB dataset, obtained from https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset, has over 5k movie details. These details include attributes like actor names, director name, and genre of the movie along with the title and IMDB rating of the movie. <br />

<hr>
Below are shown the first five rows of the data with its attributes (features) and the names of all the columns in the csv.

In [2]:
imdb = pd.read_csv("../data/movie_metadata.csv")
imdb.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [3]:
for i, col in enumerate(imdb.columns):
    print("{0:0>2}. {1}".format(i + 1, col))

01. color
02. director_name
03. num_critic_for_reviews
04. duration
05. director_facebook_likes
06. actor_3_facebook_likes
07. actor_2_name
08. actor_1_facebook_likes
09. gross
10. genres
11. actor_1_name
12. movie_title
13. num_voted_users
14. cast_total_facebook_likes
15. actor_3_name
16. facenumber_in_poster
17. plot_keywords
18. movie_imdb_link
19. num_user_for_reviews
20. language
21. country
22. content_rating
23. budget
24. title_year
25. actor_2_facebook_likes
26. imdb_score
27. aspect_ratio
28. movie_facebook_likes


There are a total of 28 features in the data set.

In [4]:
print("\n    {:30} {}".format("Column name", "Total null values"))
for i, col in enumerate(imdb.columns):
    print("{0:0>2}. {1:30} {2}".format(i + 1, col, imdb[col].isnull().sum()))


    Column name                    Total null values
01. color                          19
02. director_name                  104
03. num_critic_for_reviews         50
04. duration                       15
05. director_facebook_likes        104
06. actor_3_facebook_likes         23
07. actor_2_name                   13
08. actor_1_facebook_likes         7
09. gross                          884
10. genres                         0
11. actor_1_name                   7
12. movie_title                    0
13. num_voted_users                0
14. cast_total_facebook_likes      0
15. actor_3_name                   23
16. facenumber_in_poster           13
17. plot_keywords                  153
18. movie_imdb_link                0
19. num_user_for_reviews           21
20. language                       12
21. country                        5
22. content_rating                 303
23. budget                         492
24. title_year                     108
25. actor_2_facebook_likes         

As seen above, the data is not complete - has <b>NaN</b> (null) values - and inconsistences that have to be removed/changed before we can work with the data. The next notebook demonstrates the feature engineering steps taken to remove the nulls and inconsistencies.