# Preliminaries

In [1]:
import pandas as pd
%matplotlib inline

First we import the file 'imdb.csv'.

> **Note:** The names of the movies end with the unicode character \\xa0, which we would like to remove. To do that you should use the argument _encoding='utf-8-sig'_ and replace it with an empty string (explained [here][1]).

[1]: https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string "reindex"

In [2]:
imdb = pd.read_csv('imdb5000.csv', 
                   index_col='movie_title', 
                   encoding='utf-8-sig')
imdb.set_index(imdb.index.str.replace('\xa0', ''), 
               inplace=True)
print("The shape of the raw data is:", imdb.shape)
imdb.head()

The shape of the raw data is: (5043, 27)


Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [3]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5043 entries, Avatar to My Date with Drew
Data columns (total 27 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews     

# Data inspection

Make yourself familiar with the data by answering questions like what do the columns mean, what are their types and what are their values and distribution.

> **Note:** This part has no strict flow, and it just encourages you to have a thorough look on the data before moving on.

# Data cleaning

Based on our preliminary inspection we decide to drop movies with more than 4 `NaN`'s.

We decude that some columns are not important for our analysis, and we remove all the columns, whose name contains the word 'facebook' or 'num'.

> **Note:** The `r` prefix is Python's way of using raw strings as regular expressions. More about it can be found [here](https://docs.python.org/3.7/library/re.html).

Remove duplicates in the data. Be carefull with what you consider a duplicate. 
idea: a director can have only one movie in a single year

# Warm-up questions

How many languages are represented?

What is the oldest movie?

Who is the director with the highest average imdb score?

How many unique actors are represented? Consider all three columns.

To avoid analyses based on esoteric movies we decide to drop all the movies made by directors, for whom that movie is the only movie in the data. In other words we wish to remove all the directors who appear only once in the data. Then we repeat the question above.

Finally, we wish to add a column called `profits` with the calculation of $gross - budget$ and answer:
* What is the most profitable movie and what is the biggest failure?
* Who is the most profitable director?

# Visualizations

What was the median budget spent on a production every year? Why is the median more informative than the average?

How many movies were released every year? Repeat the last question, but this time separate the graph into two plots - one for color movies and one for B/W movies.

How many movies are there of each genre?

> **Note:** A movie can be of multiple genres, so the count of genres is higher the count of movies.

# The `directors` dataset

We wish to create a datasets about the directors, including the following columns:
* Median duration
* Movie count
* Movies per year
* Average profit
* Main language
* Main genre

# More questions

Which actor participated in the highest number of movies? Consider all three columns.

Which pairs of director and actor like to work together? Sort the number of co-occurrences of such pairs.