In [129]:
import imdb
import pandas as pd
import csv

In [130]:
fetcher = imdb.IMDb() # How we fetch the data

In [45]:
top = fetcher.get_top250_movies() # Getting the 250 movie objects that are the highest ranked on the website

In [112]:
top_IDs = list()
for i in range(len(top)):
    top_IDs.append(top[i].getID())
    
top_titles = list()
for i in range(len(top)):
    top_titles.append(top[i]['title'])
    
top_years = list()
for i in range(len(top)):
    top_years.append(top[i]['year'])

### Why is this next chunk (defining get_top_billing) necessary?
- The IMDb movie objects imported into the *top* variable are **not** the same as movie objects that are retrieved by calling the *get_movie()* function. The information in each movie object from the *get_top250_movies()* function is severly limited. 
- The way we access the information per each movie object is to use different keys, ie
        movie_obj['cast']
        
  this should print out a large number of people objects, each containing the person's name and information
- But the keys that are available to be used with the objects fetched by using the get_top250_movies() function are limited to the following:
        ['rating',
         'title',
         'year',
         'votes',
         'top 250 rank',
         'kind',
         'canonical title',
         'long imdb title',
         'long imdb canonical title',
         'smart canonical title',
         'smart long imdb canonical title']
         
- While the keys that are available to movies fetched by the *get_movie()* function are much more extensive:
        ['cast',
         'genres',
         'runtimes',
         'countries',
         'country codes',
         'language codes',
         'color info',
         'aspect ratio',
         'sound mix',
         'box office',
         'certificates',
         'original air date',
         'rating',
         'votes',
         'cover url',
         'plot outline',
         'languages',
         'title',
         'year',
         'kind',
         'directors',
         'writers]
           ... etc

There are very few items in this list that I desire to work with but things like director and cast are rather important to include. Instead of finding the top 250 films by hand and fetch them as Movie objects by hand, which would take a considerable amount of time, I am defining a function that will only pull the top 5 billed casts' names.

In [155]:
# Function: get_top_billing(self, size = 5):
# Necessary to grab the People objects using the ID numbers of the 250 movies
# because the 'cast' key is not used when fetching the movie objects
def get_top_billing(self, size = 5):
    " Get the top 5 highest billed actors"
    "must be passed a get_top250 movies object, not the original movie object"
    top_billed = []
    if(type(self) == list):
        for i in range(len(self)):
            top_billed.append(fetcher.get_movie(self[i].getID())['cast'][0:size])
    else:
        top_billed = fetcher.get_movie(self.getID())['cast'][0:size]
    return top_billed


## Warning: The following line will take several minutes to load

In [157]:
## NOTE: This line of code takes a considerable amount of time to run
top_cast = get_top_billing(top) ## Already ran this line of code

Now I am taking *top_cast* which is a list of lists, and extacting the name of the People object to then store in a list of lists which will then be transfered to a data frame using pandas

In [163]:
top_cast_name = []
for i in range(len(top_cast)):
    temp_list = []
    for j in range(5):
        temp_list.append(top_cast[i][j]['name'])
    top_cast_name.append(temp_list)

In [123]:
# Temporary dataframe to easily contant the two dataframes later
cast_df = pd.DataFrame(top_cast_name)

In [124]:
# Adding proper titles to the data previously retrieved via loops at the top of this document
top250 = pd.DataFrame({"IDs" : top_IDs, "Titles" : top_titles, "Year": top_years})

In [125]:
# 
fetcher.get
top250 = pd.concat([top250, cast_df], axis=1)

In [126]:
top250.columns = ["ID", "Title", "Year", "Star1", "Star2", "Star3", "Star4", "Star5"]

In [127]:
with open("top250.csv", "w", newline = "") as f:
    writer = csv.writer(f)
    writer.writerows(top250)

### What If I Wanted Every Key In Included in the get_movies() Function?
- Included in the list of data is alot of information that makes a movie potentially high ranking that may be somewhat subconcious such as the composer or the casting director
- But surely that is a hugh amount of information to store, so lets do some EDA on a couple samples

Because it is one of my favorite films, we will retrieve the ID of Iron Man and see the data types and contents of each key.

In [138]:
ironman_ID = fetcher.search_movie("Iron Man")[0].getID()
print(ironman_ID)

0371746


In [141]:
ironman_obj = fetcher.get_movie(ironman_ID)

In [143]:
len(ironman_obj.keys())

64

There are 64 pieces of information that we have for this particular film. They are quite varried in their contents. 

In [145]:
print(ironman_obj.keys())

['cast', 'genres', 'runtimes', 'countries', 'country codes', 'language codes', 'color info', 'aspect ratio', 'sound mix', 'box office', 'certificates', 'original air date', 'rating', 'votes', 'cover url', 'plot outline', 'languages', 'title', 'year', 'kind', 'directors', 'writers', 'producers', 'composers', 'cinematographers', 'editors', 'editorial department', 'casting directors', 'production designers', 'art directors', 'set decorators', 'costume designers', 'make up department', 'production managers', 'assistant directors', 'art department', 'sound department', 'special effects', 'visual effects', 'stunts', 'camera department', 'animation department', 'casting department', 'costume departmen', 'location management', 'music department', 'transportation department', 'miscellaneous', 'thanks', 'akas', 'writer', 'director', 'production companies', 'distributors', 'special effects companies', 'other companies', 'plot', 'synopsis', 'canonical title', 'long imdb title', 'long imdb canonica

In [154]:
print("Number of Artists in Art Department: ", len(ironman_obj['art department']))

Number of Artists in Art Department:  218
[<Person id:0007572[http] name:_Ruben Abarca_>, <Person id:3170375[http] name:_Joe Acosta_>, <Person id:3011924[http] name:_Eddie Acuña_>, <Person id:0036762[http] name:_Laurie Arnow-Epstein_>, <Person id:3170631[http] name:_Tony W. Austin_>, <Person id:1129581[http] name:_Ernie Avila_>, <Person id:3171344[http] name:_Chuck M. Beaver_>, <Person id:0080543[http] name:_Mark Bialuski_>, <Person id:3170921[http] name:_Brandon Birrer_>, <Person id:0083766[http] name:_Kelly Birrer_>]


It is quite clear that there is entirely too much data to draw any conclusions by hand.