# Creating a Dataset Using MovieLens and IMDBpy

Author: James Smith

> IMDbPY is a Python package for retrieving and managing the data of the IMDb movie database about movies and people.

It was used to obatin plot overviews and a list of genres for a select number of movies.

It is currently not possible to get a list of all IMDb movie ID's from th API, thus making it not practical to easily curate a dataset.

To obtain a list of IMDb ID's, datasets available on {MovieLens}[https://grouplens.org/datasets/movielens/] were used and appended to each other to obtain the largest number of ID's to begin with
- MovieLens 20M Dataset
    - Last Updated 10/2016
- MovieLens Latest Datasets
    - Last Updated 09/2018

# Retrieving Data

## Setup and Configuration

In [1]:
import pandas as pd

Get ID's from local dataset

In [2]:
movie_data_loc_1 = r"C:\Users\User\Documents\ITB Year 2\Text Analytics and Web Content Mining\Assignments\Assignment 2\Data\ml-20m\ml-20m\links.csv"
movie_data_loc_2 = r"C:\Users\User\Documents\ITB Year 2\Text Analytics and Web Content Mining\Assignments\Assignment 2\Data\ml-latest-small\ml-latest-small\links.csv"

def get_data(loc_1, loc_2):
    """
    IMDb ID's sometimes contain a '0' before the ID, thus 
    we must ensure we load it as a string and not a number.
    
    """
    data_1 = pd.read_csv(loc_1, usecols = ['imdbId'], dtype = {'imdbId': str})
    data_2 = pd.read_csv(loc_2, usecols = ['imdbId'], dtype = {'imdbId': str})
    
    data_union = pd.concat([data_1, data_2],
                          ignore_index = True).drop_duplicates()
    
    print("There are "+str(len(data_union.imdbId.unique()))+" unique Movie ID's")
    return data_union

movie_ids = get_data(movie_data_loc_1, movie_data_loc_2)

There are 28251 unique Movie ID's


## Understanding the IMDb API

In [56]:
import imdb
movieDB = imdb.IMDb()

**Example Movie**

In [4]:
id_example = movie_ids['imdbId'][0]
id_example

'0114709'

In [5]:
movie_example = movieDB.get_movie(id_example)
movie_example

<Movie id:0114709[http] title:_Toy Story (1995)_>

In [6]:
print("Title: "+movie_example['title'])
print("Plot Outline: "+movie_example['plot outline'])
print(movie_example['genres'])

Title: Toy Story
Plot Outline: A little boy named Andy loves to be in his room, playing with his toys, especially his doll named "Woody". But, what do the toys do when Andy is not with them, they come to life. Woody believes that his life (as a toy) is good. However, he must worry about Andy's family moving, and what Woody does not know is about Andy's birthday party. Woody does not realize that Andy's mother gave him an action figure known as Buzz Lightyear, who does not believe that he is a toy, and quickly becomes Andy's new favorite toy. Woody, who is now consumed with jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are now lost. They must find a way to get back to Andy before he moves without them, but they will have to pass through a ruthless toy killer, Sid Phillips.
['Animation', 'Adventure', 'Comedy', 'Family', 'Fantasy']


## Attaining Attributes

In [60]:
from IPython.display import clear_output
import timeit
import numpy as np
# https://towardsdatascience.com/the-simplest-cleanest-method-for-tracking-a-for-loops-progress-and-expected-run-time-in-python-972675392b3

import imdb
movieDB = imdb.IMDb()

In [69]:
def attain_attributes(df):
    
    start = timeit.default_timer()
    for i in df.index:
        
        try:
            clear_output(wait = True)

            movie_object = movieDB.get_movie(df['imdbId'][i])

            title = movie_object['title']
            df.at[i, 'title'] = title

            try:
                plot_outline = movie_object['plot outline']
                df.at[i, 'plot_outline'] = plot_outline
            except: 
                df.at[i, 'plot_outline'] = ''

            try:
                genres = movie_object['genres']
                df.at[i, 'genres'] = ', '.join(genres)
            except:
                df.at[i, 'genres'] = ''

            stop = timeit.default_timer()

            if (i/len(df) * 100) < 5:
                expected_time = "Calculating..."
            else:
                stop = timeit.default_timer()
                expected_time = np.round( ((stop - start) / (i/len(df))) / 60, 2)

            print("Current progress:", np.round(i/len(df) * 100, 2), "%")
            print("Current run time:", np.round((stop - start)/60, 2), "minutes")
            print("Expected run time:", expected_time,"minutes")
        
        except:
            print("There was an error when obtaining data for the following ID:",df['imdbId'][i])

    return df

Below cell gathers the data - **commented out once it's loaded**

In [70]:
#df = movie_ids.copy()
#loc = r"C:\Users\User\Documents\ITB Year 2\Text Analytics and Web Content Mining\Assignments\Assignment 2\Data\overview_dataset.csv"
#data = attain_attributes(df)
#data.to_csv(loc)

Current progress: 131.04 %
Current run time: 1120.79 minutes
Expected run time: 855.33 minutes
