# Falcon meets IMDB: Dataset exploration, model customizations, and evaluation
The Internet Movie Database (IMDB) is a classic storehouse of rich NLP data used in many studies. While I would prefer to simply use what's already on the Hugging Face Hub [here](https://huggingface.co/datasets/imdb), sadly this isn't an option because that sample is only used for binary classification of the reviews.

Instead, we'll go straight to the source. The IMDB group provides samples of their datasets for free [here](https://developer.imdb.com/non-commercial-datasets/). This includes the title and description of the film. We'll be interested in the description here. 

Another option for movie descriptions is of course Wikipedia articles. However, this isn't already organized by film titles, so we'll consider that a plan B.

In this notebook, we want to see how well ***an open-source language model can generate new movie concepts.*** Specifically we want to explore prompt engineering and fine-tuning with a state-of-the-art model backbone. These days that is TII's Falcon 40B. 

An interesting challenge to solve in this notebook is that ***there's not already a great way of defining a "good" movie description.*** This means we'll need to develop some new evaluation metric or method to take a basic natural language movie description, say with at least 5 sentences, and create some numeric signal for how good this is. 

If you're following along with me, I'm using a SageMaker Studio notebook, specifically an `ml.m5.2xlarge`. I start with the Python 3 Data Science kernel.

### Step 0. Define and install package requirements.

In [None]:
%%writefile requirements.txt
torch
transformers
datasets

In [None]:
!pip install -r requirements.txt

### Step 1. Download some of the `IMDB` non-commercial datasets and load into pandas.
Specifically we'll download the `ratings` and `titles` datasets, then join these. After this, we will need to query Wikipedia for the articles, including the summary and full plot.

In [None]:
!mkdir imdb

In [17]:
import os 

def download_imdb_set(file_name, local_dir):
    msg1 = f'wget https://datasets.imdbws.com/{file_name} --directory {local_dir}/'
    os.system(msg1)
    msg2 = f'gunzip {local_dir}/{file_name}'
    os.system(msg2)

download_imdb_set(file_name='title.ratings.tsv.gz', local_dir='imdb')
download_imdb_set(file_name='title.basics.tsv.gz', local_dir='imdb')

In [16]:
import pandas as pd

def format_imdb(table_name):
    # only find us titles
    if 'title.akas.tsv' in table_name:
        titles_df = pd.read_table('imdb/title.akas.tsv')
        us_titles = titles_df.loc[titles_df["region"]=='US'] 
        us_titles.set_index('titleId', inplace=True)
        return us_titles
    
    elif 'title.ratings.tsv' in table_name:
        ratings_df = pd.read_table('imdb/title.ratings.tsv')
        ratings_df.set_index('tconst', inplace=True)
        return ratings_df
    
    elif 'title.basics.tsv' in table_name:
        title_basics = pd.read_table('imdb/title.basics.tsv')
        # filter out adult films and only grab movies
        title_basics = title_basics[(title_basics.isAdult==0) & (title_basics.titleType == 'movie')]
        title_basics.set_index('tconst', inplace=True)
        return title_basics

In [41]:
title_basics = format_imdb('title_basics.tsv')
us_titles = format_imdb('title.akas.tsv')
ratings_df = format_imdb('title.ratings.tsv')

In [62]:
title_basics.shape

(637227, 8)

In [33]:
# grab only full-length feature US movies
us_movies = us_titles[us_titles.index.isin(title_basics.index)]
# join with the ratings 
df = us_movies.join(ratings_df)
df.shape

In [47]:
df.head()

Unnamed: 0,ordering,title,region,language,types,attributes,isOriginalTitle,averageRating,numVotes
tt0000009,5,Miss Jerry,US,\N,imdbDisplay,\N,0,5.3,206.0
tt0000147,1,The Corbett-Fitzsimmons Fight,US,\N,imdbDisplay,\N,0,5.3,475.0
tt0000574,7,The Story of the Kelly Gang,US,\N,imdbDisplay,\N,0,6.0,832.0
tt0000591,3,The Prodigal Son,US,\N,\N,\N,0,4.4,20.0
tt0000630,4,Hamlet,US,\N,\N,\N,0,2.8,26.0


Wow, surprinsingly it looks like of the 1.4M US movies in this dataset sample, only 326K have ratings. Let's pick the best ones.

In [60]:
top_movies = df[df.averageRating >= 7.0]
top_movies.shape

(50676, 9)

In [61]:
top_movies.head()

Unnamed: 0,ordering,title,region,language,types,attributes,isOriginalTitle,averageRating,numVotes
tt0002130,15,Dante's Inferno,US,\N,\N,\N,0,7.0,3167.0
tt0002305,2,Life of Villa,US,\N,imdbDisplay,\N,0,7.6,28.0
tt0002637,3,Arizona,US,\N,imdbDisplay,\N,0,7.2,18.0
tt0003456,1,Trial by Fire,US,\N,\N,16mm release title,0,7.2,13.0
tt0003456,2,Through Fire and Air,US,\N,imdbDisplay,\N,0,7.2,13.0


Wow, we started at 637,227 feature-length movies in the IMDB dataset, and now have just about 50,000 US feature-length films with a rating at 7.0 or higher. Clearly there's a data issue here!