# Falcon meets IMDB: Dataset exploration, model customizations, and evaluation
The Internet Movie Database (IMDB) is a classic storehouse of rich NLP data used in many studies. While I would prefer to simply use what's already on the Hugging Face Hub [here](https://huggingface.co/datasets/imdb), sadly this isn't an option because that sample is only used for binary classification of the reviews.

Instead, we'll go straight to the source. The IMDB group provides samples of their datasets for free [here](https://developer.imdb.com/non-commercial-datasets/). This includes the title and description of the film. We'll be interested in the description here. 

Another option for movie descriptions is of course Wikipedia articles. However, this isn't already organized by film titles, so we'll consider that a plan B.

In this notebook, we want to see how well ***an open-source language model can generate new movie concepts.*** Specifically we want to explore prompt engineering and fine-tuning with a state-of-the-art model backbone. These days that is TII's Falcon 40B. 

An interesting challenge to solve in this notebook is that ***there's not already a great way of defining a "good" movie description.*** This means we'll need to develop some new evaluation metric or method to take a basic natural language movie description, say with at least 5 sentences, and create some numeric signal for how good this is. 

If you're following along with me, I'm using a SageMaker Studio notebook, specifically an `ml.m5.2xlarge`. I start with the Python 3 Data Science kernel.

### Step 0. Define and install package requirements.

In [2]:
%%writefile requirements.txt
torch
transformers
datasets

Writing requirements.txt


In [3]:
!pip install -r requirements.txt

Collecting torch (from -r requirements.txt (line 1))
  Downloading torch-1.13.1-cp37-cp37m-manylinux1_x86_64.whl (887.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m997.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting transformers (from -r requirements.txt (line 2))
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting datasets (from -r requirements.txt (line 3))
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch->-r requirements.txt (line 1))
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

### Step 1. Download some of the `IMDB` non-commercial datasets and load into pandas.

In [None]:
!mkdir imdb

In [27]:
import os 

def download_imdb_set(file_name, local_dir):
    msg1 = f'wget https://datasets.imdbws.com/{file_name} --directory {local_dir}/'
    os.system(msg1)
    msg2 = f'gunzip {local_dir}/{file_name}'
    os.system(msg2)

download_imdb_set(file_name='title.ratings.tsv.gz', local_dir='imdb')

In [17]:
import pandas as pd
titles_df = pd.read_table('imdb/title.akas.tsv')
us_titles = titles_df.loc[titles_df["region"]=='US'] 

  exec(code_obj, self.user_global_ns, self.user_ns)


In [28]:
ratings_df = pd.read_table('imdb/title.ratings.tsv')

In [32]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1982
1,tt0000002,5.8,265
2,tt0000003,6.5,1839
3,tt0000004,5.5,178
4,tt0000005,6.2,2624
