This is an example for running the FrESCO library with the imdb benchmark data. Included within the FrESCO repository is a text-only version of the imdb dataset and a script for processing data prior to training. The `data_stup.py` script may be used to process data into the expected format for the FrESCO model training codebase.  If you've not already done so, go to the data directory and unzip the dataset using the command `$ tar -xf imdb.tar.gz`, then start formatting the data.

In [1]:
import gensim
import json
import math
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from tqdm import tqdm

tqdm.pandas()

We'll need a helper function to tokenize the movie reviews.

In [2]:
def word2int(d, model):
    ints = [model.wv.key_to_index.get(d.lower()) for d in d.split(' ')]
    unk = len(model.wv.key_to_index)
    return [x if x is not None else unk for x in ints]

Now we'll set the random number seed, dimension of the word embeddings, and size of the train, test, and val splits. Then we'll read in the raw text into a pandas dataframe.

In [3]:
seed = 42

train_split = 0.75
val_split = 0.15
test_split = 0.10

embed_dim = 300

df = pd.read_csv('../data/imdb/IMDB Dataset.csv')

Next we'll do some initial cleaning of the data, stripping out non-alphanumeric and escape characters and words shorter than 2 characters. Lastly, all the remaining words in the reviews will be split into individual strings.

In [4]:
data = [d.lower() for d in df['review']]
data = [gensim.parsing.preprocessing.strip_tags(d) for d in data]
data = [gensim.parsing.preprocessing.strip_non_alphanum(d) for d in data]
data = [gensim.parsing.preprocessing.strip_short(d, minsize=2) for d in data]
data = [d.split(' ') for d in data]

Now we can split the cleaned data into the train/test/val proportions we specified above and start putting them into a dataframe for subsequent processing.

In [5]:
x_train, x_tmp, y_train, y_tmp = train_test_split(df['review'], df['sentiment'], test_size=1-train_split,
                                                  random_state=seed)

x_val, x_test, y_val, y_test = train_test_split(x_tmp, y_tmp, test_size=test_split/(test_split + val_split),
                                                random_state=seed)

train_df = pd.DataFrame(x_train, columns=['review', 'X', 'sentiment', 'split'])
val_df = pd.DataFrame(x_val, columns=['review', 'X', 'sentiment', 'split'])
test_df = pd.DataFrame(x_test, columns=['review', 'X', 'sentiment', 'split'])

train_df['split'] = 'train'
val_df['split'] = 'val'
test_df['split'] = 'test'

train_df['sentiment'] = y_train
val_df['sentiment'] = y_val
test_df['sentiment'] = y_test

train_data = pd.concat([train_df, val_df])

The next step in the process is to build the vocab from the train and val sets, then train the word embeddings on that vocab. 

In [6]:
print("Creating vocab and word embeddings")
model = gensim.models.word2vec.Word2Vec(vector_size=embed_dim, min_count=2, epochs=25, workers=4)
model.build_vocab(train_data['review'].str.split())
model.train(train_data['review'].str.split(), total_examples=model.corpus_count, epochs=model.epochs)

Creating vocab and word embeddings


(198994029, 260081200)

We next need to tokenize all of the reports and save them in the `X` field of the data frames.

In [7]:
train_df['X'] = train_df['review'].progress_apply(lambda d: word2int(d, model))
val_df['X'] = val_df['review'].progress_apply(lambda d: word2int(d, model))
test_df['X'] = test_df['review'].progress_apply(lambda d: word2int(d, model))

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37500/37500 [00:01<00:00, 23348.21it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7500/7500 [00:00<00:00, 23383.88it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:00<00:00, 23552.05it/s]


Since the global vocab is based on the train and val sets, there will be some words in the test set not appearing in the vocab. To handle this situation, we map those words to the `<unk>`, unknown, token and add a random embedding.

In [8]:
word_vecs = [model.wv.vectors[index] for index in model.wv.key_to_index.values()]
rng = np.random.default_rng(seed)
unk_embed = rng.normal(size=(1, embed_dim), scale=0.1)
w2v = np.append(word_vecs, unk_embed, axis=0)

id2word = {v: k for k, v in model.wv.key_to_index.items()}
id2word[len(model.wv.key_to_index)] = "<unk>"

The last step prior to training a model is to save the needed files in a convenient location.

In [9]:
df_out = pd.concat([train_df, val_df, test_df])
df_out.to_csv("../data/imdb/data_fold0.csv", index=False)

labels = set(df['sentiment'])
id2label = {'sentiment': {i: l for i, l in enumerate(labels)}}
with open('../data/imdb/id2labels_fold0.json', 'w') as f:
    json.dump(id2label, f)

with open('../data/imdb/id2word.json', 'w') as f:
    json.dump(id2word, f)
np.save('../data/imdb/word_embeds_fold0.npy', w2v)

Now we're ready to train a model. Let's get the needed imports.

In [10]:
import fresco
import argparse

The FrESCO library is typically run from the command line with arguments specifying the model type and model args, so we'll have to set them up manually for this notebook.

In [11]:
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    _ = parser.add_argument("--model", "-m", type=str, default='ie',
                        help="""which type of model to create. Must be either
                                IE (information extraction) or clc (case-level context).""")
    _ = parser.add_argument('--model_path', '-mp', type=str, default='',
                       help="""this is the location of the model
                               that will used to make predictions""")
    _ = parser.add_argument('--data_path', '-dp', type=str, default='',
                        help="""where the data will load from. The default is
                                the path saved in the model""")
    _ = parser.add_argument('--model_args', '-args', type=str, default='',
                        help="""file specifying the model or clc args; default is in
                                the fresco directory""")

We are going to train a multi-task classification model on the P3B3 dataset, so we'll specify an `information extraction` model and point to the P3B3 model args file. 

In [12]:
args = parser.parse_args(args=['-m', 'ie', '-args', '../configs/imdb_args.yml'])

With these arguments specified, just need a few imports before we're ready to train our model. 

In [13]:
from fresco import run_ie

from fresco.validate import exceptions

In [15]:
run_ie.run_ie(args)

Validating kwargs in model_args.yml file
Loading data and creating DataLoaders
Loading data from ../data/imdb/
Num workers: 4, reproducible: True
Training on 37500 validate on 7500

Defining a model
Creating model trainer
Training a mthisan model with 2 cuda device


epoch: 1

training time 40.18
Training loss: 0.467834
        task:      micro        macro
   sentiment:     0.7522,     0.7522

epoch 1 validation

epoch 1 val loss: 0.32461786, best val loss: inf
patience counter is at 0 of 5
        task:      micro        macro
   sentiment:     0.8588,     0.8580

epoch: 2

training time 40.90
Training loss: 0.374819
        task:      micro        macro
   sentiment:     0.8425,     0.8425

epoch 2 validation

epoch 2 val loss: 0.29770253, best val loss: 0.32461786
patience counter is at 0 of 5
        task:      micro        macro
   sentiment:     0.8725,     0.8725

epoch: 3

training time 40.94
Training loss: 0.365319
        task:      micro        macro
   sentiment:     0.857