This is an example for running the FrESCO library with unlabelled imdb data included within the FrESCO repository. This notebook outlines the steps in the provided script `use_model.py` for making predictions from a trained model. The data has been preprocessed and is ready for inference. If you've not already done so, go to the data directory and unzip the dataset using the command `$ tar -xf inference.tar.gz`. This script requires a model trained on the imdb dataset, so check out our imdb example notebook, if you've not already got a trained model ready to go. Lastly, move the saved model to the `notebooks/savedmodels/` directory to make running this example easier.


The directory `inference/` does not have all of the required files for the vocabulary and word embeddings, only the unlabelled data we want to make predictions on. We must bring those over from the data directory from the trained model. Before copying those required files, let's check out the format of the unlabeled data.

In [1]:
import pandas as pd
unlabeled = pd.read_csv("../data/inference/data_fold0.csv")
unlabeled

Unnamed: 0,review,X,split
0,"Well, I have to admit that this movie brought ...","[541, 136, 22, 4, 1056, 8, 9, 19, 748, 44, 264...",test
1,The film is not for everyone. Some might think...,"[0, 21, 5, 24, 15, 3808, 44, 214, 96, 0, 137, ...",test
2,This movie is soo bad that I've wasted way to ...,"[9, 19, 5, 26416, 97, 8, 3366, 1107, 111, 4, 7...",test
3,I bought a tape of this film based on the reco...,"[136, 1172, 1, 2543, 3, 9, 21, 454, 18, 0, 811...",test
4,Krajobraz po bitwie like many films of Wajda i...,"[159048, 159048, 159048, 34, 100, 133, 3, 1590...",test
...,...,...,...
9995,I grew up in New York City and every afternoon...,"[136, 2107, 58, 6, 226, 40859, 1086, 2, 174, 4...",test
9996,Very good political thriller regarding the aft...,"[45, 51, 999, 1176, 2806, 0, 10380, 3, 14087, ...",test
9997,J Carol Nash and Ralph Morgan star in a movie ...,"[126661, 94533, 159048, 2, 159048, 135044, 568...",test
9998,Although it really isn't such a terribly movie...,"[433, 10, 56, 204, 138, 1, 1960, 19, 2489, 121...",test


We've got three columns, the review, tokenized reviews `X`, and the split `test`. We've tokenized these with the same vocab mappings as in the trained model, so the words match up between the two datasets. Now we can copy over the other required files.

In [2]:
cp ../data/imdb/*.json ../data/inference/

In [3]:
cp ../data/imdb/word_embeds_fold0.npy ../data/inference/

We'll walk through how to make predictions from a trained model, in a notebook style. This is essentially a notebook version of the script `use_model.py `, which is included for use at the command line. First we'll need some imports.

In [4]:
import argparse
import json
import os
import random

import torch

import numpy as np

from fresco.validate import exceptions
from fresco.validate import validate_params
from fresco.data_loaders import data_utils
from fresco.abstention import abstention
from fresco.models import mthisan, mtcnn
from fresco.predict import predictions

Now we'll setup the required function definitions.

In [5]:
def load_model_dict(model_path, valid_params, data_path=""):
    """Load pretrained model from disk.

        Args:
            model_path: str, from command line args, points to saved model
            valid_params: ValidateParams class, with model_args dict
            data_path: str or None, using data from the trained model, or different one

        We check if the supplied path is valid and if the packages match needed
            to run the pretrained model.

    """
    if os.path.exists(model_path):
        model_dict = torch.load(model_path, map_location=torch.device('cpu'))
    else:
        raise exceptions.ParamError("Provided model path does not exist")
    if len(data_path) > 0:
        with open(data_path + 'metadata.json', 'r', encoding='utf-8') as f:
            data_args = json.load(f)

    if os.path.exists(model_path):
        print(f"Loading trained model from {model_path}")
    else:
        raise exceptions.ParamError(f'the model at {model_path} does not exist.')

    mismatches = []
    # check to see if the stored package matches the expected one
    if len(mismatches) > 0:
        with open('metadata_package.json', 'w', encoding='utf-8') as f_out:
            json.dump(model_dict['metadata_package'], f_out, indent=2)
            raise exceptions.ParamError(f'the package(s) {", ".join(mismatches)} does not match ' +
                                        f'the generated data in {data_path}.' +
                                         '\nThe needed recreation info is in metadata_package.json')

    return model_dict


def load_model(model_dict, device, dw):

    model_args = model_dict['metadata_package']['mod_args']

    if model_args['model_type'] == 'mthisan':
        model = mthisan.MTHiSAN(dw.inference_data['word_embedding'],
                                dw.num_classes,
                                **model_args['MTHiSAN_kwargs'])

    elif model_args['model_type'] == 'mtcnn':
        model = mtcnn.MTCNN(dw.inference_data['word_embedding'],
                            dw.num_classes,
                            **model_args['MTCNN_kwargs'])

    model.to(device)
    if torch.cuda.device_count() > 1:
        model = torch.nn.DataParallel(model)

    model_dict = {k: v for k,v in model_dict.items() if k!='metadata_package'}
    # model_dict = {k.replace('module.',''): v for k,v in model_dict.items()}
    model.load_state_dict(model_dict)

    print('model loaded')

    return model

The FrESCO library is typically run from the command line with arguments specifying the model type and model args, so we'll have to set them up manually for this notebook. If you're wanting to run the script from the commandline the command is: `$python use_model.py -mp /path/to/model -dp /path/to/data/`

In [6]:
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    _ = parser.add_argument('--model_path', '-mp', type=str, default='',
                            help="""this is the location of the model
                                that will used to make predictions""")
    _ = parser.add_argument('--data_path', '-dp', type=str, default='',
                            help="""where the data will load from. The default is
                                    the path saved in the model""")
    _ = parser.add_argument('--model_args', '-args', type=str, default='',
                            help="""path to specify the model_args; default is in
                                    the configs/ directory""")

We are going to make predictions on an unlabelled dataset, so we'll specify an information extraction model and point to the unlabelled dataset directory. 

In [7]:
args = parser.parse_args(args=['-mp', 'savedmodels/model/model.h5', '-dp', '../data/inference'])

Next, we need to verfiy the `model_args` are sane and load the saved model from disk.

In [8]:
    # 1. validate model/data args
    print("Validating kwargs in model_args.yml file")
    data_source = 'pre-generated'
    # use the model args file from training
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_dict = load_model_dict(args.model_path, device)
    mod_args = model_dict['metadata_package']['mod_args']

    print("Validating kwargs from pretrained model ")
    model_args = validate_params.ValidateParams(args,
                                                data_source=data_source,
                                                model_args=mod_args)

    model_args.check_data_train_args(from_pretrained=True)

Validating kwargs in model_args.yml file
Loading trained model from savedmodels/model/model.h5
Validating kwargs from pretrained model 


We'll define a model architecture, set up abstention if the saved model was trained with it enabled, and check that the required data files exits. 


In [9]:
    if model_args.model_args['model_type'] == 'mthisan':
        model_args.hisan_arg_check()
    elif model_args.model_args['model_type'] == 'mtcnn':
        model_args.mtcnn_arg_check()

    if model_args.model_args['abstain_kwargs']['abstain_flag']:
        model_args.check_abstain_args()

    model_args.check_data_files()

Great, we can set the random number seeds next.

In [10]:
    if model_args.model_args['data_kwargs']['reproducible']:
        seed = model_args.model_args['data_kwargs']['random_seed']
        torch.manual_seed(seed)
        np.random.seed(seed)
        random.seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    else:
        seed = None

Let's go ahead and load the data. We'll also create the inference loader for our dataset needed to feed data to the gpu.

In [11]:
    dw = data_utils.DataHandler(data_source, model_args.model_args)
    dw.load_folds(fold=0)

    data_loader = dw.inference_loader(reproducible=model_args.model_args['data_kwargs']['reproducible'],
                                      seed=seed,
                                      batch_size=model_args.model_args['train_kwargs']['batch_per_gpu'])

Loading data from ../data/inference
Num workers: 4, reproducible: True


Now we're ready to load a model and create the class to predict on our unlabeled data.

In [12]:
    model = load_model(model_dict, device, dw)

    # Make predictions from pretrained model
    evaluator = predictions.ScoreModel(model_args.model_args, data_loader, model, device)

model loaded


And we can make the predictions...

In [13]:
    evaluator.predict(dw.dict_maps['id2label'])


Predicting test set
Saving predictions to csv


The predictions on the unlabelled data is saved in the `predictions` folder with the name specified in the `model_args` file and a time stamp to prevent name-clashes and overwriting data.