In [None]:
# default_exp tutorial

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Tutorial

In this tutorial, we'll start with the most basic example to get you up and running the model as quickly and easily as possible. Then we'll dive into some more complicated example and hopefully you'll get some insight of what happened behind the scene.

## Minimal Example

In this example, we'll create train, eval and predict toy problems. But first, we need to what dose problem mean here. Essentially, a problem should have **a name(string), a problem type(string), and a preprocessing function(callable)**. The following problem type is pre-defined:

- `cls`: classification
- `seq_tag`: sequence labeling
- `multi_cls`: multi-label classification
- `mask_lm`: masked language model
- `pretrain`: masked lm + next sentence prediction

Normally, you would want to use this library to do multi-task learning. There are two types of chaining operations can be used to chain problems.

- `&`. If two problems have the same inputs, they can be chained using `&`. Problems chained by `&` will be trained at the same time.
- `|`. If two problems don't have the same inputs, they need to be chained using `|`. Problems chained by `|` will be sampled to train at every instance.

If your problem dose not fall in the pre-defined problem types, you can implement your own and register to params. We will cover this topic later. Let's start with a simple example of adding a classification problem and a sequence labeling problem.

In [None]:
# define toy problems name and problem type
problem_type_dict = {'toy_cls': 'cls', 'toy_seq_tag': 'seq_tag'}

Then we need to do some coding. We need to implement preprocessing function for each problem. The preprocessing function is a callable with 

- same name as problem name
- fixed input signature 
- returns(or yield) inputs and targets
- decorated by `bert_multitask_learning.preproc_decorator.preprocessing_fn`

In [None]:
# define a simple preprocessing function
import bert_multitask_learning
from bert_multitask_learning.preproc_decorator import preprocessing_fn
from bert_multitask_learning.params import BaseParams
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    return toy_input, toy_target

@preprocessing_fn
def toy_seq_tag(params: BaseParams, mode: str):
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = ['this is a test'.split(' ') for _ in range(10)]
        toy_target = [['a', 'b', 'c', 'd'] for _ in range(10)]
    else:
        toy_input = ['this is a test'.split(' ') for _ in range(10)]
        toy_target = [['a', 'b', 'c', 'd'] for _ in range(10)]
    return toy_input, toy_target

processing_fn_dict = {'toy_cls': toy_cls, 'toy_seq_tag': toy_seq_tag}

Now we're good to go! Since these two toy problems shares the same input, we can chain them with `&`.

In [None]:
# collapse_output
from bert_multitask_learning import train_bert_multitask, eval_bert_multitask, predict_bert_multitask
problem = 'toy_cls&toy_seq_tag'
# train
model = train_bert_multitask(
    problem=problem,
    num_epochs=1,
    problem_type_dict=problem_type_dict,
    processing_fn_dict=processing_fn_dict,
    continue_training=True
)


Adding new problem toy_cls, problem type: cls
Adding new problem toy_seq_tag, problem type: seq_tag
INFO:tensorflow:sampling weights: 
INFO:tensorflow:toy_cls_toy_seq_tag: 1.0
INFO:tensorflow:sampling weights: 
INFO:tensorflow:toy_cls_toy_seq_tag: 1.0
INFO:tensorflow:sampling weights: 
INFO:tensorflow:toy_cls_toy_seq_tag: 1.0
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClass

For eval, we can need to provide `model_dir` or `model` to the function.

In [None]:
# hide_output
# eval
eval_dict = eval_bert_multitask(problem=problem,
                    problem_type_dict=problem_type_dict, processing_fn_dict=processing_fn_dict,
                    model_dir=model.params.ckpt_dir)

esolved object in checkpoint: (root).optimizer's state 'v' for (root).body.bert.bert_model.bert.encoder.layer.1.attention.self_attention.query.kernel
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.


In [None]:
print(eval_dict)

{'loss': 2.7317519187927246, 'mean_acc': 0.38333335518836975, 'toy_cls_acc': 0.6000000238418579, 'toy_seq_tag_acc': 0.1666666716337204}


In [None]:
# hide_output
# predict
fake_inputs = ['this is a test'.split(' ') for _ in range(10)]
pred, model = predict_bert_multitask(
    problem=problem,
    inputs=fake_inputs, model_dir=model.params.ckpt_dir,
    problem_type_dict=problem_type_dict,
    processing_fn_dict=processing_fn_dict, return_model=True)

Adding new problem toy_cls, problem type: cls
Adding new problem toy_seq_tag, problem type: seq_tag
INFO:tensorflow:Checkpoint dir: models/toy_cls_toy_seq_tag_ckpt
INFO:tensorflow:['this', 'is', 'a', 'test']
INFO:tensorflow:input_ids: [101, 8554, 8310, 143, 10060, 102]
INFO:tensorflow:input_mask: [1, 1, 1, 1, 1, 1]
INFO:tensorflow:segment_ids: [0, 0, 0, 0, 0, 0]
INFO:tensorflow:['this', 'is', 'a', 'test']
INFO:tensorflow:input_ids: [101, 8554, 8310, 143, 10060, 102]
INFO:tensorflow:input_mask: [1, 1, 1, 1, 1, 1]
INFO:tensorflow:segment_ids: [0, 0, 0, 0, 0, 0]
INFO:tensorflow:['this', 'is', 'a', 'test']
INFO:tensorflow:input_ids: [101, 8554, 8310, 143, 10060, 102]
INFO:tensorflow:input_mask: [1, 1, 1, 1, 1, 1]
INFO:tensorflow:segment_ids: [0, 0, 0, 0, 0, 0]
INFO:tensorflow:['this', 'is', 'a', 'test']
INFO:tensorflow:input_ids: [101, 8554, 8310, 143, 10060, 102]
INFO:tensorflow:input_mask: [1, 1, 1, 1, 1, 1]
INFO:tensorflow:segment_ids: [0, 0, 0, 0, 0, 0]
INFO:tensorflow:['this', 'is', '

`pred` is a dictionary with problem name as key and probability distribution array as value.

In [None]:
for problem_name, prob_array in pred.items():
    print(f'{problem_name} - {prob_array.shape}')

toy_cls - (10, 2)
toy_seq_tag - (10, 6, 5)


## Use Different Models

By default, we use Bert as the base model. But thanks to transformers, it's easy to switch to any SOTA transformers models with some simple configuration and pass the params to train function as an argument. 

In [None]:
# hide_output
# change model to distilbert-base-uncased
from bert_multitask_learning.params import BaseParams
params = BaseParams()
# specify model and its loading module
params.transformer_model_name = 'distilbert-base-uncased'
params.transformer_model_loading = 'TFDistilBertModel'
# specify tokenizer and its loading module
params.transformer_tokenizer_name = 'distilbert-base-uncased'
params.transformer_tokenizer_loading = 'DistilBertTokenizer'
# specify config and its loading module
params.transformer_config_name = 'distilbert-base-uncased'
params.transformer_config_loading = 'DistilBertConfig'

# train model
model = train_bert_multitask(
    problem=problem,
    num_epochs=1,
    problem_type_dict=problem_type_dict,
    processing_fn_dict=processing_fn_dict,
    continue_training=True,
    params=params # pass params
)


NING:tensorflow:Unresolved object in checkpoint: (root).optimizer's state 'm' for (root).body.bert.bert_model.distilbert.transformer.layer.5.ffn.lin1.kernel
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be up

## Write More Flexible Preprocessing Function

The preprocessing function should return two elements: inputs and targets, except for `pretrain` problem type. You don't need to manually tokenize your inputs and encode the targets, it will be done automatically.

For features and targets, it can be one of the following format:
- tuple of list. This is the way to go if your data can fit into memory.
- generator of tuple. You should use generator if your data cannot fit into memory.

The features can be single modal and multi-modal. We will elaborate each form of the input in this section.

Please note that if preprocessing function returns generator of tuple, then corresponding problem cannot be chained using `&`.

### Single Modal Input

#### Tuple of list

Normal text, label pair

In [None]:
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    return toy_input, toy_target

A, B tokens

In [None]:
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = [{'a': 'this is a test', 'b': 'this is a test'} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = [{'a': 'this is a test', 'b': 'this is a test'} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    return toy_input, toy_target

#### Generator of tuple

Normal text, label pair

In [None]:
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate singe modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = ['this is a test' for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    for single_input, single_target in zip(toy_input, toy_target):
        yield single_input, single_target

A, B tokens. Same, skipped.

### Multi-modal Input

The other modal should be passed as an array.

#### Tuple of list

In [None]:
# hide
import numpy as np

In [None]:
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate multi-modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = [{'text': 'this is a test', 'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = [{'text': 'this is a test', 'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    
    return toy_input, toy_target

#### Generator of tuple

In [None]:
@preprocessing_fn
def toy_cls(params: BaseParams, mode: str):
    "Simple example to demonstrate multi-modal tuple of list return"
    if mode == bert_multitask_learning.TRAIN:
        toy_input = [{'text': 'this is a test', 'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    else:
        toy_input = [{'text': 'this is a test', 'image': np.random.uniform(size=(16))} for _ in range(10)]
        toy_target = ['a' if i <=5 else 'b' for i in range(10)]
    
    for single_input, single_target in zip(toy_input, toy_target):
        yield single_input, single_target

Note: A, B token of multi-modal input is not working yet!

### What Happened?

The inputs returned by preprocessing function will be tokenized using transformers tokenizer which is configurable like we showed before and the labels will be encoded(or tokenized if the target is text) as scalar or numpy array. The encoded inputs and target then will be serialized and written as TFRecord. Please note that the TFRecord will NOT be overwritten even if you run the code again. So if you want to change the data in TFRecord, you need to manually remove the directory of TFRecord. The default directory is `./tmp/{problem_name}`.

After the TFRecord is created, if you want to check the feature info, you can head to the corresponding directory and take a look at the json file within. 

First, we make sure the TFRecord is created.

In [None]:
# train_eval_input_fn will create and read the TFRecord, and returns a dataset
from bert_multitask_learning.input_fn import train_eval_input_fn

dataset = train_eval_input_fn(params)

INFO:tensorflow:sampling weights: 
INFO:tensorflow:toy_cls_toy_seq_tag: 1.0


We can take a look at the json file.

In [None]:
import json
import os

# the problem chained by & create one TFRecord folder
json_path = os.path.join(params.tmp_file_dir, 'toy_cls_toy_seq_tag', 'train_feature_desc.json')
print(json.dumps(json.load(open(json_path, 'r', encoding='utf8')), indent=4))

{
    "input_ids": "int64",
    "input_ids_shape_value": [
        null
    ],
    "input_ids_shape": "int64",
    "input_mask": "int64",
    "input_mask_shape_value": [
        null
    ],
    "input_mask_shape": "int64",
    "segment_ids": "int64",
    "segment_ids_shape_value": [
        null
    ],
    "segment_ids_shape": "int64",
    "toy_cls_label_ids": "int64",
    "toy_cls_label_ids_shape": "int64",
    "toy_cls_label_ids_shape_value": [],
    "toy_seq_tag_label_ids": "int64",
    "toy_seq_tag_label_ids_shape_value": [
        null
    ],
    "toy_seq_tag_label_ids_shape": "int64"
}
