# Multi-task Supervision Tutorial

In this tutorial we demonstrate how to use the multi-task versions of the label model and end model. We do this with a simple synthetic dataset, focusing primarily on input/output interfaces of these models. In a future tutorial, we will demonstrate the multi-task workflow on a real-world problem with additional scale and complexity, and illustrate the benefits that come from jointly modeling the weak supervision.

For multi-task problems, we execute our pipeline in five steps; for more detail see our latest working [technical draft](https://ajratner.github.io/assets/papers/mts-draft.pdf):
1. **Load Data:** As in the `Basics` tutorial, we only have access to unlabeled data points `X`, and noisy labels---which are now in the form of `t` matrices, one for each different _task_.
2. **Define Task Graph:** The `TaskGraph` defines the structure of logical relationships between tasks.
3. **Train Label Model:** The purpose of the `LabelModel` is again to estimate the unknown accuracies of the labeling functions, _without access to `Y`_, and then use this to denoise and combine them into a set of _probabilistic multi-task training labels_.
3. **Train End Model:** We can then use these training labels to supervise a multi-task learning (MTL) model, which optionally inherits network structure from the `TaskGraph`.
4. **Evaluate:** We evaluate this model on a held-out test set/

## Step 1: Load Data

We first load our data.

The data dyptes for the multi-task setting mirror those of the single-task setting, but with an extra dimension for the number of tasks (t), and with the single-task cardinality (k) being replaced by multiple task-specific cardinalities (K_t):

* X: a t-length list of \[n\]-dim iterables of end model inputs OR a single \[n\]-dim iterable of inputs if all tasks operate on the same input
* Y: a t-length list of \[n\]-dim numpy.ndarray of target labels (Y[i] $\in$ {1,...,K_t})
* L: a t-length list of \[n,m\] scipy.sparse matrices of noisy labels (L[i,j] $\in$ {0,...,K_t}, with label 0 reserved for abstentions

And optionally (for use with some debugging/analysis tools):
* D: a t-length list of \[n\]-dim iterables of human-readable examples (e.g. sentences) OR a single \[n\]-dim iterable of examples if all tasks operate on the same data

We load data that has been pre-split into train/dev/test splits in 80/10/10 proportions.

In [4]:
import pickle
with open("data/multitask_tutorial.pkl", 'rb') as f:
    Xs, Ys, Ls, Ds = pickle.load(f)

## Step 2: Define Task Graph

The primary role of the task graph is to define a set of feasible target label vectors.
For example, consider the following set of classification tasks, wherein we assign text entities to one of the given labels:

T0: Y0 ∈ {PERSON, ORG}  
T1: Y1 ∈ {DOCTOR, OTHER PERSON, NOT APPLICABLE}  
T2: Y2 ∈ {HOSPITAL, OTHER ORG, NOT APPLICABLE}  

Observe that the tasks are related by logical implication relationships: if Y0 = PERSON,
then Y2 = NOT APPLICABLE, since Y2 classifies ORGs. Thus, in this task structure, [PERSON, DOCTOR, NOT APPLICABLE] is a feasible label vector, whereas [PERSON, DOCTOR, HOSPITAL] is not.

To reflect this feasible label set, we define our task graph for this problem with a TaskHierarchy, a subclass of TaskGraph which assumes that label K_t for each non-root node is the "NOT APPLICABLE" class.

In [5]:
from metal.multitask import TaskHierarchy
task_graph = TaskHierarchy(cardinalities=[2,3,3], edges=[(0,1), (0,2)])

## Step 3: Train Label Model

We now pass our TaskGraph into the multi-task label model to instantiate a model with the appropriate structure.

In [6]:
from metal.multitask import MTLabelModel
label_model = MTLabelModel(task_graph=task_graph)

We then train the model, computing the overlap matrix $O$ and estimating accuracies $\mu$...

In [7]:
label_model.train_model(Ls[0], n_epochs=200, print_every=20, seed=123)

Computing O...
Estimating \mu...
[E:0]	Train Loss: 2.785
[E:20]	Train Loss: 0.451
[E:40]	Train Loss: 0.053
[E:60]	Train Loss: 0.027
[E:80]	Train Loss: 0.026
[E:100]	Train Loss: 0.025
[E:120]	Train Loss: 0.025
[E:140]	Train Loss: 0.025
[E:160]	Train Loss: 0.025
[E:180]	Train Loss: 0.025
[E:199]	Train Loss: 0.025
Finished Training


In [8]:
Ls[2]

[<100x10 sparse matrix of type '<class 'numpy.float64'>'
 	with 846 stored elements in Compressed Sparse Row format>,
 <100x10 sparse matrix of type '<class 'numpy.float64'>'
 	with 846 stored elements in Compressed Sparse Row format>,
 <100x10 sparse matrix of type '<class 'numpy.float64'>'
 	with 846 stored elements in Compressed Sparse Row format>]

As with the single-task case, we can score this trained model to evaluate it directly, or use it to make predictions for our training set that will then be used to train a multi-task end model.

In [9]:
label_model.score((Ls[1], Ys[1]))

Accuracy: 0.900


0.9

In [10]:
# Y_train_ps stands for "Y[labels]_train[split]_p[redicted]s[oft]"
Y_train_ps = label_model.predict_proba(Ls[0])

## Step 4: Train End Model

As with the single-task end model, the multi-task end model consists of three components: input layers, middle layers, and task head layers. Again, each layer consists of a torch.nn.Module followed by various optional additional operators (e.g., a ReLU nonlinearity, batch normalization, and/or dropout).

**Input layers**: The input module is an IdentityModule by default. If your tasks accept inputs of different types (e.g., one task over images and another over text), you may pass in a t-length list of input modules.

**Middle layers**: The middle modules are nn.Linear by default and are shared by all tasks.

**Head layers**: The t task head modules are nn.Linear modules by default. You may instead pass in a custom module to be used by all tasks or a t-length list of modules. These task heads are unique to each task, sharing no parameters with other tasks. Their output is fed to a set of softmax operators whose output dimensions are equal to the cardinalities for each task.

Here we construct a simple graph with a single (identity) input module, two intermediate layers, and linear task heads attached to the top layer.

In [11]:
from metal.multitask import MTEndModel
import torch
use_cuda = torch.cuda.is_available()
end_model = MTEndModel([1000,100,10], task_graph=task_graph, seed=123)


Network architecture:

--Input Layer--
IdentityModule()

--Middle Layers--
(layer1):
Sequential(
  (0): Linear(in_features=1000, out_features=100, bias=True)
  (1): ReLU()
)

(layer2):
Sequential(
  (0): Linear(in_features=100, out_features=10, bias=True)
  (1): ReLU()
)
(head0)
Linear(in_features=10, out_features=2, bias=True)
(head1)
Linear(in_features=10, out_features=3, bias=True)
(head2)
Linear(in_features=10, out_features=3, bias=True)




In [12]:
end_model.train_model((Xs[0], Y_train_ps), valid_data=(Xs[1], Ys[1]), n_epochs=5, seed=123)

100%|██████████| 25/25 [00:00<00:00, 161.90it/s]


Saving model at iteration 0 with best score 0.833
[E:0]	Train Loss: 2.260	Dev score: 0.833


100%|██████████| 25/25 [00:00<00:00, 230.25it/s]


Saving model at iteration 1 with best score 0.930
[E:1]	Train Loss: 1.334	Dev score: 0.930


100%|██████████| 25/25 [00:00<00:00, 259.97it/s]


Saving model at iteration 2 with best score 0.937
[E:2]	Train Loss: 1.054	Dev score: 0.937


100%|██████████| 25/25 [00:00<00:00, 256.49it/s]


[E:3]	Train Loss: 0.911	Dev score: 0.917


100%|██████████| 25/25 [00:00<00:00, 258.90it/s]


[E:4]	Train Loss: 0.864	Dev score: 0.903
Restoring best model from iteration 2 with score 0.937
Finished Training
Accuracy: 0.937


## Step 5: Evaluate

When it comes scoring our multi-task models, by default the mean task accuracy is reported.

In [13]:
print("Label Model:")
score = label_model.score((Ls[2], Ys[2]))

print()

print("End Model:")
score = end_model.score((Xs[2], Ys[2]))

Label Model:
Accuracy: 0.850

End Model:
Accuracy: 0.927


We can also, however, pass `reduce=None` to get back a list of task-specific accuracies.

In [14]:
scores = end_model.score((Xs[2], Ys[2]), reduce=None)

Accuracy (t=0): 0.930
Accuracy (t=1): 0.920
Accuracy (t=2): 0.930


And to get the predictions for all three tasks, we can call predict():

In [15]:
Y_p = end_model.predict(Xs[2])
Y_p

[array([2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2,
        2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 1, 2,
        1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2,
        2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1]),
 array([3, 2, 1, 3, 2, 1, 2, 2, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3,
        3, 3, 3, 3, 2, 3, 1, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 3, 3, 1, 1, 1,
        2, 1, 1, 2, 1, 3, 3, 2, 2, 2, 3, 2, 3, 1, 3, 3, 3, 1, 2, 3, 3, 3,
        1, 3, 2, 2, 3, 3, 1, 3, 2, 3, 3, 3, 1, 3, 1, 2, 1, 3, 3, 3, 3, 3,
        3, 3, 2, 1, 1, 1, 1, 3, 3, 3, 2, 2]),
 array([1, 3, 3, 1, 3, 3, 3, 3, 3, 2, 1, 1, 2, 1, 2, 2, 1, 1, 3, 3, 3, 1,
        2, 1, 2, 1, 3, 2, 3, 3, 3, 2, 3, 1, 1, 3, 3, 1, 3, 1, 1, 3, 3, 3,
        3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 1, 1, 1, 3, 3, 1, 1, 1,
        3, 1, 3, 3, 1, 2, 3, 2, 3, 2, 2, 1, 3, 1, 3, 3, 3, 1, 2, 1, 2, 1,
        2, 3, 3, 3, 