# Multi-Output Models 

A multi-output model is one that has more than one output or **head**. For instance, a model that takes the text of a user's social media profile and tries to predict multiple values such as their age, gender, location, income level, etc.

# Multi-Output Models with `keras`

In [18]:
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras import Input

## Defining Problem Constants
- `vocabulary_size` is the number of unique words which can occur in the text of any given social media post.
- `num_income_groups` is the number of income groups we wish to classify the user into based on their social media post.

In [19]:
vocabulary_size = 50000
num_income_groups = 10

## Creating an Input Tensor

The `posts_input` is our `Input` tensor. It will consist of variable-length text sequences that have been encoded as a sequence of integers, where each integer corresponds to one of 50,000 words in the `vocabulary`. 

We also give this `Input` a name - `posts` so that when we're fitting the model later on, we can specify which `numpy` array is to be used as `posts_input` through a dictionary. 

Later, we embed this 50,000-dimensional vector to a 256-dimensional space using an `Embedding` layer, which accepts the `posts_input` tensor as an argument, processes it, and outputs an `embedded_posts` tensor.

In [20]:
posts_input = Input(shape=(None, ),   # Post can have variable length
                   dtype='int32', 
                   name='posts')      # Naming this input for passing numpy tensor via dict later on

In [21]:
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)

## Transforming the `Input` through `Conv1D` and `MaxPooling`

In [22]:
# First Conv/Pooling Pair
x = layers.Conv1D(filters=128, kernel_size=5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(pool_size=5)(x)

In [23]:
# Second Conv/Pooling Set
x = layers.Conv1D(filters=256, kernel_size=5, activation='relu')(x)
x = layers.Conv1D(filters=256, kernel_size=5, activation='relu')(x)
x = layers.MaxPooling1D(pool_size=5)(x)

In [24]:
# Third Conv/Pooling Set
x = layers.Conv1D(filters=256, kernel_size=5, activation='relu')(x)
x = layers.Conv1D(filters=256, kernel_size=5, activation='relu')(x)

# This layer requires no specific parameters
x = layers.GlobalMaxPooling1D()(x)

## Output Layers

`x` still refers to the output tensor returned by the `GlobalMaxPooling1D` layer. Instead of passing this to just one layer to predict one of three attributes (`age`, `income`, `gender`), wer are passing it to separate layers for each of these attributes. 

This means our model has three heads or three outputs. 

**Every output layer has a name.**

### Age
`age_prediction` produces a continuous-valued output, so it is just one unit with the default (`linear`) activation function. 

In [25]:
age_prediction = layers.Dense(1, name='age')(x)

### Income 

We aren't predicting the exact continuous value of income. Rather, we're predicting which of the 10 predefined income classes the user belongs to. For this reason, we use a `softmax` activation in this layer, with one unit per each income group.

In [26]:
income_prediction = layers.Dense(num_income_groups, 
                                activation='softmax', 
                                name='income')(x)

### Gender

Because we are not super woke, we're assuming gender to be a binary variable. This means a `sigmoid` layer is the best choice, because it will predict the probability that a given input belongs to the arbitrarily defined positive class representing one of the two genders.

In [27]:
gender_prediction = layers.Dense(1, 
                                 activation='sigmoid', 
                                 name='gender')(x)

## Building a Model

Just as before, but this time a the output tensors argument will be a list of the tensors returned by each of our heads.

In [28]:
model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

## Compilation

We have multiple output values now, which means we must specify a different loss function for each of them. 
- For instance, **age prediction** is a regression task, so mean squared or absolute error will be a good loss function for this output. 
- However, **gender prediction** is a binary classification task, so it requires `binary_crossentropy` as a loss function.
- The loss function for **income prediction** is categorical crossentropy, because there are 10 different classes to choose from. 

Because we have named the heads of our network, we can use these names to specify their respective loss functions in the compilation step.

However, this is not a requirement. We can still compile the model as long as we pass the loss functions in the same order as the heads were added. Still, it is better to be explicit than implicit.

### Without Specifying Heads

In [30]:
model.compile(optimizer='rmsprop', 
             loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])

### Specifying Heads

In [31]:
model.compile(optimizer='rmsprop', 
             loss={
                 'age': 'mse', 
                 'income': 'categorical_crossentropy', 
                 'gender': 'binary_crossentropy'
             })

## Loss Weighting

Not all losses of a multi-output model will be equal. Some losses will have higher values than others. The gradient descent weight updates will thus be optimised to minimise the loss with the highest value, sometimes at the expense of other losses. 

This is common if the losses occupy different scales e.g. MSE and crossentropy have an order of magnitude difference between typical values. 

To remedy this, we can assign weights to each loss in an attempt to 'equalize' the loss values produced for each head.

In the following cell, we have 'scaled up' the loss values for `binary_crossentropy`, and scaled down the loss values for `mse`. `categorical_crossentropy` loss values remain unchanged since they are expected to be in roughly the same range as the other two losses after scaling.

### Without Specifying Heads

In [None]:
# Without specifying head names
model.compile(optimizer='rmsprop', 
             loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'], 
             loss_weights=[0.25, 1., 10.])

### Specifying Heads

In [32]:
model.compile(optimizer='rmsprop', 
             loss={
                 'age': 'mse', 
                 'income': 'categorical_crossentropy',
                 'gender': 'binary_crossentropy', 
             }, 
             loss_weights={
                 'age': 0.25, 
                 'income': 1, 
                 'gender': 10.,
             })

## Training 

Again, can use the head names to specify the training labels/targets when fitting.

### Without Specifying Heads

In [None]:
model.fit(posts, [age_targets, income_targets, gender_targets], 
         epochs=10, batch_size=64)

### Specifying Heads

In [None]:
model.fit(posts, {
    'age': age_targets, 
    'income': income_targets, 
    'gender': gender_targets
}, epochs=10, batch_size=64)