# Challenge - Multimodal Transformers Big-Gan

![](https://images.unsplash.com/photo-1512572525676-f9b59951929e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=952&q=80)

## Introduction

⚠️ **For more context on this challenge, be sure to read the README**

In this challenge you be creating a mapping model that will be able to connect the output of a Transformer model to the input of a BigGAN model.

There are two main parts in this problem:
1. Preprocessing: in this phase we will use a Transformer to obtain the inputs for our mapping model and the BigGAN to obtain the outputs of our mapping model.
2. Building the mapping model: this is the phase were we define the architecture that will be used to map the hidden-states of a sentence outputted by a Transformer and an embbeding that is used as input to the BigGAN. For example, if we use a Transformer that outputs a hidden-state with a dimension of 768, given that the dimension of the embeddings of the BigGAN is 128, our mapping model should be able to take vectors with a dimension of 768 and output vectors with a dimension of 128.

Let's install all dependencies needed for this challenge:

In [None]:
!pip install pytorch-pretrained-biggan transformers Pillow

In [None]:
import tensorflow as tf

## 1. Preprocessing

### Creating a dataset

Initially we have to create the dataset that we will be using to build the mapping function. The main idea is to create a list of sentences for each ImageNet class in the BigGan. Our final goal will be to map these sentences to each class using a Transformer and the mapping function.

In order to create the list of sentences for each class, we will use some base patterns. Please define the base patterns in the cell above, don't forget to include the string `<WORD>` in the pattern. This string will be later replaced by the string corresponding ImageNet class. 

An example of pattern is: `"I saw a <WORD>"`

In [1]:
PATTERNS = [
    #Enter your code here
]

From the `helpers.py` script, import the function `generate_raw_dataset` and use it to build the raw dataset: a dictionnaire that maps each ImageNet class to a list of sentences following the patterns provided.

Do not forget to provide the patterns and a tokenizer from the Transformers library to the function

In [None]:
# TODO : Imports

tokenizer = # Enter your code here

raw_data = # Enter your code here

You can take a look at what this generated data looks like:

In [None]:
raw_data

Before converting the sentences to tokens ids, let's organize the `raw_data` in such a way that we will have a `raw_inputs` list (list with all generated sentences) and a `labels` list (list of the corresponding ImageNet class for each sentence).

Please be attention to the fact that both list must have the same size.

In [None]:
# TODO : Create `raw_inputs` and `labels`

In [6]:
print("Length of raw_inputs list:", len(raw_inputs))
print("Length of labels list:", len(labels))

Length of raw_inputs list: 2392
Length of labels list: 2392


Now, let's convert the input sentences to sequences of token ids and the attention mask (remember, we are padding the sentences) that will be used as input to the transformer model.

In order to do that, you will need to use the method `tokenizer.encode`. Do not forget to add the special tokens, to set a max length and to pad the sequences to this max length (so that all inputs have the same sequence size).

Please note that, these variables (`input_ids` and `attention_mask`) should be a 2-dim `tf.Tensor` (use the parameter `return_tensors` of the `encode` method).

> *Hint: In order to obtain one 2-dim `tf.Tensor` you can use the function `tf.concat`*

In [7]:
MAX_LENGTH = # Enter code here

input_ids = # Enter code here

attention_mask  = # Enter code here

In [8]:
input_ids

<tf.Tensor: shape=(2392, 10), dtype=int32, numpy=
array([[ 101, 1045, 2387, ...,    0,    0,    0],
       [ 101, 1045, 2359, ...,    0,    0,    0],
       [ 101, 1045, 2052, ..., 3869,  102,    0],
       ...,
       [ 101, 1045, 2359, ...,    0,    0,    0],
       [ 101, 1045, 2052, ..., 8153,  102,    0],
       [ 101, 1045, 2293, ...,    0,    0,    0]], dtype=int32)>

### Using a Transformer to generate inputs to the mapping model

Now, using a transformer model that outputs hidden states for each sentence (eg. BertModel, DistilBertModel, etc.), you will generate a new tensor that will be used as input the mapping model that we will build afterwards.

In [9]:
# Import the model class here

In [None]:
transformer_model = # Enter your code here

hidden_states, _ = # Enter your code here

inputs = # Enter your code here

### Generating the targets from the ImageNet labels

Here we will obtain the targets we need to train the mapping function, which is the dense representation of each class in the form of a 128-dimension vector.

In [None]:
# Here we load the Big-GAN network, that is available only on PyTorch

import torch
from pytorch_pretrained_biggan import BigGAN

big_gan = BigGAN.from_pretrained("biggan-deep-128")

**Generating outputs from the embeddings of the Big-GAN:**

In [None]:
one_hot_labels = torch.zeros(len(labels), 1000)
one_hot_labels[torch.arange(len(labels)), labels] = 1

with torch.no_grad():
    outputs = big_gan.embeddings(one_hot_labels).numpy()

**We can se that the dimension of each output is 128:**

In [14]:
outputs.shape

(2392, 128)

### Building the tensorflow datasets for training

As usual, before training a model in Tensorflow, we need to split the dataset in training and validation sets.

You can use the library and the function of your preference to do it. Do not forget to include the outputs.

In [15]:
# Enter your code here

Now, using the method `tf.data.Dataset.from_tensor_slices`, build the train and validation sets:

In [None]:
train_dataset = # Enter your code here
validation_dataset = # Enter your code here

Now shuffle and batch the datasets:

In [17]:
BATCH_SIZE = # Enter your code here
SHUFFLE_BUFFER_SIZE = # Enter your code here

train_dataset = # Enter your code here
validation_dataset = # Enter your code here

## 2. Building the mapping model

### Defining the model

Now let's build the mapping function from the Transformer hidden states to the BigGAN inputs.

Feel free to choose the architecture you want for this mapping model.

>  🔦 *Hint: start with the simplest architecture as possible, you can try more complex architectures later if you want*

In [18]:
model = # Enter your code here

### Training the model

As usual in Tensorflow 2, you should define the optimizer and the loss and compile the model before training:

In [None]:
# Enter your code here

You can define an early_stopping callback to avoid overfitting:

In [None]:
early_stopping = # Enter your code here

Now let's train the model!

In [None]:
# Enter your code here

## 3. Visualizing results

Now let's visualize the results of our mapping function!

In [None]:
from pytorch_pretrained_biggan import BigGAN
from visualization import text_to_image

gan_model = BigGAN.from_pretrained('biggan-deep-128')

In [None]:
text = "she loves her dog"
text_to_image(text, mapping_model=model, lm_model=transformer_model, lm_tokenizer=tokenizer, gan_model=gan_model)

In [None]:
text = "he loves his dog"
text_to_image(text, mapping_model=model, lm_model=transformer_model, lm_tokenizer=tokenizer, gan_model=gan_model)

In [None]:
text = "she loves her cat"
text_to_image(text, mapping_model=model, lm_model=transformer_model, lm_tokenizer=tokenizer, gan_model=gan_model)

In [None]:
text = "There was a boat"
text_to_image(text, mapping_model=model, lm_model=transformer_model, lm_tokenizer=tokenizer, gan_model=gan_model)

In [None]:
text = "I saw a castle"
text_to_image(text, mapping_model=model, lm_model=transformer_model, lm_tokenizer=tokenizer, gan_model=gan_model)

If you got weird images, don't worry, it's normal. This is a pretty difficult challenge and there is no optimal solution. The main idea is to understand how the experiment works and to play with.

Now that you got to a first solution, why don't you try other alternatives? 

You can try to obtain other resilts by using different patterns, different architectures for the mapping model or different traning parameters.