![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* Transformers - Alamaar Transformer

This exercise adapted from Alamaar (2020) <a href="https://jalammar.github.io/illustrated-transformer/"> The Illustrated Transformer</a>.

#### Video Resources
- [Jay Alamaar (Transformers)](https://www.youtube.com/watch?v=-QH8fRhqFHM&t=2s) - A complete description of this exercise begins at the 15:00 minute mark

In [1]:
# If you are running on Google Colab or outside of HiPerGator
# uncomment the following line to install the needed packages
# HiPerGator users should not need to do this!
#
# !pip install transformers datasets

### Setup and Tokenization

Declare and assign values to the tokenizer and model variables.  Distilgpt2 is a smaller version of the GPT2 model.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("distilgpt2") 
model = AutoModelForCausalLM.from_pretrained("distilgpt2", output_hidden_states = True)

2021-11-29 11:06:49.786124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


Assign a value to the text string to be tokenized, and then present it to the model's generate function.  The model correctly returns 'Redemption' as the next word in the sequence.

```python
text = "The Shawshank"

# Tokenize the input string
input = tokenizer.encode(text, return_tensors="pt")

# Run the model
output = model.generate(input, max_length = 5, do_sample = False)

# Print the output
print('\n',tokenizer.decode(output[0]))
```

In [1]:
# Code it!

```python
# Print the token ides (of the input and output)
output
```

In [2]:
# Code it!

### From words to vectors and back

```python
# Print the input token ids
text = "The Shawshank"
input = tokenizer(text, return_tensors="pt")['input_ids']
input
```

In [3]:
# Code it!

```python
tokenizer.convert_ids_to_tokens(input[0])
```

In [4]:
# Code it!

### Breathe meaning into numbers (Embedding)

This model has a vocabulary of 50,257 tokens, each with an embedding of 768 numbers.

In [6]:
# This is the embedding matrix of the model
model.transformer.wte # Dimensions are: (Number of tokens in vocabulary, dimension of model)

Embedding(50257, 768)

In [7]:
import tensorflow as tf

```python
# View all of the embeddings.
model.transformer.wte.weight

# View the embedding vector for token #464 ('The')
model.transformer.wte.weight[464]

# View the size of the embedding vector for token #464 
len(model.transformer.wte.weight[464])
```

In [None]:
# Code it!

Let's now test the model's generative abilities.

```python

text = "The chicken didn't cross the road because it was"

# Tokenize the input string
input = tokenizer.encode(text, return_tensors="pt")

# Run the model
output = model.generate(input, max_length = 25, do_sample = True)

# Print the output
print('\n',tokenizer.decode(output[0]))
```

In [12]:
# Code it!

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



 The chicken didn't cross the road because it was like, "Oh wow. That's the best
