<a href="https://colab.research.google.com/github/PaolaMaribel18/hands-on-2023A/blob/master/notebooks/15_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 15 Transformers

Transformers are a type of deep learning architecture that have revolutionized the field of natural language processing (NLP) due to their remarkable performance on a variety of tasks.

Key Characteristics:

* _Attention Mechanisms:_ At the heart of Transformers is the self-attention mechanism that can weigh input tokens differently, allowing the model to focus on various parts of the input data. This mechanism can capture relationships between tokens regardless of their positions or distance from one another.

* _Parallel Processing:_ Unlike recurrent neural networks (RNNs) that process data sequentially, Transformers process all tokens in the input data in parallel, which leads to significant speed-ups during training.

* _Scalability:_ Transformers are highly scalable. This means they can be trained with a vast number of parameters (often billions), leading to models like GPT-3 from OpenAI.

* _Positional Encoding:_ Since Transformers don't process data sequentially, they don't have a built-in notion of the order or position of tokens. To address this, positional encodings are added to the embeddings at the input layer, providing the model with positional context.

### Transformer Architecture
The Transformer model consists of an Encoder-Decoder structure. Both the encoder and the decoder are composed of a stack of identical layers.

1. Encoder: The encoder receives the input data (like a sentence) and compresses the information into a 'context' or 'memory' that the decoder can then use. The encoder consists of a stack of identical layers. Each layer has two main components: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.


* _Input Embedding:_ The input data is first converted into vectors using embedding layers. Positional encoding is then added to these embeddings to give the model information about the position of words in a sequence.

* _Encoder Stack:_ Multiple (often 6 or more) identical layers are stacked. Each layer has two main components:
    (1) Multi-Head Attention Mechanism: This allows the encoder to focus on different parts of the input sentence when producing the context. It uses the attention mechanism we discussed earlier but multiple times in parallel.
    (2) Feed-Forward Neural Network: Each attention output is then passed through a feed-forward neural network (the same one for each position).

2. Decoder: The decoder generates the output data (like the translation of the input sentence) from the context provided by the encoder. Also consists of a stack of identical layers. In addition to the two components in the encoder layer, the decoder has a third component, which is a multi-head attention over the encoder's output.

* _Output Embedding:_ Like the input embedding but for the output data.

* _Decoder Stack:_ Also composed of multiple identical layers. Each layer has three main components:
(1) Masked Multi-Head Attention Mechanism: This ensures that the prediction for a particular word doesn’t depend on future words in the sequence. It's "masked" to prevent the model from "cheating" by looking ahead.
(2) Multi-Head Attention Mechanism (over encoder’s output): This helps the decoder focus on relevant parts of the input sentence, much like in the encoder but attending to the encoder's output.
(3) Feed-Forward Neural Network: Just like in the encoder.

The final layer of the decoder produces the output sequence.

In [1]:
!pip install transformers



In [2]:
!pip install sentencepiece



In [3]:
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-es'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

Downloading (…)olve/main/source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/826k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]



In [4]:
encoder = model.get_encoder()
print("Encoder:", encoder)
decoder = model.get_decoder()
print("Decoder:", decoder)

Encoder: MarianEncoder(
  (embed_tokens): Embedding(65001, 512, padding_idx=65000)
  (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
  (layers): ModuleList(
    (0-5): 6 x MarianEncoderLayer(
      (self_attn): MarianAttention(
        (k_proj): Linear(in_features=512, out_features=512, bias=True)
        (v_proj): Linear(in_features=512, out_features=512, bias=True)
        (q_proj): Linear(in_features=512, out_features=512, bias=True)
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (activation_fn): SiLUActivation()
      (fc1): Linear(in_features=512, out_features=2048, bias=True)
      (fc2): Linear(in_features=2048, out_features=512, bias=True)
      (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
  )
)
Decoder: MarianDecoder(
  (embed_tokens): Embedding(65001, 512, padding_idx=65000)
  (embed_positions): MarianS

In [5]:
def translate_to_spanish(phrase):
    """Translate an English phrase to Spanish using the pre-trained Transformer model."""
    # Tokenize and encode the phrase
    encoded_phrase = tokenizer.encode(phrase, return_tensors="pt")  # Use "tf" for TensorFlow
    # Generate the translation from the encoded phrase
    translation_ids = model.generate(encoded_phrase)
    # Convert token IDs back to a string
    translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
    return translation

In [6]:
print(translate_to_spanish("the quick brown fox jumps over the lazy dog"))

el rápido zorro marrón salta sobre el perro perezoso
