In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Transformers are a type of deep learning architecture that have revolutionized the field of natural language processing (NLP) due to their remarkable performance on a variety of tasks.

Key Characteristics:

Attention Mechanisms: At the heart of Transformers is the self-attention mechanism that can weigh input tokens differently, allowing the model to focus on various parts of the input data. This mechanism can capture relationships between tokens regardless of their positions or distance from one another.

Parallel Processing: Unlike recurrent neural networks (RNNs) that process data sequentially, Transformers process all tokens in the input data in parallel, which leads to significant speed-ups during training.

Scalability: Transformers are highly scalable. This means they can be trained with a vast number of parameters (often billions), leading to models like GPT-3 from OpenAI.

Positional Encoding: Since Transformers don't process data sequentially, they don't have a built-in notion of the order or position of tokens. To address this, positional encodings are added to the embeddings at the input layer, providing the model with positional context.

Transformer Architecture
The Transformer model consists of an Encoder-Decoder structure. Both the encoder and the decoder are composed of a stack of identical layers.

Encoder: The encoder receives the input data (like a sentence) and compresses the information into a 'context' or 'memory' that the decoder can then use. The encoder consists of a stack of identical layers. Each layer has two main components: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.
Input Embedding: The input data is first converted into vectors using embedding layers. Positional encoding is then added to these embeddings to give the model information about the position of words in a sequence.

Encoder Stack: Multiple (often 6 or more) identical layers are stacked. Each layer has two main components: (1) Multi-Head Attention Mechanism: This allows the encoder to focus on different parts of the input sentence when producing the context. It uses the attention mechanism we discussed earlier but multiple times in parallel. (2) Feed-Forward Neural Network: Each attention output is then passed through a feed-forward neural network (the same one for each position).

Decoder: The decoder generates the output data (like the translation of the input sentence) from the context provided by the encoder. Also consists of a stack of identical layers. In addition to the two components in the encoder layer, the decoder has a third component, which is a multi-head attention over the encoder's output.
Output Embedding: Like the input embedding but for the output data.

Decoder Stack: Also composed of multiple identical layers. Each layer has three main components:

(1) Masked Multi-Head Attention Mechanism: This ensures that the prediction for a particular word doesn’t depend on future words in the sequence. It's "masked" to prevent the model from "cheating" by looking ahead. (2) Multi-Head Attention Mechanism (over encoder’s output): This helps the decoder focus on relevant parts of the input sentence, much like in the encoder but attending to the encoder's output. (3) Feed-Forward Neural Network: Just like in the encoder.

The final layer of the decoder produces the output sequence.

In [None]:
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-es'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

In [None]:
encoder = model.get_encoder()
print("Encoder:", encoder)
decoder = model.get_decoder()
print("Decoder:", decoder)

In [None]:
def translate_to_spanish(phrase):
    """Translate an English phrase to Spanish using the pre-trained Transformer model."""
    # Tokenize and encode the phrase
    encoded_phrase = tokenizer.encode(phrase, return_tensors="pt")  # Use "tf" for TensorFlow
    # Generate the translation from the encoded phrase
    translation_ids = model.generate(encoded_phrase)
    # Convert token IDs back to a string
    translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
    return translation

In [None]:
print(translate_to_spanish("the quick brown fox jumps over the lazy dog"))