# Machine Translation

## Approach 1: Using Transformers Pipeline API


### Install the `transformers` library if you haven't already:


In [6]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Import the necessary modules.

The transformers library has pipeline APIs that offer a user-friendly method to utilize pre-trained models for particular tasks. These APIs take care of all the essential processes, such as tokenization, model loading, prediction, and decoding. By abstracting the underlying complexities, the pipeline APIs make it simpler to employ pre-trained models for various tasks, including text classification, named entity recognition, machine translation, and others.

In [7]:
from transformers import pipeline

### Define the task and the model:

In [56]:
task = "translation_en_to_fr"
model = "t5-base"

### Load the pre-trained machine translation pipeline:

In [60]:
translator = pipeline(task = task, model = model)

### Provide an input sentence or a list of sentences:

In [41]:
sentences = ["Large Language Models are amazing and highly versatile", "This presentation was very boring"]

### Perform translation and print the results:

In [61]:
translation = translator(sentences)
for t in translation:
  print(t['translation_text'])

Les grands modèles linguistiques sont étonnants et très polyvalents
Cette présentation était très ennuyeuse.


## Approach 2: Using AutoModelForSeq2Seq with a Pre-trained Model

### Import the necessary modules

The transformers library includes the class **AutoTokenizer**, which selects and loads the suitable tokenizer for a given pre-trained model automatically. Tokenizers are critical in NLP tasks since they convert raw text into tokens, which can be words, subwords, or characters, depending on the tokenizer's approach. They handle text normalization, tokenization, and special token addition, and prepare the input data for the model by converting text into a format that it can process.

The transformers library also includes the class **AutoModelForSeq2SeqLM**, which automatically selects and loads the appropriate pre-trained model for sequence-to-sequence tasks, such as text summarization or machine translation, abstracting the details of model loading and providing a standardized interface for using pre-trained models for sequence generation tasks.

A **sequence-to-sequence** task is a type of NLP task where the input and output are both sequences of arbitrary lengths. In this scenario, the model takes a variable-length input sequence and generates an output sequence of variable length.

In [11]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

### Load a pre-trained model and its tokenizer:

Helsinki-NLP/opus-mt-en-fr is a machine translation model developed by the Helsinki NLP research group. This model is specifically trained for translating text from English (en) to French (fr). It belongs to the Open Parallel Universal Sentence Encoder (OPUS) project, which aims to provide high-quality machine translation models for various language pairs.

In [62]:
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)



### Tokenize the input text (we will use the same sentences as before):

In [42]:
inputs = tokenizer.batch_encode_plus(sentences, return_tensors="pt", padding=True)

### Generate the translation:

In [43]:
translation_ids = model.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=128)

When the tokenizer generates tokens, they are assigned unique numerical identifiers called **input IDs**. These IDs are crucial in helping the model understand and process the input text. Input IDs are typically used to create tensor representations of the input text, which can be fed into the model for inference.

**Attention masks** are binary tensors that indicate which tokens the model should focus on and which ones to disregard. It helps the model differentiate between actual tokens and padding tokens, guaranteeing that the model only attends to the relevant parts of the input. The shape of the attention mask is the same as the input IDs tensor, and it contains 1s for real tokens and 0s for padding tokens.

### Decode the translation tokens:

In [44]:
translations = tokenizer.batch_decode(translation_ids, skip_special_tokens=True)

for sentence, translation in zip(sentences, translations):
    print("Input:", sentence)
    print("Translation:", translation)
    print()

Input: Large Language Models are amazing and highly versatile
Translation: Les modèles de grande langue sont étonnants et très polyvalents

Input: This presentation was very boring
Translation: Cette présentation était très ennuyeuse





The pipeline API is more convenient for quick translation tasks, while the generic approach offers more control over the translation process.

Note: Replace "translation_en_to_fr" and "Helsinki-NLP/opus-mt-en-fr" with the desired translation model and language pairs according to your requirements.

Check the full list here: https://huggingface.co/models?pipeline_tag=translation


# Sentiment Analysis

The distilbert-base-uncased-finetuned-sst-2-english model is based on the DistilBERT architecture, which is a smaller and faster version of the BERT (Bidirectional Encoder Representations from Transformers) model. It has been pre-trained on a large corpus of uncased English text and then fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset. The SST-2 dataset consists of movie reviews labeled with their corresponding sentiment (positive or negative). By fine-tuning on this dataset, the model learns to classify the sentiment of a given English text as either positive or negative.

In [55]:
from transformers import AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

encoded_inputs = tokenizer(sentences, truncation=True, padding=True, return_tensors="pt")

input_ids = encoded_inputs["input_ids"]
attention_mask = encoded_inputs["attention_mask"]

outputs = model(input_ids=input_ids, attention_mask=attention_mask)

predicted_labels = outputs.logits.argmax(dim=1).tolist()

sentiment_labels = ["Positive" if label == 1 else "Negative" for label in predicted_labels]

for sentence, sentiment_label in zip(sentences, sentiment_labels):
    print("Sentence:", sentence)
    print("Sentiment:", sentiment_label)
    print()

Sentence: Large Language Models are amazing and highly versatile
Sentiment: positive

Sentence: This presentation was very boring
Sentiment: negative



After obtaining the model's outputs, the logits represent the raw scores or probabilities assigned to each class by the model. In sentiment analysis, the logits would represent the scores for positive and negative sentiment. argmax(dim=1) is used to find the index of the class with the highest score (i.e., the predicted sentiment label).