# Translation and Summarization 🎯 

## Introduction
This notebook demonstrates how to use the Hugging Face Transformers library for translation and summarization tasks. We will go through the steps of setting up the environment, preparing the data, building the pipelines, and evaluating the results.

### Install and update the necessary libraries

In [19]:
%pip install --upgrade transformers torch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [20]:
# Suppress warning messages such as non-critical log messages
from transformers.utils import logging
logging.set_verbosity_error()

In [21]:
# Import the required libraries
from transformers import pipeline
import torch
import textwrap

### Library Descriptions
Here is a brief description of the required libraries:

- The transformers library by Hugging Face provides state-of-the-art pre-trained models for natural language processing (NLP) tasks such as text classification, translation, summarization, and more. It offers easy-to-use APIs to leverage models like BERT, GPT, T5, and others.

- PyTorch is an open-source deep learning framework developed by Facebook's AI Research lab. It provides flexibility and speed for building, training, and deploying deep learning models. PyTorch is widely used for research and production due to its dynamic computational graph and strong support for GPU acceleration.

- The textwrap library in Python provides utilities for formatting and wrapping plain text. It helps in breaking long strings into lines of a specified width, making the text more readable. This is particularly useful for generating neatly formatted text outputs, such as word-wrapping paragraphs in console applications or formatting text for display in a fixed-width area.

### Build the `translation` pipeline using 🤗 Transformers Library

Build and train the translation pipeline using the Transformers library.

### Load Translation Model

NLLB: No Language Left Behind: ['nllb-200-distilled-600M'](https://huggingface.co/facebook/nllb-200-distilled-600M).

In [22]:
# Load the translation pipeline using the NLLB model
translator = pipeline(task="translation", 
                      model="facebook/nllb-200-distilled-600M", 
                      torch_dtype=torch.bfloat16)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Translation Examples
Perform translation on example texts.

 - Example 1: Translation from English to French

In [23]:
# Define the input text
text = """\
My puppy is adorable, \
Your kitten is cute.
Her panda is friendly.
His llama is thoughtful. \
We all have nice pets!"""

In [24]:
print(text)

My puppy is adorable, Your kitten is cute.
Her panda is friendly.
His llama is thoughtful. We all have nice pets!


In [25]:
# Translate the input text from English to French
text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="fra_Latn")

# Uncomment the following line to print the text_translated
#text_translated

In [26]:
# Extract and print the translation text
translation_text = text_translated[0]['translation_text']
wrapped_text = textwrap.fill(translation_text, width=50)
print(f"The translation of the given text is:\n{wrapped_text}")

The translation of the given text is:
Mon chiot est adorable, ton chaton est mignon, son
panda est ami, sa lamme est attentive, nous avons
tous de beaux animaux de compagnie.


To choose other languages, you can find the other language codes on the page: [Languages in FLORES-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200)



- Example 2: Translation from English to Dutch

In [27]:
# Define the input text
text1 = """
What are you doing today?
"""
print(text1)

# Translate the input text from English to Dutch
text_translated1 = translator(text1, src_lang="eng_Latn", tgt_lang="nld_Latn")  # Dutch

# Uncomment the following line to print the text_translated
#text_translated1

# Extract and print the translation text
translation_text = text_translated1[0]['translation_text']
wrapped_text = textwrap.fill(translation_text, width=50)
print(f"The translation of the given text is:\n{wrapped_text}")


What are you doing today?

The translation of the given text is:
Wat doe je vandaag?


## Free up some memory before continuing

Free up memory before continuing with the summarization tasks.

In [28]:
import gc

# Delete the translator model and free up memory
del translator
gc.collect()

85

### Build the `summarization` pipeline using 🤗 Transformers Library

Summarization Pipeline
Build and train the summarization pipeline using the Transformers library.

Model info: ['bart-large-cnn'](https://huggingface.co/facebook/bart-large-cnn)

In [29]:
# Load the summarization pipeline using the BART model
summarizer = pipeline(task="summarization",
                      model="facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

### Summarization Examples
Perform summarization on example texts.

 - Example 1: Summarization of a Text

In [30]:
# Define the input text
text = """Paris is the capital and most populous city of France, with
          an estimated population of 2,175,601 residents as of 2018,
          in an area of more than 105 square kilometres (41 square
          miles). The City of Paris is the centre and seat of
          government of the region and province of Île-de-France, or
          Paris Region, which has an estimated population of
          12,174,880, or about 18 percent of the population of France
          as of 2017."""

In [31]:
# Summarize the input text
summary = summarizer(text,
                     min_length=10,
                     max_length=100)

# Uncomment the following line to print the summary
#summary

In [32]:
# Extract and print the summary text
summary_text = summary[0]['summary_text']
wrapped_text = textwrap.fill(summary_text, width=50)
print(f"Summary:\n{wrapped_text}")

Summary:
Paris is the capital and most populous city of
France, with an estimated population of 2,175,601
residents as of 2018. The City of Paris is the
centre and seat of the government of the region
and province of Île-de-France.


- Example 2: Summarization of a Text

In [33]:
# Define the input text
text2 = """Amsterdam, the capital of the Netherlands, is known for its picturesque canals,
        historic architecture, and vibrant cultural scene. Founded in the 12th century as 
        a small fishing village, it grew into a major global trading hub during the Dutch Golden 
        Age. Today, Amsterdam is a cosmopolitan city, famous for its rich artistic heritage, 
        with museums like the Van Gogh Museum and Rijksmuseum. The city is also known for its 
        liberal attitudes, such as tolerance of cannabis use and legal sex work, as well 
        as its cycling culture, eco-friendly initiatives, and diverse, international population."""

In [34]:
# Summarize the input text
summary2 = summarizer(text2,
                     min_length=10,
                     max_length=100)

# Uncomment the following line to print the summary2
#summary2

# Extract and print the summary text
summary_text = summary2[0]['summary_text']
wrapped_text = textwrap.fill(summary_text, width=50)
print(f"Summary:\n{wrapped_text}")

Summary:
Amsterdam, the capital of the Netherlands, is
known for its picturesque canals, historic
architecture, and vibrant cultural scene. Founded
in the 12th century as a small fishing village, it
grew into a major global trading hub. Today,
Amsterdam is a cosmopolitan city, famous for its
rich artistic heritage, with museums like the Van
Gogh Museum and Rijksmuseum.


### Conclusion
- The NLLB (No Language Left Behind) model offers significant benefits for translation tasks but comes with notable challenges. Advantages: It supports diverse low-resource languages, ensuring inclusivity. Its high-quality translations preserve semantic integrity, and being open-source makes it accessible and customizable. It is scalable for multilingual tasks and promotes ethical practices by aiding underrepresented communities.  Disadvantages: The model is computationally intensive and complex to fine-tune. It may carry biases from training data and struggle with underrepresented languages' quality. Latency issues arise in real-time use, and limited documentation hampers development. Evaluating its quality across languages is difficult due to the lack of benchmarks. This overview balances its inclusivity and technical strengths against operational and resource-related constraints.
Use Case Recommendations : Best suited for projects requiring translation for low-resource or minority languages (e.g.: applications in research, social good, and cultural preservation).Less suited for scenarios requiring low latency and high scalability without robust infrastructure such as domains where hyper-specialized translation quality (e.g., medical or legal texts).

- The BART (Bidirectional and Auto-Regressive Transformers) model is highly effective for abstractive summarization and other NLP tasks. It delivers high-quality, flexible, and customizable results, supported by an active community and state-of-the-art performance. However, its computational demands, susceptibility to hallucination, and challenges with domain-specific applications and poorly structured inputs pose limitations. Additionally, controlling output length and handling latency in real-time scenarios may require extra effort or resources.
Use Case Recommendations : Best suited for: Abstractive summarization tasks where a human-like and semantically rich summary is required. Use cases involving moderately well-structured input, such as news articles, reports, and research papers.Applications with access to sufficient computational resources for efficient execution.
Less suited for: Tasks demanding strict factual accuracy or summaries with no deviation from the original text. Scenarios where input is unstructured or noisy (e.g., raw user comments or fragmented notes).

### Summary
This notebook demonstrated how to use the Hugging Face Transformers library for translation and summarization tasks. We went through the steps of setting up the environment, preparing the data, building the pipelines, and evaluating the results.

### Next Steps
- Try this model with your own texts!
- Experiment with different models and parameters.
- Improve the performance by fine-tuning the models on specific datasets.

### References