### 📝 **Multilingual Text Translation & Summarization System**

#### **📚 Overview**

This project is a Multilingual Text Translation & Summarization System that allows users to:
- Translate text between multiple languages 🌐
- Summarize text documents efficiently 📝
- Upload files for automatic summarization
- Utilize a simple Streamlit-based UI for user interaction 🔍

It leverages pretrained NLP models for translation and summarization, making it efficient and accurate.

**1. Set Up the Development Environment**

In [1]:
# %pip install transformers sentencepiece streamlit torch pdfplumber

In [2]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer, BartForConditionalGeneration, BartTokenizer, MBartForConditionalGeneration, MBart50TokenizerFast

### For Translation
**Use M2M-100 (Facebook's Many-to-Many Multilingual Translation model) from Hugging Face.**

- It supports multiple languages.
- The 418M version is lightweight and should work on your PC.
- The 1.2B or 12B versions require more resources but can run on Colab GPU.
- 👉 Hugging Face Model: `facebook/m2m100_418M`

In [3]:
# Load Model & Tokenizer
translation_model_name = "facebook/m2m100_418M"
translation_tokenizer = M2M100Tokenizer.from_pretrained(translation_model_name)
translation_model = M2M100ForConditionalGeneration.from_pretrained(translation_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

### Translate text

In [4]:
def translate_text(text, src_lang, tgt_lang):
    translation_tokenizer.src_lang = src_lang  # Set source language
    encoded_text = translation_tokenizer(text, return_tensors="pt")

    generated_tokens = translation_model.generate(**encoded_text, forced_bos_token_id=translation_tokenizer.get_lang_id(tgt_lang))

    return translation_tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

### Example Translation

In [5]:
# Example Usage
print(translate_text("Hello, how are you?", "en", "fr"))
print(translate_text("Hola, ¿cómo estás?", "es", "en"))

model.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

Bonjour, comment vous êtes-vous ?
Hello, how are you?


### For Summarization
**Use BART or T5-based models**

- `facebook/bart-large-cnn` (Best for news/document summarization)
- `google/pegasus-xsum` (Best for short summaries)
- `t5-small` or `t5-base` (Lightweight and supports multiple tasks)
- 👉 Example: Using BART

In [9]:
summarizemodel_name = "facebook/bart-large-cnn"
summarize_tokenizer = BartTokenizer.from_pretrained(summarizemodel_name)
summarize_model = BartForConditionalGeneration.from_pretrained(summarizemodel_name)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

### Summarizer

In [10]:
def summarize_text(text, src_lang="en"):
    summarize_tokenizer.src_lang = src_lang  # Set the source language
    encoded_text = summarize_tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)

    summary_ids = summarize_model.generate(encoded_text.input_ids, max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

    return summarize_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

### Exaple Summarization

In [11]:
text = """A neural network is a type of machine learning model within artificial intelligence (AI) that mimics the structure of the human brain by using interconnected nodes, called neurons, arranged in layers to process data and learn patterns, similar to how the brain does, allowing computers to perform complex tasks like image recognition and natural language processing. """
print(summarize_text(text))

A neural network is a type of machine learning model within artificial intelligence (AI) It mimics the structure of the human brain by using interconnected nodes, called neurons, arranged in layers to process data and learn patterns, similar to how the brain does.
