### This Notebook shows a full pipeline for Text language identification and Translation using Facebook models fasttext and No Language Left Behind (NLLB).

First, we start with taking an input text in any language, then we will detect its language code using fasttext.

After that, we take the entered text, and predicted label and feed them to NLLB which translates text from our original language to whatever language NLLB supports.

# Language Identification

In [1]:
# download the language model pretrained file
!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

--2024-05-27 21:17:31--  https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2600:9000:21c7:a400:13:6e38:acc0:93a1, 2600:9000:21c7:3600:13:6e38:acc0:93a1, 2600:9000:21c7:1800:13:6e38:acc0:93a1, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:21c7:a400:13:6e38:acc0:93a1|:443... failed: Connection timed out.
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:21c7:3600:13:6e38:acc0:93a1|:443... failed: Connection timed out.
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:21c7:1800:13:6e38:acc0:93a1|:443... failed: Connection timed out.
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:21c7:c800:13:6e38:acc0:93a1|:443... failed: Connection timed out.
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:21c7:a000:13:6e38:acc0:93a1|:443... 

In [None]:
!pip install fasttext

In [2]:
import fasttext

pretrained_lang_model = "nllb-200-600M/model.bin" # "/content/lid218e.bin" # path of pretrained model file
model = fasttext.load_model(pretrained_lang_model)



ValueError: nllb-200-600M/model.bin has wrong file format!

Now lets enter a test text in the original language, here we will translate from Arabic to Spanish.

In [None]:
text = "صباح الخير، الجو جميل اليوم والسماء صافية."

In [None]:
predictions = model.predict(text, k=1)
print(predictions)

(('__label__arb_Arab',), array([0.99960977]))


In [None]:
input_lang = predictions[0][0].replace('__label__', '')

# Text Translation

In [None]:
!pip install -U pip transformers

In [None]:
!pip install sentencepiece

In [None]:
checkpoint = 'facebook/nllb-200-distilled-600M'
# checkpoint = 'facebook/nllb-200-1.3B'
# checkpoint = 'facebook/nllb-200-3.3B'
# checkpoint = 'facebook/nllb-200-distilled-1.3B'

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.29G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.63M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/16.5M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/3.46k [00:00<?, ?B/s]

In [None]:
target_lang = 'spa_Latn'
translation_pipeline = pipeline('translation',
                                model=model,
                                tokenizer=tokenizer,
                                src_lang=input_lang,
                                tgt_lang=target_lang,
                                max_length = 400)
output = translation_pipeline(text)
print(output[0]['translation_text'])

Buenos días, el clima es hermoso y el cielo está limpio.
