# DL Translate

*A deep learning-based translation library built on Huggingface `transformers` and Facebook's `mBART-Large`*

💻 [GitHub Repository](https://github.com/xhlulu/dl-translate)\
📚 [Documentation](https://git.io/dlt-docs) / [readthedocs](https://dl-translate.readthedocs.io)\
🐍 [PyPi project](https://pypi.org/project/dl-translate/)

## Quickstart

Install the library with pip:

In [1]:
!pip install -q dl-translate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h

To translate some text:

In [2]:
import dl_translate as dlt

mt = dlt.TranslationModel()



Downloading (…)olve/main/vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/272 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

In [3]:
text_pl = "Poszedłem do domu po spotkaniu."
mt.translate(text_pl, source=dlt.lang.POLISH, target=dlt.lang.ENGLISH)



'I went home after the meeting.'

Above, you can see that `dlt.lang` contains variables representing each of the 50 available languages with auto-complete support. Alternatively, you can specify the language (e.g. "Arabic") or the language code (e.g. "fr" for French):

In [4]:
text_ar = "Ich habe keine Ahnung wo ist sie."
mt.translate(text_ar, source="German", target="en")



'I don’t know where she is.'

If you want to verify whether a language is available, you can check it:

In [5]:
print(mt.available_languages())  # All languages that you can use
print(mt.available_codes())  # Code corresponding to each language accepted
print(mt.get_lang_code_map())  # Dictionary of lang -> code

('Afrikaans', 'Amharic', 'Arabic', 'Asturian', 'Azerbaijani', 'Bashkir', 'Belarusian', 'Bulgarian', 'Bengali', 'Breton', 'Bosnian', 'Catalan', 'Valencian', 'Cebuano', 'Czech', 'Welsh', 'Danish', 'German', 'Greek', 'English', 'Spanish', 'Estonian', 'Persian', 'Fulah', 'Finnish', 'French', 'Western Frisian', 'Irish', 'Gaelic', 'Scottish Gaelic', 'Galician', 'Gujarati', 'Hausa', 'Hebrew', 'Hindi', 'Croatian', 'Haitian', 'Haitian Creole', 'Hungarian', 'Armenian', 'Indonesian', 'Igbo', 'Iloko', 'Icelandic', 'Italian', 'Japanese', 'Javanese', 'Georgian', 'Kazakh', 'Khmer', 'Central Khmer', 'Kannada', 'Korean', 'Luxembourgish', 'Letzeburgesch', 'Ganda', 'Lingala', 'Lao', 'Lithuanian', 'Latvian', 'Malagasy', 'Macedonian', 'Malayalam', 'Mongolian', 'Marathi', 'Malay', 'Burmese', 'Nepali', 'Dutch', 'Flemish', 'Norwegian', 'Northern Sotho', 'Occitan', 'Oriya', 'Panjabi', 'Punjabi', 'Polish', 'Pushto', 'Pashto', 'Portuguese', 'Romanian', 'Moldavian', 'Moldovan', 'Russian', 'Sindhi', 'Sinhala', 'Si

## Usage

### Selecting a device

When you load the model, you can specify the device:
```python
mt = dlt.TranslationModel(device="auto")
```

By default, the value will be `device="auto"`, which means it will use a GPU if possible. You can also explicitly set `device="cpu"` or `device="gpu"`, or some other strings accepted by [`torch.device()`](https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device). __In general, it is recommend to use a GPU if you want a reasonable processing time.__

Let's check what we originally loaded:

In [6]:
mt.device

device(type='cuda')

### Loading from a path

By default, `dlt.TranslationModel` will download the model from the [huggingface repo](https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt) and cache it. However, you are free to load from a path:
```python
mt = dlt.TranslationModel("/path/to/your/model/directory/", model_family="mbart50")
```
Make sure that your tokenizer is also stored in the same directory if you use this approach.


### Using a different model

You can also choose another model that has [a similar format](https://huggingface.co/models?filter=mbart-50). In those cases, it's preferable to specify the model family:
```python
mt = dlt.TranslationModel("facebook/mbart-large-50-one-to-many-mmt")
mt = dlt.TranslationModel("facebook/m2m100_1.2B", model_family="m2m100")
```
Note that the available languages will change if you do this, so you will not be able to leverage `dlt.lang` or `dlt.utils`.


### Breaking down into sentences

It is not recommended to use extremely long texts as it takes more time to process. Instead, you can try to break them down into sentences with the help of `nltk`. First install the library with `pip install nltk`, then run:

In [7]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
text = "Adam went to his favorite cafe. There, he met his friend Dr. Joe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(mt.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.POLISH))

'Adam poszedł do ulubionej kawiarni. Tam spotkał się ze swoim przyjacielem dr Joe.'

### Setting a `batch_size` and verbosity when calling `dlt.TranslationModel.translate`

It's possible to set a batch size (i.e. the number of elements processed at once) for `mt.translate` and whether you want to see the progress bar or not:

In [18]:
mt.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.POLISH, batch_size=1, verbose=True)

  0%|          | 0/2 [00:00<?, ?it/s]

['Adam poszedł do ulubionej kawiarni.',
 'Tam spotkał się ze swoim przyjacielem dr Joe.']

If you set `batch_size=None`, it will compute the entire `text` at once rather than splitting into "chunks". We recommend lowering `batch_size` if you do not have a lot of RAM or VRAM and run into CUDA memory error. Set a higher value if you are using a high-end GPU and the VRAM is not fully utilized.


### `dlt.utils` module

An alternative to `mt.available_languages()` is the `dlt.utils` module. You can use it to find out which languages and codes are available:


In [10]:
print(dlt.utils.available_languages('mbart50'))  # All languages that you can use
print(dlt.utils.available_codes('mbart50'))  # Code corresponding to each language accepted
print(dlt.utils.get_lang_code_map('mbart50'))  # Dictionary of lang -> code

('Arabic', 'Czech', 'German', 'English', 'Spanish', 'Estonian', 'Finnish', 'French', 'Gujarati', 'Hindi', 'Italian', 'Japanese', 'Kazakh', 'Korean', 'Lithuanian', 'Latvian', 'Burmese', 'Nepali', 'Dutch', 'Romanian', 'Russian', 'Sinhala', 'Turkish', 'Vietnamese', 'Chinese', 'Afrikaans', 'Azerbaijani', 'Bengali', 'Persian', 'Hebrew', 'Croatian', 'Indonesian', 'Georgian', 'Khmer', 'Macedonian', 'Malayalam', 'Mongolian', 'Marathi', 'Polish', 'Pashto', 'Portuguese', 'Swedish', 'Swahili', 'Tamil', 'Telugu', 'Thai', 'Tagalog', 'Ukrainian', 'Urdu', 'Xhosa', 'Galician', 'Slovene')
('ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN', 'af_ZA', 'az_AZ', 'bn_IN', 'fa_IR', 'he_IL', 'hr_HR', 'id_ID', 'ka_GE', 'km_KH', 'mk_MK', 'ml_IN', 'mn_MN', 'mr_IN', 'pl_PL', 'ps_AF', 'pt_XX', 'sv_SE', 'sw_KE', 'ta_IN', 'te_IN', 'th_TH', 'tl_XX

## Advanced

You have direct access to transformers:

In [19]:
bart = mt.get_transformers_model()
tokenizer = mt.get_tokenizer()

print(tokenizer)
print(bart)

M2M100Tokenizer(name_or_path='saved_model', vocab_size=128104, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'additional_special_tokens': ['__af__', '__am__', '__ar__', '__ast__', '__az__', '__ba__', '__be__', '__bg__', '__bn__', '__br__', '__bs__', '__ca__', '__ceb__', '__cs__', '__cy__', '__da__', '__de__', '__el__', '__en__', '__es__', '__et__', '__fa__', '__ff__', '__fi__', '__fr__', '__fy__', '__ga__', '__gd__', '__gl__', '__gu__', '__ha__', '__he__', '__hi__', '__hr__', '__ht__', '__hu__', '__hy__', '__id__', '__ig__', '__ilo__', '__is__', '__it__', '__ja__', '__jv__', '__ka__', '__kk__', '__km__', '__kn__', '__ko__', '__lb__', '__lg__', '__ln__', '__lo__', '__lt__', '__lv__', '__mg__', '__mk__', '__ml__', '__mn__', '__mr__', '__ms__', '__my__', '__ne__', '__nl__', '__no__', '__ns__', '__oc__', '__or__', '__pa__', '__pl__

See the [huggingface docs](https://huggingface.co/transformers/master/model_doc/mbart.html) for more information.


### `bart_model.generate()` keyword arguments

When running `mt.translate`, you can also give a `generation_options` dictionary that is passed as keyword arguments to the underlying `bart_model.generate()` method:

In [20]:
mt.translate(
    sents,
    source=dlt.lang.ENGLISH,
    target=dlt.lang.POLISH,
    generation_options=dict(num_beams=5, max_length=128)
)

['Adam poszedł do ulubionej kawiarni.',
 'Tam spotkał się ze swoim przyjacielem dr Joe.']