Neural Machine Translation

Hello!

This repo has the code that is accompanying code for The Lord of The Words Talks given in various events. It trackles the Transformers architecture from the translation perspective

The Challenge

Trains a model that transforms a language text from one language into another, taking into account LLM fundamentals: Transformers architecture and feature engineering coming from Natural Language Processing.

Why is this suitable/interesting for DVC ? and VSCode DVC extension?

DVC allows us to version 9 different language datasets to be trained.
DVC Pipelines It allows us to train transformer architecture for each language avoiding code duplication and controlling versioning by language in datasets, feature engineering parameters and architecture variations.
VSCode DVC extension table and plots allow us to benchmark how well the same/best feature engineering and the same/best architecture perform with various languages and visualize learning and attention heads.

What is Neural Machine Translation?

Neural Machine Translation’s main goal is to transform a sequence from one language into another sequence to another one. It is an approach to machine translation inside NLP that uses Artificial Neural Networks to predict the likelihood of a sequence of words, often trained in an end-to-end fashion and can generalize well to very long word sequences. Formally it can be defined as a NN that models the conditional probability $ p(y|x)$ of translating a sentence $x1...xn$ into $y1...yn$.

Why Transformers for Neural Machine Translation?

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformers is challenging because different scenarios require models of different complexities and scales.

Current state of the project

The Project structure divides as follows. Tokenizer language has created 9 datasets of 9 tokenized languages following the word embeddings tutorial. This is separated from the Neural Machine Translation project for faster integration. These datasets are integrated with DVC. In ´src/features´ you can see feature engineering steps, that are related to the feature engineering transformation

load_dataset.py Loads data
tokenizer_transformer.py Tokenize the dataset and makes batches
positional_encoding.py Makes the embeddings

In ´src/models´ you can find the modules for training and for inference

train_transformer.py Trains the transformer, declaring the arguments for encoder and decoder modules

In ´src/visualization´ you can find the visualizations for VS Code extension

metrics.py define the loss_function and accuracy_function.
visualize.py define the attention heads that will be plotted in visual studio.

Instructions for reproducing the project Instructions setup for MAC M1

1 . Open a terminal and clone the repository

$git clone https://github.com/SoyGema/Neural-Machine-Translation

2 . Activate virtual environment

source .venv/bin/activate

Download TF 2.9 wheel from here
Install requirements.txt

pip3 install -r requirements.txt

Download data from data registry . This step should place the data in the right folder.
run dvc exp run train_transformer

Notes for MAC devices and Tensorflow

Current testing under TF2.10 scenario. For running in MAC

conda install -c apple tensorflow-deps --force-reinstall
conda install numpy --force-reinstall

conda install -c apple tensorflow-deps=2.10.0
python -m pip install tensorflow-macos==2.10.0

Then install from wheel

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── tokenizer_transformer.py
│   │   └── positional_encoding.py 
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── decoder.py 
│   │   ├── encoder.py       
│   │   ├── predict_transformer.py    
│   │   └── train_transformer.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       ├── metrics.py  
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

@inproceedings{Ye2018WordEmbeddings, author = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig}, title = {When and Why are pre-trained word embeddings useful for Neural Machine Translation}, booktitle = {HLT-NAACL}, year = {2018}, }

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.dvc		.dvc
.vscode		.vscode
datasets		datasets
docs		docs
dvclive		dvclive
models		models
notebooks		notebooks
references		references
reports		reports
src.egg-info		src.egg-info
src		src
~/tensorflow_datasets/ted_hrlr_translate		~/tensorflow_datasets/ted_hrlr_translate
.DS_Store		.DS_Store
.dvcignore		.dvcignore
.gitignore		.gitignore
AWSCLIV2.pkg		AWSCLIV2.pkg
LICENSE		LICENSE
Makefile		Makefile
README 2.md		README 2.md
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
dvclive.json		dvclive.json
en_emb		en_emb
params.yaml		params.yaml
pt_emb		pt_emb
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini
train_batches_saved		train_batches_saved

License

SoyGema/Neural-Machine-Translation

Folders and files

Latest commit

History

Repository files navigation

Neural Machine Translation

Hello!

The Challenge

What is Neural Machine Translation?

Why Transformers for Neural Machine Translation?

Current state of the project

Notes for MAC devices and Tensorflow

Project Organization

About

Resources

License

Stars

Watchers

Forks

Languages