Skip to content

SoyGema/Neural-Machine-Translation

Repository files navigation

Neural Machine Translation

Hello!

This repo has the code that is accompanying code for The Lord of The Words Talks given in various events. It trackles the Transformers architecture from the translation perspective Gandalf

The Challenge

Trains a model that transforms a language text from one language into another, taking into account LLM fundamentals: Transformers architecture and feature engineering coming from Natural Language Processing.

Why is this suitable/interesting for DVC ? and VSCode DVC extension?

  • DVC allows us to version 9 different language datasets to be trained.

  • DVC Pipelines It allows us to train transformer architecture for each language avoiding code duplication and controlling versioning by language in datasets, feature engineering parameters and architecture variations.

  • VSCode DVC extension table and plots allow us to benchmark how well the same/best feature engineering and the same/best architecture perform with various languages and visualize learning and attention heads.

What is Neural Machine Translation?

Neural Machine Translation’s main goal is to transform a sequence from one language into another sequence to another one. It is an approach to machine translation inside NLP that uses Artificial Neural Networks to predict the likelihood of a sequence of words, often trained in an end-to-end fashion and can generalize well to very long word sequences. Formally it can be defined as a NN that models the conditional probability $ p(y|x)$ of translating a sentence $x1...xn$ into $y1...yn$.

Why Transformers for Neural Machine Translation?

Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformers is challenging because different scenarios require models of different complexities and scales.

Current state of the project

The Project structure divides as follows. Tokenizer language has created 9 datasets of 9 tokenized languages following the word embeddings tutorial. This is separated from the Neural Machine Translation project for faster integration. These datasets are integrated with DVC. In ´src/features´ you can see feature engineering steps, that are related to the feature engineering transformation

  • load_dataset.py Loads data
  • tokenizer_transformer.py Tokenize the dataset and makes batches
  • positional_encoding.py Makes the embeddings

In ´src/models´ you can find the modules for training and for inference

  • train_transformer.py Trains the transformer, declaring the arguments for encoder and decoder modules

In ´src/visualization´ you can find the visualizations for VS Code extension

  • metrics.py define the loss_function and accuracy_function.
  • visualize.py define the attention heads that will be plotted in visual studio.

Instructions for reproducing the project Instructions setup for MAC M1

1 . Open a terminal and clone the repository

$git clone https://github.com/SoyGema/Neural-Machine-Translation

2 . Activate virtual environment

source .venv/bin/activate
  1. Download TF 2.9 wheel from here

  2. Install requirements.txt

pip3 install -r requirements.txt
  1. Download data from data registry . This step should place the data in the right folder.

  2. run dvc exp run train_transformer

Notes for MAC devices and Tensorflow

Current testing under TF2.10 scenario. For running in MAC

conda install -c apple tensorflow-deps --force-reinstall
conda install numpy --force-reinstall

conda install -c apple tensorflow-deps=2.10.0
python -m pip install tensorflow-macos==2.10.0

Then install from wheel

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── tokenizer_transformer.py
│   │   └── positional_encoding.py 
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── decoder.py 
│   │   ├── encoder.py       
│   │   ├── predict_transformer.py    
│   │   └── train_transformer.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       ├── metrics.py  
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

@inproceedings{Ye2018WordEmbeddings, author = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig}, title = {When and Why are pre-trained word embeddings useful for Neural Machine Translation}, booktitle = {HLT-NAACL}, year = {2018}, }

About

Neural Machine Translation Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published