This repo has the code that is accompanying code for The Lord of The Words Talks given in various events. It trackles the Transformers architecture from the translation perspective
Trains a model that transforms a language text from one language into another, taking into account LLM fundamentals: Transformers architecture and feature engineering coming from Natural Language Processing.
Why is this suitable/interesting for DVC ? and VSCode DVC extension?
-
DVC allows us to version 9 different language datasets to be trained.
-
DVC Pipelines It allows us to train transformer architecture for each language avoiding code duplication and controlling versioning by language in datasets, feature engineering parameters and architecture variations.
-
VSCode DVC extension table and plots allow us to benchmark how well the same/best feature engineering and the same/best architecture perform with various languages and visualize learning and attention heads.
Neural Machine Translation’s main goal is to transform a sequence from one language into another sequence to another one. It is an approach to machine translation inside NLP that uses Artificial Neural Networks to predict the likelihood of a sequence of words, often trained in an end-to-end fashion and can generalize well to very long word sequences. Formally it can be defined as a NN that models the conditional probability $ p(y|x)$ of translating a sentence
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. However, the deployment of Transformers is challenging because different scenarios require models of different complexities and scales.
The Project structure divides as follows. Tokenizer language has created 9 datasets of 9 tokenized languages following the word embeddings tutorial. This is separated from the Neural Machine Translation project for faster integration. These datasets are integrated with DVC. In ´src/features´ you can see feature engineering steps, that are related to the feature engineering transformation
- load_dataset.py Loads data
- tokenizer_transformer.py Tokenize the dataset and makes batches
- positional_encoding.py Makes the embeddings
In ´src/models´ you can find the modules for training and for inference
- train_transformer.py Trains the transformer, declaring the arguments for encoder and decoder modules
In ´src/visualization´ you can find the visualizations for VS Code extension
- metrics.py define the loss_function and accuracy_function.
- visualize.py define the attention heads that will be plotted in visual studio.
Instructions for reproducing the project Instructions setup for MAC M1
1 . Open a terminal and clone the repository
$git clone https://github.com/SoyGema/Neural-Machine-Translation
2 . Activate virtual environment
source .venv/bin/activate
-
Download TF 2.9 wheel from here
-
Install requirements.txt
pip3 install -r requirements.txt
-
Download data from data registry . This step should place the data in the right folder.
-
run dvc exp run train_transformer
Current testing under TF2.10 scenario. For running in MAC
conda install -c apple tensorflow-deps --force-reinstall
conda install numpy --force-reinstall
conda install -c apple tensorflow-deps=2.10.0
python -m pip install tensorflow-macos==2.10.0
Then install from wheel
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── tokenizer_transformer.py
│ │ └── positional_encoding.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── decoder.py
│ │ ├── encoder.py
│ │ ├── predict_transformer.py
│ │ └── train_transformer.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ ├── metrics.py
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Project based on the cookiecutter data science project template. #cookiecutterdatascience
@inproceedings{Ye2018WordEmbeddings, author = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig}, title = {When and Why are pre-trained word embeddings useful for Neural Machine Translation}, booktitle = {HLT-NAACL}, year = {2018}, }