NL-FM-Toolkit stands for Natural Language - Foundation Model Toolkit. The repo contains code for work pertaining to pre-training and fine-tuning of Language Models.
The repository was used to obtain all the results reported in the paper Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages, Tejas Dhamecha, Rudra Murthy, Samarth Bharadwaj, Karthik Sankaranarayanan, Pushpak Bhattacharyya, EMNLP 2021
The IndoAryan Language models can be found here and here
The repository contains code for
- Training of language models from
- scratch
- existing pre-trained language models
- Fine-tuning of pre-trained language models for
- Sequence-Labelling tasks like POS, NER, etc.
- Text classification tasks like Sentiment Analysis, etc.
The following models are supported for training language models
- Encoder-only models (BERT like models)
- Masked Language Modeling
- Whole-word Masked Language Modeling
- Auto-regressive Models (GPT like models)
- Causal Language Modeling
The code uses Aim to keep track of various hyper-parameter runs and select the best hyper-parameter.
Python 3.9.0
conda create -n NLPWorkSpace python==3.9.0
conda activate NLPWorkSpaceGit clone the repo.
git clone https://github.com/IBM/NL-FM-Toolkit.git
cd NL-FM-Toolkit
pip install -r requirements.txt| Task | ReadME | Tutorials |
|---|---|---|
| Tokenizer Training | README file present in src/tokenizer for more details |
Tokenizer Training Tutorial |
| Language Model | README file present in src/lm for more details |
Masked Language Model Tutorial Causal Language Model Tutorial |
| Token Classification (Sequence-Labeling) Tasks | README file present in src/token_classsifier for more details |
Sequence Labeling Trainer |
| Sequence-Classification Tasks | README file present in src/sequence_classsifier for more details |
Sequence Classification Trainer |
The repo is organized as follows.
| folder | description |
|---|---|
src |
The core code is present in this folder. |
src/tokenizer |
Code to train a tokenizer from scratch. |
src/lm |
Code to train a language model. |
src/tokenclassifier |
Code to train a token classifier model. |
src/sequenceclassifier |
Code to train a sequence classifier model. |
src/utils |
Miscellaneous helper scripts. |
demo |
Contains the data used by the tutorial code and the folder to save the trained demo models |
examples |
The folder contains sample scripts to run the model. |
docs |
All related documentation, etc. |
-
Token Classification
- Loading Data from Huggingface Dataset
-
Sequence-to-Sequence Pre-training
- Encoder-Decoder Models (mBART, mT5 like models)
- Denoising objective
- Whole-word Denoising objective
-
Question-Answering
-
Machine Translation
If you find toolkit in your work, you can cite the paper as below:
@inproceedings{dhamecha-etal-2021-role,
title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
author = "Dhamecha, Tejas and
Murthy, Rudra and
Bharadwaj, Samarth and
Sankaranarayanan, Karthik and
Bhattacharyya, Pushpak",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.675",
doi = "10.18653/v1/2021.emnlp-main.675",
pages = "8584--8595",
}

