Skip to content

NL-FM-Toolkit stands for Natural Language - Foundation Model Toolkit. The repo contains code for work pertaining to pre-training and fine-tuning of Language Models

License

Notifications You must be signed in to change notification settings

IBM/NL-FM-Toolkit

Repository files navigation

🔬 NL-FM-Toolkit

NL-FM-Toolkit stands for Natural Language - Foundation Model Toolkit. The repo contains code for work pertaining to pre-training and fine-tuning of Language Models.

The repository was used to obtain all the results reported in the paper Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages, Tejas Dhamecha, Rudra Murthy, Samarth Bharadwaj, Karthik Sankaranarayanan, Pushpak Bhattacharyya, EMNLP 2021

The IndoAryan Language models can be found here and here

🔬 Natural Language - Foundational Model - Toolkit

Documentation Python PyTorch Code style: black

The repository contains code for

  • Training of language models from
    • scratch
    • existing pre-trained language models
  • Fine-tuning of pre-trained language models for
    • Sequence-Labelling tasks like POS, NER, etc.
    • Text classification tasks like Sentiment Analysis, etc.

The following models are supported for training language models

  • Encoder-only models (BERT like models)
    • Masked Language Modeling
    • Whole-word Masked Language Modeling
  • Auto-regressive Models (GPT like models)
    • Causal Language Modeling

What is New!

The code uses Aim to keep track of various hyper-parameter runs and select the best hyper-parameter.

📚 Documentation

⏬ Installation

Python 3.9.0

conda create -n NLPWorkSpace python==3.9.0
conda activate NLPWorkSpace

Git clone the repo.

git clone https://github.com/IBM/NL-FM-Toolkit.git
cd NL-FM-Toolkit

Install dependencies

pip install -r requirements.txt
Task ReadME Tutorials
Tokenizer Training README file present in src/tokenizer for more details Tokenizer Training Tutorial
Language Model README file present in src/lm for more details
Masked Language Model Tutorial

Causal Language Model Tutorial
Token Classification (Sequence-Labeling) Tasks README file present in src/token_classsifier for more details Sequence Labeling Trainer
Sequence-Classification Tasks README file present in src/sequence_classsifier for more details Sequence Classification Trainer

📁 Folder structure

The repo is organized as follows.

folder description
src The core code is present in this folder.
src/tokenizer Code to train a tokenizer from scratch.
src/lm Code to train a language model.
src/tokenclassifier Code to train a token classifier model.
src/sequenceclassifier Code to train a sequence classifier model.
src/utils Miscellaneous helper scripts.
demo Contains the data used by the tutorial code and the folder to save the trained demo models
examples The folder contains sample scripts to run the model.
docs All related documentation, etc.

To-Do Tasks

  • Token Classification

    • Loading Data from Huggingface Dataset
  • Sequence-to-Sequence Pre-training

    • Encoder-Decoder Models (mBART, mT5 like models)
    • Denoising objective
    • Whole-word Denoising objective
  • Question-Answering

  • Machine Translation

Citation

If you find toolkit in your work, you can cite the paper as below:

@inproceedings{dhamecha-etal-2021-role,
    title = "Role of {L}anguage {R}elatedness in {M}ultilingual {F}ine-tuning of {L}anguage {M}odels: {A} {C}ase {S}tudy in {I}ndo-{A}ryan {L}anguages",
    author = "Dhamecha, Tejas  and
      Murthy, Rudra  and
      Bharadwaj, Samarth  and
      Sankaranarayanan, Karthik  and
      Bhattacharyya, Pushpak",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.675",
    doi = "10.18653/v1/2021.emnlp-main.675",
    pages = "8584--8595",
}

About

NL-FM-Toolkit stands for Natural Language - Foundation Model Toolkit. The repo contains code for work pertaining to pre-training and fine-tuning of Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published