Project Name

PreTraining Pipeline for smaller Bert

Introduction

This repo provides a methodology for pretraining a BERT with Masked Language Modeling Objective.

Installation

Download Anaconda and create a conda environment(python 3.10.x) with the requirement.txt file.

    pip install -r requirements.txt --no-cache

PretrainTinyBERT

Change the config file at tiny_bert_config.py to your need

class TinyBertConfig:
    # your tokenizer path
    tokenizer_path = "./tokenization/vocab/vocab_bert.4096"
    # vocab size
    vocab_size = 4096
    # max sequence length
    max_len = 128
    # embedding dim
    d_model = 256
    #number of head
    n_head = 4
    # number of encoder layers
    n_encoder_layer = 4

    batch_size = 128
    epoch = 20
    device = "cuda"
    train_data_path = "./pretraining_data/tiny_bert_pretraining_data_v3"
    saving_dir = "./tiny_bert_artifact/tiny_bert_ner-v3.pt"

run python pretrain_tiny_bert.py from cmd

TrainTokenizer

To Train, use the following code snippet.

TOKENIZER_PATH = 'where/to/save/my/tokenizer/artifact'
TOKENIZER_MAX_SEQ = 128 # max sequence length your model can handle
TOKENIZER_VOCAB_SIZE = 8192 # number of vacab you want
DATA_PATH = "tokenizer_training_data.txt"

from tokenization.text_processor import Tokenizer
sp_tokenizer = Tokenizer(
    sentencepiece_path=TOKENIZER_PATH,
    max_len=TOKENIZER_MAX_SEQ,
    vocab_size=TOKENIZER_VOCAB_SIZE
)
sp_tokenizer.train(DATA_PATH)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Name

Table of Contents

Introduction

Installation

PretrainTinyBERT

TrainTokenizer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
tiny_bert		tiny_bert
tokenization		tokenization
pretrain_tiny_bert.py		pretrain_tiny_bert.py
readme.md		readme.md
requirements.txt		requirements.txt
tiny_bert_config.py		tiny_bert_config.py

JahidBasher/BERTSmallPretrainingWithMLM

Folders and files

Latest commit

History

Repository files navigation

Project Name

Table of Contents

Introduction

Installation

PretrainTinyBERT

TrainTokenizer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages