Generative Pre - Trained Transformer (GPT) Model

Author: Nguyen Duc Tri (Alan Nguyen)
Github: https://github.com/Alan-404
Linkedin: https://www.linkedin.com/in/%C4%91%E1%BB%A9c-tr%C3%AD-nguy%E1%BB%85n-269845210/ Reference: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018). Improving Language Understanding by Generative Pre-Training.

Architecture

Credit: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018). Improving Language Understanding by Generative Pre-Training.

Setup Environment

Clone this repo: git clone https://github.com/Alan-404/GPT-model.git
CD into project: cd GPT-model
(Optional) Create Conda Environment: conda create --name {YOUR_PROJECT_NAME}
(Optional) Activation Conda Environment: conda activate {YOUR_PROJECT_NAME}
Install packages: pip install requirements.txt

Dataset Setup

If you have a pair question and corresponding answer, the data sample is look like: {question} <sep> {answer} <sep>: Separative Token
Example:

Question: What is your name?
Corresponding Answer: I am chatbot Lily
Data Sample: What is your name? <sep> I am chatbot Lily

Store all your data samples in txt file
Example:

Training Model Step by Step

(*): The signal requiring setting.

Train Tokenizer: python data.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --iterations {ITERATIONS} --sigma {SIGMA}

(*)DATA_SAMPLE_PATH: Your txt file storing all data samples.
(*)TOKENIZER_PATH: Path where stores your tokenizer after training (If path is None, tokenizer learns from scratch otherwise tokenizer continues training from previous session)
(*)ITERATIONS: Maximum loop which is used for training Tokenizer.
SIGMA: SIGMA = (Num(tokens_whitespece)) / (Num(tokens_trained)), default is 2.

Proprocessing Data - Digitize Text Data: python process.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --max_length {MAX_LENGTH} --clean_path {CLEAN_PATH}

MAX_LENGTH: The number of contexts that you want to set, default is None. If its value is None, model set max length is the lenght of a data sample having largest number of contexts.
(*)CLEAN_PATH: The path saving digital data after proprocessing stage.

Training Model: python train.py --data_path {CLEAN_DATA_PATH} --tokenizer {TOKENIZER_PATH} --checkpoint {CHECKPOINT_PATH} --n {N} --d_model {D_MODEL} --heads {HEADS} --d_ff {D_FF} --dropout_rate {DROPOUT_RATE} --eps {EPS} --activation {ACTIVATION} --epochs {EPOCHS} --batch_size {BATCH_SIZE} --mini_batch {MINI_BATCH} --learning_rate {LEARNING_RATE} --device {DEVICE} ...

Other

Interface of Chatbot:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
dataset		dataset
mlruns/0		mlruns/0
model		model
preprocessing		preprocessing
tokenizer		tokenizer
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
api.py		api.py
config.yml		config.yml
data.py		data.py
deploy.py		deploy.py
jit.py		jit.py
pack.py		pack.py
process.py		process.py
request.http		request.http
service.py		service.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative Pre - Trained Transformer (GPT) Model

Architecture

Setup Environment

Dataset Setup

Training Model Step by Step

Other

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative Pre - Trained Transformer (GPT) Model

Architecture

Setup Environment

Dataset Setup

Training Model Step by Step

Other

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages