Skip to content

Alan-404/GPT-model

Repository files navigation

Generative Pre - Trained Transformer (GPT) Model

Author: Nguyen Duc Tri (Alan Nguyen)
Github: https://github.com/Alan-404
Linkedin: https://www.linkedin.com/in/%C4%91%E1%BB%A9c-tr%C3%AD-nguy%E1%BB%85n-269845210/ Reference: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018). Improving Language Understanding by Generative Pre-Training.

Architecture

Credit: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018). Improving Language Understanding by Generative Pre-Training.

Setup Environment

  1. Clone this repo: git clone https://github.com/Alan-404/GPT-model.git
  2. CD into project: cd GPT-model
  3. (Optional) Create Conda Environment: conda create --name {YOUR_PROJECT_NAME}
  4. (Optional) Activation Conda Environment: conda activate {YOUR_PROJECT_NAME}
  5. Install packages: pip install requirements.txt

Dataset Setup

  1. If you have a pair question and corresponding answer, the data sample is look like: {question} <sep> {answer} <sep>: Separative Token
    Example:
  • Question: What is your name?
  • Corresponding Answer: I am chatbot Lily
  • Data Sample: What is your name? <sep> I am chatbot Lily
  1. Store all your data samples in txt file
    Example:

Training Model Step by Step

(*): The signal requiring setting.

  1. Train Tokenizer: python data.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --iterations {ITERATIONS} --sigma {SIGMA}
  • (*)DATA_SAMPLE_PATH: Your txt file storing all data samples.
  • (*)TOKENIZER_PATH: Path where stores your tokenizer after training (If path is None, tokenizer learns from scratch otherwise tokenizer continues training from previous session)
  • (*)ITERATIONS: Maximum loop which is used for training Tokenizer.
  • SIGMA: SIGMA = (Num(tokens_whitespece)) / (Num(tokens_trained)), default is 2.
  1. Proprocessing Data - Digitize Text Data: python process.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --max_length {MAX_LENGTH} --clean_path {CLEAN_PATH}
  • MAX_LENGTH: The number of contexts that you want to set, default is None. If its value is None, model set max length is the lenght of a data sample having largest number of contexts.
  • (*)CLEAN_PATH: The path saving digital data after proprocessing stage.
  1. Training Model: python train.py --data_path {CLEAN_DATA_PATH} --tokenizer {TOKENIZER_PATH} --checkpoint {CHECKPOINT_PATH} --n {N} --d_model {D_MODEL} --heads {HEADS} --d_ff {D_FF} --dropout_rate {DROPOUT_RATE} --eps {EPS} --activation {ACTIVATION} --epochs {EPOCHS} --batch_size {BATCH_SIZE} --mini_batch {MINI_BATCH} --learning_rate {LEARNING_RATE} --device {DEVICE} ...

Other

  1. Interface of Chatbot:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors