Author: Nguyen Duc Tri (Alan Nguyen)
Github: https://github.com/Alan-404
Linkedin: https://www.linkedin.com/in/%C4%91%E1%BB%A9c-tr%C3%AD-nguy%E1%BB%85n-269845210/
Reference: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (2018). Improving Language Understanding by Generative Pre-Training.
- Clone this repo:
git clone https://github.com/Alan-404/GPT-model.git - CD into project:
cd GPT-model - (Optional) Create Conda Environment:
conda create --name {YOUR_PROJECT_NAME} - (Optional) Activation Conda Environment:
conda activate {YOUR_PROJECT_NAME} - Install packages:
pip install requirements.txt
- If you have a pair question and corresponding answer, the data sample is look like:
{question} <sep> {answer}<sep>: Separative Token
Example:
- Question:
What is your name? - Corresponding Answer:
I am chatbot Lily - Data Sample:
What is your name? <sep> I am chatbot Lily
- Store all your data samples in txt file
Example:
(*): The signal requiring setting.
- Train Tokenizer:
python data.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --iterations {ITERATIONS} --sigma {SIGMA}
(*)DATA_SAMPLE_PATH: Your txt file storing all data samples.(*)TOKENIZER_PATH: Path where stores your tokenizer after training (If path is None, tokenizer learns from scratch otherwise tokenizer continues training from previous session)(*)ITERATIONS: Maximum loop which is used for training Tokenizer.SIGMA: SIGMA = (Num(tokens_whitespece)) / (Num(tokens_trained)), default is 2.
- Proprocessing Data - Digitize Text Data:
python process.py --data_path {DATA_SAMPLE_PATH} --tokenizer_path {TOKENIZER_PATH} --max_length {MAX_LENGTH} --clean_path {CLEAN_PATH}
MAX_LENGTH: The number of contexts that you want to set, default is None. If its value is None, model set max length is the lenght of a data sample having largest number of contexts.(*)CLEAN_PATH: The path saving digital data after proprocessing stage.
- Training Model:
python train.py --data_path {CLEAN_DATA_PATH} --tokenizer {TOKENIZER_PATH} --checkpoint {CHECKPOINT_PATH} --n {N} --d_model {D_MODEL} --heads {HEADS} --d_ff {D_FF} --dropout_rate {DROPOUT_RATE} --eps {EPS} --activation {ACTIVATION} --epochs {EPOCHS} --batch_size {BATCH_SIZE} --mini_batch {MINI_BATCH} --learning_rate {LEARNING_RATE} --device {DEVICE} ...
- Interface of Chatbot:


