-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dattri.benchmark] Add nanoGPT and retrain function here #60
Conversation
I will take a look first |
@SeanZh30 probably better removing unnecessary files? e.g., the assets folder and the .ipynb files. |
dattri/benchmark/shakespeare.py
Outdated
@@ -0,0 +1,41 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow #64 add this to benchmark/datasets/shakespare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add a script for tinystories in benchmark/datasets/tinystories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these fixed!
…Change the arg file. Delete unnecessary files
Now we support a new entry point dattri_retrain_nanogpt --save_path ./experiment
--dataset 'shakespeare_char'/'tinystories'
--data_file TinyStoriesV2-GPT4-train.txt # optional, only valid for tinystories
--partition 0,5,5 # same as `dattri_retrain` |
@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready. Some bullet points:
@SeanZh30 please also have a look. I made some changes to the API after our offline discussion. |
I think everything looks fine but we may need to change the readme file in nanogpt to show the new entry. |
Merge this PR to keep rolling |
Description
The main purpose of this PR is to provide the code related to nanoGPT.
1. Motivation and Context
Under the benchmark folder, a benchmark/models/nanoGPT was created. This PR mainly provides the relevant code for nanoGPT retrain based on the shakespare_char dataset. For specific running method, please read benchmark/models/nanoGPT/readme.md. In addition, nanoGPT originally trained on dataset sampling. This PR tried to modify the original code, especially some functions in train.py (such as get_batch), and tried to modify it to train on all data. , which may be good to the implementation of data attribution function.
2. Summary of the change
3. What tests have been added/updated for the change?