Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dattri.benchmark] Add nanoGPT and retrain function here #60

Merged
merged 11 commits into from
May 10, 2024
Merged

[dattri.benchmark] Add nanoGPT and retrain function here #60

merged 11 commits into from
May 10, 2024

Conversation

SeanZh30
Copy link
Collaborator

@SeanZh30 SeanZh30 commented May 5, 2024

Description

The main purpose of this PR is to provide the code related to nanoGPT.

1. Motivation and Context

Under the benchmark folder, a benchmark/models/nanoGPT was created. This PR mainly provides the relevant code for nanoGPT retrain based on the shakespare_char dataset. For specific running method, please read benchmark/models/nanoGPT/readme.md. In addition, nanoGPT originally trained on dataset sampling. This PR tried to modify the original code, especially some functions in train.py (such as get_batch), and tried to modify it to train on all data. , which may be good to the implementation of data attribution function.

2. Summary of the change

  1. Add nanoGPT model-related code in dattri/benchmark/models/nanogpt/*
  2. Add retrain function based on shakespeare_char dataset in dattri/benchmark/shakespare.py

3. What tests have been added/updated for the change?

  • N/A: No test will be added.

@TheaperDeng TheaperDeng self-requested a review May 5, 2024 19:38
@TheaperDeng
Copy link
Collaborator

I will take a look first

@jiaqima
Copy link
Contributor

jiaqima commented May 6, 2024

@SeanZh30 probably better removing unnecessary files? e.g., the assets folder and the .ipynb files.

@SeanZh30 SeanZh30 changed the title add nanoGPT and retrain function here [dattri.benchmark] add nanoGPT and retrain function here May 6, 2024
@SeanZh30 SeanZh30 changed the title [dattri.benchmark] add nanoGPT and retrain function here [dattri.benchmark] Add nanoGPT and retrain function here May 6, 2024
@SeanZh30 SeanZh30 closed this May 6, 2024
@SeanZh30 SeanZh30 deleted the nanogpt branch May 6, 2024 18:50
@SeanZh30 SeanZh30 restored the nanogpt branch May 6, 2024 18:51
@TheaperDeng TheaperDeng reopened this May 6, 2024
@@ -0,0 +1,41 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow #64 add this to benchmark/datasets/shakespare

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add a script for tinystories in benchmark/datasets/tinystories

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these fixed!

@SeanZh30 SeanZh30 requested a review from TheaperDeng May 9, 2024 22:12
@TheaperDeng
Copy link
Collaborator

Now we support a new entry point

dattri_retrain_nanogpt --save_path ./experiment
                       --dataset 'shakespeare_char'/'tinystories'
                       --data_file TinyStoriesV2-GPT4-train.txt # optional, only valid for tinystories
                       --partition 0,5,5 # same as `dattri_retrain`

@TheaperDeng
Copy link
Collaborator

TheaperDeng commented May 10, 2024

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

  • The model architecture is not changed
  • only lds mode retraining is supported for nanogpt
  • We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)
  • Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)
  • validate the generation quality
  • The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

@SeanZh30
Copy link
Collaborator Author

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

  • The model architecture is not changed
  • only lds mode retraining is supported for nanogpt
  • We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)
  • Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)
  • validate the generation quality
  • The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

I think everything looks fine but we may need to change the readme file in nanogpt to show the new entry.

@TheaperDeng TheaperDeng merged commit a3efd9b into TRAIS-Lab:main May 10, 2024
4 checks passed
@TheaperDeng
Copy link
Collaborator

Merge this PR to keep rolling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants