[dattri.benchmark] Add nanoGPT and retrain function here #60

SeanZh30 · 2024-05-05T06:09:39Z

Description

The main purpose of this PR is to provide the code related to nanoGPT.

1. Motivation and Context

Under the benchmark folder, a benchmark/models/nanoGPT was created. This PR mainly provides the relevant code for nanoGPT retrain based on the shakespare_char dataset. For specific running method, please read benchmark/models/nanoGPT/readme.md. In addition, nanoGPT originally trained on dataset sampling. This PR tried to modify the original code, especially some functions in train.py (such as get_batch), and tried to modify it to train on all data. , which may be good to the implementation of data attribution function.

2. Summary of the change

Add nanoGPT model-related code in dattri/benchmark/models/nanogpt/*
Add retrain function based on shakespeare_char dataset in dattri/benchmark/shakespare.py

3. What tests have been added/updated for the change?

N/A: No test will be added.

TheaperDeng · 2024-05-05T19:38:17Z

I will take a look first

jiaqima · 2024-05-06T15:41:41Z

@SeanZh30 probably better removing unnecessary files? e.g., the assets folder and the .ipynb files.

TheaperDeng · 2024-05-07T03:52:18Z

dattri/benchmark/shakespeare.py

@@ -0,0 +1,41 @@
+import os


Please follow #64 add this to benchmark/datasets/shakespare

please also add a script for tinystories in benchmark/datasets/tinystories

Both of these fixed!

…Change the arg file. Delete unnecessary files

TheaperDeng · 2024-05-10T04:08:07Z

Now we support a new entry point

dattri_retrain_nanogpt --save_path ./experiment
                       --dataset 'shakespeare_char'/'tinystories'
                       --data_file TinyStoriesV2-GPT4-train.txt # optional, only valid for tinystories
                       --partition 0,5,5 # same as `dattri_retrain`

TheaperDeng · 2024-05-10T04:18:30Z

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

The model architecture is not changed
only lds mode retraining is supported for nanogpt
We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)
Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)
validate the generation quality
The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

SeanZh30 · 2024-05-10T05:04:01Z

@jiaqima @xingjian-zhang Finally we made nanogpt retraining code ready.

Some bullet points:

The model architecture is not changed

only lds mode retraining is supported for nanogpt

We change the dataloader of the original one to make each sample does not overlap with other (makes more sense for data attribution)

Add a new entrypoint: [dattri.benchmark] Add nanoGPT and retrain function here #60 (comment)

validate the generation quality

The indexs and model checkpoint will both be saved, the index can be mapped back to the actural text sample

@SeanZh30 please also have a look. I made some changes to the API after our offline discussion.

I think everything looks fine but we may need to change the readme file in nanogpt to show the new entry.

TheaperDeng · 2024-05-10T19:57:02Z

Merge this PR to keep rolling

TheaperDeng self-requested a review May 5, 2024 19:38

SeanZh30 changed the title ~~add nanoGPT and retrain function here~~ [dattri.benchmark] add nanoGPT and retrain function here May 6, 2024

SeanZh30 changed the title ~~[dattri.benchmark] add nanoGPT and retrain function here~~ [dattri.benchmark] Add nanoGPT and retrain function here May 6, 2024

SeanZh30 added the work-in-progress label May 6, 2024

SeanZh30 closed this May 6, 2024

SeanZh30 deleted the nanogpt branch May 6, 2024 18:50

SeanZh30 restored the nanogpt branch May 6, 2024 18:51

TheaperDeng reopened this May 6, 2024

TheaperDeng reviewed May 7, 2024

View reviewed changes

SeanZh30 requested a review from TheaperDeng May 9, 2024 22:12

SeanZh30 and others added 9 commits May 9, 2024 19:14

add nanoGPT and retrain function here

89fac22

Fix ruff problems. Change the logic of getting data without overlap. …

fb22116

…Change the arg file. Delete unnecessary files

add dataset folder and and input arg for dataset

fd76ea6

fix retrain function

bd5f93d

fix retrain file and named as retrain_nanogpt.py

6dd17c8

mv some files

2452631

fix some bugs

8e6a559

fix

bc1cbdc

fix

16b70f9

fix

7ab21d0

fix readme

69a468f

TheaperDeng merged commit a3efd9b into TRAIS-Lab:main May 10, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dattri.benchmark] Add nanoGPT and retrain function here #60

[dattri.benchmark] Add nanoGPT and retrain function here #60

SeanZh30 commented May 5, 2024 •

edited

Loading

TheaperDeng commented May 5, 2024

jiaqima commented May 6, 2024

TheaperDeng May 7, 2024

TheaperDeng May 7, 2024

SeanZh30 May 8, 2024

TheaperDeng commented May 10, 2024

TheaperDeng commented May 10, 2024 •

edited

Loading

SeanZh30 commented May 10, 2024

TheaperDeng commented May 10, 2024

[dattri.benchmark] Add nanoGPT and retrain function here #60

[dattri.benchmark] Add nanoGPT and retrain function here #60

Conversation

SeanZh30 commented May 5, 2024 • edited Loading

Description

1. Motivation and Context

2. Summary of the change

3. What tests have been added/updated for the change?

TheaperDeng commented May 5, 2024

jiaqima commented May 6, 2024

TheaperDeng May 7, 2024

Choose a reason for hiding this comment

TheaperDeng May 7, 2024

Choose a reason for hiding this comment

SeanZh30 May 8, 2024

Choose a reason for hiding this comment

TheaperDeng commented May 10, 2024

TheaperDeng commented May 10, 2024 • edited Loading

SeanZh30 commented May 10, 2024

TheaperDeng commented May 10, 2024

SeanZh30 commented May 5, 2024 •

edited

Loading

TheaperDeng commented May 10, 2024 •

edited

Loading