NeMo Integrate #125

cat-state · 2022-12-05T15:56:21Z

Currently implemented:

Training for ILQL
Generation using ILQL (i.e validation)
Inference script for ILQL

Future issues:

More reusable nemo abstraction
bf16 appears unstable compared to fp16

trlx/trainer/nemo_ilql_trainer.py

jon-tow · 2023-01-31T21:03:52Z

trlx/trainer/nemo/gpt.py

+        return torch.utils.data.DataLoader(
+            dataset,
+            batch_sampler=batch_sampler,
+            # For some reason this causes a crash when using >0 workers


If _reconfigure for generate is uncommented, as currently is, num_workers=self.cfg.data.num_workers seems to work fine. Do you think we should change it back from 0?

trlx/trainer/nemo/gpt.py

jon-tow · 2023-02-02T16:48:50Z

I think this PR is in a really good state right now. Any updates on the remaining issues you've listed?

Verify checkpointing
ILQL loss spikes near the start of training (this doesn't seem to be the case now?)

cat-state · 2023-02-02T17:53:42Z

I think this PR is in a really good state right now. Any updates on the remaining issues you've listed?
* [ ]  Verify checkpointing

* [ ]  ILQL loss spikes near the start of training (this doesn't seem to be the case now?)

Yeah, I think the last one isn't a case anymore. I'll add a script showing how to load a checkpoint and infer it.

cat-state · 2023-02-03T03:42:38Z

I added the inference script and made the checkpointing save to a reloadable name (doesn't work with metrics with / in their name)

jon-tow

Leaving some very minor nits

trlx/data/configs.py

trlx/trainer/nemo/README.md

trlx/trainer/nemo/gpt.py

jon-tow

Amazing work 🥳 🥳 🥳 Thanks @cat-state !!

cat-state and others added 27 commits December 1, 2022 00:07

nemo ilql heads

cd9ab1c

nemo

102a74c

model loads

5cc6fb7

contiguous error

f451099

it trains

8f74179

dbg

d5dfd6b

dbg

cf7d873

it works somewhat

689f558

runs but hangs

314a0da

OOM debug

d1a64fb

debug off-by-one in split idxes

65413c1

ILQL on only one rank

1391a9a

Merge remote-tracking branch 'origin' into nemo-integrate

afe4d71

nemo readd

7f8a849

ilql generate

d050cbf

ilql test inference

5edaf8f

generate hang

8f01942

dbg

7cd68f3

add resume from ckpt

c01ba71

feat(examples): add hh

aedb7d7

style(*_hh): satisfy isort

0701714

feat(hh): add more stop_sequences

b86867a

ilql optimize index select

d3cea9a

ilql gpt

4aef72f

Merge remote-tracking branch 'origin' into nemo-integrate

396319a

eval

ff3d697

reduce dupe fns

f9c6b84

cat-state marked this pull request as ready for review January 16, 2023 18:16

cat-state added 2 commits January 16, 2023 18:56

add config loading

a3a69b9

unused imports

7bc3dab

cat-state added 2 commits January 31, 2023 00:31

try custom generate

376a1d5

jon's fix for dataloader crash

8029266

jon-tow reviewed Jan 31, 2023

View reviewed changes

trlx/trainer/nemo/gpt.py Outdated Show resolved Hide resolved

cat-state added 6 commits February 1, 2023 01:46

fix bug with mutating nemo padding

cbe17a2

log all metrics

ab2483c

move metrics and fixes

00e060d

qa

fc80e91

fmt

3850a7e

typo

c75a370

cat-state mentioned this pull request Feb 2, 2023

NeMo-Megatron Integration #96

Closed

5 tasks

cat-state added 6 commits February 3, 2023 02:07

change save metric to not have slash

92f09a9

add inference script

885dbf6

find checkpoints

6830e1a

inference script

4c5057e

update readme

66a7f73

Merge branch 'main' of github.com:CarperAI/trlx into nemo-integrate

6a55597

cat-state added 3 commits February 3, 2023 03:43

fmt

aa62e47

fmt

42c8d7f

remove check

1025086

jon-tow reviewed Feb 3, 2023

View reviewed changes

trlx/data/configs.py Outdated Show resolved Hide resolved

trlx/trainer/nemo/README.md Outdated Show resolved Hide resolved

trlx/trainer/nemo/gpt.py Outdated Show resolved Hide resolved

cat-state added 3 commits February 3, 2023 21:10

update readme

c96e434

fix nits

b40091a

update trainer comment

d58278b

cat-state requested a review from jon-tow February 3, 2023 21:19

jon-tow approved these changes Feb 3, 2023

View reviewed changes

jon-tow merged commit b70bc92 into main Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NeMo Integrate #125

NeMo Integrate #125

cat-state commented Dec 5, 2022 •

edited

jon-tow Jan 31, 2023

jon-tow commented Feb 2, 2023 •

edited

cat-state commented Feb 2, 2023

cat-state commented Feb 3, 2023

jon-tow left a comment

jon-tow left a comment

NeMo Integrate #125

NeMo Integrate #125

Conversation

cat-state commented Dec 5, 2022 • edited

jon-tow Jan 31, 2023

Choose a reason for hiding this comment

jon-tow commented Feb 2, 2023 • edited

cat-state commented Feb 2, 2023

cat-state commented Feb 3, 2023

jon-tow left a comment

Choose a reason for hiding this comment

jon-tow left a comment

Choose a reason for hiding this comment

cat-state commented Dec 5, 2022 •

edited

jon-tow commented Feb 2, 2023 •

edited