Efficient Sampling #32

sdtblck · 2020-09-08T10:52:09Z

Currently our sampling is incredibly inefficient, doesn't store past values for k / v, and instead recomputes them for every token.

We should look again at how sampling is done in mesh / T5 (maybe ask Colin?) and see if we can store k/v values to increase the efficiency of our sampling code.

The basic infrastructure for this is already in place, but commented out (the k/v values will be stored in this Context object: https://github.com/EleutherAI/GPTNeo/blob/master/sample.py#L148, https://github.com/EleutherAI/GPTNeo/blob/master/models/gpt2/gpt2.py#L138). The problem we ran into last time was that in the mesh code they seem to feed in the inputs a token at a time, but the token is missing a batch dimension, and we're not exactly sure where batch gets added on, and therefore how to replicate this. (see: https://github.com/tensorflow/mesh/blob/a3c05f705641dfe144f70b7b5230db4933ce8ca9/mesh_tensorflow/transformer/transformer.py#L1137)

This is (the last?) issue I'd like to get solved before the code is released publicly, so I will try to look into this soon, but any help would be appreciated.

Mistobaan · 2020-09-11T18:09:57Z

@shawwn is your pr #28 related to this?

StellaAthena · 2020-09-11T20:16:00Z

Per Stack Overflow, apparently you cannot have multiple assignees to a private repo. No idea why.

@Mistobaan @lucidrains should assure you both get notifications.

sdtblck · 2020-09-16T13:00:27Z

reopening this because there are still some problems

at least with the models I've tested, 'fast sampling' seems to be of significantly worse quality than slow sampling. (see sample screenshots below)

Slow sampling:

Fast sampling:

It looks ok on the surface but is notably more nonsensical and more often than not devolves into repetition:

fast sampling does not work with larger TPU pods - I will post the stacktrace for this shortly (whenever we have a spare pod 😆 )

I'm happy to look into this if you're busy @lucidrains . I suspect 1) is just an off by one error somewhere with the attention mask, maybe. 2) might be a little trickier but I'd be happy to look into this one, since TPUs might be the one area where I can claim to have more experience, haha. I think it may be a shape error relating to the mtf splitting.

sdtblck · 2020-09-17T18:47:09Z

After a bit more digging, it seems like the sample quality is only worse with MOE models. I'm going to keep investigating as to why.

sdtblck · 2020-09-17T21:30:25Z

Ok, for now we have set fast sampling as the default, but revert to slow sampling for moe models.

Error no. 2 wasn't in fact an error with larger pods, or models, but was an error with fast sampling when recompute_grad was set to True. This is fixed here: 8415e56

so what's left now is to 1) figure out what's wrong with moe and 2) try out step 4 here: #43

sdtblck · 2020-12-10T09:08:47Z

closing this issue to open a more relevant one w/r/t moe

StellaAthena added feature request A feature that isn't implemented yet. research Research questions for discussion. labels Sep 8, 2020

StellaAthena added this to To do in 1T or Bust via automation Sep 8, 2020

Mistobaan added this to the v1 milestone Sep 10, 2020

Mistobaan assigned Mistobaan and lucidrains and unassigned Mistobaan Sep 11, 2020

StellaAthena assigned Mistobaan and lucidrains and unassigned lucidrains and Mistobaan Sep 11, 2020

lucidrains linked a pull request Sep 12, 2020 that will close this issue

add faster sampling for global attention #43

Merged

4 tasks

lucidrains closed this as completed Sep 12, 2020

1T or Bust automation moved this from To do to Done Sep 12, 2020

sdtblck reopened this Sep 16, 2020

1T or Bust automation moved this from Done to To do Sep 16, 2020

sdtblck mentioned this issue Sep 16, 2020

Make repository public #36

Closed

sdtblck closed this as completed Nov 1, 2020

1T or Bust automation moved this from To do to Done Nov 1, 2020

sdtblck reopened this Nov 1, 2020

1T or Bust automation moved this from Done to To do Nov 1, 2020

sdtblck closed this as completed Dec 10, 2020

1T or Bust automation moved this from To do to Done Dec 10, 2020

sdtblck mentioned this issue Dec 10, 2020

Low quality samples with MOE models when using fast sampling #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Sampling #32

Efficient Sampling #32

sdtblck commented Sep 8, 2020

Mistobaan commented Sep 11, 2020

StellaAthena commented Sep 11, 2020 •

edited

sdtblck commented Sep 16, 2020

sdtblck commented Sep 17, 2020

sdtblck commented Sep 17, 2020

sdtblck commented Dec 10, 2020

Efficient Sampling #32

Efficient Sampling #32

Comments

sdtblck commented Sep 8, 2020

Mistobaan commented Sep 11, 2020

StellaAthena commented Sep 11, 2020 • edited

sdtblck commented Sep 16, 2020

sdtblck commented Sep 17, 2020

sdtblck commented Sep 17, 2020

sdtblck commented Dec 10, 2020

StellaAthena commented Sep 11, 2020 •

edited