Sequence mode prototype #89

LoganDark · 2023-06-04T06:11:42Z

This is a prototype of sequence mode.

Load model ... 1.318s
Serial mode to process 30 tokens ... 2.116s
Sequence mode to process 30 tokens ... 0.509s
Logits total diff = 0.00000
Logits identical = TRUE

This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens.

It is also likely that the dedicated single token functions should be brought back. Again, only prototype.

This is a prototype of sequence mode. Load model ... 1.318s Serial mode to process 30 tokens ... 2.116s Sequence mode to process 30 tokens ... 0.509s Logits total diff = 0.00000 Logits identical = TRUE This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens. It is also likely that the dedicated single token functions should be brought back. Again, only prototype.

LoganDark · 2023-06-04T06:13:19Z

This PR is currently dependent on ggerganov/ggml#229 due to relying on a single computation graph. This dependency will most likely go away

LoganDark · 2023-06-04T06:16:17Z

No implementation advice is needed yet, there are still some low hanging fruit left for me to take care of.

This is just to get the news out that this is being worked on.

saharNooby · 2023-06-04T07:50:56Z

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

LoganDark · 2023-06-04T07:53:04Z

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

Yes, width n_embed and height sequence_len

saharNooby · 2023-06-04T07:53:50Z

Huge matmul should be fast! (some say, I myself don't know)

LoganDark · 2023-06-04T07:54:42Z

Huge matmul should be fast! (some say, I myself don't know)

OP includes a benchmark that shows 4x speedup on processing 30 tokens :)

LoganDark · 2023-06-04T10:04:46Z

Load model ... 1.397s
Serial mode to process 30 tokens ... 6.971s
Sequence mode to process 30 tokens ... 0.696s
Logits total diff = 0.00000
Logits identical = TRUE

Also, tests are passing ? Pleasant surprise (I guess I no longer use the faulty ggml function)

we still build one, but that seems necessary for ggml.

This is a huge speedup when the same sequence length is used many times in a row. I intend to clean up this code very soon

It doesn't actually matter

LoganDark · 2023-06-07T02:34:34Z

macos fail is probably a false positive, it works fine on my mac, but need someone on apple silicon to test

LoganDark · 2023-06-07T02:47:14Z

I think this pull request is ready for a review (not merge yet)

There are still probably some touch-ups I need to do, but the code is almost clean enough to be production ready, I think

It does not support anywhere near 25k tokens (in fact going too far above 64 will probably crash upstream builds of ggml) but it can be improved in the future

saharNooby · 2023-06-08T11:28:52Z

Great PR! Aside from some nits, concerns are:

existence of sequence.c
MacOS build does not pass; I'll try to enable sanitizer to debug it

saharNooby · 2023-06-08T11:38:17Z

I'll test sanitizer builds on a new temporary branch logan-dark-sequence-mode, since I have no permission to push wworkflow changes directly into this PR.

saharNooby · 2023-06-08T12:16:37Z

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

LoganDark · 2023-06-08T14:32:34Z

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

I guess may as well leave it on since it seems to work fine on my mac. Real macs probably have AVX2 and FMA

saharNooby · 2023-06-09T13:28:52Z

I guess may as well leave it on

That's an option, but I have another idea I will try tomorrow. I will literally insert a macro at each line printing line number and see on what line it stops executing. I did that before, it is a crude way, but it works.

Then, when line is known, I can try to rewrite the part.

saharNooby · 2023-06-10T08:12:42Z

@LoganDark I've tried to debug it, but I have no ideas and/or willingness to go further. Looks like a compiler bug, IDK...

Please change the MacOS build command in build.yml to cmake -DRWKV_AVX2=OFF -DRWKV_FMA=OFF -DRWKV_SANITIZE_ADDRESS=ON .., also adding comment above: Sanitizer is enabled to fix issues discovered when testing #89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again). I'll approve the workflow.

saharNooby · 2023-06-10T08:13:19Z

Oops, closed accidentally.

@LoganDark You've said the PR is ready for review, but not merge; is it still not ready to be merged?

Sanitizer is enabled to fix issues discovered when testing RWKV#89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again)

LoganDark · 2023-06-10T10:33:29Z

is it still not ready to be merged?

I'll perform another pass over it personally and make sure. My main concern is that long sequence lengths build computation graphs that are so large that they cannot fit inside the ggml_cgraph struct; you'll hit ggml asserts at runtime that crash the program due to exceeding the node limit. I don't know if there's any way to fix this as long as ggml doesn't support iteration in cgraphs.

LoganDark · 2023-06-10T11:40:39Z

Yeah, 8523841 is what I missed, lmao!!

lol

LoganDark · 2023-06-10T20:18:49Z

I think I'd be okay merging this since further improvements can be made in the future.

I am experimenting with making performance graphs that compare sequence mode to serial mode. Here is one in log scale

That drop around 32 tokens is so not an outlier:

I don't know what in the world cuBLAS is doing to achieve this, but I'm interested and also scared. lol

extras/sequence.c

saharNooby · 2023-06-11T06:55:02Z

I have the last comment (if nothing more changes in this PR) about sequence.c file, and will be ready to merge.

BTW,

you'll hit ggml asserts at runtime that crash the program

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

LoganDark · 2023-06-11T16:25:43Z

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

Those are unrelated

LoganDark · 2023-06-11T19:24:39Z

For the record, it doesn't make me very happy to discard the files I used while developing this branch. Those deprive future developers of the tests that they would need to do the same work that I've done. But, there isn't a framework for those kinds of tests (except for maybe the tests directory) since they can't use the tiny 660K models, they need to be tested on real, large models.

Move out rwkv_att_inner

26ae3e7

saharNooby linked an issue Jun 4, 2023 that may be closed by this pull request

Blas-like Prompt Parallelization? (sequence processing mode) #55

Closed

Move out more graph functions

aff44aa

LoganDark added 6 commits June 4, 2023 03:10

Print system info in sequence.c

226d6ef

Small single-token optimizations

092591b

Add function to estimate graph work size

960cc81

Avoid allocating new sequence graph every rwkv_eval_sequence

1c91c47

we still build one, but that seems necessary for ggml.

Remove sequence capability from ops that do not need it

a534429

Add GPU offload to sequence.c benchmark

a3f3892

This comment was marked as outdated.

Sign in to view

LoganDark added 10 commits June 5, 2023 17:11

Only calculate 1 - x tensors once per layer

f4319da

use ggml_cpy in sequence mode xx output

dd8bb25

Rename "inputs" to "state" in rwkv_eval_sequence

e207194

Basic sequence mode graph caching

084c191

This is a huge speedup when the same sequence length is used many times in a row. I intend to clean up this code very soon

Revert "Only calculate 1 - x tensors once per layer"

a072a84

It doesn't actually matter

Clean up code around graph building and ggml contexts

3298425

Remove unused parameter from rwkv_att_wkv_size

dd6f778

Fix printf integer width in rwkv_eval

c0e32a8

Correct assert return types, whoops

5857dda

Free rwkv_context at the end of sequence.c

cab3d33

LoganDark added 2 commits June 8, 2023 07:33

Fix typo I didn't make

7d03e1b

Expand single-line return conditions

29165a6

saharNooby closed this Jun 10, 2023

saharNooby reopened this Jun 10, 2023

Enable sanitizer in macOS workflows

ff405b6

Sanitizer is enabled to fix issues discovered when testing RWKV#89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again)

LoganDark added 4 commits June 10, 2023 04:16

Add doc comments and expand ser->serial, seq->sequence

e5250f0

Adjust doc comment in rwkv.h

17495b3

Add thread safety note to rwkv_eval_sequence as well

8983823

Remove entire rwkv.cpp source code from sequence.c

8523841

LoganDark added 2 commits June 10, 2023 05:19

Don't validate when sequence is NULL

f0ec611

lol

Fix OOM on cuBLAS-enabled quantized models

ff8e3d8

saharNooby reviewed Jun 11, 2023

View reviewed changes

extras/sequence.c Outdated Show resolved Hide resolved

Remove sequence.c

bf8073a

saharNooby approved these changes Jun 12, 2023

View reviewed changes

saharNooby merged commit c41ed98 into RWKV:master Jun 12, 2023
12 checks passed

LoganDark deleted the sequence-mode branch June 12, 2023 12:56

LoganDark mentioned this pull request Jun 12, 2023

Adding rwkv_eval_array operation #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence mode prototype #89

Sequence mode prototype #89

LoganDark commented Jun 4, 2023

LoganDark commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023 •

edited

Loading

saharNooby commented Jun 4, 2023

LoganDark commented Jun 4, 2023 •

edited

Loading

saharNooby commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023

This comment was marked as outdated.

LoganDark commented Jun 7, 2023

LoganDark commented Jun 7, 2023 •

edited

Loading

saharNooby commented Jun 8, 2023

saharNooby commented Jun 8, 2023

saharNooby commented Jun 8, 2023

LoganDark commented Jun 8, 2023

saharNooby commented Jun 9, 2023

saharNooby commented Jun 10, 2023

saharNooby commented Jun 10, 2023

LoganDark commented Jun 10, 2023

LoganDark commented Jun 10, 2023

LoganDark commented Jun 10, 2023 •

edited

Loading

saharNooby commented Jun 11, 2023

LoganDark commented Jun 11, 2023

LoganDark commented Jun 11, 2023

Sequence mode prototype #89

Sequence mode prototype #89

Conversation

LoganDark commented Jun 4, 2023

LoganDark commented Jun 4, 2023 • edited Loading

LoganDark commented Jun 4, 2023 • edited Loading

saharNooby commented Jun 4, 2023

LoganDark commented Jun 4, 2023 • edited Loading

saharNooby commented Jun 4, 2023 • edited Loading

LoganDark commented Jun 4, 2023 • edited Loading

LoganDark commented Jun 4, 2023

This comment was marked as outdated.

LoganDark commented Jun 7, 2023

LoganDark commented Jun 7, 2023 • edited Loading

saharNooby commented Jun 8, 2023

saharNooby commented Jun 8, 2023

saharNooby commented Jun 8, 2023

LoganDark commented Jun 8, 2023

saharNooby commented Jun 9, 2023

saharNooby commented Jun 10, 2023

saharNooby commented Jun 10, 2023

LoganDark commented Jun 10, 2023

LoganDark commented Jun 10, 2023

LoganDark commented Jun 10, 2023 • edited Loading

saharNooby commented Jun 11, 2023

LoganDark commented Jun 11, 2023

LoganDark commented Jun 11, 2023

LoganDark commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023 •

edited

Loading

saharNooby commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 4, 2023 •

edited

Loading

LoganDark commented Jun 7, 2023 •

edited

Loading

LoganDark commented Jun 10, 2023 •

edited

Loading