Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence mode prototype #89

Merged
merged 30 commits into from
Jun 12, 2023
Merged

Sequence mode prototype #89

merged 30 commits into from
Jun 12, 2023

Conversation

LoganDark
Copy link
Contributor

This is a prototype of sequence mode.

Load model ... 1.318s
Serial mode to process 30 tokens ... 2.116s
Sequence mode to process 30 tokens ... 0.509s
Logits total diff = 0.00000
Logits identical = TRUE

This is only for testing. It runs into precision and capacity limits at large lengths. The goal is to support sequences of up to 25k tokens.

It is also likely that the dedicated single token functions should be brought back. Again, only prototype.

This is a prototype of sequence mode.

Load model ... 1.318s
Serial mode to process 30 tokens ... 2.116s
Sequence mode to process 30 tokens ... 0.509s
Logits total diff = 0.00000
Logits identical = TRUE

This is only for testing. It runs into precision and capacity
limits at large lengths. The goal is to support sequences of up to
25k tokens.

It is also likely that the dedicated single token functions should
be brought back. Again, only prototype.
@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 4, 2023

This PR is currently dependent on ggerganov/ggml#229 due to relying on a single computation graph. This dependency will most likely go away

@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 4, 2023

No implementation advice is needed yet, there are still some low hanging fruit left for me to take care of.

This is just to get the news out that this is being worked on.

@saharNooby
Copy link
Collaborator

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 4, 2023

Just to be sure -- at line 910 struct ggml_tensor * x = ggml_get_rows(ctx, model.emb, token_index); we will get huge matrix of size sequence_length, n_embed?

Yes, width n_embed and height sequence_len

@saharNooby
Copy link
Collaborator

saharNooby commented Jun 4, 2023

Huge matmul should be fast! (some say, I myself don't know)

@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 4, 2023

Huge matmul should be fast! (some say, I myself don't know)

OP includes a benchmark that shows 4x speedup on processing 30 tokens :)

@saharNooby saharNooby linked an issue Jun 4, 2023 that may be closed by this pull request
@LoganDark
Copy link
Contributor Author

Load model ... 1.397s
Serial mode to process 30 tokens ... 6.971s
Sequence mode to process 30 tokens ... 0.696s
Logits total diff = 0.00000
Logits identical = TRUE

Also, tests are passing ? Pleasant surprise (I guess I no longer use the faulty ggml function)

@LoganDark

This comment was marked as outdated.

@LoganDark
Copy link
Contributor Author

macos fail is probably a false positive, it works fine on my mac, but need someone on apple silicon to test

@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 7, 2023

I think this pull request is ready for a review (not merge yet)

There are still probably some touch-ups I need to do, but the code is almost clean enough to be production ready, I think

It does not support anywhere near 25k tokens (in fact going too far above 64 will probably crash upstream builds of ggml) but it can be improved in the future

@saharNooby
Copy link
Collaborator

Great PR! Aside from some nits, concerns are:

  • existence of sequence.c
  • MacOS build does not pass; I'll try to enable sanitizer to debug it

@saharNooby
Copy link
Collaborator

I'll test sanitizer builds on a new temporary branch logan-dark-sequence-mode, since I have no permission to push wworkflow changes directly into this PR.

@saharNooby
Copy link
Collaborator

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

@LoganDark
Copy link
Contributor Author

Lol, just adding -DRWKV_SANITIZE_ADDRESS=ON to build command line fixes the issue, and gives literally zero useful info.

I guess may as well leave it on since it seems to work fine on my mac. Real macs probably have AVX2 and FMA

@saharNooby
Copy link
Collaborator

I guess may as well leave it on

That's an option, but I have another idea I will try tomorrow. I will literally insert a macro at each line printing line number and see on what line it stops executing. I did that before, it is a crude way, but it works.

Then, when line is known, I can try to rewrite the part.

@saharNooby
Copy link
Collaborator

@LoganDark I've tried to debug it, but I have no ideas and/or willingness to go further. Looks like a compiler bug, IDK...

Please change the MacOS build command in build.yml to cmake -DRWKV_AVX2=OFF -DRWKV_FMA=OFF -DRWKV_SANITIZE_ADDRESS=ON .., also adding comment above: Sanitizer is enabled to fix issues discovered when testing #89. It needs to be disabled as soon as it is possible (that is, master is able to be built on MacOS GitHub runner again). I'll approve the workflow.

@saharNooby saharNooby closed this Jun 10, 2023
@saharNooby saharNooby reopened this Jun 10, 2023
@saharNooby
Copy link
Collaborator

Oops, closed accidentally.

@LoganDark You've said the PR is ready for review, but not merge; is it still not ready to be merged?

Sanitizer is enabled to fix issues discovered when testing RWKV#89. It
needs to be disabled as soon as it is possible (that is, master is
able to be built on MacOS GitHub runner again)
@LoganDark
Copy link
Contributor Author

is it still not ready to be merged?

I'll perform another pass over it personally and make sure. My main concern is that long sequence lengths build computation graphs that are so large that they cannot fit inside the ggml_cgraph struct; you'll hit ggml asserts at runtime that crash the program due to exceeding the node limit. I don't know if there's any way to fix this as long as ggml doesn't support iteration in cgraphs.

@LoganDark
Copy link
Contributor Author

Yeah, 8523841 is what I missed, lmao!!

@LoganDark
Copy link
Contributor Author

LoganDark commented Jun 10, 2023

I think I'd be okay merging this since further improvements can be made in the future.

I am experimenting with making performance graphs that compare sequence mode to serial mode. Here is one in log scale

image

That drop around 32 tokens is so not an outlier:

image

I don't know what in the world cuBLAS is doing to achieve this, but I'm interested and also scared. lol

extras/sequence.c Outdated Show resolved Hide resolved
@saharNooby
Copy link
Collaborator

I have the last comment (if nothing more changes in this PR) about sequence.c file, and will be ready to merge.

BTW,

you'll hit ggml asserts at runtime that crash the program

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

@LoganDark
Copy link
Contributor Author

MacOS build was failing because of random ggml asserts. Maybe something overflowed here too.

Those are unrelated

@LoganDark
Copy link
Contributor Author

For the record, it doesn't make me very happy to discard the files I used while developing this branch. Those deprive future developers of the tests that they would need to do the same work that I've done. But, there isn't a framework for those kinds of tests (except for maybe the tests directory) since they can't use the tiny 660K models, they need to be tested on real, large models.

@saharNooby saharNooby merged commit c41ed98 into RWKV:master Jun 12, 2023
12 checks passed
@LoganDark LoganDark deleted the sequence-mode branch June 12, 2023 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Blas-like Prompt Parallelization? (sequence processing mode)
2 participants