Pressure Testing: Open LLMs 💢

This is derivative work of Needle In A Haystack - Pressure Testing LLMs, a project where @gkamradt explored the in-context retrieval abilities of GPT-4 and Claude 2. I was impressed by the insights gained from this test, and as an open-source enthusiast, I felt compelled to extend the experiment to the broader open-source LLM market. As such, this project examines the in-context retrieval capabilities of popular open-source models. My primary aim is to evaluate how these widely-used models in the LLM community perform in terms of simple retrieval within their context window. I welcome suggestions for additional models to include in our study, particularly those with larger context windows and the ability to run with 24GB VRAM + 64GB RAM.

Note: As a response to @gkamradt's work, Anthropic ran their own pressure tests, covered in this blog post. They were able to massivively improve in-context retrieval performance by priming the model response with Here is the most relevant sentence in the text:. All my tests using this retrieval priming technique will be suffixed with rp.

The Test 📝

Place a random fact or statement (the 'needle') in the middle of a long context window (the 'haystack')
Ask the model to retrieve this statement using the following prompt format

You are provided with a text of some essays, admist these essays is a sentence
that contains the answer to the user's question. I will now provide the text
(delimited with XML tags) followed by the user question.

[TEXT]
{content}
[/TEXT]


User: {prompt}

Assistant: {retrival primer}

Iterate over various document depths (where the needle is placed) and context lengths to measure performance

Roadmap 🛣️

An ongoing list of models to pressure test.

1. Mistral 7B Instruct v0.2

Results 📊

Each test consists of a retrieval, at certain depth percentage, for a given context length. The results are combined into a pivot table illustrating how well the model response was, judged by GPT-4. The scoring system is defined as

Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference.

I have slightly adjusted @gkamradt's visualization code to work for this project. The code can be found here. The raw results are found in results/.

Qwen-1.5-4B @ 7k [RP]

Qwen doesn't have any attention optimizations (SHA, MHA, MQA, GQA), hence scaling contexts is super expensive in terms of VRAM :(

Qwen-1.5-7B @ 7k [RP]

Wish I could test how well this does at higher contexts, all Qwen 1.5 support contexts up to 32k in practice.

Mistral-7B-Instruct-v0.2 @ 16k

This model is trained on 8k context but features a theoretical context window of up to 128k, made possible through sliding window attention.

Mistral-7B-Instruct-v0.2 @ 16k [RP]

Using the retrieval priming technique from Anthropic, results improve tremendously.The model is capable of handling contexts exceeding 8k. However, its performance is characterized by volatility; it tends to either achieve flawless success or encounter complete failure.

OpenChat 7B 3.5-1210 @ 8k

OpenChat 7B 3.5-1210 @ 8k [RP]

Starling LM 7B Alpha @ 8k

Starling is finetuned from OpenChat 3.5 and is one of the best 7B models on Chatbot Arena.

Starling LM 7B Alpha @ 8k [RP]

Toppy 7B @ 16k

Implementation

Just a quick note on the implementation. @gkamradt refactored and cleaned the code significantly since I originally started working on this. I don't plan to sync this with his more polished version. This code works fine but it's hacky.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
img		img
paulgrahamessays		paulgrahamessays
results		results
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pressure Testing: Open LLMs 💢

The Test 📝

Roadmap 🛣️

Results 📊

Qwen-1.5-4B @ 7k [RP]

Qwen-1.5-7B @ 7k [RP]

Mistral-7B-Instruct-v0.2 @ 16k

Mistral-7B-Instruct-v0.2 @ 16k [RP]

OpenChat 7B 3.5-1210 @ 8k

OpenChat 7B 3.5-1210 @ 8k [RP]

Starling LM 7B Alpha @ 8k

Starling LM 7B Alpha @ 8k [RP]

Toppy 7B @ 16k

Implementation

About

Languages

License

LeonEricsson/llmcontext

Folders and files

Latest commit

History

Repository files navigation

Pressure Testing: Open LLMs 💢

The Test 📝

Roadmap 🛣️

Results 📊

Qwen-1.5-4B @ 7k [RP]

Qwen-1.5-7B @ 7k [RP]

Mistral-7B-Instruct-v0.2 @ 16k

Mistral-7B-Instruct-v0.2 @ 16k [RP]

OpenChat 7B 3.5-1210 @ 8k

OpenChat 7B 3.5-1210 @ 8k [RP]

Starling LM 7B Alpha @ 8k

Starling LM 7B Alpha @ 8k [RP]

Toppy 7B @ 16k

Implementation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages