# Submit LLM 34B Model in 5 hours!
This notebook demonstrates how to submit a LLM 34B model in only 5 hours! Amazing! The key tricks are:
* use vLLM (for speed)
* use AWQ 4bit quantization (to avoid GPU VRAM OOM)
* limit input size to 1024 tokens (for speed)
* limit output size to 1 token (for speed)

# Pip Install vLLM
The package vLLM is an incredibly fast LLM inference library! The vLLM that is installed in Kaggle notebooks will produce errors, therefore we need to reinstall vLLM. The code below was taken from notebook [here][1]

[1]: https://www.kaggle.com/code/lewtun/numina-1st-place-solution

In [1]:
import os, math, numpy as np
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

In [2]:
%%time
!pip uninstall -y torch
!pip install -U --no-index --find-links=/kaggle/input/vllm-whl -U vllm
!pip install -U --upgrade /kaggle/input/vllm-t4-fix/grpcio-1.62.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!pip install -U --upgrade /kaggle/input/vllm-t4-fix/ray-2.11.0-cp310-cp310-manylinux2014_x86_64.whl

Found existing installation: torch 2.1.2
Uninstalling torch-2.1.2:
  Successfully uninstalled torch-2.1.2
Looking in links: /kaggle/input/vllm-whl
Processing /kaggle/input/vllm-whl/vllm-0.4.0.post1-cp310-cp310-manylinux1_x86_64.whl
Processing /kaggle/input/vllm-whl/cmake-3.29.0.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from vllm)
Processing /kaggle/input/vllm-whl/torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl (from vllm)
Processing /kaggle/input/vllm-whl/xformers-0.0.23.post1-cp310-cp310-manylinux2014_x86_64.whl (from vllm)
Processing /kaggle/input/vllm-whl/pynvml-11.5.0-py3-none-any.whl (from vllm)
Processing /kaggle/input/vllm-whl/triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (from vllm)
Processing /kaggle/input/vllm-whl/outlines-0.0.34-py3-none-any.whl (from vllm)
Processing /kaggle/input/vllm-whl/tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from vllm)
Processing /kaggle/input/vllm-whl/interegular-

# Load 34B Quantized Model with vLLM!
We will load and use LLM 34B Bagel [here][1]. This is a strong model.

[1]: https://huggingface.co/jondurbin/bagel-34b-v0.2

In [3]:
import vllm

llm = vllm.LLM(
    "/kaggle/input/bagel-v3-343",
    quantization="awq",
    tensor_parallel_size=2, 
    gpu_memory_utilization=0.95, 
    trust_remote_code=True,
    dtype="half", 
    enforce_eager=True,
    max_model_len=1024,
    #distributed_executor_backend="ray",
)
tokenizer = llm.get_tokenizer()

2024-07-18 10:16:38,718	INFO util.py:124 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.




2024-07-18 10:16:41,358	INFO worker.py:1749 -- Started a local Ray instance.


INFO 07-18 10:16:43 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/kaggle/input/bagel-v3-343', tokenizer='/kaggle/input/bagel-v3-343', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 07-18 10:16:50 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 07-18 10:16:50 selector.py:25] Using XFormers backend.
[36m(RayWorkerVllm pid=355)[0m INFO 07-18 10:16:51 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
[36m(RayWorkerVllm pid=355)[0m INFO 07-18 10:16:51 selector.py:25] Using XFormers backend.
INFO 07-18 10:16:52 pynccl_utils.py:45] vLLM is using nccl==2.18.1
[36m(RayWorkerVllm pid=355)[0m INFO 07-18 10:16:52 pynccl_utils.py:45] vLLM is

# Load Test Data
During **commit** we load 128 rows of train to compute CV score. During **submit**, we load the test data.

In [4]:
import pandas as pd
VALIDATE = 128

test = pd.read_csv("/kaggle/input/lmsys-chatbot-arena/test.csv") 
if len(test)==3:
    test = pd.read_csv("/kaggle/input/lmsys-chatbot-arena/train.csv")
    test = test.iloc[:VALIDATE]
print( test.shape )
test.head(1)

(128, 9)


Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0


# Engineer Prompt
If we want to submit zero shot LLM, we need to experiment with different system prompts to improve CV score. If we finetune the model, then system is not as important because the model will learn from the targets what to do regardless of which system prompt we use.

We use a logits processor to force the model to output the 3 tokens we are interested in.

In [5]:
from typing import Any, Dict, List
from transformers import LogitsProcessor
import torch

choices = ["A","B","tie"]

KEEP = []
for x in choices:
    c = tokenizer.encode(x,add_special_tokens=False)[0]
    KEEP.append(c)
print(f"Force predictions to be tokens {KEEP} which are {choices}.")

class DigitLogitsProcessor(LogitsProcessor):
    def __init__(self, tokenizer):
        self.allowed_ids = KEEP
        
    def __call__(self, input_ids: List[int], scores: torch.Tensor) -> torch.Tensor:
        scores[self.allowed_ids] += 100
        return scores

Force predictions to be tokens [59603, 59616, 45228] which are ['A', 'B', 'tie'].


In [6]:
sys_prompt = """Please read the following prompt and two responses. Determine which response is better.
If the responses are relatively the same, respond with 'tie'. Otherwise respond with 'A' or 'B' to indicate which is better."""

In [7]:
SS = "#"*25 + "\n"

In [8]:
all_prompts = []
for index,row in test.iterrows():
    
    a = " ".join(eval(row.prompt, {"null": ""}))
    b = " ".join(eval(row.response_a, {"null": ""}))
    c = " ".join(eval(row.response_b, {"null": ""}))
    
    prompt = f"{SS}PROMPT: "+a+f"\n\n{SS}RESPONSE A: "+b+f"\n\n{SS}RESPONSE B: "+c+"\n\n"
    
    formatted_sample = sys_prompt + "\n\n" + prompt
    
    all_prompts.append( formatted_sample )

# Infer Test
We infer test using fast vLLM. We ask vLLM to output probabilties of the top 5 tokens considered to be predicted in the first token. We also limit prediction to 1 token to increase inference speed.

Based on the speed it takes to infer 128 train samples, we can deduce how long inferring 25,000 test samples will take.

In [9]:
%%time

from time import time
start = time()

logits_processors = [DigitLogitsProcessor(tokenizer)]
responses = llm.generate(
    all_prompts,
    vllm.SamplingParams(
        n=1,  # Number of output sequences to return for each prompt.
        top_p=0.9,  # Float that controls the cumulative probability of the top tokens to consider.
        temperature=0,  # randomness of the sampling
        seed=777, # Seed for reprodicibility
        skip_special_tokens=True,  # Whether to skip special tokens in the output.
        max_tokens=1,  # Maximum number of tokens to generate per output sequence.
        logits_processors=logits_processors,
        logprobs = 5
    ),
    use_tqdm = True
)

end = time()
elapsed = (end-start)/60. #minutes
print(f"Inference of {VALIDATE} samples took {elapsed} minutes!")

Processed prompts:   0%|          | 0/128 [00:00<?, ?it/s]



Processed prompts:   5%|▍         | 6/128 [00:06<01:59,  1.02it/s]



Processed prompts:   7%|▋         | 9/128 [00:09<01:58,  1.00it/s]



Processed prompts:  16%|█▋        | 21/128 [00:16<01:16,  1.40it/s]



Processed prompts:  20%|█▉        | 25/128 [00:19<01:14,  1.38it/s]



Processed prompts:  25%|██▌       | 32/128 [00:23<01:01,  1.56it/s]



Processed prompts:  32%|███▏      | 41/128 [00:30<01:01,  1.41it/s]



Processed prompts:  36%|███▌      | 46/128 [00:34<01:00,  1.36it/s]



Processed prompts:  41%|████      | 52/128 [00:38<00:51,  1.48it/s]



Processed prompts:  48%|████▊     | 62/128 [00:45<00:46,  1.41it/s]



Processed prompts:  50%|█████     | 64/128 [00:49<00:58,  1.10it/s]



Processed prompts:  54%|█████▍    | 69/128 [00:53<00:51,  1.15it/s]



Processed prompts:  59%|█████▊    | 75/128 [00:57<00:41,  1.27it/s]



Processed prompts:  73%|███████▎  | 94/128 [01:13<00:31,  1.08it/s]



Processed prompts:  76%|███████▌  | 97/128 [01:16<00:30,  1.01it/s]



Processed prompts:  79%|███████▉  | 101/128 [01:20<00:27,  1.01s/it]



Processed prompts:  83%|████████▎ | 106/128 [01:25<00:20,  1.05it/s]



Processed prompts:  91%|█████████▏| 117/128 [01:33<00:09,  1.18it/s]



Processed prompts: 100%|██████████| 128/128 [01:38<00:00,  1.31it/s]

Inference of 128 samples took 1.645822032292684 minutes!
CPU times: user 1min 38s, sys: 0 ns, total: 1min 38s
Wall time: 1min 38s





In [10]:
submit = 25_000 / 128 * elapsed / 60
print(f"Submit will take {submit} hours")

Submit will take 5.357493594702747 hours


# Extract Inference Probabilites
We now extract the probabilties of "A", "B", "tie" from the vLLM predictions.

In [11]:
results = []
errors = 0

for i,response in enumerate(responses):
    try:
        x = response.outputs[0].logprobs[0]
        logprobs = []
        for k in KEEP:
            if k in x:
                logprobs.append( math.exp(x[k].logprob) )
            else:
                logprobs.append( 0 )
                print(f"bad logits {i}")
        logprobs = np.array( logprobs )
        logprobs /= logprobs.sum()
        results.append( logprobs )
    except:
        #print(f"error {i}")
        results.append( np.array([1/3., 1/3., 1/3.]) )
        errors += 1
        
print(f"There were {errors} inference errors out of {i+1} inferences")
results = np.vstack(results)

There were 33 inference errors out of 128 inferences


# Create Submission CSV

In [12]:
sub = pd.read_csv("/kaggle/input/lmsys-chatbot-arena/sample_submission.csv")

if len(test)!=VALIDATE:
    sub[["winner_model_a","winner_model_b","winner_tie"]] = results
    
sub.to_csv("submission.csv",index=False)
sub.head()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
0,136060,0.333333,0.333333,0.333333
1,211333,0.333333,0.333333,0.333333
2,1233961,0.333333,0.333333,0.333333


# Compute CV Score

In [13]:
if len(test)==VALIDATE:
    true = test[['winner_model_a','winner_model_b','winner_tie']].values
    print(true.shape)

(128, 3)


In [14]:
if len(test)==VALIDATE:
    from sklearn.metrics import log_loss
    print(f"CV loglosss is {log_loss(true,results)}" )

CV loglosss is 0.9413079497822053
