# Core Reranker
vLLM-Hook is an extensible framework that aims to allow selective access to model internals during the inference. 
As a demonstration of that, in this notebook, we show how vLLM-Hook enables *Core Reranker* for document relevance scoring. 

**Paper**: [Contrastive Retrieval Heads Improve Attention-Based Re-Ranking](https://arxiv.org/abs/2510.02219).<br />
**Authors**: Linh Tran, Yulong Li, Radu Florian, Wei Sun <br />
**"TL;DR"**: Core reranker is an attention-based reranker that leverage attention weights from selected transformer heads to produce document relevance scores.


### Installation
If running this from a new environment, please use the cell below to install `vllm_hook_plugins`. Update the path/command to match your environment.<br />
The following block is not necessary if running this notebook from an environment where the package has already been installed.

In [None]:
from pathlib import Path
import sys

# vllm_hooks/notebooks/
NOTEBOOK_DIR = Path.cwd()
REPO_ROOT = NOTEBOOK_DIR.parent

PKG_DIR = REPO_ROOT/"vllm_hook_plugins"
REQ_FILE = REPO_ROOT/"requirement.txt"

print("Notebook dir:", NOTEBOOK_DIR)
print("Repo root   :", REPO_ROOT)
print("Package dir :", PKG_DIR)
print("Req file    :", REQ_FILE)

%pip install -e "{PKG_DIR}"

if REQ_FILE.exists():
    %pip install -r "{REQ_FILE}"
else:
    print("⚠️ requirements.txt not found at", REQ_FILE)


### Importing the Hook-Enabled LLM
The plugin provides its own LLM wrapper that behaves like vllm.LLM (`from vllm import LLM`) but adds support for hooks and instrumentation.
We import it here:

In [1]:
from vllm_hook_plugins import HookLLM

  from .autonotebook import tqdm as notebook_tqdm


### Environment & multiprocessing setup

In [2]:
import os
import multiprocessing as mp
import torch
from typing import List
mp.set_start_method("spawn", force=True)
os.environ["VLLM_USE_V1"] = "1"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

### Helper functions that give the instruction range
As Core Reranker needs to locate the candidate passages and the user query in the prompt, below is a helper function that gives the data range with texts.<br />
Check [Core Reranker](https://arxiv.org/pdf/2510.02219) for more details.

In [3]:
def apply_chat_template_and_get_ranges(tokenizer, model_name: str, query: str, documents: List[str]):
    # setup prompts
    off_set = 0
    if 'granite' in model_name.lower():
        prompt_prefix = '<|start_of_role|>user<|end_of_role|>'
        prompt_suffix = '<|end_of_text|><|start_of_role|>assistant<|end_of_role|>'
    elif 'llama' in model_name.lower():
        prompt_prefix = '<|start_header_id|>user<|end_header_id|>'
        prompt_suffix = '<|eot_id|><|start_header_id|>assistant<|end_header_id|>'
    elif 'mistral' in model_name.lower():
        prompt_prefix = '[INST]'
        prompt_suffix = '[/INST]'
        off_set = 1
    elif 'phi' in model_name.lower():
        prompt_prefix = '<|im_start|>user<|im_sep|>'
        prompt_suffix = '<|im_end|><|im_start|>assistant<|im_sep|>'
    retrieval_instruction = ' Here are some paragraphs:\n\n'
    retrieval_instruction_late = 'Please find information that are relevant to the following query in the paragraphs above.\n\nQuery: '
    
    doc_span = []
    query_start_idx = None
    query_end_idx = None

    llm_prompt = prompt_prefix + retrieval_instruction

    for i, doc in enumerate(documents):

        llm_prompt += f'[document {i+1}]'
        start_len = len(tokenizer(llm_prompt).input_ids)

        llm_prompt += ' ' + " ".join(doc)
        end_len = len(tokenizer(llm_prompt).input_ids) - off_set

        doc_span.append((start_len, end_len))
        llm_prompt += '\n\n'

    start_len = len(tokenizer(llm_prompt).input_ids)

    llm_prompt += retrieval_instruction_late
    after_retrieval_instruction_late = len(tokenizer(llm_prompt).input_ids) - off_set

    llm_prompt += f'{query.strip()}'
    end_len = len(tokenizer(llm_prompt).input_ids) - off_set
    llm_prompt += prompt_suffix

    query_start_idx = start_len
    query_end_idx = end_len

    return llm_prompt, (doc_span, query_start_idx, after_retrieval_instruction_late, query_end_idx)

### Initialize `HookLLM`
Before we create the LLM instance, we need to specify the model and data type:

In [4]:
cache_dir = '~/.cache'  # Specify cache dir
model = 'mistralai/Mistral-7B-Instruct-v0.3' 
    
dtype_map = {
    'mistralai/Mistral-7B-Instruct-v0.3': torch.float16,
}

We also need to provide a config file that specifies the important heads we want to track. <br />
For Core Reranker, this config file can be obtained from [head_detection.py](https://github.com/linhhtran/CoRe-Reranking/blob/main/experiments/head_detection.py). 

In [5]:
import json
from pathlib import Path

json_path = Path("../model_configs/core_reranker/Mistral-7B-Instruct-v0.3.json")  # adjust path

with open(json_path, "r") as f:
    config = json.load(f)

# print(config)

Inside `probe_hook_qk` and `core_reranker` we defined the desired behavior during model inference and after the model inference: 
- `workers/probe_hookqk_worker.py` defines that we need `q` (query) and `k` (key) to be saved during forward passes
- `analyzers/core_reranker_analyzer.py` calculates the passage relevance score and the final ranking of passages

Now, we initialize the llm:

In [6]:
llm = HookLLM(
    model=model,
    worker_name="probe_hook_qk",
    analyzer_name="core_reranker",
    config_file=json_path,
    download_dir=cache_dir,
    gpu_memory_utilization=0.7,
    trust_remote_code=True,
    dtype=dtype_map[model],
    enable_prefix_caching=True,
    enable_hook=True
)

INFO 12-05 17:50:46 [utils.py:253] non-default args: {'trust_remote_code': True, 'download_dir': '/dccstor/pyrite/irene/', 'dtype': torch.float16, 'seed': None, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.7, 'disable_log_stats': True, 'enforce_eager': True, 'worker_cls': 'vllm_hook_plugins.workers.probe_hookqk_worker.ProbeHookQKWorker', 'model': 'mistralai/Mistral-7B-Instruct-v0.3'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 12-05 17:50:47 [model.py:637] Resolved architecture: MistralForCausalLM
INFO 12-05 17:50:47 [model.py:1750] Using max model len 32768


2025-12-05 17:50:52,496	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-05 17:50:52 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 12-05 17:50:52 [vllm.py:707] Cudagraph is disabled under eager mode




[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:52:54 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/dccstor/pyrite/irene/', load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [01:02<00:00, 62.55s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [01:02<00:00, 62.55s/it]
[0;36m(EngineCore_DP0 pid=856148)[0;0m 


[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:06 [default_loader.py:308] Loading weights took 63.32 seconds
[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:07 [gpu_model_runner.py:3549] Model loading took 13.5084 GiB memory and 69.303830 seconds
[0;36m(EngineCore_DP0 pid=856148)[0;0m Installed 5 hooks on layers: ['model.layers.9.self_attn.attn', 'model.layers.12.self_attn.attn', 'model.layers.15.self_attn.attn', 'model.layers.16.self_attn.attn', 'model.layers.18.self_attn.attn']
[0;36m(EngineCore_DP0 pid=856148)[0;0m Hooks installed successfully
[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:12 [gpu_worker.py:359] Available KV cache memory: 41.08 GiB
[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:12 [kv_cache_utils.py:1286] GPU KV cache size: 336,512 tokens
[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:12 [kv_cache_utils.py:1291] Maximum concurrency for 32,768 tokens per request: 10.27x
[0;36m(EngineCore_DP0 pid=856148)[0;0m



[0;36m(EngineCore_DP0 pid=856148)[0;0m INFO 12-05 17:54:15 [vllm.py:707] Cudagraph is disabled under eager mode
INFO 12-05 17:54:16 [llm.py:343] Supported tasks: ['generate']


### Test case
In the following, we show a test case with seven candidate passages and a user query.

In [7]:
case = {
        "query": "Which came first, the invention of the telephone or the light bulb?",
        "documents": [
            [
            "Alexander Graham Bell is credited with inventing the first practical telephone.",
            " He was awarded the U.S. patent for the invention of the telephone on March 7, 1876.",
            " The first successful demonstration of the telephone took place shortly thereafter, when Bell famously called his assistant, saying, 'Mr. Watson, come here, I want to see you.'",
            " Bell’s invention revolutionized communication by allowing people to talk to each other over long distances."
            ],
            [
            "Thomas Edison is widely known for inventing the first commercially practical incandescent light bulb.",
            " Although he did not invent the concept of the light bulb itself, Edison developed a version that was safe, affordable, and long-lasting.",
            " His patent for the electric light bulb was filed in 1879, three years after Bell’s telephone patent.",
            " Edison's innovation led to widespread use of electric lighting and helped usher in the modern electrical age."
            ],
            [
            "Before Edison, several inventors worked on early versions of the light bulb.",
            " Sir Humphry Davy created the first electric arc lamp in the early 1800s, and later inventors like Joseph Swan in Britain improved upon the design.",
            " However, these early bulbs were inefficient or burned out quickly, and it was Edison who perfected the design for everyday use."
            ],
            [
            "The telephone was invented before the practical light bulb.",
            " Bell’s patent for the telephone was issued in 1876, while Edison’s patent for the light bulb was filed in 1879.",
            " Thus, the telephone came first."
            ],
            [
            "Both the telephone and the light bulb are considered groundbreaking inventions of the late 19th century.",
            " The telephone transformed communication, while the light bulb transformed how people lived and worked at night.",
            " Together, they symbolize the rapid technological progress of that era."
            ],
            [
            "Edison and Bell were contemporaries and pioneers of the Second Industrial Revolution.",
            " Their inventions marked major milestones in human history, driving the growth of telecommunications and electrical infrastructure."
            ],
            [
            "In summary, the telephone was invented in 1876 and the light bulb in 1879.",
            " Therefore, the invention of the telephone came first."
            ]
        ]
    }

Next, we apply chat template and obtain the input range using the helper function defined above.<br />
Specifically, as core reranker relies on the aggregated attentions from the user query to each passage, it needs a reference attention baseline for each passage. The authors swap the user query with `'N/A'` and treat the resulting aggregated attention as the normalizing factor for each passage.

In [8]:
query = case["query"]
documents = case["documents"]
        
# Apply chat template and get ranges
query_text, query_spec = apply_chat_template_and_get_ranges(llm.tokenizer, model, query, documents)
na_text, na_spec = apply_chat_template_and_get_ranges(llm.tokenizer, model, 'N/A', documents)

Finally, we perform the model inference:

In [9]:
llm.generate(query_text, temperature=0.1, max_tokens=1)
llm.generate(na_text, cleanup=False, temperature=0.1, max_tokens=1)

Logged run ID.
Created hook flag.


Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 618.08it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.61s/it, est. speed input: 363.81 toks/s, output: 0.62 toks/s]


Hooks deactivated.


Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 568.64it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 64.62it/s, est. speed input: 38989.81 toks/s, output: 66.60 toks/s]


Logged run ID.
Created hook flag.


Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 693.73it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.66it/s, est. speed input: 16795.39 toks/s, output: 29.29 toks/s]


Hooks deactivated.


Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 584.90it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 71.72it/s, est. speed input: 42564.04 toks/s, output: 74.21 toks/s]


[RequestOutput(request_id=3, prompt="[INST] Here are some paragraphs:\n\n[document 1] Alexander Graham Bell is credited with inventing the first practical telephone.  He was awarded the U.S. patent for the invention of the telephone on March 7, 1876.  The first successful demonstration of the telephone took place shortly thereafter, when Bell famously called his assistant, saying, 'Mr. Watson, come here, I want to see you.'  Bell’s invention revolutionized communication by allowing people to talk to each other over long distances.\n\n[document 2] Thomas Edison is widely known for inventing the first commercially practical incandescent light bulb.  Although he did not invent the concept of the light bulb itself, Edison developed a version that was safe, affordable, and long-lasting.  His patent for the electric light bulb was filed in 1879, three years after Bell’s telephone patent.  Edison's innovation led to widespread use of electric lighting and helped usher in the modern electrical

During the model inference in the previous step, vLLM-Hook has automatically saved selected queries and keys. Now, we can directly call the analyzer to get the passage relevance score and the final ranking of passages:

In [10]:

stats = llm.analyze(analyzer_spec={'query_spec': query_spec, 'na_spec': na_spec})

Finally we can print out the results as follows:

In [11]:
print(f"Sorted document IDs and scores by CoRe-Reranking: {stats['ranking']}: {stats['scores']}")

Sorted document IDs and scores by CoRe-Reranking: [[6, 3, 1, 0, 4, 2, 5]]: [[4.04296875, 3.427734375, 2.419921875, 1.6767578125, 1.62890625, 1.01953125, 0.79736328125]]
