## Inference

### Step 0: Set Up environment

First, you need to use VSCode to login to the login node.

Then open this file and select ipykernel to be the weburl you get in the README.md.

If success, the next block will output True.

In [1]:
import torch
torch.cuda.is_available()

True

### Step 1: Process Data

In [3]:
import json

data_file = "ControlBench_rubric.json"

with open(data_file, 'r') as f:
    data = json.load(f)

questions = [e['Question']['Text'] for e in data]

solutions = [[meta_e['Text'] for meta_e in e['Solution']['ReasoningSteps']] if 'ReasoningSteps' in e['Solution'] else [] for e in data]

In [4]:
print('Question 0: ', questions[0])
print('Solution 0: ', solutions[0])

Question 0:  Determine the transfer function of a linear time invariant (LTI) system given the following information: The system has relative degree 3. It has 3 poles, of which 2 are at -2 and -4. The impulse response resembles a step response for a stable linear system with a steady state value of 0.25.
Solution 0:  ['The system has a relative degree 3 with 3 poles, hence it has no finite zeros.', 'With 3 poles, the transfer function takes the general form G(s) = K / [A(s)(s + 2)(s + 4)].', 'Since the impulse response resembles a step response with a steady state value, we conclude the system must contain a pole at zero. Therefore, the transfer function is of the form G(s) = K / [s(s + 2)(s + 4)].', 'Using the final value theorem to determine K, we evaluate lim(s -> 0) sG(s) and obtain K / 8 = 0.25.', 'Solving the equation, we find K = 2.', 'Thus, the transfer function of the system is G(s) = 2 / [s(s + 2)(s + 4)].']


### Step 1: Load models into vLLM engine

In [4]:
from vllm import LLM, SamplingParams

model_path = "/u/ziqiw9/LLMs/Llama-3.1-8B-Instruct"

llm = LLM(model=model_path, trust_remote_code=False, tensor_parallel_size=1)

  from .autonotebook import tqdm as notebook_tqdm
2024-11-02 16:21:28,133	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 11-02 16:21:34 config.py:1021] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 11-02 16:21:34 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/u/ziqiw9/LLMs/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='/u/ziqiw9/LLMs/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_n

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:07<00:21,  7.25s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:15<00:16,  8.04s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:17<00:05,  5.30s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:26<00:00,  6.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:26<00:00,  6.51s/it]



INFO 11-02 16:22:03 model_runner.py:1067] Loading model weights took 14.9888 GB
INFO 11-02 16:22:03 gpu_executor.py:122] # GPU blocks: 28161, # CPU blocks: 2048
INFO 11-02 16:22:03 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 3.44x
INFO 11-02 16:22:05 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-02 16:22:05 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-02 16:22:16 model_runner.py:1523] Graph capturing finished in 11 secs.


In [15]:
stop_token_ids = [128009] # for llama 3
n_per_q = 10
sampling_params = SamplingParams(n=n_per_q, temperature=0.8, top_p=1, seed=123, max_tokens=2048, stop_token_ids=stop_token_ids)

Load tokenizer to deal with processing

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

## Step 1.5: You can play with examples you like

In [7]:
prompts = [
    "Who are you?",
    "1+1=?"
]

prompts = [tokenizer.apply_chat_template(
    [{"role": "user", "content": e}],
    add_generation_prompt=True,
    tokenize=False,
) for e in prompts]

# print(prompts)

outputs = llm.generate(prompts, sampling_params)

Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.53it/s, est. speed input: 261.63 toks/s, output: 101.38 toks/s]


In [8]:
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, \n Generated text: {generated_text!r} \n ---------------- \n")

Prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 
 Generated text: 'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."' 
 ---------------- 

Prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n1+1=?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 
 Generated text: '1 + 1 = 2' 
 ---------------- 



## Step 2: Get output for ControlBench

In [11]:
questions_processed = [tokenizer.apply_chat_template(
    [{"role": "user", "content": e}],
    add_generation_prompt=True,
    tokenize=False,
) for e in questions]

# print(prompts)

answers = llm.generate(questions_processed, sampling_params)

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:44<00:00,  3.46s/it, est. speed input: 29.76 toks/s, output: 2685.34 toks/s]


In [13]:
## Print the first input output
for answer in answers:
    prompt = answer.prompt
    for e in answer.outputs:
        print(f"Prompt: {prompt!r}, \n Generated text: {e.text!r} \n ---------------- \n")
    break

Prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nDetermine the transfer function of a linear time invariant (LTI) system given the following information: The system has relative degree 3. It has 3 poles, of which 2 are at -2 and -4. The impulse response resembles a step response for a stable linear system with a steady state value of 0.25.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', 
 Generated text: "Given the information, we need to find the transfer function of the linear time invariant (LTI) system.\n\nFrom the given information:\n\n1. The system has a relative degree of 3.\n2. It has 3 poles, of which 2 are at -2 and -4, and one is unknown.\n3. The impulse response resembles a step response for a stable linear system with a steady-state value of 0.25.\n\nTo find the transfer function, we can start by assuming the structure 

In [14]:
answers_processed = [[e.text for e in answer.outputs] for answer in answers]

## Step 3: Estimate P(z|y)

In [26]:
prompt_template = """
Given a question, an model-generated answer and a reasoning step from the ground-truth answer. 
You are required to analyze and tell if the model-generated answer contains the given reasoning step.
End your answer with [[Yes]] or [[No]].

Question: {}


Model-generated answer: {}

Reasoning Step: {}

"""

stop_token_ids = [128009] # for llama 3
eval_params = SamplingParams(n=1, temperature=0, top_p=1, seed=123, max_tokens=2048, stop_token_ids=stop_token_ids)

dump_data_full = []

for question, gt_steps, gen_answers in zip(questions, solutions, answers_processed):
    dump_data = [{
        'question': question,
        'answer': gen_answer,
        'contamination': []
    } for gen_answer in gen_answers]
    prompt_set = []

    len_gt_steps = len(gt_steps)

    for gen_answer in gen_answers:
        for gt_step in gt_steps:
            prompt_set.append(prompt_template.format(question, gen_answer, gt_step))
    
    if len(prompt_set) == 0:
        continue

    prompt_set_processed = [tokenizer.apply_chat_template(
        [{"role": "user", "content": e}],
        add_generation_prompt=True,
        tokenize=False,
    ) for e in prompt_set]

    eval_results = llm.generate(prompt_set_processed, eval_params)
    eval_results = [result.outputs[0].text.strip() for result in eval_results]

    for i, eval_result in enumerate(eval_results):
        if eval_result[-1] == '.':
            eval_result = eval_result[:-1]

        contain = None
        if '[[Yes]]' in eval_result:
            contain = True
        elif '[[No]]' in eval_result:
            contain = False

        answer_id = i // len_gt_steps

        step_id = i % len_gt_steps

        # print(answer_id)
        # print(step_id)
        # print(len(gen_answers))
        # print(len_gt_steps)
        # print(i)
        # print('-----')

        dump_data[answer_id]['contamination'].append({
            'gt_step': gt_steps[step_id],
            'is_contain': contain,
            'analysis': eval_result
        })
    
    dump_data_full.extend(dump_data)

print(len(dump_data_full))

import json
with open('/u/ziqiw9/LLM4Eng/dump.json', 'w') as f:
    json.dump(dump_data_full, f, indent=4)


Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:09<00:00,  6.04it/s, est. speed input: 7018.22 toks/s, output: 257.08 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:06<00:00,  6.44it/s, est. speed input: 7126.26 toks/s, output: 288.10 toks/s]
Processed prompts:  70%|████████████████████████████████████████████████████████████████▍                           | 35/50 [10:02<04:18, 17.21s/it, est. speed input: 8419.07 toks/s, output: 55.11 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:07<00:00,  8.38it/s, est. speed input: 9703.25 toks/s, output: 161.33 toks/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.42it/s, est. speed input: 5191.22 toks/s

120



