# Data Prep

In [1]:
!pip install -r requirements.txt

Collecting datasets (from -r requirements.txt (line 1))
  Using cached datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting flow_judge (from -r requirements.txt (line 2))
  Using cached flow_judge-0.1.2-py3-none-any.whl.metadata (27 kB)
Collecting filelock (from datasets->-r requirements.txt (line 1))
  Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting numpy>=1.17 (from datasets->-r requirements.txt (line 1))
  Using cached numpy-2.1.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pyarrow>=15.0.0 (from datasets->-r requirements.txt (line 1))
  Using cached pyarrow-18.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->-r requirements.txt (line 1))
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets->-r requirements.txt (line 1))
  Using cached pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metada

In [2]:
from datasets import load_dataset, Dataset
import random

dataset_name = "mlabonne/FineTome-100k"
dataset = load_dataset(dataset_name, split="all").shuffle(seed=42)
messages = dataset.map(
    lambda row: {"role": "user", "content": next((item["value"] for item in row["conversations"] if item["from"] == "human"), None)},
    remove_columns=dataset.column_names
)

messages = list(messages)[:100]

for message in messages:
    print(message['content'])

orpo_path = "orpo.json"
llama_path = "llama.json"

Give three types of computer graphics.
Write Python code to solve the task:
Let S be the concatenation of 10^{10} copies of the string 110. (For reference, the concatenation of 3 copies of 110 is 110110110.)
We have a string T of length N.
Find the number of times T occurs in S as a contiguous substring.

-----Constraints-----
 - 1 \leq N \leq 2 \times 10^5
 - T is a string of length N consisting of 0 and 1.

-----Input-----
Input is given from Standard Input in the following format:
N
T

-----Output-----
Print the number of times T occurs in S as a contiguous substring.

-----Sample Input-----
4
1011

-----Sample Output-----
9999999999

S is so long, so let us instead count the number of times 1011 occurs in the concatenation of 3 copies of 110, that is, 110110110. We can see it occurs twice:
 - 1 1011 0110
 - 1101 1011 0
Can you provide information on the most effective methods for teaching language to young children?
How do you solve and graph the compound inequality 2t + 1 > 13 or 

# Model Prep

In [10]:

from unsloth import FastLanguageModel
import json, time

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "EITD/orpo_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    dtype = NotImplemented,
    load_in_4bit = False,
)
FastLanguageModel.for_inference(model)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

responses = []
for message in messages:
    start = time.time()        
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            message['content'],
            message['content'], # input
            "", # output - leave this blank for generation!
        )
    ], return_tensors = "pt").to("cuda")
    
    outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True, temperature = 1.5, min_p = 0.1)
    
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    end = time.time()         

    response = response.split("Response")[-1].strip()
    responses.append({'time': end - start, 'response': response})

with open(orpo_path, "w", encoding="utf-8") as f:
    json.dump(responses, f, indent=4)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: NotImplemented is not recognized, so we'll default to None
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: NVIDIA H100 80GB HBM3 MIG 1g.20gb. Max memory: 19.625 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.0+cu121. CUDA: 9.0. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.2 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


In [11]:

from unsloth import FastLanguageModel
import json, time

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = 2048,
    dtype = NotImplemented,
    load_in_4bit = False,
)
FastLanguageModel.for_inference(model)

responses = []
for message in messages:
    start = time.time()        
    inputs = tokenizer.apply_chat_template(
        [message],
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")
    
    outputs = model.generate(input_ids = inputs, max_new_tokens = 128, use_cache = True, temperature = 1.5, min_p = 0.1)
    
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    end = time.time()         

    response = response.split("assistant")[-1].strip()
    responses.append({'time': end - start, 'response': response})

with open(llama_path, "w", encoding="utf-8") as f:
    json.dump(responses, f, indent=4)

Unsloth: NotImplemented is not recognized, so we'll default to None
==((====))==  Unsloth 2024.12.2: Fast Llama patching. Transformers:4.47.0.
   \\   /|    GPU: NVIDIA H100 80GB HBM3 MIG 1g.20gb. Max memory: 19.625 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.4.0+cu121. CUDA: 9.0. CUDA Toolkit: 12.1. Triton: 3.0.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


---

# LLM Judge

---

In [3]:
from flow_judge import Llamafile, EvalInput, FlowJudge, Vllm
from flow_judge.metrics import RESPONSE_FAITHFULNESS_5POINT
import json

# Initialize the model
judge = Vllm()

# Initialize the judge
faithfulness_judge = FlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=judge
)

# Create a list of inputs and outputs
inputs_batch = [
    [
        {"query": message["content"]},
        {"context": ""},
    ]
    for message in messages
]

def judge(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
        
    outputs_batch = [{"response": item["response"]} for item in data]

    # Create a list of EvalInput
    eval_inputs_batch = [EvalInput(inputs=inputs, output=output) for inputs, output in zip(inputs_batch, outputs_batch)]

    # Run the batch evaluation
    results = faithfulness_judge.batch_evaluate(eval_inputs_batch, save_results=False)

    for i, result in enumerate(results):
        data[i]['score'] = result.score

    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

judge(orpo_path)
judge(llama_path)



config.json:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/flowaicom/Flow-Judge-v0.1-AWQ:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


INFO 12-06 01:53:30 awq_marlin.py:90] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-06 01:53:30 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='flowaicom/Flow-Judge-v0.1-AWQ', speculative_config=None, tokenizer='flowaicom/Flow-Judge-v0.1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), 

tokenizer_config.json:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/193 [00:00<?, ?B/s]

INFO 12-06 01:53:33 model_runner.py:1014] Starting to load model flowaicom/Flow-Judge-v0.1-AWQ...
INFO 12-06 01:53:33 weight_utils.py:242] Using model weights format ['*.safetensors']
INFO 12-06 01:53:33 weight_utils.py:287] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-06 01:53:35 model_runner.py:1025] Loading model weights took 2.1717 GB
INFO 12-06 01:53:37 gpu_executor.py:122] # GPU blocks: 2456, # CPU blocks: 682


Processed prompts:   0%|          | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|██████████| 100/100 [00:49<00:00,  2.01it/s, est. speed input: 2076.53 toks/s, output: 489.04 toks/s]
Processed prompts: 100%|██████████| 100/100 [00:49<00:00,  2.01it/s, est. speed input: 2087.92 toks/s, output: 508.46 toks/s]


In [7]:
def compute_metrics(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    time_sum = 0
    score_sum = 0
    for item in data:
        time_sum += item['time']
        score_sum += item['score']

    print("Avg Inference Time:", time_sum / len(data))
    print("Avg Score:", score_sum / len(data))

print("Orpo metrics:")
compute_metrics(orpo_path)
print("Llama metrics:")
compute_metrics(llama_path)

Orpo metrics:
Avg Inference Time: 1.4090176939964294
Avg Score: 3.24
Llama metrics:
Avg Inference Time: 1.1835386228561402
Avg Score: 3.57
