Чернышова Д.К.

# Lab overview

In this lab, we will learn the basics of working with large language models, we will analyze design techniques and how they affect model output, as well as learn how to train models on our own data at the application level. This script implements five different prompting techniques for boolean question answering using LangChain and the Qwen2.5-7B-Instruct model and LoRA fine-tuning to iprove the model behavior.

The places where you are required to write code/conduct experiments are indicated by tasks and "TODO" in the code.
This is a technical lab and you may need to consult documentation of VLLM, LangChain and Unsloth, you can also use help in writing code from such models as DeepSeek (but recheck the code).  

**Warning:** Do not start the lab at the last moment! It may need more time than you think, so reserve 5-6 hours for the correct implementation  
**Note:** If you encounter problems with the lab or resources and stuck at some point do not hesitete to write to the group of the course in advance to the deadline

In prompting techniques part you may want to utilize CPU instead of GPU if you work on free version of Colab (though it will be too time consuming)

In [None]:
# Installing necessary libraries
# Uncomment the lines below to install the libraries, do not pay attention to broken pip or warnings

%pip install --quiet transformers==4.51.2 vllm==0.8.3 accelerate==1.5.2 bitsandbytes==0.45.5 datasets==3.5.0
%pip install --quiet unsloth==2025.3.19
%pip install --quiet msgspec

# Note that after installation you need to restart the kernel

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/97.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.9/97.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.6/87.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [None]:
import os
GPU_DEVICE = os.environ.get("CUDA_VISIBLE_DEVICES", "0")  # Default to GPU 0 if not specified
os.environ["CUDA_VISIBLE_DEVICES"] = GPU_DEVICE
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

In [None]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, field_validator
from typing import List
import torch
import random
import numpy as np
from vllm import LLM, SamplingParams
from langchain.schema.runnable import Runnable
from langchain.schema import StrOutputParser

INFO 04-22 08:39:37 [__init__.py:239] Automatically detected platform cuda.


In [None]:
# Set seed for reproducibility
const_seed = 42
random.seed(const_seed)
np.random.seed(const_seed)
torch.manual_seed(const_seed)

<torch._C.Generator at 0x7e830e8c5c10>

Now we will load quantized model that should fit into the memory of T4 card. For the sake of speed VLLM framework is used. Pay attention to the sampling parameters, if you need you can change them anytime by passing new sampling parameters to model

In [None]:
# Initialize vLLM model
model_name = 'Qwen/Qwen2.5-7B-Instruct-AWQ'
llm = LLM(
    model=model_name,
    trust_remote_code=True,
    dtype="float16",
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    quantization="awq",
    max_num_seqs=256,
    disable_custom_all_reduce=True,
)

# Create sampling parameters
sampling_params = SamplingParams(
    temperature=0.2,
    top_p=0.95,
    max_tokens=400,
    stop=None
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

INFO 04-22 08:40:13 [config.py:600] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 04-22 08:40:15 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='Qwen/Qwen2.5-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=awq, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

INFO 04-22 08:40:21 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-22 08:40:21 [cuda.py:289] Using XFormers backend.
INFO 04-22 08:40:22 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-22 08:40:22 [model_runner.py:1110] Starting to load model Qwen/Qwen2.5-7B-Instruct-AWQ...
INFO 04-22 08:40:23 [weight_utils.py:265] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.57G [00:00<?, ?B/s]

INFO 04-22 08:40:59 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-7B-Instruct-AWQ: 36.368823 seconds


model.safetensors.index.json:   0%|          | 0.00/62.7k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 04-22 08:41:21 [loader.py:447] Loading weights took 21.36 seconds
INFO 04-22 08:41:22 [model_runner.py:1146] Model loading took 5.2036 GiB and 58.866435 seconds
INFO 04-22 08:41:26 [worker.py:267] Memory profiling takes 3.91 seconds
INFO 04-22 08:41:26 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 04-22 08:41:26 [worker.py:267] model weights take 5.20GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is 6.60GiB.
INFO 04-22 08:41:27 [executor_base.py:112] # cuda blocks: 7725, # CPU blocks: 4681
INFO 04-22 08:41:27 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 30.18x
INFO 04-22 08:41:31 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:55<00:00,  1.60s/it]

INFO 04-22 08:42:27 [model_runner.py:1598] Graph capturing finished in 56 secs, took 0.50 GiB
INFO 04-22 08:42:27 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 65.56 seconds





As far as we will use LangChain here to implement prompting the wrapper is provided that process batches of input prompts to provide the generated answers. LangChain basically helps to combine all the operations into a unified chain that

In [None]:
# Create a LangChain LLM wrapper for vLLM
class VLLMWrapper(Runnable):
    def __init__(self, llm, sampling_params):
        self.llm = llm
        self.sampling_params = sampling_params
        self.output_parser = StrOutputParser()

    def invoke(self, input, config=None):
        # Handle single input case
        if isinstance(input, dict):
            prompt = self._format_prompt(input)
            outputs = self.llm.generate([prompt], self.sampling_params)
            return self.output_parser.invoke(outputs[0].outputs[0].text)
        # Handle batch input case
        elif isinstance(input, list):
            return self.batch(input)
        else:
            # Handle string input
            prompt = str(input)
            outputs = self.llm.generate([prompt], self.sampling_params)
            return self.output_parser.invoke(outputs[0].outputs[0].text)

    def batch(self, *args, **kwargs):
        """Process a batch of inputs efficiently using vLLM's batch processing.

        Args:
            *args: Positional arguments, where the first argument is the list of inputs
            **kwargs: Additional arguments passed by LangChain, including:
                - return_exceptions: If True, return exceptions instead of raising them
        """
        if not args:
            raise ValueError("No inputs provided")

        inputs = args[0]
        return_exceptions = kwargs.get('return_exceptions', False)

        prompts = []
        for item in inputs:
            if isinstance(item, dict):
                prompt = self._format_prompt(item)
            else:
                prompt = str(item)
            prompts.append(prompt)

        try:
            # Use vLLM's batch processing
            outputs = self.llm.generate(prompts, self.sampling_params)
            # If you want to see the actual model outputs uncomment this line
            print(outputs)
            return [self.output_parser.invoke(output.outputs[0].text) for output in outputs]
        except Exception as e:
            if return_exceptions:
                return [e] * len(inputs)
            raise e

    def _format_prompt(self, input_dict):
        if isinstance(input_dict, dict):
            if "examples" in input_dict:
                # Format few-shot prompt
                examples_text = ""
                for i, ex in enumerate(input_dict["examples"]):
                    examples_text += f"Example {i+1}:\ntext: {ex['passage']}\nquestion: {ex['question']}\nanswer: {'true' if ex['answer'] else 'false'}\n\n"
                return f"{examples_text}\nNow answer this question:\ntext: {input_dict['text']}\nquestion: {input_dict['question']}\nanswer: "
            else:
                # Format regular prompt
                return f"text: {input_dict['text']}\nquestion: {input_dict['question']}\nanswer: "
        return str(input_dict)

    async def ainvoke(self, input, config=None):
        return self.invoke(input, config)

# Initialize the wrapper
llm_wrapper = VLLMWrapper(llm, sampling_params)

# Prompting techniques

We discussed several approaches in the lectures, let's shortly recap what will be used in this lab

## Naive Prompting

In this type of prompting you are not
***Description:*** The simplest approach that directly asks the model to answer a question with "true" or "false" based on the provided text.   
***Key Characteristics:*** 1) Minimal instructions; 2) Direct question-answer format; 3) No additional context or guidance; 4) Focuses on the model's baseline performance   
***Implementation focus:*** 1) Create a simple prompt template that clearly presents the text and question; 2) Ensure the model understands it should respond with only "true" or "false"; 3) Implement basic error handling for unexpected responses

## Few-Shot Prompting

***Description:*** Provides examples of similar questions and their correct answers to guide the model's responses.   
***Key Characteristics:*** 1) Includes multiple examples in the prompt; 2) Demonstrates the expected format and reasoning; 3) Helps the model understand the task through pattern recognition; 4) Can improve accuracy by showing the model what good answers look like    
***Implementation Focus:*** 1) Design a prompt template that incorporates multiple examples; 2) Format examples consistently to show the relationship between text, question, and answer; 3) Ensure the examples are relevant to the target question type; 4) Include a confidence score in the response to indicate certainty    

## Chain-of-Thought Prompting

***Description:*** Encourages the model to break down its reasoning process step by step before arriving at a final answer.    
***Key Characteristics:*** 1) Explicitly requests step-by-step reasoning; 2) Structures the thinking process into distinct phases; 3) Makes the model's logic transparent and traceable; 4) Often improves accuracy by forcing more thorough analysis    
***Implementation Focus:*** 1) Create a prompt template that guides the model through a structured reasoning process; 2) For example, include sections for question analysis, text analysis, logical reasoning, and conclusion or more simple approach for thinking step-by-step; 3) Ensure the model provides both reasoning steps and a final boolean answer; 4) Capture both the reasoning process and the conclusion in the response schema    

## Role Prompting

***Description:*** Assigns a specific expert role to the model to leverage domain-specific knowledge and analytical approaches.   
***Key Characteristics:*** 1) Positions the model as an expert (e.g., fact-checker, analyst); 2) Provides role-specific instructions and expectations; 3) Encourages domain-specific analytical approaches; 4) Can improve performance by focusing the model's capabilities   
***Implementation Focus:*** 1) Define a clear expert role with specific expertise and responsibilities; 2) Include role-specific instructions in the prompt template; 3) Request evidence-based analysis from the text; 4) Capture both the expert analysis and supporting evidence in the response


# How are we going to work with the model?

Each prompt will contain pert with the awaited output format (so called structured output), so pay attention that in all prompting variants we specify an exact response format in the form of Pydantic scheme - that is one of the common approaches to form LLM outputs. It is required for you to use structured output with pydantic schema in this lab.

You are making LLM call and if the output fails validation regenerate the answer. If you want you can use regular expression to parse the output but still you should do the schema validation.

**Data**  
In our case we will work with the following data - text and question on the text that qwaits True or False answer. Task of the LLM is to generate the answer in json format according to the provided schema


![image.png](attachment:a0324e4c-0d85-4e8c-b40f-98c7c287166e.png)

In [None]:
# Example text and question
example_texts = [
    "As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.",
    "The Earth is the third planet from the Sun and the only astronomical object known to harbor life."
]
example_questions = [
    "is elder scrolls online the same as skyrim",
    "is earth the closest planet to the sun"
]

In [None]:
# Base Pydantic model for all responses
class BaseAnswerResponse(BaseModel):
    answer: bool = Field(..., description="Answer to the question: True or False")

    @field_validator('answer')
    @classmethod
    def validate_answer(cls, v):
        if not isinstance(v, bool):
            raise ValueError("Answer must be a boolean value (True or False)")
        return v

**Task 0**: (Note: Come to this part after implementing the schemes) Implement the default scheme initialization when retries fails. You can set number of retries to 1, it is also ok. Do not spend too much time on trying to make model generate valid (according to schema) answers.

In [None]:
# Function to run a chain and handle potential errors
# Note: you can modify code if you need
def run_chain(chain, text, question, examples=None, max_retries=3, batch_size=1):
    """Run a chain and handle potential errors.

    Args:
        chain: The LangChain chain to run
        text: Input text or list of texts
        question: Input question or list of questions
        examples: Optional examples for few-shot prompting
        max_retries: Maximum number of retry attempts
        batch_size: Number of items to process in a batch (default: 1 for single item)
    """
    # Add OutputType extraction logic
    if hasattr(chain, "OutputType"):
        OutputType = chain.OutputType
    elif hasattr(chain, "config") and "OutputType" in chain.config:
        OutputType = chain.config["OutputType"]
    else:
        OutputType = None

    # Convert single inputs to lists for batch processing
    if not isinstance(text, list):
        text = [text]
    if not isinstance(question, list):
        question = [question]

    # Ensure text and question lists have the same length
    if len(text) != len(question):
        raise ValueError("Text and question lists must have the same length")

    # Initialize results list with None for all items
    results = [None] * len(text)

    # Process in batches
    for i in range(0, len(text), batch_size):
        batch_text = text[i:i + batch_size]
        batch_question = question[i:i + batch_size]
        batch_indices = list(range(i, min(i + batch_size, len(text))))

        # Track which items need to be processed
        items_to_process = list(zip(batch_indices, batch_text, batch_question))

        for attempt in range(max_retries):
            if not items_to_process:
                break

            try:
                # Prepare batch inputs for remaining items
                batch_inputs = [{"text": t, "question": q, "examples": examples if examples else None}
                              for _, t, q in items_to_process]

                # Process batch
                batch_results = chain.batch(batch_inputs)

                # Update results for successful items
                for (idx, _, _), result in zip(items_to_process, batch_results):
                    if not isinstance(result, Exception):
                        results[idx] = result

                # Remove successfully processed items
                items_to_process = [(idx, t, q) for (idx, t, q), result in zip(items_to_process, batch_results)
                                  if isinstance(result, Exception)]

            except Exception as e:
                print(f"Attempt {attempt+1}/{max_retries} failed: {e}")
                if attempt == max_retries - 1:
                    # On the last attempt, return default responses for remaining items
                    for idx, _, _ in items_to_process:
                        # Create appropriate default response based on chain type
                        # TODO: implement pydantic schemes for other types of prompting, so first implement the pipelines below, then come back here
                        if OutputType is not None:
                            try:
                                results[idx] = OutputType(answer=False)
                            except Exception:
                                results[idx] = None
                        else:
                            results[idx] = None



    # Return single result if input was single item
    if len(text) == 1:
        return results[0]
    return results

## 1. Naive Prompting

Basic way of working with the model - passing simple query and setting the output parser in prompt template and when checking the result. An example is provided for you, feel free to modify it if you want.

In [None]:
# Pydantic model for naive prompting
class NaiveAnswerResponse(BaseAnswerResponse):
    pass

def create_naive_chain():
    """Create a LangChain chain for naive prompting."""
    # Define the prompt template
    template = """You are given a text and question. Answer only "True" or "False".
text: {text}
question: {question}

Follow the output format:
{format_instructions}
"""

    # Create output parser
    parser = PydanticOutputParser(pydantic_object=NaiveAnswerResponse)

    prompt = PromptTemplate(
        input_variables=["text", "question"],
        template=template,
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # Create the chain using the new approach
    chain = prompt | llm_wrapper | parser

    return chain

naive_chain = create_naive_chain()

# Run chains with batch processing
print("Naive Prompting (Batch):")
results = run_chain(naive_chain, example_texts, example_questions, batch_size=2, max_retries=2)
for i, result in enumerate(results):
    print(f"Result {i+1}:", result)

Naive Prompting (Batch):


Processed prompts: 100%|██████████| 2/2 [00:01<00:00,  1.44it/s, est. speed input: 366.27 toks/s, output: 18.71 toks/s]


[RequestOutput(request_id=0, prompt='text=\'You are given a text and question. Answer only "True" or "False".\\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim  \\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\\n\\nHere is the output schema:\\n```\\n{"properties": {"answer": {"description": "Answer 

Processed prompts: 100%|██████████| 2/2 [00:01<00:00,  1.34it/s, est. speed input: 341.50 toks/s, output: 25.50 toks/s]

[RequestOutput(request_id=2, prompt='text=\'You are given a text and question. Answer only "True" or "False".\\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim  \\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\\n\\nHere is the output schema:\\n```\\n{"properties": {"answer": {"description": "Answer 




In [None]:
# 2. Few-Shot Prompting

# Pydantic model for few-shot prompting
class FewShotAnswerResponse(BaseAnswerResponse):
    # TODO: Implement schema
    confidence: float = Field(..., ge=0.0, le=1.0, description="Confidence level of the answer between 0 and 1")

    @field_validator('confidence')
    @classmethod
    def validate_confidence(cls, v):
        if not isinstance(v, float):
            raise ValueError("Confidence must be a float value between 0 and 1")
        if not 0.0 <= v <= 1.0:
            raise ValueError("Confidence must be between 0 and 1")
        return v


def create_few_shot_chain(examples):
    """Create a LangChain chain for few-shot prompting."""
    # TODO: create few-shot chain
    # Ⅰ.Create an output parser
    parser = PydanticOutputParser(pydantic_object=FewShotAnswerResponse)

    # Ⅱ.Construct Prompt template
    template = """You are given a text and question. Answer only "True" or "False".
text: {text}
question: {question}

Follow the output format:
{format_instructions}
"""
    # Ⅲ.Create PromptTemplate (be careful not to write examples to input_variables, because we will use it internally in _format_prompt)
    prompt = PromptTemplate(
        input_variables=["text", "question"],
        template=template,
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # Ⅳ.Build chain: prompt → llm_wrapper → parser
    chain = prompt | llm_wrapper | parser

    # Ⅴ.Add OutputType information for fallback judgment
    chain = chain.with_config({"OutputType": FewShotAnswerResponse})

    return chain

# Example few-shot examples
few_shot_examples = [
    {
        "passage": "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.",
        "question": "is earth the closest planet to the sun",
        "answer": False
    },
    {
        "passage": "The Sun is the star at the center of the Solar System. It is a nearly perfect sphere of hot plasma.",
        "question": "is the sun a star",
        "answer": True
    }
]
few_shot_chain = create_few_shot_chain(few_shot_examples)

print("\nFew-Shot Prompting (Batch):")
results = run_chain(few_shot_chain, example_texts, example_questions, examples=few_shot_examples, batch_size=2)
for i, result in enumerate(results):
    print(f"Result {i+1}:", result)


Few-Shot Prompting (Batch):


Processed prompts: 100%|██████████| 2/2 [00:18<00:00,  9.04s/it, est. speed input: 33.90 toks/s, output: 23.72 toks/s]

[RequestOutput(request_id=4, prompt='text=\'You are given a text and question. Answer only "True" or "False".\\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim  \\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\\n\\nHere is the output schema:\\n```\\n{"properties": {"answer": {"description": "Answer 




In [None]:
from tempfile import template
# 3. Chain-of-Thought Prompting

# Pydantic model for chain-of-thought prompting
class ChainOfThoughtAnswerResponse(BaseAnswerResponse):
    # TODO: Implement schema
    reasoning: str = Field(..., description="The reasoning or explanation that leads to the final answer")

    @field_validator('reasoning')
    @classmethod
    def validate_reasoning(cls, v):
        if not isinstance(v, str):
            raise ValueError("Reasoning must be a string")
        return v

def create_chain_of_thought_chain():
    """Create a LangChain chain for chain-of-thought prompting."""
    # TODO: create CoT chain

    parser = PydanticOutputParser(pydantic_object=ChainOfThoughtAnswerResponse)

    template = """You are a helpful assistant. Given a text and a question, reason step by step and then answer "True" or "False".

text: {text}
question: {question}

Think through the question and explain your reasoning first. Then provide the final answer.

Output format:
{format_instructions}
"""

    prompt = PromptTemplate(
        input_variables=["text", "question"],
        template=template,
        partial_variables={"format_instructions": parser.get_format_instructions()}
    )

    chain = prompt | llm_wrapper | parser

    chain = chain.with_config({"OutputType": ChainOfThoughtAnswerResponse})

    return chain


chain_of_thought_chain = create_chain_of_thought_chain()

print("\nChain-of-Thought Prompting (Batch):")
results = run_chain(chain_of_thought_chain, example_texts, example_questions, batch_size=2)
for i, result in enumerate(results):
    print(f"Result {i+1}:", result)


Chain-of-Thought Prompting (Batch):


Processed prompts: 100%|██████████| 2/2 [00:10<00:00,  5.35s/it, est. speed input: 58.99 toks/s, output: 31.41 toks/s]


[RequestOutput(request_id=6, prompt='text=\'You are a helpful assistant. Given a text and a question, reason step by step and then answer "True" or "False".\\n\\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim\\n\\nThink through the question and explain your reasoning first. Then provide the final answer.\\n\\nOutput format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"fo

Processed prompts: 100%|██████████| 2/2 [00:06<00:00,  3.45s/it, est. speed input: 91.37 toks/s, output: 32.15 toks/s]

[RequestOutput(request_id=8, prompt='text=\'You are a helpful assistant. Given a text and a question, reason step by step and then answer "True" or "False".\\n\\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim\\n\\nThink through the question and explain your reasoning first. Then provide the final answer.\\n\\nOutput format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"fo




In [None]:
# 4. Role Prompting

# Pydantic model for role prompting
class RoleAnswerResponse(BaseAnswerResponse):
    # TODO: Implement schema
    evidence: str = Field(..., description="Expert's evidence or justification for the answer")

    @field_validator('evidence')
    @classmethod
    def validate_evidence(cls, v):
        if not isinstance(v, str):
            raise ValueError("Evidence must be a non-empty string")
        return v

# 4. Role Prompting
def create_role_chain():
    """Create a LangChain chain for role prompting."""
    # TODO: create role chain

    parser = PydanticOutputParser(pydantic_object=RoleAnswerResponse)

    template = """You are a fact-checking expert. Your job is to evaluate whether a statement is true or false based on the given passage.

text: {text}
question: {question}

As a professional, provide a boolean answer and a brief justification based only on the content of the passage.

Follow the output format:
{format_instructions}
"""

    prompt = PromptTemplate(
        input_variables=["text", "question"],
        template=template,
        partial_variables={"format_instructions": parser.get_format_instructions()}
    )

    chain = prompt | llm_wrapper | parser

    chain = chain.with_config({"OutputType": RoleAnswerResponse})

    return chain


role_chain = create_role_chain()

print("\nRole Prompting (Batch):")
results = run_chain(role_chain, example_texts, example_questions, batch_size=2)
for i, result in enumerate(results):
    print(f"Result {i+1}:", result)


Role Prompting (Batch):


Processed prompts: 100%|██████████| 2/2 [00:09<00:00,  4.86s/it, est. speed input: 65.79 toks/s, output: 31.09 toks/s]


[RequestOutput(request_id=10, prompt='text=\'You are a fact-checking expert. Your job is to evaluate whether a statement is true or false based on the given passage.\\n    \\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim\\n\\nAs a professional, provide a boolean answer and a brief justification based only on the content of the passage.\\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance o

Processed prompts: 100%|██████████| 2/2 [00:10<00:00,  5.23s/it, est. speed input: 61.07 toks/s, output: 28.96 toks/s]


[RequestOutput(request_id=12, prompt='text=\'You are a fact-checking expert. Your job is to evaluate whether a statement is true or false based on the given passage.\\n    \\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim\\n\\nAs a professional, provide a boolean answer and a brief justification based only on the content of the passage.\\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance o

Processed prompts: 100%|██████████| 2/2 [00:04<00:00,  2.23s/it, est. speed input: 143.22 toks/s, output: 39.89 toks/s]

[RequestOutput(request_id=14, prompt='text=\'You are a fact-checking expert. Your job is to evaluate whether a statement is true or false based on the given passage.\\n    \\ntext: As with other games in The Elder Scrolls series, the game is set on the continent of Tamriel. The events of the game occur a millennium before those of The Elder Scrolls V: Skyrim and around 800 years before The Elder Scrolls III: Morrowind and The Elder Scrolls IV: Oblivion.\\nquestion: is elder scrolls online the same as skyrim\\n\\nAs a professional, provide a boolean answer and a brief justification based only on the content of the passage.\\n\\nFollow the output format:\\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\\n\\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance o




In [None]:
# Note: you can also add chain with some other prompting technique of combine an existing ones

# BoolQ Dataset

We will use BoolQ dataset (https://huggingface.co/datasets/google/boolq) for our experiments. Look carefull at the data - there are train and validation splits and question / answer / passage columns

In [None]:
from datasets import load_dataset

# Load dataset
df = load_dataset("google/boolq")

# Create a balanced test dataset
def sample_balanced_dataset(df_sample, test_size=20):
    """
    Create a balanced dataset with equal number of true and false examples.

    Args:
        df_sample: Dataset to sample from
        test_size: Total number of examples to sample

    Returns:
        List of examples with passage, question, and answer
    """
    # Split examples into true and false
    true_examples = []
    false_examples = []

    for i, example in enumerate(df_sample):
        if example['answer']:
            true_examples.append((i, example))
        else:
            false_examples.append((i, example))

    # Determine number of examples for each class
    examples_per_class = test_size // 2

    # Select random examples from each class
    import random
    random.seed(42)
    selected_true = random.sample(true_examples, min(examples_per_class, len(true_examples)))
    selected_false = random.sample(false_examples, min(examples_per_class, len(false_examples)))

    # Combine selected examples
    selected_examples = selected_true + selected_false
    random.shuffle(selected_examples)

    # Create test dataset
    return [
        {
            "passage": example['passage'],
            "question": example['question'],
            "answer": example['answer']
        }
        for _, example in selected_examples
    ]

# Create test dataset
test_dataset = sample_balanced_dataset(df["validation"], test_size=100)

README.md:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

# Comparing prompting on BoolQ

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from typing import List, Dict, Any

def compute_metrics(preds, labels):
    """
    Compute binary classification metrics from boolean predictions and labels.
    """
    return {
        "accuracy": accuracy_score(labels, preds),
        "precision": precision_score(labels, preds),
        "recall": recall_score(labels, preds),
        "f1": f1_score(labels, preds)
    }

def compare_prompting_techniques(dataset: List[Dict[str, Any]], batch_size: int = 4,
                                few_shot_examples: List[Dict[str, Any]] = None) -> Dict[str, Dict[str, float]]:
    """
    Compares the effectiveness of various prompting techniques using LangChain.

    Args:
        dataset: List of examples with 'passage', 'question', and 'answer' fields
        batch_size: Batch size for processing
        few_shot_examples: Examples to use for few-shot prompting

    Returns:
        dict: Comparison results
    """
    # Extract true answers
    true_answers = [example['answer'] for example in dataset]

    # Create chains
    chains = {
        "Naive Prompting": create_naive_chain(),
        "Few-Shot Prompting": create_few_shot_chain(few_shot_examples or []),
        "Chain-of-Thought Prompting": create_chain_of_thought_chain(),
        "Role Prompting": create_role_chain(),
        # "Formatted Prompting": create_formatted_chain()
    }

    results = {}

    for name, chain in chains.items():
        print(f"\nEvaluating: {name}")

        # Run LLM predictions
        predictions = run_chain(
            chain,
            text=[ex["passage"] for ex in dataset],
            question=[ex["question"] for ex in dataset],
            examples=few_shot_examples if "Few-Shot" in name else None,
            batch_size=batch_size,
            max_retries=2
        )

        # Extract only the answer field from the response object
        preds = [res.answer if isinstance(res, BaseAnswerResponse) else False for res in predictions]

        # Compute metrics
        metrics = compute_metrics(preds, true_answers)

        results[name] = metrics

    return results

metrics_result = compare_prompting_techniques(test_dataset, batch_size=4, few_shot_examples=few_shot_examples)
for technique, metric in metrics_result.items():
    print(f"\n{technique}")
    for k, v in metric.items():
        print(f"{k}: {v:.4f}")

# Fine-tuning

Note: before this fine-tuning section restart the kernel. Do not forget to save all the results you gained when comparing prompting techniques on BoolQ dataset  

Here we will use Unsloth on 1000 training examples from boolQ dataset

In [None]:
import os
import torch
import random
import numpy as np
import unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import wandb
from tqdm import tqdm

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-22 19:23:34 [__init__.py:239] Automatically detected platform cuda.


In [None]:
# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [None]:
# Format the prompt for training
def format_prompt(example):
    """Format a single example into a prompt."""
    # Do not forget that we need to train model to provide structured output with field "answer"
    prompt = f"""You are a helpful assistant. Given a passage and a question, answer only "True" or "False" as a JSON object.

Passage: {example['passage']}
Question: {example['question']}

Respond with a JSON object containing the field "answer".
"""
    response = f'{{"answer": {str(example["answer"]).lower()}}}'
    return {"prompt": prompt, "response": response}

In [None]:
# Set seeds
set_seed(42)

# Initialize wandb (it is not nesessary)
wandb.init(project="qwen-boolq-finetuning", name="qwen-qlora-boolq")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mnekov[0m ([33mnekov-itmo-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
def sample_balanced_dataset(df_sample, test_size=100):
    true_examples = []
    false_examples = []
    for example in df_sample:
        if example['answer']:
            true_examples.append(example)
        else:
            false_examples.append(example)

    examples_per_class = test_size // 2
    import random
    random.seed(42)
    selected_true = random.sample(true_examples, min(examples_per_class, len(true_examples)))
    selected_false = random.sample(false_examples, min(examples_per_class, len(false_examples)))
    selected_examples = selected_true + selected_false
    random.shuffle(selected_examples)
    return selected_examples

In [None]:
df = load_dataset("google/boolq")
train_dataset = sample_balanced_dataset(df["train"], test_size=1000)
formatted_train_data = [format_prompt(example) for example in train_dataset]

In [None]:
local_model_path = "/home/neko/ArchNN/qwen25-4bit"

In [None]:
# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=local_model_path,
        max_seq_length=512,
        max_memory={
            0: "7GiB",
            "cpu": "60GiB"},
        dtype=torch.bfloat16,
        device_map={"": 0},
    )

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.3.
   \\   /|    NVIDIA GeForce RTX 3060 Ti. Num GPUs = 1. Max memory: 8.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


RuntimeError: CUDA driver error: out of memory

In [None]:
# Configure LoRA, Note: you may need to change the parameters
lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Tokenize dataset
print("Tokenizing dataset...")
def tokenize_function(examples):
    return tokenizer(
        examples,
        padding="max_length",
        truncation=True,
        max_length=2048,
        return_tensors="pt"
    )

tokenized_train_data = tokenize_function(formatted_train_data)