# Homework 4: Prompting Language Models

In this assignment, we work with pre-trained large language models (LLMs) to solve the task of determining linguistic acceptability of Russian sentences (RuCoLA dataset).

**Tasks:**
1. Load RuCoLA data (in_domain_dev.csv)
2. Use Qwen2.5-1.5B-Instruct model with various prompts
3. Experiment with prompts in Russian and English languages
4. Add System Prompt and conduct similar experiments
5. Compare with base model (without Instruct)
6. Draw conclusions

## 1. Installation and Library Imports

In [4]:
# Install required libraries
!pip install transformers torch pandas scikit-learn tqdm accelerate -q

In [5]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from tqdm import tqdm
import re
import warnings
warnings.filterwarnings('ignore')

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cuda
GPU: NVIDIA GeForce RTX 4060 Ti
Memory: 17.18 GB


## 2. Loading RuCoLA Data

In [6]:
# Load data
data_url = "https://raw.githubusercontent.com/RussianNLP/RuCoLA/main/data/in_domain_dev.csv"
df = pd.read_csv(data_url)

print(f"Dataset size: {len(df)}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['acceptable'].value_counts())
print(f"\nData examples:")
df.head(10)

Dataset size: 983

Columns: ['id', 'sentence', 'acceptable', 'error_type', 'detailed_source']

Label distribution:
acceptable
1    733
0    250
Name: count, dtype: int64

Data examples:


Unnamed: 0,id,sentence,acceptable,error_type,detailed_source
0,0,Иван вчера не позвонил.,1,0,Paducheva2013
1,1,"У многих туристов, кто посещают Кемер весной, ...",0,Syntax,USE8
2,2,Лесные запахи набегали волнами; в них смешалос...,1,0,USE5
3,3,Вчера президент имел неофициальную беседу с ан...,1,0,Seliverstova
4,4,Коллега так и не признал вину за катастрофу пе...,1,0,Testelets
5,5,Я говорил с ним только ради Вас.,1,0,Seliverstova
6,6,"Этот игрок был куплен «Реалом», чтобы он играл...",1,0,Testelets
7,7,Ивану удалось попасть на концерт Макаревича.,1,0,Paducheva2013
8,8,Ты посылал ей приглашение на свадьбу?,1,0,Paducheva2010
9,9,После счастливого конца Тюлин предложил зайти ...,1,0,Testelets


In [7]:
# Examples of acceptable and unacceptable sentences
print("Examples of ACCEPTABLE sentences (acceptable=1):")
for i, row in df[df['acceptable'] == 1].head(3).iterrows():
    print(f"  - {row['sentence']}")

print("\nExamples of UNACCEPTABLE sentences (acceptable=0):")
for i, row in df[df['acceptable'] == 0].head(3).iterrows():
    print(f"  - {row['sentence']}")

Examples of ACCEPTABLE sentences (acceptable=1):
  - Иван вчера не позвонил.
  - Лесные запахи набегали волнами; в них смешалось дыхание можжевельника, вереска, брусники.
  - Вчера президент имел неофициальную беседу с английским послом.

Examples of UNACCEPTABLE sentences (acceptable=0):
  - У многих туристов, кто посещают Кемер весной, есть шанс застать снег на вершине горы Тахталы и даже сочетать пляжный отдых с горнолыжным.
  - Вчера в два часа магазин закрыт.
  - А ты ехай прямо к директору театров, князю Гагарину.


## 3. Loading Qwen2.5-1.5B-Instruct Model

In [8]:
def load_model(model_name, is_instruct=True):
    """Load model and tokenizer"""
    print(f"Loading model: {model_name}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto" if device == "cuda" else None,
        trust_remote_code=True
    )
    
    if device == "cpu":
        model = model.to(device)
    
    model.eval()
    print(f"Model loaded on {device}")
    return model, tokenizer

# Load Instruct model
model_instruct, tokenizer_instruct = load_model("Qwen/Qwen2.5-1.5B-Instruct")

Loading model: Qwen/Qwen2.5-1.5B-Instruct


`torch_dtype` is deprecated! Use `dtype` instead!


Model loaded on cuda


## 4. Post-processing Function for Model Responses

In [9]:
def extract_label(response, verbose=False):
    """
    Extract label from model response.
    
    Strategy:
    1. Look for explicit numeric answers (0 or 1)
    2. Look for keywords indicating acceptability
    3. Look for keywords indicating unacceptability
    4. Return -1 by default (could not determine)
    """
    response_lower = response.lower().strip()
    
    # Patterns for numeric answers
    # Look at the beginning of the response
    if response_lower.startswith('1') or response_lower.startswith('«1»'):
        return 1
    if response_lower.startswith('0') or response_lower.startswith('«0»'):
        return 0
    
    # Look for patterns like "answer: 1" or "label: 0"
    patterns_1 = [
        r'\b(?:ответ|answer|label|метка|результат|result)[:\s]*1\b',
        r'\b1\b.*(?:приемлем|acceptable|correct|верн|правильн)',
        r'(?:приемлем|acceptable|correct|верн|правильн).*\b1\b',
    ]
    patterns_0 = [
        r'\b(?:ответ|answer|label|метка|результат|result)[:\s]*0\b',
        r'\b0\b.*(?:неприемлем|unacceptable|incorrect|невер|неправильн)',
        r'(?:неприемлем|unacceptable|incorrect|невер|неправильн).*\b0\b',
    ]
    
    for pattern in patterns_1:
        if re.search(pattern, response_lower):
            return 1
    for pattern in patterns_0:
        if re.search(pattern, response_lower):
            return 0
    
    # Keywords for acceptability (check at the beginning of response)
    acceptable_keywords = [
        'приемлемо', 'приемлем', 'acceptable', 'correct', 'верно', 'верн',
        'правильно', 'правильн', 'грамматически верно', 'grammatically correct',
        'да,', 'да.', 'yes', 'корректно', 'корректн', 'допустим', 'natural',
        'естественно', 'нормально', 'хорошо сформулировано'
    ]
    
    # Keywords for unacceptability
    unacceptable_keywords = [
        'неприемлемо', 'неприемлем', 'unacceptable', 'incorrect', 'неверно', 
        'неправильно', 'ошибка', 'error', 'wrong', 'некорректно', 'недопустим',
        'нет,', 'нет.', 'no,', 'no.', 'unnatural', 'неестественно', 'странно',
        'не является приемлемым', 'is not acceptable', 'грамматическая ошибка'
    ]
    
    # Check first 100 characters to determine the answer
    first_part = response_lower[:100]
    
    # First check for unacceptability (since "unacceptable" contains "acceptable" in Russian)
    for keyword in unacceptable_keywords:
        if keyword in first_part:
            return 0
    
    for keyword in acceptable_keywords:
        if keyword in first_part:
            return 1
    
    # Check the entire response
    for keyword in unacceptable_keywords:
        if keyword in response_lower:
            return 0
    
    for keyword in acceptable_keywords:
        if keyword in response_lower:
            return 1
    
    if verbose:
        print(f"Could not determine label for response: {response[:100]}...")
    
    return -1  # Could not determine


# Test the post-processing function
test_responses = [
    "1",
    "0",
    "Answer: 1",
    "The sentence is acceptable.",
    "This sentence is unacceptable.",
    "Yes, this is a grammatically correct sentence.",
    "No, there is an error in the sentence.",
    "Unacceptable. The word order is violated."
]

print("Testing extract_label function:")
for resp in test_responses:
    label = extract_label(resp)
    print(f"  '{resp[:50]}...' -> {label}")

Testing extract_label function:
  '1...' -> 1
  '0...' -> 0
  'Answer: 1...' -> 1
  'The sentence is acceptable....' -> 1
  'This sentence is unacceptable....' -> 0
  'Yes, this is a grammatically correct sentence....' -> 1
  'No, there is an error in the sentence....' -> 0
  'Unacceptable. The word order is violated....' -> 0


## 5. Generation and Evaluation Functions

In [10]:
def generate_response(model, tokenizer, prompt, system_prompt=None, max_new_tokens=50):
    """
    Generate model response.
    Uses chat template for Instruct models.
    """
    messages = []
    
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    
    messages.append({"role": "user", "content": prompt})
    
    # Use chat template if available
    if hasattr(tokenizer, 'apply_chat_template'):
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
    else:
        # Fallback for models without chat template
        if system_prompt:
            text = f"{system_prompt}\n\nUser: {prompt}\nAssistant:"
        else:
            text = f"User: {prompt}\nAssistant:"
    
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Deterministic generation
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode only new tokens
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()


def generate_response_base(model, tokenizer, prompt, max_new_tokens=50):
    """
    Generate response for base (non-Instruct) model.
    Uses completion-style prompting.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()


def evaluate_model(model, tokenizer, df, prompt_template, system_prompt=None, 
                   is_instruct=True, max_samples=None, verbose=True):
    """
    Evaluate model on dataset.
    
    Args:
        model: the model
        tokenizer: the tokenizer
        df: DataFrame with data
        prompt_template: prompt template (must contain {sentence})
        system_prompt: system prompt (optional)
        is_instruct: whether the model is an Instruct version
        max_samples: maximum number of samples (None = all)
        verbose: print progress
    
    Returns:
        dict with metrics and predictions
    """
    if max_samples:
        data = df.head(max_samples)
    else:
        data = df
    
    predictions = []
    responses = []
    true_labels = data['acceptable'].tolist()
    
    iterator = tqdm(data.iterrows(), total=len(data), disable=not verbose)
    
    for idx, row in iterator:
        sentence = row['sentence']
        prompt = prompt_template.format(sentence=sentence)
        
        if is_instruct:
            response = generate_response(model, tokenizer, prompt, system_prompt)
        else:
            response = generate_response_base(model, tokenizer, prompt)
        
        label = extract_label(response)
        
        predictions.append(label)
        responses.append(response)
    
    # Handle undefined labels
    # Replace -1 with the most frequent class for metric calculation
    predictions_clean = [p if p != -1 else 1 for p in predictions]  # default to 1 (acceptable)
    
    # Calculate metrics
    accuracy = accuracy_score(true_labels, predictions_clean)
    f1 = f1_score(true_labels, predictions_clean, average='macro')
    
    # Count undefined responses
    undefined_count = predictions.count(-1)
    
    results = {
        'accuracy': accuracy,
        'f1_macro': f1,
        'undefined_count': undefined_count,
        'undefined_ratio': undefined_count / len(predictions),
        'predictions': predictions,
        'predictions_clean': predictions_clean,
        'responses': responses,
        'true_labels': true_labels
    }
    
    if verbose:
        print(f"\nResults:")
        print(f"  Accuracy: {accuracy:.4f}")
        print(f"  F1 (macro): {f1:.4f}")
        print(f"  Undefined responses: {undefined_count} ({undefined_count/len(predictions)*100:.1f}%)")
    
    return results

## 6. Experiment 1: Prompts without System Prompt

Testing various prompt formulations in Russian and English languages.

In [11]:
# Define prompts for experiments

prompts_ru = {
    "simple_ru": """Определи, является ли следующее предложение на русском языке приемлемым (грамматически корректным и естественным).

Предложение: {sentence}

Ответь только 1 (приемлемо) или 0 (неприемлемо).""",

    "detailed_ru": """Задача: определить лингвистическую приемлемость предложения на русском языке.

Приемлемое предложение - это предложение, которое носитель русского языка воспринял бы как естественное и грамматически корректное.

Предложение для анализа: {sentence}

Является ли это предложение приемлемым? Ответь 1 (да) или 0 (нет).""",

    "cot_ru": """Проанализируй следующее предложение на русском языке на предмет грамматической корректности и естественности.

Предложение: {sentence}

Сначала кратко объясни, есть ли в предложении ошибки, затем дай окончательный ответ: 1 (приемлемо) или 0 (неприемлемо)."""
}

prompts_en = {
    "simple_en": """Determine if the following Russian sentence is acceptable (grammatically correct and natural).

Sentence: {sentence}

Answer only 1 (acceptable) or 0 (unacceptable).""",

    "detailed_en": """Task: Determine the linguistic acceptability of a Russian sentence.

An acceptable sentence is one that a native Russian speaker would perceive as natural and grammatically correct.

Sentence to analyze: {sentence}

Is this sentence acceptable? Answer 1 (yes) or 0 (no).""",

    "cot_en": """Analyze the following Russian sentence for grammatical correctness and naturalness.

Sentence: {sentence}

First briefly explain if there are any errors, then give your final answer: 1 (acceptable) or 0 (unacceptable)."""
}

all_prompts_no_system = {**prompts_ru, **prompts_en}

print("Defined prompts:", len(all_prompts_no_system))
for name in all_prompts_no_system:
    print(f"  - {name}")

Defined prompts: 6
  - simple_ru
  - detailed_ru
  - cot_ru
  - simple_en
  - detailed_en
  - cot_en


In [12]:
# Run experiments without System Prompt
# Using a subset of data for quick testing (can be increased)

MAX_SAMPLES = 200  # Set to None to use the entire dataset

results_no_system = {}

print("="*60)
print("EXPERIMENT 1: Prompts without System Prompt")
print("="*60)

for prompt_name, prompt_template in all_prompts_no_system.items():
    print(f"\n--- Testing: {prompt_name} ---")
    
    results = evaluate_model(
        model_instruct, 
        tokenizer_instruct, 
        df, 
        prompt_template,
        system_prompt=None,
        is_instruct=True,
        max_samples=MAX_SAMPLES
    )
    
    results_no_system[prompt_name] = results

EXPERIMENT 1: Prompts without System Prompt

--- Testing: simple_ru ---


  0%|          | 0/200 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|██████████| 200/200 [03:55<00:00,  1.18s/it]



Results:
  Accuracy: 0.7350
  F1 (macro): 0.5007
  Undefined responses: 0 (0.0%)

--- Testing: detailed_ru ---


100%|██████████| 200/200 [11:33<00:00,  3.47s/it]



Results:
  Accuracy: 0.7100
  F1 (macro): 0.5289
  Undefined responses: 0 (0.0%)

--- Testing: cot_ru ---


100%|██████████| 200/200 [59:48<00:00, 17.94s/it] 



Results:
  Accuracy: 0.7050
  F1 (macro): 0.4442
  Undefined responses: 37 (18.5%)

--- Testing: simple_en ---


100%|██████████| 200/200 [00:18<00:00, 10.68it/s]



Results:
  Accuracy: 0.6550
  F1 (macro): 0.5014
  Undefined responses: 0 (0.0%)

--- Testing: detailed_en ---


100%|██████████| 200/200 [00:33<00:00,  5.94it/s]



Results:
  Accuracy: 0.7000
  F1 (macro): 0.5026
  Undefined responses: 0 (0.0%)

--- Testing: cot_en ---


100%|██████████| 200/200 [07:01<00:00,  2.11s/it]


Results:
  Accuracy: 0.6600
  F1 (macro): 0.4241
  Undefined responses: 25 (12.5%)





In [13]:
# Summary of Experiment 1 results
print("\n" + "="*60)
print("EXPERIMENT 1 SUMMARY: Without System Prompt")
print("="*60)

summary_no_system = []
for name, res in results_no_system.items():
    summary_no_system.append({
        'Prompt': name,
        'Accuracy': f"{res['accuracy']:.4f}",
        'F1 (macro)': f"{res['f1_macro']:.4f}",
        'Undefined (%)': f"{res['undefined_ratio']*100:.1f}%"
    })

summary_df_1 = pd.DataFrame(summary_no_system)
print(summary_df_1.to_string(index=False))


EXPERIMENT 1 SUMMARY: Without System Prompt
     Prompt Accuracy F1 (macro) Undefined (%)
  simple_ru   0.7350     0.5007          0.0%
detailed_ru   0.7100     0.5289          0.0%
     cot_ru   0.7050     0.4442         18.5%
  simple_en   0.6550     0.5014          0.0%
detailed_en   0.7000     0.5026          0.0%
     cot_en   0.6600     0.4241         12.5%


In [14]:
# Examples of model responses
best_prompt = max(results_no_system, key=lambda x: results_no_system[x]['accuracy'])
print(f"\nModel response examples (prompt: {best_prompt}):")
print("-" * 60)

res = results_no_system[best_prompt]
for i in range(min(5, len(res['responses']))):
    sentence = df.iloc[i]['sentence']
    true_label = res['true_labels'][i]
    pred_label = res['predictions'][i]
    response = res['responses'][i]
    
    status = "✓" if pred_label == true_label else "✗"
    print(f"\n{status} Sentence: {sentence[:60]}...")
    print(f"   True label: {true_label}, Prediction: {pred_label}")
    print(f"   Model response: {response[:100]}...")


Model response examples (prompt: simple_ru):
------------------------------------------------------------

✓ Sentence: Иван вчера не позвонил....
   True label: 1, Prediction: 1
   Model response: 1...

✗ Sentence: У многих туристов, кто посещают Кемер весной, есть шанс заст...
   True label: 0, Prediction: 1
   Model response: 1...

✓ Sentence: Лесные запахи набегали волнами; в них смешалось дыхание можж...
   True label: 1, Prediction: 1
   Model response: 1...

✓ Sentence: Вчера президент имел неофициальную беседу с английским посло...
   True label: 1, Prediction: 1
   Model response: 1...

✓ Sentence: Коллега так и не признал вину за катастрофу перед коллективо...
   True label: 1, Prediction: 1
   Model response: 1...


## 7. Experiment 2: Prompts with System Prompt

Adding various system prompts and analyzing their impact on quality.

In [15]:
# Define system prompts

system_prompts = {
    "expert_ru": """Ты эксперт-лингвист, специализирующийся на русском языке. 
Твоя задача - определять, является ли предложение лингвистически приемлемым.
Отвечай кратко и точно.""",

    "expert_en": """You are an expert linguist specializing in Russian language.
Your task is to determine if a sentence is linguistically acceptable.
Answer briefly and precisely.""",

    "strict_ru": """Ты строгий грамматический анализатор русского языка.
Оценивай предложения по критериям: грамматическая корректность, естественность для носителя языка.
Давай только числовой ответ: 1 или 0.""",

    "native_speaker_ru": """Представь, что ты носитель русского языка с высшим филологическим образованием.
Определи, звучит ли предложение естественно и правильно."""
}

print("Defined system prompts:", len(system_prompts))
for name in system_prompts:
    print(f"  - {name}")

Defined system prompts: 4
  - expert_ru
  - expert_en
  - strict_ru
  - native_speaker_ru


In [16]:
# Run experiments with System Prompt
# Using the best prompt from the previous experiment

# Select simple prompt for testing system prompts
test_prompt = all_prompts_no_system["simple_ru"]

results_with_system = {}

print("="*60)
print("EXPERIMENT 2: Prompts with System Prompt")
print("="*60)

for sys_name, sys_prompt in system_prompts.items():
    print(f"\n--- Testing System Prompt: {sys_name} ---")
    
    results = evaluate_model(
        model_instruct, 
        tokenizer_instruct, 
        df, 
        test_prompt,
        system_prompt=sys_prompt,
        is_instruct=True,
        max_samples=MAX_SAMPLES
    )
    
    results_with_system[sys_name] = results

EXPERIMENT 2: Prompts with System Prompt

--- Testing System Prompt: expert_ru ---


100%|██████████| 200/200 [00:18<00:00, 11.00it/s]



Results:
  Accuracy: 0.6850
  F1 (macro): 0.5294
  Undefined responses: 0 (0.0%)

--- Testing System Prompt: expert_en ---


100%|██████████| 200/200 [00:18<00:00, 10.96it/s]



Results:
  Accuracy: 0.7100
  F1 (macro): 0.5192
  Undefined responses: 0 (0.0%)

--- Testing System Prompt: strict_ru ---


100%|██████████| 200/200 [00:18<00:00, 11.10it/s]



Results:
  Accuracy: 0.7200
  F1 (macro): 0.4919
  Undefined responses: 0 (0.0%)

--- Testing System Prompt: native_speaker_ru ---


100%|██████████| 200/200 [00:22<00:00,  8.81it/s]


Results:
  Accuracy: 0.7000
  F1 (macro): 0.5026
  Undefined responses: 0 (0.0%)





In [17]:
# Summary of Experiment 2 results
print("\n" + "="*60)
print("EXPERIMENT 2 SUMMARY: With System Prompt")
print("="*60)

summary_with_system = []
for name, res in results_with_system.items():
    summary_with_system.append({
        'System Prompt': name,
        'Accuracy': f"{res['accuracy']:.4f}",
        'F1 (macro)': f"{res['f1_macro']:.4f}",
        'Undefined (%)': f"{res['undefined_ratio']*100:.1f}%"
    })

summary_df_2 = pd.DataFrame(summary_with_system)
print(summary_df_2.to_string(index=False))


EXPERIMENT 2 SUMMARY: With System Prompt
    System Prompt Accuracy F1 (macro) Undefined (%)
        expert_ru   0.6850     0.5294          0.0%
        expert_en   0.7100     0.5192          0.0%
        strict_ru   0.7200     0.4919          0.0%
native_speaker_ru   0.7000     0.5026          0.0%


## 8. Experiment 3: Base Model (Qwen2.5-1.5B without Instruct)

Comparing with the base model that was not fine-tuned for instruction following.

In [18]:
# Load base model
model_base, tokenizer_base = load_model("Qwen/Qwen2.5-1.5B", is_instruct=False)

Loading model: Qwen/Qwen2.5-1.5B
Model loaded on cuda


In [19]:
# Prompts for base model (completion-style)
# Base models work better with examples (few-shot) or completion-style prompts

prompts_base = {
    "completion_ru": """Задача: определить приемлемость русского предложения (1 - приемлемо, 0 - неприемлемо).

Пример 1:
Предложение: "Мама мыла раму."
Ответ: 1

Пример 2:
Предложение: "Он вчера будет читать книгу."
Ответ: 0

Предложение: "{sentence}"
Ответ:""",

    "completion_en": """Task: determine if a Russian sentence is acceptable (1 - acceptable, 0 - unacceptable).

Example 1:
Sentence: "Мама мыла раму."
Answer: 1

Example 2:
Sentence: "Он вчера будет читать книгу."
Answer: 0

Sentence: "{sentence}"
Answer:""",

    "fewshot_ru": """Определи, приемлемо ли предложение на русском (1=да, 0=нет):

"Кошка сидит на окне." -> 1
"Вчера я буду гулять." -> 0
"Красивый цветок расцвел." -> 1
"Книга читает мальчик себя." -> 0

"{sentence}" ->"""
}

print("Prompts for base model:", len(prompts_base))

Prompts for base model: 3


In [20]:
# Run experiments with base model

results_base = {}

print("="*60)
print("EXPERIMENT 3: Base Model Qwen2.5-1.5B")
print("="*60)

for prompt_name, prompt_template in prompts_base.items():
    print(f"\n--- Testing: {prompt_name} ---")
    
    results = evaluate_model(
        model_base, 
        tokenizer_base, 
        df, 
        prompt_template,
        system_prompt=None,
        is_instruct=False,
        max_samples=MAX_SAMPLES
    )
    
    results_base[prompt_name] = results

EXPERIMENT 3: Base Model Qwen2.5-1.5B

--- Testing: completion_ru ---


100%|██████████| 200/200 [06:54<00:00,  2.07s/it]



Results:
  Accuracy: 0.7500
  F1 (macro): 0.4648
  Undefined responses: 0 (0.0%)

--- Testing: completion_en ---


100%|██████████| 200/200 [06:57<00:00,  2.09s/it]



Results:
  Accuracy: 0.6050
  F1 (macro): 0.4698
  Undefined responses: 0 (0.0%)

--- Testing: fewshot_ru ---


100%|██████████| 200/200 [07:04<00:00,  2.12s/it]


Results:
  Accuracy: 0.5100
  F1 (macro): 0.4370
  Undefined responses: 0 (0.0%)





In [21]:
# Summary of Experiment 3 results
print("\n" + "="*60)
print("EXPERIMENT 3 SUMMARY: Base Model")
print("="*60)

summary_base = []
for name, res in results_base.items():
    summary_base.append({
        'Prompt': name,
        'Accuracy': f"{res['accuracy']:.4f}",
        'F1 (macro)': f"{res['f1_macro']:.4f}",
        'Undefined (%)': f"{res['undefined_ratio']*100:.1f}%"
    })

summary_df_3 = pd.DataFrame(summary_base)
print(summary_df_3.to_string(index=False))


EXPERIMENT 3 SUMMARY: Base Model
       Prompt Accuracy F1 (macro) Undefined (%)
completion_ru   0.7500     0.4648          0.0%
completion_en   0.6050     0.4698          0.0%
   fewshot_ru   0.5100     0.4370          0.0%


In [22]:
# Examples of base model responses
print("\nBase model response examples:")
print("-" * 60)

best_base_prompt = max(results_base, key=lambda x: results_base[x]['accuracy'])
res = results_base[best_base_prompt]

for i in range(min(5, len(res['responses']))):
    sentence = df.iloc[i]['sentence']
    true_label = res['true_labels'][i]
    pred_label = res['predictions'][i]
    response = res['responses'][i]
    
    status = "✓" if pred_label == true_label else "✗"
    print(f"\n{status} Sentence: {sentence[:60]}...")
    print(f"   True label: {true_label}, Prediction: {pred_label}")
    print(f"   Model response: {response[:100]}")


Base model response examples:
------------------------------------------------------------

✓ Sentence: Иван вчера не позвонил....
   True label: 1, Prediction: 1
   Model response: 1

Пример 4:
Предложение: "Мама мыла раму."
Ответ: 1

Пример 5:
Предложение: "Он вчера будет читать 

✗ Sentence: У многих туристов, кто посещают Кемер весной, есть шанс заст...
   True label: 0, Prediction: 1
   Model response: 1

Пример 4:
Предложение: "Все, кто приезжает в Кемер, должны посетить горнолыжный курорт."
Ответ: 1

✓ Sentence: Лесные запахи набегали волнами; в них смешалось дыхание можж...
   True label: 1, Prediction: 1
   Model response: 1

Пример 4:
Предложение: "Все мысли о том, что мы не можем сделать ничего, исчезли."
Ответ: 1

Прим

✓ Sentence: Вчера президент имел неофициальную беседу с английским посло...
   True label: 1, Prediction: 1
   Model response: 1

Пример 4:
Предложение: "Вчера президент имел неофициальную беседу с английским послом."
Ответ: 1


✓ Sentence: Коллега так и не

## 9. Comparison of All Experiments and Conclusions

In [23]:
# Overall summary table
print("="*80)
print("OVERALL SUMMARY OF ALL EXPERIMENTS")
print("="*80)

all_results = []

# Experiment 1: without system prompt
for name, res in results_no_system.items():
    all_results.append({
        'Experiment': 'Instruct (no sys)',
        'Prompt/Config': name,
        'Accuracy': res['accuracy'],
        'F1 (macro)': res['f1_macro'],
        'Undefined (%)': res['undefined_ratio']*100
    })

# Experiment 2: with system prompt
for name, res in results_with_system.items():
    all_results.append({
        'Experiment': 'Instruct (with sys)',
        'Prompt/Config': name,
        'Accuracy': res['accuracy'],
        'F1 (macro)': res['f1_macro'],
        'Undefined (%)': res['undefined_ratio']*100
    })

# Experiment 3: base model
for name, res in results_base.items():
    all_results.append({
        'Experiment': 'Base model',
        'Prompt/Config': name,
        'Accuracy': res['accuracy'],
        'F1 (macro)': res['f1_macro'],
        'Undefined (%)': res['undefined_ratio']*100
    })

comparison_df = pd.DataFrame(all_results)
comparison_df = comparison_df.sort_values('Accuracy', ascending=False)

print(comparison_df.to_string(index=False))

OVERALL SUMMARY OF ALL EXPERIMENTS
         Experiment     Prompt/Config  Accuracy  F1 (macro)  Undefined (%)
         Base model     completion_ru     0.750    0.464783            0.0
  Instruct (no sys)         simple_ru     0.735    0.500730            0.0
Instruct (with sys)         strict_ru     0.720    0.491925            0.0
  Instruct (no sys)       detailed_ru     0.710    0.528915            0.0
Instruct (with sys)         expert_en     0.710    0.519151            0.0
  Instruct (no sys)            cot_ru     0.705    0.444209           18.5
  Instruct (no sys)       detailed_en     0.700    0.502570            0.0
Instruct (with sys) native_speaker_ru     0.700    0.502570            0.0
Instruct (with sys)         expert_ru     0.685    0.529412            0.0
  Instruct (no sys)            cot_en     0.660    0.424119           12.5
  Instruct (no sys)         simple_en     0.655    0.501427            0.0
         Base model     completion_en     0.605    0.469781      

In [24]:
# Best results by category
print("\n" + "="*60)
print("BEST RESULTS BY CATEGORY")
print("="*60)

# Best result for Instruct without system prompt
best_no_sys = max(results_no_system.items(), key=lambda x: x[1]['accuracy'])
print(f"\n1. Instruct without System Prompt:")
print(f"   Best prompt: {best_no_sys[0]}")
print(f"   Accuracy: {best_no_sys[1]['accuracy']:.4f}")
print(f"   F1 (macro): {best_no_sys[1]['f1_macro']:.4f}")

# Best result for Instruct with system prompt
best_with_sys = max(results_with_system.items(), key=lambda x: x[1]['accuracy'])
print(f"\n2. Instruct with System Prompt:")
print(f"   Best System Prompt: {best_with_sys[0]}")
print(f"   Accuracy: {best_with_sys[1]['accuracy']:.4f}")
print(f"   F1 (macro): {best_with_sys[1]['f1_macro']:.4f}")

# Best result for base model
best_base = max(results_base.items(), key=lambda x: x[1]['accuracy'])
print(f"\n3. Base Model:")
print(f"   Best prompt: {best_base[0]}")
print(f"   Accuracy: {best_base[1]['accuracy']:.4f}")
print(f"   F1 (macro): {best_base[1]['f1_macro']:.4f}")


BEST RESULTS BY CATEGORY

1. Instruct without System Prompt:
   Best prompt: simple_ru
   Accuracy: 0.7350
   F1 (macro): 0.5007

2. Instruct with System Prompt:
   Best System Prompt: strict_ru
   Accuracy: 0.7200
   F1 (macro): 0.4919

3. Base Model:
   Best prompt: completion_ru
   Accuracy: 0.7500
   F1 (macro): 0.4648


In [25]:
# Error analysis for the best configuration
print("\n" + "="*60)
print("ERROR ANALYSIS FOR BEST CONFIGURATION")
print("="*60)

# Find the best configuration
all_configs = [
    ('Instruct no sys', best_no_sys[0], best_no_sys[1]),
    ('Instruct with sys', best_with_sys[0], best_with_sys[1]),
    ('Base', best_base[0], best_base[1])
]

best_overall = max(all_configs, key=lambda x: x[2]['accuracy'])
print(f"\nBest configuration: {best_overall[0]} - {best_overall[1]}")

res = best_overall[2]
true_labels = res['true_labels']
predictions = res['predictions_clean']

print(f"\nConfusion Matrix:")
cm = confusion_matrix(true_labels, predictions)
print(f"                 Predicted")
print(f"                 0       1")
print(f"Actual 0      {cm[0][0]:5d}   {cm[0][1]:5d}")
print(f"       1      {cm[1][0]:5d}   {cm[1][1]:5d}")

print(f"\nDetailed Report:")
print(classification_report(true_labels, predictions, target_names=['Unacceptable', 'Acceptable']))


ERROR ANALYSIS FOR BEST CONFIGURATION

Best configuration: Base - completion_ru

Confusion Matrix:
                 Predicted
                 0       1
Actual 0          2      46
       1          4     148

Detailed Report:
              precision    recall  f1-score   support

Unacceptable       0.33      0.04      0.07        48
  Acceptable       0.76      0.97      0.86       152

    accuracy                           0.75       200
   macro avg       0.55      0.51      0.46       200
weighted avg       0.66      0.75      0.67       200



## 10. Conclusions

In [26]:
# Automatic generation of conclusions based on results
print("="*80)
print("CONCLUSIONS")
print("="*80)

# Compare prompt languages
ru_prompts = {k: v for k, v in results_no_system.items() if k.endswith('_ru')}
en_prompts = {k: v for k, v in results_no_system.items() if k.endswith('_en')}

avg_ru = np.mean([v['accuracy'] for v in ru_prompts.values()]) if ru_prompts else 0
avg_en = np.mean([v['accuracy'] for v in en_prompts.values()]) if en_prompts else 0

print(f"\n1. PROMPT LANGUAGE IMPACT:")
print(f"   Average accuracy (Russian): {avg_ru:.4f}")
print(f"   Average accuracy (English): {avg_en:.4f}")
if avg_ru > avg_en:
    print(f"   -> Russian prompts show better results for the RuCoLA task")
elif avg_en > avg_ru:
    print(f"   -> English prompts show better results")
else:
    print(f"   -> Prompt language does not have a significant impact")

# Compare with/without system prompt
avg_no_sys = np.mean([v['accuracy'] for v in results_no_system.values()])
avg_with_sys = np.mean([v['accuracy'] for v in results_with_system.values()])

print(f"\n2. SYSTEM PROMPT IMPACT:")
print(f"   Average accuracy without System Prompt: {avg_no_sys:.4f}")
print(f"   Average accuracy with System Prompt: {avg_with_sys:.4f}")
if avg_with_sys > avg_no_sys:
    print(f"   -> System Prompt improves quality by {(avg_with_sys-avg_no_sys)*100:.1f}%")
else:
    print(f"   -> System Prompt does not improve or degrades quality")

# Compare Instruct vs Base
avg_instruct = max(avg_no_sys, avg_with_sys)
avg_base = np.mean([v['accuracy'] for v in results_base.values()])

print(f"\n3. INSTRUCT vs BASE MODEL:")
print(f"   Best Instruct accuracy: {avg_instruct:.4f}")
print(f"   Average Base accuracy: {avg_base:.4f}")
if avg_instruct > avg_base:
    print(f"   -> Instruct model significantly outperforms the base model")
    print(f"   -> Instruction tuning is critical for prompting tasks")
else:
    print(f"   -> Base model shows comparable results")

# Overall conclusion about model size
best_acc = max(
    max([v['accuracy'] for v in results_no_system.values()]),
    max([v['accuracy'] for v in results_with_system.values()]),
    max([v['accuracy'] for v in results_base.values()])
)

print(f"\n4. MODEL SIZE ADEQUACY:")
print(f"   Best achieved result: {best_acc:.4f}")

if best_acc >= 0.8:
    print(f"   -> 1.5B model shows good results (>{0.8:.0%})")
    print(f"   -> Model size is sufficient for basic task solution")
elif best_acc >= 0.7:
    print(f"   -> 1.5B model shows moderate results (70-80%)")
    print(f"   -> Recommended to try a larger model (7B) for better quality")
else:
    print(f"   -> 1.5B model shows low results (<70%)")
    print(f"   -> Model size is insufficient for quality task solution")
    print(f"   -> Recommended to use 7B model or larger")

print(f"\n5. RECOMMENDATIONS:")
print(f"   - For RuCoLA task, best configuration: {best_overall[0]} with prompt {best_overall[1]}")
print(f"   - With limited resources: use simple prompts with clear instructions")
print(f"   - With GPU available: consider Qwen2.5-7B-Instruct for better quality")
print(f"   - Chain-of-Thought prompts may improve interpretability, but not always accuracy")

CONCLUSIONS

1. PROMPT LANGUAGE IMPACT:
   Average accuracy (Russian): 0.7167
   Average accuracy (English): 0.6717
   -> Russian prompts show better results for the RuCoLA task

2. SYSTEM PROMPT IMPACT:
   Average accuracy without System Prompt: 0.6942
   Average accuracy with System Prompt: 0.7038
   -> System Prompt improves quality by 1.0%

3. INSTRUCT vs BASE MODEL:
   Best Instruct accuracy: 0.7038
   Average Base accuracy: 0.6217
   -> Instruct model significantly outperforms the base model
   -> Instruction tuning is critical for prompting tasks

4. MODEL SIZE ADEQUACY:
   Best achieved result: 0.7500
   -> 1.5B model shows moderate results (70-80%)
   -> Recommended to try a larger model (7B) for better quality

5. RECOMMENDATIONS:
   - For RuCoLA task, best configuration: Base with prompt completion_ru
   - With limited resources: use simple prompts with clear instructions
   - With GPU available: consider Qwen2.5-7B-Instruct for better quality
   - Chain-of-Thought prompts m

## 11. Saving Results

In [27]:
# Save summary results table
comparison_df.to_csv('prompting_results.csv', index=False)
print("Results saved to prompting_results.csv")

# Save detailed results of the best model
best_res = best_overall[2]
detailed_results = pd.DataFrame({
    'sentence': df['sentence'].head(len(best_res['predictions'])),
    'true_label': best_res['true_labels'],
    'predicted_label': best_res['predictions'],
    'model_response': best_res['responses']
})
detailed_results.to_csv('best_model_predictions.csv', index=False)
print("Detailed predictions saved to best_model_predictions.csv")

Results saved to prompting_results.csv
Detailed predictions saved to best_model_predictions.csv


In [28]:
# Clean up GPU memory
if device == "cuda":
    del model_instruct, model_base
    torch.cuda.empty_cache()
    print("GPU memory cleared")

GPU memory cleared
