# Self‑Improving LLM Project

This notebook implements Parts 2 and 3 of the project plan for the **Self‑Improving LLM** final project.  Specifically, it covers:

- **Dataset Acquisition & Sampling:** download the StrategyQA dataset, sample ~2 000 training examples as recommended, and save them to disk for subsequent processing.
- **Prompt Engineering & Teacher Generation:** generate a baseline *student draft* for each question, compose prompts according to the plan (question, student draft, and a teacher instruction), call GPT‑4 (or run in dry‑run mode), and build two parallel corpora for baseline and CoT training.

The plan specifies a data‑generation loop where each question is paired with a student draft and a teacher chain‑of‑thought, resulting in two training tracks.  The baseline model is trained on `(Q → answer)` pairs, while the CoT model is trained on `(Q + teacher CoT → answer)` pair.

> **Note:** Running the full pipeline (especially calling GPT‑4) requires an OpenAI API key and may incur costs.  A dry‑run mode is provided for testing the notebook without external API calls.


In [1]:
!pip install -q datasets transformers openai bitsandbytes accelerate python-dotenv huggingface_hub huggingface_hub[hf_xet]


[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file if it exists
load_dotenv()

# Dataset parameters
DATASET_NAME = os.getenv('DATASET_NAME', 'voidful/StrategyQA')
TRAIN_SAMPLES = int(os.getenv('TRAIN_SAMPLES', '100'))
RANDOM_SEED = int(os.getenv('RANDOM_SEED', '42'))

# Model parameters
MODEL_NAME = os.getenv('MODEL_NAME', 'microsoft/phi-2')
MAX_NEW_TOKENS = int(os.getenv('MAX_NEW_TOKENS', '35'))
BATCH_SIZE = int(os.getenv('BATCH_SIZE', '8'))
USE_4BIT = os.getenv('USE_4BIT', 'True').lower() in ('true', '1', 't')
MAX_SEQ_LENGTH = int(os.getenv('MAX_SEQ_LENGTH', '512'))
HUGGINGFACE_TOKEN = os.getenv('HUGGINGFACE_TOKEN', '')

# Generation parameters
DO_SAMPLE = os.getenv('DO_SAMPLE', 'False').lower() in ('true', '1', 't')
TEMPERATURE = float(os.getenv('TEMPERATURE', '0.7'))

# GPT-4 parameters
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', '')
GPT4_MODEL = os.getenv('GPT4_MODEL', 'gpt-4')
GPT4_MAX_TOKENS = int(os.getenv('GPT4_MAX_TOKENS', '150'))
GPT4_TEMPERATURE = float(os.getenv('GPT4_TEMPERATURE', '0.3'))
DRY_RUN = os.getenv('DRY_RUN', 'True').lower() in ('true', '1', 't')

# File paths
DATA_DIR = os.getenv('DATA_DIR', 'data')
RAW_DIR = os.path.join(DATA_DIR, 'raw')
SAMPLE_TRAIN_PATH = os.path.join(DATA_DIR, 'sample_train.jsonl')
STUDENT_DRAFTS_PATH = os.path.join(DATA_DIR, 'student_drafts.jsonl')
TEACHER_OUTPUTS_PATH = os.path.join(DATA_DIR, 'teacher_outputs.jsonl')
BASELINE_PATH = os.path.join(DATA_DIR, 'train_baseline.jsonl')
COT_PATH = os.path.join(DATA_DIR, 'train_cot.jsonl')

# Print configuration
print("=== Configuration ===")
print(f"Dataset: {DATASET_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Batch size: {BATCH_SIZE}")
print(f"4-bit quantization: {USE_4BIT}")
print(f"GPT-4 dry run: {DRY_RUN}")
print("====================")


=== Configuration ===
Dataset: voidful/StrategyQA
Model: microsoft/phi-2
Batch size: 8
4-bit quantization: True
GPT-4 dry run: False


In [None]:
# from huggingface_hub import login, notebook_login

# def smart_hf_login():
#     """Use HF_TOKEN env/secret if present, else fall back to interactive login."""
#     if HUGGINGFACE_TOKEN:         # works for Colab secrets, CI, docker, …
#         login(HUGGINGFACE_TOKEN)
#     elif 'google.colab' in sys.modules:   # inside a Colab kernel but no secret set
#         notebook_login()
#     else:                                 # local Jupyter; will prompt only once
#         login()

# smart_hf_login()


In [5]:
from datasets import load_dataset
import os
import json
import sys
import subprocess

# Check if we need to download the dataset
# Fix: Use paths relative to the parent directory (project root)
parent_dir = os.path.dirname(os.getcwd())
raw_dir = os.path.join(parent_dir, DATA_DIR, 'raw')
train_path = os.path.join(raw_dir, 'strategyqa_train.jsonl')
val_path = os.path.join(raw_dir, 'strategyqa_validation.jsonl')  # Changed from dev to validation to match download script
test_path = os.path.join(raw_dir, 'strategyqa_test.jsonl')

print(f"Looking for files in:")
print(f"- Train: {train_path}")
print(f"- Val: {val_path}")
print(f"- Test: {test_path}")

# Create raw directory if it doesn't exist
os.makedirs(raw_dir, exist_ok=True)
print(f"Created directory: {raw_dir}")

# Check if files exist
files_exist = all(os.path.exists(p) for p in [train_path, val_path])
print(f"Files exist: {files_exist}")

if not files_exist:
    print("Dataset files not found. Running download script...")
    script_path = os.path.join(parent_dir, 'scripts', 'download_strategyqa.py')
    print(f"Running: {sys.executable} {script_path} --output-dir {raw_dir}")
    result = subprocess.run(
        [sys.executable, script_path, '--output-dir', raw_dir],
        check=True,
        capture_output=True,
        text=True
    )
    print("Download script output:")
    print(result.stdout)
    if result.stderr:
        print("Errors:")
        print(result.stderr)
    
    # Verify files were created
    print("\nChecking if files were created:")
    for path in [train_path, val_path, test_path]:
        exists = os.path.exists(path)
        print(f"- {path}: {'✓' if exists else '✗'}")
        if exists:
            size = os.path.getsize(path)
            print(f"  Size: {size:,} bytes")

# Load the dataset from local JSONL files
print("Loading dataset from local files...")
data_files = {
    'train': train_path,
    'validation': val_path,
}
if os.path.exists(test_path):
    data_files['test'] = test_path

dataset = load_dataset('json', data_files=data_files)
train = dataset['train']
validation = dataset['validation']

def sample_train_set(train_dataset, n_samples=TRAIN_SAMPLES, seed=RANDOM_SEED):
    '''Return a random sample of the training set.'''
    shuffled = train_dataset.shuffle(seed=seed)
    return shuffled.select(range(min(n_samples, len(shuffled))))

# Sample examples from the training set
print(f"Sampling {TRAIN_SAMPLES} examples with seed {RANDOM_SEED}")
target_train = sample_train_set(train)

# Create output directories (also fix these paths)
data_dir_full = os.path.join(parent_dir, DATA_DIR)
os.makedirs(data_dir_full, exist_ok=True)
os.makedirs(raw_dir, exist_ok=True)

# Save the full dev/test sets and the sampled train set
sample_train_path = os.path.join(parent_dir, SAMPLE_TRAIN_PATH)

def save_jsonl(dataset_split, path):
    with open(path, 'w', encoding='utf-8') as f:
        for item in dataset_split:
            f.write(json.dumps(item) + '\n')

# Save splits
save_jsonl(train, train_path)
save_jsonl(validation, val_path)
if 'test' in dataset:
    save_jsonl(dataset['test'], test_path)
save_jsonl(target_train, sample_train_path)

print(f"Full training set saved to {train_path}")
print(f"Validation set saved to {val_path}")
print(f"Sampled train set (≈{TRAIN_SAMPLES} entries) saved to {sample_train_path}")

  from .autonotebook import tqdm as notebook_tqdm


Looking for files in:
- Train: c:\Users\noham\Desktop\Self-Improving-LLM\data\raw\strategyqa_train.jsonl
- Val: c:\Users\noham\Desktop\Self-Improving-LLM\data\raw\strategyqa_validation.jsonl
- Test: c:\Users\noham\Desktop\Self-Improving-LLM\data\raw\strategyqa_test.jsonl
Created directory: c:\Users\noham\Desktop\Self-Improving-LLM\data\raw
Files exist: True
Loading dataset from local files...


Generating train split: 1038 examples [00:00, 61042.70 examples/s]
Generating validation split: 565 examples [00:00, 51368.47 examples/s]
Generating test split: 687 examples [00:00, 57247.32 examples/s]


Sampling 200 examples with seed 42
Full training set saved to c:\Users\noham\Desktop\Self-Improving-LLM\data\raw\strategyqa_train.jsonl
Validation set saved to c:\Users\noham\Desktop\Self-Improving-LLM\data\raw\strategyqa_validation.jsonl
Sampled train set (≈200 entries) saved to c:\Users\noham\Desktop\Self-Improving-LLM\data\sample_train.jsonl


In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from torch.utils.data import DataLoader
from datasets import load_dataset
import torch
import json
import os
from tqdm import tqdm

def setup_dataset(input_path: str, tokenizer, batch_size: int = BATCH_SIZE):
    """Load and prepare dataset for GPU processing."""
    # Load the dataset
    dataset = load_dataset('json', data_files=input_path, split='train')
    
    # Keep the original questions for reference
    original_questions = dataset['question']
    
    # Tokenization function
    def tokenize_function(examples):
        return tokenizer(
            examples['question'],
            truncation=True,
            padding='max_length',
            max_length=MAX_SEQ_LENGTH,
            return_tensors=None  # Return as list, not tensors
        )
    
    # Apply tokenization
    tokenized_dataset = dataset.map(tokenize_function, batched=True)
    
    # Create a custom dataset that includes both tokenized data and original questions
    class QADataset(torch.utils.data.Dataset):
        def __init__(self, tokenized_data, original_questions):
            self.tokenized_data = tokenized_data
            self.original_questions = original_questions
            
        def __len__(self):
            return len(self.tokenized_data)
            
        def __getitem__(self, idx):
            item = {
                'input_ids': torch.tensor(self.tokenized_data[idx]['input_ids']),
                'attention_mask': torch.tensor(self.tokenized_data[idx]['attention_mask']),
                'question': self.original_questions[idx]
            }
            return item
    
    # Create custom dataset
    custom_dataset = QADataset(tokenized_dataset, original_questions)
    
    # Create DataLoader
    loader = DataLoader(
        custom_dataset, 
        batch_size=batch_size, 
        shuffle=False  # Keep order for output matching
    )
    
    return loader

# GPU setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Model setup
print(f"Loading model: {MODEL_NAME}")

# Use 4-bit quantization if enabled and on GPU
if device.type == 'cuda' and USE_4BIT:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
    )
    print("Loading model in 4-bit quantization...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto"
    )
else:
    print("Loading model in standard precision...")
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Ensure the tokenizer has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model.eval()

if device.type == 'cuda':
    print(f"GPU Memory after model load: {torch.cuda.memory_allocated()/1e9:.2f} GB")

# Load and prepare dataset (fix path)
parent_dir = os.path.dirname(os.getcwd())
sample_train_path_full = os.path.join(parent_dir, SAMPLE_TRAIN_PATH)
print(f"Loading dataset from {sample_train_path_full} with batch size {BATCH_SIZE}")
train_loader = setup_dataset(sample_train_path_full, tokenizer, batch_size=BATCH_SIZE)

Using device: cuda
Loading model: microsoft/Phi-3.5-mini-instruct
Loading model in 4-bit quantization...


Loading checkpoint shards: 100%|██████████| 2/2 [00:13<00:00,  6.56s/it]


GPU Memory after model load: 2.26 GB
Loading dataset from c:\Users\noham\Desktop\Self-Improving-LLM\data\sample_train.jsonl with batch size 8


Generating train split: 200 examples [00:00, 17865.21 examples/s]
Map: 100%|██████████| 200/200 [00:00<00:00, 1177.04 examples/s]


## Generate Student Drafts

In this section we load a base language model (e.g. `meta-llama/Llama-2-7b-hf` or `gpt2`) and generate a short *student draft* for each question in the sampled training set.  A draft consists of a yes/no answer followed by one or two clarifying questions, as specified in the data‑generation loop.  Adjust the model name based on your available hardware and licences.

> **Tip:** On Colab, you can enable a GPU via *Runtime → Change runtime type → GPU* and use half‑precision weights to reduce memory usage.  For demonstration, we use `gpt2` (which is small) to keep the example runnable on CPU.


In [13]:
def build_messages(q: str):
    return [
        {"role": "system",
         "content": "You are a reasoning assistant that provides initial analysis for complex yes/no questions.\n\n"
                   "RESPONSE FORMAT - EXACTLY two lines:\n"
                   "Line 1: Answer: <Yes/No> - <brief reasoning>\n"
                   "Line 2: Questions: <focused question>? <key uncertainty>?\n\n"
                   "REASONING APPROACH:\n"
                   "- Consider key facts and logical connections\n"
                   "- When uncertain, lean toward the more likely answer with caveats\n"
                   "- Identify the most critical missing information\n"
                   "- Focus on specific, actionable clarifying questions\n\n"
                   "QUALITY CRITERIA:\n"
                   "- Your brief reasoning should capture the main logical path\n"
                   "- Questions should target genuine uncertainties, not obvious facts\n"
                   "- Be specific rather than generic in your inquiries\n"
                   "- Consider context, timing, and domain-specific knowledge"},
        
        # Improved few-shot examples
        {"role": "user", "content": "Is the sky blue?"},
        {"role": "assistant", "content": "Answer: Yes - electromagnetic scattering favors blue wavelengths\n"
                                        "Questions: Under what atmospheric conditions? At what time of day?"},
        
        {"role": "user", "content": "Can koalas digest meat?"},
        {"role": "assistant", "content": "Answer: No - specialized herbivore digestive system\n"
                                        "Questions: What about small amounts accidentally? In emergency situations?"},
        
        # Your actual question
        {"role": "user", "content": q},
    ]

def generate_batch_drafts(batch):
    """Generate drafts for a batch of questions."""
    # Set padding side to left for generation (decoder-only models need this)
    tokenizer.padding_side = 'left'
    # Create prompts for each question
    prompts = [
        tokenizer.apply_chat_template(build_messages(q),
                                    tokenize=False,
                                    add_generation_prompt=True)
        for q in batch["question"]
    ]
    inputs = tokenizer(prompts, padding=True, return_tensors="pt").to(device)

    # Track memory usage before generation
    if device.type == 'cuda':
        print(f"GPU Memory before generation: {torch.cuda.memory_allocated()/1e9:.2f} GB", end='\r')
    
    # Generate responses
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_new_tokens=MAX_NEW_TOKENS,
            do_sample=DO_SAMPLE,
            temperature=TEMPERATURE if DO_SAMPLE else 1.0,
            pad_token_id=tokenizer.pad_token_id,
            return_dict_in_generate=True
        )

    prompt_lens = inputs["attention_mask"].sum(dim=1)
    decoded = [tokenizer.decode(seq[p_len:], skip_special_tokens=True).strip()
            for seq, p_len in zip(outputs.sequences, prompt_lens)]
    drafts = [ans[ans.find("Answer:"):] for ans in decoded]
    return drafts

In [None]:
# Import tqdm for progress bars
from tqdm import tqdm

# Create output directory if it doesn't exist (fix path)
parent_dir = os.path.dirname(os.getcwd())
student_drafts_path_full = os.path.join(parent_dir, STUDENT_DRAFTS_PATH)
os.makedirs(os.path.dirname(student_drafts_path_full), exist_ok=True)

# Process batches and write outputs
print(f"Writing student drafts to {student_drafts_path_full}")
with open(student_drafts_path_full, 'w', encoding='utf-8') as out_f:
    for batch in tqdm(train_loader, desc='Generating drafts'):
        # Get original questions directly from the batch
        questions = batch['question']
        
        # Move input_ids and attention_mask to device
        input_batch = {
            'input_ids': batch['input_ids'].to(device),
            'attention_mask': batch['attention_mask'].to(device)
        }
        
        # Generate drafts for the batch
        drafts = generate_batch_drafts({'question': questions})
        
        # Write results
        for q, draft in zip(questions, drafts):
            out_rec = {
                'question': q,
                'student_draft': draft
            }
            out_f.write(json.dumps(out_rec, ensure_ascii=False) + '\n')
        
        # Print memory usage periodically
        if device.type == 'cuda':
            print(f"Current GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB", end='\r')

print(f"Student drafts written to {student_drafts_path_full}")
print("Using optimized student prompts for better reasoning and question quality!")

In [15]:
import json
import re

def clean_student_draft(draft_text):
    """
    Clean student draft to keep only Answer and Questions lines.
    Removes extra reasoning, paragraphs, and explanatory text.
    """
    lines = draft_text.strip().split('\n')
    cleaned_lines = []
    
    for line in lines:
        line = line.strip()
        # Keep lines that start with "Answer:" or "Questions:"
        if line.startswith('Answer:') or line.startswith('Questions:'):
            cleaned_lines.append(line)
        # Stop processing if we hit any reasoning sections
        elif any(keyword in line.upper() for keyword in ['REASONING APPROACH:', 'QUALITY CRITERIA:', 'TO CLARIFY', 'IN THIS CASE']):
            break
    
    # If we have both Answer and Questions, return them
    if len(cleaned_lines) >= 2:
        return '\n'.join(cleaned_lines[:2])  # Keep only first Answer and Questions lines
    elif len(cleaned_lines) == 1 and cleaned_lines[0].startswith('Answer:'):
        # If only Answer line, add a generic Questions line
        return cleaned_lines[0] + '\nQuestions: What additional context is needed?'
    else:
        # Fallback: extract from the full text
        return extract_answer_and_questions(draft_text)

def extract_answer_and_questions(text):
    """
    Extract Answer and Questions from messy text using regex patterns.
    """
    # Look for Answer pattern
    answer_match = re.search(r'Answer:\s*([^\n]+)', text, re.IGNORECASE)
    answer = answer_match.group(0) if answer_match else "Answer: Unable to determine"
    
    # Look for Questions pattern
    questions_match = re.search(r'Questions?:\s*([^\n]+(?:\?[^\n]*)*)', text, re.IGNORECASE | re.MULTILINE)
    if questions_match:
        questions = f"Questions: {questions_match.group(1)}"
    else:
        # Look for question marks in the text as a fallback
        question_lines = [line.strip() for line in text.split('\n') if '?' in line and not line.startswith('Answer:')]
        if question_lines:
            questions = f"Questions: {question_lines[0]}"
        else:
            questions = "Questions: What additional context is needed?"
    
    return f"{answer}\n{questions}"

# Load the existing student drafts
parent_dir = os.path.dirname(os.getcwd())
student_drafts_path_full = os.path.join(parent_dir, STUDENT_DRAFTS_PATH)
cleaned_drafts_path = os.path.join(parent_dir, DATA_DIR, 'student_drafts_cleaned.jsonl')

print(f"Loading student drafts from {student_drafts_path_full}")
with open(student_drafts_path_full, 'r', encoding='utf-8') as f:
    drafts = [json.loads(line) for line in f]

print(f"Cleaning {len(drafts)} student drafts...")
cleaned_count = 0
stats = {
    'total': len(drafts),
    'cleaned': 0,
    'already_clean': 0,
    'extracted': 0
}

# Clean each draft
cleaned_drafts = []
for draft in drafts:
    original_text = draft['student_draft']
    
    # Check if already clean (only has Answer and Questions lines)
    lines = [line.strip() for line in original_text.split('\n') if line.strip()]
    if (len(lines) == 2 and 
        lines[0].startswith('Answer:') and 
        lines[1].startswith('Questions:')):
        # Already clean
        cleaned_drafts.append(draft)
        stats['already_clean'] += 1
    else:
        # Needs cleaning
        cleaned_text = clean_student_draft(original_text)
        cleaned_draft = {
            'question': draft['question'],
            'student_draft': cleaned_text
        }
        cleaned_drafts.append(cleaned_draft)
        
        # Count the type of cleaning performed
        if len(original_text) > len(cleaned_text) * 2:  # Significant reduction
            stats['cleaned'] += 1
        else:
            stats['extracted'] += 1

# Save cleaned drafts
print(f"Saving cleaned drafts to {cleaned_drafts_path}")
with open(cleaned_drafts_path, 'w', encoding='utf-8') as f:
    for draft in cleaned_drafts:
        f.write(json.dumps(draft, ensure_ascii=False) + '\n')

# Also update the original file
print(f"Updating original file {student_drafts_path_full}")
with open(student_drafts_path_full, 'w', encoding='utf-8') as f:
    for draft in cleaned_drafts:
        f.write(json.dumps(draft, ensure_ascii=False) + '\n')

print(f"\n=== CLEANING STATISTICS ===")
print(f"Total drafts processed: {stats['total']}")
print(f"Already clean: {stats['already_clean']}")
print(f"Required major cleaning: {stats['cleaned']}")
print(f"Required extraction: {stats['extracted']}")
print(f"Cleaning success rate: {((stats['total'] - (stats['cleaned'] + stats['extracted']))/stats['total']*100):.1f}%")

# Show some examples
print(f"\n=== SAMPLE CLEANED DRAFTS ===")
for i, draft in enumerate(cleaned_drafts[:3]):
    print(f"Example {i+1}:")
    print(f"Question: {draft['question'][:60]}...")
    print(f"Draft: {draft['student_draft']}")
    print()

print("✅ Student drafts cleaned successfully!")
print("All drafts now contain only Answer and Questions lines without extra reasoning.")

Loading student drafts from c:\Users\noham\Desktop\Self-Improving-LLM\data\student_drafts.jsonl
Cleaning 200 student drafts...
Saving cleaned drafts to c:\Users\noham\Desktop\Self-Improving-LLM\data\student_drafts_cleaned.jsonl
Updating original file c:\Users\noham\Desktop\Self-Improving-LLM\data\student_drafts.jsonl

=== CLEANING STATISTICS ===
Total drafts processed: 200
Already clean: 183
Required major cleaning: 17
Required extraction: 0
Cleaning success rate: 91.5%

=== SAMPLE CLEANED DRAFTS ===
Example 1:
Question: Did Disney get most of Rudyard Kipling's The Jungle Book pro...
Draft: Answer: Uncertain - distribution rights and profits depend on agreements
Questions: What was the nature of Disney's agreement with Kipling's estate? How are profits typically divided in such cases?

Example 2:
Question: Did Robert Downey Jr. possess same caliber gun as Resident E...
Draft: Answer: No - Robert Downey Jr. is an actor, not a gun manufacturer or user
Questions: Is there a movie or conte

## Generate Teacher Responses

We now call GPT‑4 to obtain chain‑of‑thought (CoT) reasoning and final yes/no answers for each question/draft pair.  The prompt format follows the plan:

```
Q: <original yes/no question>
Student draft: <answer + clarifying questions>
Teacher: Please think step-by-step and provide your thought process and final Yes/No answer.
```

To run the actual API calls, you must provide a valid OpenAI API key.  If you set `dry_run=True`, dummy responses will be generated for testing purposes.


In [16]:
import os
import json
from openai import OpenAI
import re
from tqdm import tqdm

# Load student drafts (fix path)
parent_dir = os.path.dirname(os.getcwd())
student_drafts_path_full = os.path.join(parent_dir, STUDENT_DRAFTS_PATH)
print(f"Loading student drafts from {student_drafts_path_full}")
with open(student_drafts_path_full, 'r', encoding='utf-8') as f:
    drafts = [json.loads(line) for line in f]

# Get API key from environment
if not OPENAI_API_KEY and not DRY_RUN:
    print("Warning: OPENAI_API_KEY not set. Set DRY_RUN=True or provide an API key.")

def extract_yes_no(text: str) -> str:
    """Extract a yes/no answer from the teacher's structured response.
    
    Looks for the Final Assessment section with **YES** or **NO** in bold.
    Falls back to searching for yes/no in the text if structured format not found.
    """
    # First, try to find the structured final assessment
    final_assessment_match = re.search(r'## Final Assessment.*?\*\*(YES|NO)\*\*', text, re.IGNORECASE | re.DOTALL)
    if final_assessment_match:
        return final_assessment_match.group(1).capitalize()
    
    # Fallback to original method for backwards compatibility
    match = re.search(r"\b(yes|no)\b", text, re.IGNORECASE)
    if match:
        return match.group(1).capitalize()
    
    return "No"  # Default fallback instead of returning full text

def call_gpt4(q: str, draft: str) -> str:
    client = OpenAI(api_key=OPENAI_API_KEY)
    
    system_prompt = """You are an expert teacher helping a student AI model learn to reason through complex yes/no questions. Your role is to provide educational reasoning that demonstrates good thinking patterns.

CRITICAL: You must follow this exact response format:

## Teaching Analysis
[Provide 2-3 sentences acknowledging the student's approach and identifying key reasoning steps needed]

## Step-by-Step Reasoning
[Provide clear, educational reasoning in numbered steps that the student can learn from and apply to similar questions]

## Final Assessment
Based on this analysis, the answer is: **[YES/NO]**

REQUIREMENTS:
- Always use the exact format above with the specified headers
- Keep your reasoning educational and transferable
- Address the student's specific concerns when they raise valid points
- The Final Assessment section must contain exactly "**YES**" or "**NO**" in bold
- Be concise but thorough in your explanations"""
    
    user_prompt = f"""QUESTION: {q}

STUDENT DRAFT: {draft}

Provide your teaching analysis following the required format."""
    
    response = client.chat.completions.create(
        model=GPT4_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        max_tokens=GPT4_MAX_TOKENS,
        temperature=GPT4_TEMPERATURE
    )
    return response.choices[0].message.content.strip()

# Create output directory if it doesn't exist (fix path)
teacher_outputs_path_full = os.path.join(parent_dir, TEACHER_OUTPUTS_PATH)
os.makedirs(os.path.dirname(teacher_outputs_path_full), exist_ok=True)

print(f"Generating teacher responses with improved prompts - model: {GPT4_MODEL} (dry_run: {DRY_RUN})")
with open(teacher_outputs_path_full, 'w', encoding='utf-8') as out_f:
    for rec in tqdm(drafts, desc='Generating teacher responses'):
        q = rec['question']
        draft = rec['student_draft']
        
        if DRY_RUN or not OPENAI_API_KEY:
            response_text = '''## Teaching Analysis
Good approach with clarifying questions. Let me analyze this systematically.

## Step-by-Step Reasoning
1. This is a placeholder reasoning step for dry run mode
2. Replace DRY_RUN with False for real GPT-4 calls
3. The structured format ensures easy parsing

## Final Assessment
Based on this analysis, the answer is: **NO**'''
        else:
            response_text = call_gpt4(q, draft)
            
        answer = extract_yes_no(response_text)
        out_record = {
            'question': q,
            'student_draft': draft,
            'teacher_thought': response_text,
            'teacher_answer': answer
        }
        out_f.write(json.dumps(out_record, ensure_ascii=False) + '\n')

print(f"Teacher outputs written to {teacher_outputs_path_full}")
print("Improved prompts now provide structured responses for reliable parsing!")

Loading student drafts from c:\Users\noham\Desktop\Self-Improving-LLM\data\student_drafts.jsonl
Generating teacher responses with improved prompts - model: gpt-4.1-nano-2025-04-14 (dry_run: False)


Generating teacher responses: 100%|██████████| 200/200 [09:26<00:00,  2.83s/it]

Teacher outputs written to c:\Users\noham\Desktop\Self-Improving-LLM\data\teacher_outputs.jsonl
Improved prompts now provide structured responses for reliable parsing!





## Build Training Corpora

Finally, we build two parallel training corpora:

1. **Baseline (Track A)** – pairs of `(question → answer)` for training a basic model.
2. **CoT (Track B)** – pairs of `(question + teacher chain-of-thought → answer)` for CoT distillation.

These files will be used in later steps for model fine‑tuning.


In [20]:
import json

# Load teacher outputs (fix path)
parent_dir = os.path.dirname(os.getcwd())
teacher_outputs_path_full = os.path.join(parent_dir, TEACHER_OUTPUTS_PATH)
print(f"Loading teacher outputs from {teacher_outputs_path_full}")
with open(teacher_outputs_path_full, 'r', encoding='utf-8') as f:
    teacher_data = [json.loads(line) for line in f]

# Load original training data with ground truth answers
sample_train_path_full = os.path.join(parent_dir, SAMPLE_TRAIN_PATH)
print(f"Loading ground truth data from {sample_train_path_full}")
with open(sample_train_path_full, 'r', encoding='utf-8') as f:
    ground_truth_data = [json.loads(line) for line in f]

# Create a mapping from question to ground truth answer
ground_truth_map = {item['question']: item['answer'] for item in ground_truth_data}

def is_valid_answer(answer: str) -> bool:
    """Check if the answer is a valid Yes/No response."""
    return answer.strip().lower() in ['yes', 'no']

def convert_to_yes_no(boolean_answer: bool) -> str:
    """Convert boolean ground truth to Yes/No string."""
    return "Yes" if boolean_answer else "No"

# Create baseline and CoT records with validation
baseline_records = []
cot_records = []
validation_stats = {
    'total_processed': 0,
    'invalid_teacher_answers': 0,
    'teacher_wrong_answers': 0,
    'valid_records': 0,
    'ground_truth_missing': 0
}

print("Processing and validating teacher responses...")
for rec in teacher_data:
    q = rec['question']
    draft = rec['student_draft']
    thought = rec['teacher_thought']
    teacher_answer = rec['teacher_answer']
    
    validation_stats['total_processed'] += 1
    
    # Check if we have ground truth for this question
    if q not in ground_truth_map:
        validation_stats['ground_truth_missing'] += 1
        print(f"Warning: No ground truth found for question: {q[:50]}...")
        continue
    
    ground_truth_answer = convert_to_yes_no(ground_truth_map[q])
    
    # Validate teacher answer format
    if not is_valid_answer(teacher_answer):
        validation_stats['invalid_teacher_answers'] += 1
        print(f"Skipping invalid teacher answer: '{teacher_answer}' for question: {q[:50]}...")
        continue
    
    # Check if teacher answer matches ground truth
    if teacher_answer.strip().capitalize() != ground_truth_answer:
        validation_stats['teacher_wrong_answers'] += 1
        # print(f"Skipping incorrect teacher answer: Teacher='{teacher_answer}', Truth='{ground_truth_answer}' for question: {q[:50]}...")
        continue
    
    # If we reach here, the data point is valid
    validation_stats['valid_records'] += 1
    
    # Track A: Baseline (question → answer)
    baseline_records.append({'prompt': q, 'answer': teacher_answer})
    
    # Track B: CoT with student draft context (self-improvement format)
    cot_prompt = f"Question: {q}\nStudent draft: {draft}\nTeacher reasoning: {thought}"
    cot_records.append({'prompt': cot_prompt, 'answer': teacher_answer})

# Create output directories if they don't exist (fix paths)
baseline_path_full = os.path.join(parent_dir, BASELINE_PATH)
cot_path_full = os.path.join(parent_dir, COT_PATH)
os.makedirs(os.path.dirname(baseline_path_full), exist_ok=True)
os.makedirs(os.path.dirname(cot_path_full), exist_ok=True)

# Write output files
print(f"Writing validated baseline corpus to {baseline_path_full}")
with open(baseline_path_full, 'w', encoding='utf-8') as f:
    for r in baseline_records:
        f.write(json.dumps(r) + '\n')

print(f"Writing validated CoT corpus to {cot_path_full}")
with open(cot_path_full, 'w', encoding='utf-8') as f:
    for r in cot_records:
        f.write(json.dumps(r) + '\n')

print(f"Baseline corpus saved to {baseline_path_full}")
print(f"CoT corpus saved to {cot_path_full}")

# Print validation statistics
print(f"\n=== VALIDATION STATISTICS ===")
print(f"Total processed: {validation_stats['total_processed']}")
print(f"Ground truth missing: {validation_stats['ground_truth_missing']}")
print(f"Invalid teacher answers (not Yes/No): {validation_stats['invalid_teacher_answers']}")
print(f"Wrong teacher answers (vs ground truth): {validation_stats['teacher_wrong_answers']}")
print(f"Valid records kept: {validation_stats['valid_records']}")
print(f"Data quality rate: {validation_stats['valid_records']/validation_stats['total_processed']*100:.1f}%")

print(f"\nFinal training corpora:")
print(f"- Baseline: {len(baseline_records)} validated examples")
print(f"- CoT: {len(cot_records)} validated examples")
print(f"\nAll training data is now validated against ground truth and contains only clean Yes/No answers!")

Loading teacher outputs from c:\Users\noham\Desktop\Self-Improving-LLM\data\teacher_outputs.jsonl
Loading ground truth data from c:\Users\noham\Desktop\Self-Improving-LLM\data\sample_train.jsonl
Processing and validating teacher responses...
Writing validated baseline corpus to c:\Users\noham\Desktop\Self-Improving-LLM\data\train_baseline.jsonl
Writing validated CoT corpus to c:\Users\noham\Desktop\Self-Improving-LLM\data\train_cot.jsonl
Baseline corpus saved to c:\Users\noham\Desktop\Self-Improving-LLM\data\train_baseline.jsonl
CoT corpus saved to c:\Users\noham\Desktop\Self-Improving-LLM\data\train_cot.jsonl

=== VALIDATION STATISTICS ===
Total processed: 200
Ground truth missing: 0
Invalid teacher answers (not Yes/No): 0
Wrong teacher answers (vs ground truth): 63
Valid records kept: 137
Data quality rate: 68.5%

Final training corpora:
- Baseline: 137 validated examples
- CoT: 137 validated examples

All training data is now validated against ground truth and contains only clean Ye

## Phase A: Baseline Training

Phase A trains the student model on direct question-answer pairs without CoT reasoning. This establishes a baseline performance before implementing self-improvement learning.

In [6]:
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,      
    TrainingArguments, Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch
import os
import gc

# Phase A Configuration
PHASE_A_CONFIG = {
    'model_name': MODEL_NAME,
    'train_file': os.path.join(parent_dir, BASELINE_PATH),
    'output_dir': os.path.join(parent_dir, 'models', 
'baseline_phaseA'),
    'max_length': MAX_SEQ_LENGTH,
    'num_epochs': 3,
    'batch_size': 4,
    'gradient_accumulation_steps': 8,
    'learning_rate': 2e-4,
    'use_4bit': USE_4BIT
}

print("=== Phase A: Baseline Training ===")   
print(f"Model: {PHASE_A_CONFIG['model_name']}")
print(f"Training file: {PHASE_A_CONFIG['train_file']}")
print(f"Output directory: {PHASE_A_CONFIG['output_dir']}")
print(f"Max sequence length: {PHASE_A_CONFIG['max_length']}")
print(f"Training for {PHASE_A_CONFIG['num_epochs']} epochs")

# Clear previous model from memory
if 'model' in locals():
    del model
if 'tokenizer' in locals():
    del tokenizer
gc.collect()
torch.cuda.empty_cache()

# Load fresh model for training
print(f"Loading model for training: {PHASE_A_CONFIG['model_name']}")

if PHASE_A_CONFIG['use_4bit']:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,       
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16  
    )
    model = AutoModelForCausalLM.from_pretrained(
        PHASE_A_CONFIG['model_name'],
        quantization_config=bnb_config,       
        device_map="auto",
        torch_dtype=torch.float16
    )
    # Prepare model for k-bit training first  
    model = prepare_model_for_kbit_training(model)
    print("Model prepared for 4-bit training")
else:
    model = AutoModelForCausalLM.from_pretrained(
        PHASE_A_CONFIG['model_name'],
        torch_dtype=torch.float16
    ).to(device)

tokenizer = AutoTokenizer.from_pretrained(PHASE_A_CONFIG['model_name'])
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token 

# LoRA Configuration - Compatible with 4-bit quantization
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],  # Phi-3.5 specific modules
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    inference_mode=False,
)

try:
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()        
    print("✅ LoRA adapter successfully attached")
except Exception as e:
    print(f"❌ Error applying LoRA: {e}")     
    print("Troubleshooting: Using alternative LoRA configuration...")
    
    # Alternative LoRA config for compatibility
    lora_config = LoraConfig(
        r=8,  # Smaller rank
        lora_alpha=16,
        target_modules="all-linear",  # Auto-detect linear layers
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()        
    print("✅ Alternative LoRA configuration applied successfully")

print(f"GPU Memory after model setup: {torch.cuda.memory_allocated()/1e9:.2f} GB")  

=== Phase A: Baseline Training ===
Model: microsoft/phi-2
Training file: c:\Users\noham\Desktop\Self-Improving-LLM\data\train_baseline.jsonl
Output directory: c:\Users\noham\Desktop\Self-Improving-LLM\models\baseline_phaseA
Max sequence length: 2048
Training for 3 epochs
Loading model for training: microsoft/phi-2


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Fetching 2 files: 100%|██████████| 2/2 [01:58<00:00, 59.18s/it] 
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.24s/it]


Model prepared for 4-bit training
trainable params: 7,864,320 || all params: 2,787,548,160 || trainable%: 0.2821
✅ LoRA adapter successfully attached
GPU Memory after model setup: 2.38 GB


In [8]:
# Load and prepare training data
print("Loading baseline training data...")
train_dataset = load_dataset('json', data_files=PHASE_A_CONFIG['train_file'], split='train')

def tokenize_function(examples):
    # Format: "Question: {prompt}\nAnswer: {answer}"
    texts = [f"Question: {prompt}\nAnswer: {answer}" for prompt, answer in zip(examples['prompt'], examples['answer'])]
    
    tokenized = tokenizer(
        texts,
        truncation=True,
        padding='max_length',  # Fix: Use consistent padding
        max_length=PHASE_A_CONFIG['max_length'],
        return_tensors=None
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

# Tokenize the dataset
tokenized_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=train_dataset.column_names)

# Data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

print(f"Training dataset size: {len(tokenized_dataset)}")

# Adjust batch size for small dataset
effective_batch_size = min(32, len(tokenized_dataset))
batch_size = 2
gradient_accumulation_steps = max(1, effective_batch_size // batch_size)

print(f"Adjusted batch size: {batch_size}, grad accumulation: {gradient_accumulation_steps}")

# Training arguments
training_args = TrainingArguments(
    output_dir=PHASE_A_CONFIG['output_dir'],
    overwrite_output_dir=True,
    num_train_epochs=PHASE_A_CONFIG['num_epochs'],
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    learning_rate=PHASE_A_CONFIG['learning_rate'],
    warmup_steps=5,
    logging_steps=5,
    save_steps=50,
    save_total_limit=2,
    prediction_loss_only=True,
    remove_unused_columns=False,
    dataloader_pin_memory=False,
    fp16=True,
    report_to=None,
    dataloader_num_workers=0,
)

# Create and run trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("Starting Phase A training...")
trainer.train()
trainer.save_model()

print("✅ Phase A training completed!")
print(f"Model saved to: {PHASE_A_CONFIG['output_dir']}")
print(f"Final GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

Loading baseline training data...


Map: 100%|██████████| 137/137 [00:00<00:00, 1108.54 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Training dataset size: 137
Adjusted batch size: 2, grad accumulation: 16
Starting Phase A training...


  return fn(*args, **kwargs)


Step,Training Loss
5,3.0129
10,2.7491
15,2.6159


✅ Phase A training completed!
Model saved to: c:\Users\noham\Desktop\Self-Improving-LLM\models\baseline_phaseA
Final GPU Memory: 2.46 GB


In [9]:
import json
import torch
from datasets import load_dataset
from tqdm import tqdm
import re

def evaluate_model_on_test(model, tokenizer, test_file, device):
    """Evaluate the model on test set and return accuracy."""
    
    # Load test data
    test_dataset = load_dataset('json', data_files=test_file, split='train')
    
    model.eval()
    correct = 0
    total = 0
    
    print(f"Evaluating on {len(test_dataset)} test examples...")
    
    for example in tqdm(test_dataset, desc="Evaluating"):
        question = example['question']
        ground_truth = "Yes" if example['answer'] else "No"
        
        # Format prompt same as training
        prompt = f"Question: {question}\nAnswer:"
        
        # Tokenize
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
        
        # Generate
        with torch.no_grad():
            outputs = model.generate(
                inputs['input_ids'],
                attention_mask=inputs['attention_mask'],
                max_new_tokens=5,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id
            )
        
        # Extract generated answer
        generated_text = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
        
        # Parse Yes/No answer
        predicted_answer = "No"  # Default
        if re.search(r'\byes\b', generated_text.lower()):
            predicted_answer = "Yes"
        elif re.search(r'\bno\b', generated_text.lower()):
            predicted_answer = "No"
        
        # Check correctness
        if predicted_answer == ground_truth:
            correct += 1
        total += 1
        
        # Debug first few examples
        if total <= 3:
            print(f"Q: {question[:50]}...")
            print(f"Generated: '{generated_text.strip()}'")
            print(f"Predicted: {predicted_answer}, Ground truth: {ground_truth}")
            print("---")
    
    accuracy = correct / total * 100
    print(f"\n=== Phase A Baseline Results ===")
    print(f"Test Accuracy: {accuracy:.1f}% ({correct}/{total})")
    print(f"Target was ~60%, {'✅ SUCCESS' if accuracy >= 55 else '⚠️ BELOW TARGET'}")
    
    return accuracy

# Run evaluation on test set
parent_dir = os.path.dirname(os.getcwd())
test_file = os.path.join(parent_dir, 'data', 'raw', 'strategyqa_test.jsonl')

baseline_accuracy = evaluate_model_on_test(model, tokenizer, test_file, device)

# Save results for comparison with later phases
results = {
    'phase_a_baseline_accuracy': baseline_accuracy,
    'model_path': PHASE_A_CONFIG['output_dir'],
    'dataset_size': len(tokenized_dataset),
    'training_epochs': PHASE_A_CONFIG['num_epochs']
}

results_file = os.path.join(parent_dir, 'results_phase_a.json')
with open(results_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Results saved to: {results_file}")

NameError: name 'device' is not defined

## Phase A Evaluation

Now we evaluate the baseline model on the test set to establish our performance baseline. According to the training plan, we expect around 60% accuracy from the baseline model.

**What this evaluation does:**
- Loads the StrategyQA test set (687 examples)
- Generates Yes/No answers using the trained baseline model
- Compares predictions against ground truth labels
- Calculates accuracy and saves results for comparison with future phases

This baseline accuracy is crucial for measuring the effectiveness of Phase B (CoT distillation) and Phase C (DPO alignment), which should achieve +7-10pp and +10pp improvements respectively.