# AskQE Pipeline - Qwen2.5-3B-Instruct Baseline

This notebook runs the complete AskQE pipeline using the **Qwen/Qwen2.5-3B-Instruct** model.
All results are saved in `results Qwen3B baseline/` folder.

**Note:** Models are cached on Google Drive for faster subsequent runs.

## 0. Mount Google Drive & Configure Model Cache

This section mounts Google Drive and configures the model cache directory to avoid re-downloading models on each run.

In [None]:
# Mount Google Drive (only works in Colab)
import os
import sys

# Check if running in Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Set model cache directory on Google Drive
    DRIVE_CACHE_DIR = '/content/drive/MyDrive/AskQE_Models_Cache'
    os.makedirs(DRIVE_CACHE_DIR, exist_ok=True)
    
    # Configure Hugging Face cache
    os.environ['HF_HOME'] = DRIVE_CACHE_DIR
    os.environ['TRANSFORMERS_CACHE'] = os.path.join(DRIVE_CACHE_DIR, 'transformers')
    os.environ['SENTENCE_TRANSFORMERS_HOME'] = os.path.join(DRIVE_CACHE_DIR, 'sentence_transformers')
    
    print(f'Model cache directory: {DRIVE_CACHE_DIR}')
else:
    print('Not running in Colab - using default cache directories')

## Setup - Install Dependencies

In [None]:
import subprocess
import sys
import os

# Install dependencies
subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'transformers', 'torch', 'accelerate', 'nltk', 'sentence-transformers', 'sacrebleu', 'textstat'], check=True)

# Get project root
if IN_COLAB:
    # Clone repo if not present
    if not os.path.exists('/content/askqe'):
        subprocess.run(['git', 'clone', 'https://github.com/Simone280802/AskQE_DNLP_2025-2026.git', '/content/askqe'], check=True)
    PROJECT_ROOT = '/content/askqe'
else:
    PROJECT_ROOT = os.getcwd()

RESULTS_DIR = os.path.join(PROJECT_ROOT, 'results Qwen3B baseline')
os.makedirs(RESULTS_DIR, exist_ok=True)

print(f'Project root: {PROJECT_ROOT}')
print(f'Results directory: {RESULTS_DIR}')

## Pre-download Models

Download all models needed for the pipeline. These will be saved to Google Drive and reused in future runs.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import torch

print('=== Downloading/Loading Models ===')
print('This may take a while on first run, but will be cached on Drive for future use.\n')

# List of models to download
MODELS = {
    'qwen': 'Qwen/Qwen2.5-3B-Instruct',
    'sbert': 'sentence-transformers/all-MiniLM-L6-v2',
    'qa_model': 'potsawee/longformer-large-4096-answerable-squad2'  # For answerability evaluation
}

# Download Qwen model
print(f"[1/3] Loading {MODELS['qwen']}...")
tokenizer = AutoTokenizer.from_pretrained(MODELS['qwen'])
model = AutoModelForCausalLM.from_pretrained(
    MODELS['qwen'],
    torch_dtype=torch.bfloat16,
    device_map='auto'
)
print(f"      ✓ Qwen model loaded")

# Free memory - model will be reloaded when needed
del model, tokenizer
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Download SBERT model
print(f"[2/3] Loading {MODELS['sbert']}...")
sbert_model = SentenceTransformer(MODELS['sbert'])
print(f"      ✓ SBERT model loaded")
del sbert_model

# Download QA model for answerability
print(f"[3/3] Loading {MODELS['qa_model']}...")
try:
    qa_tokenizer = AutoTokenizer.from_pretrained(MODELS['qa_model'])
    print(f"      ✓ QA model loaded")
    del qa_tokenizer
except Exception as e:
    print(f"      ⚠ Could not load QA model: {e}")

print('\n=== All models cached! ===')
if IN_COLAB:
    print(f'Models saved to: {DRIVE_CACHE_DIR}')

---

## 1. Question Generation (QG)

Generate questions for each variant: vanilla, atomic, and semantic.

In [None]:
# Run QG for vanilla prompt
os.chdir(os.path.join(PROJECT_ROOT, 'QG', 'code'))
output_path = os.path.join(RESULTS_DIR, 'QG', 'vanilla_qwen-3b.jsonl')
os.makedirs(os.path.dirname(output_path), exist_ok=True)
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--prompt', 'vanilla'], check=True)

In [None]:
# Run QG for atomic prompt
output_path = os.path.join(RESULTS_DIR, 'QG', 'atomic_qwen-3b.jsonl')
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--prompt', 'atomic'], check=True)

In [None]:
# Run QG for semantic prompt
output_path = os.path.join(RESULTS_DIR, 'QG', 'semantic_qwen-3b.jsonl')
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--prompt', 'semantic'], check=True)

## 2. Question Answering (QA)

Answer questions based on source sentences and backtranslated MT.

In [None]:
# Run QA based on source sentences
os.chdir(os.path.join(PROJECT_ROOT, 'QA', 'code'))
qg_input = os.path.join(RESULTS_DIR, 'QG', 'vanilla_qwen-3b.jsonl')
output_path = os.path.join(RESULTS_DIR, 'QA', 'source', 'qa_source.jsonl')
os.makedirs(os.path.dirname(output_path), exist_ok=True)
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--sentence_type', 'source', '--qg_input_path', qg_input], check=True)

In [None]:
# Run QA based on backtranslated MT
qg_input = os.path.join(RESULTS_DIR, 'QG', 'vanilla_qwen-3b.jsonl')
output_path = os.path.join(RESULTS_DIR, 'QA', 'bt', 'qa_bt.jsonl')
os.makedirs(os.path.dirname(output_path), exist_ok=True)
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--sentence_type', 'bt', '--qg_input_path', qg_input], check=True)

## 3. BioMQM Pipeline

In [None]:
# Run BioMQM pipeline
os.chdir(os.path.join(PROJECT_ROOT, 'biomqm', 'askqe'))
output_path = os.path.join(RESULTS_DIR, 'biomqm', 'askqe_qg_qwen3b.jsonl')
os.makedirs(os.path.dirname(output_path), exist_ok=True)
subprocess.run([sys.executable, '-u', 'qwen-3b.py', '--output_path', output_path, '--prompt', 'atomic'], check=True)

## 4. Evaluation Metrics

### 4.1 SBERT (Sentence-BERT Cosine Similarity)

In [None]:
# Run SBERT evaluation
os.chdir(os.path.join(PROJECT_ROOT, 'evaluation', 'sbert'))
output_file = os.path.join(RESULTS_DIR, 'evaluation', 'sbert', 'qwen-3b.csv')
os.makedirs(os.path.dirname(output_file), exist_ok=True)
subprocess.run([sys.executable, 'sbert.py', '--model', 'qwen-3b', '--output_file', output_file], check=True)

### 4.2 String Comparison (F1, EM, BLEU, chrF)

In [None]:
# Run String Comparison evaluation
os.chdir(os.path.join(PROJECT_ROOT, 'evaluation', 'string-comparison'))
subprocess.run([sys.executable, 'string_comparison.py'], check=True)

## 5. Baseline Metrics (QE)

### 5.1 BT-Score (BERTScore on backtranslation)

In [None]:
# Run BT-Score
os.chdir(os.path.join(PROJECT_ROOT, 'evaluation', 'bt-score'))
subprocess.run([sys.executable, 'run_bt.py'], check=True)

### 5.2 xCOMET-QE

In [None]:
# Run xCOMET-QE
os.chdir(os.path.join(PROJECT_ROOT, 'evaluation', 'xcomet-qe'))
subprocess.run([sys.executable, 'xcomet.py'], check=True)

## 6. Desiderata Evaluation (Question Quality)

### 6.1 Empty Questions Count

In [None]:
# Run Empty Questions evaluation
os.chdir(os.path.join(PROJECT_ROOT, 'evaluation', 'desiderata'))
subprocess.run([sys.executable, 'i_avg_questions.py'], check=True)

### 6.2 Duplicate Questions

In [None]:
# Run Duplicate Questions evaluation
subprocess.run([sys.executable, 'i_duplicate.py'], check=True)

### 6.3 Diversity

In [None]:
# Run Diversity evaluation
subprocess.run([sys.executable, 'i_diversity.py'], check=True)

### 6.4 Answerability

In [None]:
# Run Answerability evaluation
subprocess.run([sys.executable, 'q_answerability.py'], check=True)

### 6.5 Readability

In [None]:
# Run Readability evaluation
subprocess.run([sys.executable, 'q_readability.py'], check=True)

---

## Pipeline Complete!

All results are saved in the `results Qwen3B baseline/` folder.