# LLM Benchmarking for SOCAR Hackathon RAG Chatbot

This notebook tests different LLM models for the `/llm` endpoint to find the best performer.

## Evaluation Criteria (LLM Judge Metrics):
- **Accuracy**: Is the answer correct?
- **Relevance**: Are retrieved citations relevant?
- **Completeness**: Does it fully answer the question?
- **Citation Quality**: Proper sources with page numbers?
- **Response Time**: Speed of generation

## Available LLM Models:
1. **Llama-4-Maverick-17B-128E-Instruct-FP8** (Current choice, open-source)
2. **DeepSeek-R1** (Open-source reasoning model)
3. **GPT-4.1** (Strong general performance)
4. **GPT-5, GPT-5-mini**
5. **Claude Sonnet 4.5** (Best quality)
6. **Claude Opus 4.1**
7. **Phi-4-multimodal-instruct**
8. **gpt-oss-120b**

In [1]:
# Install required packages
# !pip install openai pinecone-client sentence-transformers python-dotenv pandas matplotlib seaborn jiwer

In [2]:
import os
import json
import time
from typing import Dict, List, Tuple
from dotenv import load_dotenv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from openai import AzureOpenAI
from pinecone import Pinecone
from sentence_transformers import SentenceTransformer
from jiwer import wer, cer

# Load environment variables
load_dotenv()

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

print("‚úÖ Libraries loaded successfully")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Libraries loaded successfully


## 1. Load Test Questions and Expected Answers

In [3]:
# Load sample questions
with open('docs/sample_questions.json', 'r', encoding='utf-8') as f:
    questions = json.load(f)

# Load expected answers
with open('docs/sample_answers.json', 'r', encoding='utf-8') as f:
    expected_answers = json.load(f)

print(f"Loaded {len(questions)} test cases")
print("\nTest Questions:")
for i, (key, msgs) in enumerate(questions.items(), 1):
    user_msg = [m for m in msgs if m['role'] == 'user'][-1]
    print(f"{i}. {key}: {user_msg['content'][:100]}...")

Loaded 5 test cases

Test Questions:
1. Example1: Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn hansƒ± …ôsas amill…ôrin inteqrasiyasƒ± t…ôl…ôb olunur?...
2. Example2: Q…ôrbi Ab≈üeron yataƒüƒ±nda suvurma t…ôdbirl…ôri hansƒ± tarixd…ô v…ô hansƒ± layda t…ôtbiq edilmi≈üdir v…ô bunun m...
3. Example3: Pirallahƒ± strukturunda 1253 n√∂mr…ôli quyudan g√∂t√ºr√ºlm√º≈ü n√ºmun…ôl…ôrd…ô SiO2 v…ô CaO oksidl…ôri arasƒ±nda ha...
4. Example4: Bakƒ± arxipelaqƒ± (BA) v…ô A≈üaƒüƒ± K√ºr √ß√∂k…ôkliyi (AK√á) √º√ß√ºn geotemperatur x…ôrit…ôl…ôrin…ô …ôsas…ôn neft v…ô qaz...
5. Example5: Bu zonada hansƒ± prosesl…ôr ba≈ü verir?...


## 2. Initialize Vector Database and Embedding Model

In [4]:
# Initialize Pinecone
pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY'))
index = pc.Index(os.getenv('PINECONE_INDEX_NAME', 'hackathon'))

# Initialize embedding model (same as used for ingestion)
embed_model = SentenceTransformer('BAAI/bge-large-en-v1.5')

print(f"‚úÖ Vector DB connected: {index.describe_index_stats()}")
print(f"‚úÖ Embedding model loaded: {embed_model}")

‚úÖ Vector DB connected: {'_response_info': {'raw_headers': {'connection': 'keep-alive',
                                    'content-length': '188',
                                    'content-type': 'application/json',
                                    'date': 'Sun, 14 Dec 2025 03:21:33 GMT',
                                    'grpc-status': '0',
                                    'server': 'envoy',
                                    'x-envoy-upstream-service-time': '4',
                                    'x-pinecone-request-id': '3979707437017514155',
                                    'x-pinecone-request-latency-ms': '4'}},
 'dimension': 1024,
 'index_fullness': 0.0,
 'memoryFullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'__default__': {'vector_count': 1300}},
 'storageFullness': 0.0,
 'total_vector_count': 1300,
 'vector_type': 'dense'}
‚úÖ Embedding model loaded: SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': '

## 3. RAG Retrieval Function

In [5]:
def retrieve_documents(query: str, top_k: int = 3) -> List[Dict]:
    """
    Retrieve relevant documents from vector database.
    """
    # Generate query embedding
    query_embedding = embed_model.encode(query).tolist()
    
    # Search vector DB
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # Extract documents
    documents = []
    for match in results['matches']:
        documents.append({
            'pdf_name': match['metadata'].get('pdf_name', 'unknown.pdf'),
            'page_number': match['metadata'].get('page_number', 0),
            'content': match['metadata'].get('text', ''),
            'score': match.get('score', 0.0)
        })
    
    return documents

# Test retrieval
test_query = "Pal√ßƒ±q vulkanlarƒ±nƒ±n t…ôsir radiusu n…ô q…ôd…ôrdir?"
test_docs = retrieve_documents(test_query)
print(f"\n‚úÖ Retrieved {len(test_docs)} documents for test query")
print(f"Top result: {test_docs[0]['pdf_name']}, page {test_docs[0]['page_number']} (score: {test_docs[0]['score']:.3f})")


‚úÖ Retrieved 3 documents for test query
Top result: document_10.pdf, page 8 (score: 0.767)


## 4. LLM Client Functions

In [None]:
# Initialize Azure OpenAI
azure_client = AzureOpenAI(
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION', '2024-08-01-preview'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')
)

LLM_MODELS = {
    'Llama-4-Maverick': 'Llama-4-Maverick-17B-128E-Instruct-FP8',
    'DeepSeek-R1': 'DeepSeek-R1',
    'GPT-4.1': 'gpt-4.1',
    'GPT-5-mini': 'gpt-5-mini',
    'Claude-Sonnet-4.5': 'claude-sonnet-4-5',
}

def generate_answer(model_name: str, query: str, documents: List[Dict], 
                   temperature: float = 0.2, max_tokens: int = 1000) -> Tuple[str, float]:
    """
    Generate answer using specified LLM model.
    Returns: (answer, response_time)
    """
    # Build context from retrieved documents
    context_parts = []
    for i, doc in enumerate(documents, 1):
        context_parts.append(
            f"Document {i} (Source: {doc['pdf_name']}, Page {doc['page_number']}):\n{doc['content']}"
        )
    context = "\n\n".join(context_parts)
    
    # Create prompt
    prompt = f"""Siz SOCAR-ƒ±n tarixi neft v…ô qaz s…ôn…ôdl…ôri √ºzr…ô m√ºt…ôx…ôssis k√∂m…ôk√ßisisiniz.

Kontekst (…ôlaq…ôli s…ôn…ôdl…ôr):
{context}

Sual: {query}

∆ètraflƒ± cavab verin v…ô m√ºtl…ôq s…ôn…ôd m…ônb…ôl…ôrin…ô istinad edin (PDF adƒ± v…ô s…ôhif…ô n√∂mr…ôsi il…ô).
Cavabƒ±nƒ±z d…ôqiq, faktlara …ôsaslanan v…ô kontekst m…ôlumatlarƒ±ndan istifad…ô ed…ôn olmalƒ±dƒ±r."""
    
    # Get model deployment
    deployment = MODELS[model_name]['deployment']
    
    try:
        start_time = time.time()
        
        # GPT-5 models use max_completion_tokens, others use max_tokens
        if deployment.startswith('gpt-5'):
            response = azure_client.chat.completions.create(
                model=deployment,
                messages=[
                    {"role": "user", "content": prompt}
                ],
                temperature=temperature,
                max_completion_tokens=max_tokens
            )
        else:
            response = azure_client.chat.completions.create(
                model=deployment,
                messages=[
                    {"role": "user", "content": prompt}
                ],
                temperature=temperature,
                max_tokens=max_tokens
            )
        
        response_time = time.time() - start_time
        answer = response.choices[0].message.content
        
        return answer, response_time
    
    except Exception as e:
        return f"ERROR: {str(e)}", 0.0

print(f"\n‚úÖ Configured {len(LLM_MODELS)} LLM models for testing")

## 5. Evaluation Metrics

In [7]:
def normalize_text(text: str) -> str:
    """Normalize text for comparison."""
    import re
    text = text.lower().strip()
    text = re.sub(r'\s+', ' ', text)
    return text

def calculate_answer_similarity(reference: str, hypothesis: str) -> Dict[str, float]:
    """
    Calculate similarity between generated and expected answer.
    Lower is better for error rates.
    """
    ref_norm = normalize_text(reference)
    hyp_norm = normalize_text(hypothesis)
    
    # Character Error Rate
    cer_score = cer(ref_norm, hyp_norm) * 100
    
    # Word Error Rate  
    wer_score = wer(ref_norm, hyp_norm) * 100
    
    # Similarity scores (higher is better)
    similarity = max(0, 100 - wer_score)
    
    return {
        'CER': round(cer_score, 2),
        'WER': round(wer_score, 2),
        'Similarity': round(similarity, 2)
    }

def check_citations(answer: str, documents: List[Dict]) -> Dict[str, any]:
    """
    Check if answer includes proper citations.
    """
    import re
    
    # Check for PDF names
    pdf_names = [doc['pdf_name'] for doc in documents]
    cited_pdfs = sum(1 for pdf in pdf_names if pdf.replace('.pdf', '') in answer)
    
    # Check for page numbers
    page_numbers = [str(doc['page_number']) for doc in documents]
    cited_pages = sum(1 for page in page_numbers if page in answer)
    
    # Check for source keywords
    source_keywords = ['m…ônb…ô', 's…ôn…ôd', 's…ôhif…ô', 'pdf', 'document', 'page', 'source']
    has_source_ref = any(kw in answer.lower() for kw in source_keywords)
    
    citation_score = (
        (cited_pdfs / len(pdf_names) * 40) +  # 40% for PDF citation
        (cited_pages / len(page_numbers) * 40) +  # 40% for page citation
        (20 if has_source_ref else 0)  # 20% for having source keywords
    )
    
    return {
        'Citation_Score': round(citation_score, 2),
        'Cited_PDFs': cited_pdfs,
        'Cited_Pages': cited_pages,
        'Has_Source_Reference': has_source_ref
    }

def evaluate_completeness(answer: str, min_length: int = 100) -> Dict[str, any]:
    """
    Evaluate answer completeness.
    """
    word_count = len(answer.split())
    char_count = len(answer)
    
    # Penalize very short or very long answers
    if char_count < min_length:
        completeness_score = (char_count / min_length) * 100
    elif char_count > 2000:
        completeness_score = 100 - ((char_count - 2000) / 2000 * 20)  # Penalty for verbosity
    else:
        completeness_score = 100
    
    return {
        'Completeness_Score': round(max(0, completeness_score), 2),
        'Word_Count': word_count,
        'Char_Count': char_count
    }

print("‚úÖ Evaluation functions ready")

‚úÖ Evaluation functions ready


## 6. Run Benchmark on All Models

In [8]:
# Select models to test (you can comment out models to skip)
MODELS_TO_TEST = [
    'Llama-4-Maverick-17B',
    'DeepSeek-R1',
    'GPT-4.1',
    'GPT-5-mini',
    'Claude-Sonnet-4.5',
    # 'Claude-Opus-4.1',  # Uncomment to test
    # 'Phi-4-multimodal',  # Uncomment to test
    # 'GPT-OSS-120B',  # Uncomment to test
]

print(f"Testing {len(MODELS_TO_TEST)} models on {len(questions)} questions...\n")
print("This may take several minutes...\n")

Testing 5 models on 5 questions...

This may take several minutes...



In [9]:
# Run benchmark
results = []

for model_name in MODELS_TO_TEST:
    print(f"\n{'='*80}")
    print(f"Testing: {model_name}")
    print(f"{'='*80}")
    
    model_results = []
    
    for example_key, messages in questions.items():
        # Get the last user message (the actual question)
        user_msg = [m for m in messages if m['role'] == 'user'][-1]
        query = user_msg['content']
        
        print(f"\n  Question {example_key}: {query[:80]}...")
        
        # Retrieve documents
        documents = retrieve_documents(query, top_k=3)
        
        # Generate answer
        answer, response_time = generate_answer(model_name, query, documents)
        
        if answer.startswith('ERROR'):
            print(f"  ‚ùå Failed: {answer}")
            continue
        
        print(f"  ‚úÖ Response time: {response_time:.2f}s")
        
        # Get expected answer
        expected = expected_answers.get(example_key, {}).get('Answer', '')
        
        # Calculate metrics
        similarity_metrics = calculate_answer_similarity(expected, answer) if expected else {'CER': 0, 'WER': 0, 'Similarity': 0}
        citation_metrics = check_citations(answer, documents)
        completeness_metrics = evaluate_completeness(answer)
        
        # Store result
        result = {
            'Model': model_name,
            'Question': example_key,
            'Query': query[:100],
            'Answer': answer[:200] + '...',
            'Response_Time': round(response_time, 2),
            **similarity_metrics,
            **citation_metrics,
            **completeness_metrics,
            'Open_Source': MODELS[model_name]['open_source'],
            'Architecture_Score': MODELS[model_name]['architecture_score']
        }
        
        model_results.append(result)
        results.append(result)
    
    # Show summary for this model
    if model_results:
        avg_response_time = sum(r['Response_Time'] for r in model_results) / len(model_results)
        avg_similarity = sum(r['Similarity'] for r in model_results) / len(model_results)
        avg_citation = sum(r['Citation_Score'] for r in model_results) / len(model_results)
        avg_completeness = sum(r['Completeness_Score'] for r in model_results) / len(model_results)
        
        print(f"\n  üìä {model_name} Summary:")
        print(f"     Avg Response Time: {avg_response_time:.2f}s")
        print(f"     Avg Similarity: {avg_similarity:.1f}%")
        print(f"     Avg Citation Score: {avg_citation:.1f}%")
        print(f"     Avg Completeness: {avg_completeness:.1f}%")

print(f"\n{'='*80}")
print("‚úÖ Benchmarking complete!")
print(f"{'='*80}")


Testing: Llama-4-Maverick-17B

  Question Example1: Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn hansƒ± …ôsas amill…ôrin inteqrasi...
  ‚úÖ Response time: 4.39s

  Question Example2: Q…ôrbi Ab≈üeron yataƒüƒ±nda suvurma t…ôdbirl…ôri hansƒ± tarixd…ô v…ô hansƒ± layda t…ôtbiq e...
  ‚úÖ Response time: 3.74s

  Question Example3: Pirallahƒ± strukturunda 1253 n√∂mr…ôli quyudan g√∂t√ºr√ºlm√º≈ü n√ºmun…ôl…ôrd…ô SiO2 v…ô CaO o...
  ‚úÖ Response time: 4.07s

  Question Example4: Bakƒ± arxipelaqƒ± (BA) v…ô A≈üaƒüƒ± K√ºr √ß√∂k…ôkliyi (AK√á) √º√ß√ºn geotemperatur x…ôrit…ôl…ôrin...
  ‚úÖ Response time: 4.20s

  Question Example5: Bu zonada hansƒ± prosesl…ôr ba≈ü verir?...
  ‚úÖ Response time: 3.50s

  üìä Llama-4-Maverick-17B Summary:
     Avg Response Time: 3.98s
     Avg Similarity: 0.0%
     Avg Citation Score: 84.0%
     Avg Completeness: 100.0%

Testing: DeepSeek-R1

  Question Example1: Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn hansƒ± …ôsas amill…ôrin inteqrasi...

## 7. Aggregate Results and Rankings

In [10]:
# Create DataFrame
df = pd.DataFrame(results)

# Calculate aggregate scores per model
model_summary = df.groupby('Model').agg({
    'Response_Time': 'mean',
    'Similarity': 'mean',
    'Citation_Score': 'mean',
    'Completeness_Score': 'mean',
    'CER': 'mean',
    'WER': 'mean',
    'Open_Source': 'first',
    'Architecture_Score': 'first'
}).round(2)

# Calculate overall quality score (weighted average)
model_summary['Quality_Score'] = (
    model_summary['Similarity'] * 0.35 +  # 35% answer accuracy
    model_summary['Citation_Score'] * 0.35 +  # 35% citation quality
    model_summary['Completeness_Score'] * 0.30  # 30% completeness
).round(2)

# Sort by Quality Score
model_summary = model_summary.sort_values('Quality_Score', ascending=False)

# Display summary table
print("\n" + "="*100)
print("üìä LLM BENCHMARKING RESULTS - MODEL SUMMARY")
print("="*100)
print(model_summary.to_string())
print("="*100)


üìä LLM BENCHMARKING RESULTS - MODEL SUMMARY
                      Response_Time  Similarity  Citation_Score  Completeness_Score     CER     WER  Open_Source Architecture_Score  Quality_Score
Model                                                                                                                                             
Llama-4-Maverick-17B           3.98         0.0            84.0              100.00  330.97  378.42         True               High          59.40
GPT-4.1                        5.95         0.0            84.0               93.54  755.19  780.64        False             Medium          57.46
DeepSeek-R1                   10.80         0.0            80.0               67.73  855.43  992.02         True               High          48.32


## 8. Visualizations

In [None]:
# Create comprehensive visualization
import os
from pathlib import Path

# Create output directory
output_dir = Path('output/llm_benchmark')
output_dir.mkdir(parents=True, exist_ok=True)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))

models = model_summary.index.tolist()
colors = sns.color_palette('husl', len(models))

# 1. Overall Quality Score
ax1 = axes[0, 0]
bars1 = ax1.barh(models, model_summary['Quality_Score'], color=colors)
ax1.set_xlabel('Quality Score (Higher is Better)', fontsize=11)
ax1.set_title('Overall Quality Score\n(Similarity 35% + Citation 35% + Completeness 30%)', 
              fontsize=12, fontweight='bold')
ax1.set_xlim(0, 100)
for i, (model, score) in enumerate(zip(models, model_summary['Quality_Score'])):
    ax1.text(score + 1, i, f'{score:.1f}', va='center', fontsize=10, fontweight='bold')

# 2. Answer Similarity (Accuracy)
ax2 = axes[0, 1]
ax2.barh(models, model_summary['Similarity'], color=colors)
ax2.set_xlabel('Similarity to Expected Answer (%)', fontsize=11)
ax2.set_title('Answer Accuracy', fontsize=12, fontweight='bold')
ax2.set_xlim(0, 100)
for i, (model, score) in enumerate(zip(models, model_summary['Similarity'])):
    ax2.text(score + 1, i, f'{score:.1f}%', va='center', fontsize=9)

# 3. Citation Quality
ax3 = axes[0, 2]
ax3.barh(models, model_summary['Citation_Score'], color=colors)
ax3.set_xlabel('Citation Score (%)', fontsize=11)
ax3.set_title('Citation Quality\n(PDF names + Page numbers)', fontsize=12, fontweight='bold')
ax3.set_xlim(0, 100)
for i, (model, score) in enumerate(zip(models, model_summary['Citation_Score'])):
    ax3.text(score + 1, i, f'{score:.1f}%', va='center', fontsize=9)

# 4. Response Time
ax4 = axes[1, 0]
ax4.barh(models, model_summary['Response_Time'], color=colors)
ax4.set_xlabel('Response Time (seconds - Lower is Better)', fontsize=11)
ax4.set_title('Speed Performance', fontsize=12, fontweight='bold')
for i, (model, time) in enumerate(zip(models, model_summary['Response_Time'])):
    ax4.text(time + 0.1, i, f'{time:.2f}s', va='center', fontsize=9)

# 5. Completeness
ax5 = axes[1, 1]
ax5.barh(models, model_summary['Completeness_Score'], color=colors)
ax5.set_xlabel('Completeness Score (%)', fontsize=11)
ax5.set_title('Answer Completeness', fontsize=12, fontweight='bold')
ax5.set_xlim(0, 100)
for i, (model, score) in enumerate(zip(models, model_summary['Completeness_Score'])):
    ax5.text(score + 1, i, f'{score:.1f}%', va='center', fontsize=9)

# 6. Error Rates (CER vs WER)
ax6 = axes[1, 2]
x = range(len(models))
width = 0.35
ax6.bar([i - width/2 for i in x], model_summary['CER'], width, label='CER', alpha=0.8)
ax6.bar([i + width/2 for i in x], model_summary['WER'], width, label='WER', alpha=0.8)
ax6.set_ylabel('Error Rate (% - Lower is Better)', fontsize=11)
ax6.set_title('Error Rates', fontsize=12, fontweight='bold')
ax6.set_xticks(x)
ax6.set_xticklabels(models, rotation=45, ha='right')
ax6.legend()
ax6.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(output_dir / 'results.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n‚úÖ Visualization saved to '{output_dir}/results.png'")

## 9. Final Rankings and Recommendations

In [12]:
# Create rankings table
rankings = model_summary[[
    'Quality_Score', 'Similarity', 'Citation_Score', 'Completeness_Score', 
    'Response_Time', 'Open_Source', 'Architecture_Score'
]].copy()

rankings.insert(0, 'Rank', range(1, len(rankings) + 1))

print("\n" + "="*100)
print("üèÜ FINAL RANKINGS")
print("="*100)
print(rankings.to_string())
print("="*100)

# Winner analysis
best_overall = rankings.index[0]
best_open_source = rankings[rankings['Open_Source'] == True].index[0] if any(rankings['Open_Source']) else None
fastest = model_summary['Response_Time'].idxmin()

print("\n" + "="*100)
print("üí° RECOMMENDATIONS FOR HACKATHON")
print("="*100)

print(f"\nü•á Best Overall Quality: {best_overall}")
print(f"   Quality Score: {model_summary.loc[best_overall, 'Quality_Score']:.1f}%")
print(f"   Similarity: {model_summary.loc[best_overall, 'Similarity']:.1f}%")
print(f"   Citation Score: {model_summary.loc[best_overall, 'Citation_Score']:.1f}%")
print(f"   Response Time: {model_summary.loc[best_overall, 'Response_Time']:.2f}s")
print(f"   Open Source: {model_summary.loc[best_overall, 'Open_Source']}")
print(f"   Architecture Score: {model_summary.loc[best_overall, 'Architecture_Score']}")

if best_open_source:
    print(f"\nüîì Best Open-Source Model: {best_open_source}")
    print(f"   Quality Score: {model_summary.loc[best_open_source, 'Quality_Score']:.1f}%")
    print(f"   Architecture Score: {model_summary.loc[best_open_source, 'Architecture_Score']} (Better for hackathon!)")
    print(f"   Response Time: {model_summary.loc[best_open_source, 'Response_Time']:.2f}s")

print(f"\n‚ö° Fastest Model: {fastest}")
print(f"   Response Time: {model_summary.loc[fastest, 'Response_Time']:.2f}s")
print(f"   Quality Score: {model_summary.loc[fastest, 'Quality_Score']:.1f}%")

print("\n" + "="*100)
print("üìù FINAL RECOMMENDATION")
print("="*100)
print("\nScoring Breakdown:")
print("  - LLM Quality: 30% of total hackathon score")
print("  - Architecture: 20% of total hackathon score (open-source preferred!)")
print("\nBest Choice:")
if best_open_source and model_summary.loc[best_open_source, 'Quality_Score'] >= model_summary.loc[best_overall, 'Quality_Score'] * 0.9:
    print(f"  ‚úÖ {best_open_source} - Best balance of quality and architecture score")
    print(f"     Only {model_summary.loc[best_overall, 'Quality_Score'] - model_summary.loc[best_open_source, 'Quality_Score']:.1f}% quality drop for higher architecture score!")
else:
    print(f"  ‚úÖ {best_overall} - Highest quality, use if quality gap is significant")
    if best_open_source:
        print(f"  ‚ö†Ô∏è  Consider {best_open_source} for higher architecture score (trade-off: {model_summary.loc[best_overall, 'Quality_Score'] - model_summary.loc[best_open_source, 'Quality_Score']:.1f}% quality)")

print("="*100)


üèÜ FINAL RANKINGS
                      Rank  Quality_Score  Similarity  Citation_Score  Completeness_Score  Response_Time  Open_Source Architecture_Score
Model                                                                                                                                   
Llama-4-Maverick-17B     1          59.40         0.0            84.0              100.00           3.98         True               High
GPT-4.1                  2          57.46         0.0            84.0               93.54           5.95        False             Medium
DeepSeek-R1              3          48.32         0.0            80.0               67.73          10.80         True               High

üí° RECOMMENDATIONS FOR HACKATHON

ü•á Best Overall Quality: Llama-4-Maverick-17B
   Quality Score: 59.4%
   Similarity: 0.0%
   Citation Score: 84.0%
   Response Time: 3.98s
   Open Source: True
   Architecture Score: High

üîì Best Open-Source Model: Llama-4-Maverick-17B
   Quality Score

## 10. Export Results

In [None]:
# Save results
from pathlib import Path

output_dir = Path('output/llm_benchmark')
output_dir.mkdir(parents=True, exist_ok=True)

df.to_csv(output_dir / 'detailed_results.csv', index=False, encoding='utf-8')
model_summary.to_csv(output_dir / 'summary.csv', encoding='utf-8')
rankings.to_csv(output_dir / 'rankings.csv', index=False, encoding='utf-8')

print("\n‚úÖ Results exported to output/llm_benchmark/:")
print("   - detailed_results.csv (all questions and answers)")
print("   - summary.csv (model averages)")
print("   - rankings.csv (final rankings)")
print("   - results.png (visualizations)")

## 11. Sample Answer Comparison

In [14]:
# Show sample answers for first question
sample_question = 'Example1'
sample_results = df[df['Question'] == sample_question]

print("\n" + "="*100)
print(f"üìù SAMPLE ANSWER COMPARISON - {sample_question}")
print("="*100)

print(f"\n‚ùì Question: {questions[sample_question][0]['content']}")
print(f"\n‚úÖ Expected Answer:\n{expected_answers[sample_question]['Answer']}")
print("\n" + "-"*100)

for _, row in sample_results.iterrows():
    print(f"\nü§ñ {row['Model']} (Quality: {model_summary.loc[row['Model'], 'Quality_Score']:.1f}%, Time: {row['Response_Time']:.2f}s):")
    print(f"{row['Answer']}")
    print("-"*100)

print("="*100)


üìù SAMPLE ANSWER COMPARISON - Example1

‚ùì Question: Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn hansƒ± …ôsas amill…ôrin inteqrasiyasƒ± t…ôl…ôb olunur?

‚úÖ Expected Answer:
Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn d√ºzg√ºn se√ßilmi≈ü texnoloji inteqrasiya (horizontal v…ô √ßoxt…ôr…ôfli qazma texnikalarƒ±) v…ô qazma m…ôhlullarƒ±nƒ±n s…ôm…ôr…ôli idar…ô edilm…ôsi t…ôl…ôb olunur. Bu yana≈üma h…ôm iqtisadi, h…ôm d…ô ekoloji baxƒ±mdan √ºst√ºnl√ºk yaradƒ±r.

----------------------------------------------------------------------------------------------------

ü§ñ Llama-4-Maverick-17B (Quality: 59.4%, Time: 4.39s):
Daha az quyu il…ô daha √ßox hasilat …ôld…ô etm…ôk √º√ß√ºn d√ºzg√ºn se√ßilmi≈ü texnoloji inteqrasiya v…ô qazma m…ôhlullarƒ±nƒ±n s…ôm…ôr…ôli idar…ôsi …ôsas amill…ôrdir. Bu, Document 1 (document_11.pdf, S…ôhif…ô 3)-d…ô qeyd olun...
----------------------------------------------------------------------------------------------------

ü§ñ DeepSeek-R1 (Q