# Part 3: Evaluation (The "Robust" Part)

This is the most important notebook. Without evaluation, you are flying blind. You might change a chunk size and *think* it's better, but you don't *know*.

## Theory: LLM-as-a-Judge
It is hard to measure the quality of text automatically. ROUGE and BLUE scores (from translation) are bad for RAG.
Instead, we use a strong LLM (like GPT-4) to grade the performance of our RAG system. This is the core of the **RAGAS** library.

## Metrics
1. **Faithfulness**: Is the answer derived *only* from the retrieved context? (Hallucination check)
2. **Answer Relevance**: Does the answer actually address the user's question?
3. **Context Precision**: Is the "gold" chunk ranked high in the results?
4. **Context Recall**: Did we retrieve all the information needed to answer?

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

## 1. Creating a Synthetic Test Set (Stroj jako zkou≈°ej√≠c√≠)

P≈ôedstavte si, ≈æe jste uƒçitel. M√°te uƒçebnici (va≈°e intern√≠ dokumenty) a chcete sv√© studenty (v√°≈° RAG syst√©m) vyzkou≈°et, jestli l√°tce opravdu rozum√≠.
*   **Probl√©m:** Vym√Ω≈°let stovky kvalitn√≠ch ot√°zek a spr√°vn√Ωch odpovƒõd√≠ ruƒçnƒõ je nuda, stoj√≠ to ƒças a je to drah√©.
*   **≈òe≈°en√≠:** Nech√°me chyt≈ôej≈°√≠ho kolegu (GPT-4), aby si uƒçebnici p≈ôeƒçetl za n√°s a testovac√≠ ot√°zky automaticky vymyslel.

### Jak to funguje? (Synthetic Data Generation)
Knihovna **Ragas** nep√≠≈°e jen jednoduch√© ot√°zky. Sna≈æ√≠ se simulovat skuteƒçn√©ho, zv√≠dav√©ho u≈æivatele pomoc√≠ r≈Øzn√Ωch strategi√≠ (Evolutions):

1.  **Simple (Faktick√©):** *"Jak√Ω je limit API request≈Ø?"* (Odpovƒõƒè le≈æ√≠ na jednom m√≠stƒõ).
2.  **Reasoning (Uva≈æovac√≠):** *"Co mus√≠m splnit, abych dos√°hl na SLA kredit?"* (Odpovƒõƒè vy≈æaduje spojen√≠ podm√≠nek a logick√Ω √∫sudek).
3.  **Multi-Context (Komplexn√≠):** *"Porovnej podm√≠nky tarifu Basic a Enterprise."* (Odpovƒõƒè je roztrou≈°ena ve v√≠ce ƒç√°stech dokumentu).

**V√Ωsledek:** Z√≠sk√°me tzv. **Golden Dataset** ‚Äì sadu ot√°zek (`question`) a spr√°vn√Ωch odpovƒõd√≠ (`ground_truth`), proti kter√© budeme n√°≈° syst√©m testovat.

In [7]:
import os
import pandas as pd
from langchain_community.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from ragas.testset import TestsetGenerator
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

# 1. Naƒçten√≠ dat
print("--- Naƒç√≠t√°m dokumenty ---")
loader = DirectoryLoader("../data", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader)
documents = loader.load()
print(f"‚úÖ Naƒçteno {len(documents)} dokument≈Ø.")

# 2. Konfigurace Azure
azure_config = {
    "api_version": os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15"),
    "azure_endpoint": os.getenv("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.getenv("AZURE_OPENAI_API_KEY"),
}

# 3. Inicializace model≈Ø
# Pozn√°mka: Ragas 0.4 pou≈æ√≠v√° jeden hlavn√≠ LLM pro generov√°n√≠ i kritiku v z√°kladu
main_llm = AzureChatOpenAI(
    azure_deployment="gpt-4o", 
    temperature=0,
    **azure_config
)

embeddings = AzureOpenAIEmbeddings(
    azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002"),
    **azure_config
)

# 4. Vytvo≈ôen√≠ Gener√°toru (OPRAVENO)
# Spr√°vn√© n√°zvy argument≈Ø pro Ragas 0.4.1 jsou 'llm' a 'embedding_model'
generator = TestsetGenerator.from_langchain(
    llm=main_llm,
    embedding_model=embeddings
)

# 5. Generov√°n√≠
print("\n--- Generuji syntetick√© ot√°zky ---")
testset = generator.generate_with_langchain_docs(
    documents, 
    testset_size=5
)

# 6. V√Ωsledek
test_df = testset.to_pandas()
print(f"\n‚úÖ Hotovo. Vygenerov√°no {len(test_df)} p√°r≈Ø.")
display(test_df.head())

--- Naƒç√≠t√°m dokumenty ---
‚úÖ Naƒçteno 1 dokument≈Ø.

--- Generuji syntetick√© ot√°zky ---


Applying SummaryExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/1 [00:00<?, ?it/s]

Applying EmbeddingExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying ThemesExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying NERExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying CosineSimilarityBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Skipping multi_hop_abstract_query_synthesizer due to unexpected error: No relationships match the provided condition. Cannot form clusters.


Generating personas:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/5 [00:00<?, ?it/s]


‚úÖ Hotovo. Vygenerov√°no 5 p√°r≈Ø.


Unnamed: 0,user_input,reference_contexts,reference,persona_name,query_style,query_length,synthesizer_name
0,Wht is us-east-1?,[NebulaDB API Documentation\n\nIntroduction\n\...,us-east-1 is a region code used to filter clus...,Interstellar Systems Engineer,MISSPELLED,SHORT,single_hop_specific_query_synthesizer
1,How do I retrive a list of all actve clusters ...,[NebulaDB API Documentation\n\nIntroduction\n\...,To retrieve a list of all active clusters in N...,Interstellar Systems Engineer,MISSPELLED,MEDIUM,single_hop_specific_query_synthesizer
2,How get list of clusters in mars-north-2?,[NebulaDB API Documentation\n\nIntroduction\n\...,To get a list of clusters in the mars-north-2 ...,Interstellar Systems Engineer,POOR_GRAMMAR,MEDIUM,single_hop_specific_query_synthesizer
3,Howw do I filter the list of active clusters b...,[NebulaDB API Documentation\n\nIntroduction\n\...,To filter the list of active clusters by the u...,Interstellar Systems Engineer,MISSPELLED,LONG,single_hop_specific_query_synthesizer
4,What information can be retrieved about cluste...,[NebulaDB API Documentation\n\nIntroduction\n\...,"Using the NebulaDB API, you can retrieve a lis...",Interstellar Systems Engineer,PERFECT_GRAMMAR,LONG,single_hop_specific_query_synthesizer


## 2. Evaluate Your Pipeline (Vysvƒõdƒçen√≠ pro robota)

Teƒè, kdy≈æ m√°me testovac√≠ ot√°zky od "uƒçitele" (syntetick√Ω dataset), mus√≠me nechat n√°≈° RAG syst√©m, aby na nƒõ odpovƒõdƒõl. Pot√© jeho odpovƒõdi obodujeme.

### Co vlastnƒõ hodnot√≠me? (Ragas Metriky)
Proci se u RAGu nepou≈æ√≠v√° jen jedna zn√°mka, ale hodnot√≠me 4 r≈Øzn√© discipl√≠ny:

1.  **Faithfulness (Pravdomluvnost):** L≈æe model, nebo se dr≈æ√≠ textu?
    *   *P≈ô√≠klad:* V dokumentu je *"Cena 500 Kƒç"*. Model ≈ôekne *"Cena 600 Kƒç"*. -> N√≠zk√° faithfulness (Model si vym√Ω≈°l√≠/halucinuje).
2.  **Answer Relevance (K vƒõci):** Odpovƒõdƒõl model na to, na co jsem se ptal?
    *   *Ot√°zka:* *"Jak√© je poƒças√≠?"* -> *Odpovƒõƒè:* *"Dnes je √∫ter√Ω."* -> N√≠zk√° relevance (Pravdiv√©, ale mimo m√≠su).
3.  **Context Precision (Kvalita vyhled√°v√°n√≠):** Byly "zlat√© dokumenty" na prvn√≠m m√≠stƒõ?
    *   Pokud model na≈°el spr√°vn√Ω dokument, ale a≈æ jako 10. v po≈ôad√≠ (utopen√Ω v balastu), sk√≥re kles√°.
4.  **Context Recall (√öplnost vyhled√°v√°n√≠):** Na≈°li jsme v≈°echno, co bylo pot≈ôeba?
    *   Pokud k zodpovƒõzen√≠ ot√°zky pot≈ôebujete 3 r≈Øzn√© odstavce, ale syst√©m na≈°el jen 2, sk√≥re kles√°.

### Proces Evaluace
Do hodnot√≠c√≠ funkce (`evaluate`) mus√≠me poslat 4 vƒõci pro ka≈æd√Ω testovac√≠ p≈ô√≠pad:
1.  `question`: Co jsme se ptali?
2.  `answer`: Co n√°≈° syst√©m odpovƒõdƒõl?
3.  `contexts`: Jak√© dokumenty syst√©m p≈ôi hled√°n√≠ na≈°el?
4.  `ground_truth`: Jak√° byla spr√°vn√° odpovƒõƒè (podle uƒçitele/datasetu)?

In [10]:
from ragas import evaluate
from datasets import Dataset
# OPRAVA IMPORT≈Æ: answer_relevancy (v≈°imnƒõte si 'cy')
from ragas.metrics import (
    context_precision,
    faithfulness,
    answer_relevancy,  # ZDE BLA CHYBA
    context_recall,
)

print("--- Spou≈°t√≠m Evaluaci ---")

# 1. P≈ò√çPRAVA DAT (SIMULACE)
data_samples = {
    'question': [
        'What is the API rate limit?', 
        'How do I upgrade to Enterprise?'
    ],
    'answer': [
        'The rate limit is 100 requests per second.',    
        'You can upgrade by contacting sales team.'      
    ],
    'contexts': [
        ['API Rate limit is set to 100 req/sec per token limits.'],  
        ['Contact our sales department for Enterprise licensing and upgrades.'] 
    ],
    'ground_truth': [
        '100 req/sec per token.',    
        'Contact sales to upgrade.'  
    ]
}

# 2. Vytvo≈ôen√≠ Dataset objektu
dataset = Dataset.from_dict(data_samples)

# 3. Spu≈°tƒõn√≠ hodnocen√≠
results = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy, # OPRAVEN√ù N√ÅZEV
        context_recall,
    ],
    llm=critic_llm, 
    embeddings=embeddings
)

# 4. Zobrazen√≠ v√Ωsledk≈Ø
print("\nüìä V√ùSLEDKY EVALUACE:")
print(results)
df_results = results.to_pandas()
display(df_results.head())

--- Spou≈°t√≠m Evaluaci ---


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.



üìä V√ùSLEDKY EVALUACE:
{'context_precision': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.9282, 'context_recall': 1.0000}


Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,faithfulness,answer_relevancy,context_recall
0,What is the API rate limit?,[API Rate limit is set to 100 req/sec per toke...,The rate limit is 100 requests per second.,100 req/sec per token.,1.0,1.0,0.941238,1.0
1,How do I upgrade to Enterprise?,[Contact our sales department for Enterprise l...,You can upgrade by contacting sales team.,Contact sales to upgrade.,1.0,1.0,0.915258,1.0
