# Part 1: Ingestion and Basic RAG

Welcome to the **Deepnote RAG Training** series. In this first notebook, we will build the foundation of a Retrieval-Augmented Generation (RAG) system.

## What is RAG?
Large Language Models (LLMs) like GPT-4 are frozen in time. They don't know about your private data, your company's contracts, or the latest financial report from yesterday. **RAG** solves this by:
1. **Retrieving** relevant documents based on your query.
2. **Augmenting** the prompt with these documents.
3. **Generating** an answer using the LLM + the context.

## 1. Setup
First, we load the environment variables (API Keys) and import necessary libraries.

In [2]:
%pip install -r ../requirements.txt
import os
import openai
from dotenv import load_dotenv

load_dotenv()

# If using Deepnote's integration, the key is already in env
# os.environ["OPENAI_API_KEY"] = "sk-..."

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


True

## 2. Document Loading
**Concept:** Your data exists in many formats (PDF, Markdown, CSV, JSON). we need to standardise it into a `Document` object (text + metadata).

We will load the synthetic data we created in the `data/` folder.

### üìò Pozn√°mky k naƒç√≠t√°n√≠ dat (Data Ingestion)

Tento krok slou≈æ√≠ k **normalizaci dat**. Nez√°le≈æ√≠ na tom, zda je zdrojov√Ωm souborem [.txt](cci:7://file:///c:/Users/jan.petr/OneDrive%20-%20dolphinconsulting.cz/Projects/rag-training/data/legal/sla_contract.txt:0:0-0:0), [.csv](cci:7://file:///c:/Users/jan.petr/OneDrive%20-%20dolphinconsulting.cz/Projects/rag-training/data/finance/financial_report.csv:0:0-0:0) nebo [.md](cci:7://file:///c:/Users/jan.petr/OneDrive%20-%20dolphinconsulting.cz/Projects/rag-training/data/tech_docs/api_docs.md:0:0-0:0) ‚Äì na≈°√≠m c√≠lem je p≈ôev√©st v≈°e na jednotn√Ω objekt **Dokument**, se kter√Ωm um√≠ RAG syst√©m pracovat.

#### 1. Co je to `Document` objekt?
Ka≈æd√Ω naƒçten√Ω kus dat je v LangChainu reprezentov√°n objektem, kter√Ω m√° dvƒõ hlavn√≠ ƒç√°sti:
* **`page_content`**: Samotn√Ω text (obsah).
* **`metadata`**: Informace o p≈Øvodu (nap≈ô. `{source: "cesta/k/souboru.txt", row: 5}`).

#### 2. Jak funguj√≠ jednotliv√© Loadery?
Loadery se staraj√≠ o p≈ôevod surov√Ωch dat na text.
* **`TextLoader`**: "Hloup√Ω" loader. Vezme cel√Ω obsah souboru tak, jak je, a vlo≈æ√≠ ho do jednoho dokumentu.
* **`CSVLoader`**: "Chytr√Ω" loader. Ka≈æd√Ω ≈ô√°dek tabulky p≈ôevede na **samostatn√Ω dokument**. Data zform√°tuje do dvojic `Sloupec: Hodnota`, aby si LLM zachovalo kontext (v√≠, co kter√© ƒç√≠slo znamen√°).
* **`UnstructuredMarkdownLoader`**: Sna≈æ√≠ se parsovat strukturu. ƒåasto odstra≈àuje form√°tovac√≠ znaky (jako `**` pro tuƒçn√© p√≠smo), aby zbyl ƒçist√Ω kontextu√°ln√≠ text.

#### 3. Co se dƒõje se znaky? (Encoding & Normalizace)
* **K√≥dov√°n√≠**: Syst√©m oƒçek√°v√° standardn√≠ **UTF-8**. Pokud jsou v souboru ƒçesk√© znaky ulo≈æen√© v jin√©m k√≥dov√°n√≠ (nap≈ô. Windows-1250), loader m≈Ø≈æe selhat nebo zobrazit nesmysly.
* **Opravy chyb**: Loadery **nefunguj√≠** jako spell-checker. Pokud je ve zdrojov√©m textu p≈ôeklep, bude i v naƒçten√©m dokumentu.
* **Form√°tov√°n√≠**: B√≠l√© znaky (mezery, konce ≈ô√°dk≈Ø) jsou ƒçasto sjednoceny (normalizov√°ny), aby byl text kompaktnƒõj≈°√≠.

In [3]:
from langchain_community.document_loaders import TextLoader, JSONLoader, CSVLoader, UnstructuredMarkdownLoader

ts_path = "../data/legal/sla_contract.txt"
md_path = "../data/tech_docs/api_docs.md"
csv_path = "../data/finance/financial_report.csv"

# 1. Text Loader
text_loader = TextLoader(ts_path)
documents_text = text_loader.load()

# 2. Markdown Loader
md_loader = UnstructuredMarkdownLoader(md_path)
documents_md = md_loader.load()

# 3. CSV Loader
csv_loader = CSVLoader(csv_path)
documents_csv = csv_loader.load()

print(f"Loaded {len(documents_text)} text docs, {len(documents_md)} md docs, {len(documents_csv)} csv rows.")

Loaded 1 text docs, 1 md docs, 4 csv rows.


In [4]:
# 1. Zobrazen√≠ obsahu textov√©ho dokumentu
print("--- TEXT DOCUMENT ---")
print(documents_text[0].page_content)
print("Metadata:", documents_text[0].metadata)

# 2. Zobrazen√≠ obsahu Markdown dokumentu
print("\n--- MARKDOWN DOCUMENT ---")
# Zobraz√≠me t≈ôeba jen prvn√≠ch 500 znak≈Ø, pokud je to dlouh√©
print(documents_md[0].page_content[:500]) 

# 3. Zobrazen√≠ CSV (ka≈æd√Ω ≈ô√°dek je obvykle jeden dokument)
print("\n--- CSV ROW 1 ---")
print(documents_csv[0].page_content)
print("Metadata:", documents_csv[0].metadata)

--- TEXT DOCUMENT ---
SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise

1. DEFINITIONS
"Uptime" refers to the availability of the Service during a billing month.
"Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.

2. SERVICE COMMITMENT
NebulaDB commits to a Monthly Uptime Percentage of at least 99.99% ("Service Commitment").

3. SERVICE CREDITS
If we do not meet the Service Commitment, you will be eligible to receive a Service Credit as follows:
- < 99.99% but >= 99.0%: 10% Credit
- < 99.0% but >= 95.0%: 25% Credit
- < 95.0%: 100% Credit

4. EXCLUSIONS
The Service Commitment does not apply to any unavailability, suspension, or termination of NebulaDB performance issues: (i) caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of NebulaDB.

5. CLAIMS
To receive a Service Credit, you must submit a claim by opening a case in the NebulaD

## 3. Chunking (Text Splitting)

**Concept:** LLMs have context windows. We can't stuff a 100-page contract into one prompt. Also, retrieving smaller, specific chunks is more accurate than retrieving whole documents.

**Technique:**
- `CharacterTextSplitter`: Simple, splits by character count.
- `RecursiveCharacterTextSplitter`: Smarter, tries to keep paragraphs and sentences together.

# üîß Detailn√≠ vysvƒõtlen√≠ parametr≈Ø `RecursiveCharacterTextSplitter`

## 1. Z√°sadn√≠ parametry (Ovliv≈àuj√≠ kvalitu RAG)

### `chunk_size` (int)
*   **Default:** `4000`
*   **Co to je:** Maxim√°ln√≠ c√≠lov√° velikost jednoho kousku textu (chunku).
*   **Vliv:** Pokud je p≈ô√≠li≈° velk√Ω, LLM se zahlt√≠ balastem. Pokud p≈ô√≠li≈° mal√Ω, ztrat√≠te kontext.
*   **Doporuƒçen√≠:** Pro GPT-4 a RAG se obvykle nastavuje na **500 a≈æ 1000**.

### `chunk_overlap` (int)
*   **Default:** `200`
*   **Co to je:** Poƒçet znak≈Ø, kter√© se p≈ôekr√Ωvaj√≠ mezi koncem jednoho a zaƒç√°tkem druh√©ho chunku.
*   **Vliv:** Zabra≈àuje tomu, aby d≈Øle≈æit√° informace (nap≈ô. vƒõta) byla "roz≈ô√≠znuta" v p≈Ølce a ztratila smysl.
*   **Doporuƒçen√≠:** Obvykle **10‚Äì20 %** velikosti `chunk_size`.

### `separators` (list[str] | None)
*   **Default:** `None` (Internƒõ: `["\n\n", "\n", " ", ""]`)
*   **Co to je:** Seznam oddƒõlovaƒç≈Ø se≈ôazen√Ω podle priority (od nejsilnƒõj≈°√≠ho po nejslab≈°√≠).
*   **Vliv:** Urƒçuje strategii dƒõlen√≠. Splitter zkou≈°√≠ dƒõlit podle 1. oddƒõlovaƒçe (odstavce). Pokud je chunk st√°le moc velk√Ω, zkus√≠ 2. oddƒõlovaƒç (≈ô√°dky), pak 3. (slova) atd.
*   **Kdy mƒõnit:** Pokud dƒõl√≠te k√≥d (pou≈æijte `RecursiveCharacterTextSplitter.from_language`) nebo specifick√° data.

---

## 2. Parametry chov√°n√≠ oddƒõlovaƒç≈Ø

### `keep_separator` (bool | "start" | "end")
*   **Default:** `True`
*   **Co to je:** Urƒçuje, co se stane se znakem, podle kter√©ho se dƒõlilo (nap≈ô. `\n\n`).
    *   `True` (nebo `"end"`): Oddƒõlovaƒç z≈Østane na **konci** p≈ôedchoz√≠ho chunku.
    *   `"start"`: Oddƒõlovaƒç se p≈ôesune na **zaƒç√°tek** n√°sleduj√≠c√≠ho chunku.
    *   `False`: Oddƒõlovaƒç se √∫plnƒõ vyma≈æe.
*   **Vliv:** Zachov√°n√≠ (`True`) pom√°h√° LLM pochopit strukturu textu (≈æe tam byl konec odstavce).

### `is_separator_regex` (bool)
*   **Default:** `False`
*   **Co to je:** Pokud `True`, ch√°pe polo≈æky v `separators` jako regul√°rn√≠ v√Ωrazy (RegEx) m√≠sto obyƒçejn√©ho textu.
*   **Vliv:** Umo≈æ≈àuje slo≈æitƒõj≈°√≠ pravidla dƒõlen√≠ (nap≈ô. "rozdƒõl v m√≠stƒõ, kde je teƒçka n√°sledovan√° velk√Ωm p√≠smenem").

### `strip_whitespace` (bool)
*   **Default:** `True`
*   **Co to je:** Pokud `True`, odstran√≠ nadbyteƒçn√© mezery a pr√°zdn√© znaky na √∫pln√©m zaƒç√°tku a konci hotov√©ho chunku.
*   **Vliv:** ƒåist√≠ data, aby chunks nezaƒç√≠naly zbyteƒçn√Ωmi entery nebo mezerami.

---

## 3. Technick√© a mapovac√≠ parametry

### `length_function` (Callable)
*   **Default:** `len` (standardn√≠ funkce Pythonu pro poƒçet znak≈Ø)
*   **Co to je:** Metr, kter√Ωm splitter mƒõ≈ô√≠ d√©lku textu. Urƒçuje, co vlastnƒõ znamen√° ƒç√≠slo v `chunk_size`.
*   **Jak to funguje:**
    *   **Znaky (Default):** Pokud parametr nezmƒõn√≠te, splitter pou≈æ√≠v√° funkci `len()`. Nastaven√≠ `chunk_size=500` pak znamen√° **maxim√°lnƒõ 500 znak≈Ø** (p√≠smen, mezer, interpunkce). Je to rychl√©, ale pro LLM m√©nƒõ p≈ôesn√© (model "nevid√≠" znaky, ale tokeny).
    *   **Tokeny (Pokroƒçil√©):** Pro RAG je ƒçasto lep≈°√≠ mƒõ≈ôit p≈ô√≠mo v jednotk√°ch, kter√Ωm rozum√≠ model (tokeny). Nastaven√≠ `chunk_size=500` pak znamen√° **maxim√°lnƒõ 500 token≈Ø**. To zajist√≠, ≈æe efektivnƒõji vyu≈æijete kontextov√© okno modelu.
*   **P≈ô√≠klad nastaven√≠ na tokeny:**
    Pokud chcete dƒõlit podle token≈Ø (nap≈ô. pro modely OpenAI), mus√≠te pou≈æ√≠t knihovnu `tiktoken`:

    ```python
    import tiktoken

    # Vytvo≈ô√≠me funkci, kter√° poƒç√≠t√° tokeny m√≠sto znak≈Ø
    def tiktoken_len(text):
        tokenizer = tiktoken.get_encoding("cl100k_base") # encoding pro GPT-3.5/4
        tokens = tokenizer.encode(text)
        return len(tokens)

    # P≈ôed√°me tuto funkci splitteru
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,  # Nyn√≠ to znamen√° 500 TOKEN≈Æ
        chunk_overlap=50,
        length_function=tiktoken_len
    )
    ```
### `add_start_index` (bool)
*   **Default:** `False`
*   **Co to je:** Pokud `True`, p≈ôid√° do metadat chunku kl√≠ƒç `start_index` s ƒç√≠slem, na kter√©m znaku v p≈Øvodn√≠m souboru tento chunk zaƒç√≠n√°.
*   **Vliv:** U≈æiteƒçn√© pro zv√Ωraz≈àov√°n√≠ (highlighting) nalezen√©ho textu v p≈Øvodn√≠m dokumentu v UI aplikace.

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
all_docs = documents_text + documents_md + documents_csv
chunks = splitter.split_documents(all_docs)
print(f"Split {len(all_docs)} documents into {len(chunks)} chunks.")
if len(chunks) > 0:
    print("Example chunk:", chunks[0].page_content)

Split 6 documents into 9 chunks.
Example chunk: SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise

1. DEFINITIONS
"Uptime" refers to the availability of the Service during a billing month.
"Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.

2. SERVICE COMMITMENT
NebulaDB commits to a Monthly Uptime Percentage of at least 99.99% ("Service Commitment").


In [6]:
# --- INSPEKCE PROMƒöNN√ùCH (VARIABLE INSPECTION) ---

# 1. Anal√Ωza `all_docs` (V≈°echny dokumenty p≈ôed rozdƒõlen√≠m)
# Co to je: Seznam v≈°ech dokument≈Ø naƒçten√Ωch loadery. Ka≈æd√Ω dokument je objekt 'Document'.
print(f"üîç P≈Øvodn√≠ dokumenty (all_docs): {len(all_docs)} ks")
try:
    print("   - Typ objektu:", type(all_docs[0]))
    # Zobraz√≠me detail prvn√≠ho dokumentu (nap≈ô. SLA kontrakt)
    if len(all_docs) > 0:
        print(f"   - Uk√°zka 1. dokumentu (znak≈Ø: {len(all_docs[0].page_content)}):")
        print(f"     '{all_docs[0].page_content[:100]}...'") # Jen prvn√≠ch 100 znak≈Ø
        print(f"     Metadata: {all_docs[0].metadata}")
except Exception as e:
    print(f"   - Chyba p≈ôi zobrazov√°n√≠ all_docs: {e}")

print("\n" + "="*40 + "\n")

# 2. Anal√Ωza `chunks` (Rozsekan√© kousky)
# Co to je: Seznam dokument≈Ø PO rozdƒõlen√≠ splitterem.
# Proƒç: Proto≈æe p≈Øvodn√≠ dokumenty mohou b√Ωt pro LLM p≈ô√≠li≈° dlouh√© a neve≈°ly by se do kontextu.
print(f"üî™ Rozsekan√© ƒç√°sti (chunks): {len(chunks)} ks")
try:
    if len(chunks) > 0:
        print("   - Typ objektu:", type(chunks[0]), "(St√°le je to Document!)")

        # Porovn√°n√≠: Kolik chunk≈Ø vzniklo z prvn√≠ho dokumentu?
        # Filtrujeme chunky, kter√© maj√≠ stejn√Ω 'source' jako prvn√≠ dokument
        if len(all_docs) > 0:
            source_file = all_docs[0].metadata.get('source')
            matching_chunks = [c for c in chunks if c.metadata.get('source') == source_file]

            print(f"   - P≈Øvodn√≠ soubor '{source_file}' byl rozdƒõlen na {len(matching_chunks)} ƒç√°st√≠.")
            if len(matching_chunks) > 0:
                print("   - Uk√°zka 1. chunku:")
                print(f"     '{matching_chunks[0].page_content}'")
                print(f"     Metadata chunku: {matching_chunks[0].metadata}")
except Exception as e:
    print(f"   - Chyba p≈ôi zobrazov√°n√≠ chunks: {e}")

üîç P≈Øvodn√≠ dokumenty (all_docs): 6 ks
   - Typ objektu: <class 'langchain_core.documents.base.Document'>
   - Uk√°zka 1. dokumentu (znak≈Ø: 1027):
     'SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise

1. DEFINITIONS
"Uptime" refers to the availab...'
     Metadata: {'source': '../data/legal/sla_contract.txt'}


üî™ Rozsekan√© ƒç√°sti (chunks): 9 ks
   - Typ objektu: <class 'langchain_core.documents.base.Document'> (St√°le je to Document!)
   - P≈Øvodn√≠ soubor '../data/legal/sla_contract.txt' byl rozdƒõlen na 3 ƒç√°st√≠.
   - Uk√°zka 1. chunku:
     'SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise

1. DEFINITIONS
"Uptime" refers to the availability of the Service during a billing month.
"Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.

2. SERVICE COMMITMENT
NebulaDB commits to a Monthly Uptime Percentage of at least 99.99% ("Service Commitment").'
     Metadata chunku: {'source': '../data/legal/sla_contract.txt'}


In [12]:
all_docs

[Document(metadata={'source': '../data/legal/sla_contract.txt'}, page_content='SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise\n\n1. DEFINITIONS\n"Uptime" refers to the availability of the Service during a billing month.\n"Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.\n\n2. SERVICE COMMITMENT\nNebulaDB commits to a Monthly Uptime Percentage of at least 99.99% ("Service Commitment").\n\n3. SERVICE CREDITS\nIf we do not meet the Service Commitment, you will be eligible to receive a Service Credit as follows:\n- < 99.99% but >= 99.0%: 10% Credit\n- < 99.0% but >= 95.0%: 25% Credit\n- < 95.0%: 100% Credit\n\n4. EXCLUSIONS\nThe Service Commitment does not apply to any unavailability, suspension, or termination of NebulaDB performance issues: (i) caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of NebulaDB.\n\n5. CLAIMS\nTo receiv

## 4. Vector Stores & Embeddings

**Concept:** Computers don't understand text meaning, they understand numbers. **Embeddings** convert text into a vector (a list of numbers) where similar meanings are close together in space.

We store these vectors in a **Vector Store** (like ChromaDB) to perform semantic search.

# üß† Co se dƒõje v ƒç√°sti "Embeddings & Vector Store"?

V t√©to f√°zi mƒõn√≠me **lidskou ≈ôeƒç** na **≈ôeƒç strojovou**, abychom v n√≠ mohli efektivnƒõ vyhled√°vat.

## 1. Embedding Model (P≈ôekladaƒç)
*   **Co to je:** Slu≈æba (v na≈°em p≈ô√≠padƒõ `text-embedding-ada-002` na Azure), kter√° vezme jak√Ωkoli text a vr√°t√≠ dlouh√Ω seznam ƒç√≠sel (vektor).
*   **Jak to funguje:** Model ch√°pe v√Ωznamy slov. Slova s podobn√Ωm v√Ωznamem (nap≈ô. *kr√°l* a *c√≠sa≈ô*) budou m√≠t ƒç√≠sla velmi bl√≠zko sebe.
*   **V√Ωsledek:** Ka≈æd√Ω n√°≈° "chunk" textu se promƒõn√≠ na seznam 1536 ƒç√≠sel.

## 2. Vector Store (Datab√°ze v√Ωznam≈Ø)
*   **Co to je:** Specializovan√° datab√°ze (zde `ChromaDB`), kter√° neum√≠ jen ukl√°dat text, ale hlavnƒõ ty ƒç√≠seln√© vektory.
*   **Proƒç to pot≈ôebujeme:** Klasick√° datab√°ze hled√° p≈ôesnou shodu (*obsahuje slovo "smlouva"?*). Vektorov√° datab√°ze hled√° **v√Ωznamovou bl√≠zkost** (*je tento dotaz podobn√Ω obsahu smlouvy?*).
*   **Proces:**
    1.  Vezmeme v≈°echny `chunks`.
    2.  Po≈°leme je do Azure na "p≈ôelo≈æen√≠" (embed).
    3.  Vr√°t√≠ se n√°m vektory.
    4.  Ulo≈æ√≠me je do `vectorstore` spolu s p≈Øvodn√≠m textem a metadaty.

In [7]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os, shutil

# POKUD M√ÅTE .ENV SOUBOR, HODNOTY SE NAƒåTOU Z NƒöJ.
# JINAK JE DOPL≈áTE P≈ò√çMO DO UVOZOVEK:
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT", "https://VAS_RESOURCE_NAME.openai.azure.com/")
api_key = os.getenv("AZURE_OPENAI_API_KEY", "VAS_API_KEY") # V prost≈ôed√≠ ƒçasto jako AZURE_OPENAI_API_KEY
deployment_name = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002") # N√°zev nasazen√≠ modelu (ne modelu samotn√©ho!)
api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15") 

# Inicializace Azure Embeddings
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=azure_endpoint,
    api_key=api_key,
    azure_deployment=deployment_name,
    openai_api_version=api_version,
)


import stat  # Pot≈ôeba importovat

# Tahle pomocn√° funkce "odpoj√≠" read-only z√°mek, pokud na nƒõj naraz√≠
def force_remove_readonly(func, path, exc_info):
    os.chmod(path, stat.S_IWRITE) # Zmƒõn√≠me pr√°va na z√°pis
    func(path) # Zkus√≠me operaci smaz√°n√≠ znovu

persist_directory = "../chroma_db"
if os.path.exists(persist_directory):
    # P≈ôid√°me parametr onerror, kter√Ω zavol√° na≈°i funkci p≈ôi chybƒõ
    shutil.rmtree(persist_directory, onerror=force_remove_readonly)
    print(f"üóëÔ∏è Smaz√°na star√° datab√°ze: {persist_directory}")

# 2. Vytvo≈ôen√≠ nov√© datab√°ze s persistenc√≠
# Nyn√≠ vytvo≈ô√≠me novou datab√°zi a rovnou ji ulo≈æ√≠me na disk.
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="rag_training_v1",
    persist_directory=persist_directory
)
print(f"‚úÖ Nov√Ω VectorStore vytvo≈ôen a ulo≈æen do: {persist_directory}")

print("VectorStore created with Azure OpenAI!")

üóëÔ∏è Smaz√°na star√° datab√°ze: ../chroma_db
‚úÖ Nov√Ω VectorStore vytvo≈ôen a ulo≈æen do: ../chroma_db
VectorStore created with Azure OpenAI!


# üß† Jak funguj√≠ Embeddings a Vector Store? (Vysvƒõtlen√≠ pro lidi)

P≈ôedstavte si cel√Ω proces jako **vztah mezi Kucha≈ôkou (Vector Store) a Chu≈•ov√Ωmi bu≈àkami (Embedding Model)**. Zde je vysvƒõtlen√≠ krok za krokem.

## 1. Co je to vlastnƒõ "Embedding"?
**Embedding je p≈ôekladaƒç z lid≈°tiny do "ƒç√≠seln√© ≈ôeƒçi", kter√© rozum√≠ poƒç√≠taƒç.**

P≈ôedstavte si, ≈æe m√°te mapu svƒõta.
*   Kdy≈æ ≈ôeknu **"Pes"**, embedding model p≈ôevede toto slovo na p≈ôesn√© sou≈ôadnice: `[50.123, 14.456]`.
*   Kdy≈æ ≈ôeknu **"Koƒçka"**, model ≈ôekne: `[50.125, 14.458]` (le≈æ√≠ to na mapƒõ kousek vedle psa).
*   Kdy≈æ ≈ôeknu **"Vesm√≠rn√° raketa"**, model ≈ôekne: `[-20.000, 80.555]` (√∫plnƒõ jin√Ω svƒõtad√≠l).

To dlouh√© pole ƒç√≠sel (v k√≥du promƒõnn√° `vector`), kter√© vid√≠te ve v√Ωstupu, jsou p≈ôesnƒõ tyto "GPS sou≈ôadnice v√Ωznamu" v obrovsk√©m, 1536-rozmƒõrn√©m prostoru.

---

## 2. Mus√≠ m√≠t embedding "trval√Ω charakter"?
**Rozhodnƒõ ANO. To je naprosto kl√≠ƒçov√©.**

Mus√≠te v≈ædy pou≈æ√≠vat **ten sam√Ω model** (stejn√Ω kl√≠ƒç k mapƒõ). P≈ôedstavte si dva r≈Øzn√© modely jako dvƒõ r≈Øzn√© mapy:
1.  **Model A (OpenAI):** Mapa Prahy.
2.  **Model B (HuggingFace):** Mapa New Yorku.

Pokud byste dokumenty ulo≈æili pomoc√≠ **Modelu A** (sou≈ôadnice v Praze) a pak se ptali pomoc√≠ **Modelu B** (hled√°te sou≈ôadnice v New Yorku), nikdy se nepotk√°te. Proto v cel√©m notebooku pou≈æ√≠v√°me jednu promƒõnnou `embeddings`.

---

## 3. Jak se "Testovac√≠ slovo" a "Vector Store" propoj√≠?
Toto je ten magick√Ω moment RAGu (Retrieval-Augmented Generation). Funguje to ve t≈ôech kroc√≠ch:

### Krok A: Ulo≈æen√≠ (Ingestion) ‚Äì "Zapichov√°n√≠ vlajeƒçek"
1.  Vezmeme vƒõtu ze smlouvy: *"Smlouva plat√≠ 5 let."*
2.  Po≈°leme ji do `embeddings` modelu.
3.  Model vr√°t√≠ sou≈ôadnice: `[0.1, 0.5, 0.9 ...]`.
4.  Tyto sou≈ôadnice **ulo≈æ√≠me do Vector Store** (ChromaDB) a d√°me k nim l√≠steƒçek s p≈Øvodn√≠m textem.
    *   *Efekt: M√°me datab√°zi plnou vlajeƒçek zap√≠chan√Ωch na mapƒõ podle v√Ωznamu.*

### Krok B: Dotaz (Query) ‚Äì "Kde jsem j√°?"
1.  U≈æivatel (nebo vy v testu) nap√≠≈°e dotaz: *"Jak dlouho to plat√≠?"* (V≈°imnƒõte si, ≈æe jsme nepou≈æili stejn√° slova).
2.  Tento dotaz po≈°leme do **toho sam√©ho `embeddings` modelu**.
3.  Model vr√°t√≠ sou≈ôadnice pro tento dotaz: `[0.11, 0.51, 0.89 ...]`.

### Krok C: Hled√°n√≠ (Search) ‚Äì "Kdo je m√Ωm sousedem?"
1.  Vektorov√° datab√°ze dostane sou≈ôadnice va≈°eho dotazu.
2.  Rozhl√©dne se kolem tohoto bodu a hled√° **nejbli≈æ≈°√≠ zap√≠chnutou vlajeƒçku**.
3.  Najde vlajeƒçku *"Smlouva plat√≠ 5 let"*, proto≈æe jej√≠ sou≈ôadnice jsou matematicky nejbl√≠≈æ (le≈æ√≠ na stejn√©m "n√°mƒõst√≠ v√Ωznamu").
4.  Vr√°t√≠ v√°m ten p≈Øvodn√≠ text.

In [8]:
# --- INSPEKCE EMBEDDINGS A VECTOR STORE ---

# 1. Testovac√≠ Embedding
# Co to je: Vektor, kter√Ω reprezentuje v√Ωznam slova "test".
# Proƒç: Ovƒõ≈ô√≠me, ≈æe model funguje a pod√≠v√°me se, jak vypad√° "strojov√° ≈ôeƒç".
test_word = "apple"
try:
    vector = embeddings.embed_query(test_word)
    print(f"üìä Testovac√≠ slovo: '{test_word}'")
    print(f"   - D√©lka vektoru: {len(vector)} dimenz√≠ (standard pro model ada-002 je 1536)")
    print(f"   - Prvn√≠ch 5 ƒç√≠sel vektoru: {vector[:5]}...")
    print(f"   - Typ dat: {type(vector)}")
except Exception as e:
    print(f"‚ùå Chyba p≈ôi vytv√°≈ôen√≠ embeddingu: {e}")

print("\n" + "="*40 + "\n")

# 2. Vector Store (ChromaDB)
# Co to je: Datab√°ze, kde jsou ulo≈æeny vektory v≈°ech na≈°ich chunk≈Ø.
# Proƒç: Abychom v nich mohli rychle hledat podle podobnosti.
print(f"üóÑÔ∏è VectorStore (ChromaDB):")
# ChromaDB v LangChainu zapouzd≈ôuje kolekci
try:
    if hasattr(vectorstore, "_collection"):
        print(f"   - Jm√©no kolekce: {vectorstore._collection.name}")
        print(f"   - Poƒçet ulo≈æen√Ωch vektor≈Ø: {vectorstore._collection.count()}")

    # Zkus√≠me naj√≠t nƒõco podobn√©ho slovu "contract"
    # Toto ovƒõ≈ô√≠, ≈æe vyhled√°v√°n√≠ funguje end-to-end
    results = vectorstore.similarity_search("contract", k=1)
    if results:
        print(f"\nüîé Rychl√Ω test hled√°n√≠ slova 'contract':")
        print(f"   - Nalezen√Ω dokument: '{results[0].page_content[:100]}...'")
        print(f"   - Zdroj: {results[0].metadata}")
    else:
        print("\n‚ö†Ô∏è Hled√°n√≠ vr√°tilo 0 v√Ωsledk≈Ø (to je divn√©, pokud m√°me data).")

except Exception as e:
    print(f"‚ùå Chyba p≈ôi pr√°ci s VectorStore: {e}")

üìä Testovac√≠ slovo: 'apple'
   - D√©lka vektoru: 1536 dimenz√≠ (standard pro model ada-002 je 1536)
   - Prvn√≠ch 5 ƒç√≠sel vektoru: [0.007782285567373037, -0.023080386221408844, -0.007522648200392723, -0.02772652730345726, -0.00455048494040966]...
   - Typ dat: <class 'list'>


üóÑÔ∏è VectorStore (ChromaDB):
   - Jm√©no kolekce: rag_training_v1
   - Poƒçet ulo≈æen√Ωch vektor≈Ø: 9

üîé Rychl√Ω test hled√°n√≠ slova 'contract':
   - Nalezen√Ω dokument: 'Quarter: Q4 2023
Revenue: 950000
Expenses: 750000
Profit: 200000
Notes: Previous year comparison....'
   - Zdroj: {'source': '../data/finance/financial_report.csv', 'row': 3}


## 5. Basic Retrieval (S√©mantick√© vyhled√°v√°n√≠)

V p≈ôedchoz√≠m kroku jsme dokumenty rozsekali, p≈ôevedli na ƒç√≠sla (vektory) a ulo≈æili do **Vektorov√© datab√°ze (Chroma)**. Nyn√≠ si uk√°≈æeme, jak v tƒõchto datech hledat odpovƒõdi.

### ‚öôÔ∏è Jak to funguje pod kapotou?

K√≥d `docs = vectorstore.similarity_search(question, k=3)` vypad√° jednodu≈°e, ale spou≈°t√≠ komplexn√≠ proces, kter√Ω spojuje v≈°e, co jsme doteƒè udƒõlali.

Aby datab√°ze mohla naj√≠t odpovƒõƒè, mus√≠ se st√°t n√°sleduj√≠c√≠:

1.  **Transformace ot√°zky (Embeddings):**
    *   V√°≈° dotaz (`question`) je pouze text. Datab√°ze ale rozum√≠ jen ƒç√≠sl≈Øm.
    *   Proto se syst√©m (objekt `vectorstore`) pod√≠v√° na svou konfiguraci **Embeddings** (n√°≈° "model/p≈ôekladaƒç").
    *   Po≈°le v√°≈° dotaz do modelu (Azure OpenAI) a z√≠sk√° zpƒõt **vektor ot√°zky** (seznam ƒç√≠sel reprezentuj√≠c√≠ v√Ωznam ot√°zky).
    *   **D≈Øle≈æit√©:** Mus√≠me pou≈æ√≠t *√∫plnƒõ stejn√Ω* model (Embeddings), jak√Ω jsme pou≈æili p≈ôi ukl√°d√°n√≠ dokument≈Ø. Jinak by ƒç√≠sla nesedƒõla.

2.  **Matematick√© porovn√°n√≠ (Similarity Search):**
    *   ChromaDB vezme **vektor ot√°zky** a porovn√° ho se v≈°emi miliardami **vektor≈Ø dokument≈Ø**, kter√© m√° ulo≈æen√©.
    *   Hled√° tzv. *nejbli≈æ≈°√≠ sousedy* ‚Äì tedy dokumenty, jejich≈æ v√Ωznamov√Ω vektor je matematicky nejbl√≠≈æe vektoru ot√°zky.

3.  **V√Ωbƒõr kontextu (Retrieval):**
    *   Datab√°ze vr√°t√≠ `k` (nap≈ô. 3) nejlep≈°√≠ch √∫ryvk≈Ø textu. Tyto √∫ryvky pak p≈ôedhod√≠me LLM jako podklad pro odpovƒõƒè.

---

> üí° **Architektura v re√°ln√© aplikaci vs. Notebook**
>
> V tomto v√Ωukov√©m notebooku dƒõl√°me v≈°e najednou: vytvo≈ô√≠me datab√°zi a hned se j√≠ zept√°me. V re√°ln√© produkƒçn√≠ aplikaci ("chatbotovi") jsou tyto procesy oddƒõlen√©:
>
> 1.  **Ingestion Pipeline (Noƒçn√≠ proces):** Skript, kter√Ω jednou za ƒças vezme dokumenty, vytvo≈ô√≠/aktualizuje datab√°zi a **ulo≈æ√≠ ji na disk**.
> 2.  **Aplikace (Chatbot):** Bƒõ≈æ√≠ neust√°le. P≈ôi startu nic nevytv√°≈ô√≠, pouze se **p≈ôipoj√≠** k ji≈æ existuj√≠c√≠ datab√°zi na disku (tzv. naƒçte `persisted_directory` a pou≈æije stejn√Ω `embeddings` model) a rovnou hled√°.

In [19]:
question = "What is the SLA for downtimes?"



# --- SIMULACE: Naƒçten√≠ ji≈æ existuj√≠c√≠ datab√°ze (Query Only) ---
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
# 1. Mus√≠me inicializovat Embeddings
# Datab√°ze pot≈ôebuje vƒõdƒõt, jak√Ωm "jazykem" (modelem) m√° p≈ôev√°dƒõt v√°≈° dotaz na ƒç√≠sla.
# Mus√≠ to b√Ωt STEJN√ù model, jak√Ωm jste data ulo≈æili!
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002"),
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15"),
)
# 2. Naƒçten√≠ existuj√≠c√≠ho VectorStore z disku
# Nepou≈æ√≠v√°me .from_documents(), proto≈æe nechceme nic vkl√°dat.
# Jen ≈ôekneme: "Tady je slo≈æka s daty, pou≈æij ji."
persist_directory = "../chroma_db"  # Cesta ke slo≈æce z p≈ôedchoz√≠ch krok≈Ø
if os.path.exists(persist_directory):
    vectorstore = Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings,
        collection_name="rag_training_v1"
    )
    print(f"‚úÖ √öspƒõ≈°nƒõ p≈ôipojeno k datab√°zi. Obsahuje {vectorstore._collection.count()} dokument≈Ø.")
else:
    print(f"‚ùå Chyba: Slo≈æka {persist_directory} neexistuje. Mus√≠te nejd≈ô√≠ve spustit Ingestion ƒç√°st (vytvo≈ôen√≠ datab√°ze).")





docs = vectorstore.similarity_search(question, k=3)

for i, doc in enumerate(docs):
    print(f"\n[Result {i+1}] Source: {doc.metadata.get('source')}")
    print(doc.page_content)

  vectorstore = Chroma(


‚úÖ √öspƒõ≈°nƒõ p≈ôipojeno k datab√°zi. Obsahuje 9 dokument≈Ø.

[Result 1] Source: ../data/legal/sla_contract.txt
SERVICE LEVEL AGREEMENT (SLA) for NebulaDB Enterprise

1. DEFINITIONS
"Uptime" refers to the availability of the Service during a billing month.
"Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.

2. SERVICE COMMITMENT
NebulaDB commits to a Monthly Uptime Percentage of at least 99.99% ("Service Commitment").

[Result 2] Source: ../data/legal/sla_contract.txt
4. EXCLUSIONS
The Service Commitment does not apply to any unavailability, suspension, or termination of NebulaDB performance issues: (i) caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of NebulaDB.

5. CLAIMS
To receive a Service Credit, you must submit a claim by opening a case in the NebulaDB Support Center within 30 days of the incident.

[Result 3] Source: ../dat

## 6. Generation (The 'G' in RAG)

Finally, we pass the retrieved chunks to the LLM to write the answer.

# üõ†Ô∏è Anatomie RAG ≈ôetƒõzce: Co se dƒõje pod kapotou?
Tento dokument vysvƒõtluje logiku RAG (Retrieval-Augmented Generation) pipeline postaven√© pomoc√≠ modern√≠ho **LCEL (LangChain Expression Language)**. Jednotliv√© kroky na sebe navazuj√≠ jako na v√Ωrobn√≠ lince.
## 1. Definice LLM (Mozek)
```python
llm = AzureChatOpenAI(
    azure_deployment="gpt-4o",  # V√°≈° deployment v Azure
    openai_api_version="2023-05-15",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    temperature=0
)
```
*   **Co se dƒõje:** Inicializujeme spojen√≠ s jazykov√Ωm modelem (GPT-4o) bƒõ≈æ√≠c√≠m v cloudu Azure.
*   **Proƒç to dƒõl√°me:** Pot≈ôebujeme "mozek", kter√Ω na konci procesu vygeneruje srozumitelnou odpovƒõƒè na z√°kladƒõ nalezen√Ωch dat.
*   **Kl√≠ƒçov√Ω parametr:** `temperature=0`. Pro RAG syst√©my chceme nulu, aby model "nehalucinoval" a dr≈æel se striktnƒõ nalezen√Ωch fakt≈Ø.
## 2. Retriever (Vyhled√°vaƒç)
```python
retriever = vectorstore.as_retriever()
```

*   **Co se dƒõje:** P≈ôev√°d√≠me statickou datab√°zi (VectorStore) na aktivn√≠ vyhled√°vac√≠ komponentu.
*   **Proƒç to dƒõl√°me:** VectorStore um√≠ jen ukl√°dat. My pot≈ôebujeme funkci, do kter√© hod√≠me ot√°zku (string) a ona n√°m vr√°t√≠ seznam relevantn√≠ch dokument≈Ø.
## 3. Prompt Template (≈†ablona instrukc√≠)
```python
template = """Answer the question based only on the following context: {context}

Question: {question} """

prompt = ChatPromptTemplate.from_template(template)
```

*   **Co se dƒõje:** P≈ôipravujeme "formu", do kter√© se budou vkl√°dat data.
*   **Logika:** ≈†ablona m√° dvƒõ m√≠sta (promƒõnn√©), kter√© je t≈ôeba vyplnit:
    1.  `{context}`: Sem se vlo≈æ√≠ texty nalezen√© v dokumentech (va≈°e znalostn√≠ b√°ze).
    2.  `{question}`: Sem se vlo≈æ√≠ aktu√°ln√≠ dotaz u≈æivatele.

## 4. Helper Funkce (Form√°tovaƒç dokument≈Ø)
```python
def format_docs(docs): 
    return "\n\n".join([d.page_content for d in docs])
```
*   **Co se dƒõje:** Tato funkce bere slo≈æit√© objekty `Document` (kter√© vrac√≠ retriever) a vytahuje z nich ƒçist√Ω text.
*   **Proƒç to dƒõl√°me:** Retriever vrac√≠ metadata, zdrojov√© cesty atd. LLM ale zaj√≠m√° jen obsah. Funkce vezme obsah v≈°ech nalezen√Ωch dokument≈Ø a spoj√≠ je do jednoho dlouh√©ho textu oddƒõlen√©ho mezerami, aby se ve≈°el do promƒõnn√© `{context}`.

## 5. Sestaven√≠ ≈òetƒõzce (The Pipeline)
Toto je nejd≈Øle≈æitƒõj≈°√≠ ƒç√°st. Definujeme tok dat pomoc√≠ oper√°toru `|` (pipe), kter√Ω pos√≠l√° v√Ωstup z jedn√© funkce jako vstup do dal≈°√≠.
```python
rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()}
 | prompt
 | llm
 | StrOutputParser() )


 # Pseudo-k√≥d toho, co se dƒõje uvnit≈ô jen pro uk√°zku:
rag_chain = RunnableSequence(
    first=RunnableParallel(  # Ten slovn√≠k na zaƒç√°tku
        context=(retriever | format_docs),
        question=RunnablePassthrough()
    ),
    middle=[
        prompt,  # ChatPromptTemplate
        llm,     # AzureChatOpenAI
        StrOutputParser()
    ]
)
```
### Krok A: Paraleln√≠ p≈ô√≠prava dat (Slovn√≠k na zaƒç√°tku)
`{"context": ..., "question": ...}`
Tento blok bƒõ≈æ√≠ jako prvn√≠ a p≈ôipravuje data pro prompt. Dƒõl√° dvƒõ vƒõci nar√°z:
1.  **Z√≠sk√°n√≠ kontextu:** `retriever | format_docs`
    *   Vezme vstupn√≠ ot√°zku -> po≈°le ji do `retriever` (najde dokumenty) -> v√Ωsledek po≈°le do `format_docs` (p≈ôevede na text) -> ulo≈æ√≠ jako `context`.
2.  **P≈ôed√°n√≠ ot√°zky:** `"question": RunnablePassthrough()`
    *   `RunnablePassthrough()` je "pr≈Øtokov√Ω oh≈ô√≠vaƒç". Vezme vstupn√≠ ot√°zku a beze zmƒõny ji po≈°le d√°l pod kl√≠ƒçem `question`.
### Krok B: Vytvo≈ôen√≠ zad√°n√≠
`| prompt`
*   Vezme data z kroku A (naplnƒõn√Ω `context` i `question`) a vlo≈æ√≠ je do ≈°ablony z bodu 3. V√Ωsledkem je fin√°ln√≠ text zad√°n√≠ pro model.
### Krok C: Generov√°n√≠ odpovƒõdi
`| llm`
*   Hotov√© zad√°n√≠ se po≈°le do GPT-4o. Model vygeneruje odpovƒõƒè (objekt typu `AIMessage`).
### D. ƒåist√Ω v√Ωstup
`| StrOutputParser()`
*   Model vrac√≠ slo≈æit√Ω objekt. Tento parser z nƒõj vyt√°hne ƒçist√Ω ≈ôetƒõzec (string) s odpovƒõd√≠, kterou vid√≠ u≈æivatel.
## 6. Spu≈°tƒõn√≠ (Execution)
response = rag_chain.invoke(question)

*   **Co se dƒõje:** Metoda `.invoke()` je spou≈°tƒõc√≠ tlaƒç√≠tko.
*   Funkce vezme promƒõnnou `question` (v√°≈° dotaz) a hod√≠ ji na zaƒç√°tek ≈ôetƒõzce (do bodu 5).
*   Cel√Ω proces probƒõhne automaticky a na konci vypadne textov√° odpovƒõƒè.

In [12]:
# Modern√≠ a spolehliv√° alternativa k RetrievalQA
from langchain_openai import AzureChatOpenAI
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Definice LLM (Pou≈æ√≠v√°me Azure variantu)
# Naƒçteme promƒõnn√© prost≈ôed√≠, kter√© jste definoval v p≈ôedchoz√≠ch bu≈àk√°ch nebo .env souboru
llm = AzureChatOpenAI(
    azure_deployment="gpt-4o",  # P≈ôedpokl√°d√°me, ≈æe n√°zev deploymentu v Azure je "gpt-4o"
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2023-05-15"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    temperature=0
)

# 2. Retriever (Vyhled√°vaƒç)
retriever = vectorstore.as_retriever()

# 3. Prompt (Instrukce pro model)
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# 4. Pomocn√° funkce pro zform√°tov√°n√≠ dokument≈Ø do textu
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

# 5. Sestaven√≠ RAG ≈ôetƒõzce (Chain)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 6. Spu≈°tƒõn√≠
# .invoke() spust√≠ cel√Ω proces: najde dokumenty -> vlo≈æ√≠ do promptu -> ode≈°le na GPT-4
print(f"Ot√°zka: {question}")
print("-" * 30)
response = rag_chain.invoke(question)
print("Odpovƒõƒè:", response)

Ot√°zka: What is the SLA for downtimes?
------------------------------
Odpovƒõƒè: The SLA for downtimes in the NebulaDB Enterprise Service Level Agreement specifies that "Downtime" refers to a period of time exceeding 5 minutes where the Error Rate is greater than 5%.


In [18]:
print(type(rag_chain))

<class 'langchain_core.runnables.base.RunnableSequence'>
