
# 🔎 Vertex AI Search — Title Finder (Notebook)

Minimalny, wygodny **notebook do wyszukiwania misji po przybliżonym tytule/haśle** w Vertex AI Search (Discovery Engine).  
Bez Streamlita — same komórki + proste UI oparte o `ipywidgets`.

**Co potrafi:**
- Zapytanie tekstowe (nie musi być słowo w słowo), pobranie top‑K z Vertex Search.
- (Opcjonalnie) lokalny **fuzzy re‑ranking** względem `display_id` / `mission_id` / `outcome` (RapidFuzz).
- **Tabelka wyników** (`pandas.DataFrame`) + podgląd **surowego `struct_data` najlepszego trafienia**.
- Opcjonalny **eksport do CSV/JSONL**.

---



## 1) Zależności i autoryzacja

Uruchom komórkę poniżej, jeśli nie masz pakietów:

```bash
%pip install -q google-cloud-discoveryengine rapidfuzz pandas ipywidgets
```

> **Autoryzacja**: ustaw poświadczenia, np. w terminalu JupyterLab:
```bash
export GOOGLE_APPLICATION_CREDENTIALS=/ścieżka/do/service-account.json
```
Albo w Pythonie (jednorazowo w tej sesji):
```python
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/ścieżka/do/service-account.json"
```


In [None]:

# Jeśli trzeba, odkomentuj:
# %pip install -q google-cloud-discoveryengine rapidfuzz pandas ipywidgets


## 2) Importy i pomocnicze funkcje

In [None]:

import re, json, sys
from typing import Any, Dict, List, Optional
import pandas as pd

# RapidFuzz (fuzzy match) — opcjonalny, ale zalecany
try:
    from rapidfuzz import fuzz
    _FUZZY_OK = True
except Exception:
    _FUZZY_OK = False
    def _simple_ratio(a: str, b: str) -> float:
        a = (a or "").lower()
        b = (b or "").lower()
        if not a or not b:
            return 0.0
        if a in b or b in a:
            return 100.0 * min(len(a), len(b)) / max(len(a), len(b))
        return 0.0
    class fuzz:  # fallback
        @staticmethod
        def partial_ratio(a, b): return _simple_ratio(a, b)
        @staticmethod
        def token_set_ratio(a, b): return _simple_ratio(a, b)

# Google Discovery Engine client
try:
    from google.cloud import discoveryengine_v1 as de
    from google.api_core.client_options import ClientOptions
except Exception as e:
    print("⚠️  Brak pakietu google-cloud-discoveryengine. Uruchom: %pip install google-cloud-discoveryengine", file=sys.stderr)
    raise

def infer_api_endpoint(serving_config: str) -> str:
    m = re.search(r"/locations/([^/]+)/", serving_config)
    loc = m.group(1) if m else "global"
    return f"{loc}-discoveryengine.googleapis.com"

def search_vertex(
    serving_config: str,
    query: str,
    page_size: int = 30,
    api_endpoint: Optional[str] = None,
    filter_expression: Optional[str] = None,
) -> List[Dict[str, Any]]:
    endpoint = api_endpoint or infer_api_endpoint(serving_config)
    client = de.SearchServiceClient(client_options=ClientOptions(api_endpoint=endpoint))

    req = de.SearchRequest(serving_config=serving_config, query=query, page_size=page_size)
    # Włącz automatyczne rozszerzanie zapytań i korektę pisowni, jeśli dostępne
    try:
        req.query_expansion_spec = de.SearchRequest.QueryExpansionSpec(
            condition=de.SearchRequest.QueryExpansionSpec.Condition.AUTO
        )
    except Exception:
        pass
    try:
        req.spell_correction_spec = de.SearchRequest.SpellCorrectionSpec(
            mode=de.SearchRequest.SpellCorrectionSpec.Mode.AUTO
        )
    except Exception:
        pass
    if filter_expression:
        req.filter = filter_expression

    rows: List[Dict[str, Any]] = []
    for r in client.search(request=req):
        doc = r.document
        struct = dict(doc.struct_data) if doc.struct_data is not None else {}
        links = struct.get("links") if isinstance(struct.get("links"), dict) else {}
        tags = struct.get("tags", [])
        if isinstance(tags, dict):
            tags = list(tags.values())
        elif isinstance(tags, str):
            tags = [tags]

        rows.append({
            "doc_name": doc.name,
            "doc_id": getattr(doc, "id", None) or doc.name.split("/")[-1],
            "engine_score": getattr(r, "score", None),
            "mission_id": struct.get("mission_id"),
            "display_id": struct.get("display_id"),
            "mission_type": struct.get("mission_type"),
            "outcome": struct.get("outcome"),
            "tags": tags,
            "approved": struct.get("approved"),
            "final_score": struct.get("final_score"),
            "lang": struct.get("lang"),
            "nodes_count": struct.get("nodes_count"),
            "edges_count": struct.get("edges_count"),
            "plan_uri": links.get("plan_uri"),
            "transcript_uri": links.get("transcript_uri"),
            "metrics_uri": links.get("metrics_uri"),
            "txt_uri": links.get("txt_uri"),
            "_struct": struct,
        })
    return rows

def fuzzy_score(query: str, row: Dict[str, Any]) -> float:
    fields = [row.get("display_id") or "", row.get("mission_id") or "", row.get("outcome") or ""]
    scores = [
        fuzz.token_set_ratio(query, fields[0]),
        fuzz.partial_ratio(query, fields[0]),
        fuzz.token_set_ratio(query, fields[1]),
        fuzz.partial_ratio(query, fields[1]),
        fuzz.token_set_ratio(query, fields[2]),
        fuzz.partial_ratio(query, fields[2]),
    ]
    return max(float(s or 0.0) for s in scores)

def rerank_by_title(query: str, rows: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    for r in rows:
        r["fuzzy_score"] = fuzzy_score(query, r)
        es = r.get("engine_score") or 0.0
        try:
            es = float(es)
        except Exception:
            es = 0.0
        r["combined_score"] = (0.8 * r["fuzzy_score"]) + (0.2 * es)
    rows.sort(key=lambda x: (x.get("combined_score", 0.0), x.get("engine_score") or 0.0), reverse=True)
    return rows

def as_dataframe(rows: List[Dict[str, Any]]) -> pd.DataFrame:
    cols = [
        "combined_score", "fuzzy_score", "engine_score",
        "display_id", "mission_id", "mission_type", "outcome",
        "tags", "approved", "final_score", "lang",
        "nodes_count", "edges_count",
        "plan_uri", "transcript_uri", "metrics_uri"
    ]
    df = pd.DataFrame([{k: r.get(k) for k in cols} for r in rows])
    return df


## 3) Proste UI (ipywidgets)

In [None]:

import ipywidgets as W
from IPython.display import display, JSON, clear_output

serving_config_w = W.Text(
    description="serving_config",
    placeholder="projects/PROJ/locations/eu/collections/default_collection/dataStores/DS/servingConfigs/default_config",
    layout=W.Layout(width="100%")
)
api_endpoint_w = W.Text(
    description="api_endpoint",
    placeholder="(auto from location)",
    layout=W.Layout(width="50%")
)
query_w = W.Text(
    description="query",
    value="fraud scoring pipeline rollback handling",
    layout=W.Layout(width="100%")
)
topk_w = W.IntSlider(description="top_k", min=1, max=100, step=1, value=30, continuous_update=False)
filter_w = W.Text(description="filter", placeholder='approved = true AND lang = "pl"', layout=W.Layout(width="100%"))
fuzzy_w = W.Checkbox(description="Fuzzy re‑rank", value=True)
search_btn = W.Button(description="🔎 Szukaj", button_style="primary")
export_csv_w = W.Text(description="CSV", placeholder="hits.csv")
export_json_w = W.Text(description="JSONL", placeholder="hits.jsonl")
export_btn = W.Button(description="💾 Eksportuj", button_style="")

hdr = W.HTML("<b>Parametry zapytania</b>")
box = W.VBox([hdr, serving_config_w, api_endpoint_w, query_w, topk_w, filter_w, W.HBox([fuzzy_w, search_btn])])
display(box)

out = W.Output()
display(out)

last_rows = []  # pamięć wyników w sesji

def on_search(_):
    global last_rows
    out.clear_output()
    with out:
        if not serving_config_w.value.strip():
            print("❗ Podaj serving_config.")
            return
        print("Zapytanie do Vertex Search...")
        try:
            rows = search_vertex(
                serving_config=serving_config_w.value.strip(),
                query=query_w.value.strip(),
                page_size=int(topk_w.value),
                api_endpoint=api_endpoint_w.value.strip() or None,
                filter_expression=filter_w.value.strip() or None,
            )
        except Exception as e:
            print("❌ Błąd wyszukiwania:", e)
            return
        if not rows:
            print("ℹ️ Brak wyników.")
            return
        if fuzzy_w.value:
            rows = rerank_by_title(query_w.value.strip(), rows)
            print(f"Uwaga: zastosowano fuzzy re‑ranking (RapidFuzz={_FUZZY_OK}).")
        last_rows = rows
        df = as_dataframe(rows)
        display(df)

        # podgląd surowego struct_data najlepszego trafienia
        print("\n== Raw struct_data (best hit) ==")
        print(json.dumps(rows[0]["_struct"], ensure_ascii=False, indent=2))

search_btn.on_click(on_search)

# Eksport
box2 = W.HBox([export_csv_w, export_json_w, export_btn])
display(W.HTML("<b>Eksport wyników (opcjonalnie)</b>"))
display(box2)

def on_export(_):
    if not last_rows:
        print("⚠️ Najpierw wyszukaj wyniki.")
        return
    if export_csv_w.value.strip():
        import csv
        cols = [
            "combined_score", "fuzzy_score", "engine_score",
            "display_id", "mission_id", "mission_type", "outcome",
            "tags", "approved", "final_score", "lang",
            "nodes_count", "edges_count",
            "plan_uri", "transcript_uri", "metrics_uri"
        ]
        with open(export_csv_w.value.strip(), "w", newline="", encoding="utf-8") as f:
            w = csv.DictWriter(f, fieldnames=cols)
            w.writeheader()
            for r in last_rows:
                rec = {k: r.get(k) for k in cols}
                for k, v in rec.items():
                    if isinstance(v, (list, dict)):
                        rec[k] = json.dumps(v, ensure_ascii=False)
                w.writerow(rec)
        print(f"✅ CSV zapisany: {export_csv_w.value.strip()}")
    if export_json_w.value.strip():
        with open(export_json_w.value.strip(), "w", encoding="utf-8") as f:
            for r in last_rows:
                f.write(json.dumps(r, ensure_ascii=False) + "\n")
        print(f"✅ JSONL zapisany: {export_json_w.value.strip()}")

export_btn.on_click(on_export)


## 4) Użycie programistyczne (bez widgetów)

In [None]:

# Przykład „headless”:
# serving_config = "projects/PROJ/locations/eu/collections/default_collection/dataStores/DS/servingConfigs/default_config"
# rows = search_vertex(serving_config, query="your mission title", page_size=30)
# rows = rerank_by_title("your mission title", rows)
# pd.DataFrame(rows)[[
#     "combined_score","fuzzy_score","engine_score","display_id","mission_id","plan_uri"
# ]].head(10)
