#  22 september update...badhu navu kryu 

awesome. let’s kick this off clean with your env name **mllocalag** and quality-first Jina v3 + OCR. below is a tight, copy-paste runbook. (you can keep your current code; I only add two tiny helpers and 3–4 lines in ingest/embed.)

---

# 1) one-time cleanup (safe)

```powershell
# stop/remove qdrant + old volumes (ignore errors if not found)
docker rm -f qdrant 2>$null
docker volume ls | findstr qdrant
# if a qdrant volume appears, remove it:
# docker volume rm multilingual-ragbot_qdrant_storage

# clear python caches & old HF cache (we’ll use a short path next)
Get-ChildItem -Recurse -Force -Include __pycache__ | Remove-Item -Recurse -Force -ErrorAction SilentlyContinue
Get-ChildItem -Recurse -Force -Include *.pyc | Remove-Item -Force -ErrorAction SilentlyContinue
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface" -ErrorAction SilentlyContinue

# fresh metadata (we’ll regenerate on ingest)
Remove-Item -Recurse -Force .\data\metadata -ErrorAction SilentlyContinue
New-Item -ItemType Directory -Force -Path .\data\metadata | Out-Null
```

---

# 2) new environment (name: **mllocalag**)

```powershell
conda create -n mllocalag python=3.10 -y
conda activate mllocalag
```

**short HF cache (prevents the old Windows long-path crash)**

```powershell
[Environment]::SetEnvironmentVariable("HF_HOME","C:\hf","User")
[Environment]::SetEnvironmentVariable("HF_HUB_CACHE","C:\hf\hub","User")
[Environment]::SetEnvironmentVariable("TRANSFORMERS_CACHE","C:\hf\tf","User")
# close this terminal, open a new one, then:
conda activate mllocalag
```

**GPU (recommended): install PyTorch CUDA (choose ONE path)**

* pip (CUDA 12.1):

```powershell
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision torchaudio
```

* or conda:

```powershell
conda install -y pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
```

**base deps**

```powershell
pip install -U pip wheel
pip install -r requirements.txt
pip install -U "sentence-transformers>=3.0.1" "transformers>=4.44.0" "huggingface_hub>=0.24" accelerate safetensors
pip install ocrmypdf pdfplumber pytesseract pillow
```

> Make sure system Tesseract is installed with **deu** + **eng** traineddata. (OCRmyPDF will use it.)

---

# 3) config (quality-first Jina v3)

Open **core/config.py** and confirm:

```python
CFG.embed_model        = "jinaai/jina-embeddings-v3"
CFG.embed_dim          = 1024                # full quality; we can crop to 512 later
CFG.embed_doc_prefix   = "search_document: "
CFG.embed_query_prefix = "search_query: "
CFG.qdrant_url         = "http://127.0.0.1:6333"
CFG.qdrant_collection  = "tender_docs_jina-v3_d1024_fresh"
CFG.extract_dir        = "data/extract"
CFG.ocr_cache_dir      = "data/ocr_cache"
CFG.ocr_langs          = "deu+eng"
CFG.ocr_quality        = "quality"          # set "speed" later if needed
CFG.ocr_jobs           = 0                   # auto = cpu_count()-1
```

---

# 4) add two helpers (copy-paste)

## 4.1 `core/text_cleaning.py`

```python
import re, unicodedata
URL   = re.compile(r'https?://\S+|www\.\S+', re.I)
EMAIL = re.compile(r'\b[\w\.-]+@[\w\.-]+\.\w+\b')
WS    = re.compile(r'[ \t\r\f\v]+')
CTRL  = re.compile(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]')
NOISE = re.compile(r'[^\w\s\.\,\;\:\!\?\(\)\[\]\{\}\'\"\-\/_\#\%\&\+\*\=\<\>\@]')
NL3   = re.compile(r'\n{3,}')
SP2   = re.compile(r' {2,}')

def _nfkc(t): 
    t = unicodedata.normalize("NFKC", t).replace("\u00A0"," ")
    return (t.replace("\u2018","'").replace("\u2019","'")
             .replace("\u201C",'"').replace("\u201D",'"')
             .replace("\u2013","-").replace("\u2014","-"))

def _fix_hyphens(t): return re.sub(r'-\s*\n\s*','-',t)

def clean_text(text: str) -> str:
    if not text: return ""
    text = _nfkc(text)
    text = CTRL.sub(" ", text)
    text = URL.sub(" ", text)
    text = EMAIL.sub(" ", text)
    text = _fix_hyphens(text)
    text = WS.sub(" ", text)
    text = NOISE.sub(" ", text)
    text = SP2.sub(" ", text).strip()
    text = NL3.sub("\n\n", text)
    return text
```

## 4.2 `core/ocr.py`

```python
from __future__ import annotations
import hashlib, json, subprocess, tempfile, shutil
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Any
import pdfplumber
from PIL import Image
import pytesseract
import fitz  # PyMuPDF

from core.config import CFG

@dataclass
class OcrResult:
    pdf_path: Path
    metrics: Dict[str, Any]

def _jobs():
    import os
    n = os.cpu_count() or 4
    return CFG.ocr_jobs if CFG.ocr_jobs and CFG.ocr_jobs>0 else max(1, n-1)

def _profile_args():
    if CFG.ocr_quality == "speed":
        return ["--skip-text","--optimize","1","--rotate-pages","--jobs",str(_jobs())]
    return ["--redo-ocr","--deskew","--remove-background","--rotate-pages",
            "--tesseract-timeout","0","--optimize","2","--tesseract-config","textonly_pdf=1",
            "--jobs",str(_jobs())]

def _estimate_native_chars(pdf: Path) -> int:
    total = 0
    with pdfplumber.open(str(pdf)) as d:
        for p in d.pages[:5]:
            total += len((p.extract_text() or "").strip())
    return total

def _collect_metrics(pdf: Path, applied: bool, note: str="") -> Dict[str,Any]:
    pages=[]
    with pdfplumber.open(str(pdf)) as d:
        for i,p in enumerate(d.pages):
            pages.append({"page":i+1,"len":len((p.extract_text() or ""))})
    return {"ocr_applied":applied,"note":note,"pages":pages,"total_chars":sum(p["len"] for p in pages)}

def _tess_img_to_text(img_path: Path, langs: str, psm: str) -> str:
    img = Image.open(str(img_path))
    try:
        cfg = f"--oem 1 --psm {psm}"
        d = pytesseract.image_to_data(img, lang=langs, config=cfg, output_type=pytesseract.Output.DICT)
        words = d.get("text", [])
        return " ".join(w for w in words if w and w.strip())
    finally:
        img.close()

def _pytess_fallback_to_pdf(src: Path, langs: str) -> Path:
    doc = fitz.open(str(src))
    out = fitz.open()
    tmpdir = Path(tempfile.mkdtemp())
    for i,page in enumerate(doc):
        pix = page.get_pixmap(dpi=300)
        img = tmpdir / f"p{i:04d}.png"
        pix.save(str(img))
        txt = _tess_img_to_text(img, langs, psm="6")
        if len(txt.strip())<30:
            txt = _tess_img_to_text(img, langs, psm="11")
        rect = fitz.Rect(0,0,pix.width,pix.height)
        new = out.new_page(width=rect.width, height=rect.height)
        new.insert_textbox(rect, txt)
    out_path = tmpdir / "fallback_ocr.pdf"
    out.save(str(out_path)); out.close(); doc.close()
    return out_path

def ocr_pdf_if_needed(src_pdf: Path) -> OcrResult:
    src_pdf = Path(src_pdf).resolve()
    Path(CFG.ocr_cache_dir).mkdir(parents=True, exist_ok=True)
    key = hashlib.sha256((src_pdf.read_bytes()+str(CFG.ocr_quality+CFG.ocr_langs).encode())).hexdigest()[:16]
    out_pdf = Path(CFG.ocr_cache_dir) / f"{src_pdf.stem}.{key}.ocr.pdf"
    metrics_json = Path(CFG.ocr_cache_dir) / f"{src_pdf.stem}.{key}.metrics.json"

    if out_pdf.exists() and metrics_json.exists():
        return OcrResult(out_pdf, json.loads(metrics_json.read_text(encoding="utf-8")))

    if _estimate_native_chars(src_pdf) > 500:
        m = {"native_text_chars":True,"ocr_applied":False,"pages":[],"total_chars":0}
        metrics_json.write_text(json.dumps(m, ensure_ascii=False, indent=2), encoding="utf-8")
        return OcrResult(src_pdf, m)

    cmd = ["ocrmypdf","-l",CFG.ocr_langs,*_profile_args(),str(src_pdf),str(out_pdf)]
    try:
        subprocess.run(cmd, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        m = _collect_metrics(out_pdf, True)
        metrics_json.write_text(json.dumps(m, ensure_ascii=False, indent=2), encoding="utf-8")
        return OcrResult(out_pdf, m)
    except Exception:
        tmp_pdf = _pytess_fallback_to_pdf(src_pdf, CFG.ocr_langs)
        shutil.move(tmp_pdf, out_pdf)
        m = _collect_metrics(out_pdf, True, note="pytesseract_fallback")
        metrics_json.write_text(json.dumps(m, ensure_ascii=False, indent=2), encoding="utf-8")
        return OcrResult(out_pdf, m)
```

---

# 5) make embeddings dimension-safe (tiny edit)

In **core/index.py** (where the embedder & Qdrant client are created), ensure:

```python
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from core.config import CFG

def _embed_dim(m):
    try:
        return m.get_sentence_embedding_dimension()
    except Exception:
        return len(m.encode("probe", normalize_embeddings=True))

embedder = SentenceTransformer(CFG.embed_model, trust_remote_code=True)
_raw = _embed_dim(embedder)
_DIM = min(CFG.embed_dim or _raw, _raw)

client = QdrantClient(url=CFG.qdrant_url)
names = {c.name for c in client.get_collections().collections}
if CFG.qdrant_collection not in names:
    client.create_collection(
        collection_name=CFG.qdrant_collection,
        vectors_config=VectorParams(size=_DIM, distance=Distance.COSINE),
    )

def encode_doc(text: str):
    v = embedder.encode(CFG.embed_doc_prefix + text, normalize_embeddings=True)
    return v[:_DIM]

def encode_query(text: str):
    v = embedder.encode(CFG.embed_query_prefix + text, normalize_embeddings=True)
    return v[:_DIM]
```

Use `encode_doc/encode_query` wherever you embed.

---

# 6) call OCR + cleaner in ingest (2 lines)

In **scripts/ingest.py** (where you parse PDFs):

```python
from core.ocr import ocr_pdf_if_needed
from core.text_cleaning import clean_text
# ...
ocr = ocr_pdf_if_needed(pdf_path)                 # <-- NEW
raw_text = extract_text_with_pymupdf(ocr.pdf_path)
text = clean_text(raw_text)                       # <-- NEW
# then your existing chunk -> embed(encode_doc) -> upsert(payload)
# include ocr_applied/total_chars in payload if you wish
```

---

# 7) start qdrant and run the pipeline

```powershell
docker compose up -d
Invoke-RestMethod http://127.0.0.1:6333/healthz

# 1) ingest (OCR + parsing + cleaning)
python scripts/ingest.py

# 2) embed (creates 'tender_docs_jina-v3_d1024_fresh' with correct dim and upserts)
python scripts/embed.py
```

**verify collection populated**

```powershell
$COLL = "tender_docs_jina-v3_d1024_fresh"
Invoke-RestMethod -Method Post `
  -Uri "http://127.0.0.1:6333/collections/$COLL/points/count" `
  -ContentType "application/json" -Body '{ "exact": true }'
```

**quick retrieval smoke test**

```powershell
python - << 'PY'
from core.qa import retrieve_candidates
from core.config import CFG
hits = retrieve_candidates("Vergabe Frist Anforderungen", CFG)[:3]
for h in hits:
    print(round(h.score,4), (h.payload or {}).get("source_path",""), (h.text or "")[:100])
PY
```

**launch UI**

```powershell
streamlit run ui/app_streamlit.py
```

---

# switches you can try later (no refactor)

* **Speed**: set `CFG.embed_dim=512` (Matryoshka crop) and `CFG.ocr_quality="speed"`.
* **Precision**: enable reranker (e.g., bge-reranker-v2-m3) in your `core/qa` after dense retrieval.
* **GPU OCR**: when you want faster scan OCR, we can add PaddleOCR GPU as a toggle while keeping OCRmyPDF as first choice.

that’s it—env **mllocalag** + fresh ingest/parse/embed/retrieve/UI, built for quality first. run the steps and you’ll have a new, clean collection with better text and robust imports.


# update requirement file,


# ---- Vector DB ----
qdrant-client>=1.9,<2.0

# ---- Embeddings / HF stack (needed for Jina v3 remote code) ----
sentence-transformers>=3.0.1
transformers>=4.44,<5
huggingface_hub>=0.24
accelerate>=0.33
safetensors>=0.4.3
FlagEmbedding>=1.2.10     # keep if you'll use a reranker later (ok to leave)

# ---- Data / PDF / utils ----
numpy>=1.24,<3
pandas>=2.2
openpyxl>=3.1
pymupdf>=1.24
pdfplumber>=0.11
tqdm>=4.66
langdetect>=1.0.9
python-magic-bin==0.4.14 ; sys_platform == "win32"

# ---- OCR (quality-first path) ----
ocrmypdf>=16.0
pytesseract>=0.3.10
Pillow>=10.0

# ---- LLM / UI ----
ollama>=0.3.0
streamlit>=1.36.0


# update ingest and io

Why these versions

Works with your Pydantic config (all paths are Path).

PDFLoader uses OCR fallback (pytesseract, 300 DPI) and cleans text for better recall.

Ingest handles ZIPs + loose files, logs to data/logs/ingest_YYYY-MM.csv.

Excel metadata is optional and cleaned recursively from data/raw/.

You’re good to run:


# delete old collection # %%
import shutil
from pathlib import Path
import os

def delete_qdrant_collection():
    """Delete existing Qdrant collection"""
    
    try:
        from qdrant_client import QdrantClient
        from core.config import CFG
        
        print("\n🗑️ Connecting to Qdrant...")
        client = QdrantClient(url=CFG.qdrant_url)
        
        collections = client.get_collections().collections
        collection_names = [c.name for c in collections]
        
        if CFG.qdrant_collection in collection_names:
            print(f"🗑️ Deleting collection: {CFG.qdrant_collection}")
            client.delete_collection(CFG.qdrant_collection)
            print("✅ Qdrant collection deleted successfully")
            return True
        else:
            print(f"ℹ️ No collection named '{CFG.qdrant_collection}' found")
            return False
            
    except ImportError:
        print("⚠️ qdrant_client not available - install with: pip install qdrant-client")
        return False
    except Exception as e:
        print(f"❌ Could not delete Qdrant collection: {e}")
        return False
    
def main():
    print("🚀 Starting Nuclear Cleanup")
    print("=" * 50)
    
    
    # Step 2: Delete Qdrant collection
    qdrant_deleted = delete_qdrant_collection()
    
    # Summary
    print("\n" + "=" * 50)
    print("🎉 NUCLEAR CLEANUP COMPLETE!")
    print("=" * 50)
    print("✅ Unwanted script/UI files removed")
    print("✅ Data directories cleaned")
    print(f"{'✅' if qdrant_deleted else '⚠️'} Qdrant collection {'deleted' if qdrant_deleted else 'not found/accessible'}")
    print("\n🚀 Ready for unified document processing!")

if __name__ == "__main__":
    main()
