# CareMind: a MVP CDSS

## 0 目标与范围（MVP 边界）

覆盖 1–2 个优先疾病场景（如高血压、2 型糖尿病）。

知识源：3–5 份卫健委指南 + 1–2 份护理共识 + 10 种常用药（阿司匹林、二甲双胍、氨氯地平等）。

能力：自然语言提问 →（向量检索指南/共识 + SQL 查询药品表）→ 生成带引用与合规提示的回答。

UI：简单的 Streamlit 单页（输入问题、展示依据片段与来源、药品表结构化信息、免责声明）。

现有本地环境（Win11 + WSL + RTX 4070 SUPER + VS Code/Jupyter），并以**RAG（向量库）+ 结构化药品库（SQLite）**的混合检索为核心。

## 1. 环境准备（WSL 内执行）

### 1.1 CUDA/驱动

Windows 侧已装好 NVIDIA 驱动与 CUDA/cuDNN；WSL2 中建议使用 nvidia-smi 与 torch.cuda.is_available() 验证 GPU 可用。

(base) myunix@40VFO2U:~$ mkdir caremind

(base) myunix@40VFO2U:~$ cd caremind

(base) myunix@40VFO2U:~/caremind$

(base) myunix@40VFO2U:~/caremind$ nvidia-smi

### 1.2 本地推理服务（推荐 Ollama + Qwen2）

我用 4070 SUPER，本地 7B–14B 推理很合适。

(base) myunix@40VFO2U:~/caremind$ curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama

[sudo] password for myunix:

>>> Installing ollama to /usr/local

>>> Downloading Linux amd64 bundle

######################################################################## 100.0%

>>> Creating ollama user...

>>> Adding ollama user to render group...

>>> Adding ollama user to video group...

>>> Adding current user to ollama group...

>>> Creating ollama systemd service...

>>> Enabling and starting ollama service...

Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.

>>> Nvidia GPU detected.

>>> The Ollama API is now available at 127.0.0.1:11434.

>>> Install complete. Run "ollama" from the command line.

## 2 项目结构与配置

caremind/
  ├─ .env                           # 环境变量
  ├─ data/
  │   ├─ guidelines/                # 卫健委/护理共识 PDF/HTML/Word
  │   └─ drugs.xlsx                 # 10种药的Excel（初期）
  ├─ db/
  │   └─ drugs.sqlite               # SQLite 实例
  ├─ embeddings/
  │   └─ bge-large-zh/              # 可选：缓存模型权重
  ├─ ingest/
  │   ├─ parse_docs.py              # 文档抽取与切片
  │   ├─ build_vectors.py           # 嵌入与写入 Chroma
  │   └─ load_drugs.py              # Excel→SQLite
  ├─ rag/
  │   ├─ retriever.py               # 混合检索（Chroma+SQLite）
  │   ├─ prompt.py                  # 提示词模板（含合规要求与引用格式）
  │   └─ pipeline.py                # 端到端 RAG 管线
  ├─ app.py                         # Streamlit 前端
  └─ README.md

(base) myunix@40VFO2U:~/caremind$ mkdir data

(base) myunix@40VFO2U:~/caremind$ cd data

(base) myunix@40VFO2U:~/caremind/data$ mkdir guidlines

(base) myunix@40VFO2U:~/caremind/data$ cd ..

(base) myunix@40VFO2U:~/caremind$ mkdir db

(base) myunix@40VFO2U:~/caremind$ mkdir embeddings

(base) myunix@40VFO2U:~/caremind$ cd embeddings

(base) myunix@40VFO2U:~/caremind/embeddings$ mkdir bge-large-zh

(base) myunix@40VFO2U:~/caremind/embeddings$ cd ..

(base) myunix@40VFO2U:~/caremind$ mkdir ingest

(base) myunix@40VFO2U:~/caremind$ mkdir rag

(base) myunix@40VFO2U:~/caremind$ ls

data  db  embeddings  ingest  rag

(base) myunix@40VFO2U:~/caremind$

#### .env 示例

##### # LLM/Ollama
OLLAMA_BASE_URL=http://localhost:11434
LLM_MODEL=qwen2:7b-instruct

##### # Embedding 模型（中文优先）
EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5

##### # Chroma 向量库存储路径
CHROMA_PERSIST_DIR=./chroma_store

##### # SQLite
SQLITE_PATH=./db/drugs.sqlite

### 1.4 Conda 环境与依赖

(base) myunix@40VFO2U:~/caremind$conda create -n caremind python=3.10.18 -y

(base) myunix@40VFO2U:~/caremind$conda activate caremind

(caremind) myunix@40VFO2U:~/caremind$ pip install --upgrade pip

#### 核心：RAG + 分词/嵌入 + 本地LLM(走Ollama) + UI + PDF抽取 + 中文处理
(caremind) myunix@40VFO2U:~/caremind$ pip install langchain langchain-community chromadb pydantic-settings

(caremind) myunix@40VFO2U:~/caremind$ pip install sentence-transformers # 用于中文 bge

(caremind) myunix@40VFO2U:~/caremind$ pip install transformers accelerate torch --extra-index-url https://download.pytorch.org/whl/cu121

(caremind) myunix@40VFO2U:~/caremind$ pip install pdfplumber pypdf pdf2image pytesseract pillow beautifulsoup4 lxml

(caremind) myunix@40VFO2U:~/caremind$ pip install jieba nltk

(caremind) myunix@40VFO2U:~/caremind$ pip install streamlit

(caremind) myunix@40VFO2U:~/caremind$ pip install python-dotenv

(caremind) myunix@40VFO2U:~/caremind$ pip install sqlalchemy aiosqlite

Newly added:

(caremind) myunix@40VFO2U:~/caremind$ pip install -U "chromadb>=0.5"

pip install requests pandas openpyxl tenacity pdfminer.six pypdf2 chardet

### 1.5 Open VS Code from WSL

(caremind) myunix@40VFO2U:~/caremind$ code .

create or open file: caremind.ipynb

use ctrl+shit+p to select the Python Environment: caremind(Python3.10.18)

## 3 数据准备与入库

### 3.1 文档抽取与切片（指南/护理共识）

策略：按标题层级 + 语义切片，每块 500–1000 字，保留元数据（来源、年份/版本、证据等级、人群适用）。

Working dir assumption (you can change paths if needed): /home/myunix/caremind

In [14]:
import os
os.getcwd()

'/home/myunix/caremind'

#### Pipeline
1. Extract year from filename (Chinese guideline naming patterns).
2. Extract text (pdfplumber → OCR fallback with pdf2image + pytesseract).
3. Heuristically split into titled chunks.
4. Write JSONL with metadata (content, meta).

🧱 SECTION BREAKDOWN
SECTION PURPOSE
1. Metadata Extraction Utils
Self-contained functions to extract: year,title,authors,typefrom filename — easily testable and reusable.
2. Text Extraction
pdfplumber, wrapper + optional OCR fallback (commented out).
3. Chunking Logic
Enhanced heuristic-based chunking tuned for Chinese medical text — outputs structured chunks with full metadata.
4. Main Pipeline
Orchestrates file discovery → metadata extraction → text extraction → chunking → JSONL output. Includes debug prints.
5. Run
Standard Python entry point.

In [None]:
# %% [markdown]
# # 📄 Medical Guideline PDF Parser v1.1
# Enhanced to extract rich metadata from Chinese clinical guidelines and interpretation articles.
# Distinguishes between official guidelines and expert interpretations.
# Outputs: `guidelines.parsed.jsonl` with full bibliographic & structural metadata.

# %%
from pathlib import Path
import pdfplumber 
import re
import json
from typing import List, Dict, Any

# %%
# =============================
# 🧩 1. METADATA EXTRACTION UTILS (from filename)
# =============================

def extract_year_from_filename(filename: str) -> str:
    """Extracts 4-digit year from Chinese medical guideline filenames."""
    patterns = [
        r'[（\(]([12]\d{3})[）\)]',
        r'[（\(]([12]\d{3})[年\s]*(?:修订版|版|年版|年)?[）\)]?',
        r'([12]\d{3})[年\s]*(?:修订版|版|年版|年)?(?=[\s_）\)。\.\-]|$)',
        r'[（\(]?([12]\d{3})[）\)]?\s*\.pdf$',
    ]
    for pattern in patterns:
        match = re.search(pattern, filename)
        if match:
            return match.group(1)
    return "unknown"

def extract_doc_title(filename: str) -> str:
    """Extract clean document title from filename."""
    base = re.sub(r"_[^_]*\.pdf$", "", filename)
    base = re.sub(r"\.pdf$", "", base)
    base = re.sub(r"[（\(][^）\)]*[）\)]", "", base)
    base = re.sub(r"\s*—+\s*", " ", base)
    base = re.sub(r"\s+", " ", base).strip()
    return base or "未命名文档"

def extract_authors_from_filename(filename: str) -> List[str]:
    """Extract author names from filename (after last underscore)."""
    match = re.search(r"_([^_\(]+?)(?:\([12]\d{3}\))?\.pdf$", filename)
    if match:
        author_str = match.group(1).strip()
        authors = re.split(r"[,，、]", author_str)
        return [a.strip() for a in authors if a.strip()]
    return []

def extract_doc_type_from_filename(filename: str) -> str:
    """Classify document type from filename."""
    if "指南" in filename and "解读" not in filename:
        return "guideline"
    elif "解读" in filename or "浅析" in filename or "解析" in filename:
        return "guideline_interpretation"
    elif "共识" in filename:
        return "consensus"
    elif "证据总结" in filename:
        return "evidence_summary"
    else:
        return "other"

# %%
# =============================
# 📚 2. METADATA EXTRACTION UTILS (from document text — PREFERRED)
# =============================

def extract_metadata_from_text(text: str) -> Dict[str, Any]:
    """Extract rich metadata from first page of document text."""
    meta = {
        "authors": [],
        "corresponding_author": "",
        "affiliations": [],
        "journal_name": "",
        "volume": "",
        "issue": "",
        "pages": "",
        "doi": "",
        "keywords": [],
        "publish_date": "",
        "original_guideline_title": "",
        "doc_type": "other"  # Will be overridden if detected
    }

    # Extract author (first non-empty line after title, before postal code or affiliation)
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()] #
    for i, line in enumerate(lines[:10]):  # Look in first 10 lines
        if re.match(r"^\d{6}", line):  # Postal code → previous line is likely author
            if i > 0:
                author_line = lines[i-1]
                authors = re.split(r"[,，、]", author_line)
                meta["authors"] = [a.strip() for a in authors if len(a.strip()) >= 2]
            break
        if "通信作者" in line:
            author_match = re.search(r"通信作者[:：]\s*([^\s，,、]+)", line)
            if author_match:
                meta["corresponding_author"] = author_match.group(1).strip()
                if not meta["authors"]:
                    meta["authors"] = [meta["corresponding_author"]]

    # Extract affiliation (lines with postal code or university)
    for line in lines[:15]:
        if re.search(r"\d{6}|大学|医院|中心", line) and len(line) > 10:
            meta["affiliations"].append(line)

    # Extract DOI
    doi_match = re.search(r"DOI\s*[:：]?\s*([0-9\.\s\/a-z-]+)", text, re.IGNORECASE)
    if doi_match:
        meta["doi"] = re.sub(r"\s+", "", doi_match.group(1)).strip()

    # Extract journal, volume, issue, pages from footer pattern
    # e.g., "·396· 中国心血管杂志 2024年 10月第 29卷第 5期"
    journal_match = re.search(r"·\d+·\s*([^\s]+?杂志|学报)\s*(\d{4})年\s*\d+月第\s*(\d+)卷第\s*(\d+)期", text)
    if journal_match:
        meta["journal_name"] = journal_match.group(1).strip()
        meta["publish_date"] = journal_match.group(2).strip()  # e.g., "2024"
        meta["volume"] = journal_match.group(3).strip()
        meta["issue"] = journal_match.group(4).strip()

    # Extract pages from header/footer (e.g., "·396·")
    page_match = re.search(r"·(\d+)·", text.splitlines()[0] if text.splitlines() else "")
    if page_match:
        start_page = page_match.group(1)
        # Try to find end page (often not available, so leave as single page)
        meta["pages"] = start_page

    # Extract keywords
    kw_match = re.search(r"【关键词】\s*([^\n【】]+)", text)
    if kw_match:
        kw_str = kw_match.group(1).strip()
        meta["keywords"] = [k.strip() for k in re.split(r"[,，;；、]", kw_str) if k.strip()]

    # Detect if this is an interpretation of a guideline
    guideline_ref_match = re.search(r"《([^》]+?指南[^》]*)》", text[:500])
    if guideline_ref_match:
        meta["original_guideline_title"] = guideline_ref_match.group(1).strip()
        if "解读" in text[:200] or "浅析" in text[:200]:
            meta["doc_type"] = "guideline_interpretation"

    # If no authors found but filename has them, fallback
    # (Handled in main function)

    return meta

# %%
# =============================
# 📄 3. TEXT EXTRACTION
# =============================

def extract_text_from_pdf(pdf_path: Path) -> str:
    """Extract text from PDF using pdfplumber."""
    try:
        with pdfplumber.open(str(pdf_path)) as pdf: 
            pages = [p.extract_text() or "" for p in pdf.pages] 
        return "\n".join(pages)
    except Exception as e:
        print(f"⚠️  PDF extraction error: {e}")
        return ""

# %%
# =============================
# 🧱 4. CHUNKING LOGIC
# =============================

def chunk_by_rules(
    text: str,
    source_filename: str,
    year: str,
    doc_title: str,
    authors: List[str],
    doc_type: str,
    original_guideline_title: str = "",
    journal_name: str = "",
    volume: str = "",
    issue: str = "",
    pages: str = "",
    doi: str = "",
    keywords: List[str] = [],
    publish_date: str = ""
) -> List[Dict[str, Any]]:
    """Split text into chunks by section titles, with rich metadata."""
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    if not lines:
        return []

    chunks, buf = [], []
    current_title = "未命名章节"

    TITLE_KEYWORDS = [
        "章", "节", "篇", "部分", "概述", "背景", "目的", "方法", "结果", "结论",
        "推荐", "建议", "管理", "治疗", "诊断", "评估", "定义", "目标",
        "一、", "二、", "三、", "四、", "五、", "六、", "七、", "八、", "九、", "十、",
        "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9.", "10.",
        "第一", "第二", "第三", "第四", "第五",
        "【", "】", "（", "）", "(", ")", "：", ":"
    ]

    for ln in lines:
        is_title = False

        if any(kw in ln for kw in TITLE_KEYWORDS) and 3 <= len(ln) <= 100:
            is_title = True
        elif ln.endswith("：") or ln.endswith(":") or \
             (len(ln) <= 50 and (ln.startswith("【") and ln.endswith("】"))):
            is_title = True
        elif 3 <= len(ln) <= 25 and not ln.endswith("。") and not ln.endswith("."):
            is_title = True
        elif re.match(r"^[0-9]+[\.、]\s*\S{3,}", ln):
            is_title = True

        if is_title:
            if buf:
                chunk_id = f"{re.sub(r'[^a-zA-Z0-9]', '_', doc_title)}_{year}_{len(chunks):03d}"
                chunk_meta = {
                    "source_filename": source_filename,
                    "doc_title": doc_title,
                    "section_title": current_title,
                    "authors": authors,
                    "year": year,
                    "doc_type": doc_type,
                    "original_guideline_title": original_guideline_title,
                    "journal_name": journal_name,
                    "volume": volume,
                    "issue": issue,
                    "pages": pages,
                    "doi": doi,
                    "keywords": keywords,
                    "publish_date": publish_date,
                    "chunk_id": chunk_id,
                    "extraction_method": "pdfplumber + rule-based + metadata extraction"
                }
                chunks.append({
                    "content": "\n".join(buf),
                    "meta": chunk_meta
                })
                buf = []
            current_title = ln
        else:
            buf.append(ln)

    # Flush final buffer
    if buf:
        chunk_id = f"{re.sub(r'[^a-zA-Z0-9]', '_', doc_title)}_{year}_{len(chunks):03d}"
        chunk_meta = {
            "source_filename": source_filename,
            "doc_title": doc_title,
            "section_title": current_title,
            "authors": authors,
            "year": year,
            "doc_type": doc_type,
            "original_guideline_title": original_guideline_title,
            "journal_name": journal_name,
            "volume": volume,
            "issue": issue,
            "pages": pages,
            "doi": doi,
            "keywords": keywords,
            "publish_date": publish_date,
            "chunk_id": chunk_id,
            "extraction_method": "pdfplumber + rule-based + metadata extraction"
        }
        chunks.append({
            "content": "\n".join(buf),
            "meta": chunk_meta
        })

    return chunks

# %%
# =============================
# 🚀 5. MAIN PROCESSING PIPELINE
# =============================

def main():
    in_dir = Path("data/guidelines")
    out_path = Path("data/guidelines.parsed.jsonl")

    print(f"📂 Input directory: {in_dir.absolute()}")
    pdf_files = list(in_dir.glob("*.pdf"))
    print(f"📄 Found {len(pdf_files)} PDF files.")

    if not pdf_files:
        print("❌ No PDFs found. Check directory path.")
        return

    with out_path.open("w", encoding="utf-8") as f:
        for pdf in pdf_files:
            print(f"\n--- 📄 Processing: {pdf.name} ---")

            # Step 1: Extract preliminary metadata from filename
            year = extract_year_from_filename(pdf.name)
            doc_title = extract_doc_title(pdf.name)
            authors_filename = extract_authors_from_filename(pdf.name)
            doc_type = extract_doc_type_from_filename(pdf.name)

            print(f"  📅 Year (from filename): {year}")
            print(f"  🏷️ Title (from filename): {doc_title}")
            print(f"  👥 Authors (from filename): {authors_filename}")
            print(f"  📑 Type (from filename): {doc_type}")

            # Step 2: Extract text
            text = extract_text_from_pdf(pdf)
            char_count = len(text.strip())
            print(f"  🔤 Extracted {char_count} characters.")

            if char_count == 0:
                print("  ⚠️  WARNING: No text extracted. File may be scanned.")
                continue

            # Step 3: Extract rich metadata from text (overrides filename where possible)
            text_meta = extract_metadata_from_text(text)

            # Merge: Prefer text-extracted metadata, fallback to filename
            authors = text_meta["authors"] if text_meta["authors"] else authors_filename
            doc_type_final = text_meta["doc_type"] if text_meta["doc_type"] != "other" else doc_type
            original_guideline_title = text_meta["original_guideline_title"]
            journal_name = text_meta["journal_name"]
            volume = text_meta["volume"]
            issue = text_meta["issue"]
            pages = text_meta["pages"]
            doi = text_meta["doi"]
            keywords = text_meta["keywords"]
            publish_date = text_meta["publish_date"]

            print(f"  ✍️  Authors (final): {authors}")
            print(f"  📚 Journal: {journal_name} {volume}({issue}), {pages}, {publish_date}")
            print(f"  🔗 DOI: {doi}")
            print(f"  🏷️ Keywords: {keywords}")
            print(f"  🧭 Original Guideline: {original_guideline_title}")
            print(f"  📑 Doc Type (final): {doc_type_final}")

            # Step 4: Chunk text with full metadata
            chunks = chunk_by_rules(
                text=text,
                source_filename=pdf.name,
                year=year,
                doc_title=doc_title,
                authors=authors,
                doc_type=doc_type_final,
                original_guideline_title=original_guideline_title,
                journal_name=journal_name,
                volume=volume,
                issue=issue,
                pages=pages,
                doi=doi,
                keywords=keywords,
                publish_date=publish_date
            )
            print(f"  🧩 Generated {len(chunks)} chunks.")

            if not chunks:
                print("  ❌ No chunks generated. Check chunking logic.")

            # Step 5: Write to JSONL
            for chunk in chunks:
                f.write(json.dumps(chunk, ensure_ascii=False) + "\n")

    print(f"\n✅ Output written to: {out_path.absolute()}")
    print(f"💾 File size: {out_path.stat().st_size} bytes")

# %%
# =============================
# ▶️ 6. RUN
# =============================

if __name__ == "__main__":
    main()

📂 Input directory: /home/myunix/caremind/data/guidelines
📄 Found 16 PDF files.

--- 📄 Processing: 妊娠期糖尿病患者产前血糖管理的证据总结_秦煜(2023).pdf ---
  📅 Year (from filename): 2023
  🏷️ Title (from filename): 妊娠期糖尿病患者产前血糖管理的证据总结
  👥 Authors (from filename): ['秦煜']
  📑 Type (from filename): evidence_summary
  🔤 Extracted 11758 characters.
  ✍️  Authors (final): ['秦煜']
  📚 Journal:  (), , 
  🔗 DOI: 
  🏷️ Keywords: []
  🧭 Original Guideline: 
  📑 Doc Type (final): evidence_summary
  🧩 Generated 34 chunks.

--- 📄 Processing: 冠心病合并2 型糖尿病患者的血糖管理专家共识(2024)_中国医疗保健国际交流促进会心血管病学分会.pdf ---
  📅 Year (from filename): 2024
  🏷️ Title (from filename): 冠心病合并2 型糖尿病患者的血糖管理专家共识
  👥 Authors (from filename): ['中国医疗保健国际交流促进会心血管病学分会']
  📑 Type (from filename): consensus
  🔤 Extracted 29204 characters.
  ✍️  Authors (final): ['中国医疗保健国际交流促进会心血管病学分会']
  📚 Journal:  (), , 
  🔗 DOI: 10.3969/j.issn.1000-3614.2024.04.003ChaoXing
  🏷️ Keywords: []
  🧭 Original Guideline: 
  📑 Doc Type (final): consensus
  🧩 Generated 103 chunks.

-



  🔤 Extracted 22961 characters.
  ✍️  Authors (final): ['陈欢']
  📚 Journal:  (), 3984, 
  🔗 DOI: 10.12114/j.issn.1007-9572.2022.0362
  🏷️ Keywords: ['糖尿病足', '糖尿病', '甲病', '指（趾）甲', '循证医学', '证据总结']
  🧭 Original Guideline: 
  📑 Doc Type (final): evidence_summary
  🧩 Generated 65 chunks.

--- 📄 Processing: 2型糖尿病患者运动方案的最佳证据总结(2019).pdf ---
  📅 Year (from filename): 2019
  🏷️ Title (from filename): 2型糖尿病患者运动方案的最佳证据总结
  👥 Authors (from filename): []
  📑 Type (from filename): evidence_summary
  🔤 Extracted 15099 characters.
  ✍️  Authors (final): []
  📚 Journal:  (), , 
  🔗 DOI: 10
  🏷️ Keywords: ['糖尿病', '2型', '运动治疗', '生活方式', '循证护理学']
  🧭 Original Guideline: 
  📑 Doc Type (final): evidence_summary
  🧩 Generated 47 chunks.

--- 📄 Processing: 《妊娠期糖尿病临床护理实践指南》推荐意见专家共识_周英凤(2020).pdf ---
  📅 Year (from filename): 2020
  🏷️ Title (from filename): 《妊娠期糖尿病临床护理实践指南》推荐意见专家共识
  👥 Authors (from filename): ['周英凤']
  📑 Type (from filename): guideline
  🔤 Extracted 16024 characters.
  ✍️  Authors (final): ['周英

### 向量化与写入 Chroma

#### 📑 Embedding Chinese Guideline Chunks into ChromaDB

This notebook cell runs a **robust embedding pipeline** designed to convert parsed
Chinese medical guideline chunks into vector representations and store them in a
**persistent ChromaDB collection** for downstream **RAG (Retrieval-Augmented Generation)**
applications.

##### 🔍 What this program does
- **Stream input**: Reads a JSONL file (`guidelines.parsed.jsonl`), where each line
  contains a `content` field (text chunk) and a `meta` field (metadata such as source,
  year, section, authors, etc.).
- **Embed text**: Uses a **Chinese-capable SentenceTransformer** model
  (default: `BAAI/bge-small-zh`) to convert text into dense embeddings.
- **Store vectors**: Saves embeddings, documents, and sanitized metadata into a
  **ChromaDB collection** on disk for fast similarity search and retrieval.

##### ⚙️ Key features and safeguards
- **OOM resilience (RTX 4070 friendly)**:
  - Dynamic batch-size backoff (halves batch size if VRAM runs out).
  - FP16 autocast and lowered max sequence length to reduce memory use.
  - Optional CPU fallback if GPU memory is completely exhausted.
- **Metadata safety**:
  - Converts lists (e.g., `["秦煜"]`) and dicts into scalar or JSON-safe forms.
  - Prevents `ValueError: Expected metadata value to be a str, int, float, bool, or None`.
- **Duplicate ID protection**:
  - Generates **stable, low-collision IDs** from source + chunk_id + content hash.
  - Removes duplicate IDs **inside each batch**.
  - Uses **upsert** (or add+update fallback) so re-runs are **idempotent**.
- **Progress and monitoring**:
  - Optional progress bars with `tqdm`.
  - Periodic VRAM usage reports (on CUDA).

##### 📥 Input format (one JSON object per line)
```json
{"content": "某段中文文本…", "meta": {
  "source": "中国高血压防治指南(2024年修订版).pdf",
  "year": 2024,
  "section": "2.1 定义",
  "chunk_id": 12,
  "authors": ["秦煜","张三"]
}}


In [1]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
===============================================================================
Embed Chinese guideline chunks → ChromaDB (OOM-safe, idempotent, metadata-robust)
===============================================================================

What this script does (high-level)
----------------------------------
1) Streams a JSONL file that contains one chunk per line:
     {
       "content": "分章文本……",
       "meta": {
         "source": "中国高血压防治指南(2024年修订版).pdf",
         "year": 2024,
         "section": "3.2 降压目标",
         "chunk_id": 57,
         "authors": ["秦煜", "张三"]    # lists allowed; we sanitize below
       }
     }

2) Uses a Chinese-capable SentenceTransformer (defaults to BAAI/bge-small-zh)
   to embed "content" with **GPU if available**.

3) Writes embeddings, documents, and **sanitized metadata** into a persistent
   **Chroma** collection on disk, in an **idempotent** way:
   - Uses **upsert** when available (Chromadb>=0.5) to overwrite duplicates.
   - Strengthens IDs to minimize collisions.
   - De-duplicates IDs **inside each batch**.
   - Final safety net: per-item upsert/repair if a batch raises an error.

Inputs (files & environment variables)
-------------------------------------
• JSONL file (default `data/guidelines.parsed.jsonl`)
  - Set via env var `CAREMIND_DATA`.

• Environment variables (optional, with sensible defaults):
  CHROMA_PERSIST_DIR  : Directory for Chroma persistence (default './chroma_store')
  CHROMA_COLLECTION   : Collection name (default 'guideline_chunks')
  CAREMIND_DATA       : Input JSONL path (default 'data/guidelines.parsed.jsonl')
  EMBEDDING_MODEL     : Model id (default 'BAAI/bge-small-zh')
  EMBED_BATCH_SIZE    : Starting batch size (default '16')
  EMBED_FP16          : '1' to allow fp16 autocast on CUDA (default '1')
  EMBED_PROGRESS      : '1' to show progress bars (default '1')
  EMBED_MAX_LEN       : Max sequence length for encoder (default '384')
  OOM_CPU_FALLBACK    : '1' to allow CPU fallback when BS=1 still OOM (default '1')

Outputs
-------
• A persistent Chroma database on disk containing:
    - ids: stable, low-collision per-chunk IDs
    - embeddings: float vectors
    - documents: original text chunks
    - metadatas: scalar/JSON-safe metadata
  Location and collection are controlled by CHROMA_PERSIST_DIR / CHROMA_COLLECTION.

Run
---
$ python embed_to_chroma.py
(or set envs first, e.g., EMBED_MAX_LEN=256 EMBED_BATCH_SIZE=8 for tighter VRAM)
"""

import os
import json
import hashlib
from pathlib import Path
from typing import Iterable, List, Dict, Any

# NOTE: We intentionally do NOT set PYTORCH_CUDA_ALLOC_CONF here because some
# PyTorch builds require a specific format and may crash. All OOM tactics below
# (batch backoff, fp16, truncation, cache clears, CPU fallback) work without it.

from dotenv import load_dotenv
load_dotenv()

import torch
from sentence_transformers import SentenceTransformer
from chromadb import PersistentClient
from chromadb import errors as chroma_errors  # for DuplicateIDError handling
from tqdm import tqdm

# ---------------------------
# Config (safe defaults; override via env vars)
# ---------------------------
PERSIST_DIR      = os.getenv("CHROMA_PERSIST_DIR", "./chroma_store")
COLLECTION_NAME  = os.getenv("CHROMA_COLLECTION", "guideline_chunks")
DATA_PATH        = os.getenv("CAREMIND_DATA", "data/guidelines.parsed.jsonl")

# Chinese-capable model; bge-* are strong + efficient. Start small on 12GB VRAM.
EMBEDDING_MODEL  = os.getenv("EMBEDDING_MODEL", "BAAI/bge-small-zh")

# Start conservatively; dynamic backoff will reduce further on OOM.
START_BATCH_SIZE = int(os.getenv("EMBED_BATCH_SIZE", "16"))

# fp16 autocast reduces memory on CUDA and is safe for sentence-transformers inference
USE_FP16         = os.getenv("EMBED_FP16", "1") == "1"

# Lower seq length cuts memory on long Chinese paragraphs; 384 or even 256 works well
MAX_SEQ_LEN      = int(os.getenv("EMBED_MAX_LEN", "384"))

# nice progress bars
SHOW_PROGRESS    = os.getenv("EMBED_PROGRESS", "1") == "1"

# If bs=1 still OOMs (rare), push that batch to CPU to finish and keep going
CPU_FALLBACK     = os.getenv("OOM_CPU_FALLBACK", "1") == "1"

# Ensure output dir exists early
Path(PERSIST_DIR).mkdir(parents=True, exist_ok=True)

# ---------------------------
# Device and model load
# ---------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
if torch.cuda.is_available():
    # Not a memory tactic, but improves throughput on Ada/RTX40
    torch.backends.cuda.matmul.allow_tf32 = True

embed_model = SentenceTransformer(EMBEDDING_MODEL, device=device)
# Reduce max sequence length to save memory; long texts will be truncated
embed_model.max_seq_length = MAX_SEQ_LEN

# ---------------------------
# ID generation (stable + low collision)
# ---------------------------
def stable_id(meta: Dict[str, Any], content: str) -> str:
    """
    Stable, low-collision ID:
    • Prefer source|chunk_id|sha12(content) when source+chunk_id exist.
    • Fallback binds to source-hash + content-hash so identical text in different
      files remains separate.
    This keeps re-runs idempotent and minimizes accidental collisions.
    """
    src = str(meta.get("source", "")).strip()
    cid = meta.get("chunk_id", None)
    ch  = hashlib.sha1(content.encode("utf-8")).hexdigest()[:12]
    if src and cid is not None:
        return f"{src}|{cid}|{ch}"
    sh  = hashlib.sha1(src.encode("utf-8")).hexdigest()[:8]
    return f"g_{sh}_{hashlib.sha1(content.encode('utf-8')).hexdigest()[:16]}"

# ---------------------------
# JSONL streaming (robust to bad lines)
# ---------------------------
def jsonl_iter(path: Path) -> Iterable[Dict[str, Any]]:
    with path.open("r", encoding="utf-8") as f:
        for ln, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                yield json.loads(line)
            except json.JSONDecodeError as e:
                print(f"[WARN] Skipping bad JSON at line {ln}: {e}")
                continue

# ---------------------------
# Metadata sanitizer (prevents list/dict errors in Chroma)
# ---------------------------
def sanitize_meta(meta: Dict[str, Any]) -> Dict[str, Any]:
    """
    Chroma requires scalar metadata values: str, int, float, bool, or None.
    This function converts:
        • List/Tuple/Set of scalars → "a, b, c"
        • Complex lists (mixed/nested) → JSON string
        • Dicts → JSON string (preserve Chinese with ensure_ascii=False)
        • Other types → str(v)
    Example fix:
        ["秦煜"]  → "秦煜"
    """
    clean: Dict[str, Any] = {}
    for k, v in meta.items():
        if isinstance(v, (str, int, float, bool)) or v is None:
            clean[k] = v
        elif isinstance(v, (list, tuple, set)):
            seq = list(v)
            if all(isinstance(x, (str, int, float, bool)) or x is None for x in seq):
                clean[k] = ", ".join("" if x is None else str(x) for x in seq)
            else:
                clean[k] = json.dumps(seq, ensure_ascii=False)
        elif isinstance(v, dict):
            clean[k] = json.dumps(v, ensure_ascii=False, sort_keys=True)
        else:
            clean[k] = str(v)
    return clean

# ---------------------------
# Optional VRAM stats (debug)
# ---------------------------
def cuda_mem_summary(prefix: str = ""):
    if not torch.cuda.is_available():
        return
    try:
        free, total = torch.cuda.mem_get_info()  # bytes
        used = total - free
        gb = 1024**3
        print(f"[VRAM] {prefix} used={used/gb:.2f} GB / total={total/gb:.2f} GB (free={free/gb:.2f} GB)")
    except Exception:
        pass

def clear_cuda_cache():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# ---------------------------
# OOM-resilient encoder with dynamic batch backoff + optional CPU fallback
# ---------------------------
@torch.no_grad()
def encode_with_backoff(texts: List[str], start_bs: int, use_fp16: bool, model: SentenceTransformer, cpu_fallback: bool):
    """
    Try encoding with the given batch size. On CUDA OOM:
      - Halve batch size and retry.
      - If batch size is 1 and still OOM, optionally move to CPU for that batch.
    Returns a List[List[float]].
    """
    bs = max(1, start_bs)
    current_device = model.device  # torch.device('cuda') or ('cpu')

    while True:
        try:
            if use_fp16 and current_device.type == "cuda":
                # AMP saves memory; sentence-transformers is autocast-aware
                with torch.cuda.amp.autocast(dtype=torch.float16):
                    vecs = model.encode(
                        texts,
                        batch_size=bs,
                        normalize_embeddings=True,
                        convert_to_numpy=True,
                        show_progress_bar=False
                    )
            else:
                vecs = model.encode(
                    texts,
                    batch_size=bs,
                    normalize_embeddings=True,
                    convert_to_numpy=True,
                    show_progress_bar=False
                )
            return vecs.tolist()

        except RuntimeError as e:
            msg = str(e)
            if "CUDA out of memory" in msg or "CUBLAS" in msg:
                print(f"[OOM] CUDA OOM at batch_size={bs} (len={len(texts)}). Backing off…")
                clear_cuda_cache()
                if bs > 1:
                    bs = max(1, bs // 2)
                    continue
                # bs==1 and still OOM → optional CPU fallback
                if cpu_fallback and current_device.type == "cuda":
                    print("[OOM] Switching this batch to CPU to complete.")
                    model.to("cpu")
                    current_device = torch.device("cpu")
                    vecs = model.encode(
                        texts,
                        batch_size=1,
                        normalize_embeddings=True,
                        convert_to_numpy=True,
                        show_progress_bar=False
                    )
                    # Move back to CUDA for subsequent batches if available
                    if torch.cuda.is_available():
                        model.to("cuda")
                    return vecs.tolist()
                else:
                    raise
            else:
                # Not an OOM error; surface it
                raise

# ---------------------------
# Main pipeline
# ---------------------------
def main():
    # Resolve & validate input
    data_path = Path(DATA_PATH).expanduser().resolve()
    if not data_path.exists():
        raise SystemExit(f"❌ JSONL not found: {data_path}")

    # Prepare Chroma
    client = PersistentClient(path=PERSIST_DIR)
    collection = client.get_or_create_collection(COLLECTION_NAME)

    # Count lines for progress bar (optional; ok to skip on huge files)
    try:
        num_lines = sum(1 for _ in data_path.open("r", encoding="utf-8"))
    except Exception:
        num_lines = None

    iterator = jsonl_iter(data_path)
    if SHOW_PROGRESS:
        iterator = tqdm(iterator, total=num_lines, desc="Reading JSONL")

    BUFFER: List[Dict[str, Any]] = []
    total = 0
    current_bs = START_BATCH_SIZE

    print(f"📦 Persist dir: {PERSIST_DIR}")
    print(f"🗃️  Collection: {COLLECTION_NAME}")
    print(f"🧠 Model: {EMBEDDING_MODEL} | Device: {device} | fp16: {USE_FP16} | max_seq_len: {MAX_SEQ_LEN}")
    print(f"📑 Input: {data_path}")
    print(f"⚙️  Start batch size: {START_BATCH_SIZE} | CPU fallback: {CPU_FALLBACK}")
    if torch.cuda.is_available():
        cuda_mem_summary("start")

    # Inner helper: flush a batch to Chroma robustly
    def flush(batch: List[Dict[str, Any]]):
        nonlocal total, current_bs
        if not batch:
            return

        # Extract and sanitize
        docs  = [b["content"] for b in batch]
        metas = [sanitize_meta(b["meta"]) for b in batch]
        ids   = [stable_id(b["meta"], b["content"]) for b in batch]

        # Embed with OOM backoff
        try:
            vecs = encode_with_backoff(
                docs,
                start_bs=current_bs,
                use_fp16=USE_FP16,
                model=embed_model,
                cpu_fallback=CPU_FALLBACK
            )
        except RuntimeError as e:
            raise RuntimeError(f"Embedding failed for a batch of size {len(docs)}: {e}") from e
        finally:
            clear_cuda_cache()

        # Intra-batch de-duplication (keeps only first occurrence of each ID)
        seen = set()
        keep = []
        for i, _id in enumerate(ids):
            if _id in seen:
                continue
            seen.add(_id)
            keep.append(i)
        if len(keep) != len(ids):
            print(f"[dedupe] removed {len(ids) - len(keep)} duplicate id(s) in the same batch)")

        ids   = [ids[i] for i in keep]
        docs  = [docs[i] for i in keep]
        vecs  = [vecs[i] for i in keep]
        metas = [metas[i] for i in keep]

        # Write to Chroma, preferring upsert for idempotency
        try:
            if hasattr(collection, "upsert"):
                collection.upsert(ids=ids, embeddings=vecs, documents=docs, metadatas=metas)
            else:
                # Older Chroma: emulate upsert with add+update split
                existing = set(collection.get(ids=ids).get("ids", []) or [])
                new_idx = [i for i, _id in enumerate(ids) if _id not in existing]
                upd_idx = [i for i, _id in enumerate(ids) if _id in existing]
                if new_idx:
                    collection.add(
                        ids=[ids[i] for i in new_idx],
                        embeddings=[vecs[i] for i in new_idx],
                        documents=[docs[i] for i in new_idx],
                        metadatas=[metas[i] for i in new_idx],
                    )
                if upd_idx:
                    collection.update(
                        ids=[ids[i] for i in upd_idx],
                        embeddings=[vecs[i] for i in upd_idx],
                        documents=[docs[i] for i in upd_idx],
                        metadatas=[metas[i] for i in upd_idx],
                    )

        except (chroma_errors.DuplicateIDError, ValueError) as e:
            # Final safety net: heal per item (rare edge cases)
            print(f"[warn] batch write hit {type(e).__name__}: {e}\n[repair] trying per-item upsert/repair")
            for i in range(len(ids)):
                try:
                    if hasattr(collection, "upsert"):
                        collection.upsert(
                            ids=[ids[i]],
                            embeddings=[vecs[i]],
                            documents=[docs[i]],
                            metadatas=[metas[i]],
                        )
                    else:
                        if collection.get(ids=[ids[i]]).get("ids"):
                            collection.update(
                                ids=[ids[i]],
                                embeddings=[vecs[i]],
                                documents=[docs[i]],
                                metadatas=[metas[i]],
                            )
                        else:
                            collection.add(
                                ids=[ids[i]],
                                embeddings=[vecs[i]],
                                documents=[docs[i]],
                                metadatas=[metas[i]],
                            )
                except Exception as inner:
                    # As a last resort, stringify all non-scalar meta fields and retry once
                    fallback_meta = {
                        k: (v if isinstance(v, (str, int, float, bool)) or v is None else json.dumps(v, ensure_ascii=False))
                        for k, v in metas[i].items()
                    }
                    if hasattr(collection, "upsert"):
                        collection.upsert(
                            ids=[ids[i]],
                            embeddings=[vecs[i]],
                            documents=[docs[i]],
                            metadatas=[fallback_meta],
                        )
                    else:
                        if collection.get(ids=[ids[i]]).get("ids"):
                            collection.update(
                                ids=[ids[i]],
                                embeddings=[vecs[i]],
                                documents=[docs[i]],
                                metadatas=[fallback_meta],
                            )
                        else:
                            collection.add(
                                ids=[ids[i]],
                                embeddings=[vecs[i]],
                                documents=[docs[i]],
                                metadatas=[fallback_meta],
                            )

        total += len(ids)
        if SHOW_PROGRESS:
            tqdm.write(f"✅ Flushed {len(ids)} items (total={total})")
        if torch.cuda.is_available() and total % max(64, START_BATCH_SIZE) == 0:
            cuda_mem_summary(f"after {total}")

    # Stream → buffer → flush
    for obj in iterator:
        # Minimal schema validation
        if not isinstance(obj, dict) or "content" not in obj or "meta" not in obj:
            continue
        BUFFER.append(obj)
        if len(BUFFER) >= current_bs:
            flush(BUFFER)
            BUFFER.clear()

    # Flush remainder
    if BUFFER:
        flush(BUFFER)
        BUFFER.clear()

    print(f"\n🎉 Done. Inserted {total} chunks into '{COLLECTION_NAME}' at '{PERSIST_DIR}'.")
    print(f"Model: {EMBEDDING_MODEL} | Device: {device} | Start batch: {START_BATCH_SIZE} | max_seq_len: {MAX_SEQ_LEN}")
    if torch.cuda.is_available():
        cuda_mem_summary("end")

# ---------------------------
# Entry point
# ---------------------------
if __name__ == "__main__":
    main()


  with torch.cuda.amp.autocast(dtype=torch.float16):


📦 Persist dir: ./chroma_store
🗃️  Collection: guideline_chunks
🧠 Model: BAAI/bge-large-zh-v1.5 | Device: cuda | fp16: True | max_seq_len: 384
📑 Input: /home/myunix/caremind/data/guidelines.parsed.jsonl
⚙️  Start batch size: 16 | CPU fallback: True
[VRAM] start used=2.42 GB / total=11.99 GB (free=9.57 GB)


Reading JSONL:   1%|          | 32/5908 [00:01<02:42, 36.27it/s]

✅ Flushed 16 items (total=16)
✅ Flushed 16 items (total=32)


Reading JSONL:   1%|          | 64/5908 [00:01<01:33, 62.76it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=47)
[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=59)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:   2%|▏         | 96/5908 [00:01<01:04, 89.96it/s]

✅ Flushed 13 items (total=72)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=85)


Reading JSONL:   2%|▏         | 112/5908 [00:01<00:57, 100.91it/s]

✅ Flushed 16 items (total=101)


Reading JSONL:   2%|▏         | 144/5908 [00:02<01:01, 93.71it/s] 

✅ Flushed 16 items (total=117)
✅ Flushed 16 items (total=133)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:   3%|▎         | 176/5908 [00:02<00:52, 109.58it/s]

✅ Flushed 14 items (total=147)
✅ Flushed 16 items (total=163)


Reading JSONL:   4%|▎         | 208/5908 [00:02<00:49, 115.30it/s]

✅ Flushed 16 items (total=179)
✅ Flushed 16 items (total=195)


Reading JSONL:   4%|▍         | 240/5908 [00:02<00:51, 110.59it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=209)
✅ Flushed 16 items (total=225)


Reading JSONL:   5%|▍         | 272/5908 [00:03<00:50, 112.09it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=240)
✅ Flushed 16 items (total=256)
[VRAM] after 256 used=2.49 GB / total=11.99 GB (free=9.50 GB)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:   5%|▌         | 304/5908 [00:03<00:47, 119.23it/s]

✅ Flushed 15 items (total=271)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=286)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:   6%|▌         | 336/5908 [00:03<00:45, 122.58it/s]

✅ Flushed 14 items (total=300)
[dedupe] removed 7 duplicate id(s) in the same batch)
✅ Flushed 9 items (total=309)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:   6%|▌         | 368/5908 [00:03<00:47, 117.11it/s]

✅ Flushed 15 items (total=324)
✅ Flushed 16 items (total=340)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:   7%|▋         | 400/5908 [00:04<00:46, 117.77it/s]

✅ Flushed 15 items (total=355)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=370)


Reading JSONL:   7%|▋         | 432/5908 [00:04<00:46, 116.78it/s]

✅ Flushed 16 items (total=386)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=401)


Reading JSONL:   8%|▊         | 464/5908 [00:04<00:47, 114.10it/s]

✅ Flushed 16 items (total=417)
✅ Flushed 16 items (total=433)


Reading JSONL:   8%|▊         | 496/5908 [00:05<00:51, 105.00it/s]

✅ Flushed 16 items (total=449)
✅ Flushed 16 items (total=465)


Reading JSONL:   9%|▉         | 528/5908 [00:05<00:56, 95.67it/s] 

✅ Flushed 16 items (total=481)
✅ Flushed 16 items (total=497)


Reading JSONL:   9%|▉         | 560/5908 [00:05<00:52, 101.12it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=512)
[VRAM] after 512 used=2.49 GB / total=11.99 GB (free=9.50 GB)
✅ Flushed 16 items (total=528)


Reading JSONL:  10%|▉         | 576/5908 [00:05<00:52, 101.40it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=543)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  10%|█         | 608/5908 [00:06<00:59, 89.38it/s] 

✅ Flushed 15 items (total=558)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=573)


Reading JSONL:  11%|█         | 640/5908 [00:06<01:02, 83.96it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=587)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=600)


Reading JSONL:  11%|█         | 656/5908 [00:06<00:59, 87.61it/s]

[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=612)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  12%|█▏        | 688/5908 [00:07<01:02, 83.00it/s]

✅ Flushed 14 items (total=626)
✅ Flushed 16 items (total=642)


Reading JSONL:  12%|█▏        | 720/5908 [00:07<01:04, 80.69it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=656)
✅ Flushed 16 items (total=672)


Reading JSONL:  13%|█▎        | 752/5908 [00:08<01:03, 81.33it/s]

✅ Flushed 16 items (total=688)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=703)


Reading JSONL:  13%|█▎        | 784/5908 [00:08<01:03, 80.24it/s]

✅ Flushed 16 items (total=719)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=732)


Reading JSONL:  14%|█▎        | 800/5908 [00:08<01:03, 81.08it/s]

✅ Flushed 16 items (total=748)
[dedupe] removed 5 duplicate id(s) in the same batch)


Reading JSONL:  14%|█▍        | 832/5908 [00:09<01:02, 80.70it/s]

✅ Flushed 11 items (total=759)
✅ Flushed 16 items (total=775)


Reading JSONL:  15%|█▍        | 864/5908 [00:09<01:04, 78.25it/s]

✅ Flushed 16 items (total=791)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=806)


Reading JSONL:  15%|█▍        | 880/5908 [00:09<01:05, 77.24it/s]

✅ Flushed 16 items (total=822)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  15%|█▌        | 896/5908 [00:10<01:07, 73.97it/s]

✅ Flushed 14 items (total=836)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  15%|█▌        | 912/5908 [00:10<01:10, 70.84it/s]

✅ Flushed 15 items (total=851)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  16%|█▌        | 944/5908 [00:10<01:08, 72.86it/s]

✅ Flushed 15 items (total=866)
✅ Flushed 16 items (total=882)


Reading JSONL:  17%|█▋        | 976/5908 [00:11<01:06, 73.64it/s]

[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=895)
✅ Flushed 16 items (total=911)


Reading JSONL:  17%|█▋        | 992/5908 [00:11<01:04, 76.56it/s]

✅ Flushed 16 items (total=927)


Reading JSONL:  17%|█▋        | 1008/5908 [00:11<01:03, 77.05it/s]

✅ Flushed 16 items (total=943)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  17%|█▋        | 1024/5908 [00:11<01:04, 75.63it/s]

✅ Flushed 15 items (total=958)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  18%|█▊        | 1056/5908 [00:12<01:03, 76.35it/s]

✅ Flushed 14 items (total=972)
[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=984)


Reading JSONL:  18%|█▊        | 1088/5908 [00:12<00:52, 92.58it/s]

✅ Flushed 16 items (total=1000)
✅ Flushed 16 items (total=1016)


Reading JSONL:  19%|█▉        | 1120/5908 [00:12<00:47, 101.17it/s]

✅ Flushed 16 items (total=1032)
[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=1044)


Reading JSONL:  19%|█▉        | 1152/5908 [00:13<00:43, 108.55it/s]

✅ Flushed 16 items (total=1060)
✅ Flushed 16 items (total=1076)


Reading JSONL:  20%|██        | 1184/5908 [00:13<00:48, 98.04it/s] 

✅ Flushed 16 items (total=1092)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1106)


Reading JSONL:  21%|██        | 1216/5908 [00:13<00:50, 93.78it/s]

[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=1118)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1133)


Reading JSONL:  21%|██        | 1248/5908 [00:14<00:46, 100.03it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1147)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1162)


Reading JSONL:  22%|██▏       | 1280/5908 [00:14<00:45, 101.94it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1177)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1192)


Reading JSONL:  22%|██▏       | 1312/5908 [00:14<00:43, 105.21it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1206)
[dedupe] removed 4 duplicate id(s) in the same batch)
✅ Flushed 12 items (total=1218)


Reading JSONL:  23%|██▎       | 1344/5908 [00:15<00:43, 104.46it/s]

[dedupe] removed 6 duplicate id(s) in the same batch)
✅ Flushed 10 items (total=1228)
✅ Flushed 16 items (total=1244)


Reading JSONL:  23%|██▎       | 1376/5908 [00:15<00:43, 103.90it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1258)
✅ Flushed 16 items (total=1274)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  24%|██▍       | 1408/5908 [00:15<00:45, 99.19it/s] 

✅ Flushed 14 items (total=1288)
✅ Flushed 16 items (total=1304)


Reading JSONL:  24%|██▍       | 1440/5908 [00:16<00:42, 104.13it/s]

[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=1317)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1331)


Reading JSONL:  25%|██▍       | 1472/5908 [00:16<00:44, 98.78it/s] 

[dedupe] removed 5 duplicate id(s) in the same batch)
✅ Flushed 11 items (total=1342)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=1355)


Reading JSONL:  25%|██▌       | 1488/5908 [00:16<00:41, 105.55it/s]

✅ Flushed 16 items (total=1371)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  26%|██▌       | 1520/5908 [00:16<00:43, 101.43it/s]

✅ Flushed 13 items (total=1384)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1399)


Reading JSONL:  26%|██▋       | 1552/5908 [00:17<00:45, 95.78it/s] 

✅ Flushed 16 items (total=1415)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=1428)


Reading JSONL:  27%|██▋       | 1584/5908 [00:17<00:48, 89.16it/s]

[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=1441)
✅ Flushed 16 items (total=1457)


Reading JSONL:  27%|██▋       | 1616/5908 [00:17<00:48, 88.84it/s]

✅ Flushed 16 items (total=1473)
✅ Flushed 16 items (total=1489)


Reading JSONL:  28%|██▊       | 1648/5908 [00:18<00:45, 93.72it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1504)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1519)


Reading JSONL:  28%|██▊       | 1664/5908 [00:18<00:42, 99.46it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1533)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  29%|██▊       | 1696/5908 [00:18<00:44, 94.36it/s]

✅ Flushed 15 items (total=1548)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=1562)


Reading JSONL:  29%|██▉       | 1712/5908 [00:18<00:44, 93.72it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=1577)


Reading JSONL:  30%|██▉       | 1744/5908 [00:19<00:46, 90.51it/s]

✅ Flushed 16 items (total=1593)
✅ Flushed 16 items (total=1609)


Reading JSONL:  30%|███       | 1776/5908 [00:19<00:42, 97.83it/s]

✅ Flushed 16 items (total=1625)
✅ Flushed 16 items (total=1641)


Reading JSONL:  31%|███       | 1808/5908 [00:19<00:41, 99.92it/s]

✅ Flushed 16 items (total=1657)
✅ Flushed 16 items (total=1673)


Reading JSONL:  31%|███       | 1840/5908 [00:20<00:41, 96.86it/s]

✅ Flushed 16 items (total=1689)
✅ Flushed 16 items (total=1705)


Reading JSONL:  32%|███▏      | 1872/5908 [00:20<00:44, 90.94it/s]

✅ Flushed 16 items (total=1721)
✅ Flushed 16 items (total=1737)


Reading JSONL:  32%|███▏      | 1888/5908 [00:20<00:44, 90.54it/s]

✅ Flushed 16 items (total=1753)


Reading JSONL:  32%|███▏      | 1904/5908 [00:21<00:49, 81.64it/s]

✅ Flushed 16 items (total=1769)


Reading JSONL:  32%|███▏      | 1920/5908 [00:21<00:49, 80.56it/s]

✅ Flushed 16 items (total=1785)


Reading JSONL:  32%|███▏      | 1920/5908 [00:20<00:49, 80.56it/s]

✅ Flushed 16 items (total=1801)
✅ Flushed 16 items (total=1817)


Reading JSONL:  32%|███▏      | 1920/5908 [00:21<00:49, 80.56it/s]

✅ Flushed 16 items (total=1833)


Reading JSONL:  34%|███▍      | 2000/5908 [00:21<00:23, 163.30it/s]

✅ Flushed 16 items (total=1849)
✅ Flushed 16 items (total=1865)


Reading JSONL:  34%|███▍      | 2032/5908 [00:21<00:32, 120.98it/s]

✅ Flushed 16 items (total=1881)
✅ Flushed 16 items (total=1897)


Reading JSONL:  35%|███▍      | 2048/5908 [00:22<00:37, 103.14it/s]

✅ Flushed 16 items (total=1913)


Reading JSONL:  35%|███▌      | 2080/5908 [00:22<00:43, 88.80it/s] 

✅ Flushed 16 items (total=1929)
✅ Flushed 16 items (total=1945)


Reading JSONL:  35%|███▌      | 2096/5908 [00:22<00:43, 86.74it/s]

✅ Flushed 16 items (total=1961)


Reading JSONL:  36%|███▌      | 2128/5908 [00:23<00:44, 85.66it/s]

✅ Flushed 16 items (total=1977)
✅ Flushed 16 items (total=1993)


Reading JSONL:  37%|███▋      | 2160/5908 [00:23<00:34, 107.49it/s]

✅ Flushed 16 items (total=2009)
✅ Flushed 16 items (total=2025)


Reading JSONL:  37%|███▋      | 2192/5908 [00:23<00:30, 120.20it/s]

✅ Flushed 16 items (total=2041)
✅ Flushed 16 items (total=2057)


Reading JSONL:  38%|███▊      | 2224/5908 [00:23<00:30, 119.95it/s]

✅ Flushed 16 items (total=2073)
✅ Flushed 16 items (total=2089)


Reading JSONL:  38%|███▊      | 2256/5908 [00:24<00:35, 103.83it/s]

✅ Flushed 16 items (total=2105)
✅ Flushed 16 items (total=2121)


Reading JSONL:  39%|███▊      | 2288/5908 [00:24<00:30, 120.63it/s]

✅ Flushed 16 items (total=2137)
✅ Flushed 16 items (total=2153)


Reading JSONL:  39%|███▉      | 2320/5908 [00:24<00:28, 126.41it/s]

✅ Flushed 16 items (total=2169)
✅ Flushed 16 items (total=2185)


Reading JSONL:  40%|███▉      | 2352/5908 [00:24<00:28, 124.49it/s]

✅ Flushed 16 items (total=2201)
✅ Flushed 16 items (total=2217)


Reading JSONL:  40%|████      | 2384/5908 [00:25<00:26, 131.70it/s]

✅ Flushed 16 items (total=2233)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=2247)


Reading JSONL:  41%|████      | 2416/5908 [00:25<00:27, 127.46it/s]

✅ Flushed 16 items (total=2263)
✅ Flushed 16 items (total=2279)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  41%|████▏     | 2448/5908 [00:25<00:27, 127.37it/s]

✅ Flushed 14 items (total=2293)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=2308)


Reading JSONL:  42%|████▏     | 2480/5908 [00:25<00:26, 128.50it/s]

✅ Flushed 16 items (total=2324)
✅ Flushed 16 items (total=2340)


Reading JSONL:  43%|████▎     | 2512/5908 [00:26<00:27, 125.57it/s]

✅ Flushed 16 items (total=2356)
✅ Flushed 16 items (total=2372)


Reading JSONL:  43%|████▎     | 2544/5908 [00:26<00:27, 123.56it/s]

✅ Flushed 16 items (total=2388)
✅ Flushed 16 items (total=2404)


Reading JSONL:  44%|████▎     | 2576/5908 [00:26<00:27, 121.67it/s]

✅ Flushed 16 items (total=2420)
✅ Flushed 16 items (total=2436)


Reading JSONL:  44%|████▍     | 2608/5908 [00:27<00:27, 119.51it/s]

✅ Flushed 16 items (total=2452)
✅ Flushed 16 items (total=2468)


Reading JSONL:  45%|████▍     | 2640/5908 [00:27<00:27, 117.07it/s]

✅ Flushed 16 items (total=2484)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=2499)


Reading JSONL:  45%|████▌     | 2672/5908 [00:27<00:29, 110.66it/s]

✅ Flushed 16 items (total=2515)
✅ Flushed 16 items (total=2531)


Reading JSONL:  45%|████▌     | 2688/5908 [00:27<00:30, 105.45it/s]

✅ Flushed 16 items (total=2547)


Reading JSONL:  46%|████▌     | 2720/5908 [00:28<00:35, 91.07it/s] 

✅ Flushed 16 items (total=2563)
✅ Flushed 16 items (total=2579)


Reading JSONL:  46%|████▋     | 2736/5908 [00:28<00:35, 88.48it/s]

✅ Flushed 16 items (total=2595)


Reading JSONL:  47%|████▋     | 2752/5908 [00:28<00:36, 85.41it/s]

✅ Flushed 16 items (total=2611)


Reading JSONL:  47%|████▋     | 2784/5908 [00:28<00:37, 82.28it/s]

✅ Flushed 16 items (total=2627)
✅ Flushed 16 items (total=2643)


Reading JSONL:  48%|████▊     | 2816/5908 [00:29<00:37, 82.72it/s]

✅ Flushed 16 items (total=2659)
✅ Flushed 16 items (total=2675)


Reading JSONL:  48%|████▊     | 2832/5908 [00:29<00:37, 82.01it/s]

✅ Flushed 16 items (total=2691)


Reading JSONL:  48%|████▊     | 2848/5908 [00:29<00:37, 80.83it/s]

✅ Flushed 16 items (total=2707)


Reading JSONL:  48%|████▊     | 2864/5908 [00:30<00:46, 64.86it/s]

✅ Flushed 16 items (total=2723)


Reading JSONL:  49%|████▊     | 2880/5908 [00:30<00:49, 61.74it/s]

✅ Flushed 16 items (total=2739)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  49%|████▉     | 2896/5908 [00:30<00:52, 57.13it/s]

✅ Flushed 15 items (total=2754)


Reading JSONL:  49%|████▉     | 2912/5908 [00:31<00:54, 54.65it/s]

✅ Flushed 16 items (total=2770)


Reading JSONL:  50%|████▉     | 2928/5908 [00:31<00:53, 55.93it/s]

✅ Flushed 16 items (total=2786)


Reading JSONL:  50%|████▉     | 2944/5908 [00:31<00:52, 56.60it/s]

✅ Flushed 16 items (total=2802)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  50%|█████     | 2960/5908 [00:31<00:49, 58.97it/s]

✅ Flushed 14 items (total=2816)
[VRAM] after 2816 used=2.49 GB / total=11.99 GB (free=9.50 GB)


Reading JSONL:  50%|█████     | 2976/5908 [00:32<00:49, 59.58it/s]

✅ Flushed 16 items (total=2832)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  51%|█████     | 2992/5908 [00:32<00:46, 62.36it/s]

✅ Flushed 13 items (total=2845)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  51%|█████     | 3008/5908 [00:32<00:46, 62.52it/s]

✅ Flushed 15 items (total=2860)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  51%|█████     | 3024/5908 [00:32<00:47, 61.33it/s]

✅ Flushed 15 items (total=2875)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  51%|█████▏    | 3040/5908 [00:33<00:47, 59.80it/s]

✅ Flushed 14 items (total=2889)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  52%|█████▏    | 3056/5908 [00:33<00:50, 57.03it/s]

✅ Flushed 14 items (total=2903)
[dedupe] removed 6 duplicate id(s) in the same batch)


Reading JSONL:  52%|█████▏    | 3072/5908 [00:33<00:49, 57.18it/s]

✅ Flushed 10 items (total=2913)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  52%|█████▏    | 3088/5908 [00:34<00:49, 56.66it/s]

✅ Flushed 15 items (total=2928)


Reading JSONL:  53%|█████▎    | 3104/5908 [00:34<01:01, 45.71it/s]

✅ Flushed 16 items (total=2944)
[VRAM] after 2944 used=2.49 GB / total=11.99 GB (free=9.50 GB)


Reading JSONL:  53%|█████▎    | 3120/5908 [00:34<00:58, 48.05it/s]

✅ Flushed 16 items (total=2960)


Reading JSONL:  53%|█████▎    | 3136/5908 [00:35<00:56, 49.47it/s]

✅ Flushed 16 items (total=2976)


Reading JSONL:  53%|█████▎    | 3152/5908 [00:35<01:02, 44.41it/s]

✅ Flushed 16 items (total=2992)


Reading JSONL:  54%|█████▎    | 3168/5908 [00:35<00:57, 47.82it/s]

✅ Flushed 16 items (total=3008)
[VRAM] after 3008 used=2.49 GB / total=11.99 GB (free=9.50 GB)


Reading JSONL:  54%|█████▍    | 3184/5908 [00:36<00:51, 52.82it/s]

✅ Flushed 16 items (total=3024)


Reading JSONL:  54%|█████▍    | 3200/5908 [00:36<00:46, 58.41it/s]

✅ Flushed 16 items (total=3040)


Reading JSONL:  55%|█████▍    | 3232/5908 [00:36<00:40, 66.48it/s]

✅ Flushed 16 items (total=3056)
✅ Flushed 16 items (total=3072)
[VRAM] after 3072 used=2.49 GB / total=11.99 GB (free=9.50 GB)


Reading JSONL:  55%|█████▍    | 3248/5908 [00:36<00:40, 65.83it/s]

✅ Flushed 16 items (total=3088)


Reading JSONL:  55%|█████▌    | 3264/5908 [00:37<00:38, 67.87it/s]

✅ Flushed 16 items (total=3104)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  56%|█████▌    | 3296/5908 [00:37<00:35, 72.96it/s]

✅ Flushed 14 items (total=3118)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=3133)


Reading JSONL:  56%|█████▌    | 3312/5908 [00:37<00:36, 70.38it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=3148)


Reading JSONL:  57%|█████▋    | 3344/5908 [00:38<00:34, 73.96it/s]

✅ Flushed 16 items (total=3164)
✅ Flushed 16 items (total=3180)


Reading JSONL:  57%|█████▋    | 3360/5908 [00:38<00:34, 72.94it/s]

[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=3193)


Reading JSONL:  57%|█████▋    | 3376/5908 [00:38<00:33, 74.62it/s]

✅ Flushed 16 items (total=3209)


Reading JSONL:  57%|█████▋    | 3392/5908 [00:38<00:33, 74.56it/s]

✅ Flushed 16 items (total=3225)


Reading JSONL:  58%|█████▊    | 3408/5908 [00:39<00:33, 73.76it/s]

✅ Flushed 16 items (total=3241)


Reading JSONL:  58%|█████▊    | 3424/5908 [00:39<00:35, 70.06it/s]

✅ Flushed 16 items (total=3257)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  58%|█████▊    | 3440/5908 [00:39<00:37, 66.45it/s]

✅ Flushed 15 items (total=3272)


Reading JSONL:  58%|█████▊    | 3456/5908 [00:39<00:37, 64.67it/s]

✅ Flushed 16 items (total=3288)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  59%|█████▉    | 3472/5908 [00:40<00:35, 67.86it/s]

✅ Flushed 13 items (total=3301)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  59%|█████▉    | 3488/5908 [00:40<00:34, 70.31it/s]

✅ Flushed 13 items (total=3314)


Reading JSONL:  59%|█████▉    | 3504/5908 [00:40<00:34, 70.52it/s]

✅ Flushed 16 items (total=3330)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  60%|█████▉    | 3520/5908 [00:40<00:33, 71.72it/s]

✅ Flushed 15 items (total=3345)


Reading JSONL:  60%|█████▉    | 3536/5908 [00:41<00:33, 69.90it/s]

✅ Flushed 16 items (total=3361)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  60%|██████    | 3552/5908 [00:41<00:33, 69.49it/s]

✅ Flushed 13 items (total=3374)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  60%|██████    | 3568/5908 [00:41<00:34, 68.16it/s]

✅ Flushed 14 items (total=3388)
[dedupe] removed 7 duplicate id(s) in the same batch)


Reading JSONL:  61%|██████    | 3584/5908 [00:41<00:32, 70.85it/s]

✅ Flushed 9 items (total=3397)
[dedupe] removed 8 duplicate id(s) in the same batch)


Reading JSONL:  61%|██████    | 3600/5908 [00:41<00:31, 73.19it/s]

✅ Flushed 8 items (total=3405)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  61%|██████    | 3616/5908 [00:42<00:30, 74.29it/s]

✅ Flushed 12 items (total=3417)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  61%|██████▏   | 3632/5908 [00:42<00:31, 72.49it/s]

✅ Flushed 15 items (total=3432)


Reading JSONL:  62%|██████▏   | 3648/5908 [00:42<00:33, 68.39it/s]

✅ Flushed 16 items (total=3448)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  62%|██████▏   | 3664/5908 [00:42<00:33, 66.12it/s]

✅ Flushed 15 items (total=3463)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  62%|██████▏   | 3680/5908 [00:43<00:32, 67.97it/s]

✅ Flushed 14 items (total=3477)


Reading JSONL:  63%|██████▎   | 3696/5908 [00:43<00:32, 68.42it/s]

✅ Flushed 16 items (total=3493)


Reading JSONL:  63%|██████▎   | 3712/5908 [00:43<00:31, 69.14it/s]

✅ Flushed 16 items (total=3509)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  63%|██████▎   | 3728/5908 [00:43<00:31, 69.10it/s]

✅ Flushed 14 items (total=3523)
[dedupe] removed 5 duplicate id(s) in the same batch)


Reading JSONL:  63%|██████▎   | 3744/5908 [00:43<00:31, 69.51it/s]

✅ Flushed 11 items (total=3534)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  64%|██████▎   | 3760/5908 [00:44<00:31, 69.17it/s]

✅ Flushed 12 items (total=3546)


Reading JSONL:  64%|██████▍   | 3776/5908 [00:44<00:31, 68.08it/s]

✅ Flushed 16 items (total=3562)


Reading JSONL:  64%|██████▍   | 3792/5908 [00:44<00:31, 67.06it/s]

✅ Flushed 16 items (total=3578)


Reading JSONL:  64%|██████▍   | 3808/5908 [00:44<00:33, 63.45it/s]

✅ Flushed 16 items (total=3594)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  65%|██████▍   | 3840/5908 [00:45<00:27, 74.47it/s]

✅ Flushed 13 items (total=3607)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=3622)


Reading JSONL:  66%|██████▌   | 3872/5908 [00:45<00:24, 83.36it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=3637)
✅ Flushed 16 items (total=3653)


Reading JSONL:  66%|██████▌   | 3904/5908 [00:46<00:22, 89.92it/s]

✅ Flushed 16 items (total=3669)
✅ Flushed 16 items (total=3685)


Reading JSONL:  66%|██████▋   | 3920/5908 [00:46<00:22, 89.34it/s]

✅ Flushed 16 items (total=3701)


Reading JSONL:  67%|██████▋   | 3936/5908 [00:46<00:36, 54.72it/s]

✅ Flushed 16 items (total=3717)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  67%|██████▋   | 3952/5908 [00:47<00:34, 56.36it/s]

✅ Flushed 14 items (total=3731)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  67%|██████▋   | 3968/5908 [00:47<00:32, 60.09it/s]

✅ Flushed 13 items (total=3744)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  67%|██████▋   | 3984/5908 [00:47<00:33, 57.56it/s]

✅ Flushed 13 items (total=3757)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  68%|██████▊   | 4000/5908 [00:47<00:32, 58.84it/s]

✅ Flushed 14 items (total=3771)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  68%|██████▊   | 4016/5908 [00:48<00:32, 58.89it/s]

✅ Flushed 15 items (total=3786)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  68%|██████▊   | 4032/5908 [00:48<00:30, 62.17it/s]

✅ Flushed 15 items (total=3801)


Reading JSONL:  69%|██████▊   | 4048/5908 [00:48<00:28, 64.64it/s]

✅ Flushed 16 items (total=3817)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  69%|██████▉   | 4064/5908 [00:48<00:27, 67.44it/s]

✅ Flushed 12 items (total=3829)


Reading JSONL:  69%|██████▉   | 4080/5908 [00:49<00:27, 65.68it/s]

✅ Flushed 16 items (total=3845)
[dedupe] removed 5 duplicate id(s) in the same batch)


Reading JSONL:  69%|██████▉   | 4096/5908 [00:49<00:27, 66.26it/s]

✅ Flushed 11 items (total=3856)
[dedupe] removed 6 duplicate id(s) in the same batch)


Reading JSONL:  70%|██████▉   | 4112/5908 [00:49<00:28, 63.65it/s]

✅ Flushed 10 items (total=3866)
[dedupe] removed 7 duplicate id(s) in the same batch)


Reading JSONL:  70%|██████▉   | 4128/5908 [00:49<00:27, 64.94it/s]

✅ Flushed 9 items (total=3875)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  70%|███████   | 4144/5908 [00:49<00:26, 66.18it/s]

✅ Flushed 14 items (total=3889)
[dedupe] removed 5 duplicate id(s) in the same batch)


Reading JSONL:  70%|███████   | 4160/5908 [00:50<00:25, 68.48it/s]

✅ Flushed 11 items (total=3900)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  71%|███████   | 4176/5908 [00:50<00:24, 70.17it/s]

✅ Flushed 14 items (total=3914)
[dedupe] removed 3 duplicate id(s) in the same batch)


Reading JSONL:  71%|███████   | 4192/5908 [00:50<00:24, 69.53it/s]

✅ Flushed 13 items (total=3927)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  71%|███████   | 4208/5908 [00:50<00:25, 67.93it/s]

✅ Flushed 15 items (total=3942)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  71%|███████▏  | 4224/5908 [00:51<00:24, 70.06it/s]

✅ Flushed 14 items (total=3956)


Reading JSONL:  72%|███████▏  | 4240/5908 [00:51<00:24, 69.11it/s]

✅ Flushed 16 items (total=3972)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  72%|███████▏  | 4256/5908 [00:51<00:23, 68.93it/s]

✅ Flushed 15 items (total=3987)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  72%|███████▏  | 4272/5908 [00:52<00:29, 55.09it/s]

✅ Flushed 14 items (total=4001)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  73%|███████▎  | 4304/5908 [00:52<00:23, 69.72it/s]

✅ Flushed 14 items (total=4015)
✅ Flushed 16 items (total=4031)


Reading JSONL:  73%|███████▎  | 4336/5908 [00:52<00:19, 82.51it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4046)
✅ Flushed 16 items (total=4062)


Reading JSONL:  74%|███████▎  | 4352/5908 [00:52<00:18, 85.38it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=4076)
✅ Flushed 16 items (total=4092)


Reading JSONL:  74%|███████▎  | 4352/5908 [00:52<00:18, 85.38it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4107)
[dedupe] removed 5 duplicate id(s) in the same batch)
✅ Flushed 11 items (total=4118)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  74%|███████▎  | 4352/5908 [00:52<00:18, 85.38it/s]

✅ Flushed 14 items (total=4132)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=4146)
[dedupe] removed 6 duplicate id(s) in the same batch)


Reading JSONL:  76%|███████▌  | 4464/5908 [00:53<00:05, 249.39it/s]

✅ Flushed 10 items (total=4156)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4171)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  76%|███████▌  | 4496/5908 [00:53<00:06, 215.98it/s]

✅ Flushed 12 items (total=4183)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4198)


Reading JSONL:  77%|███████▋  | 4543/5908 [00:53<00:06, 210.55it/s]

✅ Flushed 16 items (total=4214)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4229)


Reading JSONL:  77%|███████▋  | 4565/5908 [00:53<00:08, 162.09it/s]

✅ Flushed 16 items (total=4245)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=4259)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  78%|███████▊  | 4602/5908 [00:53<00:08, 159.21it/s]

✅ Flushed 15 items (total=4274)
[dedupe] removed 5 duplicate id(s) in the same batch)
✅ Flushed 11 items (total=4285)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  78%|███████▊  | 4636/5908 [00:54<00:08, 151.28it/s]

✅ Flushed 12 items (total=4297)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4312)
[dedupe] removed 4 duplicate id(s) in the same batch)


Reading JSONL:  79%|███████▉  | 4668/5908 [00:54<00:08, 146.43it/s]

✅ Flushed 12 items (total=4324)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=4337)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  80%|███████▉  | 4697/5908 [00:54<00:09, 123.76it/s]

✅ Flushed 14 items (total=4351)
✅ Flushed 16 items (total=4367)


Reading JSONL:  80%|███████▉  | 4722/5908 [00:54<00:11, 103.71it/s]

✅ Flushed 16 items (total=4383)
✅ Flushed 16 items (total=4399)


Reading JSONL:  80%|████████  | 4752/5908 [00:55<00:10, 106.13it/s]

✅ Flushed 16 items (total=4415)
✅ Flushed 16 items (total=4431)


Reading JSONL:  81%|████████  | 4784/5908 [00:55<00:10, 107.40it/s]

✅ Flushed 16 items (total=4447)
✅ Flushed 16 items (total=4463)


Reading JSONL:  82%|████████▏ | 4816/5908 [00:55<00:09, 112.42it/s]

✅ Flushed 16 items (total=4479)
✅ Flushed 16 items (total=4495)


Reading JSONL:  82%|████████▏ | 4848/5908 [00:56<00:09, 116.88it/s]

✅ Flushed 16 items (total=4511)
✅ Flushed 16 items (total=4527)


Reading JSONL:  83%|████████▎ | 4880/5908 [00:56<00:09, 107.23it/s]

✅ Flushed 16 items (total=4543)
✅ Flushed 16 items (total=4559)


Reading JSONL:  83%|████████▎ | 4912/5908 [00:56<00:10, 95.11it/s] 

✅ Flushed 16 items (total=4575)
✅ Flushed 16 items (total=4591)


Reading JSONL:  84%|████████▎ | 4944/5908 [00:57<00:09, 99.58it/s]

✅ Flushed 16 items (total=4607)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4622)


Reading JSONL:  84%|████████▍ | 4976/5908 [00:57<00:08, 105.59it/s]

[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=4635)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4650)


Reading JSONL:  85%|████████▍ | 5008/5908 [00:57<00:08, 110.18it/s]

✅ Flushed 16 items (total=4666)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=4681)


Reading JSONL:  85%|████████▌ | 5040/5908 [00:57<00:07, 112.69it/s]

✅ Flushed 16 items (total=4697)
✅ Flushed 16 items (total=4713)


Reading JSONL:  86%|████████▌ | 5072/5908 [00:58<00:08, 103.30it/s]

✅ Flushed 16 items (total=4729)
✅ Flushed 16 items (total=4745)


Reading JSONL:  86%|████████▋ | 5104/5908 [00:58<00:08, 94.34it/s] 

✅ Flushed 16 items (total=4761)
✅ Flushed 16 items (total=4777)


Reading JSONL:  87%|████████▋ | 5136/5908 [00:59<00:08, 86.82it/s]

✅ Flushed 16 items (total=4793)
✅ Flushed 16 items (total=4809)


Reading JSONL:  87%|████████▋ | 5152/5908 [00:59<00:08, 88.36it/s]

✅ Flushed 16 items (total=4825)


Reading JSONL:  88%|████████▊ | 5184/5908 [00:59<00:08, 83.31it/s]

✅ Flushed 16 items (total=4841)
✅ Flushed 16 items (total=4857)


Reading JSONL:  88%|████████▊ | 5216/5908 [00:59<00:07, 87.86it/s]

✅ Flushed 16 items (total=4873)
✅ Flushed 16 items (total=4889)


Reading JSONL:  89%|████████▉ | 5248/5908 [01:00<00:07, 86.90it/s]

✅ Flushed 16 items (total=4905)
✅ Flushed 16 items (total=4921)


Reading JSONL:  89%|████████▉ | 5264/5908 [01:00<00:07, 87.08it/s]

✅ Flushed 16 items (total=4937)


Reading JSONL:  90%|████████▉ | 5296/5908 [01:00<00:07, 81.92it/s]

✅ Flushed 16 items (total=4953)
✅ Flushed 16 items (total=4969)


Reading JSONL:  90%|████████▉ | 5312/5908 [01:01<00:07, 85.07it/s]

✅ Flushed 16 items (total=4985)


Reading JSONL:  90%|█████████ | 5344/5908 [01:01<00:06, 87.66it/s]

✅ Flushed 16 items (total=5001)
✅ Flushed 16 items (total=5017)


Reading JSONL:  91%|█████████ | 5376/5908 [01:01<00:05, 104.78it/s]

✅ Flushed 16 items (total=5033)
✅ Flushed 16 items (total=5049)


Reading JSONL:  92%|█████████▏| 5408/5908 [01:01<00:04, 120.76it/s]

✅ Flushed 16 items (total=5065)
✅ Flushed 16 items (total=5081)


Reading JSONL:  92%|█████████▏| 5440/5908 [01:02<00:03, 120.22it/s]

✅ Flushed 16 items (total=5097)
✅ Flushed 16 items (total=5113)


Reading JSONL:  93%|█████████▎| 5472/5908 [01:02<00:03, 124.10it/s]

✅ Flushed 16 items (total=5129)
✅ Flushed 16 items (total=5145)


Reading JSONL:  93%|█████████▎| 5504/5908 [01:02<00:03, 115.54it/s]

✅ Flushed 16 items (total=5161)
✅ Flushed 16 items (total=5177)


Reading JSONL:  94%|█████████▎| 5536/5908 [01:02<00:02, 128.88it/s]

[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=5191)
[dedupe] removed 2 duplicate id(s) in the same batch)
✅ Flushed 14 items (total=5205)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  94%|█████████▍| 5568/5908 [01:03<00:02, 138.47it/s]

✅ Flushed 14 items (total=5219)
✅ Flushed 16 items (total=5235)


Reading JSONL:  95%|█████████▍| 5600/5908 [01:03<00:02, 141.59it/s]

✅ Flushed 16 items (total=5251)
[dedupe] removed 5 duplicate id(s) in the same batch)
✅ Flushed 11 items (total=5262)
[dedupe] removed 1 duplicate id(s) in the same batch)


Reading JSONL:  95%|█████████▌| 5632/5908 [01:03<00:01, 141.34it/s]

✅ Flushed 15 items (total=5277)
[dedupe] removed 3 duplicate id(s) in the same batch)
✅ Flushed 13 items (total=5290)


Reading JSONL:  96%|█████████▌| 5664/5908 [01:03<00:01, 138.38it/s]

✅ Flushed 16 items (total=5306)
✅ Flushed 16 items (total=5322)
[dedupe] removed 2 duplicate id(s) in the same batch)


Reading JSONL:  96%|█████████▋| 5696/5908 [01:04<00:01, 133.12it/s]

✅ Flushed 14 items (total=5336)
✅ Flushed 16 items (total=5352)


Reading JSONL:  97%|█████████▋| 5728/5908 [01:04<00:01, 115.97it/s]

✅ Flushed 16 items (total=5368)
✅ Flushed 16 items (total=5384)


Reading JSONL:  97%|█████████▋| 5760/5908 [01:04<00:01, 107.11it/s]

✅ Flushed 16 items (total=5400)
✅ Flushed 16 items (total=5416)


Reading JSONL:  98%|█████████▊| 5792/5908 [01:05<00:01, 99.25it/s] 

✅ Flushed 16 items (total=5432)
✅ Flushed 16 items (total=5448)


Reading JSONL:  99%|█████████▊| 5824/5908 [01:05<00:00, 108.05it/s]

[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 15 items (total=5463)
✅ Flushed 16 items (total=5479)


                                                                   

✅ Flushed 16 items (total=5495)


Reading JSONL:  99%|█████████▉| 5872/5908 [01:05<00:00, 88.42it/s]

✅ Flushed 16 items (total=5511)
✅ Flushed 16 items (total=5527)


Reading JSONL: 100%|██████████| 5908/5908 [01:06<00:00, 89.02it/s]


✅ Flushed 16 items (total=5543)
✅ Flushed 16 items (total=5559)
[dedupe] removed 1 duplicate id(s) in the same batch)
✅ Flushed 3 items (total=5562)

🎉 Done. Inserted 5562 chunks into 'guideline_chunks' at './chroma_store'.
Model: BAAI/bge-large-zh-v1.5 | Device: cuda | Start batch: 16 | max_seq_len: 384
[VRAM] end used=2.49 GB / total=11.99 GB (free=9.50 GB)


The above is the updated, production-ready, and teaching-oriented script with all the fixes you asked for:

* OOM-safe on RTX 4070: dynamic batch backoff, FP16 autocast, shorter max sequence length, cache clears, and optional CPU fallback.

* Duplicate-ID proof: upsert (with add+update fallback), stronger IDs, intra-batch de-duplication, and a final “per-item repair” safety net.

* Metadata-safe: robust sanitize_meta(...) to prevent ValueError … got ['秦煜'] by converting lists/dicts to scalars/JSON.

* Rich comments explaining inputs, outputs, and why each tactic exists.

### 3.3 药品数据（Excel → SQLite）

可直接在本地运行的 Python 脚本（单文件），用于从 DrugBank（可选，需 API key）与 NMPA 数据查询站抓取/解析药品信息，生成 caremind/data/drugs.xlsx。脚本在没有 DrugBank API的情况下也能工作（只抓 NMPA），并支持离线解析本地 NMPA 说明书 PDF/HTML作为兜底。

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
caremind_drugs.py

双用方式
--------
1) Notebook 导入：
   from caremind_drugs import run_from_notebook
   run_from_notebook(
       names=["氨氯地平","二甲双胍","阿托伐他汀"],
       out_file="~/caremind/data/drugs.xlsx",
       nmpa_offline_dir="~/caremind/data/nmpa_labels",
       use_nmpa_online=False,
       use_drugbank=False
   )

   或使用底层 API：
   from caremind_drugs import build_records, save_to_excel
   recs = build_records([...], use_nmpa_online=False, nmpa_offline_dir="...", use_drugbank=False)
   save_to_excel(recs, "~/caremind/data/drugs.xlsx")

2) 命令行：
   python caremind_drugs.py \
     --in ~/caremind/data/drug_list.txt \
     --out ~/caremind/data/drugs.xlsx \
     --nmpa-offline-dir ~/caremind/data/nmpa_labels \
     --no-nmpa-online \
     --no-drugbank

环境
----
- Python 3.10+
- pip install: requests beautifulsoup4 lxml pandas openpyxl tenacity pdfminer.six pypdf2 chardet python-dotenv
"""

from __future__ import annotations # all type hints are treated as strings, meaning they are not evaluated until later. This is called "deferred evaluation" of annotations.
import os
import re
import time
import argparse # a powerful and flexible module used for parsing command-line arguments
from dataclasses import dataclass # a way to simplify the creation of classes that are primarily used to store data
from typing import Optional, Dict, List # used for type hinting to specify expected data types. Optional means a value can be of a type or None.

import requests # a popular library for making HTTP requests in Python
from bs4 import BeautifulSoup # a library for parsing HTML and XML documents
import pandas as pd # a powerful data manipulation and analysis library
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type # a library for retrying operations that can fail
from pdfminer.high_level import extract_text as pdf_extract_text # a library for extracting text from PDF files
from PyPDF2 import PdfReader # a library for reading and manipulating PDF files
import chardet # a library for detecting the encoding of text data
from dotenv import load_dotenv # a library for loading environment variables from a .env file

# ---------------------------
# 基础配置
# ---------------------------
HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; CareMindDrugBot/1.0; +https://example.org)"
}
NEXT_SECTION_RE = re.compile(r"【[^】]{1,20}】")
SECTION_PATTERNS = [
    ("适应症", r"(?:【适应症】|适\s*应\s*症|适应证|适应症状)[：:\s]*"),
    ("禁忌症", r"(?:【禁忌】|禁\s*忌|禁忌症)[：:\s]*"),
    ("药物相互作用", r"(?:【药物相互作用】|相互作用|药物-药物相互作用)[：:\s]*"),
    ("妊娠分级", r"(?:【孕妇及哺乳期用药】|孕妇及哺乳期用药|孕期用药|妊娠用药)[：:\s]*"),
]

@dataclass
class DrugRecord:
    name: str
    indications: Optional[str] = None
    contraindications: Optional[str] = None
    interactions: Optional[str] = None
    pregnancy_category: Optional[str] = None
    source: Optional[str] = None


# ---------------------------
# 工具函数
# ---------------------------
def _ensure_utf8(text_bytes) -> str:
    if not isinstance(text_bytes, (bytes, bytearray)):
        return str(text_bytes)
    det = chardet.detect(text_bytes)
    enc = det.get("encoding") or "utf-8"
    try:
        return text_bytes.decode(enc, errors="ignore")
    except Exception:
        return text_bytes.decode("utf-8", errors="ignore")


def _slice_section(text: str, start_pat: str) -> Optional[str]:
    m = re.search(start_pat, text, flags=re.IGNORECASE)
    if not m:
        return None
    start = m.end()
    tail = text[start:]
    nxt = NEXT_SECTION_RE.search(tail)
    body = tail[:nxt.start()] if nxt else tail
    return re.sub(r"\s+", " ", body).strip() or None


def _parse_cn_label_text(full_text: str) -> Dict[str, Optional[str]]:
    out = {"适应症": None, "禁忌症": None, "药物相互作用": None, "妊娠分级": None}
    text = re.sub(r"\s+", " ", full_text)
    for key, pat in SECTION_PATTERNS:
        if key != "妊娠分级":
            out[key] = _slice_section(text, pat) or out[key]
    preg_txt = _slice_section(text, SECTION_PATTERNS[-1][1])
    if preg_txt:
        if re.search(r"(禁用|绝对禁用|禁止使用)", preg_txt):
            out["妊娠分级"] = "禁用（未标注分级）"
        elif re.search(r"(慎用|权衡利弊|风险.*收益)", preg_txt):
            out["妊娠分级"] = "慎用（未标注分级）"
        else:
            out["妊娠分级"] = "未标注"
    else:
        out["妊娠分级"] = "未标注"
    return out


# ---------------------------
# NMPA 在线（占位：需按实际接口定制）
# ---------------------------
class NMPAClient:
    def __init__(self, rate_sec: float = 1.2):
        self.sess = requests.Session()
        self.rate_sec = rate_sec
        self.base = "https://www.nmpa.gov.cn/datasearch/"

    @retry(reraise=True, stop=stop_after_attempt(3),
           wait=wait_exponential(multiplier=1, min=1, max=8),
           retry=retry_if_exception_type((requests.RequestException,)))
    def _get(self, url: str, params: Optional[dict] = None) -> requests.Response:
        resp = self.sess.get(url, headers=HEADERS, params=params, timeout=15)
        if resp.status_code != 200:
            raise requests.RequestException(f"HTTP {resp.status_code}")
        return resp

    def search_label_urls(self, drug_name: str) -> List[str]:
        # 提醒：需抓站点真实搜索接口后替换此实现
        return []

    def fetch_label_text(self, url: str) -> Optional[str]:
        try:
            r = self._get(url)
            time.sleep(self.rate_sec)
            html = _ensure_utf8(r.content)
            soup = BeautifulSoup(html, "lxml")
            container = soup.find("div", class_="article") or soup.find(id="article") or soup
            text = container.get_text(separator="\n")
            return re.sub(r"\n+", "\n", text).strip()
        except Exception:
            return None


# ---------------------------
# DrugBank API（可选）
# ---------------------------
class DrugBankClient:
    def __init__(self):
        load_dotenv()
        self.api_key = os.getenv("DRUGBANK_API_KEY")
        self.base = os.getenv("DRUGBANK_BASE", "https://api.drugbank.com/v1")
        self.sess = requests.Session()
        if self.api_key:
            self.sess.headers.update({
                "Authorization": f"Bearer {self.api_key}",
                "Accept": "application/json",
                "User-Agent": HEADERS["User-Agent"],
            })

    def available(self) -> bool:
        return bool(self.api_key)

    @retry(reraise=True, stop=stop_after_attempt(3),
           wait=wait_exponential(multiplier=1, min=1, max=8),
           retry=retry_if_exception_type((requests.RequestException,)))
    def _get(self, path: str, params: Optional[dict] = None) -> dict:
        url = self.base.rstrip("/") + "/" + path.lstrip("/")
        resp = self.sess.get(url, params=params, timeout=20)
        resp.raise_for_status()
        return resp.json()

    def query_by_name(self, name: str) -> Optional[dict]:
        try:
            j = self._get("drugs", params={"name": name})
            first = (j[0] if isinstance(j, list) and j else j) or None
            if not first:
                return None
            did = first.get("id") or first.get("drugbank_id")
            return self._get(f"drugs/{did}") if did else first
        except Exception:
            return None

    @staticmethod
    def pick_fields(detail: dict) -> Dict[str, Optional[str]]:
        def g(d, *ks, default=None):
            cur = d
            for k in ks:
                if cur is None:
                    return default
                cur = cur.get(k)
            return cur if cur is not None else default

        indications = g(detail, "indication") or g(detail, "indications")
        contraindications = g(detail, "contraindications")
        interactions = None
        intr = g(detail, "drug_interactions")
        if isinstance(intr, list) and intr:
            parts = []
            for it in intr[:30]:
                desc = it.get("description") or it.get("text")
                partner = it.get("name") or it.get("drug") or ""
                if desc:
                    parts.append(f"{partner}: {desc}")
            if parts:
                interactions = "；".join(parts)
        preg = g(detail, "pregnancy_category") or g(detail, "fda_pregnancy_category")
        return {"适应症": indications, "禁忌症": contraindications,
                "药物相互作用": interactions, "妊娠分级": preg}


# ---------------------------
# 离线说明书解析
# ---------------------------
def _read_pdf_text(path: str) -> str:
    try:
        txt = pdf_extract_text(path) or ""
        if not txt:
            reader = PdfReader(path)
            buf = [(p.extract_text() or "") for p in reader.pages]
            txt = "\n".join(buf)
        return txt
    except Exception:
        return ""


def _read_html_text(path: str) -> str:
    try:
        with open(path, "rb") as f:
            raw = f.read()
        soup = BeautifulSoup(_ensure_utf8(raw), "lxml")
        return re.sub(r"\n+", "\n", soup.get_text(separator="\n")).strip()
    except Exception:
        return ""


def _scan_offline_label(drug_name: str, folder: Optional[str]) -> Optional[str]:
    if not folder or not os.path.isdir(os.path.expanduser(folder)):
        return None
    folder = os.path.expanduser(folder)
    cands = [os.path.join(folder, fn) for fn in os.listdir(folder)
             if re.search(re.escape(drug_name), fn, flags=re.IGNORECASE)]
    cands = sorted(cands, key=lambda p: (0 if p.lower().endswith((".html", ".htm")) else 1, p))
    for p in cands:
        if p.lower().endswith((".html", ".htm")):
            txt = _read_html_text(p)
        elif p.lower().endswith(".pdf"):
            txt = _read_pdf_text(p)
        else:
            continue
        if txt and len(txt) > 50:
            return txt
    return None


# ---------------------------
# 核心构建与输出
# ---------------------------
def build_records(
    names: List[str],
    use_nmpa_online: bool = True,
    nmpa_offline_dir: Optional[str] = None,
    use_drugbank: bool = True
) -> List[DrugRecord]:

    nmpa = NMPAClient()
    db = DrugBankClient()
    if use_drugbank and not db.available():
        use_drugbank = False

    out: List[DrugRecord] = []
    for name in names:
        rec = DrugRecord(name=name)

        # NMPA 在线（当前为占位；返回空即跳过）
        if use_nmpa_online:
            urls = nmpa.search_label_urls(name)
            text = None
            for u in urls:
                text = nmpa.fetch_label_text(u)
                if text:
                    break
            if text:
                parsed = _parse_cn_label_text(text)
                rec.indications = parsed["适应症"]
                rec.contraindications = parsed["禁忌症"]
                rec.interactions = parsed["药物相互作用"]
                rec.pregnancy_category = parsed["妊娠分级"]
                rec.source = "NMPA说明书（在线）"

        # DrugBank（可选）
        if use_drugbank and not all([rec.indications, rec.contraindications, rec.interactions, rec.pregnancy_category]):
            detail = db.query_by_name(name)
            if detail:
                picked = DrugBankClient.pick_fields(detail)
                rec.indications = rec.indications or picked["适应症"]
                rec.contraindications = rec.contraindications or picked["禁忌症"]
                rec.interactions = rec.interactions or picked["药物相互作用"]
                rec.pregnancy_category = rec.pregnancy_category or picked["妊娠分级"]
                rec.source = (rec.source + " + DrugBank") if rec.source else "DrugBank"

        # 离线兜底
        if not any([rec.indications, rec.contraindications, rec.interactions, rec.pregnancy_category]):
            text = _scan_offline_label(name, nmpa_offline_dir)
            if text:
                parsed = _parse_cn_label_text(text)
                rec.indications = parsed["适应症"]
                rec.contraindications = parsed["禁忌症"]
                rec.interactions = parsed["药物相互作用"]
                rec.pregnancy_category = parsed["妊娠分级"]
                rec.source = "NMPA说明书（离线）"

        # 填补未标注
        rec.indications = rec.indications or "未标注"
        rec.contraindications = rec.contraindications or "未标注"
        rec.interactions = rec.interactions or "未标注"
        rec.pregnancy_category = rec.pregnancy_category or "未标注"
        rec.source = rec.source or "未获取（请补充源）"

        out.append(rec)
    return out


def save_to_excel(records: List[DrugRecord], out_path: str):
    os.makedirs(os.path.dirname(os.path.expanduser(out_path)), exist_ok=True)
    rows = [{
        "药品名称": r.name,
        "适应症": r.indications,
        "禁忌症": r.contraindications,
        "药物相互作用": r.interactions,
        "妊娠分级": r.pregnancy_category,
        "来源": r.source,
    } for r in records]
    df = pd.DataFrame(rows, columns=["药品名称","适应症","禁忌症","药物相互作用","妊娠分级","来源"])
    df.to_excel(os.path.expanduser(out_path), index=False)
    print(f"✅ 已写出 {len(df)} 条 → {os.path.expanduser(out_path)}")


# ---------------------------
# Notebook 友好入口
# ---------------------------
def run_from_notebook(
    names: Optional[List[str]] = None,
    in_file: Optional[str] = None,
    out_file: str = "~/caremind/data/drugs.xlsx",
    nmpa_offline_dir: Optional[str] = None,
    use_nmpa_online: bool = False,
    use_drugbank: bool = False
):
    """
    - names 和 in_file 二选一；都提供时以 names 优先。
    - 默认关闭在线检索与 DrugBank，利于先跑通离线流程。
    """
    if not names:
        if not in_file:
            raise ValueError("请提供 names 列表或 in_file 路径")
        in_file = os.path.expanduser(in_file)
        if not os.path.exists(in_file):
            raise FileNotFoundError(f"未找到药品清单：{in_file}")
        with open(in_file, "r", encoding="utf-8") as f:
            names = [ln.strip() for ln in f if ln.strip()]

    recs = build_records(
        names=names,
        use_nmpa_online=use_nmpa_online,
        nmpa_offline_dir=os.path.expanduser(nmpa_offline_dir) if nmpa_offline_dir else None,
        use_drugbank=use_drugbank
    )
    save_to_excel(recs, out_file)
    return recs


# ---------------------------
# CLI 入口
# ---------------------------
def _load_names(path: str) -> List[str]:
    with open(os.path.expanduser(path), "r", encoding="utf-8") as f:
        return [ln.strip() for ln in f if ln.strip()]

def main():
    ap = argparse.ArgumentParser(description="Build drugs.xlsx from NMPA/DrugBank/local labels")
    ap.add_argument("--in", dest="in_file", required=True, help="药品清单 txt（每行一个药名）")
    ap.add_argument("--out", dest="out_file", required=True, help="输出 Excel 路径")
    ap.add_argument("--nmpa-offline-dir", dest="nmpa_offline_dir", default=None, help="离线 NMPA 说明书目录（可选）")
    ap.add_argument("--no-nmpa-online", action="store_true", help="禁用 NMPA 在线检索")
    ap.add_argument("--no-drugbank", action="store_true", help="禁用 DrugBank API")
    args = ap.parse_args()

    names = _load_names(args.in_file)
    recs = build_records(
        names=names,
        use_nmpa_online=not args.no_nmpa_online,
        nmpa_offline_dir=args.nmpa_offline_dir,
        use_drugbank=not args.no_drugbank
    )
    save_to_excel(recs, args.out_file)

if __name__ == "__main__":
    main()


usage: ipykernel_launcher.py [-h] --in IN_FILE --out OUT_FILE
                             [--nmpa-offline-dir NMPA_OFFLINE_DIR]
                             [--no-nmpa-online] [--no-drugbank]
ipykernel_launcher.py: error: the following arguments are required: --in, --out


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [3]:
# 假设 caremind_drugs.py 与 Notebook 同在 ~/caremind/
import sys, os
sys.path.append(os.path.expanduser("~/caremind"))

from caremind_drugs import run_from_notebook

recs = run_from_notebook(
    names=["氨氯地平","二甲双胍","阿托伐他汀"],        # 或用 in_file="~/caremind/data/drug_list.txt"
    out_file="~/caremind/data/drugs.xlsx",
    nmpa_offline_dir="~/caremind/data/nmpa_labels",
    use_nmpa_online=False,   # 你稍后完善 NMPA 在线接口后可改 True
    use_drugbank=False       # 若配置了 DrugBank API 再改 True
)
len(recs)


ImportError: cannot import name 'run_from_notebook' from 'caremind_drugs' (/home/myunix/caremind/caremind_drugs.py)

手工整理 data/drugs.xlsx（字段：name、indications、contraindications、interactions、dosage、pregnancy_category、source）。

In [4]:
import pandas as pd
import sqlite3, os
from dotenv import load_dotenv
load_dotenv()

db_path = os.getenv("SQLITE_PATH", "./db/drugs.sqlite")
os.makedirs(os.path.dirname(db_path), exist_ok=True)

schema = """
CREATE TABLE IF NOT EXISTS drugs (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT UNIQUE,
  indications TEXT,
  contraindications TEXT,
  interactions TEXT,
  dosage TEXT,
  pregnancy_category TEXT,
  source TEXT
);
"""

def main():
    df = pd.read_excel("data/drugs.xlsx")
    con = sqlite3.connect(db_path)
    cur = con.cursor()
    cur.executescript(schema)
    print(df.columns)
    print(df.head())
    df.to_sql("drugs", con, if_exists="replace", index=False)
    con.commit()
    con.close()
    print("Drugs loaded into SQLite.")

if __name__ == "__main__":
    main()


RangeIndex(start=0, stop=0, step=1)
Empty DataFrame
Columns: []
Index: []


OperationalError: near ")": syntax error