# üåø GreenRetrieval ‚Äì AI Plant Disease Diagnosis

**Complete, self-contained Google Colab notebook** for diagnosing plant diseases using:
- **EPPO Global Database** for verified plant pathogen data  
- **Groq LLM** for natural language generation


---

## üìã What You Need

### 1. API Keys (both free):
- **EPPO API Key**: Get from https://data.eppo.int  
- **Groq API Key**: Get from https://console.groq.com/keys

### 2. SQLite Database:
- Download **`eppocodes_all.sqlite`** from https://www.eppo.int/download  
- Upload it to Colab (see Step 2 below)

---

## üöÄ Quick Start

1. **Upload Database** ‚Üí Use the folder icon (üìÅ) in left sidebar to upload `eppocodes_all.sqlite` to `/content/`
2. **Install Dependencies** ‚Üí Run the cell below
3. **Set API Keys** ‚Üí Paste your keys in the configuration cell  
4. **Run Diagnoses!** ‚Üí Execute the final cell

Let's go! üëá

## Step 1: Install Dependencies

In [1]:
# Install required packages
!pip install -q groq requests tqdm

print("‚úÖ Dependencies installed!")

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/138.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m71.7/138.3 kB[0m [31m1.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m138.3/138.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ Dependencies installed!


## Step 2: Load SQLite Database

**Choose ONE of the two methods below:**

### üîπ Method 1: Upload Directly (Temporary - lost when runtime restarts)
### üîπ Method 2: Connect to Google Drive (Persistent - survives runtime restarts)

Run the cell below and follow the prompts.

In [2]:
# Load SQLite Database - Choose your method
from pathlib import Path
import os

print("üóÑÔ∏è  DATABASE SETUP")
print("=" * 60)
print("Choose how to load the SQLite database:\n")
print("1Ô∏è‚É£  Upload directly (temporary - lost on runtime restart)")
print("2Ô∏è‚É£  Load from Google Drive (persistent)\n")

choice = input("Enter your choice (1 or 2): ").strip()

DB_PATH = Path("/content/eppocodes_all.sqlite")

if choice == "1":
    # Method 1: Direct Upload
    print("\nüì§ Upload eppocodes_all.sqlite using the file picker...")
    from google.colab import files
    uploaded = files.upload()

    if 'eppocodes_all.sqlite' in uploaded:
        # Move to expected location
        import shutil
        if Path('eppocodes_all.sqlite').exists():
            shutil.move('eppocodes_all.sqlite', str(DB_PATH))
        print(f"‚úÖ Database uploaded to {DB_PATH}")
    else:
        print("‚ùå Error: Please upload a file named 'eppocodes_all.sqlite'")

elif choice == "2":
    # Method 2: Google Drive
    print("\nüìÇ Mounting Google Drive...")
    from google.colab import drive
    drive.mount('/content/drive')

    print("\nüìç Enter the path to your database in Google Drive")
    print("   Example: /content/drive/MyDrive/datasets/eppocodes_all.sqlite")
    print("   Or just: MyDrive/datasets/eppocodes_all.sqlite\n")

    gdrive_path = input("Path: ").strip()

    # Handle both absolute and relative paths
    if not gdrive_path.startswith('/content/drive/'):
        gdrive_path = f"/content/drive/{gdrive_path}"

    source_path = Path(gdrive_path)

    if source_path.exists():
        # Create symlink for consistent access
        if DB_PATH.exists():
            DB_PATH.unlink()
        DB_PATH.symlink_to(source_path)
        print(f"‚úÖ Database linked from Google Drive")
    else:
        print(f"‚ùå Error: Database not found at {source_path}")
        print("   Make sure you've uploaded eppocodes_all.sqlite to Google Drive")
        print("   and the path is correct.")

else:
    print("‚ùå Invalid choice. Please run the cell again and enter 1 or 2.")

# Verify database
print("\n" + "=" * 60)
if DB_PATH.exists():
    size_mb = DB_PATH.stat().st_size / (1024 * 1024)
    print(f"‚úÖ Database ready!")
    print(f"   Location: {DB_PATH}")
    print(f"   Size: {size_mb:.1f} MB")

    # Update environment variable
    os.environ["EPPO_SQLITE_PATH"] = str(DB_PATH)
else:
    print("‚ùå Database not found. Please try again.")

üóÑÔ∏è  DATABASE SETUP
Choose how to load the SQLite database:

1Ô∏è‚É£  Upload directly (temporary - lost on runtime restart)
2Ô∏è‚É£  Load from Google Drive (persistent)

Enter your choice (1 or 2): 2

üìÇ Mounting Google Drive...
Mounted at /content/drive

üìç Enter the path to your database in Google Drive
   Example: /content/drive/MyDrive/datasets/eppocodes_all.sqlite
   Or just: MyDrive/datasets/eppocodes_all.sqlite

Path: /content/drive/MyDrive/greenretrieval/eppocodes_all.sqlite
‚úÖ Database linked from Google Drive

‚úÖ Database ready!
   Location: /content/eppocodes_all.sqlite
   Size: 47.3 MB


## Step 3: Set Your API Keys (Using Colab Secrets) üîê

**For security, use Google Colab's Secrets feature:**

1. Click the **üîë key icon** in the left sidebar (Secrets)
2. Add two secrets:
   - Name: `EPPO_API_KEY` ‚Üí Value: Your EPPO API key
   - Name: `GROQ_API_KEY` ‚Üí Value: Your Groq API key
3. **Enable notebook access** by toggling the switch for each secret
4. Run the cell below to load them

**Alternative:** If you prefer, you can manually set them by uncommenting the manual method in the cell below.

In [3]:
# Configuration: Load API keys from Colab Secrets
import os

# Method 1: Using Google Colab Secrets (RECOMMENDED - Secure)
try:
    from google.colab import userdata
    os.environ["EPPO_API_KEY"] = userdata.get('EPPO_API_KEY')
    os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
    print("‚úÖ Loaded API keys from Colab Secrets!")
except Exception as e:
    print("‚ö†Ô∏è  Could not load from Colab Secrets. Using manual method...")
    print(f"   Error: {e}")

    # Method 2: Manual (NOT RECOMMENDED - Keys visible in code)
    # Uncomment and paste your keys here if Secrets don't work:
    # os.environ["EPPO_API_KEY"] = ""  # üëà Paste your EPPO key here
    # os.environ["GROQ_API_KEY"] = ""  # üëà Paste your Groq key here

# Paths
os.environ["EPPO_SQLITE_PATH"] = "/content/eppocodes_all.sqlite"
os.environ["EPPO_CACHE_DIR"] = "/content/.eppo_cache"

# Verify keys are set
eppo_key = os.environ.get("EPPO_API_KEY", "")
groq_key = os.environ.get("GROQ_API_KEY", "")

if eppo_key and groq_key:
    print("‚úÖ API keys configured!")
    print(f"   EPPO: {'*' * (len(eppo_key) - 4)}{eppo_key[-4:]}")
    print(f"   Groq: {'*' * (len(groq_key) - 4)}{groq_key[-4:]}")
else:
    print("‚ö†Ô∏è  Warning: One or both API keys are missing")
    if not eppo_key:
        print("  ‚ùå EPPO_API_KEY not set")
    if not groq_key:
        print("  ‚ùå GROQ_API_KEY not set")
    print("\nüìå To fix this:")
    print("   1. Click the üîë key icon in the left sidebar")
    print("   2. Add secrets: EPPO_API_KEY and GROQ_API_KEY")
    print("   3. Enable 'Notebook access' for both")
    print("   4. Re-run this cell")

‚úÖ Loaded API keys from Colab Secrets!
‚úÖ API keys configured!
   EPPO: ****************************5b35
   Groq: ****************************************************tO6E


## Step 4: Load All Code (One Cell!)

This cell contains the entire GreenRetrieval pipeline. Just run it once.

In [4]:
# ============================================================================
# GreenRetrieval ‚Äì Complete Pipeline (Self-Contained)
# ============================================================================

import os
import re
import json
import time
import sqlite3
from pathlib import Path
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Set, Tuple

# ---------------------------------------------------------------------------
# Global Statistics Tracking
# ---------------------------------------------------------------------------

class PipelineStats:
    def __init__(self):
        self.cache_hits = 0
        self.cache_misses = 0
        self.api_calls = 0
        self.groq_calls = 0
        self.diagnoses = []

    def reset(self):
        self.cache_hits = 0
        self.cache_misses = 0
        self.api_calls = 0
        self.groq_calls = 0
        self.diagnoses = []

    def add_diagnosis(self, result, label):
        self.diagnoses.append({"label": label, "result": result})

    def summary(self):
        total = len(self.diagnoses)
        if total == 0:
            return "No diagnoses performed."

        verified = sum(1 for d in self.diagnoses if not d["result"].refused)
        refused = total - verified
        avg_conf = sum(d["result"].confidence or 0 for d in self.diagnoses if d["result"].confidence) / total

        lines = [
            "\n" + "=" * 80,
            "üìä PIPELINE SUMMARY STATISTICS",
            "=" * 80,
            f"‚úÖ Verified: {verified}/{total} ({verified/total*100:.1f}%)",
            f"üö´ Refused: {refused}/{total} ({refused/total*100:.1f}%)",
            f"üéØ Average Confidence: {avg_conf:.2%}",
            f"\nüíæ EPPO API Cache:",
            f"   Hits: {self.cache_hits} (reused from disk)",
            f"   Misses: {self.cache_misses} (fetched from API)",
            f"   Total API Calls: {self.api_calls}",
            f"\nü§ñ Groq LLM Calls: {self.groq_calls}",
            "=" * 80,
        ]
        return "\n".join(lines)

STATS = PipelineStats()

# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------

PROJECT_ROOT = Path("/content")
SQLITE_PATH = Path(os.environ.get("EPPO_SQLITE_PATH", "/content/eppocodes_all.sqlite"))
EPPO_CACHE_DIR = Path(os.environ.get("EPPO_CACHE_DIR", "/content/.eppo_cache"))

EPPO_API_KEY = os.environ.get("EPPO_API_KEY", "")
GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

EPPO_BASE_URL = "https://api.eppo.int/gd/v2"
CONFIDENCE_THRESHOLD = 0.3  # Lowered from 0.45 for better recall
GROQ_MODEL = "openai/gpt-oss-120b"  # 120B model for best accuracy + built-in search
GROQ_MAX_TOKENS = 1024
GROQ_TEMPERATURE = 0.3  # Low temp = more factual, less creative
EPPO_RATE_LIMIT_DELAY = 0.2

# ---------------------------------------------------------------------------
# 1. Normalization
# ---------------------------------------------------------------------------

GENERIC_TERMS = frozenset({
    "of", "the", "and", "on", "in", "plant", "plants", "crop", "crops",
})
LOCATION_TERMS = frozenset({
    "leaf", "leaves", "stem", "stems", "fruit", "fruits", "root", "roots",
    "seed", "seeds", "flower", "flowers", "bark", "shoot", "branch",
})
# Symptom synonyms for better semantic matching
SYMPTOM_SYNONYMS = {
    "blight": {"blight", "spot", "lesion", "necrosis"},
    "rust": {"rust", "uredinia", "pustule"},
    "mosaic": {"mosaic", "mottle", "pattern", "variegation"},
    "rot": {"rot", "decay", "decomposition"},
    "wilt": {"wilt", "wilting", "droop", "collapse"},
    "curl": {"curl", "curling", "distortion", "deformation"},
}
MIN_TOKEN_LEN = 2

@dataclass
class NormalizedLabel:
    original: str
    tokens: List[str]
    host_candidates: List[str]
    symptom_candidates: List[str]
    location_terms: List[str]  # NEW: preserve location keywords

def _tokenize(text: str) -> List[str]:
    text = (text or "").strip().lower()
    tokens = re.split(r"[^\w]+", text)
    return [t for t in tokens if len(t) >= MIN_TOKEN_LEN]

def normalize_cv_label(label: str) -> NormalizedLabel:
    if not (label or isinstance(label, str)):
        return NormalizedLabel(original=label or "", tokens=[], host_candidates=[],
                                symptom_candidates=[], location_terms=[])

    original = label.strip()
    tokens = _tokenize(original)

    # Extract location terms BEFORE filtering
    location_terms = [t for t in tokens if t in LOCATION_TERMS]

    # Filter out generic terms but keep location terms
    meaningful = [t for t in tokens if t not in GENERIC_TERMS]

    if not meaningful:
        meaningful = tokens

    host_candidates = [meaningful[0]] if meaningful else []
    symptom_candidates = meaningful[1:] if len(meaningful) > 1 else meaningful

    return NormalizedLabel(
        original=original,
        tokens=meaningful,
        host_candidates=host_candidates,
        symptom_candidates=symptom_candidates,
        location_terms=location_terms,
    )

# ---------------------------------------------------------------------------
# 2. Retrieval
# ---------------------------------------------------------------------------

PREFERRED_DTCODE = "GAF"
SECONDARY_DTCODES = frozenset({"SFT"})

@dataclass
class Candidate:
    eppocode: str
    dtcode: str
    fullname: str
    score: float
    token_overlap: int
    host_match: bool

def _tokenize_name(name: str) -> set:
    tokens = re.split(r"[^\w]+", (name or "").lower())
    return {t for t in tokens if len(t) >= 2}

def _score_candidate(eppocode: str, dtcode: str, fullname: str, norm: NormalizedLabel) -> Tuple[float, int, bool]:
    name_tokens = _tokenize_name(fullname)
    query_tokens = set(norm.tokens)
    overlap = len(query_tokens & name_tokens)
    host_match = bool(norm.host_candidates and (set(norm.host_candidates) & name_tokens))

    # NEW: Check if location terms match (e.g., "leaf rust" should match names with "leaf")
    location_match = 0
    if norm.location_terms:
        location_tokens = set(norm.location_terms)
        location_match = len(location_tokens & name_tokens) / len(location_tokens)

    query_len = max(len(query_tokens), 1)
    overlap_ratio = overlap / query_len
    host_bonus = 0.2 if host_match else 0.0
    location_bonus = 0.3 * location_match  # NEW: Strong bonus for matching location terms
    dtcode_bonus = 0.15 if dtcode == PREFERRED_DTCODE else (0.05 if dtcode in SECONDARY_DTCODES else 0.0)

    score = overlap_ratio + host_bonus + location_bonus + dtcode_bonus
    return (min(score, 1.5), overlap, host_match)  # Allow scores > 1.0 for better differentiation

def query_candidates(sqlite_path: Path, norm: NormalizedLabel, max_candidates: int = 50) -> List[Candidate]:
    if not norm.tokens or not sqlite_path.exists():
        return []

    conn = sqlite3.connect(str(sqlite_path))
    conn.row_factory = sqlite3.Row
    try:
        placeholders = " OR ".join(["n.fullname LIKE ?" for _ in norm.tokens])
        params = [f"%{t}%" for t in norm.tokens]

        sql = f"""
            SELECT DISTINCT c.eppocode, c.dtcode, n.fullname
            FROM t_codes c
            JOIN t_names n ON c.codeid = n.codeid
            WHERE c.status = 'A' AND n.status = 'A'
              AND ({placeholders})
        """
        cur = conn.execute(sql, params)
        rows = list(cur.fetchall())
    finally:
        conn.close()

    query_tokens_set = set(norm.tokens)
    by_code: dict[Tuple[str, str], str] = {}
    for row in rows:
        key = (row["eppocode"], row["dtcode"])
        name = row["fullname"] or ""
        name_tokens = _tokenize_name(name)
        overlap = len(query_tokens_set & name_tokens)
        prev_name = by_code.get(key, "")
        prev_overlap = len(query_tokens_set & _tokenize_name(prev_name)) if prev_name else -1
        if key not in by_code or overlap > prev_overlap or (overlap == prev_overlap and len(name) > len(prev_name)):
            by_code[key] = name

    candidates: List[Candidate] = []
    for (eppocode, dtcode), fullname in by_code.items():
        score, token_overlap, host_match = _score_candidate(eppocode, dtcode, fullname, norm)
        candidates.append(
            Candidate(
                eppocode=eppocode,
                dtcode=dtcode,
                fullname=fullname,
                score=score,
                token_overlap=token_overlap,
                host_match=host_match,
            )
        )

    candidates.sort(key=lambda c: c.score, reverse=True)
    return candidates[:max_candidates]

def select_best(candidates: List[Candidate], threshold: float) -> Optional[Candidate]:
    if not candidates:
        return None
    best = candidates[0]
    return best if best.score >= threshold else None

# ---------------------------------------------------------------------------
# 3. EPPO API Retrieval
# ---------------------------------------------------------------------------

def _load_cached(cache_dir: Path, eppocode: str, endpoint_suffix: str) -> Optional[Dict[str, Any]]:
    cache_file = cache_dir / "taxons" / eppocode / f"{endpoint_suffix}.json"
    if not cache_file.exists():
        return None
    try:
        with open(cache_file, "r", encoding="utf-8") as f:
            return json.load(f)
    except Exception:
        return None

def _save_cached(cache_dir: Path, eppocode: str, endpoint_suffix: str, data: Any):
    cache_file = cache_dir / "taxons" / eppocode / f"{endpoint_suffix}.json"
    cache_file.parent.mkdir(parents=True, exist_ok=True)
    try:
        with open(cache_file, "w", encoding="utf-8") as f:
            json.dump(data, f)
    except Exception:
        pass

def _get_eppo(eppocode: str, endpoint_suffix: str, api_key: str, base_url: str,
              cache_dir: Optional[Path], use_cache: bool, max_retries: int = 3) -> Optional[Dict[str, Any]]:
    if use_cache and cache_dir:
        cached = _load_cached(cache_dir, eppocode, endpoint_suffix)
        if cached is not None:
            STATS.cache_hits += 1
            return cached

    url = f"{base_url.rstrip('/')}/taxons/taxon/{eppocode}/{endpoint_suffix}"
    headers = {"X-Api-Key": api_key} if api_key else {}

    for attempt in range(max_retries):
        try:
            import requests
            STATS.cache_misses += 1
            STATS.api_calls += 1
            time.sleep(EPPO_RATE_LIMIT_DELAY)
            resp = requests.get(url, headers=headers, timeout=30)
            resp.raise_for_status()
            data = resp.json()
            if use_cache and cache_dir and data is not None:
                _save_cached(cache_dir, eppocode, endpoint_suffix, data)
            return data
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(0.5 * (2 ** attempt))
                continue
            return None
    return None

def fetch_eppo_facts(eppocode: str, cache_dir: Optional[Path] = None, use_cache: bool = True) -> Dict[str, Any]:
    cache_dir = cache_dir or EPPO_CACHE_DIR
    overview = _get_eppo(eppocode, "overview", EPPO_API_KEY, EPPO_BASE_URL, cache_dir, use_cache)
    names = _get_eppo(eppocode, "names", EPPO_API_KEY, EPPO_BASE_URL, cache_dir, use_cache)
    hosts = _get_eppo(eppocode, "hosts", EPPO_API_KEY, EPPO_BASE_URL, cache_dir, use_cache)

    return {
        "overview": overview,
        "names": names if isinstance(names, list) else [],
        "hosts": hosts if isinstance(hosts, list) else [],
    }

# ---------------------------------------------------------------------------
# 4. Validation
# ---------------------------------------------------------------------------

def _tokenize_text(text: str) -> Set[str]:
    tokens = re.split(r"[^\w]+", (text or "").lower())
    return {t for t in tokens if len(t) >= 2}

def _texts_from_facts(facts: Dict[str, Any]) -> List[str]:
    texts: List[str] = []
    overview = facts.get("overview") or {}
    if isinstance(overview, dict):
        prefname = overview.get("prefname")
        if prefname:
            texts.append(prefname)
    for name_entry in facts.get("names") or []:
        if isinstance(name_entry, dict) and name_entry.get("fullname"):
            texts.append(name_entry["fullname"])
    for host_entry in facts.get("hosts") or []:
        if isinstance(host_entry, dict) and host_entry.get("prefname"):
            texts.append(host_entry["prefname"])
    return texts

def validate_eppo_against_label(facts: Dict[str, Any], norm: NormalizedLabel, min_token_overlap: int = 1) -> bool:
    if not facts or not norm.tokens:
        return False

    overview = facts.get("overview")
    if not overview or not isinstance(overview, dict):
        return False

    texts = _texts_from_facts(facts)
    if not texts:
        return False

    label_tokens = set(norm.tokens)
    combined = " ".join(texts).lower()
    combined_tokens = _tokenize_text(combined)
    overlap = len(label_tokens & combined_tokens)
    return overlap >= min_token_overlap

# ---------------------------------------------------------------------------
# 5. Generation
# ---------------------------------------------------------------------------

SYSTEM_PROMPT = """You are an expert plant pathologist and agricultural advisor. Your expertise includes disease diagnosis, treatment protocols, and integrated pest management.

Your communication style:
- Clear, concise, and action-oriented
- Use simple language accessible to farmers and gardeners
- Provide specific, practical advice (not vague generalities)
- Include dosages, timing, and application methods when relevant
- Acknowledge limitations or uncertainties honestly

Your response structure:
1. Confirmation: State clearly if prediction matches EPPO data (Yes/No + reasoning)
2. Disease Overview: Explain cause, symptoms, and impact in 2-3 sentences
3. Treatment: Provide 3-5 concrete actions with implementation details
4. Prevention: List 3-5 preventive measures in priority order

Avoid:
- Generic advice like "maintain good hygiene" without specifics
- Overly technical jargon without explanation
- Unverified information not supported by EPPO data
- Recommending products without active ingredients"""

def _format_facts_for_prompt(facts: Dict[str, Any]) -> str:
    parts = []
    overview = facts.get("overview") or {}
    if isinstance(overview, dict):
        prefname = overview.get("prefname")
        eppocode = overview.get("eppocode")
        if prefname:
            parts.append(f"Disease/Pest: {prefname}")
        if eppocode:
            code = eppocode.get("eppocode") if isinstance(eppocode, dict) else eppocode
            if code:
                parts.append(f"EPPO Code: {code}")

    # Add common names for better understanding
    common_names = []
    for name_entry in facts.get("names") or []:
        if isinstance(name_entry, dict) and name_entry.get("fullname"):
            common_names.append(name_entry['fullname'])
    if common_names:
        parts.append(f"Also known as: {', '.join(common_names[:5])}")

    # Add affected plants
    hosts = []
    for host_entry in facts.get("hosts") or []:
        if isinstance(host_entry, dict) and host_entry.get("prefname"):
            host_name = host_entry['prefname']
            classification = host_entry.get("class_label", "")
            if classification:
                hosts.append(f"{host_name} ({classification})")
            else:
                hosts.append(host_name)
    if hosts:
        parts.append(f"Commonly affects: {', '.join(hosts[:10])}")

    return "\n".join(parts) if parts else ""

def generate_from_facts(cv_label: str, facts: Dict[str, Any]) -> str:
    formatted = _format_facts_for_prompt(facts)
    if not formatted.strip():
        return "I cannot provide a diagnosis: no EPPO-backed facts are available for this label."

    if not GROQ_API_KEY:
        return "I cannot generate a response: Groq API key is not set."

    try:
        from groq import Groq
        client = Groq(api_key=GROQ_API_KEY)
        user_content = f'''Vision Model Prediction: "{cv_label}"

=== EPPO DATABASE INFORMATION ===
{formatted}

=== YOUR TASK ===
Analyze the prediction against EPPO data and provide a structured response:

**1. CONFIRMATION**
   - Does the prediction match the EPPO disease? (YES/NO)
   - Explain your reasoning (2-3 sentences)
   - If NO, specify what the prediction likely refers to

**2. DISEASE OVERVIEW**
   - Causative agent (pathogen type and scientific name)
   - Primary symptoms (visible signs on plant)
   - Economic/agricultural impact

**3. TREATMENT OPTIONS** (in order of effectiveness)
   - Option 1: [Method] - [Active ingredient/approach] - [Application timing]
   - Option 2: [Method] - [Active ingredient/approach] - [Application timing]
   - Option 3: [Method] - [Active ingredient/approach] - [Application timing]

**4. PREVENTION STRATEGIES** (proactive measures)
   - Priority 1: [Most critical preventive action]
   - Priority 2: [Second most important]
   - Priority 3: [Additional preventive measure]

Keep each section concise (3-5 bullet points max). Focus on what farmers can DO, not just what to know.'''

        STATS.groq_calls += 1
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_content},
            ],
            model=GROQ_MODEL,
            max_completion_tokens=GROQ_MAX_TOKENS,
            temperature=GROQ_TEMPERATURE,
        )
        content = response.choices[0].message.content if response.choices else None
        return (content or "").strip() or "I could not generate a response from the provided facts."
    except Exception as e:
        return f"I cannot generate a response: {str(e)}"

# ---------------------------------------------------------------------------
# 6. Pipeline
# ---------------------------------------------------------------------------

REFUSAL_NO_CANDIDATES = "I cannot verify this diagnosis: no matching EPPO record was found for this label."
REFUSAL_LOW_CONFIDENCE = "I cannot verify this diagnosis: the match to EPPO data is too uncertain."
REFUSAL_EPPO_FAILED = "I cannot verify this diagnosis: EPPO data could not be retrieved."
REFUSAL_VALIDATION_FAILED = "I cannot verify this diagnosis: the retrieved EPPO data does not support this label."

@dataclass
class DiagnosisResult:
    refused: bool
    message: str
    eppocode: Optional[str] = None
    confidence: Optional[float] = None
    debug_info: Optional[str] = None

def diagnose(cv_label: str,
             sqlite_path: Optional[Path] = None,
             cache_dir: Optional[Path] = None,
             confidence_threshold: float = CONFIDENCE_THRESHOLD,
             debug: bool = False) -> DiagnosisResult:
    sqlite_path = sqlite_path or SQLITE_PATH
    cache_dir = cache_dir or EPPO_CACHE_DIR

    norm = normalize_cv_label(cv_label)
    if not norm.tokens:
        return DiagnosisResult(refused=True, message=REFUSAL_NO_CANDIDATES)

    candidates = query_candidates(sqlite_path, norm, max_candidates=30)

    # Debug info
    debug_info = None
    if debug and candidates:
        top_5 = candidates[:5]
        debug_lines = [f"\nüîç Top candidates for '{cv_label}':"]
        for i, c in enumerate(top_5, 1):
            debug_lines.append(f"  {i}. {c.fullname} ({c.eppocode}) - Score: {c.score:.3f}")
        debug_info = "\n".join(debug_lines)

    best = select_best(candidates, confidence_threshold)
    if best is None:
        msg = REFUSAL_LOW_CONFIDENCE
        if candidates:
            msg += f" (Top match: {candidates[0].fullname} with score {candidates[0].score:.3f}, threshold: {confidence_threshold})"
        return DiagnosisResult(
            refused=True,
            message=msg,
            confidence=candidates[0].score if candidates else None,
            debug_info=debug_info,
        )

    facts = fetch_eppo_facts(best.eppocode, cache_dir=cache_dir, use_cache=True)
    if not facts.get("overview"):
        return DiagnosisResult(
            refused=True,
            message=REFUSAL_EPPO_FAILED,
            eppocode=best.eppocode,
            debug_info=debug_info,
        )

    if not validate_eppo_against_label(facts, norm, min_token_overlap=1):
        return DiagnosisResult(
            refused=True,
            message=REFUSAL_VALIDATION_FAILED,
            eppocode=best.eppocode,
            debug_info=debug_info,
        )

    answer = generate_from_facts(cv_label, facts)
    return DiagnosisResult(
        refused=False,
        message=answer,
        eppocode=best.eppocode,
        confidence=best.score,
        debug_info=debug_info,
    )

print("‚úÖ GreenRetrieval pipeline loaded successfully!")
print(f"üìä Configuration: Confidence threshold = {CONFIDENCE_THRESHOLD}, Model = {GROQ_MODEL}")

‚úÖ GreenRetrieval pipeline loaded successfully!
üìä Configuration: Confidence threshold = 0.3, Model = openai/gpt-oss-120b


## Step 5: Run Plant Disease Diagnoses! üéâ

Now let's diagnose some plant diseases. The pipeline will:
1. Normalize the disease label
2. Search the SQLite database for matching EPPO codes
3. Retrieve verified data from EPPO API
4. Validate and generate a factual response

**Run the cell below to see it in action!**

In [5]:
# Run diagnoses for multiple plant diseases
labels = [
    "Rice leaf blast",
    # "Wheat leaf rust",
    # "Potato leaf late blight",
]

# Check database first
import sqlite3
import os
from pathlib import Path
from tqdm.auto import tqdm

# Use the path from environment variable (set in Step 2)
DB_PATH = Path(os.environ.get("EPPO_SQLITE_PATH", "/content/eppocodes_all.sqlite"))

if DB_PATH.exists():
    print(f"‚úÖ Database found at: {DB_PATH}")
    print(f"   Size: {DB_PATH.stat().st_size / (1024*1024):.1f} MB")

    # Test query
    conn = sqlite3.connect(str(DB_PATH))
    cursor = conn.execute("SELECT COUNT(*) FROM t_codes WHERE status = 'A'")
    count = cursor.fetchone()[0]
    print(f"   Active EPPO codes: {count:,}")
    conn.close()
else:
    print(f"‚ùå Database NOT found at: {DB_PATH}")
    print("   Please run Step 2 to load the database first!")

print("\n" + "=" * 80)
print("üåø Starting plant disease diagnosis...")
print("=" * 80)

# Reset statistics for this run
STATS.reset()

# Process with progress bar
for label in tqdm(labels, desc="üî¨ Diagnosing", unit="disease", ncols=80):
    result = diagnose(label)
    STATS.add_diagnosis(result, label)
    status = "üö´ REFUSED" if result.refused else "‚úÖ VERIFIED"

    print(f"\n{status}: {label}")
    print("-" * 80)
    print(result.message)

    if result.eppocode:
        print(f"\nüìã EPPO Code: {result.eppocode}")
    if result.confidence is not None:
        print(f"üéØ Confidence: {result.confidence:.2%}")

# Display summary statistics
print("\n‚ú® Diagnosis complete!")
print(STATS.summary())
print("=" * 80)


‚úÖ Database found at: /content/eppocodes_all.sqlite
   Size: 47.3 MB
   Active EPPO codes: 121,370

üåø Starting plant disease diagnosis...


üî¨ Diagnosing:   0%|                                 | 0/1 [00:00<?, ?disease/s]


‚úÖ VERIFIED: Rice leaf blast
--------------------------------------------------------------------------------
**1. CONFIRMATION**  
- **NO** ‚Äì The EPPO entry you were given describes *Alternaria padwickii* (leaf‚Äëspot/stack‚Äëburn of rice). ‚ÄúRice leaf blast‚Äù is caused by *Magnaporthe oryzae* (formerly *Pyricularia oryzae*), a different fungus.  
- The prediction therefore refers to a different disease; the correct match for the EPPO data is **Alternaria leaf‚Äëspot (stack‚Äëburn)**, not leaf‚Äëblast.

---

**2. DISEASE OVERVIEW**  
- **Causative agent:** Fungus *Alternaria padwickii* (Ascomycota).  
- **Primary symptoms:** Small, water‚Äësoaked lesions that enlarge into brown‚Äëblack circular spots (5‚Äë10‚ÄØmm) with concentric rings; lesions may coalesce, giving a ‚Äústack‚Äëburn‚Äù appearance on leaves, sheaths, and panicles.  
- **Impact:** Reduces photosynthetic area, lowers grain filling, and can cause 10‚Äë30‚ÄØ% yield loss in heavily infected fields; severe epidemics ma

---

## üìö How It Works

**GreenRetrieval** is a retrieval-augmented generation (RAG) pipeline that ensures accurate, verified plant disease information:

### Pipeline Steps:
1. **Normalize**: Tokenize the disease label, remove generic terms (leaf, stem), preserve host + symptom
2. **Retrieve**: Query local SQLite database for candidate EPPO codes using token matching
3. **Rank**: Score candidates by token overlap, host match, and datatype preference
4. **Refuse**: If no candidate exceeds confidence threshold, refuse (no guessing!)
5. **Fetch**: Retrieve verified data from EPPO API (cached to respect rate limits)
6. **Validate**: Ensure EPPO data actually supports the label semantics
7. **Generate**: Use Groq LLM to create response ONLY from verified EPPO facts

### Key Features:
- ‚úÖ **No Hallucination**: Only uses verified EPPO database facts
- ‚úÖ **Refusal-Aware**: Prefers refusing over guessing
- ‚úÖ **Offline-First**: SQLite lookup before API calls
- ‚úÖ **Cached**: EPPO API responses cached to `/content/.eppo_cache`
- ‚úÖ **Transparent**: Shows EPPO codes and confidence scores

---

## üîß Configuration

All configuration is in the main code cell. Key parameters:

```python
CONFIDENCE_THRESHOLD = 0.45  # Minimum score to accept a match
GROQ_MODEL = "llama-3.3-70b-versatile"  # LLM model
GROQ_TEMPERATURE = 0.3  # Lower = more factual
```

---

## üìñ Learn More

- **EPPO Global Database**: https://gd.eppo.int
- **Groq Console**: https://console.groq.com
- **Project Repository**: [Your GitHub URL]

---

**Made with üå± by the GreenRetrieval team**