<a href="https://colab.research.google.com/github/Asaad972/CollabFirstNoteBook/blob/main/HW02_Cloud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip -q install firebase-admin

import firebase_admin
from firebase_admin import credentials, firestore

cred = credentials.Certificate("hw02-cloud-inverted-index-firebase-adminsdk-fbsvc-437db7abaa.json")
if not firebase_admin._apps:
    firebase_admin.initialize_app(cred)

db = firestore.client()
print("Firestore connected:", db.project)


Firestore connected: hw02-cloud-inverted-index


In [2]:
# CELL 1: Minimal package installation (only if missing)
import importlib.util, sys, subprocess

def ensure(pkg, import_name=None):
    name = import_name or pkg
    if importlib.util.find_spec(name) is None:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

# Usually already installed in Colab, but keep safe:
ensure("pandas", "pandas")

# Required for your homework plan:
ensure("nltk", "nltk")
ensure("sentence-transformers", "sentence_transformers")
ensure("faiss-cpu", "faiss")
ensure("pymupdf", "fitz")


print(" Dependencies ready")


 Dependencies ready


In [3]:
# CELL 2: Imports + NLTK resources (run once per runtime)

import re
from collections import defaultdict
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sentence_transformers import SentenceTransformer
import faiss

# NLTK downloads (required for stopwords/tokenizer/lemmatizer)
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("punkt_tab")

print(" Imports ready + NLTK resources downloaded")




 Imports ready + NLTK resources downloaded


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [4]:
!pip -q install firebase-admin

In [5]:
# CELL 3: Store Classes (Vector Store + Inverted Index)
# =====================================================
"""
 CELL 3: STORE CLASSES
- SimpleVectorStore: stores embeddings + documents + metadatas + ids (like Tirgul 7)
- InvertedIndexStore: stores required index schema term -> DocIDs (homework requirement)
"""

import numpy as np
from collections import defaultdict

# ---------- Vector Store (similar to Tirgul 7) ----------
class SimpleVectorStore:
    """Simple in-memory vector store (fallback)"""

    def __init__(self):
        self.documents = []
        self.embeddings = []   # list of numpy arrays
        self.metadatas = []
        self.ids = []
        print(" SimpleVectorStore initialized")

    def add(self, embeddings, documents, metadatas, ids):
        # Ensure numpy arrays
        embeddings = [np.asarray(e, dtype=np.float32) for e in embeddings]
        self.embeddings.extend(embeddings)
        self.documents.extend(documents)
        self.metadatas.extend(metadatas)
        self.ids.extend(ids)
        print(f" Added {len(documents)} documents to simple vector store")

    def query(self, query_embeddings, n_results=5):
        if not self.embeddings:
            return {'ids': [[]], 'documents': [[]], 'metadatas': [[]], 'distances': [[]]}

        q = np.asarray(query_embeddings[0], dtype=np.float32)

        E = np.vstack(self.embeddings)  # shape: (N, d)

        # cosine similarity without sklearn
        q_norm = np.linalg.norm(q) + 1e-12
        E_norm = np.linalg.norm(E, axis=1) + 1e-12
        sims = (E @ q) / (E_norm * q_norm)

        top_idx = np.argsort(sims)[::-1][:n_results]

        return {
            'ids': [[self.ids[i] for i in top_idx]],
            'documents': [[self.documents[i] for i in top_idx]],
            'metadatas': [[self.metadatas[i] for i in top_idx]],
            'distances': [[float(1 - sims[i]) for i in top_idx]]  # distance-like
        }

    def count(self):
        return len(self.documents)


# ---------- Inverted Index (required by homework) ----------
class InvertedIndexStore:
    """Required structure: term -> DocIDs"""

    def __init__(self):
        self.term_to_docids = defaultdict(set)
        print(" InvertedIndexStore initialized")

    def add_occurrence(self, term: str, doc_id: str):
        self.term_to_docids[term].add(doc_id)

    def get_docids(self, term: str):
        return sorted(self.term_to_docids.get(term, set()))

    def count_terms(self) -> int:
        return len(self.term_to_docids)

    def to_required_format(self):
        # [{"term": ..., "DocIDs": [...]}, ...]
        return [{"term": t, "DocIDs": sorted(list(docids))}
                for t, docids in sorted(self.term_to_docids.items())]


print(" Store classes defined!")
print(" Next: Cell 4 (core logic: preprocess + build index + embeddings)")


 Store classes defined!
 Next: Cell 4 (core logic: preprocess + build index + embeddings)


In [6]:
# CELL 4: Core setup (custom stopwords + stemming + embedding model + FAISS)

# --- Custom stopwords (you define them) ---
# We remove these words because they are very frequent function words (articles, prepositions, pronouns).
# They usually do not add topic meaning, but they increase index size and add noise to retrieval.
CUSTOM_STOPWORDS = {
    "the","a","an","and","or","but",
    "to","of","in","on","at","for","from","by","with","as",
    "is","are","was","were","be","been","being",
    "this","that","these","those",
    "it","its","they","them","their","we","our","you","your",
    "i","me","my","he","him","his","she","her",
    "not","no","do","does","did","doing"
}

stemmer = PorterStemmer()

def preprocess_text(text: str):
    """
    Returns list of terms for indexing:
    - lowercase
    - tokenize
    - keep alphabetic tokens only
    - remove custom stopwords
    - apply stemming
    """
    text = text.lower()
    tokens = word_tokenize(text)
    terms = []
    for tok in tokens:
        if tok.isalpha() and tok not in CUSTOM_STOPWORDS:
            terms.append(stemmer.stem(tok))
    return terms

# --- Embedding model (for semantic retrieval) ---
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

# --- FAISS index (stores embeddings for doc-level retrieval) ---
faiss_index = None
vector_dim = None

# Parallel stores (FAISS row -> doc data)
vector_doc_ids = []   # doc_id
vector_texts = []     # full doc text

print(" Core setup ready (custom stopwords + stemming + embeddings + FAISS)")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


 Core setup ready (custom stopwords + stemming + embeddings + FAISS)


In [7]:
# quick test
test = "This is a simple TEST of watering plants and plant diseases."
print("ðŸ§ª preprocess test:", preprocess_text(test)[:20])

ðŸ§ª preprocess test: ['simpl', 'test', 'water', 'plant', 'plant', 'diseas']


In [8]:
wiki_links = [
    "https://en.wikipedia.org/wiki/Plant_disease",
    "https://en.wikipedia.org/wiki/Plant_pathology",
    "https://en.wikipedia.org/wiki/Fungus",
    "https://en.wikipedia.org/wiki/Bacterial_wilt",
    "https://en.wikipedia.org/wiki/Powdery_mildew"
]

print("Wikipedia links used:")
for i, link in enumerate(wiki_links, 1):
    print(f"{i}. {link}")


Wikipedia links used:
1. https://en.wikipedia.org/wiki/Plant_disease
2. https://en.wikipedia.org/wiki/Plant_pathology
3. https://en.wikipedia.org/wiki/Fungus
4. https://en.wikipedia.org/wiki/Bacterial_wilt
5. https://en.wikipedia.org/wiki/Powdery_mildew


In [9]:
import requests
import re

WIKI_API = "https://en.wikipedia.org/w/api.php"

# Wikipedia blocks requests without a proper User-Agent sometimes
HEADERS = {
    "User-Agent": "HW02-Cloud-RAG/1.0 (student project; contact: student@example.com)"
}

def title_from_wiki_url(url: str) -> str:
    if "/wiki/" not in url:
        raise ValueError(f"Unsupported Wikipedia URL: {url}")
    title = url.split("/wiki/", 1)[1]
    title = title.split("#", 1)[0]      # remove anchors
    title = title.replace("_", " ")
    return title

def fetch_page_extract_by_title(title: str):
    params = {
        "action": "query",
        "format": "json",
        "prop": "extracts|info",
        "titles": title,
        "inprop": "url",
        "explaintext": True,
        "redirects": 1,   # follow redirects
        "origin": "*"     # helps in some environments
    }
    r = requests.get(WIKI_API, params=params, headers=HEADERS, timeout=30)
    r.raise_for_status()

    pages = r.json()["query"]["pages"]
    page = next(iter(pages.values()))

    # Handle missing page
    if "missing" in page:
        return {"pageid": None, "title": title, "url": "", "text": ""}

    return {
        "pageid": page.get("pageid"),
        "title": page.get("title", title),
        "url": page.get("fullurl", ""),
        "text": page.get("extract", "")
    }

def slugify(s: str) -> str:
    s = s.strip().lower()
    s = re.sub(r"[^a-z0-9]+", "-", s)
    return s.strip("-")

def load_docs_from_wiki_links(wiki_links):
    docs = {}
    docs_meta = {}

    for url in wiki_links:
        title = title_from_wiki_url(url)
        data = fetch_page_extract_by_title(title)

        text = (data.get("text") or "").strip()
        if not text:
            print(f"Empty/blocked page: {title} | {url}")
            continue

        doc_id = f"wiki_{slugify(data['title'])}"
        docs[doc_id] = text
        docs_meta[doc_id] = {
            "title": data["title"],
            "url": data.get("url") or url,
            "source": "wikipedia",
            "pageid": data.get("pageid"),
        }

        print(f"Loaded: {data['title']} -> {doc_id} | chars={len(text)}")

    return docs, docs_meta

docs, docs_meta = load_docs_from_wiki_links(wiki_links)
print("Docs loaded:", len(docs))


Loaded: Plant disease -> wiki_plant-disease | chars=9654
Loaded: Plant pathology -> wiki_plant-pathology | chars=5228
Loaded: Fungus -> wiki_fungus | chars=65562
Loaded: Bacterial wilt -> wiki_bacterial-wilt | chars=3688
Loaded: Erysiphaceae -> wiki_erysiphaceae | chars=14230
Docs loaded: 5


In [10]:
# CELL 7: Build the required index (term -> DocIDs) + build FAISS embeddings store (doc-level)

# 1) Build inverted index (term -> DocIDs)
inv_index = InvertedIndexStore()

for doc_id, text in docs.items():
    terms = preprocess_text(text)   # uses custom stopwords + stemming
    for t in set(terms):            # presence only (not frequency)
        inv_index.add_occurrence(t, doc_id)

print(f" Inverted index built. Unique terms: {inv_index.count_terms()}")

# 2) Build embeddings + FAISS (one vector per doc)
doc_ids = list(docs.keys())
texts = [docs[d] for d in doc_ids]

emb = embed_model.encode(texts, convert_to_numpy=True, normalize_embeddings=True).astype("float32")

vector_dim = emb.shape[1]
faiss_index = faiss.IndexFlatIP(vector_dim)  # cosine similarity via normalized embeddings
faiss_index.add(emb)

# parallel arrays for retrieval results
vector_doc_ids = doc_ids
vector_texts = texts

print(f" FAISS built. Vectors: {faiss_index.ntotal} | dim={vector_dim}")


 InvertedIndexStore initialized
 Inverted index built. Unique terms: 2580
 FAISS built. Vectors: 5 | dim=384


In [11]:
from google.cloud.firestore_v1 import ArrayUnion

def upload_inverted_index(inv_index, collection_name="inverted_index", batch_size=400):
    """
    Uploads term -> DocIDs into Firestore.
    Creates documents: inverted_index/{term}
    Fields: term, doc_ids, df, updated_at
    """
    col = db.collection(collection_name)
    records = inv_index.to_required_format()  # [{"term": ..., "DocIDs": [...]}, ...]

    batch = db.batch()
    ops = 0

    for r in records:
        term = r["term"]
        doc_ids = r["DocIDs"]

        # Use term as document id (safe for most stems; fallback if too long)
        doc_id = term[:1500]  # Firestore doc id limit is large, but keep it reasonable

        ref = col.document(doc_id)
        batch.set(ref, {
            "term": term,
            "doc_ids": doc_ids,
            "df": len(doc_ids),
        })
        ops += 1

        if ops >= batch_size:
            batch.commit()
            batch = db.batch()
            ops = 0

    if ops > 0:
        batch.commit()

    print(f"Uploaded {len(records)} terms to Firestore collection '{collection_name}'")

upload_inverted_index(inv_index)


Uploaded 2580 terms to Firestore collection 'inverted_index'


In [12]:
# Read 1 term back
sample = next(iter(inv_index.term_to_docids.keys()))
doc = db.collection("inverted_index").document(sample[:1500]).get()
print("Exists:", doc.exists)
print(doc.to_dict())


Exists: True
{'df': 2, 'doc_ids': ['wiki_plant-disease', 'wiki_plant-pathology'], 'term': 'breakag'}


In [13]:
t = "diseas"  # try a stem that should appear
doc = db.collection("inverted_index").document(t[:1500]).get()
print(doc.exists)
print(doc.to_dict() if doc.exists else "not found")


True
{'df': 5, 'doc_ids': ['wiki_bacterial-wilt', 'wiki_erysiphaceae', 'wiki_fungus', 'wiki_plant-disease', 'wiki_plant-pathology'], 'term': 'diseas'}


In [14]:
def upload_wiki_meta(docs_meta, collection_name="documents", batch_size=400):
    col = db.collection(collection_name)

    batch = db.batch()
    ops = 0

    for doc_id, meta in docs_meta.items():
        ref = col.document(doc_id)
        batch.set(ref, {
            "doc_id": doc_id,
            "title": meta.get("title", ""),
            "url": meta.get("url", ""),
            "source": meta.get("source", "wikipedia"),
            "pageid": meta.get("pageid", None),
        }, merge=True)

        ops += 1
        if ops >= batch_size:
            batch.commit()
            batch = db.batch()
            ops = 0

    if ops > 0:
        batch.commit()

    print(f"Uploaded {len(docs_meta)} wiki docs to '{collection_name}'")

upload_wiki_meta(docs_meta)


Uploaded 5 wiki docs to 'documents'


In [15]:
doc = db.collection("documents").document("s41598-025-98454-6").get()
print(doc.exists)
print(doc.to_dict())

False
None


In [16]:
# CELL 8: Export + quick preview of the required index format (term + DocIDs)

records = inv_index.to_required_format()

print(f" Index records created: {len(records)} terms")
print("Preview (first 10):")
for row in records[:10]:
    print(row)


 Index records created: 2580 terms
Preview (first 10):
{'term': 'abil', 'DocIDs': ['wiki_erysiphaceae', 'wiki_fungus', 'wiki_plant-pathology']}
{'term': 'abiot', 'DocIDs': ['wiki_plant-disease', 'wiki_plant-pathology']}
{'term': 'abl', 'DocIDs': ['wiki_erysiphaceae', 'wiki_fungus', 'wiki_plant-disease']}
{'term': 'about', 'DocIDs': ['wiki_bacterial-wilt', 'wiki_fungus', 'wiki_plant-disease']}
{'term': 'abov', 'DocIDs': ['wiki_erysiphaceae', 'wiki_fungus']}
{'term': 'absenc', 'DocIDs': ['wiki_erysiphaceae']}
{'term': 'absent', 'DocIDs': ['wiki_erysiphaceae']}
{'term': 'absorb', 'DocIDs': ['wiki_erysiphaceae', 'wiki_fungus']}
{'term': 'absorpt', 'DocIDs': ['wiki_fungus']}
{'term': 'abund', 'DocIDs': ['wiki_fungus']}


In [17]:
# CELL 8: Sanity checks + index preview (NO PlantDiseaseIndexRAG)

import pandas as pd

print(" Sanity checks:")

# 1) Documents
print("Docs loaded:", len(docs))
assert len(docs) > 0, "No documents loaded!"

# 2) Inverted index
num_terms = inv_index.count_terms()
print("Unique terms in index:", num_terms)
assert num_terms > 0, "Index is empty!"

# 3) FAISS
print("FAISS vectors:", faiss_index.ntotal)
assert faiss_index.ntotal == len(docs), "FAISS vectors != docs count"

# 4) Export index in REQUIRED schema
records = inv_index.to_required_format()
df_index = pd.DataFrame(records)

print("\n Index preview (first 5 rows):")
display(df_index.head(5))

print("\n CELL 8 completed successfully")


 Sanity checks:
Docs loaded: 5
Unique terms in index: 2580
FAISS vectors: 5

 Index preview (first 5 rows):


Unnamed: 0,term,DocIDs
0,abil,"[wiki_erysiphaceae, wiki_fungus, wiki_plant-pa..."
1,abiot,"[wiki_plant-disease, wiki_plant-pathology]"
2,abl,"[wiki_erysiphaceae, wiki_fungus, wiki_plant-di..."
3,about,"[wiki_bacterial-wilt, wiki_fungus, wiki_plant-..."
4,abov,"[wiki_erysiphaceae, wiki_fungus]"



 CELL 8 completed successfully


In [18]:
# CELL 9: Embedding-based retrieval (FAISS) for a user query (no OpenAI yet)

def retrieve_top_docs(query: str, top_k: int = 5):
    if faiss_index is None or faiss_index.ntotal == 0:
        return "FAISS index is empty. Build vectors first."

    q_emb = embed_model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")
    distances, indices = faiss_index.search(q_emb, top_k)

    lines = []
    lines.append(f"Query: {query}")
    lines.append("=" * 60)

    for rank, idx in enumerate(indices[0], start=1):
        if idx == -1:
            continue
        doc_id = vector_doc_ids[idx]
        title = docs_meta.get(doc_id, {}).get("title", "")
        text = vector_texts[idx]
        snippet = re.sub(r"\s+", " ", text)[:350]
        score = float(distances[0][rank - 1])

        lines.append(f"{rank}) {doc_id} | {title} | similarity: {score:.4f}")
        lines.append(f"Snippet: {snippet}...")
        lines.append("-" * 60)

    return "\n".join(lines)

print(" Retrieval function ready")


 Retrieval function ready


In [19]:
# CELL 10: RAG-style output (retrieval + "enriched" answer without OpenAI)
# We will: retrieve top docs, then produce a simple enriched response by extracting key sentences.

def split_sentences(text: str):
    # simple sentence split (good enough for baseline)
    parts = re.split(r'(?<=[.!?])\s+', re.sub(r"\s+", " ", text).strip())
    return [s for s in parts if len(s) > 30]

def rag_answer_without_llm(query: str, top_k: int = 3, max_sentences_per_doc: int = 2):
    if faiss_index is None or faiss_index.ntotal == 0:
        return "FAISS index is empty. Build vectors first."

    q_emb = embed_model.encode([query], convert_to_numpy=True, normalize_embeddings=True).astype("float32")
    distances, indices = faiss_index.search(q_emb, top_k)

    lines = []
    lines.append(f"Query: {query}")
    lines.append("=" * 60)

    # Retrieval section
    lines.append("Top retrieved documents:")
    retrieved = []
    for rank, idx in enumerate(indices[0], start=1):
        if idx == -1:
            continue
        doc_id = vector_doc_ids[idx]
        title = docs_meta.get(doc_id, {}).get("title", "")
        score = float(distances[0][rank - 1])
        retrieved.append((doc_id, title, score))
        lines.append(f"{rank}) {doc_id} | {title} | similarity: {score:.4f}")
    lines.append("=" * 60)

    # Enriched response (extractive, no LLM)
    lines.append("Enriched response (extractive, no LLM):")
    q_terms = set(preprocess_text(query))

    for doc_id, title, score in retrieved:
        text = docs[doc_id]
        sents = split_sentences(text)

        # score sentences by overlap with query terms (stems)
        scored = []
        for s in sents:
            s_terms = set(preprocess_text(s))
            overlap = len(q_terms & s_terms)
            if overlap > 0:
                scored.append((overlap, s))

        scored.sort(key=lambda x: x[0], reverse=True)
        best = [s for _, s in scored[:max_sentences_per_doc]]

        lines.append(f"- Source: {doc_id} | {title}")
        if best:
            for b in best:
                lines.append(f"  â€¢ {b}")
        else:
            lines.append("  â€¢ (No strong matching sentences found)")
        lines.append("-" * 60)

    return "\n".join(lines)

print(" RAG-style (no OpenAI) function ready")


 RAG-style (no OpenAI) function ready


In [20]:
# CELL 11: Quick demo (edit the query text)

print(retrieve_top_docs("how to detect plant diseases using sensors and ai", top_k=3))
print()
print(rag_answer_without_llm("how to detect plant diseases using sensors and ai", top_k=3))


Query: how to detect plant diseases using sensors and ai
1) wiki_plant-disease | Plant disease | similarity: 0.4522
Snippet: Plant diseases are diseases in plants caused by pathogens (infectious organisms) and environmental conditions (physiological factors). Organisms that cause infectious disease include fungi, oomycetes, bacteria, viruses, viroids, virus-like organisms, phytoplasmas, protozoa, nematodes and parasitic plants. Not included are ectoparasites like insects...
------------------------------------------------------------
2) wiki_plant-pathology | Plant pathology | similarity: 0.4479
Snippet: Plant pathology or phytopathology is the scientific study of plant diseases caused by pathogens (infectious organisms) and environmental conditions (physiological factors). Plant pathology involves the study of pathogen identification, disease etiology, disease cycles, economic impact, plant disease epidemiology, plant disease resistance, how plant...
----------------------------------

In [21]:
# CELL 12: Evaluation / sanity checks (index + FAISS + stopwords + stemming)

def evaluate_system():
    lines = []
    lines.append("=== EVALUATION (Sanity Checks) ===")

    # Docs
    num_docs = len(docs) if isinstance(docs, dict) else 0
    lines.append(f"Docs loaded: {num_docs}")
    if num_docs == 0:
        lines.append(" No documents loaded. Check PDF_FOLDER path and filenames in sample_papers.")
        return "\n".join(lines)

    # Index
    num_terms = inv_index.count_terms() if 'inv_index' in globals() else 0
    lines.append(f"Unique terms in inverted index: {num_terms}")
    if num_terms == 0:
        lines.append(" Index is empty. Check preprocess_text() and PDF text extraction.")
        return "\n".join(lines)

    # FAISS
    faiss_total = faiss_index.ntotal if faiss_index is not None else 0
    lines.append(f"FAISS vectors: {faiss_total}")
    if faiss_total != num_docs:
        lines.append(f" FAISS vectors ({faiss_total}) != docs ({num_docs}). Check embedding build step.")

    # Stopwords + stemming check on a tiny sample
    sample_doc_id = next(iter(docs.keys()))
    sample_text = docs[sample_doc_id][:800]
    terms = preprocess_text(sample_text)

    lines.append(f"Sample doc_id: {sample_doc_id}")
    lines.append(f"Sample extracted chars (first 80): {repr(docs[sample_doc_id][:80])}")
    lines.append(f"Preprocess produced {len(terms)} terms from first 800 chars.")
    lines.append(f"First 25 terms (stems): {terms[:25]}")

    # Check a few stopwords are removed
    test_sentence = "This is a simple test of the system and the index."
    test_terms = preprocess_text(test_sentence)
    lines.append(f"Stopword test input: {test_sentence}")
    lines.append(f"Stopword test output terms: {test_terms}")
    if any(w in test_terms for w in ["the", "is", "and", "this"]):
        lines.append(" Some stopwords may still be appearing. Check CUSTOM_STOPWORDS and token filtering.")
    else:
        lines.append(" Stopwords appear to be removed (basic check).")

    # Quick retrieval check
    q = "plant disease detection"
    preview = retrieve_top_docs(q, top_k=2)
    lines.append("Retrieval check (top 2):")
    lines.append(preview)

    return "\n".join(lines)

print(evaluate_system())


=== EVALUATION (Sanity Checks) ===
Docs loaded: 5
Unique terms in inverted index: 2580
FAISS vectors: 5
Sample doc_id: wiki_plant-disease
Sample extracted chars (first 80): 'Plant diseases are diseases in plants caused by pathogens (infectious organisms)'
Preprocess produced 79 terms from first 800 chars.
First 25 terms (stems): ['plant', 'diseas', 'diseas', 'plant', 'caus', 'pathogen', 'infecti', 'organ', 'environment', 'condit', 'physiolog', 'factor', 'organ', 'caus', 'infecti', 'diseas', 'includ', 'fungi', 'oomycet', 'bacteria', 'virus', 'viroid', 'organ', 'phytoplasma', 'protozoa']
Stopword test input: This is a simple test of the system and the index.
Stopword test output terms: ['simpl', 'test', 'system', 'index']
 Stopwords appear to be removed (basic check).
Retrieval check (top 2):
Query: plant disease detection
1) wiki_plant-disease | Plant disease | similarity: 0.5957
Snippet: Plant diseases are diseases in plants caused by pathogens (infectious organisms) and environmental 

In [22]:
print(rag_answer_without_llm("plant diesases", top_k=3))


Query: plant diesases
Top retrieved documents:
1) wiki_plant-pathology | Plant pathology | similarity: 0.4377
2) wiki_plant-disease | Plant disease | similarity: 0.4175
3) wiki_bacterial-wilt | Bacterial wilt | similarity: 0.3561
Enriched response (extractive, no LLM):
- Source: wiki_plant-pathology | Plant pathology
  â€¢ Plant pathology or phytopathology is the scientific study of plant diseases caused by pathogens (infectious organisms) and environmental conditions (physiological factors).
  â€¢ Plant pathology involves the study of pathogen identification, disease etiology, disease cycles, economic impact, plant disease epidemiology, plant disease resistance, how plant diseases affect humans and animals, pathosystem genetics, and management of plant diseases.
------------------------------------------------------------
- Source: wiki_plant-disease | Plant disease
  â€¢ Plant diseases are diseases in plants caused by pathogens (infectious organisms) and environmental conditions (phy

In [23]:
print(retrieve_top_docs("plant diesases", top_k=3))


Query: plant diesases
1) wiki_plant-pathology | Plant pathology | similarity: 0.4377
Snippet: Plant pathology or phytopathology is the scientific study of plant diseases caused by pathogens (infectious organisms) and environmental conditions (physiological factors). Plant pathology involves the study of pathogen identification, disease etiology, disease cycles, economic impact, plant disease epidemiology, plant disease resistance, how plant...
------------------------------------------------------------
2) wiki_plant-disease | Plant disease | similarity: 0.4175
Snippet: Plant diseases are diseases in plants caused by pathogens (infectious organisms) and environmental conditions (physiological factors). Organisms that cause infectious disease include fungi, oomycetes, bacteria, viruses, viroids, virus-like organisms, phytoplasmas, protozoa, nematodes and parasitic plants. Not included are ectoparasites like insects...
------------------------------------------------------------
3) wiki_