# RAG Setup
Now that we have the structured data in ```JSONL``` format, we will go about embedding and vectorizing this data to make it RAG-ready.   
   
So, before any LLM "answers" in a RAG pipeline, the setup must go through:
> Data ‚Üí Embedding ‚Üí Vectorization ‚Üí Indexing ‚Üí Retrieval ‚Üí LLM QA

## Load Data

In [1]:
import json

# Path to your JSONL file
jsonl_path = "/workspaces/Avva-2.0/experiments/symptom_remedy_data.jsonl"

# Load records
records = []
with open(jsonl_path, "r") as f:
    for line in f:
        try:
            item = json.loads(line.strip())
            # Optional: filter out invalid entries
            if item.get("symptom") and item.get("remedy"):
                records.append(item)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid line: {e}")


In [2]:
print(f"Loaded {len(records)} valid records.")
print(records[0])

Loaded 88 valid records.


## RAG Framework 

| Framework               | Best For                           | Strengths                                                    | Tradeoffs                                |
| ----------------------- | ---------------------------------- | ------------------------------------------------------------ | ---------------------------------------- |
| **LangChain**           | General RAG + agents               | ‚úÖ Huge ecosystem, tools, agents, integrations                | Heavy; complex tracing; less transparent |
| **LlamaIndex**          | Structured, hybrid, or tabular RAG | ‚úÖ Better document structure control, tree index, tabular RAG | Agents are basic, fewer agent tools      |
| **Haystack**            | Traditional NLP, QA, open source   | ‚úÖ Flexible pipeline control, solid evals                     | Less LLM-native; older ecosystem         |
| **SmolAgents / CrewAI** | Dev-first agent systems            | ‚úÖ Tiny, hackable, very fast for multi-agent apps             | Lacks fine-grained RAG plumbing          |
| **RAGatouille**         | Plug-and-play vector RAG           | ‚úÖ Super minimal RAG-only wrapper                             | No agents, no eval tools                 |
| **Dust.tt**             | UI-first prompt+tool orchestration | ‚úÖ Great UX, graphs, eval tracking                            | SaaS-based, not self-hosted              |      
   
So module-wise
   
| Module                   | Best Choice                                                                                              |
| ------------------------ | -------------------------------------------------------------------------------------------------------- |
| üîç Embedding + Retrieval | **LlamaIndex** or **LangChain** (start with LangChain if agents are central)                             |
| üß† Agents + Tool Use     | **CrewAI** or **SmolAgents** (lightweight, composable)                                                   |
| üìä Evaluation            | Use **TruLens**, **RAGAS**, or **Promptfoo** (can be plugged into either)                                |
| üîß Optional RAG Plug     | You can even use **LlamaIndex for vector management**, and connect to **LangChain Agents** for reasoning |

### Step 1: Convert JSON ‚Üí `Document` objects
| Format                      | Pros                                               | Cons                                                    |
| --------------------------- | -------------------------------------------------- | ------------------------------------------------------- |
| `langchain.schema.Document` | ‚úÖ Standardized for LangChain<br>‚úÖ Metadata support | Slightly verbose                                        |
| `llama_index.TextNode`      | ‚úÖ Structured field storage<br>‚úÖ Fine-tuned control | Works best if you plan to use **LlamaIndex end-to-end** |
| Raw dicts (custom)          | ‚úÖ Minimal, DIY logic                               | ‚ùå Not supported in RAG toolchains                       |
   
‚úÖ Use LangChain


In [3]:
from langchain.schema import Document

documents = []
for item in records:
    if item.get("symptom") and item.get("remedy"):
        page_content = (
            f"Symptom: {item['symptom']}\n"
            f"Remedy: {item['remedy']}\n"
            f"Description: {item.get('description', '')}\n"
            f"Warnings: {item.get('warnings', '')}"
        )
        metadata = {"source_url": item.get("source_url", "unknown")}
        documents.append(Document(page_content=page_content, metadata=metadata))


In [4]:
documents[:5]



### Step 2: Choose Embedding Model

| Model                                                   | Pros                                                     | Cons                                                                  |
| ------------------------------------------------------- | -------------------------------------------------------- | --------------------------------------------------------------------- |
| `OpenAIEmbeddings` (`text-embedding-3-small`)           | ‚úÖ Very accurate<br>‚úÖ Works out of the box with LangChain | ‚ùå API calls = paid<br>‚ùå Slow for large batches                        |
| `HuggingFaceEmbeddings` (e.g. MiniLM, bge-small)        | ‚úÖ Free<br>‚úÖ Fast<br>‚úÖ Good quality                       | ‚ùå Needs local compute<br>‚ùå Slightly less performant on subtle queries |
| `InstructorEmbeddings`                                  | ‚úÖ Good with instructions/context<br>‚úÖ Free               | ‚ùå Heavier; needs sentence-transformers setup                          |
| `Mistral-based` embeddings (via Hugging Face or Ollama) | ‚úÖ Emerging, fast, local                                  | ‚ùå Less mature; limited support in LangChain now                       |



In [5]:
! pip install sentence-transformers langchain-community chromadb langchain-huggingface --quiet

For general-purpose semantic search, a good starting model is:
`"all-MiniLM-L6-v2"`   

Other great options:   

`"BAAI/bge-small-en"` ‚Äì very strong performance.   
`"intfloat/multilingual-e5-small"` ‚Äì if you want multilingual support.   
`"thenlper/gte-small"` ‚Äì open and tuned for RAG-like tasks.   
   
We'll use `"all-MiniLM-L6-v2"` here:    


In [6]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

  from .autonotebook import tqdm as notebook_tqdm


### Step 3: Choose Vector Store

| Store                       | Pros                                                            | Cons                                         |
| --------------------------- | --------------------------------------------------------------- | -------------------------------------------- |
| **Chroma**                  | ‚úÖ Lightweight<br>‚úÖ Local file persistence<br>‚úÖ LangChain native | Slightly slower for huge datasets            |
| **FAISS**                   | ‚úÖ Fast, local, simple                                           | ‚ùå No metadata filtering natively             |
| **Weaviate / Qdrant**       | ‚úÖ Scalable<br>‚úÖ Metadata filtering<br>‚úÖ REST API                | Requires running a server or cloud account   |
| **Pinecone**                | ‚úÖ Hosted, powerful<br>‚úÖ Metadata filtering                      | ‚ùå Paid API<br>‚ùå Cold starts in free tier     |
| **LlamaIndex vector index** | ‚úÖ Tight integration with LlamaIndex pipelines                   | ‚ùå Limited if not using LlamaIndex end-to-end |
   
‚úÖ Start with Chroma ‚Äî best for local, fast, simple vector DB with metadata support.

In [7]:
from langchain.vectorstores import Chroma

vector_db = Chroma.from_documents(
    documents=documents,
    embedding=embedding_model,
    persist_directory="avva_rag_vector_db"
)


## Test Basic Retrieval

In [8]:
retriever = vector_db.as_retriever()
results = retriever.invoke("home remedies for cough")

for i, doc in enumerate(results):
    print(f"\nüîπ Result {i+1}")
    print(doc.page_content)
    print(doc.metadata)



üîπ Result 1
Symptom: Cough
Remedy: Apple cider vinegar, cayenne & cinnamon
Description: A mixture of apple cider vinegar, cayenne, cinnamon, and honey can help soothe a cough and sore throat.
{'source_url': 'https://www.prevention.com/health/a20477585/35-all-time-favorite-natural-remedies/'}

üîπ Result 2
Symptom: cough and sore throat
Remedy: Tea, Honey, Echinacea, Elderberry syrup, Pelargonium
Description: Throat-coating properties to reduce irritation, soothe sore throats, suppress coughs, reduce cold symptoms, and have antiviral properties
{'source_url': 'https://www.allinahealth.org/healthysetgo/heal/natural-remedies-for-everyday-illnesses'}

üîπ Result 3
Symptom: Cough
Remedy: Honey
Description: Honey may work just as well for a cough as over-the-counter medicines, especially helpful for children who aren‚Äôt old enough to take those.
{'source_url': 'https://www.webmd.com/balance/ss/slideshow-home-remedies'}

üîπ Result 4
Symptom: Honey and tea
Remedy: adding honey to ginge

### General Knowledge

#### üß† Embedding Models: Comparison Table

| Model / Tool                          | Type    | Strengths                                      | Weaknesses                                    | Hosted / Local | LangChain Support | üí∏ Cost               |
|--------------------------------------|---------|------------------------------------------------|-----------------------------------------------|----------------|-------------------|------------------------|
| **OpenAI (`text-embedding-3-small`)**| Dense   | High quality, optimized for LLM RAG           | Paid, API limits                              | Hosted         | ‚úÖ Yes            | üí∏ $0.00002 / 1K tokens |
| **HuggingFace Transformers**         | Dense   | Free, fast, open-source                       | Slightly lower accuracy than OpenAI           | Local / HF Hub | ‚úÖ Yes            | üÜì Free (local use)     |
| **Cohere Embed v3**                  | Dense   | Fast, high quality, multilingual              | API-based, requires key                       | Hosted         | ‚úÖ Yes            | üí∏ $0.10 / 1K calls     |
| **Jina AI Embeddings**               | Dense   | Open-source, designed for long docs           | Less mainstream                                | Both           | ‚úÖ Yes            | üÜì Free (OSS), üí∏ if hosted |
| **Google's BERT / USE**              | Dense   | Strong sentence-level understanding           | Not search-optimized                          | Local          | üö´ No             | üÜì Free                 |
| **BM25 / TF-IDF**                    | Sparse  | Great keyword matching                        | Poor semantic understanding                   | Local          | ‚úÖ Yes            | üÜì Free                 |
| **Hybrid (Dense + Sparse)**          | Hybrid  | Combines semantic + keyword strengths         | Complex to manage                             | Local          | ‚úÖ Yes            | üÜì Free                 |

###### ‚úÖ Best by Use Case:

| Use Case                              | Recommended Embedding     |
|--------------------------------------|----------------------------|
| RAG with semantic accuracy           | OpenAI or Cohere           |
| Privacy/local deployment             | HuggingFace, BM25          |
| Long documents (web scraping, RAG)   | Jina AI, OpenAI            |
| Keyword-driven search                | BM25 or Hybrid             |

---

#### üßä Vector Database Options: Comparison Table

| Vector DB        | Hosted / Local | Dedup Support | Metadata Filtering | Scale Ready | Notes                              | üí∏ Cost                          |
|------------------|----------------|----------------|---------------------|-------------|-------------------------------------|----------------------------------|
| **Chroma**        | Local           | ‚ùå No           | ‚úÖ Yes               | ‚ö†Ô∏è Limited   | Great for prototyping              | üÜì Free                          |
| **FAISS**         | Local           | ‚ùå No           | ‚ùå No                | ‚úÖ Yes       | No metadata, high perf              | üÜì Free                          |
| **Pinecone**      | Hosted          | ‚úÖ Yes (ID)     | ‚úÖ Yes               | ‚úÖ Yes       | Ideal for production RAG           | üí∏ Free tier + $0.096/1M vec updates |
| **Qdrant**        | Both            | ‚úÖ Yes (ID)     | ‚úÖ Yes               | ‚úÖ Yes       | Open source + managed options      | üÜì OSS, üí∏ $5+/mo (cloud)         |
| **Weaviate**      | Both            | ‚úÖ Yes          | ‚úÖ Yes + hybrid      | ‚úÖ Yes       | REST API + hybrid search           | üÜì OSS, üí∏ cloud available        |
| **Milvus**        | Both            | ‚úÖ Manual       | ‚úÖ Yes               | ‚úÖ Yes       | Best for massive vector data       | üÜì Free                          |
| **ElasticSearch**| Both            | ‚ö†Ô∏è Partial      | ‚úÖ Yes               | ‚úÖ Yes       | Keyword + dense hybrid search      | üí∏ Cloud: starts ~$16/mo         |

###### ‚úÖ Best by Use Case:

| Need                                 | Recommended Store |
|-------------------------------------|-------------------|
| Local quick testing                 | Chroma, FAISS     |
| Deduplicated, production-ready RAG | Pinecone, Qdrant  |
| Open source, scalable               | Qdrant, Milvus    |
| Hybrid search (semantic + keyword) | Weaviate, ElasticSearch |
| Fine-grained filtering              | Qdrant, Weaviate  |

---

#### üîÑ Recommended Pairings

| Embedding        | Best Paired Vector DB     |
|------------------|---------------------------|
| OpenAI           | Pinecone, Qdrant, Weaviate|
| HuggingFace      | Chroma, FAISS, Qdrant     |
| Cohere           | Pinecone, Qdrant          |
| BM25 / Hybrid    | Weaviate, ElasticSearch   |
| Jina AI          | Qdrant, Chroma            |

