# RAG Prototype Notebook

This notebook walks through creating a retrieval-augmented generation (RAG) prototype using your `all_district_data.json` file.

**What it does:**
- Inspect the collected JSON data
- Convert records into text documents (with metadata)
- Save documents and a `train.jsonl` of generated QA pairs
- (Optional) Build embeddings using `sentence-transformers` and a FAISS index
- Run a simple retrieval test

**How to run:**
- If you haven't produced `all_district_data.json` yet, run your collector script first (e.g. `python dbcollection_final.py`).
- Install required packages when prompted in the cells (e.g. `pip install sentence-transformers faiss-cpu`)

---

*Notebook generated automatically by ChatGPT — edit cells as needed.*

In [74]:
# 1) Inspect dataset
import os, json
fn = 'all_district_data.json'
if not os.path.exists(fn):
    print(f"File not found: {fn}. Make sure to run your collector script (e.g. python dbcollection_final.py) to produce it.")
else:
    with open(fn,'r',encoding='utf-8') as f:
        data = json.load(f)
    print('Loaded', len(data), 'records')
    if len(data) > 0:
        print('\nTop-level keys in first record:')
        print(list(data[0].keys()))
        import textwrap
        print('\nExample record (pretty):')
        print(textwrap.shorten(json.dumps(data[0], ensure_ascii=False, indent=2), width=2000))

# Save small sample for quick inspection
with open('sample_records.json','w',encoding='utf-8') as f:
    json.dump(data[:10], f, ensure_ascii=False, indent=2)
print('Saved sample to sample_records.json')

Loaded 743 records

Top-level keys in first record:
['locationName', 'area', 'loss', 'category', 'reportSummary', 'gwProjectedUtilAllocationDynamicAquifer', 'additionalRecharge', 'staticGWResource', 'totalGWAvailability', 'aquiferBusinessData', 'coastalBusinessData', 'waterDepletedZonesBusinessData', 'inOutFlow', 'baseFlow', 'streamRecharge', 'additionalbaseflow', 'envFlows', 'subject', 'action', 'modifiable', 'isUrban', 'gwSpecificYield', 'geology', 'evaporation', 'qualityTagging', 'approvalLevel', 'verificStatus', 'timeStamp', 'locationUUID', 'rainfall', 'wtfonly', 'computationSummary', 'rechargeData', 'draftData', 'currentAvailabilityForAllPurposes', 'availabilityForFutureUse', 'gwallocation', 'stageOfExtraction', 'gwlevelData', 'gwanalysisSeason', 'gwtrendSlope', 'waterTableRiseFall', 'gwtrendAttention', 'waterTableCategory', 'message']

Example record (pretty):
{ "locationName": "SHEOPUR", "area": { "non_recharge_worthy": { "commandArea": 0.0, "nonCommandArea": 0.0, "poorQualityAr

In [63]:
import re, json

def record_to_text(rec, keys_order=None):
    lines = []
    if keys_order is None:
        keys = list(rec.keys())
    else:
        keys = [k for k in keys_order if k in rec] + [k for k in rec if k not in (keys_order or [])]
    for k in keys:
        v = rec.get(k)
        if v is None:
            continue
        if isinstance(v, (list, tuple)):
            try:
                val = ', '.join(map(str, v))
            except Exception:
                val = str(v)
        elif isinstance(v, dict):
            val = json.dumps(v, ensure_ascii=False)
        else:
            val = str(v)
        val = re.sub(r'\s+', ' ', val).strip()
        lines.append(f"{k}: {val}")
    return '\n'.join(lines)

fn = 'all_district_data.json'
with open(fn,'r',encoding='utf-8') as f:
    data = json.load(f)

docs = []
for i, rec in enumerate(data):
    text = record_to_text(rec)
    meta = { 'source_id': i }
    for possible_name_key in ['locationName', 'area', 'loss', 'category', 'reportSummary', 'gwProjectedUtilAllocationDynamicAquifer', 'additionalRecharge', 'staticGWResource', 'totalGWAvailability', 'aquiferBusinessData', 'coastalBusinessData', 'waterDepletedZonesBusinessData', 'inOutFlow', 'baseFlow', 'streamRecharge', 'additionalbaseflow', 'envFlows', 'subject', 'action', 'modifiable', 'isUrban', 'gwSpecificYield', 'geology', 'evaporation', 'qualityTagging', 'approvalLevel', 'verificStatus', 'timeStamp', 'locationUUID', 'rainfall', 'wtfonly', 'computationSummary', 'rechargeData', 'draftData', 'currentAvailabilityForAllPurposes', 'availabilityForFutureUse', 'gwallocation', 'stageOfExtraction', 'gwlevelData', 'gwanalysisSeason', 'gwtrendSlope', 'waterTableRiseFall', 'gwtrendAttention', 'waterTableCategory', 'message']:
        if possible_name_key in rec:
            meta['title'] = rec.get(possible_name_key)
            break
    docs.append({'id': str(i), 'text': text, 'meta': meta})

with open('docs.json','w',encoding='utf-8') as f:
    json.dump(docs, f, ensure_ascii=False, indent=2)

with open('docs.jsonl','w',encoding='utf-8') as f:
    for d in docs:
        f.write(json.dumps(d, ensure_ascii=False) + '\n')

print('Saved', len(docs), 'documents to docs.json and docs.jsonl')

# Generate QA pairs
train_pairs = []
for d in docs:
    title = d['meta'].get('title', f"record {d['id']}")
    q = f"What information do you have about {title}?"
    a = d['text']
    train_pairs.append({'instruction': q, 'output': a})

with open('train.jsonl','w',encoding='utf-8') as f:
    for p in train_pairs:
        f.write(json.dumps(p, ensure_ascii=False) + '\n')

print('Saved', len(train_pairs), 'QA pairs to train.jsonl')

Saved 743 documents to docs.json and docs.jsonl
Saved 743 QA pairs to train.jsonl


In [67]:
# 3) OPTIONAL: Build embeddings + FAISS index
# Install first: pip install -U sentence-transformers faiss-cpu

from sentence_transformers import SentenceTransformer
import numpy as np, faiss, json

with open('docs.json','r',encoding='utf-8') as f:
    docs = json.load(f)

texts = [d['text'] for d in docs]
model = SentenceTransformer('all-MiniLM-L6-v2')

embs = model.encode(texts, show_progress_bar=False, convert_to_numpy=True)
faiss.normalize_L2(embs)
d = embs.shape[1]
index = faiss.IndexFlatIP(d)
index.add(embs)

faiss.write_index(index, 'districts.index')
with open('docs_meta.json','w',encoding='utf-8') as f:
    json.dump([d['meta'] for d in docs], f, ensure_ascii=False, indent=2)

print('Saved FAISS index and metadata')

Saved FAISS index and metadata


In [69]:
# 4) Retrieval test
from sentence_transformers import SentenceTransformer
import faiss, json, numpy as np

index = faiss.read_index('districts.index')
with open('docs.json','r',encoding='utf-8') as f:
    docs = json.load(f)
model = SentenceTransformer('all-MiniLM-L6-v2')

queries = [
    "What is the total command area of SHEOPUR?",
    "Which category is assigned to Sheopur?",
]

for q in queries:
    q_emb = model.encode([q], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, k=5)
    print('\nQuery:', q)
    for rank, idx in enumerate(I[0]):
        doc = docs[idx]
        print(f' Rank {rank+1}: doc_id={doc["id"]}, title={doc["meta"].get("title")}')
        print('  snippet:', doc['text'][:300].replace('\n',' ') + '...')


Query: What is the total command area of SHEOPUR?
 Rank 1: doc_id=376, title=HISAR
  snippet: locationName: HISAR area: {"non_recharge_worthy": {"commandArea": 0.0, "nonCommandArea": 0.0, "poorQualityArea": 0.0, "hillyArea": 0.0, "forestArea": 0.0, "totalArea": 0.0, "pavedArea": 0.0, "unpavedArea": 0.0}, "total": {"commandArea": 406838.29692163103, "nonCommandArea": 0.0, "poorQualityArea": 6...
 Rank 2: doc_id=176, title=Faridkot
  snippet: locationName: Faridkot area: {"non_recharge_worthy": {"commandArea": 0.0, "nonCommandArea": 0.0, "poorQualityArea": 0.0, "hillyArea": 0.0, "forestArea": 0.0, "totalArea": 0.0, "pavedArea": 0.0, "unpavedArea": 0.0}, "total": {"commandArea": 92275.49, "nonCommandArea": 0.0, "poorQualityArea": 55322.50...
 Rank 3: doc_id=128, title=SHEOHAR
  snippet: locationName: SHEOHAR area: {"non_recharge_worthy": {"commandArea": 0.0, "nonCommandArea": 0.0, "poorQualityArea": 0.0, "hillyArea": 0.0, "forestArea": 0.0, "totalArea": 0.0, "pavedArea": 0.0, "unpavedAre

## 5) Example RAG Prompt Template
```
You are a helpful assistant. Use the following sources to answer the question. If not found, say "I don't know".

SOURCES:
[1] <passage 1>
[2] <passage 2>

QUESTION: <user question>
```