# LangGraph-style RAG Q&A Agent (Gemini-only, clean & clear)

This notebook demonstrates a small Retrieval-Augmented Generation (RAG) pipeline with a 4-node workflow: `plan` -> `retrieve` -> `answer` -> `reflect`.

It uses: ChromaDB (local), HuggingFace sentence-transformers for embeddings, and Google Gemini (via `google.generativeai`) as the LLM. The notebook includes minimal logging/prints so you can see each step's outputs.

Notes: set `GEMINI_API_KEY` environment variable, or set `GOOGLE_APPLICATION_CREDENTIALS` for service account JSON before running the `answer` cell that calls Gemini.

In [None]:
import os
import glob
import json
import logging
from typing import List, Dict, Optional

print('Working directory:', os.getcwd())

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
logger = logging.getLogger('gemini_rag_notebook')

DATA_DIR = os.environ.get('RAG_DATA_DIR', os.path.join(os.getcwd(), 'data'))
CHROMA_DIR = os.environ.get('RAG_CHROMA_DIR', os.path.join(os.getcwd(), 'chroma_db'))
EMBEDDING_MODEL = os.environ.get('RAG_EMBEDDING_MODEL', 'sentence-transformers/all-MiniLM-L6-v2')
GEMINI_API_KEY = os.environ.get('GEMINI_API_KEY', '')
USE_GEMINI = bool(GEMINI_API_KEY) or bool(os.environ.get('GOOGLE_APPLICATION_CREDENTIALS'))

print('DATA_DIR =', DATA_DIR)
print('CHROMA_DIR =', CHROMA_DIR)
print('EMBEDDING_MODEL =', EMBEDDING_MODEL)
print('USE_GEMINI =', USE_GEMINI)

Working directory: c:\Users\jella\swarm\notebooks
DATA_DIR = c:\Users\jella\swarm\notebooks\data
CHROMA_DIR = c:\Users\jella\swarm\notebooks\chroma_db
EMBEDDING_MODEL = sentence-transformers/all-MiniLM-L6-v2
USE_GEMINI = True


In [30]:
# Create sample data files if they don't exist so the demo can run quickly.
os.makedirs(DATA_DIR, exist_ok=True)
sample1 = os.path.join(DATA_DIR, 'renewable_energy_overview.txt')
sample2 = os.path.join(DATA_DIR, 'solar_wind_benefits.txt')
if not os.path.exists(sample1):
    with open(sample1, 'w', encoding='utf-8') as f:
        f.write('Renewable energy, including solar and wind, provides low-carbon electricity, reduces greenhouse gas emissions, and diversifies energy supply. Renewables can lower energy costs in the long term and create local jobs.')
if not os.path.exists(sample2):
    with open(sample2, 'w', encoding='utf-8') as f:
        f.write('Solar power generates electricity from sunlight using photovoltaic cells or concentrated solar power; wind power harnesses wind with turbines. Both technologies reduce reliance on fossil fuels and improve air quality.')

print('Sample data files ready:')
print(glob.glob(os.path.join(DATA_DIR, '*.txt')))

Sample data files ready:
['c:\\Users\\jella\\swarm\\notebooks\\data\\renewable_energy_overview.txt', 'c:\\Users\\jella\\swarm\\notebooks\\data\\solar_wind_benefits.txt']


## Ingest documents into ChromaDB
The next cell reads .txt files under `./data`, splits them into chunks, creates embeddings with HuggingFace `all-MiniLM-L6-v2`, and stores them in a local Chroma database under `./chroma_db`.

In [37]:
def ingest_documents(data_dir: str = DATA_DIR, chroma_dir: str = CHROMA_DIR, embedding_model: str = EMBEDDING_MODEL):
    from typing import List, Dict
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.vectorstores import Chroma
    from langchain.embeddings import HuggingFaceEmbeddings

    logger.info('Ingesting documents from %s', data_dir)
    patterns = [os.path.join(data_dir, '**', '*.txt'), os.path.join(data_dir, '**', '*.md'), os.path.join(data_dir, '**', '*.csv')]
    files: List[str] = []
    for p in patterns:
        files.extend(glob.glob(p, recursive=True))

    if not files:
        logger.warning('No documents found in %s. Add files or use Kaggle helper.', data_dir)
        return None

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    texts: List[str] = []
    metadatas: List[Dict[str, str]] = []

    for path in sorted(files):
        try:
            with open(path, 'r', encoding='utf-8') as f:
                raw = f.read()
        except Exception as e:
            logger.warning('Failed to read %s: %s', path, e)
            continue
        chunks = splitter.split_text(raw)
        texts.extend(chunks)
        metadatas.extend([{'source': os.path.basename(path)} for _ in chunks])

    if not texts:
        logger.warning('No text after splitting; aborting ingestion.')
        return None

    logger.info('Creating embeddings using %s', embedding_model)
    hf_embed = HuggingFaceEmbeddings(model_name=embedding_model)

    vectordb = Chroma.from_texts(
        texts=texts,
        embedding=hf_embed,
        metadatas=metadatas,
        collection_name='gemini_rag_docs',
        persist_directory=chroma_dir,
    )
    vectordb.persist()
    logger.info('Indexed %d chunks into Chroma.', len(texts))
    return vectordb

vectordb = ingest_documents()
print('Ingestion returned:', 'OK' if vectordb else 'No vectordb')

2025-11-05 16:55:25,595 INFO Ingesting documents from c:\Users\jella\swarm\notebooks\data
2025-11-05 16:55:25,601 INFO Creating embeddings using sentence-transformers/all-MiniLM-L6-v2
2025-11-05 16:55:25,603 INFO Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
2025-11-05 16:55:25,842 INFO Use pytorch device: cpu
2025-11-05 16:55:25,852 ERROR Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
2025-11-05 16:55:25,857 ERROR Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
2025-11-05 16:55:25,929 INFO Indexed 2 chunks into Chroma.


Ingestion returned: OK


In [38]:
## Kaggle dataset helper (optional)
from typing import Optional
def download_kaggle_dataset(dataset: str, target_dir: Optional[str] = None, unzip: bool = True) -> bool:
    try:
        from kaggle.api.kaggle_api_extended import KaggleApi
    except Exception:
        print("Kaggle api not available. Install 'kaggle' and configure credentials if you need this.")
        return False
    api = KaggleApi()
    try:
        api.authenticate()
    except Exception as e:
        print('Kaggle auth failed:', e)
        return False
    td = target_dir or DATA_DIR
    os.makedirs(td, exist_ok=True)
    print(f'Downloading {dataset} to {td} ...')
    api.dataset_download_files(dataset, path=td, unzip=unzip, quiet=False)
    print('Done.')
    return True

In [39]:
# Agent workflow nodes: plan, retrieve, answer, reflect
from typing import List, Dict
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

def plan_node(question: str) -> str:
    logger.info('[PLAN] Question received: %s', question)
    query = question.strip()
    logger.info('[PLAN] Retrieval query: %s', query)
    print('[PLAN] Query:', query)
    return query

def retrieve_node(query: str, chroma_dir: str = CHROMA_DIR, embedding_model: str = EMBEDDING_MODEL, k: int = 4) -> List[Dict[str, str]]:
    logger.info('[RETRIEVE] Querying Chroma for: %s', query)
    hf_embed = HuggingFaceEmbeddings(model_name=embedding_model)
    vectordb = Chroma(persist_directory=chroma_dir, embedding_function=hf_embed, collection_name='gemini_rag_docs')
    docs = vectordb.similarity_search(query, k=k)
    results = [{'source': getattr(d, 'metadata', {}).get('source', 'unknown'), 'content': d.page_content} for d in docs]
    logger.info('[RETRIEVE] Retrieved %d documents', len(results))
    print('[RETRIEVE] Sources:', sorted({r['source'] for r in results}))
    return results

def answer_node(question: str, retrieved: List[Dict[str, str]]) -> str:
    try:
        import google.generativeai as genai
    except Exception:
        return 'google.generativeai not installed.'
    if not USE_GEMINI:
        return 'Gemini not configured; set GEMINI_API_KEY or GOOGLE_APPLICATION_CREDENTIALS.'
    if GEMINI_API_KEY:
        genai.configure(api_key=GEMINI_API_KEY)

    context_blocks = []
    for i, doc in enumerate(retrieved, start=1):
        src = doc.get('source', 'unknown')
        content = doc.get('content', '').strip()
        context_blocks.append(f'[Source {i}: {src}]\n{content}')
    context_text = '\n\n---\n\n'.join(context_blocks) if context_blocks else ''

    system_instruction = (
        'You are a helpful, factual assistant. Answer the user question using only the provided context. '
        'If the answer is not present in the context, say you cannot answer based on the documents.'
    )
    user_prompt = f'Question: {question}\n\nContext:\n{context_text}\n\nProvide a concise answer and cite source numbers (e.g., [Source 1]).'
    full_prompt = system_instruction + '\n\n' + user_prompt

    preferred_order = [
        'gemini-1.5-flash-latest',
        'gemini-1.5-pro-latest',
        'gemini-pro',
        'gemini-1.0-pro',
    ]
    try:
        available = [
            m.name for m in genai.list_models()
            if 'generateContent' in (getattr(m, 'supported_generation_methods', []) or [])
        ]
    except Exception as e:
        available = []
    def short(n: str) -> str:
        return n.split('/')[-1]
    available_sorted = sorted(available, key=lambda n: (preferred_order.index(short(n)) if short(n) in preferred_order else 999, n))

    last_error = None
    for model_name in available_sorted or ['models/gemini-pro']:
        logger.info('[ANSWER] Trying model %s', model_name)
        try:
            model = genai.GenerativeModel(model_name)
            response = model.generate_content(full_prompt)
            answer_text = getattr(response, 'text', None) or str(response)
            print('[ANSWER]')
            print(answer_text)
            return answer_text
        except Exception as e:
            last_error = e
            continue
    return f"LLM call failed; tried {available_sorted or ['models/gemini-pro']}: {last_error}"


def reflect_node(question: str, answer: str, retrieved: List[Dict[str, str]]):
    question_terms = [w.lower().strip(',.?') for w in question.split() if len(w) > 3][:12]
    coverage_in_answer = sum(1 for t in question_terms if t in (answer or '').lower())
    coverage_in_retrieved = sum(1 for t in question_terms if any(t in doc['content'].lower() for doc in retrieved))
    result = {
        'question_terms': question_terms,
        'coverage_in_answer': coverage_in_answer,
        'coverage_in_retrieved': coverage_in_retrieved,
        'pass': (coverage_in_answer > 0 or coverage_in_retrieved > 0),
    }
    logger.info('[REFLECT] %s', result)
    print('[REFLECT]', json.dumps(result, indent=2))
    return result

In [40]:
# One-shot run through the pipeline
question = 'What are the benefits of renewable energy?'
print('\n--- RUNNING AGENT ---')
q = question.strip()
query = plan_node(q)
retrieved = retrieve_node(query)
answer = answer_node(q, retrieved)
reflect = reflect_node(q, answer, retrieved)

print('\n--- SUMMARY ---')
print('Question:', q)
print('\nAnswer:\n', answer)

2025-11-05 16:55:35,518 INFO [PLAN] Question received: What are the benefits of renewable energy?
2025-11-05 16:55:35,519 INFO [PLAN] Retrieval query: What are the benefits of renewable energy?
2025-11-05 16:55:35,520 INFO [RETRIEVE] Querying Chroma for: What are the benefits of renewable energy?
2025-11-05 16:55:35,522 INFO Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2



--- RUNNING AGENT ---
[PLAN] Query: What are the benefits of renewable energy?


2025-11-05 16:55:35,761 INFO Use pytorch device: cpu
2025-11-05 16:55:35,770 ERROR Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
2025-11-05 16:55:35,776 ERROR Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
2025-11-05 16:55:35,816 INFO [RETRIEVE] Retrieved 4 documents


[RETRIEVE] Sources: ['renewable_energy_overview.txt']


2025-11-05 16:55:36,440 INFO [ANSWER] Trying model models/gemini-2.0-flash
2025-11-05 16:55:38,063 INFO [REFLECT] {'question_terms': ['what', 'benefits', 'renewable', 'energy'], 'coverage_in_answer': 2, 'coverage_in_retrieved': 2, 'pass': True}


[ANSWER]
Renewable energy provides low-carbon electricity, reduces greenhouse gas emissions, diversifies energy supply, can lower energy costs in the long term, and creates local jobs [Source 1, Source 2, Source 3, Source 4].

[REFLECT] {
  "question_terms": [
    "what",
    "benefits",
    "renewable",
    "energy"
  ],
  "coverage_in_answer": 2,
  "coverage_in_retrieved": 2,
  "pass": true
}

--- SUMMARY ---
Question: What are the benefits of renewable energy?

Answer:
 Renewable energy provides low-carbon electricity, reduces greenhouse gas emissions, diversifies energy supply, can lower energy costs in the long term, and creates local jobs [Source 1, Source 2, Source 3, Source 4].

