# RAG Research Papers with LlamaIndex + Ollama + Chroma

This notebook demonstrates a complete RAG (Retrieval-Augmented Generation) system for research papers with:

- **Local Models**: Ollama for both LLM and embeddings (no API costs)
- **Persistent Storage**: Chroma vector database for embeddings
- **Query Understanding**: Auto-routing between retrieval and summarization
- **Deduplication**: SHA256-based duplicate detection
- **Interactive UI**: Streamlit app for easy interaction

## Key Features

1. **Vector Search**: Semantic search over PDF content
2. **Document Summarization**: Tree-based summarization of selected PDFs
3. **Duplicate Prevention**: Only indexes new/unique documents
4. **Query Intent Detection**: Automatically routes summary vs retrieval queries
5. **Persistent Index**: No re-embedding on restart


In [None]:
# Environment setup and imports
import os
import json
import hashlib
from typing import Dict, Any, List
from dotenv import load_dotenv

load_dotenv()

# Optional: set proxy variables if needed
# os.environ["HTTP_PROXY"] = ""
# os.environ["HTTPS_PROXY"] = ""

print("✅ Environment configured")


True

In [None]:
# LlamaIndex + Ollama configuration
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader, SummaryIndex
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.schema import NodeWithScore
import chromadb

# Configuration constants
PERSIST_DIR = "./chroma_store"
COLLECTION_NAME = "papers"
DATA_DIR = "Data"
MANIFEST_PATH = os.path.join(PERSIST_DIR, "manifest.json")

# Configure local Ollama models
Settings.llm = Ollama(model="llama3.1", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

print("✅ LlamaIndex + Ollama configured")
print(f"📁 Data directory: {DATA_DIR}")
print(f"🗄️  Chroma store: {PERSIST_DIR}")


In [4]:
# Load PDFs and build index
documents = SimpleDirectoryReader("Data").load_data()
index = VectorStoreIndex.from_documents(documents[0:1], show_progress=True)


Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 493.45it/s]
2025-09-24 12:16:51,888 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
Generating embeddings: 100%|██████████| 1/1 [00:04<00:00,  4.10s/it]


In [9]:
# Query example
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("Summarize the key concepts across these papers.")
print(response)


2025-09-24 12:20:26,559 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"
2025-09-24 12:20:29,238 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-09-24 12:21:17,166 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Sure, here's a summary of the key concepts from the context:

- Self-preference bias is a significant bias in large language models (LLMs) that overestimates the quality of their own outputs.


- Self-preference bias is measured using perplexity, which is the measure of how different a text is from other texts.


- GPT-4 exhibits a significant degree of self-preference bias based on perplexity analysis.


- The fundamental causes of self-preference bias remain unclear, but perplexity may play a role in this phenomenon.


- The paper introduces a novel quantitative metric to measure self-preference bias, which can help to mitigate this bias.


In [10]:
# Optional: chat with sources
chat_engine = index.as_chat_engine(chat_mode="context", similarity_top_k=3)
chat_response = chat_engine.chat("What is attention, and why is it useful in LLMs?")
print(chat_response)


2025-09-24 12:21:45,988 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-09-24 12:22:13,928 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


The context does not provide information about attention, so I cannot answer this question from the provided context.


### Persist embeddings to Chroma (disk)


In [3]:
# Build vector store in Chroma and persist
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.storage_context import StorageContext
import chromadb

# Load PDFs and build index
documents = SimpleDirectoryReader("Data").load_data()

PERSIST_DIR = "./chroma_store"
chroma_client = chromadb.PersistentClient(path=PERSIST_DIR)
chroma_collection = chroma_client.get_or_create_collection("papers")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Rebuild index with the Chroma vector store
index = VectorStoreIndex.from_documents(documents[0:1], storage_context=storage_context, show_progress=True)


2025-09-24 12:55:05,020 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 305.95it/s]
Generating embeddings:   0%|          | 0/1 [00:00<?, ?it/s]2025-09-24 12:55:09,973 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
Generating embeddings: 100%|██████████| 1/1 [00:04<00:00,  4.41s/it]


### Reload from Chroma and query (no re-embedding)


In [6]:
# Re-create index from existing Chroma collection
from llama_index.core import VectorStoreIndex

# If in a new session, re-initialize chroma_client/collection/vector_store/storage_context as done earlier
reload_index = VectorStoreIndex.from_vector_store(vector_store)
reload_query_engine = reload_index.as_query_engine(similarity_top_k=3)
print(reload_query_engine.query("give me the key concepts of the paper."))


2025-09-24 13:06:58,796 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-09-24 13:08:27,792 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


The key concepts of the paper are:

* **Large language models (LLMs)** are a new type of artificial intelligence that can generate human-quality text.
* **Reasoning and planning tasks** are a subset of natural language processing (NLP) tasks that involve using LLMs to solve problems that require abstract reasoning or planning skills.
* **Prompt engineering** is a field of NLP that focuses on designing natural language queries that guide LLM responses effectively.
* **Safety and controllability** is a major concern when using LLMs, as they can generate harmful or inappropriate content.
* **Multi-modality** is a field of NLP that involves training LLMs on diverse data to create models with richer understanding.
