
# Comprehensive Revision Notebook
This notebook serves as a comprehensive revision of the key concepts learned across all stages of the course, including:
- Loading and Chunking
- Tokenization
- Embeddings and Vector Representations
- Vector Databases (ChromaDB)
- Indexing (VectorStoreIndex, Query Engines, Retrievers)
- Response Synthesizers (Refine, Compact, Tree Summarize, Accumulate, Compact Accumulate)

We will demonstrate the integration and application of all these components in a unified workflow.


In [None]:

# Install required libraries
%pip install chromadb llama-index llama-index-vector-stores-chroma

# Import necessary libraries
import chromadb
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, get_response_synthesizer
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding


In [None]:

# **Stage 1: Loading and Chunking**

# Load a sample document (adjust file path as necessary)
documents = SimpleDirectoryReader(input_files=['../data_uber/uber_2021.pdf']).load_data(show_progress=True)

# Display the first document
print("First Document Content:")
print(documents[0].text[:500])  # Show first 500 characters
print("Length of the documents:", len(documents))


In [None]:

# **Stage 2: Embeddings and Vector Representations**

# Initialize the embedding model
ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text:latest",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0}
)

# Generate embeddings for the loaded documents
embeddings = [ollama_embedding.get_text_embedding(doc.text) for doc in documents]

# Display embedding dimensionality
print(f"Embedding Dimensionality: {len(embeddings[0])}")


In [None]:

# **Stage 3: Indexing with ChromaDB**

# Initialize ChromaDB Persistent Client
db = chromadb.PersistentClient(path="./chroma_db")

# Create or retrieve a collection in ChromaDB
chroma_collection = db.get_or_create_collection("revision_collection")

# Set up Chroma as the vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build a VectorStoreIndex using the embeddings
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=ollama_embedding)

# Save the index for reuse
index.storage_context.vector_store.persist("revision_vector_store.json")


In [None]:

# **Stage 4: Querying with Query Engines and Retrievers**

# Configure a retriever for similarity-based querying
retriever = index.as_retriever(similarity_top_k=3)
query = "Summarize the main points of the document."
retrieved_nodes = retriever.retrieve(query)

# Display retrieved nodes
for i, node in enumerate(retrieved_nodes, start=1):
    print(f"Node {i} Content:")
    print(node.get_content())
    print("-" * 50)


In [None]:
from llama_index.core import Settings
from llama_index.llms.ollama import  Ollama

Settings.llm = Ollama(model='llama3.2:latest', base_url='http://localhost:11434',temperature=0.1)

In [None]:

# **Stage 5: Response Synthesizers**

# Refine Mode
refine_synthesizer = get_response_synthesizer(response_mode="refine")
refine_response = index.as_query_engine(response_synthesizer=refine_synthesizer).query(query)
print("Refine Mode Response:")
print("=====================")
print(refine_response)

# Compact Mode
compact_synthesizer = get_response_synthesizer(response_mode="compact")
compact_response = index.as_query_engine(response_synthesizer=compact_synthesizer).query(query)
print("Compact Mode Response:")
print("======================")
print(compact_response)

# Tree Summarize Mode
tree_summarize_synthesizer = get_response_synthesizer(response_mode="tree_summarize")
tree_response = index.as_query_engine(response_synthesizer=tree_summarize_synthesizer).query(query)
print("Tree Summarize Response:")
print("========================")
print(tree_response)



# **Conclusion**

This notebook demonstrates the integration of all key concepts covered in the course. By combining stages such as loading, embedding, indexing, querying, and synthesizing, we have created a robust pipeline for managing and querying textual data efficiently.

You can extend this workflow further by experimenting with different embedding models, vector databases, or custom query logic. Happy learning!
