
# Comprehensive Revision Notebook
This notebook serves as a comprehensive revision of the key concepts learned across all stages of the course, including:
- Loading and Chunking
- Tokenization
- Embeddings and Vector Representations
- Vector Databases (ChromaDB)
- Indexing (VectorStoreIndex, Query Engines, Retrievers)
- Response Synthesizers (Refine, Compact, Tree Summarize, Accumulate, Compact Accumulate)

We will demonstrate the integration and application of all these components in a unified workflow.


In [2]:

# Install required libraries
%pip install chromadb llama-index llama-index-vector-stores-chroma

# Import necessary libraries
import chromadb
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, get_response_synthesizer
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding


Note: you may need to restart the kernel to use updated packages.


In [3]:

# **Stage 1: Loading and Chunking**

# Load a sample document (adjust file path as necessary)
documents = SimpleDirectoryReader(input_files=['../data_uber/uber_2021.pdf']).load_data(show_progress=True)

# Display the first document
print("First Document Content:")
print(documents[0].text[:500])  # Show first 500 characters
print("Length of the documents:", len(documents))


Loading files: 100%|██████████| 1/1 [00:08<00:00,  8.88s/file]

First Document Content:
UNITED STATES
SECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549 ____________________________________________ FORM 10-K____________________________________________ (Mark One)☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the fiscal year ended  December 31, 2021OR☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the transition period from_____ to _____            Commission File Number: 001-38902_________
Length of the documents: 307





In [4]:

# **Stage 2: Embeddings and Vector Representations**

# Initialize the embedding model
ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text:latest",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0}
)

# Generate embeddings for the loaded documents
embeddings = [ollama_embedding.get_text_embedding(doc.text) for doc in documents]

# Display embedding dimensionality
print(f"Embedding Dimensionality: {len(embeddings[0])}")


Embedding Dimensionality: 768


In [5]:

# **Stage 3: Indexing with ChromaDB**

# Initialize ChromaDB Persistent Client
db = chromadb.PersistentClient(path="./chroma_db")

# Create or retrieve a collection in ChromaDB
chroma_collection = db.get_or_create_collection("revision_collection")

# Set up Chroma as the vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build a VectorStoreIndex using the embeddings
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=ollama_embedding)

# Save the index for reuse
index.storage_context.vector_store.persist("revision_vector_store.json")


In [6]:

# **Stage 4: Querying with Query Engines and Retrievers**

# Configure a retriever for similarity-based querying
retriever = index.as_retriever(similarity_top_k=3)
query = "Summarize the main points of the document."
retrieved_nodes = retriever.retrieve(query)

# Display retrieved nodes
for i, node in enumerate(retrieved_nodes, start=1):
    print(f"Node {i} Content:")
    print(node.get_content())
    print("-" * 50)


Node 1 Content:
SECTION
 4.7.    Successors  and Assigns. This Agreement shall be binding upon and inure to the benefit of the parties hereto andtheir respective successors and assigns.SECTION 4.8.    GOVERNING LAW . THIS AGREEMENT AND THE OTHER LOAN DOCUMENTS AND ANY CLAIM,CONTROVERSY OR DISPUTE UNDER, ARISING OUT OF OR RELATING TO THIS AGREEMENT OR THE OTHER LOANDOCUMENTS  AND  THE  TRANSACTIONS  CONTEMPLATED  HEREBY  AND  THEREBY,  WHETHER  BASED  INCONTRACT (AT LAW OR IN EQUITY), TORT OR ANY OTHER THEORY, SHALL BE GOVERNED BY, AND CONSTRUEDIN  ACCORDANCE  WITH,  THE  LAW  OF  THE  STATE  OF  NEW  YORK  WITHOUT  REGARD  TO  CONFLICTS  OF  LAWRULES THAT WOULD RESULT IN THE APPLICATION OF A DIFFERENT GOVERNING LAW.[REMAINDER OF PAGE INTENTIONALLY LEFT BLANK] 4
--------------------------------------------------
Node 2 Content:
services) in connection with the Loans, the Letters of Credit, the Commitments or this Agreement.
(c)  The  Administrative  Agent  and  the  Arrangers  hereby  i

In [7]:
from llama_index.core import Settings
from llama_index.llms.ollama import  Ollama

Settings.llm = Ollama(model='llama3.2:latest', base_url='http://localhost:11434',temperature=0.1)

In [8]:

# **Stage 5: Response Synthesizers**

# Refine Mode
refine_synthesizer = get_response_synthesizer(response_mode="refine")
refine_response = index.as_query_engine(response_synthesizer=refine_synthesizer).query(query)
print("Refine Mode Response:")
print("=====================")
print(refine_response)

# Compact Mode
compact_synthesizer = get_response_synthesizer(response_mode="compact")
compact_response = index.as_query_engine(response_synthesizer=compact_synthesizer).query(query)
print("Compact Mode Response:")
print("======================")
print(compact_response)

# Tree Summarize Mode
tree_summarize_synthesizer = get_response_synthesizer(response_mode="tree_summarize")
tree_response = index.as_query_engine(response_synthesizer=tree_summarize_synthesizer).query(query)
print("Tree Summarize Response:")
print("========================")
print(tree_response)


Refine Mode Response:
The main points of this document appear to be:

1. The agreement is binding on the parties and their respective successors and assigns.
2. The governing law for disputes arising out of or related to the transactions contemplated by the agreement is the state of New York, with its laws being applied without regard to conflicts of law rules.
Compact Mode Response:
The document appears to outline the terms and conditions of an agreement between parties. Key aspects include:

 Binding obligations for successors and assigns.

 Governing law and jurisdiction, with disputes to be resolved in accordance with New York state laws.

 The remainder of the page is intentionally left blank, suggesting that additional details or clauses may be included elsewhere in the document.
Tree Summarize Response:
The main points of this document appear to be:

1. The agreement is binding on the parties and their respective successors and assigns.
2. The governing law for disputes arising 


# **Conclusion**

This notebook demonstrates the integration of all key concepts covered in the course. By combining stages such as loading, embedding, indexing, querying, and synthesizing, we have created a robust pipeline for managing and querying textual data efficiently.

You can extend this workflow further by experimenting with different embedding models, vector databases, or custom query logic. Happy learning!
