### Apply following cell of code to get just your answer not http logs (If you want,optional)

In [2]:
import logging

# Suppress detailed HTTP logs
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("llama_index").setLevel(logging.WARNING)
logging.getLogger("llama_index.llms.groq").setLevel(logging.WARNING)

# Optional: only show errors globally
logging.basicConfig(level=logging.ERROR)

## Main code starts from here

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
os.environ['GROQ_API_KEY']=os.getenv("GROQ_API_KEY")

### **Load the pdf**

In [5]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents= SimpleDirectoryReader("Data").load_data()

In [6]:
documents

[Document(id_='b8d55291-9231-4605-bd90-fabafa62a3b6', embedding=None, metadata={'page_label': '1', 'file_name': 'ReAct.pdf', 'file_path': '/Users/betopia/LlamaIndex/Data/ReAct.pdf', 'file_type': 'application/pdf', 'file_size': 633805, 'creation_date': '2025-11-03', 'last_modified_date': '2025-11-03'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Published as a conference paper at ICLR 2023\nREAC T: S YNERGIZING REASONING AND ACTING IN\nLANGUAGE MODELS\nShunyu Yao∗*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2\n1Department of Computer Science, Princeton University\n2Google Research, Brain team\n1{sh

### **Configured the default global settings**

In [7]:
from llama_index.llms.groq import Groq
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter

Settings.llm = Groq(model="openai/gpt-oss-120b")
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.context_window = 4096

  from .autonotebook import tqdm as notebook_tqdm


### **Creates a semantic vector index**

In [8]:
# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(
    documents, embed_model=Settings.embed_model, show_progress=True
)

Parsing nodes: 100%|██████████| 64/64 [00:00<00:00, 318.58it/s]
Generating embeddings: 100%|██████████| 163/163 [00:02<00:00, 70.49it/s] 


### **Semantic index into a question-answering engine.**

In [9]:
query_engine=index.as_query_engine()

In [10]:
result = query_engine.query("What is transformer?")
print(result)

The Transformer is a neural architecture for sequence‑to‑sequence tasks that replaces traditional recurrent or convolutional components with a purely attention‑based mechanism. It follows an encoder‑decoder design: the encoder converts an input token sequence into a continuous representation, and the decoder generates the output sequence one token at a time, conditioning on previously produced tokens. All representations are computed through self‑attention, where each position attends to every other position in the same sequence, and this attention is performed in multiple parallel heads to preserve resolution. By using self‑attention exclusively, the model can capture dependencies between distant positions with a fixed number of operations, eliminating the need for sequence‑aligned recurrence.


### **Created a custom query engine that retrieves the top 4 most similar document chunks from your index and then filters them based on a similarity threshold before passing them to the LLM for answering.**

In [11]:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.indices.postprocessor import SimilarityPostprocessor

retriever=VectorIndexRetriever(index=index,similarity_top_k=4)
postprocessor=SimilarityPostprocessor(similarity_cutoff=0.3)

query_engine=RetrieverQueryEngine(retriever=retriever,
                                  node_postprocessors=[postprocessor])

### Try to focus on following part to do next part
**used_tokens = len(system_prompt) + len(all_chunks) + len(question)  
available_context = context_window - used_tokens - num_output**

In [16]:
response=query_engine.query("What is the YOLO?")

In [17]:
print(response)

YOLO (You Only Look Once) is a unified, real‑time object‑detection framework that treats detection as a single regression problem. A single neural network processes the whole image, divides it into a grid, and directly predicts bounding boxes, confidence scores, and class probabilities for all objects in the scene. This global, end‑to‑end approach enables fast inference (under 25 ms latency) with high average precision, while reducing background errors and improving generalization across domains.


### **Printing your LlamaIndex response in a human-readable format**

In [18]:
from llama_index.core.response.pprint_utils import pprint_response
pprint_response(response,show_source=True)
print(response)

Final Response: YOLO (You Only Look Once) is a unified, real‑time
object‑detection framework that treats detection as a single
regression problem. A single neural network processes the whole image,
divides it into a grid, and directly predicts bounding boxes,
confidence scores, and class probabilities for all objects in the
scene. This global, end‑to‑end approach enables fast inference (under
25 ms latency) with high average precision, while reducing background
errors and improving generalization across domains.
______________________________________________________________________
Source Node 1/2
Node ID: bd63ddcd-3581-43d2-b28c-ab98caf2ce09
Similarity: 0.569550370275926
Text: This means we can process streaming video in real-time with less
than 25 milliseconds of latency. Furthermore, YOLO achieves more than
twice the mean average precision of other real-time systems. For a
demo of our system running in real-time on a webcam please see our
project webpage: http://pjreddie.com/yolo/. 

### **Creates or loads a persistent vector index from disk so you can reuse it across sessions without rebuilding it, and then queries it with an LLM.**

In [15]:
import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# either way we can now query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are transformers?")
print(response)

Transformers are neural models for sequence‑to‑sequence tasks that replace traditional recurrent or convolutional processing with a mechanism that relates all positions in a sequence to each other through self‑attention. They consist of an encoder that converts an input sequence into a continuous representation and a decoder that generates the output sequence one token at a time, using the encoder’s representations and previously generated tokens. By relying solely on self‑attention (often implemented with multiple attention heads), Transformers achieve constant‑time dependency modeling across the entire sequence without the need for sequence‑aligned recurrence.
