In [1]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.prompts.prompts import SimpleInputPrompt

In [2]:
documents= SimpleDirectoryReader("data").load_data()
print(len(documents))
print(documents[0])

64
Doc ID: b55057b6-05ba-4421-a8d9-33bc080d23ba
Text: Published as a conference paper at ICLR 2023 REAC T: S
YNERGIZING REASONING AND ACTING IN LANGUAGE MODELS Shunyu Yao∗*,1,
Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1,
Yuan Cao2 1Department of Computer Science, Princeton University
2Google Research, Brain team 1{shunyuy,karthikn}@princeton.edu
2{jeffreyzhao,dianyu,dunan,...


In [3]:
system_prompt=""" 
You are a knowledgeable and smart Q&A assistant.

Your goals:
- Use the provided context or retrieved documents to give accurate and relevant answers.
- If the context doesn’t contain the exact answer, say you don’t have enough information instead of guessing.
- Always summarize clearly and concisely.
- When possible, provide structured and easy-to-read responses (use bullet points or short paragraphs).
- Do not include unrelated or speculative content.
- If the user’s question is ambiguous, politely ask for clarification.
- Maintain a professional yet approachable tone.

Example style:
User: "What is LlamaIndex?"
Assistant: "LlamaIndex is a data framework that helps you connect external data (like files, databases, or APIs) to large language models. It simplifies loading, indexing, and querying data."

You always follow these principles for every response.
"""

# default format supportable by llama2
query_wrapper_prompt = SimpleInputPrompt(f"<|USER|>{system_prompt}<|ASSISTANT|>")

In [4]:
from dotenv import load_dotenv
load_dotenv()

import os
from typing import List, Optional
HF_TOKEN: Optional[str] = os.getenv("HUGGINGFACE_API_KEY")

In [5]:
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

remotely_run = HuggingFaceInferenceAPI(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    token=HF_TOKEN,
    provider="together",  # this will use the best provider available
)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model=HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

2025-11-04 12:36:17,964 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


### **Now we set the settings for entire system**

In [None]:
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

Settings.llm = HuggingFaceInferenceAPI(
    model_name="deepseek-ai/DeepSeek-R1-0528",
    token=HF_TOKEN,
    provider="together",  # this will use the best provider available
)
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.num_output = 1000
Settings.context_window = 4096

2025-11-04 12:36:21,463 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


In [8]:
# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(
    documents, embed_model=embed_model, show_progress=True
)

Parsing nodes: 100%|██████████| 64/64 [00:00<00:00, 408.47it/s]
Generating embeddings: 100%|██████████| 163/163 [00:00<00:00, 171.44it/s]


In [9]:
from prompt_toolkit import prompt


query_engine=index.as_query_engine(prompt=query_wrapper_prompt)

In [10]:
result = query_engine.query("What is YOLO?")
print(result.response)

<think>
Hmm, the user is asking about YOLO based on a research paper context. Let me analyze the provided text carefully.

The context describes YOLO as a real-time object detection system that processes streaming video with under 25ms latency. It mentions three key advantages: superior speed/accuracy ratio compared to other real-time systems, global image reasoning that reduces background errors, and strong generalization to new domains like artwork.

The technical details reveal it's a unified neural network that divides images into grids, with each grid cell predicting bounding boxes and confidence scores. The text specifically contrasts YOLO's approach with sliding window and region proposal methods like R-CNN, noting its end-to-end trainable architecture.

I should highlight its core innovation - unifying detection components into a single network - while mentioning its tradeoffs: slightly lower accuracy than state-of-the-art systems, particularly with small object localization. T