## RAG System with HuggingFace

LOAD -> INDEX -> QUERY

In [None]:
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from langchain_community.embeddings import HuggingFaceEmbeddings
from llama_index.core.prompts.prompts import SimpleInputPrompt
from llama_index.llms.groq import Groq
from llama_index.core.response.pprint_utils import pprint_response

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
documents = SimpleDirectoryReader("./data").load_data()

type(documents), len(documents)

(list, 89)

In [3]:
system_prompt = """
You are a helpful Q&A assistant. Your goal is to answer questins as accurately 
as possible based on the instructions and context provided.
"""

query_wrapper_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

In [12]:
from llama_index.embeddings.langchain import LangchainEmbedding

embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)
llm = Groq(
    api_key=os.environ["GROQ_API_KEY"], 
    system_prompt=system_prompt,
    query_wrapper_prompt=query_wrapper_prompt,
    context_window_size=4096,
    max_new_tokens=256,
    tokenizer_name="openai/gpt-oss-120b",
    model="openai/gpt-oss-120b",
    device_map="auto",
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    model_kwargs={"torch_dtype": "float16", "load_in_8bit": True}
)

2025-10-24 09:49:25,034 - INFO - Use pytorch device_name: cpu
2025-10-24 09:49:25,036 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


In [13]:
from llama_index.core import Settings

Settings.embed_model = embed_model
Settings.llm = llm
Settings.chunk_size = 1024

In [14]:
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm, chunk_size=1024, show_progress=True)

Parsing nodes: 100%|██████████| 89/89 [00:00<00:00, 379.54it/s]
Generating embeddings: 100%|██████████| 147/147 [01:36<00:00,  1.52it/s]


In [25]:
query_engine = index.as_query_engine()

response = query_engine.query("What is Attention mechanism? Start the answer with 'The Attention mechanism is...'")
pprint_response(response, wrap_width=140)
print()

response = query_engine.query("What is YOLO?")
pprint_response(response, wrap_width=140)

2025-10-24 10:15:08,431 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Final Response: The Attention mechanism is a component of neural network models that computes weighted relationships between elements of an
input sequence, allowing the model to focus on the most relevant tokens when processing each word. By assigning attention scores, it
captures dependencies that may span long distances in the text and reflects structural cues such as grammatical relationships, enabling
different attention heads to specialize in distinct linguistic tasks.



2025-10-24 10:15:09,122 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Final Response: YOLO is a unified object‑detection system that uses a single convolutional neural network to directly predict multiple
bounding boxes and their class probabilities from whole images, enabling real‑time detection with high speed and accuracy.
