# RAG-based Assistent to fastly query AWS Documentation

In this notebook we will make the setup of the pipeline and answer the four questions proposed in the challenge.

In [1]:
# imports
from sentence_transformers.cross_encoder import CrossEncoder
from vector_store import VectorStore
from dense_embedder import DenseEmbedder
from sparse_embedder import BM25Retriever
from generator import LLMGenerator
from config import MAX_TOKENS

  from .autonotebook import tqdm as notebook_tqdm


**IMPORTANT:** run the build_index.py file for chunking and indexing the data if you did not clone the data files from the repo or if you want to update the indexing.

you can do this by : python3 src/build_index.py in a terminal window.

In [2]:
# aux function to format the prompt
def format_prompt(query: str, contexts: list[dict]) -> str:
    """Constructs the prompt with retrieved contexts and user query."""
    context_str = "\n\n".join(
        f"Chunk {i+1}:\n{chunk['text'].strip()}"
        for i, chunk in enumerate(contexts)
    )

    return f"""You are a helpful assistant answering questions about AWS SageMaker documentation. Base your answers on the provided context.
If the context does not contain enough information, you can answer with "I don't know".
Lastly, if you are sure about the answer, you can answer even if the information is not in the context.
There is always just one question, so do not answer multiple questions at once.
Give concise and accurate answers.

Context:
{context_str}

Question: {query}
Answer:"""

In [3]:
# Initialize components
dense_embedder = DenseEmbedder()
sparse_embedder = BM25Retriever()
generator = LLMGenerator()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 

✅ LLM Loaded Successfully


In [5]:
# --- Step 1: Get user query ---
### Test any query you want here
query = "What is SageMaker?"

# --- Step 2: Retrieve from dense store ---
query_vec = dense_embedder.encode([query])
store = VectorStore(dim=query_vec.shape[1])
store.load("../data")
dense_results = store.search(query_vec, top_k=5)
dense_chunks = [meta for _, _, meta in dense_results]

# --- Step 3: Retrieve from sparse BM25 ---
sparse_embedder.load("../data")
sparse_results = sparse_embedder.search(query, top_k=5)
sparse_chunks = [meta for _, meta in sparse_results]

# --- Step 4: Merge and deduplicate ---
all_candidates = {doc["id"]: doc for doc in dense_chunks + sparse_chunks}
candidate_chunks = list(all_candidates.values())

# --- Step 5: Rerank ---
rerank_inputs = [(query, chunk["text"]) for chunk in candidate_chunks]
scores = reranker.predict(rerank_inputs)

for chunk, score in zip(candidate_chunks, scores):
    chunk["score"] = float(score)

top_reranked = sorted(candidate_chunks, key=lambda x: x["score"], reverse=True)[:5]

In [7]:
# --- Step 6: Format and generate response ---
prompt = format_prompt(query, top_reranked)
print("Token count:", generator.count_tokens(prompt))

print("\nGenerating answer...\n")
answer = generator.llmgenerate(prompt=prompt, max_tokens=MAX_TOKENS)

print("Answer:\n")
print(answer)
print('\n')
print(f'Source for the answer: {top_reranked[0]["source"]}')
print('\n')
print(f"Other possible relevant sources for further reading: {', '.join(chunk['source'] for chunk in top_reranked[1:])}")

Token count: 1268

Generating answer...

Answer:

Amazon SageMaker is a fully managed service that enables developers and data scientists to build, train, and deploy machine learning models. It provides integrated Jupyter authoring notebook instances for easy access to data sources and eliminates the need to manage servers.


Source for the answer: examples-sagemaker.md


Other possible relevant sources for further reading: integrating-sagemaker.md, sagemaker-projects-whatis.md, kubernetes-sagemaker-jobs.md, sagemaker-projects.md


In [8]:
# --- Step 1: Get user query ---
### Test any query you want here
query = "What are all AWS regions where SageMaker is available?"

# --- Step 2: Retrieve from dense store ---
query_vec = dense_embedder.encode([query])
store = VectorStore(dim=query_vec.shape[1])
store.load("../data")
dense_results = store.search(query_vec, top_k=5)
dense_chunks = [meta for _, _, meta in dense_results]

# --- Step 3: Retrieve from sparse BM25 ---
sparse_embedder.load("../data")
sparse_results = sparse_embedder.search(query, top_k=5)
sparse_chunks = [meta for _, meta in sparse_results]

# --- Step 4: Merge and deduplicate ---
all_candidates = {doc["id"]: doc for doc in dense_chunks + sparse_chunks}
candidate_chunks = list(all_candidates.values())

# --- Step 5: Rerank ---
rerank_inputs = [(query, chunk["text"]) for chunk in candidate_chunks]
scores = reranker.predict(rerank_inputs)

for chunk, score in zip(candidate_chunks, scores):
    chunk["score"] = float(score)

top_reranked = sorted(candidate_chunks, key=lambda x: x["score"], reverse=True)[:5]

In [9]:
# --- Step 6: Format and generate response ---
prompt = format_prompt(query, top_reranked)
print("Token count:", generator.count_tokens(prompt))

print("\nGenerating answer...\n")
answer = generator.llmgenerate(prompt=prompt, max_tokens=MAX_TOKENS)

print("Answer:\n")
print(answer)
print('\n')
print(f'Source for the answer: {top_reranked[0]["source"]}')
print('\n')
print(f"Other possible relevant sources for further reading: {', '.join(chunk['source'] for chunk in top_reranked[1:])}")

Token count: 1791

Generating answer...

Answer:

SageMaker is available in all supported AWS regions except Asia Pacific (Jakarta), Africa (Cape Town), Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Osaka), Asia Pacific (Melbourne), Europe (Milan), AWS GovCloud (US-East), Europe (Spain), China (Beijing), China (Ningxia), and Europe (Zurich) Region.


Source for the answer: sagemaker-notebook-no-direct-internet-access.md


Other possible relevant sources for further reading: sagemaker-notebook-instance-inside-vpc.md, sagemaker-compliance.md, aws-properties-sagemaker-model-containerdefinition.md, sagemaker-projects-whatis.md


In [10]:
# --- Step 1: Get user query ---
### Test any query you want here
query = "How to check if an endpoint is KMS encrypted?"

# --- Step 2: Retrieve from dense store ---
query_vec = dense_embedder.encode([query])
store = VectorStore(dim=query_vec.shape[1])
store.load("../data")
dense_results = store.search(query_vec, top_k=5)
dense_chunks = [meta for _, _, meta in dense_results]

# --- Step 3: Retrieve from sparse BM25 ---
sparse_embedder.load("../data")
sparse_results = sparse_embedder.search(query, top_k=5)
sparse_chunks = [meta for _, meta in sparse_results]

# --- Step 4: Merge and deduplicate ---
all_candidates = {doc["id"]: doc for doc in dense_chunks + sparse_chunks}
candidate_chunks = list(all_candidates.values())

# --- Step 5: Rerank ---
rerank_inputs = [(query, chunk["text"]) for chunk in candidate_chunks]
scores = reranker.predict(rerank_inputs)

for chunk, score in zip(candidate_chunks, scores):
    chunk["score"] = float(score)

top_reranked = sorted(candidate_chunks, key=lambda x: x["score"], reverse=True)[:5]

In [11]:
# --- Step 6: Format and generate response ---
prompt = format_prompt(query, top_reranked)
print("Token count:", generator.count_tokens(prompt))

print("\nGenerating answer...\n")
answer = generator.llmgenerate(prompt=prompt, max_tokens=MAX_TOKENS)

print("Answer:\n")
print(answer)
print('\n')
print(f'Source for the answer: {top_reranked[0]["source"]}')
print('\n')
print(f"Other possible relevant sources for further reading: {', '.join(chunk['source'] for chunk in top_reranked[1:])}")

Token count: 2011

Generating answer...

Answer:

You can check the compliance of an Amazon SageMaker endpoint configuration regarding KMS encryption using AWS Config rules such as 'sagemaker-endpoint-configuration-kms-key-configured'. If the rule returns NON_COMPLIANT, then the KMS key is not configured for the endpoint configuration.


Source for the answer: sagemaker-roles.md


Other possible relevant sources for further reading: sagemaker-endpoint-configuration-kms-key-configured.md, aws-properties-sagemaker-featuregroup-onlinestoreconfig.md, aws-properties-sagemaker-modelpackage-transformresources.md, kubernetes-sagemaker-components-tutorials.md


In [12]:
# --- Step 1: Get user query ---
### Test any query you want here
query = "What are SageMaker Geospatial capabilities?"

# --- Step 2: Retrieve from dense store ---
query_vec = dense_embedder.encode([query])
store = VectorStore(dim=query_vec.shape[1])
store.load("../data")
dense_results = store.search(query_vec, top_k=5)
dense_chunks = [meta for _, _, meta in dense_results]

# --- Step 3: Retrieve from sparse BM25 ---
sparse_embedder.load("../data")
sparse_results = sparse_embedder.search(query, top_k=5)
sparse_chunks = [meta for _, meta in sparse_results]

# --- Step 4: Merge and deduplicate ---
all_candidates = {doc["id"]: doc for doc in dense_chunks + sparse_chunks}
candidate_chunks = list(all_candidates.values())

# --- Step 5: Rerank ---
rerank_inputs = [(query, chunk["text"]) for chunk in candidate_chunks]
scores = reranker.predict(rerank_inputs)

for chunk, score in zip(candidate_chunks, scores):
    chunk["score"] = float(score)

top_reranked = sorted(candidate_chunks, key=lambda x: x["score"], reverse=True)[:5]

In [13]:
# --- Step 6: Format and generate response ---
prompt = format_prompt(query, top_reranked)
print("Token count:", generator.count_tokens(prompt))

print("\nGenerating answer...\n")
answer = generator.llmgenerate(prompt=prompt, max_tokens=MAX_TOKENS)

print("Answer:\n")
print(answer)
print('\n')
print(f'Source for the answer: {top_reranked[0]["source"]}')
print('\n')
print(f"Other possible relevant sources for further reading: {', '.join(chunk['source'] for chunk in top_reranked[1:])}")

Token count: 1355

Generating answer...

Answer:

SageMaker Geospatial capabilities are features of Amazon SageMaker that perform geospatial operations on your behalf using the AWS hardware managed by SageMaker. They can only perform operations that the user permits and require an execution role with the appropriate permissions to access AWS resources.


Source for the answer: sagemaker-geospatial-roles.md


Other possible relevant sources for further reading: sagemaker-geospatial-roles.md, integrating-sagemaker.md, examples-sagemaker.md, sagemaker-projects-whatis.md


You can see that the answers are quite good and are also well related to the documentation. For sure, with better models the whole pipeline would be better, from chunking to generating the answer. But anyway I would say that this POC accomplished its objectives by generating good enough answers in a reasonable time even running in my local machine (macbook air M4 16gb RAM 256 SSD). Being more specific it took me for each query something between 10 and 40 seconds to have an answer. 