# AWS Documnetation RAG-QA

## Notebook Purpose
Data Anaysis for Chunking and Indexing
 
## Tasks
- Data Chunk
- Indexing

## Notable TODOs:
- Evaluation

---

# Setup

In [2]:
import nest_asyncio
nest_asyncio.apply()

from pathlib import Path
from IPython import get_ipython

ipython = get_ipython()
if Path.cwd().name == "notebooks":
	%cd ..

## Extensions
if 'autoreload' not in ipython.extension_manager.loaded:
	%load_ext autoreload
else:
	%reload_ext autoreload

if 'dotenv' not in ipython.extension_manager.loaded:
	%load_ext dotenv
else:
	%reload_ext dotenv

# if 'cudf.pandas' not in get_ipython().extension_manager.loaded:
# 	%load_ext cudf.pandas

%autoreload 2
%dotenv

/home/leobit/Development/Projects/aws-doc-ragqa/research


In [3]:
## Config / Utils
import config as cfg
from IPython.core.magic import register_cell_magic
from utils import save_obj, load_obj, run_api

@register_cell_magic
def pybash(line, cell):
	'''Runs a magic bash with Python Variables'''
	ipython.run_cell_magic('bash', '--no-raise-error', cell.format(**globals()))

### System Infomation

In [3]:
%%pybash
bash {cfg.path.scripts}/notebook_info.sh

## GLOBAL INFO
Conda Python Version: 3.12.9.final.0
Conda Base Path: /opt/miniconda3
Conda Base Version: 25.1.1

## ENVIRONMENT INFO
Active Environment: aws-doc-ragqa
Environment Python Version: Python 3.11.13
Environment Python Path: /opt/miniconda3/envs/aws-doc-ragqa/bin/python
Environment IPython Version: 9.1.0
Environment IPykernel Version: 6.29.5

## GPU INFO:
CUDA Device Initialized 

/home/leobit/Development/Projects/aws-doc-ragqa/research/scripts/notebook_info.sh: line 21: numba: command not found




GPU Info: Failed to initialize NVML: N/A
Failed to properly shut down NVML: N/A



---

# Questions

- 'What is SageMaker?',
- 'What are all AWS regions where SageMaker is available?',
- 'How to check if an endpoint is KMS encrypted?',
- 'What are SageMaker Geospatial capabilities?'

#### Chat with Amazon Q Developer

In [6]:
%%pybash
q chat

[1mWelcome to [96mAmazon Q[39m![22m

💡 Get notified whenever Q CLI finishes responding. Just run [92mq settings
    chat.enableNotifications true[39m

[92m/help[90m all commands
[92mctrl + j[90m new lines
[92mctrl + s[90m fuzzy search
[39m
[38;5;8m[39m
[39m[38;5;14m🤖 You are chatting with claude-4-sonnet
[39m
[?25h[39m[0m
(To exit the CLI, press Ctrl+C or Ctrl+D again or type [38;5;10m/quit[39m)

[1G[0m[0m[?25h

```bash
> What is SageMaker?

> Amazon SageMaker is a fully managed machine learning service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning models at scale.

Here are the key components and capabilities of SageMaker:

Core Features:
• **SageMaker Studio** - Integrated development environment for ML workflows
• **SageMaker Notebooks** - Managed Jupyter notebook instances for data exploration and model development
• **Built-in Algorithms** - Pre-built ML algorithms optimized for performance and scale
• **Custom Training** - Support for custom algorithms using popular frameworks like TensorFlow, PyTorch, and scikit-learn

Training and Tuning:
• **Automatic Model Tuning** - Hyperparameter optimization to find the best model configuration
• **Distributed Training** - Scale training across multiple instances
• **Spot Training** - Use EC2 Spot instances to reduce training costs

Deployment Options:
• **Real-time Endpoints** - Low-latency inference for real-time predictions
• **Batch Transform** - Process large datasets for batch predictions
• **Multi-Model Endpoints** - Host multiple models on a single endpoint
• **Serverless Inference** - Pay-per-use inference without managing infrastructure

Additional Services:
• **SageMaker Pipelines** - MLOps workflows for automating ML processes
• **SageMaker Feature Store** - Centralized repository for ML features
• **SageMaker Model Registry** - Version control and governance for ML models
• **SageMaker Clarify** - Detect bias and explain model predictions
• **SageMaker Data Wrangler** - Visual data preparation tool

SageMaker handles the underlying infrastructure, allowing you to focus on the ML aspects rather than managing servers, scaling, or deployment complexities. It integrates well with other AWS services and supports the entire ML lifecycle from data preparation to model monitoring in production.
```

---

```bash
> What are all AWS regions where SageMaker is available?

> Based on my testing, SageMaker is available in most AWS regions. Here are the regions where Amazon SageMaker is currently available:

US Regions:
• us-east-1 (N. Virginia)
• us-east-2 (Ohio)
• us-west-1 (N. California)
• us-west-2 (Oregon)

Europe Regions:
• eu-central-1 (Frankfurt)
• eu-west-1 (Ireland)
• eu-west-2 (London)
• eu-west-3 (Paris)
• eu-north-1 (Stockholm)

Asia Pacific Regions:
• ap-northeast-1 (Tokyo)
• ap-northeast-2 (Seoul)
• ap-northeast-3 (Osaka)
• ap-southeast-1 (Singapore)
• ap-southeast-2 (Sydney)
• ap-south-1 (Mumbai)

Other Regions:
• ca-central-1 (Canada Central)
• sa-east-1 (São Paulo)

Important Notes:
• SageMaker is available in most major AWS regions, but not all features may be available in every region
• Some newer SageMaker features or instance types might be rolled out to different regions at different times
• For the most current and detailed region availability, including specific feature availability, check the AWS Regional Services page or the SageMaker documentation
• Some specialized regions (like AWS GovCloud or China regions) may have different availability
```

---

```bash
> How to check if an endpoint is KMS Encrypted by SAGEMAKER_ENDPOINT_CONFIGURATION_KMS_KEY_CONFIGURED?

> You want to understand what the SAGEMAKER_ENDPOINT_CONFIGURATION_KMS_KEY_CONFIGURED compliance rule specifically checks for.

This AWS Config rule evaluates whether Amazon SageMaker endpoint configurations have KMS encryption configured. Here's what it specifically checks:

## What the Rule Validates

The rule checks if a SageMaker endpoint configuration has the KmsKeyId parameter specified, which enables encryption at rest for:
• Model artifacts
• Data captured for monitoring
• Any other data stored by the endpoint

## Rule Details

• **Rule Name**: SAGEMAKER_ENDPOINT_CONFIGURATION_KMS_KEY_CONFIGURED
• **Rule Type**: AWS Managed Config Rule
• **Resource Type**: AWS::SageMaker::EndpointConfig
• **Trigger**: Configuration changes and periodic evaluations

## Compliance Criteria

An endpoint configuration is COMPLIANT when:
• The KmsKeyId field is present and contains a valid KMS key ARN or alias
• The KMS key exists and is accessible

An endpoint configuration is NON_COMPLIANT when:
• The KmsKeyId field is missing, null, or empty
• The specified KMS key doesn't exist or isn't accessible

## What Gets Checked

The rule examines the endpoint configuration JSON structure for:
json
{
	"EndpointConfigName": "example-config",
	"ProductionVariants": [...],
	"KmsKeyId": "arn:aws:kms:region:account:key/key-id"  // This field
}


## Key Points

• This rule only checks the configuration - not whether endpoints using that configuration are actually deployed
• It validates the presence of KMS encryption settings, not the actual encryption of data
• The rule runs when endpoint configurations are created, modified, or during periodic evaluations
• It helps ensure compliance with data encryption requirements and security policies
```

---

```bash
> What are SageMaker Geospatial capabilities?

> Amazon SageMaker Geospatial capabilities provide a comprehensive set of tools for processing, analyzing, and visualizing geospatial data using machine learning. Here are the key capabilities:

## Core Geospatial Features

Earth Observation Jobs
• Process satellite imagery and geospatial datasets at scale
• Built-in algorithms for common geospatial tasks like land cover classification, crop monitoring, and change detection
• Support for popular satellite data sources (Sentinel, Landsat, etc.)

Geospatial Data Processing
• Raster and vector data processing capabilities
• Temporal analysis for time-series geospatial data
• Coordinate system transformations and projections
• Data fusion from multiple geospatial sources

## Pre-built Algorithms

Land Cover and Land Use
• Automatic classification of terrain types
• Urban development monitoring
• Deforestation tracking

Agriculture and Crop Monitoring
• Crop health assessment using vegetation indices
• Yield prediction models
• Irrigation and water stress analysis

Environmental Monitoring
• Flood detection and mapping
• Wildfire monitoring and risk assessment
• Air quality analysis

## Integration Capabilities

Data Sources
• Direct integration with AWS Open Data (satellite imagery)
• Support for custom geospatial datasets
• Integration with Amazon S3 for data storage

Visualization
• Built-in map visualization tools
• Integration with popular GIS tools
• Export capabilities for further analysis

ML Integration
• Seamless integration with SageMaker's ML capabilities
• Custom model training on geospatial data
• AutoML for geospatial use cases

## APIs and SDKs

• Python SDK for geospatial operations
• REST APIs for programmatic access
• Jupyter notebook integration for interactive analysis

SageMaker Geospatial essentially democratizes access to satellite data analysis and geospatial ML, making it easier for organizations to derive insights from Earth observation data without requiring deep expertise in geospatial processing.
```

#### Setting Gold Answers

In [5]:

eval_gold_qa = [
	{
		"id": "0",
		"query": "What is SageMaker?",
		"ground_truth": "Amazon SageMaker is a fully managed machine learning service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning models at scale.",
	},
	{
		"id": "1",
		"query": "What are all AWS regions where SageMaker is available?",
		"ground_truth": "This question cannot be answered from the given source.",
	},
	{
		"id": "2",
		"query": "How to check if an endpoint is KMS encrypted?",
		"ground_truth": "Using the AWS Managed Config rule SAGEMAKER_ENDPOINT_CONFIGURATION_KMS_KEY_CONFIGURED to check if KMS is configured for Amazon SageMaker endpoint configuration.",
	},
	{
		"id": "3",
		"query": "What are SageMaker Geospatial capabilities?",
		"ground_truth": "This question cannot be answered from the given source as the file only informs about Geospatial Capabilities roles, and not the concept or features.",
	},
]
save_obj(eval_gold_qa, cfg.path.data.processed / "evaluation_gold_qa_dataset.pkl")

# Questions Evaluation 1

In [3]:
from aws_doc_ragqa.rag_pipeline import RagPipeline
rag_pipe = RagPipeline()

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave


In [16]:
%%pybash
docker start {cfg.app.docker.container}

9ec1410b7a13


In [4]:
## Prompt Template
rag_pipe = RagPipeline()
prompt_template = rag_pipe.get_template()
prompt = prompt_template.format(
    context_str="== CONTEXT ==",
    query_str="== QUERY =="
)
print(prompt)

You are given with relevant information about AWS Documentation. Answer the user query only based on the information provided. Don't make up stuff.

User Query: == QUERY ==
Context: == CONTEXT ==
Answer:


In [9]:
eval_query = "What is SageMaker?"
llm_judge = rag_pipe.set_llm_judge()

In [10]:
response = rag_pipe.predict(eval_query)
print(response)

INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The Model provider is already set.
INFO:ragpipeline:Connecting to Qdrant at http://localhost:6333, collection: sagemaker_docs_v1
INFO:httpx:HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:6333/collections/sagemaker_docs_v1/exists "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:6333/collections/sagemaker_docs_v1 "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:6333/collections/sagemaker_docs_v1/points/search "HTTP/1.1 200 OK"


{'response': "Based on the provided information, Amazon SageMaker is a fully managed service that enables developers and data scientists to build, train, and deploy machine learning (ML) models. It offers capabilities for every stage of the machine learning lifecycle, from model building and experimentation to deployment and management.\n\nThe context doesn't provide a comprehensive definition of SageMaker, but it does highlight some key features and use cases, such as:\n\n1. It's a fully managed service for machine learning.\n2. It provides tools for building, training, and deploying ML models.\n3. It offers project management capabilities to help teams collaborate more efficiently and maintain code consistency.\n4. SageMaker Projects can help organize all entities of the ML lifecycle under one project.\n5. It provides pre-built templates and allows for custom templates to set up ML infrastructure.\n6. It can integrate with various tools and services for CI/CD, including both AWS-nati

In [88]:
nodes = rag_pipe.retrieve_nodes(eval_query)
for i, node in enumerate(nodes):
	print(node.get_content(metadata_mode=MetadataMode.ALL))
	print("-" * 100)
	print("-" * 100)
	print("-" * 100)
	print()

INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The Model provider is already set.
INFO:ragpipeline:Connecting to Qdrant at http://localhost:6333, collection: sagemaker_docs_v1
INFO:httpx:HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://localhost:6333/collections/sagemaker_docs_v1 "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:6333/collections/sagemaker_docs_v1/points/search "HTTP/1.1 200 OK"


file_name: examples-sagemaker.md
category: tutorial
header_path: /

# Working with Amazon SageMaker<a name="examples-sagemaker"></a>

 Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning \(ML\) models\. See the following resources for complete code examples with instructions\.

 [Link to Github](https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/example_code/sagemaker) 

 [Link to AWS Code Sample Catalog](http://docs.aws.amazon.com/code-samples/latest/catalog/code-catalog-javav2-example_code-sagemaker.html)
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------

file_name: examples-sagemaker.md
category: tutorial
header_path:

# Evaluation

Correctness:  
1 (worst) - 5 (best)

Context Relevancy:  
0 (worst) - 1 (best)

Faithfullness:  
0 (worst) - 1 (best)

LLM JUDGE:
Claude 3.5 Sonnet

In [21]:
import pprint
pp = pprint.PrettyPrinter(indent=2)

### AWS
Model: Claude 3 Haiku

In [22]:
# Configuration
pp.pprint(cfg.app.aws.config)

{ 'context_size': 200000,
  'embedding_model_id': 'amazon.titan-embed-text-v2:0',
  'llm_model_id': 'us.anthropic.claude-3-haiku-20240307-v1:0',
  'region': 'us-east-2',
  'similarity_top_k': 5,
  'temperature': 0.4}


In [14]:
%%pybash
uv run python aws_doc_ragqa/run_eval.py

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The current model is: None
INFO:ragpipeline:Setting AWS Model...
INFO:ragpipeline:Setting AWS LLM: us.anthropic.claude-3

In [26]:
# Configuration
pp.pprint(cfg.app.aws.config)

{ 'context_size': 200000,
  'embedding_model_id': 'amazon.titan-embed-text-v2:0',
  'llm_model_id': 'us.anthropic.claude-3-haiku-20240307-v1:0',
  'region': 'us-east-2',
  'similarity_top_k': 3,
  'temperature': 0.1}


In [27]:
%%pybash
uv run python aws_doc_ragqa/run_eval.py

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The current model is: None
INFO:ragpipeline:Setting AWS Model...
INFO:ragpipeline:Setting AWS LLM: us.anthropic.claude-3

### GEMINI
Model: Gemini 1.5 Flash

In [29]:
# Configuration
pp.pprint(cfg.app.gemini.config)

{ 'embedding_model_id': 'models/embedding-001',
  'llm_model_id': 'models/gemini-1.5-flash',
  'similarity_top_k': 3,
  'temperature': 0.1}


In [30]:
%%pybash
uv run python aws_doc_ragqa/run_eval.py --gemini

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The current model is: None
INFO:ragpipeline:Setting Gemini Model...
INFO:ragpipeline:Setting Gemini LLM: models/gemini-1

In [32]:
# Configuration
pp.pprint(cfg.app.gemini.config)

{ 'embedding_model_id': 'models/embedding-001',
  'llm_model_id': 'models/gemini-1.5-flash',
  'similarity_top_k': 5,
  'temperature': 0.2}


In [33]:
%%pybash
uv run python aws_doc_ragqa/run_eval.py --gemini

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The current model is: None
INFO:ragpipeline:Setting Gemini Model...
INFO:ragpipeline:Setting Gemini LLM: models/gemini-1

Model: Gemini 2.5 Pro

In [35]:
# Configuration
pp.pprint(cfg.app.gemini.config)

{ 'embedding_model_id': 'models/embedding-001',
  'llm_model_id': 'models/gemini-2.5-pro',
  'similarity_top_k': 3,
  'temperature': 0.2}


In [37]:
%%pybash
uv run python aws_doc_ragqa/run_eval.py --gemini

INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
INFO:gql.transport.requests:>>> {"query": "query DefaultEntity {\n  viewer {\n    username\n    defaultEntity {\n      name\n    }\n  }\n}"}
INFO:gql.transport.requests:<<< {"data":{"viewer":{"username":"leoaieng","defaultEntity":{"name":"leoaieng-lb"}}}}
[36m[1mweave[0m: Logged in as Weights & Biases user: leoaieng.
[36m[1mweave[0m: View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:weave.trace.init_message:Logged in as Weights & Biases user: leoaieng.
View Weave data at https://wandb.ai/leoaieng-lb/aws-doc-ragqa-demo/weave
INFO:ragpipeline:Preparing to get index...
INFO:ragpipeline:The current model is: None
INFO:ragpipeline:Setting Gemini Model...
INFO:ragpipeline:Setting Gemini LLM: models/gemini-2