# Improve RAG Accuracy with Finetuned Embedding Models on Amazon SageMaker

The purpose of the "Install Dependencies" section is to guide users on setting up their environment with the necessary Python libraries to run the notebook. It should clearly explain why each library is needed in the context of fine-tuning sentence transformer models on SageMaker for improved RAG accuracy.

<a id='install-dependencies'></a>
### Install Dependencies

    Before we can start fine-tuning our embedding model and deploying it on Amazon SageMaker, we need to ensure our environment has all the necessary Python libraries. These libraries provide the tools and functionalities we'll use throughout this notebook.  Let's install them step-by-step:

**1. Amazon SageMaker Python SDK:**
pip install sagemaker --upgrade


    What it is: The Amazon SageMaker Python SDK is essential for interacting with Amazon SageMaker services from within your Python environment. It allows us to manage and deploy machine learning models, including training, deploying endpoints, and more, all within SageMaker's infrastructure.

Why we need it: We'll use the SageMaker SDK to:

Create a SageMaker Session to interact with AWS resources.

Define and configure our training job.

Deploy our fine-tuned embedding model as a SageMaker Endpoint so we can easily use it for inference (generating embeddings for our RAG system).

**2. Hugging Face transformers and datasets libraries:**
pip install transformers datasets

What they are:

    transformers: This library from Hugging Face is the foundation for working with pre-trained transformer models, including Sentence Transformers models. It provides classes and functions to load, use, and fine-tune these powerful models.

    datasets: Also from Hugging Face, datasets makes it incredibly easy to download and manage datasets for machine learning tasks. It provides efficient data loading and processing capabilities.

Why we need them:

    transformers: We need it to load the pre-trained sentence-transformers/all-MiniLM-L6-v2 model, which is a transformer-based model. We'll also use it during the fine-tuning process.

    datasets: While this notebook uses the Bedrock FAQ dataset, datasets is a useful general-purpose library for handling datasets. If you were using a dataset from Hugging Face Hub or needed to process data in a specific format, datasets would be very helpful. Even though we might load the Bedrock FAQ data directly, having datasets installed is good practice and might be used in more complex scenarios or if you adapt this notebook for different datasets.

**3. Sentence Transformers Library:**
pip install sentence-transformers

What it is: sentence-transformers is a Python library built specifically for creating and working with sentence and text embeddings. It simplifies the process of using pre-trained models (like all-MiniLM-L6-v2) and fine-tuning them for specific tasks, especially semantic similarity and information retrieval, which are crucial for improving RAG accuracy.

Why we need it: This is the core library for this notebook! We need sentence-transformers to:

Load the sentence-transformers/all-MiniLM-L6-v2 model.

Utilize the MultipleNegativesRankingLoss function for fine-tuning.

Generate sentence embeddings for our text data.

**4. (Optional but Recommended) Accelerate and Bitsandbytes (for faster training, especially on GPUs):**
pip install accelerate bitsandbytes

What they are:

    accelerate: Hugging Face accelerate simplifies distributed training and mixed-precision training. It can significantly speed up the fine-tuning process, especially if you are using GPUs.

bitsandbytes: This library allows for efficient memory usage during training by using techniques like 8-bit Adam optimization. It's particularly helpful when fine-tuning large models or when GPU memory is limited.

Why they are recommended:

    Speed up training: Fine-tuning deep learning models can be time-consuming. accelerate and bitsandbytes can drastically reduce training time, especially if you are using a SageMaker Notebook instance with a GPU or a SageMaker Training Job with GPUs.

    Memory efficiency: They can help you fine-tune larger models or handle larger datasets within the available GPU memory.

After running these pip install commands, ensure that all installations are successful without any errors. Once these libraries are installed, you'll be ready to move on to loading your data and starting the model fine-tuning process in the next section.


In [None]:
!pip install pathos==0.3.2
!pip install datasets==2.19.2
!pip install transformers==4.40.2
!pip install transformers[torch]==4.40.2
!pip install sentence_transformers==3.1.1
!pip install accelerate==1.0.0
!pip install sagemaker==2.224.1

<a id='load-data-and-train-the-model'></a>
## Load Data and Train the Model

The following code snippet demonstrates how to load a training dataset from a JSON file, prepare the data for training, and then fine-tune the pre-trained model. After fine-tuning, the updated model is saved.

The `EPOCHS` variable determines the number of times the model will iterate over the entire training dataset during the fine-tuning process. A higher number of epochs typically leads to better convergence and potentially improved performance, but may also increase the risk of overfitting if not properly regularized.

In this example, we have a small training set consisting of only 100 records. As a result, we are using a high value for the `EPOCHS` parameter. Typically, in real-world scenarios, you would have a much larger training set. In such cases, the `EPOCHS` value should be a single or two-digit number to avoid overfitting the model to the training data.


In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
import json

def load_data(path):
    """Load the dataset from a JSON file."""
    with open(path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    return data

dataset = load_data("training.json")


# Load the pre-trained model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Convert the dataset to the required format
train_examples = [InputExample(texts=[data["sentence1"], data["sentence2"]]) for data in dataset]

# Create a DataLoader object
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=8)

# Define the loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

EPOCHS=100

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=EPOCHS,
    show_progress_bar=True,
)

# Save the fine-tuned model
model.save("opt/ml/model/",safe_serialization=False)

<a id='create-inference-script'></a>
## Create inference.py File

To deploy and serve the fine-tuned embedding model for inference, we create an `inference.py` Python script that serves as the entry point. This script implements two essential functions: `model_fn` and `predict_fn`, as required by AWS SageMaker for deploying and using machine learning models.

The `model_fn` is responsible for loading the fine-tuned embedding model and the associated tokenizer. On the other hand, the `predict_fn` takes input sentences, tokenizes them using the loaded tokenizer, and computes their sentence embeddings using the fine-tuned model. To obtain a single vector representation for each sentence, it performs mean pooling over the token embeddings, followed by normalization of the resulting embedding. Finally, the `predict_fn` returns the normalized embeddings as a list, which can be further processed or stored as required.


In [None]:
%%writefile opt/ml/model/inference.py

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import os

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def model_fn(model_dir, context=None):
  # Load model from HuggingFace Hub
  tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
  model = AutoModel.from_pretrained(f"{model_dir}/model")
  return model, tokenizer

def predict_fn(data, model_and_tokenizer, context=None):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer
    
    # Tokenize sentences
    sentences = data.pop("inputs", data)
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    
    # return dictonary, which will be json serializable
    return {"vectors": sentence_embeddings[0].tolist()}


<a id='upload-the-model'></a>
## Upload the Model

After creating the `inference.py` script, we package it together with the fine-tuned embedding model into a single `model.tar.gz` file. This compressed file can then be uploaded to an Amazon S3 bucket, making it accessible for deployment as a SageMaker endpoint.

In [None]:
import boto3
import tarfile
import os

model_dir = "opt/ml/model"
model_tar_path = "model.tar.gz"

with tarfile.open(model_tar_path, "w:gz") as tar:
    tar.add(model_dir, arcname=os.path.basename(model_dir))
    
s3 = boto3.client('s3')

# Get the region name
session = boto3.Session()
region_name = session.region_name

# Get the account ID from STS (Security Token Service)
sts_client = session.client("sts")
account_id = sts_client.get_caller_identity()["Account"]

model_path = f"s3://sagemaker-{region_name}-{account_id}/model_trained_embedding/model.tar.gz"

bucket_name = f"sagemaker-{region_name}-{account_id}"
s3_key = "model_trained_embedding/model.tar.gz"

with open(model_tar_path, "rb") as f:
    s3.upload_fileobj(f, bucket_name, s3_key)
          

<a id='deploy-model-on-sagemaker'></a>
## Deploy Model on SageMaker

Finally, we can deploy our fine-tuned model on a SageMaker Endpoint using `SageMaker HuggingFaceModel`.

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel
import sagemaker


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=model_path,  # path to your trained SageMaker model
   role=sagemaker.get_execution_role(),                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
   entry_point="opt/ml/model/inference.py",
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)



<a id='invoke-the-model'></a>
## Invoke the Model

You can invoke the model using the `predict` function.

In [None]:
# example request: you always need to define "inputs"
data = {
   "inputs": "Are Agents fully managed?."
}

# request
predictor.predict(data)

<a id='compare-predictions'></a>
## Compare Predictions

To illustrate the impact of fine-tuning, we can compare the cosine similarity scores between two semantically related sentences using both the original pre-trained model and the fine-tuned model. A higher cosine similarity score indicates that the two sentences are more semantically similar, as their embeddings are closer in the vector space.

In [None]:
from sentence_transformers import SentenceTransformer, util

pretrained_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")


sentences = [
    "What are Agents, and how can they be used?"
    , 
    "Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims."
]

#Compute embedding for both lists
embedding_x= pretrained_model.encode(sentences[0], convert_to_tensor=True)
embedding_y = pretrained_model.encode(sentences[1], convert_to_tensor=True)

util.pytorch_cos_sim(embedding_x, embedding_y)

In [None]:
from sentence_transformers import SentenceTransformer, util

data1 = {
   "inputs": 
    "What are Agents, and how can they be used?"
}

data2 = {
   "inputs": 
    "Agents for Amazon Bedrock are fully managed capabilities that automatically break down tasks, create an orchestration plan, securely connect to company data through APIs, and generate accurate responses for complex tasks like automating inventory management or processing insurance claims."
}



el1 = predictor.predict(data1)
el2 = predictor.predict(data2)

util.pytorch_cos_sim(el1["vectors"], el2["vectors"])
