# Workshop: Text2Text Generation with SageMaker

Welcome to this workshop on Text2Text Generation with SageMaker. In this workshop, we will be using a pre-trained model deployed on a SageMaker endpoint to perform text-to-text generation tasks.
The workshop is divided into several sections:

1. **Setting up the environment:** In this section, we will import necessary libraries and define some helper functions.
2. **Querying the endpoint:** We will define some example input texts and use them to query the SageMaker endpoint.
3. **Advanced features:** We will explore some advanced features of the model, such as controlling the length of the generated text and the number of output sequences returned.
4. **Using the langchain library:** We will use the langchain library to create a question answering chain and perform similarity searches on a set of documents.
5. **Cleaning up:** Finally, we will shut down the SageMaker endpoint to avoid incurring unnecessary costs.

Let's get started!

## Setting up the environment

In this section, we will import the necessary libraries and define some helper functions that we will use throughout the workshop.

We will be using the `json` and `boto3` libraries. The `json` library provides functions for working with JSON data, and the `boto3` library allows us to interact with AWS services, including SageMaker.

Let's start by importing these libraries.

In [2]:
import json
import boto3

Next, we will define some example input texts. These are the texts that we will use to query the SageMaker endpoint. The model will take these texts as input and return the output of the accomplished task.

In [3]:
text1 = "Translate to German:  My name is Arthur"
text2 = "A step by step recipe to make bolognese pasta:"

Now, let's define the endpoint that you have created. We will use this endpoint to query the model and get the generated text. We will also define some formatting variables for better output visualization.

The `endpoint_name` variable should be set to the name of the SageMaker endpoint that you have created. The `newline`, `bold`, and `unbold` variables are used to format the output text for better readability.

In [4]:
newline, bold, unbold = '\n', '\033[1m', '\033[0m'
endpoint_name = 'jumpstart-dft-hf-text2text-flan-t5-xl'
embedding_endpoint_name = 'jumpstart-dft-hf-textembedding-gpt-j-6b-fp16'

Next, we will define a function to query the endpoint. This function will take the encoded text as input and return the response from the endpoint.

The `query_endpoint` function uses the `boto3` library to create a SageMaker runtime client. It then uses this client to invoke the SageMaker endpoint with the encoded text as input. The function returns the response from the endpoint.

In [5]:
def query_endpoint(encoded_text):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/x-text', Body=encoded_text)
    return response

We will also define a function to parse the response from the endpoint. This function will extract the generated text from the response.

In [6]:
def parse_response(query_response):
    model_predictions = json.loads(query_response['Body'].read())
    generated_text = model_predictions['generated_text']
    return generated_text

Now, let's use these functions to query the endpoint with our example texts and print the generated text.

In [7]:
for text in [text1, text2]:
    query_response = query_endpoint(text.encode('utf-8'))
    generated_text = parse_response(query_response)
    print (f"Inference:{newline}"
            f"input text: {text}{newline}"
            f"generated text: {bold}{generated_text}{unbold}{newline}")

Inference:
input text: Translate to German:  My name is Arthur
generated text: [1mIch bin Arthur.[0m

Inference:
input text: A step by step recipe to make bolognese pasta:
generated text: [1mIn a large saucepan, combine the ground beef, onion, garlic, tomato paste, tomato[0m



## Advanced Features

The model we are using supports many advanced parameters that can be used to control the text generation process. These parameters include:

- **max_length:** This parameter controls the maximum length of the generated text. The model will generate text until the output length (which includes the input context length) reaches `max_length`.
- **num_return_sequences:** This parameter controls the number of output sequences returned by the model.
- **num_beams:** This parameter controls the number of beams used in the greedy search during text generation.
- **no_repeat_ngram_size:** This parameter ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence.
- **temperature:** This parameter controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words.
- **early_stopping:** If set to True, text generation is finished when all beam hypotheses reach the end of sentence token.
- **do_sample:** If set to True, the model will sample the next word as per the likelihood.
- **top_k:** In each step of text generation, the model will sample from only the `top_k` most likely words.
- **top_p:** In each step of text generation, the model will sample from the smallest possible set of words with cumulative probability `top_p`.
- **seed:** This parameter can be used to fix the randomized state for reproducibility.

We can specify any subset of these parameters when invoking the endpoint. In the next section, we will show an example of how to invoke the endpoint with these arguments.

In [8]:
payload = {"text_inputs":"Tell me the steps to make a pizza", "max_length":50, "num_return_sequences":3, "top_k":50, "top_p":0.95, "do_sample":True}

We will now define a function to query the endpoint with a JSON payload. This function will take the encoded JSON as input and return the response from the endpoint.

The `query_endpoint_with_json_payload` function is similar to the `query_endpoint` function we defined earlier. The difference is that this function takes a JSON payload as input instead of a text. This allows us to pass the advanced parameters to the endpoint.

In [9]:
def query_endpoint_with_json_payload(encoded_json):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=encoded_json)
    return response

We will also define a function to parse the response from the endpoint when multiple texts are returned. This function will extract the generated texts from the response.

The `parse_response_multiple_texts` function is similar to the `parse_response` function we defined earlier. The difference is that this function extracts the 'generated_texts' field from the JSON instead of the 'generated_text' field. This is because when we request multiple texts from the endpoint, the response contains a 'generated_texts' field with a list of generated texts.

In [10]:
def parse_response_multiple_texts(query_response):
    model_predictions = json.loads(query_response['Body'].read())
    generated_text = model_predictions['generated_texts']
    return generated_text

Now, let's use these functions to query the endpoint with our JSON payload and print the generated texts.

In [11]:
query_response = query_endpoint_with_json_payload(json.dumps(payload).encode('utf-8'))
generated_texts = parse_response_multiple_texts(query_response)
print(generated_texts)

['Place the pizza dough on a floured surface. Spread the pizza sauce on the dough. Spread the cheese on the pizza dough. Place the pizza on a baking sheet. Bake for 15 minutes at 450 degrees.', 'To make a pizza, you will first need to gather your ingredients. You will need a pizza pan, a large baking sheet, and a large pizza stone. You will also need to gather a large amount of toppings', 'Spread pizza sauce on dough. Top with cheese. Bake at 450 degrees for 20 minutes. Remove from oven and cut into slices.']


Before we proceed to the next steps, let's ensure that we have the necessary libraries installed. We will need the `langchain` library for the following steps. If it's not already installed, we can install it using pip.

The `langchain` library is a Python library that provides utilities for working with large language models. It includes utilities for creating prompts, querying endpoints, parsing responses, and more. We will use this library in the following steps to interact with our SageMaker endpoint.

In [12]:
!pip install --upgrade langchain --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Now, let's import the necessary modules from the `langchain` library.

- `PromptTemplate`: This class is used to create a template for the prompts that we will pass to the language model. 
- `SagemakerEndpoint`: This class is used to interact with the SageMaker endpoint.
- `LLMContentHandler`: This class is used to handle the content that we send to and receive from the language model.
- `load_qa_chain`: This function is used to load a question-answering chain. A chain is a sequence of transformations applied to the input to generate an answer.
- `Document`: This class is used to create documents that the language model can use to find the answer to a question.
- `EmbeddingsContentHandler`: This class is used to handle the content that we send to and receive from the embedding model.
- `SagemakerEndpointEmbeddings`: This class is used to interact with the SageMaker embeddings enpoint.

In [13]:
from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.question_answering import load_qa_chain
from langchain.docstore.document import Document
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.embeddings import SagemakerEndpointEmbeddings
import json
from typing import Dict, List

We will now create a content handler for the language model to transform input to a format that the SageMaker endpoint expects and output to a form that the language model class expects. We will also define some parameters for the model.

The `ContentHandler` class is a subclass of the `LLMContentHandler` class. It defines two methods:

- `transform_input`: This method takes a prompt and a dictionary of model parameters as input, and returns the input in a format that the SageMaker endpoint expects. In this case, it converts the input to a JSON string and encodes it to bytes.
- `transform_output`: This method takes the output from the SageMaker endpoint and returns it in a form that the language model class expects. In this case, it decodes the output from bytes to a string, parses the JSON, and returns the 'generated_texts' field.

The `parameters` dictionary defines the parameters that we will use when querying the language model. These parameters control the behavior of the language model, such as the maximum length of the generated text, the number of sequences to return, and the sampling strategy.

In [14]:
parameters = {
    "max_length": 200,
    "num_return_sequences": 1,
    "top_k": 250,
    "top_p": 0.95,
    "do_sample": False,
    "temperature": 1,
}

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json['generated_texts'][0]

llm_content_handler = ContentHandler()
sm_llm=SagemakerEndpoint(
            endpoint_name=endpoint_name,
            region_name="eu-west-1",
            model_kwargs=parameters,
            content_handler=llm_content_handler,
        )

Next, we will define a prompt template and load a question answering chain. 

The prompt template is used to format the input to the language model. It accepts a set of parameters from the user that can be used to generate a prompt for a language model. 

The question answering chain is a sequence of transformations applied to the input to generate an answer.

The `PromptTemplate` class takes a template string and a list of input variables as arguments. The template string is a string that contains placeholders for the input variables. The placeholders are enclosed in curly braces `{}` and correspond to the names of the input variables. When we use the prompt template, we will replace the placeholders with the actual values of the input variables.

The `load_qa_chain` function loads a question answering chain. A chain is a sequence of transformations applied to the input to generate an answer. Chains allow us to combine multiple components together to create a single, coherent application. For example, we can create a chain that takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM. In this case, the chain includes the language model and the prompt template.

In [15]:
prompt=PromptTemplate(
            template="Use the following pieces of context to answer the question at the end.\n{context}\nQuestion: {question}\nAnswer:",
            input_variables=["context", "question"]
        )
chain = load_qa_chain(
        llm=sm_llm,
        prompt=prompt,
    )

Now, let's test our question answering chain with a sample question and some context. The context is a list of documents that the model can use to find the answer to the question.

The `chain` function takes a dictionary as input and returns the output of the chain. The input dictionary must contain the 'input_documents' and 'question' keys. The 'input_documents' key corresponds to a list of documents that the model can use to find the answer to the question. The 'question' key corresponds to the question that we want to answer.

In [16]:
query = "Which instances can I use with Managed Spot Training in SageMaker?"

input_documents = [Document(page_content="How can I help you?")]

chain({"input_documents": input_documents, "question": query}, return_only_outputs=True)

{'output_text': '(ii).'}

Next, we will create a content handler for embeddings to transform a format that the SageMaker endpoint expects and output to a form that the embeddings class expects.

The `SagemakerEndpointEmbeddingsJumpStart` class is a subclass of the `SagemakerEndpointEmbeddings` class. It defines the `embed_documents` method, which computes document embeddings using a SageMaker Inference Endpoint. The method takes a list of texts and a chunk size as input, and returns a list of embeddings.

The `ContentHandler` class is a subclass of the `EmbeddingsContentHandler` class. It defines two methods:

- `transform_input`: This method takes a prompt and a dictionary of model parameters as input, and returns the input in a format that the SageMaker endpoint expects. In this case, it converts the input to a JSON string and encodes it to bytes.
- `transform_output`: This method takes the output from the SageMaker endpoint and returns it in a form that the embeddings class expects. In this case, it decodes the output from bytes to a string, parses the JSON, and returns the 'embedding' field.

In [17]:
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int = 5) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

        for i in range(0, len(texts), _chunk_size):
            response = self._embedding_func(texts[i : i + _chunk_size])
            print
            results.extend(response)
        return results

class ContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json["embedding"]
        return embeddings
embeddings_content_handler=ContentHandler()
embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=embedding_endpoint_name,
    region_name="eu-west-1",
    content_handler=embeddings_content_handler,
)

Now, we will download the data we will use for our documents from an S3 bucket and process it for our use.

In [18]:
original_data = "s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/"
!mkdir -p rag_data
!aws s3 cp --recursive $original_data rag_data

download: s3://jumpstart-cache-prod-us-east-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv to rag_data/Amazon_SageMaker_FAQs.csv


We will now load the data into a pandas DataFrame and process it for our use.

In [19]:
import glob
import os
import pandas as pd
all_files = glob.glob(os.path.join("rag_data/", "*.csv"))

df_knowledge = pd.concat(
    (pd.read_csv(f, header=None, names=["Question", "Answer"]) for f in all_files),
    axis=0,
    ignore_index=True,
)
df_knowledge.drop(["Question"], axis=1, inplace=True)
df_knowledge.head(5)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...


We will now save the processed data to a CSV file.

In [20]:
df_knowledge.to_csv("rag_data/processed_data.csv", header=False, index=False)

Next, we will load the processed data using a CSVLoader. The CSVLoader is a utility that helps us load data from a CSV file.

In [21]:
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path="rag_data/processed_data.csv")
documents = loader.load()

We will now install the `faiss-cpu` library, which provides efficient similarity search and clustering of dense vectors.

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI that allows for efficient similarity search and clustering of dense vectors. So, given a set of vectors(in this case a vector representation of a document i.e. an embedding), we can index them using Faiss â€” then using another vector (the query vector), we search for the most similar vectors within the index.

In [22]:
!pip install faiss-cpu --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We will now create an index of our documents using the VectorstoreIndexCreator. This index will allow us to perform efficient similarity searches on our documents.

The VectorstoreIndexCreator is a utility that helps us create an index of our documents. It uses the embeddings of the documents to create the index. The embeddings are dense vectors that represent the documents. The index allows us to perform efficient similarity searches on the documents.

In [23]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma, AtlasDB, FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=CharacterTextSplitter(chunk_size=300, chunk_overlap=0),
)
index = index_creator.from_loaders([loader])

Let's test our index by querying it with a sample question.

The `index.query` function is used to perform a similarity search on the index. It takes a question and a language model as input, and returns the most similar documents in the index. After the relevant documents are retrieved, the LLM can be used to generate a coherent and contextually relevant answer based on the retrieved documents.

In [24]:
index.query(question=query, llm=sm_llm)

'Amazon EC2 Spot instances'

We will now replicate the index.query functionality step by step to illustrate what happens

we will create a document search object using the FAISS vector store and our documents. This will allow us to perform similarity searches on our documents. Using this we retrieve the top 3 most similar docs to our query.

The `FAISS.from_documents` function is used to create a FAISS vector store from our documents. The embeddings of the documents are used to create the vector store. The vector store allows us to perform efficient similarity searches on the documents.

The `docsearch.similarity_search` function is used to perform a similarity search on the documents. It takes a query and a number of results to return as input, and returns the most similar documents in the vector store. The query is converted into an embedding and this embedding is then compared with the embeddings of the documents in the vector store.

In [25]:
docsearch = FAISS.from_documents(documents, embeddings)
docs = docsearch.similarity_search(query, k=3)
docs

[Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Once a Managed Spot Training job is completed, you can see the savings in the AWS Management Console and also calculate the cost savings as the percentage difference between the duration for which the training job ran and the duration for which you were billed. Regardless of how many times your Managed Spot Training jobs are interrupted, you are charged only once for the duration for which the data was downloaded.', metadata={'source': 'rag_data/processed_data.csv', 'row': 88}),
 Document(page_content='Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.: Managed Spot Training uses Amazon EC2 Spot instances for training, and these insta

Finally, we will use our question-answering chain to answer our query using the documents we found.

The `chain` function is used to apply our question-answering chain to our query and documents. The question-answering chain is a sequence of transformations that are applied to the query and documents to generate an answer. The transformations include converting the query and documents into embeddings, performing a similarity search on the embeddings, and generating an answer from the most similar documents.

In [26]:
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': 'Spot instances'}

We have looked at two flows so far:

a.

![alt text](flow.png)

b.

![alt text](RAGflow.png)

## Cleanup

After you have finished with this notebook, you should clean up your AWS resources to avoid any unwanted charges. This includes deleting the SageMaker endpoint. [add cleanup steps]