# Exercise - Build Your Own Conversational Search with GenAI

---

In this lab, you will build your own Generative AI application with Conversational Search powered by Large Language Models(LLMs) by leveraging the Langchain framework to implement Retrieval Augmented Generation(RAG) solution with OpenSearch Vector DB. 

You will be provided with skeleton code blocks that can be completed as per your specific use-case and requirements. Feel free to refer to the previous modules in this workshop to fill in the `**TODO**` sections of the code blocks below to build you own custom conversational search application.

We will also explore ways to improve RAG systems through

For more information about LangChain RAG, please refer: https://python.langchain.com/docs/use_cases/question_answering/

---

The lab includes the following steps:
1. [Step 1: Initialize](#Step-1:-Initialize)
2. [Step 2: Select SageMaker or Bedrock used for embedding and content generation](#Step-2:-Select-SageMaker-or-Bedrock-used-for-embedding-and-content-generation)
3. `TODO:` [Step 3: Load documents into OpenSearch's Vector DB](#Step-3:-Load-documents-into-OpenSearch's-vector-database)
4. `TODO:`[Step 4: Retrieval Augmented Generation](#TODO-Step-4:-Retrieval-Augmented-Generation)
5. `TODO:`[Step 5: Conversational search by memorizing the history](#TODO-Step-5:-Conversational-search-by-memorizing-the-history)


To be completed in this lab:
- [&#x2611;] embedding and store into OpenSearch
- [&#x2611;] OpenSearch ANN engine and number of neighbors of graph
- [&#x2611;] Choose different approach to use retrieved documents as context: #stuff, refine, map_reduce, and map_rerank
- [&#x2612;] RAG Improvement patterns
    - Base Prompt
    - Chucking Approach
    - Query Transformations ?
- [&#x2612;] Conversation search
    - Memory store
    - Prompt engineering

## Step 1: Initialize

Install required library such as OpenSearch client library, LangChain

In [None]:
%pip install --upgrade sagemaker==2.186.0 --quiet
%pip install opensearch-py==2.3.1 --quiet
%pip install wikipedia unstructured transformers==4.33.2 --quiet
%pip install langchain==0.0.293 --quiet
%pip install --upgrade boto3 --quiet

Initialize SageMaker, Boto3

In [None]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

### Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import json
region = aws_region

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')

def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs

#### **Note**: To verify deployed endpoint for embedding and content generation model please refer to Step:2 in Module 2

### Get endpoint for embedding

---
This is SageMaker Endpoint with GPT-J 6B parameters model to convert text into vector.


In [None]:
embedding_endpoint_name=outputs['EmbeddingEndpointName']
print(embedding_endpoint_name)

### Get endpoint for content generation

We use Falcon large language model to generate text. Our Falcon model has 7 billion parameters. 
It is a smallest Falcon model available and provides a good balance between accuracy and hardware costs to run the model.


In [None]:
llm_endpoint_name=outputs['LLMEndpointName']
print(llm_endpoint_name)

### Setup Amazon Bedrock

You need to change `os.environ["BEDROCK_ENDPOINT_URL"]` per your Bedrock version(production or developer access). Please refer to the documentation for the right endpoint url based on the region -[Amazon Bedrock Endpoints](https://docs.aws.amazon.com/bedrock/latest/userguide/endpointsTable.html)


Amazon Bedrock users need to **request access** to models before they are available for use. If you want to add additional models for text, chat, and image generation, you need to request access to models in Amazon Bedrock. To request access to additional models, 

1. Navigate to: [Amazon Bedrock Console](https://console.aws.amazon.com/bedrock) and;
2. Select the **Model access** link in the left side navigation panel in the Amazon Bedrock console.
3. Select the check box next to the model you want to add access to and **Save Changes**

If Bedrock is available in your account, set `is_bedrock_available` to True.

In [None]:
is_bedrock_available=False

In [None]:
import json
import os
import sys
import boto3
from botocore.config import Config

bedrock_region="us-east-1"

#boto3_bedrock = boto3.client(service_name="bedrock-runtime", endpoint_url=f"https://bedrock-runtime.{bedrock_region}.amazonaws.com")
boto3_bedrock = boto3.client(service_name="bedrock-runtime", config=Config(region_name=bedrock_region))

Here we use Claude2 text generation model. Same question as before and see if there is any hallucination.

In [None]:
from langchain.chains import ConversationChain
from langchain.llms.bedrock import Bedrock
from langchain.memory import ConversationBufferMemory

claude_llm_hallucination = Bedrock(model_id="anthropic.claude-instant-v1", client=boto3_bedrock)
claude_llm_hallucination.model_kwargs = {'temperature': 0.9, "max_tokens_to_sample": 1024}

if is_bedrock_available:
    claude_result = claude_llm_hallucination(question)
    print(claude_result)
else:
    print("Bedrock is unavailable")

## Step 2: Select SageMaker or Bedrock used for embedding and content generation

Select one the the LLM used in the lab.

---


In [None]:
from ipywidgets import Dropdown

llm_selection = [
    "SageMaker",
    "Bedrock",
]

llm_dropdown = Dropdown(
    options=llm_selection,
    value="SageMaker",
    description="Select a LLM",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(llm_dropdown)

#### Note: If Bedrock is unavailable, we have to use SageMaker as backup plan.

---

In [None]:
llm_category = llm_dropdown.value

if not is_bedrock_available:
    llm_category = "SageMaker"

In [None]:
print("You selected {0} as LLM".format(llm_category))

#### Content Handler util for defining LLM with SagemakerEndpoint

In [None]:
from uuid import uuid4
from typing import Dict
from langchain.memory import ConversationBufferMemory
from langchain.memory import DynamoDBChatMessageHistory
from langchain.memory import ConversationBufferWindowMemory
from langchain import PromptTemplate, SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains import RetrievalQA


class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"inputs": prompt, "parameters": model_kwargs})
        #print("Prompt Input:\n" + input_str)
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        #print("LLM generated text:\n" + response_json[0]["generated_text"])
        return response_json[0]["generated_text"]
    

content_handler = ContentHandler()

## `TODO:` Step 3: Load documents into OpenSearch's vector database

Langchain provides various document loaders to load data from a source as Document's. A Document is a piece of text and associated metadata. For example, there are document loaders for loading a simple .txt file, for loading the text contents of any web page, or even for loading unstructured word documents.

---

The following is data flow diagram of loading documents and store vector into OpenSearch.

![retriever](../image/module3/document_loader)


Document loaders expose a "load" method for loading data as documents from a configured source. Here, we use `UnstructuredURLLoader` to load OpenSearch best practice web page.

You could use multiple document loaders provided by langchain based on the type of source documents. Refer [Langchain Document Loaders](https://python.langchain.com/docs/integrations/document_loaders) for more details on completing the next section of the code.

In [None]:
### TODO : Use Langchain documetn loaders to load source data into the vector store


Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with Amazon Opensearch Service domain.

In [None]:
from opensearchpy import OpenSearch, RequestsHttpConnection

#auth = ("master","Semantic123!")
auth = (aos_credentials['username'], aos_credentials['password'])
aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### LangChain embedding endpoint

To build a simiplied QA application with LangChain, we need to wrap up our SageMaker endpoints for embedding model and LLM into `langchain.embeddings.SagemakerEndpointEmbeddings` and `langchain.llms.sagemaker_endpoint.SagemakerEndpoint`. That requires a overwrite methods of `SagemakerEndpointEmbeddings` class to make it compatible with SageMaker embedding mdoel.

---

Embedding language model is GPT-J, and the endpoint name is `embedding_endpoint_name`

In [None]:
from typing import Any, Dict, Iterable, List, Optional, Tuple, Callable
import json
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
from langchain.schema import Document

class BulkSagemakerEndpointEmbeddings(SagemakerEndpointEmbeddings):
        def embed_documents(
            self, texts: List[str], chunk_size: int = 5
        ) -> List[List[float]]:
            results = []
            _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size

            for i in range(0, len(texts), _chunk_size):
                response = self._embedding_func(texts[i:i + _chunk_size])
                results.extend(response)
            return results
        
class EmbeddingContentHandler(EmbeddingsContentHandler):
        content_type = "application/json"
        accepts = "application/json"

        def transform_input(self, prompt: str, model_kwargs={}) -> bytes:

            input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
            return input_str.encode('utf-8') 

        def transform_output(self, output: bytes) -> str:

            response_json = json.loads(output.read().decode("utf-8"))
            embeddings = response_json["embedding"]
            if len(embeddings) == 1:
                return [embeddings[0]]
            return embeddings

print(embedding_endpoint_name)
sagemaker_embeddings = BulkSagemakerEndpointEmbeddings( 
            endpoint_name=embedding_endpoint_name,
            region_name=aws_region, 
            content_handler=EmbeddingContentHandler())


### Bedrock embedding

In [None]:
from langchain.embeddings import BedrockEmbeddings

bedrock_embeddings = BedrockEmbeddings(model_id='amazon.titan-embed-text-v1',client=boto3_bedrock)

### OpenSearch vector store

### Provide embedding service based on selection between SageMaker and Bedrock

In [None]:
match llm_category:
    case "SageMaker":
        embeddings = sagemaker_embeddings
    case "Bedrock":
        embeddings = bedrock_embeddings


### TODO: Ingest the documents into OpenSearch Vector store using [OpenSearchVectorSearch](https://python.langchain.com/docs/integrations/vectorstores/opensearch) provided by langchain

Use `OpenSearchVectorSearch` in LangChain to ingest vector into OpenSearch. You can specify more parameters to create kNN index with specified properties. Some parameters like:

- `engine`: “nmslib”, “faiss”, “lucene”; `default`: “nmslib”

- `space_type`: “l2”, “l1”, “cosinesimil”, “linf”, “innerproduct”; `default`: “l2”

- `ef_search`: Size of the dynamic list used during k-NN searches. Higher values lead to more accurate but slower searches; `default`: 512

- `ef_construction`: Size of the dynamic list used during k-NN graph creation. Higher values lead to more accurate graph but slower indexing speed; `default`: 512

- `m`: Number of bidirectional links created for each new element. Large impact on memory consumption. Between 2 and 100; `default`: 16


**Note**: When you use LangChain `OpenSearchVectorSearch` to store embedding with OpenSearch kNN index, you can specify parameters to choose different Approximate Near Neighbour(ANN) algorithms. For more information, please refer OpenSearch kNN documentaion: https://opensearch.org/docs/latest/search-plugins/knn/knn-index/

In [None]:
from langchain.vectorstores import OpenSearchVectorSearch

os_domain_ep = 'https://'+aos_host

embedding_index_name = #<name of the OS index>

# TODO: Code to ingest data into OS index


In [None]:
# check whether the OS index is created successfully
aos_client.indices.get(index=embedding_index_name)

## `TODO` Step 4: Retrieval Augmented Generation

---

To mitigate LLM hallucination, we can provide some context to LLM and let LLM generated answer with the context. The following diagram show RAG data flow:

![rag](../image/module3/workflow)

---


Define SageMaker LLM endpoint

---

In [None]:
sagemaker_params = {
        "max_new_tokens": 1024,
        "num_return_sequences": 1,
        "top_k": 200,
        "top_p": 0.9,
        "do_sample": False,
        "return_full_text": False,
        "temperature": 0.0001
        }

sagemaker_llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs=sagemaker_params,
        content_handler=content_handler,
)

Define Bedrock content generation LLM

---

In [None]:
bedrock_params = {
    'temperature': 0.00001,
    "max_tokens_to_sample": 1024
    }

bedrock_claude_llm = Bedrock(model_id="anthropic.claude-instant-v1", client=boto3_bedrock)
bedrock_claude_llm.model_kwargs = bedrock_params

### Provide content geeration service based on selection between SageMaker and Bedrock

In [None]:
match llm_category:
    case "SageMaker":
        llm = sagemaker_llm
    case "Bedrock":
        llm = bedrock_claude_llm


Define `RetrievalQA` Chain with SageMaker or Bedrock LLM

---

### TODO: Use the OpenSearch vector store as retriever to get similiar documents with query. 

We can also specify similarity scrore threshhold to return high relevant documents. Use "k" to limit how many documents to be returned. Refer to [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.html#langchain.vectorstores.opensearch_vector_search.OpenSearchVectorSearch.as_retriever) for reference.

Category of chains are used for interacting with indexes. The purpose these chains is to combine your own data (stored in the indexes) with LLMs. The best example of this is question answering over your own documents

1. `stuff`: The stuff documents chain ("stuff" as in "to stuff" or "to fill") is the most straightforward of the document chains. It takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM. This chain is well-suited for applications where documents are small and only a few are passed in for most calls.

2. `refine`: The Refine documents chain constructs a response by looping over the input documents and iteratively updating its answer. For each document, it passes all non-document inputs, the current document, and the latest intermediate answer to an LLM chain to get a new answer. Since the Refine chain only passes a single document to the LLM at a time, it is well-suited for tasks that require analyzing more documents than can fit in the model's context.

3. `map reduce`: The map reduce documents chain first applies an LLM chain to each document individually (the Map step), treating the chain output as a new document. It then passes all the new documents to a separate combine documents chain to get a single output (the Reduce step).

4. `re-rank`: The map re-rank documents chain runs an initial prompt on each document, that not only tries to complete a task but also gives a score for how certain it is in its answer. The highest scoring response is returned.

The `chain_type` choice can certainly help in improving your RAG system. Select the chain_type for your use-case and define a `RetrievalQA.from_chain_type()`

Learn more about [chain_types](https://python.langchain.com/docs/modules/chains/document/)


In [None]:
# TODO 
# 1. Define sagemaker and Bedrock Retriever
# 2. Define the qa chain_type using the retriver and experiment with chain_type


qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type=#<> #stuff, refine, map_reduce, and map_rerank
)

Use RAG to generate answer to the same question before. Compare the content generated with RAG and LLM without context.

---

In [None]:
print("Question is:" + question)

#langchain.debug=False
result = qa({"query": question})

print("result:" + result["result"])
  

## `TODO` Step 5: Conversational search by memorizing the history 

### LangChain Memory with Amazon DynamoDB as data store

In the above example, you can ask any questions to the system. However there is no relation among the questions. In a typical search system, you may want to implement conversational search. An essential component of a conversation is being able to refer to information introduced earlier in the conversation. LangChain provides a lot of utilities for adding memory to a system. These utilities can be used by themselves or incorporated seamlessly into a chain. In this lab, we use [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) as data store of history message.

---
The data flow of conversational search with memory is as following:

![rag](../image/module8/rag-with-memory.png)

First delete DynamoDB table which is used as "memory" to store history message.

In [None]:
import boto3
import time

dynamo = boto3.client('dynamodb')

history_table_name = 'conversation-history-memory'

try:
    dynamo.delete_table(TableName=history_table_name)
    response = dynamo.describe_table(TableName=history_table_name)
    print(response)
    while response["Table"]["TableStatus"] == 'DELETING':
        time.sleep(1)
        print('.', end='')
        response = dynamo.describe_table(TableName=history_table_name)
except dynamo.exceptions.ResourceNotFoundException:
    pass

Create DynamoDB to store history message

In [None]:
dynamo.create_table(
    TableName=history_table_name,
    AttributeDefinitions=[
        {
            'AttributeName': 'SessionId',
            'AttributeType': 'S',
        }
    ],
    KeySchema=[
        {
            'AttributeName': 'SessionId',
            'KeyType': 'HASH',
        }
    ],
    ProvisionedThroughput={
        'ReadCapacityUnits': 5,
        'WriteCapacityUnits': 5,
    }
)

response = dynamo.describe_table(TableName=history_table_name) 
print(response)
while response["Table"]["TableStatus"] == 'CREATING':
    time.sleep(1)
    print('.', end='')
    response = dynamo.describe_table(TableName=history_table_name) 
    
print("\ndynamo DB Table, '"+response['Table']['TableName']+"' is created")

---
Here we create new session and use DynamoDB as backend to store history conversation. 

In [None]:
ddb_table_name = "conversation-history-memory"
session_id = str(uuid4())
chat_memory = DynamoDBChatMessageHistory(
        table_name=ddb_table_name,
        session_id=session_id
    )

messages = chat_memory.messages

# Maintains immutable sessions
# If previous session was present, create
# a new session and copy messages, and 
# generate a new session_id 
if messages:
    session_id = str(uuid4())
    chat_memory = DynamoDBChatMessageHistory(
        table_name=ddb_table_name,
        session_id=session_id
    )
    # This is a workaround at the moment. Ideally, this should
    # be added to the DynamoDBChatMessageHistory class
    try:
        messages = messages_to_dict(messages)
        chat_memory.table.put_item(
            Item={"SessionId": session_id, "History": messages}
        )
    except Exception as e:
        print(e)



### TODO: Define memory store to store conversation history

In [None]:
# TODO
# Define memory store



---

Create `ConversationalRetrievalChain` to combines the chat history and the question into a standalone question, then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.

---

In [None]:
from langchain.chains import ConversationalRetrievalChain

params = {
        "max_length": 2048,
        "max_new_tokens": 1024,
        "num_return_sequences": 1,
        "top_k": 200,
        "top_p": 0.9,
        "do_sample": False,
        "return_full_text": False,
        "temperature": 0.0001
        }

sagemaker_llm=SagemakerEndpoint(
        endpoint_name=llm_endpoint_name,
        region_name=aws_region,
        model_kwargs=params,
        content_handler=content_handler,
)

condense_template = """system: generate one standalone question.
Given the following conversation between <chat-history> and </chat-history> 
and follow up question between <follow-up-question> and </follow-up-question>, 
rephrase the follow up question to be a standalone question in its original language. 
The standalone question will only contains one sentence and it must end with '?'

<chat-history>
{chat_history}
</chat-history>

<follow-up-question>
{question}
</follow-up-question>

standalone question:
"""

CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(condense_template)



### TODO: Define prompt used by LLM to generate answers with context information and original question

In [None]:
#TODO
#Define prompt





In [None]:
qa_with_memory = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    condense_question_prompt=CONDENSE_QUESTION_PROMPT,
    combine_docs_chain_kwargs={"prompt": prompt_template2},
    verbose=True)

---

For the first question, there is no history. It is just standard RAG process.

---

In [None]:
result = qa_with_memory(question)


In [None]:
#print("result:" + str(result))
print("\nAnswer:\n" + str(result["answer"]))

### Second question
Try to ask one more question, `ConversationalRetrievalChain` will use the first question, first question's answer and second question as prompt to LLM to generate new question. The prompt to LLM is like following:

```python

_template = """system: generate one standalone question.
Given the following conversation between <chat-history> and </chat-history> 
and follow up question between <follow-up-question> and </follow-up-question>, 
rephrase the follow up question to be a standalone question in its original language. 
The standalone question will only contains one sentence and it must end with '?'

<chat-history>
{chat_history}
</chat-history>

<follow-up-question>
{question}
</follow-up-question>

standalone question:
```

After get the new question from LLM, it will search relevant document from OpenSearch vector store and get relevant documents, then combine the new question and relevant documents as prompt to go through RAG process. The prompt to LLM is like following:

```python
prompt_template = """Answer the question as truthfully as possible by using the provided informaiton in >>CONTEXT<<. If the answer is not contained within the >>CONTEXT<<, respond with "I can't answer that".

>>CONTEXT<<:
{context}

>>QUESTION<<:
{question}

>>Answer<<:
"""
```

In summary, `ConversationalRetrievalChain` will call LLM twice:
1. Use history question, history answer and latest question as prompt to generate new question
2. Use new question generated in the first step, query relevant documents. Combine relevant documents and new question as prompt to LLM to generate answer.

You can also see the verbose message like following:

---

### First call to LLM:

![generate new question](../image/module8/conversation-new-question.png)

---

### Second call to LLM:

![generate final answer](../image/module8/conversation-final-answer.png)

---


In [None]:
second_following_question = 'if my data growth is very fast'
second_result = qa_with_memory(second_following_question)


In [None]:
print("second answer:" + str(second_result["answer"]))