# LangChain for Retrieval Augmentation

## Create and Index
We can set up our index to store our data. We begin by initializing our connection to Pinecone.


In [1]:
import os
from pinecone import Pinecone
import ast

In [2]:
# Configurations
pinecone_api = "xxx"
index_name = "sagemaker-agent"
pinecone_region = "us-east-1"
pinecone_host = "xxx"
pine_cloud = "aws"
pinecone_metric = "cosine"

openai_api = "sk-xxx"


We can set up our index to store our Data.

We begin by initializing our connection to Pinecone.

In [3]:
import time
from pinecone import ServerlessSpec

# configure client
pc = Pinecone(api_key=pinecone_api)
spec = ServerlessSpec(cloud=pine_cloud, region=pinecone_region)

In [4]:
# delete index if it exists
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# we create a new index
pc.create_index(
        index_name,
        dimension=1536,  # dimensionality of text-embedding-ada-002
        metric=pinecone_metric,
        spec=spec
    )

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [20]:
index = pc.Index(index_name)
# wait a moment for connection
time.sleep(1)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Let's load our pre-embedded data. We will format the dataframe to be ready for upserting into Pinecone.


In [6]:
# Now we load the the pinecone dataset we prepared from the previous notebook
file_path = "CuratedData\sagemaker_documentation_embeddings.csv"

import pandas as pd
df = pd.read_csv(file_path)
df = df.drop(columns=['document_name', 'title'])

print(df.shape)
df.head(2)

(337, 3)


Unnamed: 0,id,values,metadata
0,1-1,"[-0.008473106659948826, 0.016663100570440292, ...","{'chunk': 1, 'source': 'amazon-sagemaker-toolk..."
1,2-1,"[-0.01526849064975977, 0.029873132705688477, 0...","{'chunk': 1, 'source': 'asff-resourcedetails-a..."


In [7]:
# Ensure 'values' column is properly formatted
df['values'] = df['values'].apply(lambda x: [float(i) for i in x.strip('[]').split(',')])

# Ensure 'metadata' column is properly formatted as a dictionary
df['metadata'] = df['metadata'].apply(lambda x: ast.literal_eval(x))
df.head(2)

Unnamed: 0,id,values,metadata
0,1-1,"[-0.008473106659948826, 0.016663100570440292, ...","{'chunk': 1, 'source': 'amazon-sagemaker-toolk..."
1,2-1,"[-0.01526849064975977, 0.029873132705688477, 0...","{'chunk': 1, 'source': 'asff-resourcedetails-a..."


Let's create a function to upsert our data in batches. This is useful when we have a large dataset.


In [21]:
# Function to batch upload data to Pinecone
def upsert_data_to_pinecone(index, df, batch_size=50):
    time.sleep(10)
    records = df.to_dict('records')
    total_records = len(records)
    for i in range(0, total_records, batch_size):
        batch = records[i:i + batch_size]
        vectors = [
            {
                "id": str(record['id']),
                "values": record['values'],
                "metadata": record['metadata']
            }
            for record in batch
        ]
        index.upsert(vectors=vectors)
        print(f"Upserted batch {i//batch_size + 1}/{(total_records // batch_size) + 1}")

# Upsert the data to Pinecone
upsert_data_to_pinecone(index, df)

Upserted batch 1/7
Upserted batch 2/7
Upserted batch 3/7
Upserted batch 4/7
Upserted batch 5/7
Upserted batch 6/7
Upserted batch 7/7


In [22]:
# Verify the data has been upserted
time.sleep(1) 
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 337}},
 'total_vector_count': 337}


Now let's explore different options to solve the problem of retrieval augmentation.


### 1st - RetrievalQAWithSourcesChain
The `RetrievalQAWithSourcesChain` in LangChain is a specialized chain designed to handle retrieval-based question answering tasks while providing source attribution.


In [23]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=openai_api
)

Initialize a vector store:

In [24]:
from langchain.vectorstores import Pinecone

text_field = "text" # Here is the content of the document in our metadata

# switch back to normal index for langchain
index = pc.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)



In [65]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain

# completion llm
llm = ChatOpenAI(
    openai_api_key=openai_api,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(k=3)
)



In [99]:
# Test the function
question = "What is SageMaker?"
qa_with_sources(question)

{'question': 'What is SageMaker?',
 'answer': 'Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning models. It integrates with AWS Marketplace, allowing developers to charge other SageMaker users for the use of their algorithms and model packages. SageMaker provides an integrated Jupyter authoring notebook instance for easy access to data sources and allows for the association of Git repositories with notebook instances. Before creating algorithm and model package resources, they must be developed and packaged in Docker containers. SageMaker is a powerful tool for machine learning development and deployment.\n',
 'sources': 'examples-sagemaker.md, sagemaker-marketplace.md, integrating-sagemaker.md, sagemaker-marketplace-develop.md'}

In [52]:
# Test the function
question = "What are all AWS regions where SageMaker is available?"
qa_with_sources(question)

{'question': 'What are all AWS regions where SageMaker is available?',
 'answer': 'The AWS regions where SageMaker is available are:\n\n- US East (Ohio)\n- US East (N. Virginia)\n- US West (N. California)\n- US West (Oregon)\n- Africa (Cape Town)\n- Asia Pacific (Hong Kong)\n- Asia Pacific (Mumbai)\n- Asia Pacific (Osaka)\n- Asia Pacific (Seoul)\n- Asia Pacific (Singapore)\n- Asia Pacific (Sydney)\n- Asia Pacific (Jakarta)\n- Asia Pacific (Tokyo)\n- Canada (Central)\n- China (Beijing)\n- China (Ningxia)\n- Europe (Frankfurt)\n- Europe (Ireland)\n- Europe (London)\n- Europe (Paris)\n- Europe (Stockholm)\n- Europe (Milan)\n- Middle East (Bahrain)\n- South America (São Paulo)\n- AWS GovCloud (US-West)\n\n',
 'sources': 'sagemaker-algo-docker-registry-paths.md'}

In [48]:
# Test the function
question = "How to check if an endpoint is KMS encrypted?"
qa_with_sources(question)

{'question': 'How to check if an endpoint is KMS encrypted?',
 'answer': 'To check if an endpoint is KMS encrypted, you need to verify whether the AWS Key Management Service (KMS) key is configured for an Amazon SageMaker endpoint configuration. The endpoint configuration is considered NON_COMPLIANT if "KmsKeyId" is not specified for the Amazon SageMaker endpoint configuration.\n\n',
 'sources': 'sagemaker-endpoint-configuration-kms-key-configured.md'}

In [63]:
# Test the function
question = "What are SageMaker Geospatial capabilities?"
qa_with_sources(question)

{'question': 'What are SageMaker Geospatial capabilities?',
 'answer': 'Amazon SageMaker geospatial capabilities allow users to perform operations on AWS hardware managed by SageMaker, with the ability to create and use execution roles for specific permissions. These capabilities include actions like passing roles between services and attaching trust policies to IAM roles. Specific permissions are required for different API calls, such as StartEarthObservationJob and StartVectorEnrichmentJob. Users can also utilize AWS managed policies like AmazonSageMakerFullAccess for broader permissions. SageMaker also integrates with AWS Marketplace for selling algorithms and model packages. \n',
 'sources': 'sagemaker-geospatial-roles.md, integrating-sagemaker.md, sagemaker-marketplace.md'}

### 2nd - Retrieval Augmented Generation using ChatCompletions


In [81]:
from openai import OpenAI

client = OpenAI(api_key=openai_api)
embedding_model = "text-embedding-ada-002"

def get_embedding(text, model=embedding_model):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding


In [92]:
import openai

def rag_chatcompletions(query):
    # Get the embedding for the query
    query_embedding = get_embedding(query)

    # Retrieve from Pinecone
    res = index.query(vector=query_embedding, top_k=5, include_metadata=True)

    # Get list of retrieved text
    contexts = [item['metadata']['text'] for item in res['matches']]

    # Create the augmented query
    augmented_query = "\n\n---\n\n".join(contexts) + "\n\n-----\n\n" + query
    # System message to 'prime' the model
    primer = """You are Q&A bot. A highly intelligent system that answers
    user questions based on the information provided by the user above
    each question. If the information can not be found in the information
    provided by the user you truthfully say "I don't know".
    Remember to share the source of the content.
    """

    # Create the chat completion
    response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {'role': 'system', 'content': primer},
                {'role': 'user', 'content': augmented_query},
            ],
            temperature=0
        )
    display(Markdown(response.choices[0].message.content))

In [93]:
# Test the function
question = "What is SageMaker?"
rag_chatcompletions(question)

SageMaker is a fully managed machine learning service provided by Amazon. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, eliminating the need to manage servers. SageMaker also integrates with AWS Marketplace, allowing developers to charge other SageMaker users for the use of their algorithms and model packages.

In [94]:
# Test the function
question = "What are all AWS regions where SageMaker is available?"
rag_chatcompletions(question)

Based on the information provided, SageMaker is available in the following AWS regions:

1. US East (Ohio)
2. US East (N. Virginia)
3. US West (N. California)
4. US West (Oregon)
5. Africa (Cape Town)
6. Asia Pacific (Hong Kong)
7. Asia Pacific (Mumbai)
8. Asia Pacific (Osaka)
9. Asia Pacific (Seoul)
10. Asia Pacific (Singapore)
11. Asia Pacific (Sydney)
12. Asia Pacific (Jakarta)
13. Asia Pacific (Tokyo)
14. Canada (Central)
15. China (Beijing)
16. China (Ningxia)
17. Europe (Frankfurt)
18. Europe (Ireland)
19. Europe (London)
20. Europe (Paris)
21. Europe (Stockholm)
22. Europe (Milan)
23. Middle East (Bahrain)
24. South America (São Paulo)
25. AWS GovCloud (US-West)

Please note that this list is based on the information provided and may not include all regions where SageMaker is available. For the most up-to-date list, please refer to the official AWS documentation.

In [95]:
# Test the function
question = "How to check if an endpoint is KMS encrypted?"
rag_chatcompletions(question)

To check if an Amazon SageMaker endpoint is KMS encrypted, you can use the AWS Config managed rule with the identifier SAGEMAKER_ENDPOINT_CONFIGURATION_KMS_KEY_CONFIGURED. This rule checks whether an AWS Key Management Service (KMS) key is configured for an Amazon SageMaker endpoint configuration. The rule is NON_COMPLIANT if "KmsKeyId" is not specified for the Amazon SageMaker endpoint configuration. This rule is triggered periodically and applies to all supported AWS regions except a few specified ones.

In [96]:
# Test the function
question = "What are SageMaker Geospatial capabilities?"
rag_chatcompletions(question)

Amazon SageMaker Geospatial capabilities are part of the managed service provided by Amazon SageMaker. These capabilities allow users to perform operations on the AWS hardware that is managed by SageMaker. The operations can only be performed if the user grants permissions through an IAM role, also known as an execution role. This role grants the service permission to access the user's AWS resources. The geospatial capabilities are particularly useful for tasks that involve geographical data or mapping services.

## Conclusions
1. **Effectiveness of Retrieval Augmentation:**
   The proof of concept (POC) demonstrated that integrating LangChain with Pinecone for retrieval augmentation can significantly reduce the time developers spend searching through documentation. This system effectively retrieves relevant information, addressing the primary challenge faced by Company X.

2. **Scalability and Ease of Application:**
   The first approach, utilizing Pinecone for vector indexing, proved to be more scalable and easier to implement. Its integration with LangChain allows for seamless expansion as additional documentation and data sources are incorporated. This approach is particularly beneficial for handling large volumes of data, ensuring the system remains efficient and responsive.

3. **Compliance with Company Requirements:**
   This approach better complies with the challenge provided by Company X. It ensures that the documentation system assists developers with unfamiliar parts of the documentation. Additionally, as mentioned in the "nice to have," it provides the source document in the output, helping users verify and explore further.
   
   **Geographical restrictions:** Using Pinecone within the US region.