### Advanced Retrieval Strategy

In [None]:
%pip install langchain_cohere -q
%pip install spacy -q
%pip install psycopg2 -q
%pip install python-dotenv -q
#ignore error

Standard imports for the libraires we will be using in this notebook.  Try to keep your imports in the first cell so this can this code can more easliy be converted into a python program later

In [1]:
import boto3
import pandas as pd
import json
import time
import os
import numpy as np
import pyarrow
import traceback
from langchain.embeddings import BedrockEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import BedrockChat
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import BedrockEmbeddings
from dotenv import load_dotenv

load_dotenv()
# Create the AWS client for the Bedrock runtime with boto3
aws_client = boto3.client(service_name="bedrock-runtime")

#### Lets define functions that will use various embedding models so we can generate vector embeddings

Amazon Titan

In [2]:
# Let's generate a dense vector using Amazon Titan with LangChain
def generate_titan_vector_embedding(text):
    #create an Amazon Titan Text Embeddings client
    embeddings_client = BedrockEmbeddings(region_name="us-west-2") 

    #Invoke the model
    embedding = embeddings_client.embed_query(text)
    return(np.array(embedding))



This is the mathmatical formula to calcuate cosine similarity between 2 vectors

In [3]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity



In [4]:
def print_top_values(list_stuff: list, num_items: int) -> None:
    i=0
    for item in list_stuff:
        i=i+1
        if i>num_items:
            return None
        print(item)

In [5]:
# clean abstract text
#df = pd.read_csv('data/latest_research_articles.csv')
#df['abstract'] = df['abstract'].apply(clean_value)

#df
dft = pd.read_pickle('data/embedded_df.pkl')

### Advanced Retrieval Techniques
#### HyDE
A technique that optimizes semantic matching requires better semantic context.  What if we generated a document from the query that better match our stored document?

In [6]:
### Retrieval from embedded sources
#Now that we have a dataframe with embedded content of interest, we can use semantic similarity to retrieve the right data to feed to an LLM

# Given the following query let's generate context that more closely matches the embedded data
query = "What is the latest research for broken ribs in children"

#### Calling the LLM with Python
Before we embed the vector with the query let's transform the query into a fake article.  This article will likely have a larger semantic overlap than the original smaller question. Using Bedrock we will now call Anthropic Claude Sonnet to generate a fictitous article.


In [7]:
# Generate HyDE context

def generate_hyde_response(query_phrase):
    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
    # Each model will named parameters which will likely be different depending on the providor
    model_kwargs =  { 
        "max_tokens": 400, # This is the maximum output tokens you want the model to use
        "temperature": 1,  # Temperature controls the randomness and creativity of the generated text.
        "top_k": 250,      # Top-k parameter determines the number of highest probability next word choices the model should conside
        "top_p": 0.9,      # Top-p sampling considers the cumulative probability distribution of the next word choices and sets a probability threshold
        "stop_sequences": ["\n\nHuman"],
    }
    # LangChain tooling
    model = BedrockChat(
        client=aws_client,
        model_id=model_id,
        model_kwargs=model_kwargs,
    )
    
    human_prompt = "Given the following question \n {query} can you please generate a paragraph of text that answers the question. Be sure to use scientific \
                    medical terminology. Please just include the paragraph in your response."
    # Uses the messaging method which is required for all Claude 3 calls
    messages = [
        ("system", "You are a helpful assistant"),
        ("human", human_prompt),
    ]
    try:
        prompt = ChatPromptTemplate.from_messages(messages)
        # LangChain at work
        chain = prompt | model | StrOutputParser()


        # Send the message content to Claude using Bedrock and get the response
        start_time = time.time()  # Start timing
        # Call Bedrock
        response = chain.invoke({"query": query_phrase})
        end_time = time.time()  # End timing
        print("Claude call took :", end_time - start_time)  # Calculate execution time

        return(response)
    except Exception as e:
        exc_type, exc_value, exc_traceback = traceback.sys.exc_info()
        line_number = exc_traceback.tb_lineno
        print(f"Errort: {exc_type}{exc_value}{exc_traceback} on {line_number}")

In [8]:
print(generate_hyde_response(query))

  model = BedrockChat(


Claude call took : 5.902079343795776
Pediatric rib fractures, although relatively uncommon, can occur due to various traumatic events or underlying medical conditions. Recent research has focused on improving diagnostic modalities and treatment approaches to minimize complications and optimize recovery. Multiplanar imaging techniques, such as computed tomography (CT) and magnetic resonance imaging (MRI), have enhanced the detection and characterization of rib fractures, particularly in cases of non-accidental trauma. Additionally, studies have explored the role of minimally invasive surgical interventions, such as video-assisted thoracoscopic surgery (VATS), for the management of complex rib fractures with associated complications like hemothorax or pulmonary contusion. Furthermore, ongoing research aims to establish evidence-based pain management protocols, incorporating multimodal analgesia strategies to alleviate discomfort and facilitate respiratory function during the healing proc

#### Titan Embeddings - SAME No HyDE
Let's review our cosine similarties with using our query term embedded as is to find articles that are simantically similar.

In [9]:
# Let's search our records for a good semantic search
# First we will embbed our search term into a vector so that we can mathmatically compare them
query_vector = generate_titan_vector_embedding(query)

# This is a tuple of the article index and the cosine similarity score
# We will use this to sort the 'closest match'
results = []
# Iterate over each row in the DataFrame
for index, row in dft.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dft.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

  embeddings_client = BedrockEmbeddings(region_name="us-west-2")


Here are a few articles that may match your interest:
Abtract: 'High sensitivity methods for automated rib fracture detection in pediatric radiographs' with a cosine match of: 0.4675950053473365
Abtract: 'Magnetic resonance imaging based finite element modelling of the proximal femur: a short-term in vivo precision study' with a cosine match of: 0.23057253235831168
Abtract: 'On the crashworthiness analysis of bio-inspired DNA tubes' with a cosine match of: 0.21674978249516885
Abtract: 'Reproduction of forearm rotation dynamic using intensity-based biplane 2D–3D registration matching method' with a cosine match of: 0.1990790331354665
Abtract: 'Propagation of extended fractures by local nucleation and rapid transverse expansion of crack-front distortion' with a cosine match of: 0.19056102062916994


We can see the difference between our first result and our next best result.

Now let's compare our cosine scores with HyDE . . .

In [None]:
# Let's search our records for a good semantic search
# First we will embbed our search term into a vector so that we can mathmatically compare them
query_vector = generate_titan_vector_embedding(generate_hyde_response(query))

# This is a tuple of the article index and the cosine similarity score
# We will use this to sort the 'closest match'
results = []

# Iterate over each row in the DataFrame
for index, row in dft.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    # store the results in our result tuple
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

# Sort the results into highest match as the first record
results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    # Use the index from the Original dataframe to extract values of interest
    article_title = dft.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break