# Using NeMo Retriever Embedding Microservice with LCEL

In LLM and RAG workflows, [embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) transform document text into vectors that capture semantic meaning. This enables efficient search for contextually relevant documents based on a user's query. These documents are then provided to the LLM, enhancing its ability to generate accurate responses. 

This notebook will first show how to generate embeddings from a query. Then, we'll use this this approach embedding a document, store the embeddings in a vector store, and use that in a LCEL chain to help the LLM answer a question about the NVIDIA H200 from the first notebook.

### Generate Embeddings with NeMo Retriever Embedding Microservice

In [1]:
# from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# # # NeMo Retriever Embeddings Microservice
# # embedding_model = NVIDIAEmbeddings(base_url="http://localhost:8080/v1")

In [2]:
import os
from dotenv import load_dotenv
load_dotenv('../.env')

True

In [3]:
# Examples of other embedding models 

import os
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings

# NVIDIA AI Foundation Models
embedding_model = NVIDIAEmbeddings(model="NV-Embed-QA")

# HuggingFace Embeddings
# embedding_model = HuggingFaceEmbeddings(
#     model_name="intfloat/e5-large-v2",
#     model_kwargs={"device": "gpu"},
#     encode_kwargs={"normalize_embeddings": False},
# )

In [4]:
# Create vector embeddings of the query

embedding_model.embed_query("How much memory does the NVIDIA H200 have?")[:10] # see the first 10 elements of the vector embeddings

[-0.036102294921875,
 -0.04345703125,
 0.031890869140625,
 -0.03680419921875,
 0.045867919921875,
 0.006679534912109375,
 -0.0019245147705078125,
 -0.047119140625,
 -0.016448974609375,
 -0.0261077880859375]

### Load PDF (NVIDIA H200 Datasheet)

Next, we'll load a PDF of the [NVIDIA H200 Datasheet](https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446), this is the knowledge base that the LLM will use to retrieve relevant information to answer our question.

LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites).  [Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the document loaders available from LangChain.

In this example, we use a LangChain [`PyPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html) to load a datasheet about the NVIDIA H200 Tensor Core GPU. 

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://nvdam.widen.net/content/udc6mzrk7a/original/hpc-datasheet-sc23-h200-datasheet-3002446.pdf")

document = loader.load()
document[0]

Document(page_content='NVIDIA H200 Tensor Core GPU\u2002|\u2002Datasheet\u2002|\u2002 1NVIDIA H200 Tensor Core GPU\nSupercharging AI and HPC workloads.\nHigher Performance With Larger, Faster Memory\nThe NVIDIA H200 Tensor Core GPU supercharges generative AI and high-\nperformance computing (HPC) workloads with game-changing performance  \nand memory capabilities. \nBased on the NVIDIA Hopper™ architecture , the NVIDIA H200 is the first GPU to \noffer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—\nthat’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU  with \n1.4X more memory bandwidth. The H200’s larger and faster memory accelerates \ngenerative AI and large language models, while advancing scientific computing for \nHPC workloads with better energy efficiency and lower total cost of ownership. \nUnlock Insights With High-Performance LLM Inference\nIn the ever-evolving landscape of AI, businesses rely on large language models to \naddress a diver

<br>

Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). 

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``RecursiveCharacterTextSplitter``](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html). The ``RecursiveCharacterTextSplitter`` is designed to divide a large text into smaller chunks based on a specified chunk size. It employs recursion as its core mechanism for splitting text, utilizing a predefined set of characters (e.g., "\n\n", "\n", " ", "") to determine where splits should occur. The process begins by attempting to split the text using the first character in the set. If the resulting chunks are still larger than the desired chunk size, it proceeds to the next character in the set and attempts to split again. This process continues until all chunks adhere to the specified maximum chunk size.

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. 

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " ", ""],
)

document_chunks = text_splitter.split_documents(document)
print("Number of chunks from the document:", len(document_chunks))

Number of chunks from the document: 16


In [7]:
# Get example of a document chunk

example_chunk = document_chunks[1].page_content
example_chunk

'offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)—\nthat’s nearly double the capacity of the NVIDIA H100 Tensor Core GPU  with \n1.4X more memory bandwidth. The H200’s larger and faster memory accelerates \ngenerative AI and large language models, while advancing scientific computing for \nHPC workloads with better energy efficiency and lower total cost of ownership. \nUnlock Insights With High-Performance LLM Inference'

In [8]:
# Create vector embeddings of the example chunk

embedding_model.embed_query(example_chunk)[:10] # see the first 10 elements of the vector embeddings

[-0.0238189697265625,
 -0.004711151123046875,
 0.027740478515625,
 -0.069580078125,
 0.040924072265625,
 0.007671356201171875,
 0.042816162109375,
 -0.0144195556640625,
 0.02264404296875,
 -0.0203094482421875]

### Store document embeddings in the vector store.

Once the document embeddings are generated, they are stored in a vector store so that at query time we can:

<ol>
    <li>Embed the user query and</li>
    <li>Retrieve the embedding vectors that are most similar or relevant to the embedding query.</li>
</ol>

A vector store takes care of storing the embedded data and performing a vector search. LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/), we'll be using FAISS for this example.

In [9]:
from langchain.vectorstores import FAISS

# Create FAISS vector store from our embedding service 
vector_store = FAISS.from_documents(document_chunks, embedding=embedding_model)

<br>

Next, we'll need to integrate the vector database with the LLM. A [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/modules/chains/) combines these components together. We can then formulate the prompt placeholders (context and question) and pipe it to our LLM connector as shown below to answer the original question from the first notebook (`How much memory does the NVIDIA H200 have?`) with embeddings from the `NVIDIA H200 datasheet` document.

In [10]:
import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA

# Initialize LLM from NVIDIA AI Foundation Endpoints
# os.environ["NVIDIA_API_KEY"] = "nvapi-***" 
llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")

# Initialize with NVIDIA NIM for LLMs
# llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")



In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system", 
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        # "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

chain = (
    {
        "context": vector_store.as_retriever(),
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [12]:
print(chain.invoke("How much memory does the NVIDIA H200 have?"))

The NVIDIA H200 has 141 gigabytes (GB) of HBM3e memory.


In [13]:
print(chain.invoke("Is the NVIDIA H200 PCIe or SXM based?"))

The NVIDIA H200 is based on both PCIe and SXM interfaces, as mentioned in the document it can come in multiple form factors, which includes PCIe and SXM.
