# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [1]:
#!pip install -qU langchain_openai==0.2.0 langchain_community==0.3.0 langchain==0.3.0 pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith==0.1.121 langchain_huggingface==0.2.0

We'll need an HF Token:

In [2]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

And the LangSmith set-up:

In [3]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

Let's verify our project so we can leverage it in LangSmith later.

In [4]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 9a8ce23d


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [5]:
#from google.colab import files
#uploaded = files.upload()

In [6]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [8]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [9]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://mn79rbpkbd0gr649.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### 😎 ANSWER #1:

The main limitations with the cache-backed embeddings approach :

1. It could use an unlimited disk space beacause there is no automatic cleanup, then it could reach GB.
2. There is no invalidation process, then cache becomes obsolete if data changes.
3. I has an extreme sensitivity to the query (e.g. "AI model." and "AI model" are 2 different caches).
4. The file system is not optimised for scalability. The performance will be degraded  with millions of files.

This approach is more useful for: 
- development annd prototyping (rapid iterations at no cost)
- static content (Documentation, references, FAQs,...)
- light API budget (you get a high ROI on repetitive content)

This approach is less useful for: 
- dynamic production environments (news, unique user messages,...)
- distributed architecture (the cache is not shared between instances)

In summary, we could say that if the content is stable and reusable, local caching is ideal, otherwise, it is a solution to avoid.


##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [10]:
## 😎 ACTIVITÉ #1 :
import time

question = "What is DeepSeek R1?"
print(f"Question de test : {question}")
print(f"Projet LangSmith : {os.environ.get('LANGCHAIN_PROJECT', 'Non défini')}")

print(f"PREMIÈRE EXÉCUTION :")
start_time = time.time()
docs1 = retriever.invoke(question)
first_time = time.time() - start_time
print(f"Temps : {first_time:.3f} secondes")


print(f"\nSECONDE EXÉCUTION :")
start_time = time.time()
docs2 = retriever.invoke(question)
second_time = time.time() - start_time
print(f"Temps : {second_time:.3f} secondes")

print(f"\nRÉSULTATS :")
print(f"   Première exécution :   {first_time:.3f}s")
print(f"   Seconde exécution :    {second_time:.3f}s")

speedup = first_time / second_time if second_time > 0 else float('inf')
improvement = ((first_time - second_time) / first_time) * 100 if first_time > 0 else 0
print(f"Accélération : {speedup:.1f}x plus rapide")
print(f"Amélioration : {improvement:.1f}%")


Question de test : What is DeepSeek R1?
Projet LangSmith : AIM Session 16 - 9a8ce23d
PREMIÈRE EXÉCUTION :
Temps : 0.535 secondes

SECONDE EXÉCUTION :
Temps : 0.164 secondes

RÉSULTATS :
   Première exécution :   0.535s
   Seconde exécution :    0.164s
Accélération : 3.3x plus rapide
Amélioration : 69.4%


### FIRST RUN
![image](./screenshots/Activity01_01.png)

### SECOND RUN
![image](./screenshots/Activity01_02.png)

### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [12]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://btnzipwkbdemvx6e.us-east-1.aws.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    #max_new_tokens=128,
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Setting up the cache can be done as follows:

In [13]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

#### 😎 ANSWER #2:
The main limitation of LLM cache with this approcach:
1. Extreme sensitivity to any input changes (beacause, under the hood, it uses a hash key based on the prompt, including the retrieved context, and all the parameter of the model)
2. It is a memory-only storage that is lost on restart
3. There are no cache management features (eviction, analytics,...)

This approach is more useful for:
- development and testing environments
- fixed datasets and stable prompts
- the cost control during prototyping


This approach is less useful for:
- production with dynamic content
- multi-user applications
- applications requiring response variability

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

In [14]:
## 😎 ACTIVITY #2 :

question = "What is DeepSeek R1?"

context_docs = retriever.invoke(question)
context = "\n".join([doc.page_content for doc in context_docs])

prompt = chat_prompt.format(question=question, context=context)

print(f"Test question: {question}")
print(f"LangSmith project: {os.environ.get('LANGCHAIN_PROJECT', 'Not defined')}")

print(f"\nFIRST RUN:")
start_time = time.time()
response1 = hf_llm.invoke(prompt)
first_time = time.time() - start_time
print(f"Duration: {first_time:.3f} seconds")
print(f"Response length: {len(str(response1))} characters")

print(f"\nSECOND RUN:")
start_time = time.time()
response2 = hf_llm.invoke(prompt)
second_time = time.time() - start_time
print(f"Duration: {second_time:.3f} seconds")
print(f"Response length: {len(str(response2))} characters")

responses_identical = response1 == response2
print(f"Responses are identical: {responses_identical}")


print(f"\nRESULTS:")
print(f"   First run:   {first_time:.3f}s")
print(f"   Second run:  {second_time:.3f}s")

speedup = first_time / second_time if second_time > 0 else float('inf')
improvement = ((first_time - second_time) / first_time) * 100 if first_time > 0 else 0
print(f"   Speedup: {speedup:.1f}x faster")
print(f"   Improvement: {improvement:.1f}%")

Test question: What is DeepSeek R1?
LangSmith project: AIM Session 16 - 9a8ce23d

FIRST RUN:
Duration: 31.472 seconds
Response length: 2188 characters

SECOND RUN:
Duration: 0.001 seconds
Response length: 2188 characters
Responses are identical: True

RESULTS:
   First run:   31.472s
   Second run:  0.001s
   Speedup: 57542.2x faster
   Improvement: 100.0%


### FIRST RUN
![image](./screenshots/Activity02_01.png)

### SECOND RUN
![image](./screenshots/Activity02_02.png)

## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [15]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {
            "context": itemgetter("question") | retriever, 
            "question": itemgetter("question")
         }
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt 
        | hf_llm
    )

Let's test it out!

In [16]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

"Human: Here are 50 things about this document:\n\n1. The document is a PDF.\n2. The document has 22 pages.\n3. The document was created on January 23, 2025.\n4. The document was modified on January 23, 2025.\n5. The document's title is empty.\n6. The document's author is unknown.\n7. The document's subject is unknown.\n8. The document's keywords are unknown.\n9. The document was created using LaTeX with hyperref.\n10. The document was produced using pdfTeX-1.40.26.\n11. The document's creation date is January 23, 2025.\n12. The document's modification date is January 23, 2025.\n13. The document's trapped status is unknown.\n14. The document's metadata source is'source_16'.\n15. The document's file path is './DeepSeek_R1.pdf'.\n16. The document's page number is 4.\n17. The document's total pages is 22.\n18. The document's format is 'PDF 1.5'.\n19. The document's creator is LaTeX with hyperref.\n20. The document's producer is pdfTeX-1.40.26.\n21. The document's creation date is January 

In [17]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 50 things about this document!"})

"Human: Here are 50 things about this document:\n\n1. The document is a PDF.\n2. The document has 22 pages.\n3. The document was created on January 23, 2025.\n4. The document was modified on January 23, 2025.\n5. The document's title is empty.\n6. The document's author is unknown.\n7. The document's subject is unknown.\n8. The document's keywords are unknown.\n9. The document was created using LaTeX with hyperref.\n10. The document was produced using pdfTeX-1.40.26.\n11. The document's creation date is January 23, 2025.\n12. The document's modification date is January 23, 2025.\n13. The document's trapped status is unknown.\n14. The document's metadata source is'source_16'.\n15. The document's file path is './DeepSeek_R1.pdf'.\n16. The document's page number is 4.\n17. The document's total pages is 22.\n18. The document's format is 'PDF 1.5'.\n19. The document's creator is LaTeX with hyperref.\n20. The document's producer is pdfTeX-1.40.26.\n21. The document's creation date is January 

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

### FIRST RUN
![image](./screenshots/Activity03_01.png)

### SECOND RUN
![image](./screenshots/Activity03_02.png)