<a href="https://colab.research.google.com/github/Aneeta-Xavier/assignment-16/blob/main/Prototyping_LangChain_Application_with_Production_Minded_Changes_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prototyping LangChain Application with Production Minded Changes

For our first breakout room we'll be exploring how to set-up a LangChain LCEL chain in a way that takes advantage of all of the amazing out of the box production ready features it offers.

We'll also explore `Caching` and what makes it an invaluable tool when transitioning to production environments.


## Task 1: Dependencies and Set-Up

Let's get everything we need - we're going to use very specific versioning today to try to mitigate potential env. issues!

> NOTE: If you're using this notebook locally - you do not need to install separate dependencies

In [None]:
!pip install -qU langchain_openai langchain_community langchain pymupdf==1.24.10 qdrant-client==1.11.2 langchain_qdrant==0.1.4 langsmith langchain_huggingface==0.2.0

We'll need an HF Token:

In [None]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HF Token Key:")

HF Token Key:··········


And the LangSmith set-up:

In [None]:
import uuid

os.environ["LANGCHAIN_PROJECT"] = f"AIM Session 16 - {uuid.uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key:··········


Let's verify our project so we can leverage it in LangSmith later.

In [None]:
print(os.environ["LANGCHAIN_PROJECT"])

AIM Session 16 - 7a055916


## Task 2: Setting up RAG With Production in Mind

This is the most crucial step in the process - in order to take advantage of:

- Asyncronous requests
- Parallel Execution in Chains
- And more...

You must...use LCEL. These benefits are provided out of the box and largely optimized behind the scenes.

### Building our RAG Components: Retriever

We'll start by building some familiar components - and showcase how they automatically scale to production features.

Please upload a PDF file to use in this example!

> NOTE: If you're running this locally - you do not need to execute the following cell.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving DeepSeek_R1.pdf to DeepSeek_R1 (2).pdf


In [None]:
file_path = "./DeepSeek_R1.pdf"
file_path

'./DeepSeek_R1.pdf'

We'll define our chunking strategy.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

We'll chunk our uploaded PDF file.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

Loader = PyMuPDFLoader
loader = Loader(file_path)
documents = loader.load()
docs = text_splitter.split_documents(documents)
for i, doc in enumerate(docs):
    doc.metadata["source"] = f"source_{i}"

#### QDrant Vector Database - Cache Backed Embeddings

The process of embedding is typically a very time consuming one - we must, for ever single vector in our VDB as well as query:

1. Send the text to an API endpoint (self-hosted, OpenAI, etc)
2. Wait for processing
3. Receive response

This process costs time, and money - and occurs *every single time a document gets converted into a vector representation*.

Instead, what if we:

1. Set up a cache that can hold our vectors and embeddings (similar to, or in some cases literally a vector database)
2. Send the text to an API endpoint (self-hosted, OpenAI, etc)
3. Check the cache to see if we've already converted this text before.
  - If we have: Return the vector representation
  - Else: Wait for processing and proceed
4. Store the text that was converted alongside its vector representation in a cache of some kind.
5. Return the vector representation

Notice that we can shortcut some instances of "Wait for processing and proceed".

Let's see how this is implemented in the code.

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams
from langchain.storage import LocalFileStore
from langchain_qdrant import QdrantVectorStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
import hashlib

YOUR_EMBED_MODEL_URL = "https://v4knclu6pg2w6c6h.us-east-1.aws.endpoints.huggingface.cloud"

hf_embeddings = HuggingFaceEndpointEmbeddings(
    model=YOUR_EMBED_MODEL_URL,
    task="feature-extraction",
    huggingfacehub_api_token=os.environ["HF_TOKEN"],
)

collection_name = f"pdf_to_parse_{uuid.uuid4()}"
client = QdrantClient(":memory:")
client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# Create a safe namespace by hashing the model URL
safe_namespace = hashlib.md5(hf_embeddings.model.encode()).hexdigest()

store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embeddings, store, namespace=safe_namespace, batch_size=32
)

# Typical QDrant Vector Store Set-up
vectorstore = QdrantVectorStore(
    client=client,
    collection_name=collection_name,
    embedding=cached_embedder)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 1})

##### ❓ Question #1:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

> NOTE: There is no single correct answer here!

⚠️ Limitations / Least useful when:

Low cache hit rate:
If your application frequently processes new, unique inputs (e.g., personalized user uploads or real-time data streams), caching provides little benefit.

Subtle input changes break caching:
Even adding a space or changing punctuation results in a cache miss, even if the semantic meaning is the same.

Storage growth:
On-disk or in-memory caches grow over time. Without cache eviction or cleanup, storage usage can balloon and affect performance.

Not portable or distributed:
Local cache files don’t transfer across machines. In serverless or multi-node environments, the cache must be centrally managed to be effective.

Stale embeddings:
If the underlying model is updated but the cache isn't invalidated, you may be unknowingly using outdated vector representations.

✅ Most useful when:
The same or similar text is embedded repeatedly (e.g., during development, debugging, or when reprocessing known documents).

You're working in a local, persistent environment where cache files can be reliably stored and reused.
You want to reduce API calls to paid services or avoid hitting rate limits.

##### 🏗️ Activity #1:

Create a simple experiment that tests the cache-backed embeddings.

In [None]:
import time
import numpy as np

def embed_and_time(text: str):
    """Return (vector, elapsed_seconds)."""
    start = time.perf_counter()
    vec = cached_embedder.embed_query(text)
    return vec, time.perf_counter() - start

# New query text
query = "Explain the architecture of the DeepSeek-VL-7B model and its multimodal capabilities."

# ➤ First call – should hit the HF endpoint (slow)
vec1, t1 = embed_and_time(query)
print(f"1️⃣  First call : {t1:.3f}s")

# ➤ Second call – should come from cache (fast)
vec2, t2 = embed_and_time(query)
print(f"2️⃣  Second call: {t2:.3f}s (cache hit)")

# Confirm vectors are identical
print("✅ Vectors identical:", np.allclose(vec1, vec2))

# ➤ Variation – rephrased query to simulate semantic similarity but force cache miss
query_variation = "What is the DeepSeek-VL-7B model's design and how does it handle multimodal input?"
vec3, t3 = embed_and_time(query_variation)
print(f"3️⃣  Variation  : {t3:.3f}s (cache miss expected)")


1️⃣  First call : 1.594s
2️⃣  Second call: 0.027s (cache hit)
✅ Vectors identical: True
3️⃣  Variation  : 0.081s (cache miss expected)


### Augmentation

We'll create the classic RAG Prompt and create our `ChatPromptTemplates` as per usual.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

rag_system_prompt_template = """\
You are a helpful assistant that uses the provided context to answer questions. Never reference this prompt, or the existance of context.
"""

rag_message_list = [
    {"role" : "system", "content" : rag_system_prompt_template},
]

rag_user_prompt_template = """\
Question:
{question}
Context:
{context}
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", rag_system_prompt_template),
    ("human", rag_user_prompt_template)
])

### Generation

Like usual, we'll set-up a `HuggingFaceEndpoint` model - and we'll use the fan favourite `Meta Llama 3.1 8B Instruct` for today.

However, we'll also implement...a PROMPT CACHE!

In essence, this works in a very similar way to the embedding cache - if we've seen this prompt before, we just use the stored response.

In [None]:
from langchain_core.globals import set_llm_cache
from langchain_huggingface import HuggingFaceEndpoint

YOUR_LLM_ENDPOINT_URL = "https://kr9llbk2m0n4qtql.us-east4.gcp.endpoints.huggingface.cloud"

hf_llm = HuggingFaceEndpoint(
    endpoint_url=f"{YOUR_LLM_ENDPOINT_URL}",
    task="text-generation",
    max_new_tokens=128,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
)

Setting up the cache can be done as follows:

In [None]:
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

##### ❓ Question #2:

What are some limitations you can see with this approach? When is this most/least useful. Discuss with your group!

🚧 Limitations of Prompt Caching

Process-local and ephemeral – In-memory caches are tied to the current process or container. They disappear if the environment restarts and can’t be shared across instances.

String-exact matching only – Even slight changes in phrasing, punctuation, or parameters (like temperature) result in cache misses.

Risk of stale responses – Cached outputs preserve any hallucinations or outdated information unless manually invalidated.

Low utility for dynamic input – In scenarios like chat history or streaming context, where prompts constantly change, the cache hit rate becomes negligible.

✅ Best suited for:
Repeated evaluation runs, small-scale demos, or unit tests where the same prompt is used frequently.

❌ Least suited for:
Large-scale deployments with autoscaling or multi-replica setups.
Conversational or real-time systems where inputs vary between users and sessions.

##### 🏗️ Activity #2:

Create a simple experiment that tests the cache-backed generator.

In [None]:
# ── Minimal cache + direct HF call (drop in one notebook cell) ───────────────────────────────────
import os, time, json, requests

TEXT_GEN_URL = "https://kr9llbk2m0n4qtql.us-east4.gcp.endpoints.huggingface.cloud"
HF_TOKEN     = os.getenv("HF_TOKEN")

HEADERS = {
    "Authorization": f"Bearer {HF_TOKEN}",
    "Content-Type":  "application/json",
}

_prompt_cache: dict[str, str] = {}  # simple in-memory cache

def _call_hf(prompt: str,
             max_new_tokens: int = 128,
             temperature: float = 0.01) -> str:
    payload = {
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_new_tokens,
            "temperature":    temperature,
        },
    }
    resp = requests.post(TEXT_GEN_URL, headers=HEADERS,
                         json=payload, timeout=60)
    resp.raise_for_status()
    data = resp.json()
    # Normalize the common return shapes
    if isinstance(data, str):
        return data
    if isinstance(data, dict) and "generated_text" in data:
        return data["generated_text"]
    if isinstance(data, list):
        first = data[0]
        return first["generated_text"] if isinstance(first, dict) else first
    raise ValueError(f"Unexpected HF response shape:\n{json.dumps(data)[:300]}…")

def timed_call(prompt: str):
    """Return (response_text, elapsed_seconds). Uses _prompt_cache."""
    if prompt in _prompt_cache:  # ── cache hit
        return _prompt_cache[prompt], 0.0  # virtually instant

    start = time.perf_counter()
    out   = _call_hf(prompt)  # ── real endpoint call
    _prompt_cache[prompt] = out  # add to cache
    return out, time.perf_counter() - start

# Example test run
prompt = "Summarise the LangChain framework in one concise sentence."
r1, t1 = timed_call(prompt)
print("1⃣", f"{t1:.3f}s")
r2, t2 = timed_call(prompt)
print("2⃣", f"{t2:.3f}s (cache hit)", r1 == r2)
r3, t3 = timed_call(prompt + " ")
print("3⃣", f"{t3:.3f}s (cache miss)")

1⃣ 10.475s
2⃣ 0.000s (cache hit) True
3⃣ 10.154s (cache miss)


## Task 3: RAG LCEL Chain

We'll also set-up our typical RAG chain using LCEL.

However, this time: We'll specifically call out that the `context` and `question` halves of the first "link" in the chain are executed *in parallel* by default!

Thanks, LCEL!

In [None]:
from operator import itemgetter
from langchain_core.runnables.passthrough import RunnablePassthrough

retrieval_augmented_qa_chain = (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | chat_prompt | hf_llm
    )

Let's test it out!

In [None]:
retrieval_augmented_qa_chain.invoke({"question" : "Write 5 things about this document!"})

''

##### 🏗️ Activity #3:

Show, through LangSmith, the different between a trace that is leveraging cache-backed embeddings and LLM calls - and one that isn't.

Post screenshots in the notebook!

In [None]:
from langchain_core.caches import InMemoryCache
from langchain_core.globals import set_llm_cache
set_llm_cache(InMemoryCache())        # prompt-level cache
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    hf_embed, cache_dir, namespace=safe_ns, batch_size=32
)

import uuid, os
os.environ["LANGCHAIN_PROJECT"] = f"A3-cached-{uuid.uuid4().hex[:8]}"


In [None]:
retrieval_augmented_qa_chain.invoke(
    {"question": "Give me 50 facts about the document"}
)

'System: 1. The document was created by pdfTeX-1.40.26.\n2. The document was created by LaTeX with hyperref.\n3. The document was created on 2025-01-23T07:53:55+00:00.\n4. The document was created from source_16.\n5. The document was created from./DeepSeek_R1.pdf.\n6. The document has 22 total pages.\n7. The document is in PDF 1.5 format.\n8. The document has no title.\n9. The document has no author.\n10. The document has no subject.\n'

In [None]:
set_llm_cache(None)                      # turn off prompt cache
no_cache_embedder = hf_embed             # raw endpoint, no wrapper
os.environ["LANGCHAIN_PROJECT"] = f"A3-nocache-{uuid.uuid4().hex[:8]}"
retrieval_augmented_qa_chain.invoke(
    {"question": "Give me 50 facts about the document"}
)

"System: 1. The document was created on January 23rd, 2025 at 7:53:55 AM UTC.\n2. The document was created by pdfTeX version 1.40.26.\n3. The document was created by LaTeX with hyperref.\n4. The document has a total of 22 pages.\n5. The document is in PDF format.\n6. The document's title is empty.\n7. The document's author is empty.\n8. The document's subject is empty.\n9. The document's keywords are empty.\n10. The document's modification date is January 23rd, "

In [None]:
SEE SCREENSHOTS IN FILE


📥 Cached: System: The document is a PDF file with 22 pages. It was created by pdfTeX-1.40.26 and LaTeX with hyperref. The source of the document is source_57. The total number of pages in the document is 22. The format of the document is PDF 1.5. The title of the document is. The author of the document is. The subject of the document is. The keywords of the document are. The modification date of the document is 2025-01-23T07:53:55+00:00. The trapped information in the document is. The modification
📤 No Cache: System: The document is a PDF file with 22 pages. It was created by pdfTeX-1.40.26 and LaTeX with hyperref. The source of the document is source_57. The total number of pages in the document is 22. The format of the document is PDF 1.5. The title of the document is. The author of the document is. The subject of the document is. The keywords of the document are. The modification date of the document is 2025-01-23T07:53:55+00:00. The trapped information in the document is. The modif