In [1]:
print("HEllo")

HEllo


# 📚 How to Build a Simple Retriever LLM App with LangChain

This section demonstrates how to build a **basic Retriever LLM application** that works over a text-based data source.

### 🔍 Key Features:
- A **very simple Retriever LLM App** that processes text documents.
- Retriever apps can **answer questions about specific documents** by retrieving relevant information from them.


In [2]:
import os

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


from dotenv import load_dotenv, find_dotenv

In [3]:
_ = load_dotenv(find_dotenv())
groq_api_key = os.environ['GROQ_API_KEY']
hf_token_api = os.environ['HF_TOKEN']
pinecone_api_key = os.environ['PINECONE_API_KEY']

## 📄 Documents in LangChain

In LangChain, a **Document** object is used to store both the **text content** and its associated **metadata**.

### 🧬 Document Attributes:
Each `Document` has the following two attributes:

- **`page_content`**: Contains the main text of the document.
- **`metadata`**: Stores additional information about the document (e.g. source, page number, etc.).


In [9]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="John F. Kennedy served as the 35th president of the United States from 1961 until his assassination in 1963.",
        metadata={"source": "us-presidents-doc"},
    ),
    Document(
        page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency.",
        metadata={"source": "us-politics-doc"},
    ),
    Document(
        page_content="The Kennedy family is known for their significant influence in American politics and public service.",
        metadata={"source": "kennedy-family-doc"},
    ),
    Document(
        page_content="Abraham Lincoln led the United States through the Civil War and issued the Emancipation Proclamation in 1863.",
        metadata={"source": "civil-war-doc"},
    ),
    Document(
        page_content="Franklin D. Roosevelt served four terms as U.S. president and implemented the New Deal to combat the Great Depression.",
        metadata={"source": "great-depression-doc"},
    ),
    Document(
        page_content="Martin Luther King Jr. was a prominent leader of the civil rights movement and delivered the famous 'I Have a Dream' speech.",
        metadata={"source": "civil-rights-doc"},
    ),
    Document(
        page_content="The Watergate scandal led to the resignation of President Richard Nixon in 1974.",
        metadata={"source": "us-scandals-doc"},
    ),
    Document(
        page_content="The moon landing in 1969 was a major milestone in the space race, with Neil Armstrong becoming the first man on the moon.",
        metadata={"source": "space-exploration-doc"},
    ),
    Document(
        page_content="Barack Obama made history by becoming the first African American president of the United States in 2008.",
        metadata={"source": "modern-presidents-doc"},
    ),
    Document(
        page_content="The U.S. Constitution, written in 1787, outlines the framework of the federal government and guarantees fundamental rights.",
        metadata={"source": "us-constitution-doc"},
    ),
]


# 🧠 Vector Stores vs. Retrievers

Let’s clarify the difference between **vector stores** and **retrievers** in a simple and practical way.

---

## 📦 Vector Stores

Think of a vector store as a **specialized storage system** where information is stored in a very specific format (vectors).

### 🔹 What They Do:
- **Storing Vectors**: A vector store saves information as vectors — numerical representations of text — making it easier for machines to compare and understand.
- **Purpose**: Its main goal is to **efficiently store and retrieve** vectorized data. When comparing pieces of information, the vector store performs fast similarity searches.
- **Usage**: Essential for systems that need to search large datasets by semantic similarity (e.g., grouping similar documents, matching user queries to content).

---

## 🔎 Retrievers

In contrast to vector stores, **retrievers focus on actively finding relevant information** based on a query.

### 🔹 What They Do:
- **Retrieving Information**: A retriever takes a query (e.g., a user question or search term) and searches a database to find the most relevant pieces of information.
- **Purpose**: Designed to sift through large datasets and return the most appropriate results to answer a specific query.
- **Usage**: Commonly used in search engines, question-answering systems, and any scenario where quick and accurate retrieval of specific information is required.

---

## ⚖️ Key Differences: Vector Stores vs. Retrievers

| Aspect         | Vector Store                                                | Retriever                                                      |
|----------------|-------------------------------------------------------------|----------------------------------------------------------------|
| **Functionality** | Stores numerical representations (vectors) for similarity comparison. | Searches through data to retrieve relevant results based on a query. |
| **Output**        | Returns vectors or similarity scores.                     | Returns a list of relevant documents or data entries.          |
| **Role in Systems** | Acts as the backend to store data used for comparison.     | Uses stored data to find and fetch relevant content.           |


In [6]:
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq


In [7]:
llm = ChatGroq(
    model= "llama-3.3-70b-versatile",
    temperature= 0.1
)

model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_embeddings = HuggingFaceEmbeddings(model_name=model_name)

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
vector_db = FAISS.from_documents(documents=documents, embedding=model_embeddings)

## 🔍 `similarity_search()` in LangChain

Imagine you have a big box full of different toys, and you're looking for toys similar to your favorite toy car.  
You might first pull out all the toy cars, then narrow it down to the ones that match the **color** or **size** of your favorite.

In computing, **similarity search** works the same way — it searches through large volumes of data (text, images, etc.) to find items **most similar** to a given query.

---

### 🧠 How Similarity Search Works in LangChain

When using `similarity_search()` in LangChain, the process usually involves three main steps:

1. **Representation**  
   Convert words or sentences into numerical vectors (called *embeddings*).

2. **Comparison**  
   Once in numerical form, these embeddings are compared to see how close (or far) they are — similar to calculating distances between points.

3. **Retrieval**  
   The retriever then sorts the items by similarity to your query and returns the most relevant results.

---

✅ The `similarity_search()` function returns documents that are most similar to a **string query** based on this process.


In [11]:
vector_db.similarity_search("Jhon")

[Document(metadata={'source': 'civil-rights-doc'}, page_content="Martin Luther King Jr. was a prominent leader of the civil rights movement and delivered the famous 'I Have a Dream' speech."),
 Document(metadata={'source': 'us-politics-doc'}, page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency."),
 Document(metadata={'source': 'us-presidents-doc'}, page_content='John F. Kennedy served as the 35th president of the United States from 1961 until his assassination in 1963.'),
 Document(metadata={'source': 'great-depression-doc'}, page_content='Franklin D. Roosevelt served four terms as U.S. president and implemented the New Deal to combat the Great Depression.')]

In [12]:
vector_db.similarity_search_with_score("Jhon")

[(Document(metadata={'source': 'civil-rights-doc'}, page_content="Martin Luther King Jr. was a prominent leader of the civil rights movement and delivered the famous 'I Have a Dream' speech."),
  1.8768809),
 (Document(metadata={'source': 'us-politics-doc'}, page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency."),
  1.9259932),
 (Document(metadata={'source': 'us-presidents-doc'}, page_content='John F. Kennedy served as the 35th president of the United States from 1961 until his assassination in 1963.'),
  2.0167084),
 (Document(metadata={'source': 'great-depression-doc'}, page_content='Franklin D. Roosevelt served four terms as U.S. president and implemented the New Deal to combat the Great Depression.'),
  2.0196872)]

## 🔁 Retrievers

We can create a retriever manually, but that is **not the most common approach**.

Once we've selected the retrieval method (e.g., similarity search), we can **wrap it in a retriever using `RunnableLambda`** for integration into a LangChain pipeline.

---

### 🧩 Example

The code below demonstrates how to **build a custom retriever** using the `similarity_search` method:

> (Add your custom retriever code using `RunnableLambda` or other LangChain tools here.)


In [13]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vector_db.similarity_search).bind(k=2)
retriever.batch(["Jhon", "Robert"])

[[Document(metadata={'source': 'civil-rights-doc'}, page_content="Martin Luther King Jr. was a prominent leader of the civil rights movement and delivered the famous 'I Have a Dream' speech."),
  Document(metadata={'source': 'us-politics-doc'}, page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency.")],
 [Document(metadata={'source': 'us-politics-doc'}, page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency."),
  Document(metadata={'source': 'kennedy-family-doc'}, page_content='The Kennedy family is known for their significant influence in American politics and public service.')]]

### ✅ Common Practice: Using `.as_retriever()`

In most cases, we use the `.as_retriever()` method to easily **convert a vector store into a retriever**.

This is the recommended and simplest way to use your vector store (e.g., FAISS, Chroma, etc.) in a LangChain pipeline.

#### 🔧 Example:
```python
retriever = vectorstore.as_retriever()

In [14]:
retriever = vector_db.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k" : 1},
)

retriever.batch(["Jhon", "Robert"])

[[Document(metadata={'source': 'civil-rights-doc'}, page_content="Martin Luther King Jr. was a prominent leader of the civil rights movement and delivered the famous 'I Have a Dream' speech.")],
 [Document(metadata={'source': 'us-politics-doc'}, page_content="Robert F. Kennedy was a key political figure and served as the U.S. Attorney General during his brother's presidency.")]]

## ⚙️ Retrievers Are Runnables

In LangChain, **VectorStore objects are not runnables**, meaning they **cannot be directly integrated** into LangChain Expression Language (LCEL) chains.

However, **LangChain Retrievers are runnables**, which means they can **seamlessly plug into LCEL pipelines** and be composed with other components.

---

### 💡 Example:
> See how we use a retriever inside an LCEL chain in the next section.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

messages = """
Answer this question usng the pprovided context only.

{question}

Context:
{context}
"""

retriever = vector_db.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k" : 1},
)

prompt = ChatPromptTemplate.from_messages([("human", messages)])

chain = {
    "context" : retriever,
    "question" : RunnablePassthrough()} | prompt | llm

In [18]:
response = chain.invoke("Tell me about Watergate")

In [19]:
print(response.content)

The Watergate scandal led to the resignation of President Richard Nixon in 1974.
