# Comoponents needed to build a RAG System

- https://python.langchain.com/v0.2/docs/tutorials/retrievers/


In [1]:
# %pip install langchain-chroma langchain  langchain-openai
from dotenv import load_dotenv, find_dotenv
import os
import warnings
from IPython.display import display, Markdown  # to see better the output text

warnings.filterwarnings('ignore')
_ = load_dotenv(find_dotenv())  # read local .env file

llm_model = "gpt-3.5-turbo"

In [27]:
import getpass
import os

# If you want to get best in-class automated tracing of your model calls you
# can also set your LangSmith API key
os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_TRACING"] = "true"

## Documents
https://python.langchain.com/v0.2/api_reference/core/documents/langchain_core.documents.base.Document.html

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata.
* `id`: an optional identifier for the document.
* `page_content`: a string representing the content;
* `metadata`: a dict containing arbitrary metadata.

In [23]:
from langchain_core.documents import Document

documents = [
    Document(id='1',
             page_content="Dogs are great companions, known for their loyalty and friendliness.",
             metadata={"source": "mammal-pets-doc", "user": "manuel"},
             ),
    Document(id='2',
             page_content="Cats are independent pets that often enjoy their own space.",
             metadata={"source": "mammal-pets-doc", "user": "manuel"},
             ),
    Document(id='3',
             page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
             metadata={"source": "fish-pets-doc", "user": "manuel"},
             ),
    Document(id='4',
             page_content="Parrots are intelligent birds capable of mimicking human speech.",
             metadata={"source": "bird-pets-doc", "user": "alejandro"},
             ),
    Document(id='5',
             page_content="Rabbits are social animals that need plenty of space to hop around.",
             metadata={"source": "mammal-pets-doc", "user": "alejandro"},
             ),
]

## Vector Store
LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Here we will demonstrate usage of LangChain VectorStores using `Chroma`, which includes an in-memory implementation.

In [24]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# even if each document has its own id it is necessary to pass the list of ids 
vectorstore = Chroma.from_documents(
    documents,
    ids=[doc.id for doc in documents],  # type: ignore
    embedding=OpenAIEmbeddings(),
)

In [25]:
out = vectorstore.similarity_search(
    query="land animals", k=3, filter={"user": "manuel"})
out

[Document(metadata={'source': 'mammal-pets-doc', 'user': 'manuel'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc', 'user': 'manuel'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'fish-pets-doc', 'user': 'manuel'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.')]

> ⚠️⚠️ **IMPORTANT:** vectorstores in langchain don't return the same document storaged; instead they return a copy of them and it is because that they don't return the `id` of the original document


In [6]:
# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with similarity.

vectorstore.similarity_search_with_score(query="birds",) 

[(Document(metadata={'source': 'bird-pets-doc', 'user': 'alejandro'}, page_content='Parrots are intelligent birds capable of mimicking human speech.'),
  0.34512820839881897),
 (Document(metadata={'source': 'mammal-pets-doc', 'user': 'alejandro'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
  0.45866116881370544),
 (Document(metadata={'source': 'mammal-pets-doc', 'user': 'manuel'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
  0.4592510163784027),
 (Document(metadata={'source': 'fish-pets-doc', 'user': 'manuel'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.'),
  0.47476130723953247)]

### Async query:

In [7]:
await vectorstore.asimilarity_search(query="birds",)

[Document(metadata={'source': 'bird-pets-doc', 'user': 'alejandro'}, page_content='Parrots are intelligent birds capable of mimicking human speech.'),
 Document(metadata={'source': 'mammal-pets-doc', 'user': 'alejandro'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'mammal-pets-doc', 'user': 'manuel'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'fish-pets-doc', 'user': 'manuel'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.')]

### To see all the documents

In [9]:
vectorstore._collection.get()

{'ids': ['1', '2', '3', '4', '5'],
 'embeddings': None,
 'metadatas': [{'source': 'mammal-pets-doc', 'user': 'manuel'},
  {'source': 'mammal-pets-doc', 'user': 'manuel'},
  {'source': 'fish-pets-doc', 'user': 'manuel'},
  {'source': 'bird-pets-doc', 'user': 'alejandro'},
  {'source': 'mammal-pets-doc', 'user': 'alejandro'}],
 'documents': ['Dogs are great companions, known for their loyalty and friendliness.',
  'Cats are independent pets that often enjoy their own space.',
  'Goldfish are popular pets for beginners, requiring relatively simple care.',
  'Parrots are intelligent birds capable of mimicking human speech.',
  'Rabbits are social animals that need plenty of space to hop around.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

As we can see, scores closer to zero show better similarity

### Let's go most deeply in `chroma` initialization


https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/

### Basic initialization

In [4]:
from langchain_openai import OpenAIEmbeddings
# from langchain_huggingface import HuggingFaceEmbeddings
# %pip install -qU langchain-huggingface

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# using HuggingFace 
# other "BAAI/bge-large-en-v1.5"
# embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-mpnet-base-v2") 

In [7]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not neccesary
)

### Initialization from client

In [18]:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

persistent_client = chromadb.PersistentClient(path="./chroma_langchain_db")
emb_funct = OpenAIEmbeddingFunction(
    model_name="text-embedding-3-large", api_key=os.getenv("OPENAI_API_KEY"))

collection = persistent_client.get_or_create_collection(
    "collection_name", embedding_function=emb_funct)
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])

vector_store_from_client = Chroma(
    client=persistent_client,
    collection_name="collection_name",
    embedding_function=embeddings,
)

## Retrievers
LangChain VectorStore objects do not subclass `Runnable`, and so cannot immediately be integrated into **LangChain Expression Language(LCEL)** chains.

LangChain Retrievers are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations) and are designed to be incorporated in LCEL chains.

VectorStoreRetriever supports search types of `similarity` (default), `mmr` (maximum marginal relevance, described above), and `similarity_score_threshold`.

In [28]:
retriever = vectorstore.as_retriever(
    search_type="mmr", search_kwargs={"k": 2, "fetch_k": 5}
)
retriever.invoke("Intelligent animals", filter={"user": "manu"})

[Document(metadata={'source': 'bird-pets-doc', 'user': 'alejandro'}, page_content='Parrots are intelligent birds capable of mimicking human speech.'),
 Document(metadata={'source': 'mammal-pets-doc', 'user': 'manuel'}, page_content='Dogs are great companions, known for their loyalty and friendliness.')]

In [44]:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

In [46]:
rag_chain.invoke("who are the Intelligent animals?")    

AIMessage(content='The intelligent animals mentioned in the context are parrots.', response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 95, 'total_tokens': 106}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_507c9469a1', 'finish_reason': 'stop', 'logprobs': None}, id='run-6106b562-fea0-419d-b660-6614dacd7c25-0', usage_metadata={'input_tokens': 95, 'output_tokens': 11, 'total_tokens': 106})

## Text Splitter/Chunking 
* https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

* https://python.langchain.com/v0.2/docs/how_to/#text-splitters

In [7]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

### **RecursiveCharacterTextSplitter**
It's the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: 
1. by list of characters.
2. How the chunk size is measured: by number of characters.

To obtain the string content directly, use `.split_text`

To create LangChain Document objects (e.g., for use in downstream tasks), use `.create_documents`

> It can be difficult to find the best `chunk_size, chunk_overlap` settings to capture the actual relationships in the document.

Parameters:
* `chunk_size`: The maximum size of a chunk, where size is determined by the length_function.
* `chunk_overlap`: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.
* `length_function`: Function determining the chunk size.
* `is_separator_regex`: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 350, chunk_overlap=200,
separators=["\n\n", "\n","."])

text_splitter.split_text(text)

['One of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear.\n\nTeachers and coaches implicitly told us the returns were linear. \n\n"You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true.',
 'Teachers and coaches implicitly told us the returns were linear. \n\n"You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true.\n\nIf your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.',
 "It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented",
 ". But superlinear returns for performance 

### **Split text based on semantic similarity**

https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/


At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space. If embeddings are sufficiently far apart, chunks are split.

In [10]:
from langchain_experimental.text_splitter import SemanticChunker
# %pip install --quiet langchain_experimental
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_splitter = SemanticChunker(OpenAIEmbeddings())

In [13]:
out = semantic_splitter.split_text(text)
[print(_) for _ in out]


One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers.
You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]



[None, None]

In [25]:
semantic_out = semantic_splitter.create_documents([text])
semantic_out

[Document(page_content='\nOne of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers.'),
 Document(page_content="You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]\n")]

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past **some threshold**, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg:

- `percentile`: the default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the `breakpoint_threshold_amount` percentile is split. In this type the `breakpoint_threshold_amount=95` by default, it is like a confidence interval. So,  if you want more chunks, lower the percentile cutoff.

- `standard_deviation`: in this method, any difference greater than `breakpoint_threshold_amount` standard deviations is split. In this type the `breakpoint_threshold_amount=3` by default. 

- `interquartile` In this method, the interquartile distance is used to split chunks.In this type the `breakpoint_threshold_amount=1.5` by default. 

- `gradient`: in this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

In [23]:
ss = SemanticChunker(OpenAIEmbeddings(
), breakpoint_threshold_type='percentile', breakpoint_threshold_amount=60)
ss.split_text(text) # we can see more chunks

['\nOne of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true.',
 "If your product is only half as good as your competitor's, you don't get half as many customers.",
 "You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity.",
 'In all of these, the rich get richer.',
 '[1]\n']

>A good strategy is split the text into large chunks with **SemanticChunker** and then split them into smaller chunks with **RecursiveCharacterTextSplitter** to ensure that the document has maximum size

In [26]:
text_splitter.split_documents(semantic_out)


[Document(page_content='One of the most important things I didn\'t understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true'),
 Document(page_content='. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers.'),
 Document(page_content="You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've 