In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

# Intro to RAG

 * load large data assets and then ask the LLM questions about it

# 1. Splitters


In RAG technique we need first to divide large documents into smaller data chunks. Splitters are also called Document transformers

In [31]:
# Simple splitting by character : Character splitter (by default "\n\n")

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader('../data/example.txt')
loaded_data = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

texts = text_splitter.create_documents([loaded_data[0].page_content])
len(texts)

Created a chunk of size 1179, which is longer than the specified 1000


4

In [33]:
texts[0]

Document(page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.')

## splitting with metadata

In [39]:
metadatas = [{"chunk": 0}, {"chunk": 1}]
documents = text_splitter.create_documents(
    [loaded_data[0].page_content, loaded_data[0].page_content],
    metadatas=metadatas
)
documents[0].metadata
documents[0].page_content

Created a chunk of size 1179, which is longer than the specified 1000
Created a chunk of size 1179, which is longer than the specified 1000


'Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.'

In [40]:
#Recursive character splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=26,
    chunk_overlap=4
)
recursive_documents = recursive_splitter.create_documents([loaded_data[0].page_content])
len(recursive_documents)


145

# 2. Embeddings

Convert chunks of text into numbers

In [45]:
from langchain_openai import OpenAIEmbeddings
# tener el cuenta el coste de computacion más alto
embeddings_model = OpenAIEmbeddings()

In [44]:
chunks_of_text = [x.page_content for x in documents]
chunks_of_text

['Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.',
 'Las islas Galápagos son la segunda reserva marina más grande del planeta4\u200b fueron declaradas Patrimonio de la Humanidad en 1978 por la Unesco. El archipiélago tiene como mayor fuente de ingresos el turismo y reci

In [46]:
embeddings = embeddings_model.embed_documents(chunks_of_text)

In [49]:
len(documents)

8

In [48]:
len(embeddings)

8

Make an embbeding from a user query

In [52]:
embedded_query = embeddings_model.embed_query("Cuantas islas forman las Galápagos?")
len(embedded_query)


1536

# 3. Vector Stores

* Sotre embeddings in a very fast searchable database

In [54]:
#!pip install langchain-chroma

In [73]:
# import  vectorial database
from langchain_chroma import Chroma

vector_db = Chroma.from_documents(documents, OpenAIEmbeddings())

In [74]:
question = "Cuantas islas conforman las islas Galápagos?"
response = vector_db.similarity_search(question)
print(response[0].page_content)

Las islas Galápagos1​ (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1​ o archipiélago de Galápagos)2​ constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3​ Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.


In [75]:
response

[Document(metadata={'chunk': 0}, page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.'),
 Document(metadata={'chunk': 1}, page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archip

# 4. Retrievers

* Find the embedding that best answers your questions

In [63]:
#!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Downloading faiss_cpu-1.8.0.post1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0.post1


In [76]:
from langchain_community.vectorstores import FAISS

vector_db = FAISS.from_documents(documents, OpenAIEmbeddings())

In [77]:
retriever = vector_db.as_retriever()

## Differences .similarity_search VS. .as_retriever()

`.similarity_search` and `.as_retriever()` are both methods used in the context of document retrieval in machine learning, but they serve slightly different purposes:

### `.similarity_search`:
- **Purpose**: This method is used to find documents that are similar to a given query. It searches the entire document set to find those that have the highest similarity score to the query.
- **Usage**: You would typically use `.similarity_search` when you want to directly retrieve a list of documents that are most similar to your input query.
- **Output**: It returns a list of documents (or vectors) ranked by their similarity to the input query.

### `.as_retriever()`:
- **Purpose**: This method is often used to convert a document search tool into a retriever object that can be used in a larger pipeline, such as a question-answering system.
- **Usage**: `.as_retriever()` wraps a similarity search model (or any other retrieval model) so that it can be used as part of a more complex chain or pipeline. It's a way to make the model compatible with systems that expect a retriever object.
- **Output**: It returns a retriever object that can be used with other components like a language model, which will call it to retrieve relevant documents during processing.

### Summary:
- Use `.similarity_search` when you need to directly find and rank similar documents.
- Use `.as_retriever()` when you want to integrate a similarity search model into a larger system that expects a retriever object.


## Simple use without LCEL

In [78]:
response = retriever.invoke(question)
len(response)

4

In [81]:
response

[Document(metadata={'chunk': 0}, page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.'),
 Document(metadata={'chunk': 1}, page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archip

In [79]:
response[0]

Document(metadata={'chunk': 0}, page_content='Las islas Galápagos1\u200b (también islas de las Galápagos y oficialmente conocidas como archipiélago de Colón1\u200b o archipiélago de Galápagos)2\u200b constituyen un archipiélago del océano Pacífico ubicado a 972 km de la costa de Ecuador.3\u200b Fueron descubiertas en 1535 por la tripulación del barco del fray Tomás de Berlanga. Está conformado por trece islas grandes con una superficie mayor a 10 km², nueve islas medianas con una superficie de 1 km² a 10 km² y otros 107 islotes de tamaño pequeño, además de promontorios rocosos de pocos metros cuadrados, distribuidos alrededor de la línea ecuatorial, que conjuntamente con el Archipiélago Malayo, son los únicos archipiélagos del planeta que tienen territorio tanto en el hemisferio norte como en el hemisferio sur.')

# Top K

* Decide how many embeddings will be retrieved to build the answer to your question

In [83]:
retriever = vector_db.as_retriever(search_kwargs={"k":1})
response = retriever.invoke(question)
len(response)

1

# Indexing
    * Advanced way to manage and search through many documents in a vector store.

## Indexing
LangChain Indexing an **advanced technique** designed to efficiently integrate and synchronize documents from various sources into a vector store. This is particularly useful for tasks like semantic searches, where the aim is to find documents with similar meanings rather than those that match on specific keywords.

#### Core Features and Their Benefits
1. **Avoiding Duplication:** By preventing the same content from being written multiple times into the vector store, the system conserves storage space and reduces redundancy.
2. **Change Detection:** The API is designed to detect if a document has changed since the last index. If there are no changes, it avoids re-writing the document. This minimizes unnecessary write operations and saves computational resources.
3. **Efficient Handling of Embeddings:** Embeddings for unchanged content are not recomputed, thus saving processing time and further enhancing system efficiency.

#### Technical Mechanics: Record Management
The `RecordManager` is a pivotal component of the LangChain indexing system. It meticulously records each document's write activity into the vector store. Here's how it works:
- **Document Hash:** Every document is hashed. This hash includes both the content and metadata, providing a unique fingerprint for each document.
- **Write Time and Source ID:** Alongside the hash, the time the document was written and a source identifier are also stored. The source ID helps trace the document back to its origin, ensuring traceability and accountability.

These details are crucial for ensuring that only necessary data handling operations are carried out, thereby enhancing efficiency and reducing the workload on the system.

#### Operational Efficiency and Cost Savings
By integrating these features, LangChain indexing not only streamlines the management of document indices but also leads to significant cost savings. This is achieved by:
- Reducing the frequency and volume of data written to and read from the vector store.
- Decreasing the computational demand required for re-indexing and re-computing embeddings.
- Improving the overall speed and relevance of vector search results, which is vital for applications requiring rapid and accurate data retrieval.

#### Conclusion
The LangChain indexing API is a sophisticated tool that leverages modern database and hashing technologies to manage and search through large volumes of digital documents efficiently. It is especially valuable in environments where accuracy, efficiency, and speed of data retrieval are crucial, such as in academic research, business intelligence, and various fields of software development. This technology not only supports effective data management but also promotes cost-effectiveness by optimizing resource utilization.

* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/indexing/).

The Indexing API from LangChain is an advanced feature suitable for very experienced developers primarily due to its complexity and the sophisticated understanding required to implement and manage it effectively. Here's why, broken down into simpler terms:

1. **Complex Integration**: The API is designed to handle documents from various sources and integrate them into a vector store for semantic searches. This requires understanding both the sources of the documents and the mechanics of vector stores, which deal with high-dimensional data representations.

2. **Efficiency Management**: It involves sophisticated features like avoiding duplication of content, detecting changes in documents, and efficiently managing embeddings (data representations). These processes require a deep understanding of data structures, hashing, and optimization techniques to ensure the system is efficient and does not waste resources.

3. **Technical Operations**:
    - **Record Management**: The `RecordManager` component is crucial in tracking each document’s activity in the vector store, using detailed information such as document hashes, write times, and source IDs. This level of detail in record management necessitates familiarity with database operations, data integrity, and possibly cryptographic hashing.
    - **Operational Efficiency and Cost Savings**: Implementing the indexing system effectively can lead to significant operational efficiencies and cost savings. However, this requires precise setup and tuning to reduce unnecessary computational demands and storage usage. Developers need to understand how to balance these factors to optimize performance and cost.

4. **Advanced Use Cases**: The API supports complex scenarios such as rapid and accurate data retrieval needed in fields like academic research, business intelligence, and software development. Each of these applications might require specialized knowledge to fully leverage the potential of the indexing API.

5. **Risk of Misimplementation**: Incorrect implementation can lead to inefficient data handling, increased operational costs, and slower retrieval times, which is why a high level of expertise is necessary to avoid potential pitfalls.

In conclusion, the LangChain Indexing API is an advanced tool that involves detailed and complex processes to manage large volumes of data efficiently. Its use is recommended for developers who are very experienced because it requires a strong understanding of database systems, data efficiency, and system integration. Proper utilization can greatly enhance the performance and cost-effectiveness of systems that rely on fast and accurate data retrieval.

## A Simple Example
The LangChain Indexing API is a sophisticated tool that helps integrate and manage large sets of documents efficiently. To make it clearer, let's consider a simple example that illustrates how it could be used:

#### Example Scenario: Managing Research Papers in a University Database

**Context**: Imagine you are developing a system for a university's library to manage and search through thousands of research papers. The goal is to allow students and faculty to quickly find papers relevant to their interests based on content similarity, not just by keywords.

#### Step-by-Step Use of LangChain Indexing API:

1. **Gathering Documents**:
   - Collect digital copies of all research papers to be included in the system.
   - These might come from various departments or sources within the university.

2. **Integration into Vector Store**:
   - Each research paper is converted into a "vector" using text embedding techniques. A vector is a numerical representation that captures the essence of the paper's content.
   - These vectors are stored in a vector store, a specialized database for managing such data.

3. **Avoiding Duplication**:
   - As new papers are added, the LangChain Indexing API checks if a similar paper already exists in the vector store.
   - It uses a hash (a unique identifier generated from the paper’s content and metadata) to prevent the same paper from being stored multiple times, saving space and reducing clutter.

4. **Change Detection**:
   - If a paper in the database is updated or revised, the API detects changes using the hash comparison.
   - It updates the vector representation only if changes are detected, saving on unnecessary computational resources.

5. **Search and Retrieval**:
   - When a student queries the system looking for papers on a specific topic, like "quantum computing applications," the API helps retrieve the most relevant papers.
   - It does this by comparing the query's vector with those in the vector store and returning papers whose vectors are most similar in content, not just those that contain specific keywords.

6. **Operational Efficiency**:
   - The system is optimized to handle large volumes of data efficiently, ensuring quick responses even when multiple users are accessing it simultaneously.
   - This efficiency is crucial during exam periods or when new research is published and interest peaks.

#### Conclusion

By using the LangChain Indexing API, the university library can manage its research papers more effectively, making them easily accessible based on content relevance. This leads to better research outcomes for students and faculty and maximizes the use of the library’s resources.

This example demonstrates how the Indexing API not only simplifies the management of documents but also enhances the retrieval process, making it more aligned with the users' actual needs.