**LLMs and User Data:** Many LLM applications require incorporating user-specific data that wasn't part of the model's original training data.

**RAG Solution:** Retrieval Augmented Generation (RAG) addresses this by retrieving relevant external data and feeding it to the LLM during the generation step.

**LangChain's Support:** LangChain, a framework for working with LLMs, provides building blocks for building RAG applications, including functionalities for data retrieval.

**Data Retrieval Complexity:** While retrieval might seem straightforward, it can involve subtle complexities.

### Document Loaders

We use document loaders to load the data from a source as a docuemnt. A document is a piece of text alongwith its metadata. There are document loaders to load the .txt file, csv file or even loading a transcript from youtube video.

In [None]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.1.11-py3-none-any.whl (807 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/807.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/807.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m798.7/807.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.25 (from langchain)
  Downloading langchain_community-0.0.27-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [None]:
!pip install langchain_openai

Collecting langchain_openai
  Downloading langchain_openai-0.0.8-py3-none-any.whl (32 kB)
Collecting openai<2.0.0,>=1.10.0 (from langchain_openai)
  Downloading openai-1.13.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken<1,>=0.5.2 (from langchain_openai)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai<2.0.0,>=1.10.0->langchain_openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai<2.0.0,>=1.10.0->langchain_openai)
  Downloading httpcore-1.0.4-py3-none-any.whl (77 kB)
[2K     

In [None]:
!pip install langchain_experimental

Collecting langchain_experimental
  Downloading langchain_experimental-0.0.53-py3-none-any.whl (173 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/173.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m122.9/173.7 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.7/173.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: langchain_experimental
Successfully installed langchain_experimental-0.0.53


In [None]:
!pip install langchain_community



In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader


# loader = CSVLoader(file_path='.csv')
# data = loader.load()

| Document Loader (Import Statement) | Function |
|---|---|
| `TextLoader` (from `langchain_community.document_loaders import TextLoader`) | Loads text content from a file. |
| `CSVLoader` (from `langchain_community.document_loaders.csv_loader import CSVLoader`) | Loads data from a CSV file into Document objects, with each row becoming a Document. |
| `JSONLoader` (from `langchain_community.document_loaders import JSONLoader`) | Reads JSON data from a file where each line represents a JSON object, converting each object into a Document. |
| `HTMLLoader` (from `langchain_community.document_loaders import UnstructuredHTMLLoader`) | Fetches the text content and basic metadata from a website URL, creating a Document. |
| `MarkdownLoader` (from `langchain_community.document_loaders import UnstructuredMarkdownLoader`) | Loads the text content from a Markdown file, converting it into a Document. |
| `PDFLoader` (from `langchain.document_loaders import PyPDFLoader`) | Extracts text content from a PDF file, creating a Document (may require additional libraries). |
| `DirectoryLoader` (from `langchain.document_loaders import Directory_Loader`) | Loads all files within a specified directory as Documents, using appropriate loaders based on file extensions (.txt, .csv, etc.). |


### Text-Splitters

After loading the document, we want to transform them to suit our application. If we have the long document, we want to split it into small chunks so that it fits in our context window.


| Name                   | Splits On                        | Adds Metadata | Description                                                                                                                                                                                    |
|------------------------|----------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Recursive              | A list of user defined characters |               | Splits text in a way that tries to keep related pieces together by recursively examining the text. Recommended for initial text splitting.                                                   |
| HTML                   | HTML specific characters         | ✅            | Splits text based on HTML-specific characters. Adds metadata about the HTML source of each chunk.                                                                                             |
| Markdown               | Markdown specific characters     | ✅            | Splits text based on Markdown-specific characters. Adds metadata about the Markdown source of each chunk.                                                                                     |
| Code                   | Code (Python, JS) specific characters |               | Splits text based on characters specific to coding languages. Supports 15 different programming languages.                                                                                     |
| Token                  | Tokens                           |               | Splits text based on tokens, with various methods available for tokenization.                                                                                                                   |
| Character              | A user defined character        |               | Splits text based on a single user-defined character. One of the simplest splitting methods.                                                                                                    |
| [Experimental] Semantic Chunker | Sentences                |               | Splits text into sentences first, then combines adjacent sentences if they are semantically similar enough. (Experimental feature, inspired by Greg Kamradt)                                    |


#### Character Splitting - Simple static character chunks of data

Character Splitting, in the context of text processing, refers to the process of dividing a string of text into individual characters. It's a fundamental operation used in various text manipulation tasks, data analysis, and machine learning applications.

Here's a breakdown of Character Splitting:

**Basic Function:**

Takes a string of text as input.
Breaks down the string into its constituent characters.
Each character becomes a separate element in a data structure, typically a list or an array.

**Advantages:**

Simple and efficient for basic text processing tasks.
Provides a granular level of control over text manipulation.

**Disadvantages:**

Might not be the most efficient approach for splitting text into words or sentences, where higher-level splitting techniques like tokenization are preferred.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
text = "Hello, world! I am learning langchain."

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 15, chunk_overlap=0, separator='', strip_whitespace=False)

In [None]:
text_splitter.create_documents([text])

[Document(page_content='Hello, world! I'),
 Document(page_content=' am learning la'),
 Document(page_content='ngchain.')]

#### Recursive Character Text Splitting

Recursive Character Text Splitting delves deeper than basic character splitting, offering a more nuanced way to segment text for various applications, particularly in the context of LangChain. Here's a breakdown of its concept and functionalities:

**Core Idea:**

It iteratively splits text into smaller chunks based on a predefined set of characters (e.g., newline characters \n, spaces, punctuation) while aiming to keep semantically related units together.
Unlike basic character splitting, this approach leverages recursion, a programming technique where a function calls itself.

"\n\n" - Double new line, or most commonly paragraph breaks

"\n" - New lines

" " - Spaces

"" - Characters


**How it Works:**

**Initial Split:** The algorithm starts by splitting the text using the first character from the predefined set (e.g., newline).

**Recursion in Action:**
For each resulting chunk:
If the chunk size is smaller than a specified limit (chunk size threshold), it's considered a valid split and is kept.
If the chunk size exceeds the threshold, the function recursively calls itself on that chunk, using the next character from the predefined set for splitting.

**Stopping Criterion:** The recursion continues until all chunks are either below the threshold size or cannot be further split without exceeding the limit.

**Benefits:**

Preserves Meaning: By considering a set of characters (often reflecting sentence boundaries or word separators), it attempts to maintain semantic coherence within the split units compared to a purely character-by-character approach.

Flexibility: The predefined character set can be customized depending on your specific task. For example, you might prioritize splitting at sentence boundaries while allowing some word breaks for longer sentences.

Useful for Generative AI: In LangChain, this approach can be used to prepare text prompts for generative models. By considering semantic units, it might lead to more coherent and well-structured outputs compared to simple character splitting.

**Implementation:**

The specific implementation details might vary depending on the programming language and framework used. However, the core logic of recursive calls and splitting based on a character set remains consistent.

**Trade-offs:**

**Complexity:** Compared to basic character splitting, recursive splitting can be computationally more expensive due to the recursive calls. However, for most practical text processing tasks, this overhead is often minimal.

**Fine-Tuning:** Tuning the character set and the chunk size threshold can be crucial to ensure a balance between preserving meaning and achieving the desired granularity for your specific task.

In essence, Recursive Character Text Splitting offers a more strategic way to divide text into smaller units, considering both character boundaries and semantic significance. This approach can be valuable in applications like preparing prompts for generative AI models or performing text analysis that requires maintaining some level of structural coherence.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text = """In essence, Recursive Character Text Splitting offers a more strategic way to divide text into smaller units, considering both character boundaries and semantic significance. This approach can be valuable in applications like preparing prompts for generative AI models or performing text analysis that requires maintaining some level of structural coherence."""

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap=0)

In [None]:
text_splitter.create_documents([text])

[Document(page_content='In essence, Recursive Character Text Splitting offers a more strategic way to divide text into smaller units, considering both character boundaries and semantic significance. This approach can be'),
 Document(page_content='valuable in applications like preparing prompts for generative AI models or performing text analysis that requires maintaining some level of structural coherence.')]

#### Semantic Chunking

Semantic Chunking in LangChain is a technique that breaks down text into segments based on their semantic similarity, going beyond simple character or sentence boundaries. This approach aims to provide more contextually relevant units of text for use with generative AI models like large language models (LLMs).

**Importance of Context:**

Many generative AI tasks benefit from understanding the broader context of the input text. Breaking down text into semantically related chunks can help provide the LLM with a better grasp of the overall meaning and flow of information.

For example, if you're feeding a story prompt to an LLM, chunking the text based on thematic shifts or plot points can lead to more coherent and cohesive outputs compared to splitting it into individual sentences.

**LangChain's Approach:**

LangChain utilizes techniques like sentence embedding to determine semantic similarity between different parts of the text. Sentence embedding involves converting sentences into numerical vectors that capture their meaning.
The chunking algorithm then analyzes these vectors, identifying breaks where the semantic similarity drops significantly, suggesting a shift in topic or idea.

**Working Mechanism:**

Here's a simplified view of how Semantic Chunking might work in LangChain:

**Sentence Embedding:** The input text is divided into sentences. Each sentence is then converted into an embedding vector using a pre-trained sentence embedding model.

**Similarity Measurement:** The algorithm calculates the cosine similarity between consecutive sentence embedding vectors. Cosine similarity reflects how similar two vectors are in terms of their direction in the embedding space.

**Identifying Breaks:** If the cosine similarity between two sentences falls below a predefined threshold, it signifies a potential shift in meaning. This point is considered a candidate for a chunk boundary.

**Refined Chunking:** Additional factors, like sentence length or thematic coherence, might be integrated to further refine the chunk boundaries.

**Output:** The original text is segmented into chunks, where each chunk represents a portion of the text with relatively consistent semantic meaning.

Benefits:

**Enhanced LLM Performance:** By providing the LLM with contextually relevant chunks, semantic chunking can potentially lead to improved quality and coherence in the generated outputs.

**Flexibility:** The threshold for similarity and other parameters can be adjusted based on the specific task and desired level of granularity.

**Trade-offs:**

**Complexity:** Compared to simpler splitting methods, semantic chunking involves additional processing steps like embedding generation and similarity calculations.

**Tuning Requirements:** The effectiveness of chunking relies on finding the right balance between similarity threshold and granularity of chunks for your specific application.

Overall, Semantic Chunking in LangChain equips you with a powerful tool to segment text based on meaning, empowering your generative AI projects with a deeper understanding of the content and potentially leading to more meaningful and contextually rich outputs.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "YOUR API KEY"

In [None]:
text_splitter = SemanticChunker(OpenAIEmbeddings())

In [None]:
docs = text_splitter.create_documents([text])
print(docs[0].page_content)

In essence, Recursive Character Text Splitting offers a more strategic way to divide text into smaller units, considering both character boundaries and semantic significance. This approach can be valuable in applications like preparing prompts for generative AI models or performing text analysis that requires maintaining some level of structural coherence.


Breakpoints- Percentile, Standard Deviation, Interquartile

This chunker decides when to split sentences by checking for differences between them. If the difference is big enough, it splits them. We can figure out how big the difference needs to be in a few different ways.


In [None]:
text_splitter = SemanticChunker(
    OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)

In [None]:
docs = text_splitter.create_documents([text])
print(docs[0].page_content)

In essence, Recursive Character Text Splitting offers a more strategic way to divide text into smaller units, considering both character boundaries and semantic significance. This approach can be valuable in applications like preparing prompts for generative AI models or performing text analysis that requires maintaining some level of structural coherence.


### Text embedding models


The Embeddings class serves as a unified interface for interacting with various text embedding models offered by different providers such as OpenAI, Cohere, and Hugging Face. Its purpose is to standardize the way these models are accessed and utilized.

When using the Embeddings class, text inputs are transformed into vector representations, commonly known as embeddings. These embeddings encode the semantic meaning of the text in a numerical vector format. This approach allows us to conceptualize and analyze text in a mathematical vector space. For instance, it enables tasks like semantic search, where we can efficiently find text passages that are most similar to a given query by comparing their embeddings in the vector space.

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [None]:
embeddings = embeddings_model.embed_documents(
    [
        "Hello World",
        "How are you?",
        "What's your name?",
        "I am your friendly chatbot"
    ]
)
len(embeddings), len(embeddings[0])

(4, 1536)

Embed a single piece of text for the purpose of comparing to other embedded pieces of texts.



In [None]:
embedded_query = embeddings_model.embed_query("What was the name in conversation?")
embedded_query[:5]

[-0.006876170481929101,
 -0.000243627548141554,
 0.03187260333173799,
 -0.002298123384815286,
 0.004427486000598364]

### Vector Stores

Taming the Unstructured Beast: Embeddings and Vector Search

Unstructured data like text documents, images, or audio recordings can be a treasure trove of information, but it can be challenging to search and analyze efficiently. Traditional methods often rely on keyword matching, which can be limiting. Here's where a powerful technique called embedding comes into play.

Imagine data as points in a giant map:

Embedding essentially transforms your unstructured data (text, image, etc.) into a numerical representation called a "vector." Think of these vectors as points on a giant map, where each point's location reflects how similar it is to other data points. Words with similar meanings, or images with similar visual features, would be positioned closer together in this map, the "vector space."

Storing and Searching Made Easy:

Here's the magic: we can store these embedding vectors efficiently using specialized databases called vector stores. These stores are designed to handle the unique needs of vector data, allowing for fast retrieval based on similarity.

Finding the Closest Neighbors:

Now comes the exciting part: searching. When you have a new piece of unstructured data (like a query text), you can also embed it into a vector. Then, you can use the vector store to find existing data points (embedded documents, images, etc.) that are closest to your query vector in the "map." These closest neighbors are likely the most relevant results for your search!



In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.28.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import Chroma

from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://docs.smith.langchain.com/user_guide")

docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = Chroma.from_documents(documents, OpenAIEmbeddings())


In [None]:
query = "What is the langchain"
docs = vector.similarity_search(query)
print(docs[0].page_content)



Skip to main contentü¶úÔ∏èüõ†Ô∏è LangSmith DocsLangChain Python DocsLangChain JS/TS DocsLangSmith API DocsSearchGo to AppLangSmithUser GuideSetupPricing (Coming Soon)Self-HostingTracingEvaluationMonitoringPrompt HubProxyUser GuideOn this pageLangSmith User GuideLangSmith is a platform for LLM application development, monitoring, and testing. In this guide, we‚Äôll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if they‚Äôre just starting their journey.Prototyping‚ÄãPrototyping LLM applications often involves quick experimentation between prompts, model types, retrieval strategy and other parameters.
The ability to rapidly understand how the model is performing ‚Äî and debug where it is failing ‚Äî is incredibly important for this phase.Debugging‚ÄãWhen developing new LLM applications, we suggest h

In [None]:
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = vector.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)



Skip to main contentü¶úÔ∏èüõ†Ô∏è LangSmith DocsLangChain Python DocsLangChain JS/TS DocsLangSmith API DocsSearchGo to AppLangSmithUser GuideSetupPricing (Coming Soon)Self-HostingTracingEvaluationMonitoringPrompt HubProxyUser GuideOn this pageLangSmith User GuideLangSmith is a platform for LLM application development, monitoring, and testing. In this guide, we‚Äôll highlight the breadth of workflows LangSmith supports and how they fit into each stage of the application development lifecycle. We hope this will inform users how to best utilize this powerful platform or give them something to consider if they‚Äôre just starting their journey.Prototyping‚ÄãPrototyping LLM applications often involves quick experimentation between prompts, model types, retrieval strategy and other parameters.
The ability to rapidly understand how the model is performing ‚Äî and debug where it is failing ‚Äî is incredibly important for this phase.Debugging‚ÄãWhen developing new LLM applications, we suggest h

### Retrievers

| Name                      | Index Type                      | Uses an LLM | When to Use                                                                                                                                              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|---------------------------|---------------------------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Vectorstore               | Vectorstore                     | No          | If you are just getting started and looking for something quick and easy.                                                                                | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text.                                                                                                                                                                                                                                                                                                                |
| ParentDocument            | Vectorstore + Document Store    | No          | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together.                     | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks).                                                                                                                                                                                                                            |
| Multi Vector              | Vectorstore + Document Store    | Sometimes   | If you are able to extract information from documents that you think is more relevant to index than the text itself.                                      | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions.                                                                                                                                                                                                                                                                 |
| Self Query                | Vectorstore                     | Yes         | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text.                       | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself).                                                                                                                                                                                            |
| Contextual Compression    | Any                             | Sometimes   | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM.                                     | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM.                                                                                                                                                                                                                                                              |
| Time-Weighted Vectorstore | Vectorstore                     | No          | If you have timestamps associated with your documents, and you want to retrieve the most recent ones                                                     | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents)                                                                                                                                                                                                                                                                                       |
| Multi-Query Retriever     | Any                             | Yes         | If users are asking questions that are complex and require multiple pieces of distinct information to respond                                            | This uses an LLM to generate multiple queries from the original one. This is useful when the original query needs pieces of information about multiple topics to be properly answered. By generating multiple queries, we can then fetch documents for each of them.                                                                                                                                                                       |
| Ensemble                  | Any                             | No          | If you have multiple retrieval methods and want to try combining them.                                                                                   | This fetches documents from multiple retrievers and then combines them.                                                                                                                                                                                                                                                                                                                                                                              |
| Long-Context Reorder      | Any                             | No          | If you are working with a long-context model and noticing that it's not paying attention to information in the middle of retrieved documents.             | This fetches documents from an underlying retriever, and then reorders them so that the most similar are near the beginning and end. This is useful because it's been shown that for longer context models they sometimes don't pay attention to information in the middle of the context window.                                                                                                                                 |


#### EnsembleRetriever (Hybrid search)

**What it Does:** The EnsembleRetriever takes a list of retrieval algorithms (retrievers) as input. These algorithms can be of various types, each with its own strengths and weaknesses.

**Combining Results:** The EnsembleRetriever calls the get_relevant_documents() method on each retriever in the list. This method is assumed to return a list of documents considered relevant to the search query by that specific retrieval algorithm.

**Reranking with Reciprocal Rank Fusion (RRF):** The EnsembleRetriever doesn't simply combine all the retrieved documents. Instead, it uses an algorithm called Reciprocal Rank Fusion (RRF) to re-rank the results. RRF assigns a score to each document based on its position in the individual retrieval lists. Documents that appear at the top (most relevant) in multiple retrieval methods receive higher overall scores and are likely to be ranked higher in the final results.

**Sparse Retriever (e.g., BM25):** Analyzes keyword presence in documents and scores them based on keyword frequency. This is efficient but might miss documents with similar meaning but different keywords.

**Dense Retriever (e.g., Embedding Similarity):** Represents documents and queries as vectors in a high-dimensional space. Documents with similar semantic meaning will have closer vectors, leading to retrieval based on meaning rather than just keywords.

In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

In [None]:
from langchain_community.vectorstores import FAISS

In [None]:
doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "You like apples",
    "You like oranges",
]

embedding = OpenAIEmbeddings()
faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

metadatas: This is an optional argument that allows you to add metadata to each document. In this case:
[{"source": 1}] * len(doc_list_1) creates a list of dictionaries with the same length as doc_list_1.
Each dictionary in the list has a single key-value pair: "source": 1. This seems to be assigning a source value of 1 to all documents. You can modify this part to include different metadata fields and values if needed.

"k": 2 specifies that during a search, the retriever should return the top 2 most similar documents (based on vector similarity) to the query vector. You can adjust this value (e.g., k=5) to retrieve a different number of top results.

In [None]:
docs = ensemble_retriever.invoke("apples")
docs

[Document(page_content='I like apples', metadata={'source': 1}),
 Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='You like oranges', metadata={'source': 2}),
 Document(page_content='Apples and oranges are fruits', metadata={'source': 1})]

#### Indexing

The indexing API enables you to import and maintain documents from various sources within a vector store efficiently. Its primary functions include preventing duplicate content from being written to the vector store, avoiding rewriting content that remains unchanged, and preventing the recomputation of embeddings for content that has not been altered.

RecordManager keeps track of document writes into the vector store.

In [None]:
!pip install langchain_elasticsearch

Collecting langchain_elasticsearch
  Downloading langchain_elasticsearch-0.1.0-py3-none-any.whl (17 kB)
Collecting elasticsearch<9.0.0,>=8.12.0 (from langchain_elasticsearch)
  Downloading elasticsearch-8.12.1-py3-none-any.whl (432 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m432.1/432.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting elastic-transport<9,>=8 (from elasticsearch<9.0.0,>=8.12.0->langchain_elasticsearch)
  Downloading elastic_transport-8.12.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: elastic-transport, elasticsearch, langchain_elasticsearch
Successfully installed elastic-transport-8.12.0 elasticsearch-8.12.1 langchain_elasticsearch-0.1.0


In [None]:
from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings

Use the indexing.index function from Langchain's data connection module. This function handles the indexing process.

index(
    docs,

    record_manager,

    vectorstore,

    cleanup=None,
    
    source_id_key="source",
)


It requires several arguments:

documents: The list of Document objects you want to index.

record_manager: An object that keeps track of which documents have already been indexed and avoids redundant processing. You can create a record manager using indexing.RecordManager.

vector_store: The vector store object where the document embeddings will be stored. (e.g., a Faiss vector store)

embedding_model: The embedding model you want to use to generate vector representations of the documents. (e.g., OpenAIEmbeddings)

(Optional) Additional arguments like cleanup_mode to control how the record manager handles updates or deletions

In the context of information retrieval, indexing refers to the process of creating a data structure that allows for fast and efficient retrieval of information from a large collection of documents or data points. It's like building an elaborate filing system for your digital information.

**Function of Indexing:**

An index acts as a map or pointer to the actual data.
It doesn't store the entire content itself but rather references where the data resides.
This allows search queries to quickly locate relevant information without having to scan through the entire dataset every time.

**Types of Indexing:**

Keyword Indexing: This is the most basic type of indexing. It involves creating a list of keywords or phrases present in the documents and associating them with the documents they appear in. This allows for efficient retrieval of documents based on exact keyword matches.

Attribute Indexing: This approach focuses on specific attributes or metadata associated with the data, such as author name, date, or category. This enables filtering and searching based on these attributes.

Full-Text Indexing: This goes beyond keywords and indexes all the words within a document. This allows for more comprehensive search capabilities, including finding documents based on synonyms or related terms.

**Benefits of Indexing:**

Faster Search Speeds: Indexing significantly reduces the time required to locate relevant information, especially for large datasets.

Improved Search Accuracy: By considering synonyms and related terms (full-text indexing), searches can be more accurate and capture the user's intent better.

Efficient Filtering: Attribute indexing allows for filtering data based on specific criteria, further streamlining the retrieval process.

**Examples of Indexing:**

Search Engines: Search engines like Google or Bing heavily rely on indexing to crawl the web and retrieve relevant web pages based on user queries.

Database Management Systems: Databases use indexes to efficiently locate specific records based on search criteria like names, dates, or other indexed fields.

Library Catalogs: Libraries traditionally used card catalogs as a form of indexing to locate books based on author, title, or subject.

**Important Considerations:**

The effectiveness of indexing depends on the chosen indexing strategy and the quality of the index itself.
Maintaining an up-to-date index is crucial, especially as your data collection grows and changes.
Choosing the right type of index for your data and search needs is essential for optimal performance.