# Build a semantic search engine with LangChain


## Overview

This tutorial will familiarize you with LangChain's [document loader](https://docs.langchain.com/oss/python/langchain/retrieval#document-loaders), [embedding](https://docs.langchain.com/oss/python/langchain/retrieval#embedding-models), and [vector store](https://docs.langchain.com/oss/python/langchain/retrieval#vector-store) abstractions. These abstractions are designed to support retrieval of data--  from (vector) databases and other sources -- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or [RAG](https://docs.langchain.com/oss/python/langchain/retrieval).

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query. The guide also includes a minimal RAG implementation on top of the search engine.



### Concepts

This guide focuses on retrieval of text data. We will cover the following concepts:

* [Documents and document loaders](https://docs.langchain.com/oss/python/integrations/document_loaders);
* [Text splitters](https://docs.langchain.com/oss/python/integrations/splitters);
* [Embeddings](https://docs.langchain.com/oss/python/integrations/text_embedding);
* [Vector stores](https://docs.langchain.com/oss/python/integrations/vectorstores) and [retrievers](https://docs.langchain.com/oss/python/integrations/retrievers).



## Setup


### Installation

This tutorial requires the `langchain-community` and `pypdf` packages. Using [uv](https://docs.astral.sh/uv/):


```bash
!uv add langchain-community pypdf
```


For more details, see our [Installation guide](https://docs.langchain.com/oss/python/langchain/install).


## 1. Documents and document loaders

LangChain implements a [Document](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

* `page_content`: a string representing the content;
* `metadata`: a dict containing arbitrary metadata;
* `id`: (optional) a string identifier for the document.

The `metadata` attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual [`Document`](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) object often represents a chunk of a larger document.

We can generate sample documents when desired:

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]


However, the LangChain ecosystem implements [document loaders](https://docs.langchain.com/oss/python/langchain/retrieval#document-loaders) that [integrate with hundreds of common sources](https://docs.langchain.com/oss/python/integrations/document_loaders/). This makes it easy to incorporate data from these sources into your AI application.



### Loading documents

Let's load a PDF into a sequence of [`Document`](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) objects. First, fetch a PDF from arXiv (e.g. the "Attention Is All You Need" paper) using [`curl`](https://tldr.inbrowser.app/pages/common/curl):

```bash
curl -L -o paper.pdf "https://arxiv.org/pdf/1706.03762.pdf"
```

Then load it with the [PyPDFLoader](https://docs.langchain.com/oss/python/integrations/document_loaders/#pdfs):

In [3]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "paper.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

15


`PyPDFLoader` loads one [`Document`](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) object per PDF page. For each, we can easily access:

* The string content of the page;
* Metadata containing the file name and page number.

In [4]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need


{'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}


### Splitting

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve [`Document`](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

We can use [text splitters](https://docs.langchain.com/oss/python/langchain/retrieval#text_splitters) for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters
with 200 characters of overlap between chunks. The overlap helps
mitigate the possibility of separating a statement from important
context related to it. We use the
`RecursiveCharacterTextSplitter`,
which will recursively split the document using common separators like
new lines until each chunk is the appropriate size. This is the
recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index where each
split Document starts within the initial Document is preserved as
metadata attribute “start\_index”.


In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))

52


## 2. Embeddings

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can [embed](https://docs.langchain.com/oss/python/langchain/retrieval#embedding_models) it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

LangChain supports embeddings from [dozens of providers](https://docs.langchain.com/oss/python/integrations/text_embedding/). These models specify how text should be converted into a numeric vector. We use HuggingFace with the `all-mpnet-base-v2` sentence transformer:

```bash
uv add langchain-huggingface sentence-transformers
```

In [8]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[0.00369695364497602, 0.017323777079582214, -0.01369203720241785, 0.00036987385828979313, -0.05261596664786339, -0.0010653170756995678, 0.002715452341362834, -0.02163930982351303, -0.06304091215133667, -0.0036578048020601273]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.


## 3. Vector stores

LangChain [VectorStore](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore) objects contain methods for adding text and [`Document`](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) objects to the store, and querying them using various similarity metrics. They are often initialized with [embedding](https://docs.langchain.com/oss/python/langchain/retrieval#embedding_models) models, which determine how text data is translated to numeric vectors.

LangChain includes a suite of [integrations](https://docs.langchain.com/oss/python/integrations/vectorstores) with different vector store technologies. For this tutorial we use the **in-memory** vector store, which is lightweight and requires no extra infrastructure:

In [10]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

Having instantiated our vector store, we can now index the documents.

In [11]:
ids = vector_store.add_documents(documents=all_splits)

Note that most vector store implementations will allow you to connect to an existing vector store--  e.g., by providing a client, index name, or other information. See the documentation for a specific [integration](https://docs.langchain.com/oss/python/integrations/vectorstores) for more detail.

Once we've instantiated a [`VectorStore`](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore) that contains documents, we can query it. [VectorStore](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore) includes methods for querying:

* Synchronously and asynchronously;
* By string query and by vector;
* With and without returning similarity scores;
* By similarity and [maximum marginal relevance](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore.max_marginal_relevance_search) (to balance similarity with query to diversity in retrieved results).

The methods will generally include a list of [Document](https://reference.langchain.com/python/langchain_core/documents/#langchain_core.documents.base.Document) objects in their outputs.

**Usage**

Embeddings typically represent text as a "dense" vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

Return documents based on similarity to a string query:

In [12]:
results = vector_store.similarity_search(
    "What is the main architecture proposed in the paper?"
)

print(results[0])

page_content='itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two
sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,' metadata={'producer': 'pdfTeX

Async query:

In [13]:
results = await vector_store.asimilarity_search("What is self-attention?")

print(results[0])


page_content='3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
3' metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 1611}


Return scores:

In [14]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("How does the Transformer encoder work?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.668251152922073

page_content='Figure 1: The Transformer - model architecture.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully
connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,
respectively.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
layers, produce outputs of dimension dmodel = 512.' metadata={'producer': 'pdfTeX-1.40

Return documents based on similarity to an embedded query:

In [None]:
embedding = embeddings.embed_query("What are the advantages of the Transformer over RNNs?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base
model. All metrics are on the English-to-German translation development set, newstest2013. Listed
perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to
per-word perplexities.
N d model dff h d k dv Pdrop ϵls
train PPL BLEU params
steps (dev) (dev) ×106
base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65
(A)
1 512 512 5.29 24.9
4 128 128 5.00 25.5
16 32 32 4.91 25.8
32 16 16 5.01 25.4
(B) 16 5.16 25.1 58
32 5.01 25.4 60
(C)
2 6.11 23.7 36
4 5.19 25.3 50
8 4.88 25.5 80
256 32 32 5.75 24.5 28
1024 128 128 4.66 26.0 168
1024 5.12 25.4 53
4096 4.75 26.2 90
(D)
0.0 5.77 24.6
0.2 4.95 25.5
0.0 4.67 25.3
0.2 5.47 25.7
(E) positional embedding instead of sinusoids 4.92 25.7
big 6 1024 4096 16 0.3 300K 4.33 26.4 213
development set, newstest2013. We used beam search as described in the previous section, but no' metadata={'producer': 'pdf

Learn more:

* [API Reference](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore)
* [Integration-specific docs](https://docs.langchain.com/oss/python/integrations/vectorstores)


## 4. Retrievers

LangChain [`VectorStore`](https://reference.langchain.com/python/langchain_core/vectorstores/?h=#langchain_core.vectorstores.base.VectorStore) objects do not subclass [Runnable](https://reference.langchain.com/python/langchain_core/runnables/#langchain_core.runnables.Runnable). LangChain [Retrievers](https://reference.langchain.com/python/langchain_core/retrievers/#langchain_core.retrievers.BaseRetriever) are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous `invoke` and `batch` operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).

We can create a simple version of this ourselves, without subclassing `Retriever`. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the `similarity_search` method:

In [24]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "What is the main architecture proposed in the paper?",
        "What is self-attention?",
    ],
)

[[Document(id='3a9a01f8-683c-471c-9dda-ba075d260afd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 770}, page_content='itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512.\nDecoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head\nattention over the output of the encoder stack. Similar to the encoder, we employ residual connections\naround each of the sub-layers, 

Vectorstores implement an `as_retriever` method that will generate a Retriever, specifically a [`VectorStoreRetriever`](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStoreRetriever.html). These retrievers include specific `search_type` and `search_kwargs` attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

In [25]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "What is the main architecture proposed in the paper?",
        "What is self-attention?",
    ],
)

[[Document(id='3a9a01f8-683c-471c-9dda-ba075d260afd', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'paper.pdf', 'total_pages': 15, 'page': 2, 'page_label': '3', 'start_index': 770}, page_content='itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimension dmodel = 512.\nDecoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two\nsub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head\nattention over the output of the encoder stack. Similar to the encoder, we employ residual connections\naround each of the sub-layers, 

`VectorStoreRetriever` supports search types of `"similarity"` (default), `"mmr"` (maximum marginal relevance, described above), and `"similarity_score_threshold"`. We can use the latter to threshold documents output by the retriever by similarity score.

Retrievers can easily be incorporated into more complex applications, such as [retrieval-augmented generation (RAG)](https://docs.langchain.com/oss/python/langchain/retrieval) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the [RAG tutorial](https://docs.langchain.com/oss/python/langchain/rag) tutorial.



## Next steps

You've now seen how to build a semantic search engine over a PDF document.

For more on document loaders:

* [Overview](https://docs.langchain.com/oss/python/langchain/retrieval#document_loaders)
* [Available integrations](https://docs.langchain.com/oss/python/integrations/document_loaders/)

For more on embeddings:

* [Overview](https://docs.langchain.com/oss/python/langchain/retrieval#embedding_models/)
* [Available integrations](https://docs.langchain.com/oss/python/integrations/text_embedding/)

For more on vector stores:

* [Overview](https://docs.langchain.com/oss/python/langchain/retrieval#vectorstores/)
* [Available integrations](https://docs.langchain.com/oss/python/integrations/vectorstores/)

For more on RAG, see:

* [Build a Retrieval Augmented Generation (RAG) App](https://docs.langchain.com/oss/python/langchain/rag/)
