# Build a semantic search engine

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

In [1]:
!pip install langchain-community pypdf

Collecting pypdf
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.3.0




In [2]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

### Loading documents

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./gen_ai.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

18


## Splitting

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

126

## Embeddings

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L6-v2")


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [8]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 384

[-0.3988904654979706, 0.21661382913589478, -0.38683268427848816, 0.10741971433162689, 0.039315272122621536, 0.02057838812470436, -0.32118719816207886, 0.1605253517627716, 0.1680615097284317, -0.13747482001781464]


## Vector stores

In [9]:
from langchain_chroma import Chroma

vector_store = Chroma(embedding_function=embeddings)

In [10]:
ids = vector_store.add_documents(documents=all_splits)

## Usage


In [12]:
results = vector_store.similarity_search(
    "What are the Challenges for generative AI‑based systems?"
)

print(results[0])

page_content='and businesses should seek to understand and embrace the 
potential of generative AI (Eloundou et al., 2023; Willcocks, 
2020).
Challenges for generative AI‑based systems
While generative AI holds transformative potential for 
individuals, organizations, and society due to its vast pos-
sible application space, the technology also inherits vari-
ous challenges that parallel those of traditional ML and 
DL systems. The domain of electronic markets is a prime 
example that moved into the center of transformation 
due to its latest focus on data-driven efforts (Selz, 2020). 
Outlining and emphasizing these challenges relevant for 
research and practice helps to raise awareness of the con-
straints as well as supports future efforts in developing, 
implementing, and improving GAI-based systems.
Bias
Because of GAI’s data-driven nature, data quality plays an 
essential role in how GAI-based systems perform and, thus, 
how feasible their adoption for real-world scenarios in bus

#### Async query

In [13]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Electronic Markets           (2023) 33:63 
1 3   63  Page 2 of 17
While GAI research and development is continuing to 
invest toward better, faster, and more capable models (e.g., 
Microsoft, 2023), studies on the fundamental principles, 
applications, and socio-economic impact remain largely 
unexplored in the academic discourse (Strobel et al., 2024; 
Susarla et al., 2023; Wessel et al., 2023). GAI provides inno-
vation opportunities for various domains (e.g., networked 
businesses and digital platforms) but also comes with chal-
lenges (e.g., transparency, biases, and misuse) that need to 
be addressed for successful implementations (Houde et al., 
2020; Schramowski et al., 2022; van Slyke et al., 2023). 
However, an examination of the key concepts is yet to be 
conducted, leaving a clear image and understanding of gen-
erative AI undefined. To overcome that shortcoming, this 
article provides an introduction to the fundamentals of gen-' metadata={'page': 2, 'page_labe

### Return Scores

In [15]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 58.756858825683594

page_content='Electronic Markets           (2023) 33:63 
1 3   63  Page 2 of 17
While GAI research and development is continuing to 
invest toward better, faster, and more capable models (e.g., 
Microsoft, 2023), studies on the fundamental principles, 
applications, and socio-economic impact remain largely 
unexplored in the academic discourse (Strobel et al., 2024; 
Susarla et al., 2023; Wessel et al., 2023). GAI provides inno-
vation opportunities for various domains (e.g., networked 
businesses and digital platforms) but also comes with chal-
lenges (e.g., transparency, biases, and misuse) that need to 
be addressed for successful implementations (Houde et al., 
2020; Schramowski et al., 2022; van Slyke et al., 2023). 
However, an examination of the key concepts is yet to be 
conducted, leaving a clear image and understanding of gen-
erative AI undefined. To overcome that shortcoming, this 
article provides an introduction to the fundamentals of gen-' meta

## Return documents based on similarity to an embedded query

In [16]:
embedding = embeddings.embed_query("What are the Challenges for generative AI‑based systems?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='and businesses should seek to understand and embrace the 
potential of generative AI (Eloundou et al., 2023; Willcocks, 
2020).
Challenges for generative AI‑based systems
While generative AI holds transformative potential for 
individuals, organizations, and society due to its vast pos-
sible application space, the technology also inherits vari-
ous challenges that parallel those of traditional ML and 
DL systems. The domain of electronic markets is a prime 
example that moved into the center of transformation 
due to its latest focus on data-driven efforts (Selz, 2020). 
Outlining and emphasizing these challenges relevant for 
research and practice helps to raise awareness of the con-
straints as well as supports future efforts in developing, 
implementing, and improving GAI-based systems.
Bias
Because of GAI’s data-driven nature, data quality plays an 
essential role in how GAI-based systems perform and, thus, 
how feasible their adoption for real-world scenarios in bus