Setup

In [14]:
%load_ext dotenv
%dotenv
import os
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Loading the pdf content into Document objects langchain can handle

In [15]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/raw/Textbook.pdf"
loader = PyPDFLoader(file_path,mode="single")
docs = loader.load()
print(type(docs[0]))

<class 'langchain_core.documents.base.Document'>


Splitting text into smaller chunks

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=400,
)
texts = text_splitter.split_documents(docs)



In [17]:
print(f"Total chunks created: {len(texts)}")
if len(texts) > 0:
    print("\nSample Metadata:")
    print(texts[0].metadata)
    print("\nSample chunk content:")
    print(texts[0].page_content)

Total chunks created: 1426

Sample Metadata:
{'producer': 'Acrobat Distiller 8.3.1 (Windows)', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-03-19T21:01:34+05:30', 'author': '', 'keywords': '', 'moddate': '2015-05-03T15:32:27+03:00', 'subject': '', 'title': '', 'source': '../data/raw/Textbook.pdf', 'total_pages': 746}

Sample chunk content:
Data Mining: The Textbook
Charu C. Aggarwal
Data Mining
The Textbook
Charu C. Aggarwal
IBM T.J. Watson Research Center
Yorktown Heights
New York
USA
A solution manual for this book is available on Springer.com.
ISBN 978-3-319-14141-1 ISBN 978-3-319-14142-8 (eBook)
DOI 10.1007/978-3-319-14142-8
Library of Congress Control Number: 2015930833
Springer Cham Heidelberg New York Dordrecht London
c⃝ Springer International Publishing Switzerland 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, speciﬁcally the rights of translation, reprinting, reuse 

Loading an embedding model based on env setup

In [None]:
from src.util.env_check import get_embed_model

embedding_model = get_embed_model()
print(embedding_model.embed_documents(["Hello world"]))


[[-0.0021383099, 0.0075008804, -0.011650108, -0.07023448, 0.0045393645, -0.01298006, -0.013741996, -0.0117532825, -0.10962647, -0.017243136, -0.0052222074, -0.031411108, 0.0401904, -0.011463264, -0.04773422, 0.09640879, -0.0013066429, 0.09508943, 0.09561688, -0.052990716, -0.0025705767, 0.03481676, -0.021451501, 0.13576753, -0.039540734, -0.03868507, -0.063845925, 0.126914, 0.019091083, -0.021086011, 0.006074588, 0.04749404, -0.021219999, -0.016755642, -0.035438623, -0.0154389525, 0.027086308, -0.010963458, -0.029936722, 0.04537243, 0.01965291, -0.007755763, 0.05530257, -0.01722083, 0.028688328, 0.013500553, 0.017399846, -0.016382286, 0.022164488, 0.011022739, -0.03313258, -0.00720774, -0.002563534, -0.005926767, 0.01760186, -0.042084303, 0.018804865, -0.027303668, 0.027514797, -0.012525264, -0.06433825, 0.04809149, -0.069110096, -0.012301332, 0.014198058, 0.04731715, 0.0057678707, -0.035050858, -0.08356759, -0.022858167, -0.011641543, -0.04040388, -0.060886834, 0.011948142, -0.0151253

We need a vectorstore to save the embeddings

In [None]:
collection_name = "simple_chunking"

In [20]:
from src.util.vectorstore import get_vectorstore
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)

The embedding model is used to generate an embedding for each document before they are saved to the vectorstore

In [21]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(texts))]

document_ids = vector_store.add_documents(documents=texts, ids=uuids)
    
print(f"Saved {len(document_ids)} documents to the vectorstore")

Saved 1426 documents to the vectorstore


Similarity search creates an embedding of the question and then returns documents from the vectorstore with the most similar embeddings

In [22]:
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)
retrieved_docs = vector_store.similarity_search("Who does the author acknowledge?")
print(("\n").join([doc.page_content for doc in retrieved_docs]))


vectors. On the other hand, vectors with one component for each data point are usually
n-dimensional column vectors. An example is then-dimensional column vectory of class
variables ofn data points.
Acknowledgments
I would like to thank my wife and daughter for their love and support during the writing of
this book. The writing of a book requires signiﬁcant time, which is taken away from family
members. This book is the result of their patience with me during this time.
IwouldalsoliketothankmymanagerNaguiHalimforprovidingthetremendoussupport
necessary for the writing of this book. His professional support has been instrumental for
my many book eﬀorts in the past and present.
During the writing of this book, I received feedback from many colleagues. In partic-
ular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo
Deng, Amit Dhurandhar, Bart Goethals, Alexander Hinneburg, Ramakrishnan Kannan,
George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Pie

As we can see the vectorstore returned text chunks most similar to the user question that hopefully contain the answer to the question. 
We pass this as additional context along with the user question to an LLM augmenting its knowledge.
This is the idea behind Retrieval Augmented Generation

In [23]:
retrieved_docs

[Document(metadata={'producer': 'Acrobat Distiller 8.3.1 (Windows)', 'creator': 'LaTeX with hyperref package', 'creationdate': '2015-03-19T21:01:34+05:30', 'author': '', 'keywords': '', 'moddate': '2015-05-03T15:32:27+03:00', 'subject': '', 'title': '', 'source': '../data/raw/Textbook.pdf', 'total_pages': 746, '_id': '3aed89b1-3072-4cf6-9d80-708617db545a', '_collection_name': 'simple_chunking_chapters'}, page_content='vectors. On the other hand, vectors with one component for each data point are usually\nn-dimensional column vectors. An example is then-dimensional column vectory of class\nvariables ofn data points.\n\x0cAcknowledgments\nI would like to thank my wife and daughter for their love and support during the writing of\nthis book. The writing of a book requires signiﬁcant time, which is taken away from family\nmembers. This book is the result of their patience with me during this time.\nIwouldalsoliketothankmymanagerNaguiHalimforprovidingthetremendoussupport\nnecessary for the 

We must close the client connection before running a new script or the streamlit app because this process is blocking the qdrant database locked

In [24]:
vector_store.client.close()