##### Dùng để lưu trữ các embeddings của văn bản
##### Có nhiều loại từ open-source đến commercial
##### Thường dùng similarity_search để tìm k tương đồng
##### - FAISS
##### - Chroma
##### - Pinecone
##### - Astra(Cassandra)

In [12]:
from configparser import ConfigParser
import os
config = ConfigParser()
config.read("./config.ini")
os.environ["OPENAI_API_KEY"]= config["KEY"]["openai_key"]

In [2]:
from langchain.document_loaders import PyPDFLoader

In [3]:
url = "https://arxiv.org/pdf/1506.02640"

In [4]:
pdf_loader = PyPDFLoader(url)
docs = pdf_loader.load()

In [5]:
len(docs)

10

In [6]:
docs

[Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='You Only Look Once:\nUniﬁed, Real-Time Object Detection\nJoseph Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗†\nUniversity of Washington∗, Allen Institute for AI†, Facebook AI Research¶\nhttp://pjreddie.com/yolo/\nAbstract\nWe present YOLO, a new approach to object detection.\nPrior work on object detection repurposes classiﬁers to per-\nform detection. Instead, we frame object detection as a re-\ngression problem to spatially separated bounding boxes and\nassociated class probabilities. A single neural network pre-\ndicts bounding boxes and class probabilities directly from\nfull images in one evaluation. Since the whole detection\npipeline is a single network, it can be optimized end-to-end\ndirectly on detection performance.\nOur uniﬁed architecture is extremely fast. Our base\nYOLO model processes images in real-time at 45 frames\nper second. A smaller version of the network, Fast

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=50,
        length_function=len,
        is_separator_regex=False
)

In [8]:
chunks = text_splitter.split_documents(docs)

In [9]:
chunks

[Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='You Only Look Once:\nUniﬁed, Real-Time Object Detection\nJoseph Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗†\nUniversity of Washington∗, Allen Institute for AI†, Facebook AI Research¶\nhttp://pjreddie.com/yolo/\nAbstract\nWe present YOLO, a new approach to object detection.'),
 Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='Prior work on object detection repurposes classiﬁers to per-\nform detection. Instead, we frame object detection as a re-\ngression problem to spatially separated bounding boxes and\nassociated class probabilities. A single neural network pre-\ndicts bounding boxes and class probabilities directly from'),
 Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='full images in one evaluation. Since the whole detection\npipeline is a single network, it can be optimized end-to-end\ndirect

In [10]:
len(chunks)

164

In [11]:
from langchain.embeddings import OpenAIEmbeddings

In [13]:
embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])

  embeddings = OpenAIEmbeddings(api_key=os.environ["OPENAI_API_KEY"])


## Using FAISS

In [14]:
from langchain.vectorstores import FAISS

In [15]:
vector_db = FAISS.from_documents(
        documents=chunks,
        embedding=embeddings
)

In [16]:
query = "what is YOLO"

In [17]:
search_k  = vector_db.similarity_search(query, k = 3)

In [18]:
search_k

[Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 1}, page_content='work, YOLO outperforms top detection methods like DPM\nand R-CNN by a wide margin. Since YOLO is highly gen-\neralizable it is less likely to break down when applied to\nnew domains or unexpected inputs.\nYOLO still lags behind state-of-the-art detection systems'),
 Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='You Only Look Once:\nUniﬁed, Real-Time Object Detection\nJoseph Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗†\nUniversity of Washington∗, Allen Institute for AI†, Facebook AI Research¶\nhttp://pjreddie.com/yolo/\nAbstract\nWe present YOLO, a new approach to object detection.'),
 Document(metadata={'source': 'https://arxiv.org/pdf/1506.02640', 'page': 0}, page_content='http://pjreddie.com/yolo/ .\nSecond, YOLO reasons globally about the image when\n1arXiv:1506.02640v5  [cs.CV]  9 May 2016')]

## Using Chroma

In [19]:
from langchain.vectorstores import Chroma

In [20]:
vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings
)

In [23]:
search_k = vector_db.similarity_search(query, k=3)

In [24]:
search_k

[Document(metadata={'page': 1, 'source': 'https://arxiv.org/pdf/1506.02640'}, page_content='work, YOLO outperforms top detection methods like DPM\nand R-CNN by a wide margin. Since YOLO is highly gen-\neralizable it is less likely to break down when applied to\nnew domains or unexpected inputs.\nYOLO still lags behind state-of-the-art detection systems'),
 Document(metadata={'page': 0, 'source': 'https://arxiv.org/pdf/1506.02640'}, page_content='You Only Look Once:\nUniﬁed, Real-Time Object Detection\nJoseph Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗†\nUniversity of Washington∗, Allen Institute for AI†, Facebook AI Research¶\nhttp://pjreddie.com/yolo/\nAbstract\nWe present YOLO, a new approach to object detection.'),
 Document(metadata={'page': 0, 'source': 'https://arxiv.org/pdf/1506.02640'}, page_content='http://pjreddie.com/yolo/ .\nSecond, YOLO reasons globally about the image when\n1arXiv:1506.02640v5  [cs.CV]  9 May 2016')]