# VectorStores

Uma das maneiras mais comuns de armazenar e buscar dados não estruturados é realizando o embedding e armazenando os vetores resultantes e, em seguida, na hora da consulta, realizar o embedding da consulta e recuperar os vetores 'mais semelhantes'. Uma VectorStore faz o armazenamento dos vetores e a realização da busca de vetores para você

## Chroma VectorStore

In [4]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

### Document Laoding

In [5]:
caminho = 'arquivos/TSP_CMC_54360.pdf'
loader = PyPDFLoader(caminho)
paginas = loader.load()

In [6]:
len(paginas)

20

### Text Splitting

In [7]:
recur_split = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

documents = recur_split.split_documents(paginas)
len(documents)

139

### Criando a VectorStore

In [8]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [9]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [10]:
from langchain_chroma import Chroma

diretorio = 'arquivos/chroma_vectorstore'

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings_model,
    persist_directory=diretorio
)

In [8]:
print(vectorstore._collection.count())

266


### Importando vectorstore do disco

In [11]:
diretorio = 'arquivos/chroma_vectorstore'

vectorstore = Chroma(
    embedding_function=embeddings_model,
    persist_directory=diretorio
)

### Retrieval

In [12]:
pergunta = 'O que é o Hugging Face?'

docs = vectorstore.similarity_search(pergunta, k=5)
len(docs)

5

In [13]:
for doc in docs:
    print(doc.page_content)
    print(f'====== {doc.metadata}\n\n')

cussion information for LLM preparation, drawing upon the lavita/ChatDoctor-HealthCareMagic-
100k dataset on Hugging Face as a source of perspective point. Here, we dig into the thinking behind
apparently basic strategies and their effect on the adequacy of the last chatbot model.
Our underlying step includes fragmenting the crude dataset into individual discussion strings.
This is usually accomplished by recognizing newline characters or other delimiters that differ in one


cussion information for LLM preparation, drawing upon the lavita/ChatDoctor-HealthCareMagic-
100k dataset on Hugging Face as a source of perspective point. Here, we dig into the thinking behind
apparently basic strategies and their effect on the adequacy of the last chatbot model.
Our underlying step includes fragmenting the crude dataset into individual discussion strings.
This is usually accomplished by recognizing newline characters or other delimiters that differ in one


4 Experiment
Our examination investiga

## FAISS VectorStore

https://python.langchain.com/docs/integrations/vectorstores/

In [14]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [15]:

loader = PyPDFLoader(caminho)
paginas = loader.load()

recur_split = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)

documents = recur_split.split_documents(paginas)

In [16]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [17]:
from langchain_community.vectorstores.faiss import FAISS

vectorstore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings_model
)

In [18]:
pergunta = 'O que é o Hugging Face?'

docs = vectorstore.similarity_search(pergunta, k=5)
len(docs)

5

In [19]:
for doc in docs:
    print(doc.page_content)
    print(f'====== {doc.metadata}\n\n')

cussion information for LLM preparation, drawing upon the lavita/ChatDoctor-HealthCareMagic-
100k dataset on Hugging Face as a source of perspective point. Here, we dig into the thinking behind
apparently basic strategies and their effect on the adequacy of the last chatbot model.
Our underlying step includes fragmenting the crude dataset into individual discussion strings.
This is usually accomplished by recognizing newline characters or other delimiters that differ in one


4 Experiment
Our examination investigates the capability of consolidating Boundary Efficient Fine-Tuning
(PEFT) strategies and quantization to make an asset-efficient and exact medical care chatbot inside the
limits of a free Google Colab environment. We further explore the utilization of Retrieval-Augmented
Generation (RAG) with LangChain to upgrade the chatbot’s capacity to address client inquiries by
leveraging an outside knowledge base.


the test set. Implemented on the National Economics University’s officia

### Salvando bd FAISS

In [20]:
vectorstore.save_local('arquivos/faiss_bd')

### Importando bd FAISS

In [21]:
from langchain_community.vectorstores.faiss import FAISS

vectorstore = FAISS.load_local(
    'arquivos/faiss_bd',
    embeddings=embeddings_model,
    allow_dangerous_deserialization=True
)