In [9]:
## Data ingestion

from langchain_community.document_loaders import TextLoader
txt_path = "RAG\\speech.txt"
loader = TextLoader("speech.txt")
text_documents = loader.load()
text_documents

[Document(page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not 

In [10]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [5]:
#web Based Loader 
from langchain_community.document_loaders import WebBaseLoader
import bs4

# loading, chunking and indexing the content of the html page

loader = WebBaseLoader(web_path= ("https://lilianweng.github.io/posts/2023-06-23-agent/"),
                       bs_kwargs=dict(parse_only= bs4.SoupStrainer(
                           class_= ("post-title", "post-content", "post-header") # theser are the classes in the html page


                       )),)

In [6]:
text_documents = loader.load()
text_documents

[Document(page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final re

In [8]:
#PDF Loader

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("fine_tuning_bert.pdf")

pdf_text = loader.load()

pdf_text

[Document(page_content='Fine-Tuning BERT Models to Extract Named Entities\nfrom Archival Finding Aids\nLuis Filipe Cunha1,∗,José Carlos Ramalho1\n1Department of Informatics, University of Minho, Portugal\nAbstract\nIn recent works, several NER models were developed to extract named entities from Portuguese Archival\nFinding Aids. In this paper, we are complementing the work already done by presenting a different NER\nmodel with a new architecture, Bidirectional Encoding Representation from Transformers (BERT). In\norder to do so, we used a BERT model that was pre-trained in Portuguese vocabulary and fine-tuned it\nto our concrete classification problem, NER. In the end, we compared the results obtained with previous\narchitectures. In addition to this model we also developed an annotation tool that uses ML models to\nspeed up the corpora annotation process.\nKeywords\nNamed Entity Recognition, BERT, Web, Corpora Annotation\n1. Introduction\nIn recent works, mechanisms were created to e

In [16]:
# chunking 

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter= RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

final_documents = text_splitter.split_documents(pdf_text)

final_documents[:5]

[Document(page_content='Fine-Tuning BERT Models to Extract Named Entities\nfrom Archival Finding Aids\nLuis Filipe Cunha1,∗,José Carlos Ramalho1\n1Department of Informatics, University of Minho, Portugal\nAbstract\nIn recent works, several NER models were developed to extract named entities from Portuguese Archival\nFinding Aids. In this paper, we are complementing the work already done by presenting a different NER\nmodel with a new architecture, Bidirectional Encoding Representation from Transformers (BERT). In\norder to do so, we used a BERT model that was pre-trained in Portuguese vocabulary and fine-tuned it\nto our concrete classification problem, NER. In the end, we compared the results obtained with previous\narchitectures. In addition to this model we also developed an annotation tool that uses ML models to\nspeed up the corpora annotation process.\nKeywords\nNamed Entity Recognition, BERT, Web, Corpora Annotation\n1. Introduction', metadata={'source': 'fine_tuning_bert.pdf', 

In [23]:
# Embeddings and Vector Store

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OllamaEmbeddings(model="llama2")
# setting up the db 
db = Chroma.from_documents(final_documents[:20],embeddings)


In [24]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x22cb2e2f010>

In [28]:
## vector Database 
query = "Who are the authors of Fine-Tuning BERT Models to extract Named Entities from Archival Finding Aids research paper"


In [29]:
res = db.similarity_search(query)
res

[Document(page_content='•Implementation of a smart annotator that uses ML models to assist the experimenter with\nthe corpora annotating process.\nNER@DI can be used by various types of users, for example, historians wishing to extract\nrelevant entities from archival documents or even other developers or researchers with the\nintent of reusing the annotated datasets in other contexts, or using NER@DI as a service in\ntheir own applications.\n4. Models\nInordertoidentifyandextractentitiesfromnaturaltext,NER@DIusesseveralMLarchitectures,\nsuchasMaximumEntropy,ConvolutionNeuralNetworks(CNN),BidirectionalLongShort-Term\nMemory with a Conditional Random Field decoder (Bi-LSTM-CRF) and was recently updated\nwith a new model, Bidirectional Encoder Representations from Transformers (BERT).\nIn NER@DI previous versions, we used NER approaches that consisted of training models\nfrom scratch. Now we present a BERT model, which consists of using pre-trained models', metadata={'page': 3, 'source':

In [30]:
res[0].page_content

'•Implementation of a smart annotator that uses ML models to assist the experimenter with\nthe corpora annotating process.\nNER@DI can be used by various types of users, for example, historians wishing to extract\nrelevant entities from archival documents or even other developers or researchers with the\nintent of reusing the annotated datasets in other contexts, or using NER@DI as a service in\ntheir own applications.\n4. Models\nInordertoidentifyandextractentitiesfromnaturaltext,NER@DIusesseveralMLarchitectures,\nsuchasMaximumEntropy,ConvolutionNeuralNetworks(CNN),BidirectionalLongShort-Term\nMemory with a Conditional Random Field decoder (Bi-LSTM-CRF) and was recently updated\nwith a new model, Bidirectional Encoder Representations from Transformers (BERT).\nIn NER@DI previous versions, we used NER approaches that consisted of training models\nfrom scratch. Now we present a BERT model, which consists of using pre-trained models'

In [35]:
query = "Bidirectional Encoder Representations from Transformers"
res = db.similarity_search(query)
res


[Document(page_content='such as NER. In this case, word-based tokenizers are usually used, i.e., defining a fixed size 𝑁\nfor the vocabulary and then associating an id for the 𝑁most frequent words of that vocabulary.\nThis method has shown good results in several contexts, however, it has several limitations.\nDue to the fact that the number of words is limited, ML models have difficulties dealing with\nout of vocabulary words or even words that are rarely used. One solution for this problem is\nto increase the number of vocabulary words ( 𝑁), however, this would lead to other problems\nsuch as making the computational model heavier and increasing the number of rare words. On\nthe other hand, as each distinct word has a different id, similar words have entirely different\nmeanings, which causes information about the words’ relationship to be lost during this phase,\ndecreasing the performance of the models.\nThus, in order to solve these limitations and increase the meaning of the nume