# Libraries

* pypdf 
* langchain 
* Chroma

In [1]:
!curl -o paper.pdf https://arxiv.org/pdf/2311.06517.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1628k  100 1628k    0     0  98575      0  0:00:16  0:00:16 --:--:--   98k


# All necessary imports

In [1]:
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain
from langchain.vectorstores import Chroma 
from langchain.chains import ChatVectorDBChain
from langchain.llms import Ollama
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OllamaEmbeddings

# LLM Model

Zephyr is a series of language models that are trained to act as helpful assistants, I chose the Zephyr cause it's only 7B parameters </br>
Source: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

In [2]:
llm = Ollama(
    model="zephyr", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()])
)

# PDF Loader

Langchain has several document loaders, is a type of tool that loads types of files, like csv, 

In [3]:
loader = PyPDFLoader("paper.pdf")
pages = loader.load_and_split()
len(pages)

26

# Embeddings

I chose zephyr as my LLM model, so I use the same model to crate the embedding. </br>
Embeddings refer to a technique in machine learning and natural language processing where words, phrases, or entities are represented as vectors of real numbers in a multi-dimensional space. The key idea behind embeddings is to capture semantic relationships between words or entities, such that similar words or entities are mapped to nearby points in the vector space.

In [4]:
embeddings = OllamaEmbeddings(model="zephyr")

# Persist the db in the directory

with `Chroma.from_documents` we can create the vectorstore or ai database  </br>
After that you'll have the database in disk, to use again you can load from disk: `Chroma(persist_directory='database_path')`

In [5]:
vector_db = Chroma.from_documents(pages, embedding=embeddings , persist_directory='.')
vector_db.persist()

# Ask questions to PDF/VectorDB

In [6]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vector_db.as_retriever())

In [7]:
question = "what is about the article?"
result = qa_chain({"query": question})

The article discusses the challenges faced by existing Bayesian methods for data cleaning and repair, which include a high dependence on accurate domain knowledge from experts, inefficiency of inference using Bayesian networks due to variable elimination following the topological order of nodes, and the presence of errors in the constructed Bayesian network resulting from noisy data. The article provides examples and references to related studies in the field of database management systems and machine learning.

In [8]:
result

{'query': 'what is about the article?',
 'result': 'The article discusses the challenges faced by existing Bayesian methods for data cleaning and repair, which include a high dependence on accurate domain knowledge from experts, inefficiency of inference using Bayesian networks due to variable elimination following the topological order of nodes, and the presence of errors in the constructed Bayesian network resulting from noisy data. The article provides examples and references to related studies in the field of database management systems and machine learning.'}

# Ask with context

the same way you talk with a chatbot using llm, they will remember all the context of the conversation, below, there is a simple example.

In [30]:
question = "Make a summary about the BClean using simple words"
result = qa_chain({"query": question})

BClean is a new approach for cleaning large-scale databases with errors. It uses Bayesian networks to represent the probability distribution of each tuple in a table, and a compensatory scoring model that considers both the accuracy and completeness of corrections. BClean's algorithm includes three stages: error detection, error correction, and error evaluation. The detection stage identifies inconsistent values using domain knowledge and probabilistic reasoning, while the correction stage uses a BN to infer the most probable value for each missing or incorrect cell based on other attributes of the same tuple. BClean outperforms other cleaning techniques in terms of precision, recall, and F1-score, especially when there is adequate relational information and a low error rate. However, due to memory limitations, BClean's evaluation was limited to a subset of the Soccer dataset, where its recall still exceeded that of other methods.

In [31]:
context = []
context.append({"question": result['query'], "answer": result["result"]})

In [32]:
#context = f"Question: {result['query']} | Answer: {result['result']}"
context

[{'question': 'Make a summary about the BClean using simple words',
  'answer': "BClean is a new approach for cleaning large-scale databases with errors. It uses Bayesian networks to represent the probability distribution of each tuple in a table, and a compensatory scoring model that considers both the accuracy and completeness of corrections. BClean's algorithm includes three stages: error detection, error correction, and error evaluation. The detection stage identifies inconsistent values using domain knowledge and probabilistic reasoning, while the correction stage uses a BN to infer the most probable value for each missing or incorrect cell based on other attributes of the same tuple. BClean outperforms other cleaning techniques in terms of precision, recall, and F1-score, especially when there is adequate relational information and a low error rate. However, due to memory limitations, BClean's evaluation was limited to a subset of the Soccer dataset, where its recall still exceed

In [33]:
question = "Make a summary in portuguese"
result = qa_chain({"context": context , "query": question})

Resumo: O artigo apresenta um método novo para corrigir valores errados em bases de dados relacionais, conhecido como BClean. A proposta de BClean é combinar o modelo Bayesiano (BN) com a otimização da complexidade de tempo e espaço. O experimento comparativo foi realizado usando seis conjuntos de dados diferentes e três medidas diferentes de desempenho: precisão, recall e F1-score. Os resultados mostraram que BClean foi mais robusto a erros de domínio especializado, erros de distribuição e valores errôneos aleatórios. Em particular, BClean obteve resultados comparáveis aos melhores métodos em quatro dos seis conjuntos de dados usados, exceto no Conjunto de Vôos, onde PClean teve melhor desempenho. Os métodos baseados na distribuição obtiveram resultados inferiores às outras metodologias devido à dificuldade de modelar acuradamente as distribuições usando conhecimentos especializados. Quanto à robustez contra diferentes tipos e ratios de erros, BClean demonstrou maior robustez contra d

KeyboardInterrupt: 

In [None]:
context.append({"question": result['query'], "answer": result["result"]})

In [None]:
question
result = qa_chain({"context": context , "query": question})

In [18]:
qa_chain2 = RetrievalQAWithSourcesChain.from_chain_type(llm, retriever=vector_db.as_retriever(),  return_source_documents=True)

In [19]:
question = "Make a summary about the BClean using simple words, show as a solution"
result = qa_chain2({"question": question})

The challenges faced by existing Bayesian methods for data cleaning are as follows:

1. High dependence on accurate domain knowledge from typical office (TTO) and necessates user interaction and domain knowledge editing, (e.g., PPL [35]). This requirement makes it difficult to apply these methods in real-world scenarios where such detailed knowledge is not readily available or easily understood by the average data scientist/engineer.
2. Inefficiency of inference using Bayesian networks, as variable elimination following the topological order of nodes incurs significant computational cost.
3. Presence of errors in the constructed Bayesian network due to noisy data that may propagate into subsequent inferences resulting in inferior inference performance.
For example, Table 1 shows a Customer table with 6 tuples and 9 attributes where most existing methods can accurately identify and correct errors. However, these methods require accurate domain knowledge from experts for PPL, which neces

KeyboardInterrupt: 

In [15]:
result

{'question': 'Make a summary about the BClean using simple words, show as a solution',
 'answer': 'The article discusses the challenges of existing data cleaning methods, particularly those based on probabilistic graphical models (PGMs), and proposes a novel approach called Bayesian Learning for Dataset Error (BBLE) that addresses three key challenges faced by most existing data cleaning methods: high dependence on accurate domain knowledge from experts, inefficiency of inferencing using Bayesian networks, and the presence of errors in the constructed Bayesian network originating from noisy data. The BBLE method combines three novel techniques: uncertain Bayesian networks (UCBNs), approximate Bayesian learning (APBL), and probabilistic uncertainty propagation (PUEPP). These techniques enable efficient Bayesian network construction and fast inferencing while maintaining high accuracy during both network construction and network inferencing processes. The proposed BBLE method applies UCB

In [16]:
context = f"Question: {result['question']} | Answer: {result['answer']}"
result = qa_chain2({"context": context , "question": "Make a longer summary"})

The study by Ma et al. (2023) proposed a novel approach, called BClean, for correcting errors in relational databases using Bayesian networks. The proposed method integrates structure learning and parameter estimation from observed data to construct a Bayesian network that accurately models the probabilistic relationships among attributes, and then infers the most probable values for each cell based on Bayesian inference. The proposed approach aims to address three key challenges of existing Bayesian methods: (1) high dependence on accurate domain knowledge, (2) inefficiency of inference using Bayesian networks, and (3) presence of errors in constructed Bayesian networks originating from noisy data.
The study conducted empirical evaluations on four real-world datasets: Hospital, Soccer, Flight, and Customer, and compared the proposed approach with several state-of-the-art methods, such as BClean_FD, PCClean, HoloClean, and CNLP. The results showed that BClean achieved competitive clean

In [17]:
result

{'context': 'Question: Make a summary about the BClean using simple words, show as a solution | Answer: The article discusses the challenges of existing data cleaning methods, particularly those based on probabilistic graphical models (PGMs), and proposes a novel approach called Bayesian Learning for Dataset Error (BBLE) that addresses three key challenges faced by most existing data cleaning methods: high dependence on accurate domain knowledge from experts, inefficiency of inferencing using Bayesian networks, and the presence of errors in the constructed Bayesian network originating from noisy data. The BBLE method combines three novel techniques: uncertain Bayesian networks (UCBNs), approximate Bayesian learning (APBL), and probabilistic uncertainty propagation (PUEPP). These techniques enable efficient Bayesian network construction and fast inferencing while maintaining high accuracy during both network construction and network inferencing processes. The proposed BBLE method applie