<a href="https://colab.research.google.com/github/RMoulla/PBD_Dexia/blob/main/TP_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP : Retrieval Augmented Generation (RAG)

Dans ce projet, on se propose de construire un RAG. Il s'agit d'une forme de moteur de recherche qui permet d'interroger une base de données en langage naturel pour obtenir une information contenue dans un document.

In [None]:
!pip install llmsherpa



In [None]:
!pip install h5py
!pip install typing-extensions
!pip install wheel
!pip install llama_index



In [None]:
!pip install sentence_transformers



In [None]:
from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

In [None]:
from llama_index.readers.schema.base import Document
from llama_index import VectorStoreIndex
from IPython.core.display import display, HTML
import openai

openai.api_key = 'votre_clé_openai_ici'

In [None]:
index = VectorStoreIndex([])
for chunk in doc.chunks():
    index.insert(Document(text=chunk.to_context_text(), extra_info={}))

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
doc.chunks()[1].to_text()

'We present BART, a denoising autoencoder for pretraining sequence-to-sequence models.\nBART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.\nIt uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes.\nWe evaluate a number of noising approaches, ﬁnding the best performance by both randomly shufﬂing the order of the original sentences and using a novel in-ﬁlling scheme, where spans of text are replaced with a single mask token.\nBART is particularly effective when ﬁne tuned for text generation but also works well for comprehension tasks.\nIt matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new stateof-the-art results on a range of abstractive dialogue, questio

In [None]:
HTML(doc.tables()[5].to_html())

0,1,2,3,4,5,6,7,8,9,10
BERT,84.1/90.9,79.0/81.8,86.6/-,93.2,91.3,92.3,90.0,70.4,88.0,60.6
UniLM,-/-,80.5/83.4,87.0/85.9,94.5,-,92.7,-,70.9,-,61.1
XLNet,89.0/94.5,86.1/88.8,89.8/-,95.6,91.8,93.9,91.8,83.8,89.2,63.6
RoBERTa,88.9/94.6,86.5/89.4,90.2/90.2,96.4,92.2,94.7,92.4,86.6,90.9,68.0
BART,88.8/94.6,86.1/89.2,89.9/90.1,96.6,92.5,94.9,91.2,87.0,90.4,62.8


In [None]:
from sentence_transformers import SentenceTransformer, util
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:
text_list = []
for table in doc.tables():
  text_list.append(table.to_text())

In [None]:
embed_list = []
for text in text_list:
  embed = sentence_model.encode(text)
  embed_list.append(embed)

## Exécution des requêtes

In [None]:
embed_quest = sentence_model.encode('masqued language model squad 1.1')
score_list = []
for embed in embed_list:
  cos_score = util.pytorch_cos_sim(embed, embed_quest)
  score_list.append(cos_score)

In [None]:
sorted_list = sorted(range(len(score_list)), key=lambda k: score_list[k], reverse = True)
for index in sorted_list[:10]:
  HTML(doc.tables()[index].to_html())

In [None]:
index = VectorStoreIndex([])
for id in sorted_list[:10]:
    chunk = doc.tables()[id].to_html()
    index.insert(Document(text=chunk, extra_info={}))

query_engine = index.as_query_engine()
response = query_engine.query("What is the performance of masqued language model in squad 1.1 task?")
print(response)

The performance of the masked language model in the SQuAD 1.1 task is 90.0.


## Requête sur du texte

In [None]:
text_list = []
for table in doc.chunks():
  text_list.append(table.to_text())

embed_list = []
for text in text_list:
  embed = sentence_model.encode(text)
  embed_list.append(embed)

In [None]:
embed_quest = sentence_model.encode('perform similar bart')
score_list = []
for embed in embed_list:
  cos_score = util.pytorch_cos_sim(embed, embed_quest)
  score_list.append(cos_score)

In [None]:
sorted_list = sorted(range(len(score_list)), key=lambda k: score_list[k], reverse = True)
index = VectorStoreIndex([])
for id in sorted_list[:10]:
    chunk = doc.chunks()[id].to_html()
    index.insert(Document(text=chunk, extra_info={}))

query_engine = index.as_query_engine()
response = query_engine.query("What are the models that perform similarly to bart?")
print(response)

RoBERTa is a model that performs similarly to BART.
