<a href="https://colab.research.google.com/github/SonnyDev/llm-apps-langchain/blob/main/Retrieval_Augmented_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Retrieval Augmented Generation (RAG)

RAG est une technique pour augmenter les connaissances des LLMs avec des données supplémentaires, souvent privées ou en temps réel.

Un retriever est une interface qui retourne des documents à partir d'une requête non structurée. Il est plus général qu'un vector store. Un retriever n'a pas besoin de pouvoir stocker des documents, seulement de les retourner (ou récupérer). Les vector stores peuvent être utilisés comme base d'un retriever, mais il existe également d'autres types de retrievers.

### 0. Chargement de la clé d'API

In [None]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

## 1. Stratégies basiques de récupération

### 1.1. Récupération par similarité simple

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

In [None]:
source_text = open("langchain.txt", "r", encoding="utf-8").read()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(source_text)

In [None]:
embeddings = OpenAIEmbeddings()
db = Chroma.from_texts(texts, embeddings)
retriever = db.as_retriever()

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
retrieved_docs = retriever.invoke(
"Qu'est-ce qu'un agent"
)
retrieved_docs[0].page_content

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


"• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs."

In [None]:
retrieved_docs

[Document(page_content="• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs."),
 Document(page_content="• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs."),
 Document(page_content="• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs."),
 Document(page_content="• Assistants 

### 1.2. Maximum Margin Relevance Retrieval (MMR Retrieval)

Le métrique MMR pénalise des informations redondantes.

In [None]:
retriever = db.as_retriever(search_type="mmr")

In [None]:
docs = retriever.get_relevant_documents(
    "Qu'est-ce qu'un agent ?")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:langchain_community.utils.math:Unable to import simsimd, defaulting to NumPy implementation. If you want to use simsimd please install with `pip install simsimd`.
INFO:langchain_community.utils.math:Unable to import simsimd, defaulting to NumPy implementation. If you want to use simsimd please install with `pip install simsimd`.
INFO:langchain_community.utils.math:Unable to import simsimd, defaulting to NumPy implementation. If you want to use simsimd please install with `pip install simsimd`.
INFO:langchain_community.utils.math:Unable to import simsimd, defaulting to NumPy implementation. If you want to use simsimd please install with `pip install simsimd`.


In [None]:
len(docs)

4

In [None]:
docs

[Document(page_content="• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs."),
 Document(page_content="• Assistants personnels : Les applications d'assistants personnels utilisent LangChain pour prendre des actions, se souvenir des interactions et avoir des connaissances sur les données de l'utilisateur. Ils peuvent répondre aux questions, effectuer des tâches spécifiques et fournir des recommandations personnalisées."),
 Document(page_content="• Mémoire : La mémoire permet de stocker des informations entre les appels d'une chaîne ou d'un agent. Elle peut être utilisée pour conserver des états, des variables ou des résultats intermédiaires lors de l'exécution d'une séquence d'appels.\n• Indexes : Les indexes permettent de combiner des modèles de la

### 1.3. Seuil de score de similarité (Similarity score threshold)

In [None]:
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold" : 0.6})

In [None]:
docs = retriever.get_relevant_documents(
    "Qu'est-ce qu'un agent")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
for doc in docs:
    print(doc.page_content)

• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs.
• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs.
• Agents : Les agents permettent d'utiliser des modèles de langage comme moteur de raisonnement en prenant des décisions et en effectuant des actions en fonction des instructions et des données d'entrée. Ils peuvent être utilisés pour créer des assistants virtuels ou des chatbots interactifs.
• Assistants personnels : Les applications d'assistants personnels utilisent LangChain pour prendre des actions, se so

### 1.4. Précision du top k

In [None]:
retriever = db.as_retriever(search_kwargs={"k" : 2})

In [None]:
docs = retriever.get_relevant_documents(
    "Quels sont les cas d'utilisation de LangChain ?")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
len(docs)

2

In [None]:
for doc in docs:
    print(doc.page_content)

• Réponse aux questions : LangChain peut être utilisé pour répondre aux questions en utilisant des documents spécifiques. Il peut extraire les informations pertinentes des documents et générer des réponses précises.
• Chatbots : Les chatbots utilisent LangChain pour interagir avec les utilisateurs en langage naturel. Ils peuvent comprendre les questions, fournir des réponses pertinentes et mener des conversations fluides.
• Réponse aux questions : LangChain peut être utilisé pour répondre aux questions en utilisant des documents spécifiques. Il peut extraire les informations pertinentes des documents et générer des réponses précises.
• Chatbots : Les chatbots utilisent LangChain pour interagir avec les utilisateurs en langage naturel. Ils peuvent comprendre les questions, fournir des réponses pertinentes et mener des conversations fluides.


## 2. Stratégies avancées de récupération

Les stratégies basiques sont limitées :

• Scalabilité : à mesure que la quantité de données indexées augmente, le processus de récupération et de génération peut devenir coûteux en calcul et prendre du temps. Il peut rencontrer des difficultés à évoluer pour gérer efficacement de grands volumes de données.

• Manque de spécificité : Bien que les embeddings capturent la sémantique du contenu, il existe un défi inhérent. À mesure que les documents augmentent en taille et en complexité, les représenter dans leur nature multifacette avec un seul vecteur d’embeddings peut conduire à une perte de spécificité.

• La récupération simple basée sur la distance peut être imprécise s’il y a des modifications mineures dans les requêtes ou si les embeddings ne reflètent pas bien la sémantique des données.

Pour surmonter ces limites, plusieurs stratégies avancées de RAG sont créées.

### 2.1. Aborder la spécificité : travailler avec les métadonnées

Pour résoudre le problème de manque de spécificité, plusieurs vectorstores supportent des operations sur les  `metadonnées`.
`metadata` fournit un contexte pour chaque embeddings.

In [None]:
### PyPDFDirectoryLoader

from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("document_loaders/pdfs/")

docs = loader.load()

In [None]:
# Transform documents

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True,
)

texts = text_splitter.split_documents(docs)

In [None]:
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings)
retriever = db.as_retriever()

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
question = "what are the core principles and key commitments of the chinese academy of science"

In [None]:
docs = retriever.get_relevant_documents(question)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
for d in docs:
    print(d.page_content)

5. Core Principles and Key Actions 
The experts identified several principles with a broad range of key actions in their responses. The suggestions from experts advised to ensure inclusivity, quality, adoption, development and innovation in the development and utility of digital public goods for SDG indicators. Principle overlaps between different experts broadly suggested to ensure that the information should be publicly accessible to promote awareness, knowledge, and research and to encourage innovative solutions to global and regional 
8 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management 
and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
5. Core Principles and Key Actions 
The experts identified several principles with a broad range of key actions in their responses. The suggestions from experts advised to ensure inclusivity, quality, adoption, development and innovation in the development a

Comme on peut le voir, le récupérateur simple n'est pas spécifique. Il a puisé d'autres informations du document de la chambre internationale de Commerce et non seulement de l'académie chinoise de sciences. Pour etre spécifique, on peut travailler avec les métadonnées.

In [None]:
docs = db.similarity_search(
    question,
    k=3,
    filter={"source":"document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf"}
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
for d in docs:
    print(d.metadata)

{'page': 7, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 2792}
{'page': 7, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 2792}
{'page': 8, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 0}


Comme on peut le voir, la page 5 n'apparait plus dans le résultant car elle ne provient pas de la source spécifiée.

In [None]:
for d in docs:
    print(d.page_content)

5. Core Principles and Key Actions 
The experts identified several principles with a broad range of key actions in their responses. The suggestions from experts advised to ensure inclusivity, quality, adoption, development and innovation in the development and utility of digital public goods for SDG indicators. Principle overlaps between different experts broadly suggested to ensure that the information should be publicly accessible to promote awareness, knowledge, and research and to encourage innovative solutions to global and regional 
8 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management 
and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
7challenges. Secondly, improvement of global scientific and technical capabilities through an open science 
approach was extensively suggested by different experts in several of the responses received. Table 1 provides a summarized list of Core Principles wi

### 2.2. Aborder la spécificité : Self-Query Retriever

Généralement, on souhaite inférer les métadonnées à partir de la requete elle-meme.

Pour ce faire, on utilise `SelfQueryRetriever`, qui utilise un LLM pour extraire:

1. Le `query` à utiliser dans la recherche vectorielle
2. Un filtre de métadonnées à passer également

Plusieurs VectorDB supportent les filtres de métadonnées.

In [None]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [None]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The submission chunk is from, should be one of `document_loaders\\pdfs\\220928_Action_Coalition_on_Innovation_and_Technology_for_Gender_Equality.pdf`, `document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf`, `document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf`, `document_loaders\\pdfs\\220929_Int_Federation_of_Library_Associations_and_Institutions.pdf`,`document_loaders\\pdfs\\221010_Global_Partners_Digital_input_to_GDC.pdf` or `document_loaders\\pdfs\\221010_Global_Partners_Digital_input_to_GDC.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the submission",
        type="integer",
    ),
]

In [None]:
!pip install lark #bibliothèque Python de parsing




[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
document_content_description = "Submissions"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    db,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [None]:
docs = retriever.get_relevant_documents("what are the core principles and key commitments of the chinese academy of science")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:langchain.retrievers.self_query.base:Generated Query: query='core principles and key commitments' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf') limit=None
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
docs

[Document(page_content='5. Core Principles and Key Actions \nThe experts identified several principles with a broad range of key actions in their responses. The suggestions from experts advised to ensure inclusivity, quality, adoption, development and innovation in the development and utility of digital public goods for SDG indicators. Principle overlaps between different experts broadly suggested to ensure that the information should be publicly accessible to promote awareness, knowledge, and research and to encourage innovative solutions to global and regional \n8 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management \nand stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18', metadata={'page': 7, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 2792}),
 Document(page_content='5. Core Principles and Key Actions \nThe experts identified several principles with a

In [None]:
for d in docs:
    print(d.metadata)

{'page': 7, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 2792}
{'page': 7, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 2792}
{'page': 8, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 0}
{'page': 8, 'source': 'document_loaders\\pdfs\\220929_Chinese_Academy_of_Sciences.pdf', 'start_index': 0}


### 2.3. Multi-query retriever

Le Multi-Query Retrieval utilise  le LLM pour générer plusieurs requêtes à partir d'une seule requête utilisateur.

Cette approche permet de surmonter certaines limitations de la recherche basée sur la similarité en fournissant un ensemble de résultats plus riche.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "What are the core principles and key actions of participants ?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(), llm=llm
)

In [None]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:langchain.retrievers.multi_query:Generated queries: ['1. What are the fundamental principles and essential steps followed by participants?', '2. Can you explain the main principles and significant actions taken by participants?', '3. Could you provide an overview of the core principles and key activities carried out by participants?']
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


5

In [None]:
unique_docs[3].page_content

'96.1. Organizational and Co-ordinational Challenges\nA community supported open-source development approach is quite an attractive proposition as it will \nprovide the necessary bottom-up approach suggested in Key Actions 2.1 (Core Principle 2) and help to encourage highly talented data scientists, programmers, and digital technology experts to voluntarily participate in devising novels solutions in-line with Core Principle 3. However, it will be a highly challenging undertaking to organize and streamline these efforts. In the absence of a designated team, open-source development environment will lead to competing methods and products which will require efficient quality assurance and control mechanisms (QA/QC) (Core Principle 5). Similarly, a community-based development approach will impact efficiency in the development process, particularly if voluntary members are involved alongside research and academic institutions, who may not be able to deliver on desired targets and deadlines.

### 2.4. Compression du contexte

Une autre approche pour améliorer la qualité de documents recupérés est la compression.
L'information la plus pertinente par rapport à une requête peut être enfouie dans un document contenant beaucoup de texte non pertinent.
Faire passer ce document complet à travers votre application peut entraîner des appels LLM plus coûteux et des réponses moins précises.
La compression contextuelle est conçue pour résoudre ce problème.

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [None]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [None]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=db.as_retriever()
)

In [None]:
question = "What are the core principles and key actions of participants ?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


Document 1:

5. Core Principles and Key Actions 
The experts identified several principles with a broad range of key actions in their responses.
----------------------------------------------------------------------------------------------------
Document 2:

5. Core Principles and Key Actions 
The experts identified several principles with a broad range of key actions in their responses.
----------------------------------------------------------------------------------------------------
Document 3:

5) Published voluntary codes of conduct can help in this respect, providing a reference point and allowing for accountability. 6) There needs to be a balance between free speech and action. 7) The interests of researchers, today and tomorrow, should be remembered.
----------------------------------------------------------------------------------------------------
Document 4:

5) Published voluntary codes of conduct can help in this respect, providing a reference point and allowing for accou

### 2.5. Multi-vector retrievers

Il est souvent bénéfique d'avoir de stocker plusieurs embeddings pour un seul document. Les Multi-vector retrievers permettent la recherche lorsque q'un document peut avoir plusieurs embeddings.

#### 2.5.1. ParentDocument Retriever

L'idées est de diviser un document en sous-documents plus petits et à en créer des embeddings.

In [None]:
from langchain.retrievers import ParentDocumentRetriever

In [None]:
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

In [None]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
retriever.add_documents(docs)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
len(list(store.yield_keys()))

4

In [None]:
sub_docs = vectorstore.similarity_search("Chinese Academy")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
print(sub_docs[0].page_content)

voluntarily participate in devising novels solutions in-line with Core Principle 3. However, it will be a highly challenging undertaking to organize and streamline these efforts. In the absence of a designated team, open-source development environment will lead to competing methods and products which will require efficient quality assurance and control mechanisms (QA/QC) (Core Principle 5).


In [None]:
retrieved_docs = retriever.get_relevant_documents("Chinese Academy")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
len(retrieved_docs[0].page_content)

997

In [None]:
print(retrieved_docs[0].page_content)

7challenges. Secondly, improvement of global scientific and technical capabilities through an open science 
approach was extensively suggested by different experts in several of the responses received. Table 1 provides a summarized list of Core Principles with associated Key Actions to facilitate and guide the development process of proposed digital public goods for SDG indicators. 
Table 1: Core Principles and Key Actions suggested by experts for developing digital public goods for 
SDG Indicators
Core Principles Key Actions
1Universality of science1.1 Promote open science, open data, and open knowledge
1.2Encourage scientific partnerships, collaborations, and cooperation and 
multi-disciplinary stakeholder engagements
2The digital public goods should be scalable 2.1Development process of digital public goods for SDG indicators should incorporate multi-stakeholder engagement and a mechanism for a bottom to top approach


#### 2.5.2. Summary Retriever

Créer un résumé pour chaque document et en créer des embeddings avec (ou à la place) du document

In [None]:
from langchain.retrievers.multi_vector import MultiVectorRetriever

In [None]:
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.storage import InMemoryByteStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

In [None]:
import uuid

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser

In [None]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

In [None]:
summaries = chain.batch(docs, {"max_concurrency": 5})

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

In [None]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
sub_docs = vectorstore.similarity_search("Chinese Academy of Science core principles and key actions")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
sub_docs[3].page_content

'The document discusses the challenges and suggested improvements for developing digital public goods for SDG indicators. One of the main challenges identified is the need to improve global scientific and technical capabilities through an open science approach. The document also provides a list of core principles and key actions suggested by experts for developing these digital public goods, including promoting open science, open data, and open knowledge, encouraging scientific partnerships and collaborations, and incorporating multi-stakeholder engagement in the development process.'

#### 2.5.3. Hypothetical Questions

Générer des questions hypothétiques appropriées pour chaque document et à en créer des embeddings avec (ou à la place) du document.

In [None]:
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

In [None]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [None]:
chain.invoke(docs[0])

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


['What core principles and key actions did the experts identify for the development and utility of digital public goods for SDG indicators?',
 'How can publicly accessible information promote awareness, knowledge, and research according to the experts?',
 'In what ways does the FAIR Guiding Principles support scientific data management and stewardship?']

In [None]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

In [None]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
hypothetical_questions

[['What are the core principles suggested by experts for the development and utility of digital public goods for SDG indicators?',
  'How does the public accessibility of information promote awareness, knowledge, and research in the context of digital public goods for SDG indicators?',
  'What are the key actions suggested by experts to encourage innovative solutions to global and regional challenges?'],
 ['What are the suggested core principles and key actions for developing digital public goods for SDG indicators?',
  'How can the universality of science be promoted and encouraged in the development of digital public goods for SDG indicators?',
  'What considerations should be made to ensure the scalability of digital public goods for SDG indicators?'],
 ['What kind of support can digital public goods approved at Lv 2 receive?',
  'What is the purpose of the community developed user-oriented training program?',
  'How does the training program support Core Principle 6?'],
 ['What cou

In [None]:
sub_docs = vectorstore.similarity_search("Chinese Academy")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
sub_docs

[Document(page_content='How can the universality of science be promoted and encouraged in the development of digital public goods for SDG indicators?', metadata={'doc_id': '326c9112-ad0e-4639-8ce0-a0904773ca43'}),
 Document(page_content='What are the key actions suggested by experts to encourage innovative solutions to global and regional challenges?', metadata={'doc_id': '4f21f737-3b11-4cf4-9812-3b399b046186'}),
 Document(page_content='What are the core principles suggested by experts for the development and utility of digital public goods for SDG indicators?', metadata={'doc_id': '4f21f737-3b11-4cf4-9812-3b399b046186'}),
 Document(page_content='What impact can a community-based development approach have on efficiency, especially when involving voluntary members and research or academic institutions?', metadata={'doc_id': '7f585b8a-7c64-40ab-ac8f-e639ffc21e66'})]

In [None]:
retrieved_docs = retriever.get_relevant_documents("Chinese Academy")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [None]:
print(retrieved_docs[0].page_content)

7challenges. Secondly, improvement of global scientific and technical capabilities through an open science 
approach was extensively suggested by different experts in several of the responses received. Table 1 provides a summarized list of Core Principles with associated Key Actions to facilitate and guide the development process of proposed digital public goods for SDG indicators. 
Table 1: Core Principles and Key Actions suggested by experts for developing digital public goods for 
SDG Indicators
Core Principles Key Actions
1Universality of science1.1 Promote open science, open data, and open knowledge
1.2Encourage scientific partnerships, collaborations, and cooperation and 
multi-disciplinary stakeholder engagements
2The digital public goods should be scalable 2.1Development process of digital public goods for SDG indicators should incorporate multi-stakeholder engagement and a mechanism for a bottom to top approach


## 3. Chaines de réponses aux questions

### 3.1 RetrievalQA Chain

In [None]:
from langchain.chains import RetrievalQA

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever()
)

In [None]:
question = "What are the core principles and key actions of the ICC BASIS Input to the Global Digital Compact ?"

In [None]:
result = qa_chain({"query": question})

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


In [None]:
result["result"]

' The core principles of the ICC BASIS Input to the Global Digital Compact are that policy and regulatory mechanisms should promote the value of the entire communications and digital services ecosystem, and that policies should be non-discriminatory, technology-neutral, and supportive of innovative digital literacy, trust, and online environments free from harassment, discrimination and violence. The key actions are to commit to preserving and strengthening the multistakeholder model, and to ensure meaningful participation of stakeholders from the global South and other typically under-represented groups in global public policymaking pertaining to the Internet. Thanks for asking!'

In [None]:
## Prompt

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [None]:
question = "What are the core principles and key actions of the ICC BASIS Input to the Global Digital Compact ?"

In [None]:
result = qa_chain({"query": question})

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


In [None]:
result["result"]

' The core principles of the ICC BASIS Input to the Global Digital Compact are that policy and regulatory mechanisms should promote the value of the entire communications and digital services ecosystem, and that policies should be non-discriminatory, technology-neutral, and supportive of innovative digital literacy, trust, and online environments free from harassment, discrimination and violence. The key actions are to commit to preserving and strengthening the multistakeholder model, and to ensure meaningful participation of stakeholders from the global South and other typically under-represented groups in global public policymaking pertaining to the Internet. Thanks for asking!'

In [None]:
result["source_documents"][0]

Document(page_content='August 2022   | ICC BASIS input to GDC consultation  | 1 \n \n  \n \n \n              \n \n \nICC BASIS Input to the Global Digital Compact  \n1. Connect all people to the internet, including all schools  \na) Core Principles  \n \nDelivering universal meaningful connectivity requires effective action on all three layers of the ICT \necosystem: accessible and affordable infrastructure and devices; appropriate applications and \nservices built upon the infrastructure; and user ability to use a device and understand the features \nof these applications and services. As stated in the ICC White Paper on Delivering Universal \nMeaningful Connectivity  this would require policymaking grounded on two basic principles:  \n1. Policy and regulatory mechanisms should promote the value of the entire communications \nand digital services ecosystem.  \n2. Policies should be non -d iscriminatory, technology -neutral, and supportive of innovative', metadata={'page': 0, 'source':

In [None]:
### Optimisation du contexte

On peut utiliser plusieurs types de chaines RetrievalQA pour optimiser le contexte :
- refine
- map_reduce
- map_rerank

In [None]:
# Refine

In [None]:
qa_chain_refine = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    chain_type="refine"
)
result = qa_chain_refine({"query": question})
result["result"]

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


'\n\nThe core principles of the ICC BASIS Input to the Global Digital Compact are: \n1. Policy and regulatory mechanisms should promote the value of the entire communications and digital services ecosystem. \n2. Policies should be non-discriminatory, technology-neutral, and supportive of innovative. \n3. Commit to digital literacy, trust, and online environments free from harassment, discrimination and violence.\n4. Commit to preserve and strengthen the multistakeholder model, particularly by ensuring that UN policymaking processes are more diverse, equitable, and inclusive, and that existing fora tasked with Internet governance challenges, such as the Internet Governance Forum (IGF), are further strengthened with appropriate human resources and funding. Meaningful participation of interested and informed stakeholders is essential to ensure that outcomes are both effective and accepted. It is particularly important to ensure the meaningful participation of stakeholders from the global 

In [None]:
# Map Reduce

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


" The core principles of the ICC BASIS Input to the Global Digital Compact are that policy and regulatory mechanisms should promote the value of the entire communications and digital services ecosystem, and that policies should be non-discriminatory, technology-neutral, and supportive of innovative. The key actions include digital literacy, trust, and online environments free from harassment, discrimination and violence, preserving and strengthening the multistakeholder model, meaningful participation of stakeholders from the global South and other typically under-represented groups in global public policymaking, private investments and public funding mechanisms informed by accurate information and reliable data, and reiterating member states' shared commitment to bridging both the coverage and usage gaps."

In [None]:
qa_chain_mrr = RetrievalQA.from_chain_type(
    llm,
    retriever=db.as_retriever(),
    chain_type="map_rerank"
)
result = qa_chain_mrr({"query": question})
result["result"]

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


" The core principles and key actions of the ICC BASIS Input to the Global Digital Compact include the need for private investments and public funding mechanisms to be informed by accurate information and reliable data, including coverage and usage data, satellite images, census data, and other relevant information. The Global Digital Compact should reiterate all member states' shared commitment to bridging both the coverage and usage gaps and bringing meaningful connectivity to all populations everywhere, and recognize the efforts of all stakeholders and encourage flexible approaches."

### 3.2 Conversational RetrievalQA Chain

La limite de RetrievalQA chain est qu'elle ne supporte pas des questions de suivi ("follow up question") et n'est pas adaptée à la conversation. Le Conversational RetrievalQA Chain ajoute de la mémoire au Retriever.

In [None]:
# Memoire

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [None]:
from langchain.chains import ConversationalRetrievalChain
retriever=db.as_retriever()
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    memory=memory
)

In [None]:
question = "What is GDC"
result = qa({"question": question})
result

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


{'question': 'What is GDC',
 'chat_history': [HumanMessage(content='What is GDC'),
  AIMessage(content=' The Global Digital Compact (GDC) is an initiative to promote universal meaningful connectivity.')],
 'answer': ' The Global Digital Compact (GDC) is an initiative to promote universal meaningful connectivity.'}

In [None]:
question = "Who are some of the stakeholders ?"
result = qa({"question": question})
result

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/completions "HTTP/1.1 200 OK"


{'question': 'Who are some of the stakeholders ?',
 'chat_history': [HumanMessage(content='What is GDC'),
  AIMessage(content=' The Global Digital Compact (GDC) is an initiative to promote universal meaningful connectivity.'),
  HumanMessage(content='Who are some of the stakeholders ?'),
  AIMessage(content=' The stakeholders of the GDC initiative include private sector organizations, multistakeholder organizations, and intergovernmental organizations.')],
 'answer': ' The stakeholders of the GDC initiative include private sector organizations, multistakeholder organizations, and intergovernmental organizations.'}