    - Ollama -> See : https://ollama.com/download
  - For linux users :
    
    For this tutorial, run these instructions
    - Install Ollama : `curl -fsSL https://ollama.com/install.sh | sh`
    - Pull required models : 
        - `ollama pull mayflowergmbh/occiglot-7b-fr-en-instruct` #french llm model
        - `ollama pull sammcj/sfr-embedding-mistral:Q4_K_M` # decent embedding for this use case
    
    -------------------------------------------------------------------------------
  
    Additional informations about Ollama
    - To remove Ollama : https://github.com/ollama/ollama/blob/main/docs/linux.md
    - To stop ollama server : `systemctl stop ollama`
    - To restart server : `systemctl start ollama`
    
    -------------------------------------------------------------------------------
    Due to issues from Ollama's latest versions in weights update, we might want to install older versions of Ollama.
    To install **v.0.1.31** on Linux:
      - `curl -fsSL https://ollama.com/install.sh | sed 's#https://ollama.com/download/ollama-linux-${ARCH}${VER_PARAM}#https://github.com/ollama/ollama/releases/download/v0.1.31/ollama-linux-amd64#' | sh`
            
- Next steps
    - Add source for response provided by the chatbot (e.g., source: from 'Le Challenger' via 'Malijet'. To know more, here are some useful links: links...)
    - Improve model response (accuracy and precision)


In [3]:
from pathlib import Path
import os
import uuid
import pandas as pd
from langchain_community.llms import Ollama
from langchain_core.output_parsers import StrOutputParser
from langchain.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.vectorstores.chroma import Chroma
from langchain_groq import ChatGroq
# from langchain_community.vectorstores import DocArrayInMemorySearch
# from langchain_openai.embeddings import OpenAIEmbeddings

In [79]:
%env OPENAI_API_KEY=sk-xxx
%env GROQ_API_KEY=gsk_xxx

env: OPENAI_API_KEY=sk-xxx
env: GROQ_API_KEY=gsk_xxx


In [4]:
chat = ChatGroq(
    temperature=0,
    model="llama3-70b-8192",
    # api_key="" # Optional if not set as an environment variable
)

system = "You are a helpful assistant."
human = "{text}"
prompt = ChatPromptTemplate.from_messages([("system", system), ("human", human)])

chain = prompt | chat
chain.invoke({"text": "Explain the importance of low latency for LLMs."})

AIMessage(content="Low latency is crucial for Large Language Models (LLMs) because it directly impacts the user experience, model performance, and overall efficiency of language-based applications. Here are some reasons why low latency is essential for LLMs:\n\n1. **Real-time Interaction**: LLMs are often used in applications that require real-time interaction, such as chatbots, virtual assistants, and language translation systems. Low latency ensures that the model responds quickly to user input, providing a seamless and engaging experience.\n2. **Conversational Flow**: In conversational AI, latency can disrupt the natural flow of conversation. High latency can lead to awkward pauses, making the interaction feel unnatural and frustrating. Low latency helps maintain a smooth conversation, allowing users to engage more naturally with the model.\n3. **Model Performance**: LLMs rely on complex algorithms and massive datasets to generate responses. High latency can lead to increased comput

In [34]:
# ARTICLE_SOURCE_FILE_PATH = Path().resolve().parent /"data" / "malijet" / "source.csv"
CHROMA_DB_PERSIST_PATH = Path().resolve().parent / "data" / "new_chroma_db_1024"
MODEL_NAME = "mayflowergmbh/occiglot-7b-fr-en-instruct"
# MODEL_NAME = "falcon2:11b-q4_0"

In [4]:
# get nb cpu
os.cpu_count()

16

In [5]:
system_role = "Tu es un expert sur les actualités du Mali et tu parles uniquement français (spécialisé en langue française)."
llm = Ollama(
    model=MODEL_NAME, 
    system=system_role, 
    # num_thread=os.cpu_count()-6
)
llm

Ollama(model='mayflowergmbh/occiglot-7b-fr-en-instruct', system='Tu es un expert sur les actualités du Mali et tu parles uniquement français (spécialisé en langue française).')

## Testing Simple LLM discussion

In [12]:
llm.invoke("Cite moi les noms des présidents du Mali.")

'Les noms des présidents du Mali sont Amadou Toumani Touré, Dioncounda Traoré, Ibrahim Boubacar Keïta et Assimi Goita.\n'

In [13]:
llm.invoke("Quelle est la plus grande crise que la Mali a connue ?")

"La crise la plus grave que le Mali ait connue est sans doute la guerre civile de 2012. Cette conflit a opposé les forces gouvernementales aux rebelles touaregs et à l'Armée malienne pour la libération de l'Azawad (AMAL). La guerre civile a fait des milliers de morts et a provoqué une crise humanitaire majeure dans le pays.\n"

In [14]:
llm.invoke("Who is the most popular scientist in the world?")

'Je ne suis pas capable de répondre à cette question. Je peux vous donner des informations sur les scientifiques célèbres, mais je ne peux pas dire qui est le plus populaire dans le monde entier.'

## 2. Build RAG with CSV file

In [1]:
from langchain_core.documents import Document

In [21]:
# loader = CSVLoader(
#     file_path=ARTICLE_SOURCE_FILE_PATH,
#     csv_args={
#         "delimiter": "\t"
#     }
# )

loader = DirectoryLoader("../data/articles", glob="**/*.csv", loader_cls=CSVLoader, 
                         loader_kwargs={
                             "csv_args" : {"delimiter": "\t",
                                           },
                             "metadata_columns": ["title", "source_paper", "date", "link"]
                         })
# load documents
data = loader.load()
data[:3] # three first documents

[Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 0, 'title': 'Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey', 'source_paper': 'Malijet ', 'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-delegation-ministerielle-sejourne-a-.html'}, page_content="content: Une délégation importante du gouvernement malien, dirigée par le ministre de la Justice et des Droits de l’Homme, garde des Sceaux, Mamadou Kassogué, s'est rendue à Niamey, au Niger, pour étudier les méthodes nigériennes dans la lutte contre l'extrémisme violent, notamment la réinsertion des ex-combattants. Cette initiative vise à trouver une solution durable et efficace pour résoudre un problème capital auquel le Mali est confronté. La délégation malienne a eu des entretiens avec le ministre d'État, ministre de l'Intérieur, de la Sécurité publique et de l'Administration territoriale du Niger, le gén

In [18]:
data[0]

Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 0, 'title': 'title', 'source_paper': 'source_paper', 'date': 'date', 'link': 'link'}, page_content='content: content')

In [23]:
pd.Series([i.metadata['source'] for i in data]).value_counts()
# 53 articles extracted
len(data), #pd.read_csv(ARTICLE_SOURCE_FILE_PATH, sep="\t").shape[0]

(56,)

## Let's see if the splitter is necessary

### Without splitter

In [24]:
def get_length_info(list_of_documents, splitters=None):
    
    if splitters is None:
        splitters = [' ', '.']
    
    ## search the content length statistics (nb characters)
    print('-'*10, "For character length", '-'*10)
    display(pd.Series([len(document.page_content) for document in list_of_documents]).describe())
    
    ## search the words length statistics (nb words)
    print('-'*10, "Length of words (nb characters) in the corpus", '-'*10)
    _res = list()
    for document in list_of_documents:
        _res += pd.Series(document.page_content.split(splitters[0])).apply(len).tolist()
    display(pd.Series(_res).describe())
    
    ## search the sentence length statistics (nb words)
    print('-'*10, "How many words in each doc", '-'*10)
    display(pd.Series([len(doc.page_content.split(splitters[0])) for doc in list_of_documents]).describe())
    
    ## search the sentence length statistics (nb words)
    print('-'*10, "For how many sentences in in each doc", '-'*10)
    display(pd.Series([len(doc.page_content.split(splitters[1])) for doc in list_of_documents]).describe())
    
    ## search the sentence length statistics (nb words)
    print('-'*10, "For sentences length in each doc", '-'*10)
    display(pd.Series([len(sentence) for document in list_of_documents for sentence in document.page_content.split('.')]).describe())
    
    return _res

In [25]:
res = get_length_info(data)

---------- For character length ----------


count       56.000000
mean      2721.000000
std       2558.947049
min        278.000000
25%       1347.000000
50%       1992.000000
75%       3256.000000
max      16557.000000
dtype: float64

---------- Length of words (nb characters) in the corpus ----------


count    23505.000000
mean         5.485088
std          3.574031
min          0.000000
25%          2.000000
50%          5.000000
75%          8.000000
max         26.000000
dtype: float64

---------- How many words in each doc ----------


count      56.000000
mean      419.732143
std       399.585398
min        41.000000
25%       200.000000
50%       300.000000
75%       491.250000
max      2519.000000
dtype: float64

---------- For how many sentences in in each doc ----------


count    56.000000
mean     16.267857
std      13.421138
min       3.000000
25%       8.000000
50%      13.000000
75%      18.250000
max      80.000000
dtype: float64

---------- For sentences length in each doc ----------


count     911.000000
mean      166.323820
std       122.778007
min         0.000000
25%        92.000000
50%       154.000000
75%       227.000000
max      1418.000000
dtype: float64

## With splitter

In [26]:
pd.Series(res).max(), pd.Series(res).quantile(.99)

(26, 15.0)

In [27]:
quantile = 387 # first version
maxi_character_per_sentence = pd.Series([len(sentence) for document in data for sentence in document.page_content.split('.')]).max().astype(int)
maxi_character_per_words = 20 # over 16 to make sure to get all-content overlap
quantile, maxi_character_per_sentence, maxi_character_per_words

(387, 1418, 20)

How do I determine how many characters form a meaningful chunk (i.e., understandable and self-sufficient information)?
We will assume that each sentence could be a meaningful chunk. The idea is to have a count distribution of sentences length in our corpus to make a quick decision.
- We might choose 400 (similar to the quantile q95) as chunk size to be more flexible. In this case, we stay with 387 as a chunk size
- Also, in this case, we'll set 20 as the maximum length of character possible in a sentence

In [28]:
## test the document splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=quantile, # selected quantile
    chunk_overlap=maxi_character_per_words,
    separators=["\n\n", "\n", ". ", " ", ""], # specify that sentence split (by dot) is more important than space & other
    keep_separator=False # drop the separators from the document after split
)
documents = text_splitter.split_documents(documents=data)
len(documents)

554

In [29]:
documents[8:20]

[Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 1, 'title': 'Digitalisation de l’Administration: La DGCCC en première ligne pour concrétiser la vision du Président de la transition', 'source_paper': 'Le Nouveau Réveil', 'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294008-digitalisation-de-l’administration-la-dgccc-en-premiere-ligne-po.html'}, page_content="Zedion Dembélé, incarne cette volonté de transformation en s'engageant pleinement dans cette révolution numérique. L’objectif est clair : rendre l'administration plus transparente, plus efficace, tout en assurant une régulation optimale du commerce, une protection accrue des consommateurs, et une concurrence équitable sur le marché national"),
 Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 1, 'title': 'Digitalisation de l’Administration: La DGCCC en première ligne pour concrétiser la vision du Président de la transition', 'source_pap

In [30]:

res = get_length_info(documents)

---------- For character length ----------


count    554.000000
mean     274.631769
std       95.651822
min        7.000000
25%      218.000000
50%      294.500000
75%      354.000000
max      387.000000
dtype: float64

---------- Length of words (nb characters) in the corpus ----------


count    23530.000000
mean         5.489588
std          3.545854
min          0.000000
25%          2.000000
50%          5.000000
75%          8.000000
max         26.000000
dtype: float64

---------- How many words in each doc ----------


count    554.000000
mean      42.472924
std       15.320452
min        1.000000
25%       33.000000
50%       46.000000
75%       54.000000
max       74.000000
dtype: float64

---------- For how many sentences in in each doc ----------


count    554.000000
mean       1.738267
std        1.014407
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        8.000000
dtype: float64

---------- For sentences length in each doc ----------


count    963.000000
mean     157.566978
std      102.649447
min        0.000000
25%       86.000000
50%      151.000000
75%      224.000000
max      387.000000
dtype: float64

 We move from 4616 words maxi per doc to 79 words per doc. That might be relevant for the final framework of RAG.

In [31]:
[doc for doc in documents[:5]]

[Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 0, 'title': 'Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey', 'source_paper': 'Malijet ', 'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-delegation-ministerielle-sejourne-a-.html'}, page_content="content: Une délégation importante du gouvernement malien, dirigée par le ministre de la Justice et des Droits de l’Homme, garde des Sceaux, Mamadou Kassogué, s'est rendue à Niamey, au Niger, pour étudier les méthodes nigériennes dans la lutte contre l'extrémisme violent, notamment la réinsertion des ex-combattants"),
 Document(metadata={'source': '../data/articles/malijet/2024/08/21.csv', 'row': 0, 'title': 'Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey', 'source_paper': 'Malijet ', 'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-del

In [19]:
[doc.page_content for doc in documents[37:40]]

## Embedding and Vector store

In [32]:
# embeddings_llm = OllamaEmbeddings(model=MODEL_NAME) #mistral or Occiglot
# embeddings_llm = OllamaEmbeddings(model="mayflowergmbh/occiglot-7b-fr-en-instruct")
# embeddings_llm = OllamaEmbeddings(model="snowflake-arctic-embed")
# embeddings_llm = OpenAIEmbeddings()

# embeddings_llm = OllamaEmbeddings(model="bge-large:335m-en-v1.5-fp16")

embeddings_llm = OllamaEmbeddings(model="bge-m3:567m-fp16")
# Set few params if needed
embeddings_llm.show_progress = True
embeddings_llm.num_thread = os.cpu_count() - 8 # 8 in my case
# embeddings_llm.top_k = 10
# embeddings_llm.top_p = .5

embeddings_llm

OllamaEmbeddings(base_url='http://localhost:11434', model='bge-m3:567m-fp16', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=8, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=True, headers=None, model_kwargs=None)

In [23]:
# Exemple
documents[3], len(documents[3].page_content)

(Document(metadata={'source': '/home/bouba/Workspace/kounafoni-app/data/malijet/source.csv', 'row': 0}, page_content='Dans son document, le fameux ‘’Panel des démocrates’’ expose des propositions farfelues en demandant que le pays soit dirigé par un chef d’État honorifique désigné chaque année par exercice tournant parmi les sénateurs...Autre hic, alors que certains démocrates exigent le retour immédiat à l’ordre constitutionnel, eux ils proposent encore une autre transition dite civile de'),
 386)

In [24]:
len(embeddings_llm.embed_query(documents[3].page_content))


OllamaEmbeddings:   0%|          | 0/1 [00:00<?, ?it/s][A
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.72s/it][A


1024

In [25]:
# test to embed a query
embeddings_llm.embed_query(documents[3].page_content)[:7], len(embeddings_llm.embed_query(documents[3].page_content))


OllamaEmbeddings:   0%|          | 0/1 [00:00<?, ?it/s][A
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  1.20it/s][A

OllamaEmbeddings:   0%|          | 0/1 [00:00<?, ?it/s][A
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  1.47it/s][A


([-0.6823922991752625,
  0.23682698607444763,
  -0.46551015973091125,
  -0.07770388573408127,
  -0.7255619764328003,
  -1.188683032989502,
  1.1188299655914307],
 1024)

## Selection of Vector Store

Thanks to this brand-new article from Google, we can safely choose any open source Vector Store, make it available to Google NFS Filestore and access it easily through mounting filestore in Cloud Run (see section 3): https://cloud.google.com/blog/products/serverless/introducing-cloud-run-volume-mounts?hl=en

In [36]:
# The embedding is a very large dimension of 4096, so this will take a real long time
# Database creation
# now 1024 embedding to test Cloud Run
db = Chroma.from_documents(
    documents=documents[:2],
    embedding=embeddings_llm,
    persist_directory=CHROMA_DB_PERSIST_PATH.as_posix(),
)
# db

OllamaEmbeddings: 100%|██████████| 2/2 [00:03<00:00,  1.90s/it]


In [38]:
db = Chroma(
    persist_directory=CHROMA_DB_PERSIST_PATH.as_posix(),
    embedding_function=embeddings_llm,
)

persisted_ids = db.get()["ids"]

new_documents_to_embed_df = pd.DataFrame({
    "single_id": [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in documents[:30]],
    "document": documents[:30]
})

# to keep only different documents (i.e. chunks)
new_documents_to_embed_df.drop_duplicates(subset="single_id", inplace=True)

# Keep only documents not already embedded
new_documents_to_embed_df = new_documents_to_embed_df.query(f"single_id not in {persisted_ids}")

if new_documents_to_embed_df.empty:
    print("No documents to embed")
else:
    print("Embedding documents...")
    display(new_documents_to_embed_df.head()), 
    db.add_documents(
        documents=new_documents_to_embed_df.document.tolist(), 
        embedding=embeddings_llm, 
        ids=new_documents_to_embed_df.single_id.tolist(), 
        persist_directory=CHROMA_DB_PERSIST_PATH.as_posix(),
    )


Embedding documents...


Unnamed: 0,single_id,document
0,f8871cf3-1f21-5d67-98a4-f01b87484621,page_content='content: Une délégation import...
1,c61aaa1c-3432-5f90-8ed2-97775a5d4c95,page_content='Cette initiative vise à trouver...
2,5261cd46-bb81-5322-a6a4-9e9327995441,page_content='Les discussions ont porté sur d...
3,61cccc77-0a85-51ff-bcc9-370d9946eb2e,page_content='Le voyage d'études de la délé...
4,4373ecaa-3dde-5cd9-a08b-e5985c5d70b8,page_content='Kandana/Malijet.com' metadata={'...


OllamaEmbeddings: 100%|██████████| 30/30 [00:19<00:00,  1.55it/s]


In [39]:
db_check = Chroma(
    persist_directory=CHROMA_DB_PERSIST_PATH.as_posix(),
    embedding_function=embeddings_llm,
)
db_check.get()

{'ids': ['08214fe1-38c6-56d6-8c4f-245ba844bee1',
  '11f1d385-f429-59a2-afd5-71223f4b74df',
  '2490711e-e345-5e92-a883-bc2dec9a8d4d',
  '2b36dcc4-e01f-53a9-89af-816bb8b058ab',
  '3fd9514c-a0b9-5ff0-83a9-6cd058a965bb',
  '431801fc-54ae-54c7-8b89-5629ed02202d',
  '4373ecaa-3dde-5cd9-a08b-e5985c5d70b8',
  '4c51426c-5f39-542e-9b7d-e8eb1b14a4f5',
  '5261cd46-bb81-5322-a6a4-9e9327995441',
  '6064e09e-f508-5762-b2ff-8c0dfcb25586',
  '61cccc77-0a85-51ff-bcc9-370d9946eb2e',
  '657515c2-5c93-5538-b8d4-afbeddebf110',
  '6ccd4345-af7b-5a6f-9cd1-920084aa5048',
  '8544498f-95b7-5a99-b4e7-da20bea4ba16',
  '886f93e9-21a4-5d31-93d9-922478057c6f',
  '8bd5dee4-85db-5188-8aec-a0289b9fa9ce',
  '8f92abb1-124d-5b64-9d88-adfb4c0df86a',
  '908b29d1-15fb-4381-a1c9-1e8621833630',
  '93b0b662-0c54-516a-8b0d-d6eac3ec97b9',
  '9b2dbda5-fb4c-59c4-a881-0118bfbaf2b4',
  'a7af70d3-fbbe-5e3b-83e3-a9d3efe766ee',
  'acdf1f22-60fa-54de-8d1b-b0b56dcc89f3',
  'ad36a773-8e9e-5825-b1ba-f74bc2a53016',
  'af0d3583-d70a-54f2-8d56-

In [42]:
# db_check.get(include=['embeddings'])

In [62]:
# db_check.delete_collection()

In [23]:
# In memory vector store test!

# Took more than 36min for 46 documents only; wow, I interrupted with the keyboard!
# For model name as occiglot or mistral or sfr-embedding first on retrieval
# db = DocArrayInMemorySearch.from_documents(documents, embedding=embeddings_llm)

In [26]:
# for snowflake
# db2 = DocArrayInMemorySearch.from_documents(documents, embedding=embeddings_llm) 

In [27]:
# for openai
# db4 = DocArrayInMemorySearch.from_documents(documents, embedding=embeddings_llm)

In [28]:
# load db from disk
# db3 = Chroma(persist_directory=CHROMA_DB_PERSIST_PATH.as_posix(), embedding_function=embeddings_llm)
# db3

## Retriever for RAG

In [40]:
retriever = db.as_retriever(search_kwargs={"k": 7}) # Occiglot or sfr-embed
# retriever = db2.as_retriever(search_kwargs={"k": 10}) # snowflake
# retriever = db4.as_retriever(search_kwargs={"k": 10}) #openAI
# retriever = db3.as_retriever(search_kwargs={"k": 5})
retriever

VectorStoreRetriever(tags=['Chroma', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x71887f2e9fc0>, search_kwargs={'k': 7})

In [41]:
# Test retriever
query = "AES" 
retriever.invoke(query), len(retriever.invoke(query))

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.57it/s]
OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.75it/s]


([Document(metadata={'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294008-digitalisation-de-l’administration-la-dgccc-en-premiere-ligne-po.html', 'row': 1, 'source': '../data/articles/malijet/2024/08/21.csv', 'source_paper': 'Le Nouveau Réveil', 'title': 'Digitalisation de l’Administration: La DGCCC en première ligne pour concrétiser la vision du Président de la transition'}, page_content='Adama Coulibaly'),
  Document(metadata={'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-delegation-ministerielle-sejourne-a-.html', 'row': 0, 'source': '../data/articles/malijet/2024/08/21.csv', 'source_paper': 'Malijet ', 'title': 'Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey'}, page_content='Kandana/Malijet.com'),
  Document(metadata={'date': '2024-08-21', 'link': 'https://malijet.com/a_la_une_du_mali/294006-classement-2024-des-puissances-militaires-africaines-le-mali-sup.html', 'row': 2

In [75]:
# Prompt template
template = """
Réponds à la question uniquement grâce au contexte suivant et uniquement en langue française. Il faudra clairement détailler ta réponse. A la fin de ta réponse, mets en bas la source de média 'source_paper' qui t'as permis d'avoir ces réponses ainsi que le lien associé (en lien hyperlink markdown sous le format [Doc title](Doc link)) pour permettre à l'utilisateur de cliquer sur le lien et aller vérifier l'information. S'il y a plusieurs link et plusieurs source_paper, cite les deux majoritaires !
Ne commence pas ta réponse par : "selon les informations ou contexte fournis" ou quelque chose de similaire, réponds directement à la question.
Si tu n'as pas de réponse explicite dans le contexte, réponds "Je n'ai pas assez d'informations pour répondre correctement à votre question.".

Contexte : {context}

Question : {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [76]:
def format_docs(docs):
    docs_formatted = list()
    for d in docs:
        doc_presentation = f"Doc title : <<{d.metadata['title']}>>\n"
        doc_presentation +=  f"Doc date : <<{d.metadata['date']}>>\n"
        doc_presentation +=  f"Doc source_paper : <<{d.metadata['source_paper']}>>\n"
        doc_presentation +=  f"Doc link : <<{d.metadata['link']}>>\n"
        doc_presentation +=  f"Doc page_content : <<{d.page_content}>>\n"
        docs_formatted.append(doc_presentation)
    print(docs_formatted)
    return f"\n\n{'-'*50}\n".join(docs_formatted)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    # | llm
    | ChatGroq(temperature=0, model="llama3-70b-8192")
    | StrOutputParser()
)


In [77]:
questions = [
    # "Qui est Oumar Diarra ?",
    # "Quels sont les actions de l'EUTM ?",
    # "Cite-moi les recommandations retenues lors du dialogue inter malien",
    # "Quand finit la mission de l'union européenne ?",
    # "Actualités sur la DIRPA",
    # "Parle-moi de l'agriculture au Mali",
    # "Que font les FAMA actuellement ?"
    
    # "Parle moi du nouveau vérificateur général",
    # "Résume moi en quelques points les dernières actualités maintenant",
    # "Où en est la relation Mali et Russie ?",
    # "Qui est Bassaro Haïdara ?",
    # "Qu'est ce que Assimi a fait récemment ?",
    # "Le dialogue inter malien est il terminé ?",
    # "Qui sont les membres de l'AES ?",
    # "Comment a été la journée du 1er Mai au Mali ?",
    # "Donne moi la date la plus récente des informations dont tu disposes",
    # "Qu'en est il de la crise sécuritaire au Mali ?",
    # "La Belgique a t elle récemment collaborée avec le Mali ?" # question bonus (must return I dont know),
    "parle moi des puissances militaires",
    # "Y a t il des tensions entre le Mali et le Danemark ?",
    # "Quelles les implications de l'Algérie à la guerre au nord du Mali ?",
    # "Qu'a fait le président Assimi le 15 Aout ?",
    # "Fais moi un bilan des inondations du Mali."
]
len(questions)

1

In [78]:
for q in questions:
    print(q)
    print(chain.invoke(q))
    print('-'*50, "\n")
# for open source models
for q in questions:
    print(q)
    print(chain.invoke(q))
    print('-'*50, "\n")

parle moi des puissances militaires


OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.53it/s]


['Doc title : <<Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey>>\nDoc date : <<2024-08-21>>\nDoc source_paper : <<Malijet >>\nDoc link : <<https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-delegation-ministerielle-sejourne-a-.html>>\nDoc page_content : <<Les discussions ont porté sur divers sujets, incluant les mécanismes de désarmement, de démobilisation et de réintégration (DDR), ainsi que le rôle crucial des autorités traditionnelles dans ce processus>>\n', "Doc title : <<Classement 2024 des puissances militaires africaines: Le Mali supplante ses alliés de l'AES et se positionne derrière la Côte d'Ivoire>>\nDoc date : <<2024-08-21>>\nDoc source_paper : <<Le Nouveau Réveil>>\nDoc link : <<https://malijet.com/a_la_une_du_mali/294006-classement-2024-des-puissances-militaires-africaines-le-mali-sup.html>>\nDoc page_content : <<Il a fondé son classement sur une nomenclature de 60 critères pour déterminer le score ‘'Power 

OllamaEmbeddings: 100%|██████████| 1/1 [00:00<00:00,  2.97it/s]


['Doc title : <<Coopération Mali-Niger : Une délégation ministérielle séjourne à Niamey>>\nDoc date : <<2024-08-21>>\nDoc source_paper : <<Malijet >>\nDoc link : <<https://malijet.com/a_la_une_du_mali/294034-cooperation-mali-niger--une-delegation-ministerielle-sejourne-a-.html>>\nDoc page_content : <<Les discussions ont porté sur divers sujets, incluant les mécanismes de désarmement, de démobilisation et de réintégration (DDR), ainsi que le rôle crucial des autorités traditionnelles dans ce processus>>\n', "Doc title : <<Classement 2024 des puissances militaires africaines: Le Mali supplante ses alliés de l'AES et se positionne derrière la Côte d'Ivoire>>\nDoc date : <<2024-08-21>>\nDoc source_paper : <<Le Nouveau Réveil>>\nDoc link : <<https://malijet.com/a_la_une_du_mali/294006-classement-2024-des-puissances-militaires-africaines-le-mali-sup.html>>\nDoc page_content : <<Il a fondé son classement sur une nomenclature de 60 critères pour déterminer le score ‘'Power 

## Récapitulatif

- 8 bons retrieval sur 11 questions [V, V, V, X, V, V, X, V, X, V, V] : V → True response and X → Wrong response
- 4 bonnes réponses sur 11 questions [X, V, V, X, V, X, X, X, X, X, V] :  V → True response and X → Wrong response
- Bonne réponse à la question Bonus !
- Questions potentielles à rajouter :
    - Quand est ce que Abdoulaye Diop rencontrera ses homologues ?
    -  Quoi de prévu le 28 juin à Bamako ?
    - Quelle célébrité le président a rencontré ?
    - À quand la transition civile au Mali ?

# Some results from OpenAIEmbeddings, far better result

In [40]:
# Result from OpenAI Embeddings
for q in questions:
    print(q)
    print(chain.invoke(q))
    print('-'*50, "\n")