# Introduction to Natural Language Processing (NLP)

## Table of Contents
- [I. Key Concepts in NLP](#i-key-concepts-in-nlp)
  - [I.1. Tokenization](#i1-tokenization)
  - [I.2. Vectorization](#i2-vectorization)
- [II. Advanced Representations](#ii-advanced-representations)
  - [II.1. Sentence Embeddings](#ii1-word-embeddings)
  - [II.2. Contextualized Embeddings](#ii2-contextualized-embeddings)
- [III. Transformer](#iii-transformer)
  - [III.1. Architecture](#iii1-architecture)
  - [III.2. Key Concepts of a Transformer](#iii2-key-concept-of-a-transformer)
- [IV. ChromaDB](#iv-ChromaDB)
  - [IV.1. Key Features](#iv1-key-features)
  - [IV.2. Use Cases](#iv2-use-cases)
  - [IV.3. Data Structure](#iv3-data-structure)
  - [IV.4. Setup and Usage](#iv4-setup-and-usage)
- [V. Mini-Project: Building a Chatbot](#iv-mini-project-building-a-chatbot)
  - [V.1. Project Objective: Use Retrievers with LLM](#iv1-project-objective-use-retrievers-with-llm)
  - [V.2. Implementation](#iv2-implementation)
  - [V.3. Evaluation](#iv3-evaluation)
- [VI. Conclusion and Future Directions](#v-conclusion-and-future-directions)


## I. Key Concepts in NLP

### I.1. Tokenization

**Tokenization** is the process of breaking down a sequence of text (such as a sentence, paragraph, or document) into smaller units called **tokens**. These tokens can be individual words, subwords, characters, or even entire sentences, depending on the application and the tokenization method chosen.

Tokenization is a fundamental step in **Natural Language Processing (NLP)** and usually occurs before other processing steps, such as vectorization or syntactic analysis. The goal is to segment the text in a way that makes it easier to manipulate and analyze using algorithms.

### Why Tokenize Text?
Machine learning and NLP algorithms cannot directly process raw text. Tokenization allows the transformation of this text into a sequence of tokens (linguistic units), which can then be processed, analyzed, or transformed into vectors for further modeling or computation.

### Types of Tokenization

#### 1. **Word Tokenization**
This is one of the most common methods, where the text is split into individual words. The segmentation is usually done using spaces and punctuation as delimiters.
- **Example**: The text "The cat is on the mat." is tokenized into `["The", "cat", "is", "on", "the", "mat", "."]`.

#### 2. **Subword Tokenization**
To handle morphemes or word roots, or to manage unknown words, modern NLP models use subword tokenization.
- **Example**: The word "unforgettable" could be split into `["un", "forget", "table"]`.
- Methods like **BPE (Byte-Pair Encoding)** or **WordPiece** are commonly used for this approach, especially in models like BERT, GPT, or Transformer.

#### 3. **Character Tokenization**
The text is segmented into individual characters. This method is sometimes used for deep learning models that work directly on character sequences or in languages with very different structures like Chinese.
- **Example**: The word "cat" is tokenized into `["c", "a", "t"]`.

#### 4. **Sentence Tokenization**
The text is split into sentences rather than words or subwords. This approach is useful for tasks where the context of the entire sentence is important.
- **Example**: The text "It’s sunny. The cat is playing outside." is segmented into `["It’s sunny.", "The cat is playing outside."]`.

### How Does Tokenization Work?
Tokenization relies on segmentation rules that depend on the language and the application:
- **Delimiters**: Spaces, punctuation marks, or other special characters are often used to delimit tokens.
- **Linguistic Exceptions**: Some languages or linguistic situations require more precise segmentation. For example, in French, tokenization must handle apostrophes, such as in "l'arbre" to obtain `["l'", "arbre"]`.

In more sophisticated models like Transformers, the resulting tokens are often mapped to **identifiers** (IDs) in a vocabulary, making it easier for machine learning algorithms to process them.

### Use Cases
- **Text Analysis**: Transforming text into tokens allows for word frequency analysis, understanding semantic context, or performing tasks like text classification.
- **Preprocessing for NLP Models**: Before training NLP models, such as recurrent neural networks or Transformers, the text must be tokenized and converted into numerical vectors.
- **Machine Translation**: Tokenization is used to segment the source and target text into coherent units to facilitate translation.

In summary, tokenization is a crucial step in text processing for NLP that involves dividing a text sequence into smaller units for easier manipulation and analysis. It is a prerequisite for any language processing task, whether it’s text classification, text generation, or other NLP applications.

In [1]:
from IPython.core.display import HTML

# URL de l'image
image_url = "https://miro.medium.com/v2/resize:fit:720/format:webp/0*cgpKoFocSYm6bLHw.png"

# Afficher l'image en HTML
HTML(f'<img src="{image_url}" width="500">') 

### I.2. Vectorization

Vectorization is the process of transforming data (such as text, images, or other types of information) into numerical vectors (or "embeddings"). These vectors allow machines to analyze and process data in a mathematical and geometric manner, which is essential for tasks in machine learning, natural language processing (NLP), or computer vision.

### Why Vectorize?
Most machine learning and deep learning algorithms handle numerical data. However, in many applications, raw data is often non-numerical (text, images, audio). Vectorization makes it possible to represent this data in a form that algorithms can understand and process. Once the data is transformed into numerical vectors, it can be used for tasks such as classification, similarity search, image recognition, and more.

**1. Text Vectorization**

Text is a sequence of words, and different techniques exist to transform it into vectors:

* **Bag of Words (BoW)**: Each document is represented by a vector where each dimension corresponds to a word from the total vocabulary. The value of each dimension is the number of occurrences of that word in the document. For example, for a vocabulary of 3 words ["cat", "dog", "bird"], the sentence "the cat is a cat" would be vectorized as [2, 0, 0].
* **TF-IDF (Term Frequency-Inverse Document Frequency)**: This is an improvement on BoW that gives more weight to important words in a document while reducing the importance of words that are frequent across all documents (such as "the", "and").
* **Word Embeddings**: Deep learning models such as Word2Vec, GloVe, or FastText represent words in a vector space so that similar words (semantically related) have similar vectors. For example, the words "king" and "queen" would have vectors that are close to each other.


## II. Advanced Representations

### II.1. Sentence Embeddings -> Contextualized Embeddings

#### What are Sentence Embeddings?

**Sentence embeddings** are dense vector representations that capture the semantic meaning of an entire sentence, rather than focusing on individual words. These embeddings transform whole sentences into fixed-size vectors, usually in a high-dimensional space, where the position of each vector reflects the semantic similarity between sentences.

### Why are Sentence Embeddings Better than Word Embeddings?

While **word embeddings** like Word2Vec, GloVe, and FastText represent individual words in a continuous vector space, they have limitations when used to understand the meaning of whole sentences. Here's why sentence embeddings can offer significant improvements:

#### 1. **Contextual Understanding**
- **Word embeddings** don’t capture the relationships between words in a sentence. For example, the words "bank" (as a financial institution) and "bank" (as the side of a river) would have the same embedding regardless of context.
- **Sentence embeddings**, on the other hand, take into account the meaning of words within their context. They can distinguish between different meanings based on how words interact in the sentence, providing a more accurate semantic representation.

#### 2. **Handling Polysemy (Multiple Meanings)**
- **Word embeddings** struggle with words that have multiple meanings (polysemy), because they assign a single vector to each word.
- **Sentence embeddings** avoid this problem because they encode entire sentences, allowing polysemous words to be disambiguated by the context in which they appear. For example, in "I deposited money at the bank" vs. "We sat by the river bank," sentence embeddings will differentiate between the financial and geographic meanings of "bank."

#### 3. **Capturing Sentence-Level Information**
- **Word embeddings** do not capture sentence structure or word order. A bag-of-words model, for example, treats a sentence as a collection of words without considering how word order contributes to meaning.
- **Sentence embeddings** account for the order and structure of the sentence, making them better at capturing meaning that depends on how words are arranged, like in complex sentences or when negation is involved ("The movie was not good" vs. "The movie was good").

#### 4. **Suitable for Higher-Level Tasks**
- **Word embeddings** are useful for word-level tasks like named entity recognition or part-of-speech tagging, but they struggle with sentence-level tasks like text classification, summarization, or question answering.
- **Sentence embeddings** are designed for higher-level tasks, where understanding the entire sentence (or even paragraph) is crucial. Tasks like semantic similarity, sentiment analysis, or document retrieval benefit greatly from using sentence embeddings since they better capture the overall meaning.

```bash

Example
- Sentence 1: The bank of the river was peaceful.
- Sentence 2: I deposited money at the bank yesterday.

With word embeddings like Word2Vec, the word bank would have a similar representation in both sentences, as the context is not considered.
- Sentence 1: The cat sat on the mat.
- Sentence 2: The dog lay on the carpet.
Although these sentences use different words, models like BERT or Sentence-BERT produce similar sentence embeddings because the general meaning (an animal performing an action on a surface) is close.

The word apple in I ate an apple and Apple released a new iPhone would have a similar representation because traditional word embeddings (like Word2Vec or GloVe) do not modify a word's representation based on its context.

In the same example, sentence embeddings would differ because they consider the overall context:
  - I ate an apple would be closer to a sentence like I had a banana for breakfast.
  - Apple released a new iPhone"* would be closer to a sentence like Samsung unveiled a new smartphone.

- Word Embeddings: Ideal for word-level tasks, such as finding synonyms or word analogies (king - man + woman = queen).
- Sentence Embeddings: Better suited for tasks requiring global understanding, such as document retrieval or sentence similarity analysis. 

```


### How are Sentence Embeddings Created?

Sentence embeddings are often created using more advanced techniques compared to word embeddings:

#### 1. **Average Word Embeddings**
- A simple method is to average the embeddings of all the words in a sentence. While this gives a rough sense of the sentence's meaning, it fails to capture word order and nuance.
  
#### 2. **Pre-trained Language Models (Transformers)**
- More sophisticated approaches use pre-trained transformer models like **BERT**, **GPT**, or **RoBERTa**. These models generate contextualized embeddings for each word and can aggregate them into sentence embeddings, taking into account context and relationships between words.

#### 3. **Models Specializing in Sentence Embeddings**
- Some models are specifically designed to generate high-quality sentence embeddings. For instance:
  - **Sentence-BERT (SBERT)** fine-tunes BERT to produce embeddings for entire sentences, optimized for tasks like semantic similarity and sentence classification.
  - **Universal Sentence Encoder (USE)** from Google directly produces fixed-size sentence embeddings and is widely used for a variety of tasks such as semantic search, clustering, and question answering.

### Why Use Sentence Embeddings?

Here are some typical use cases where sentence embeddings outperform word embeddings:

#### 1. **Semantic Similarity**
Sentence embeddings are used to determine how semantically similar two sentences are. This is crucial for tasks like:
- **Paraphrase detection**: Checking if two sentences convey the same meaning.
- **Document retrieval**: Finding relevant documents based on a user’s query.

#### 2. **Text Classification**
In tasks like sentiment analysis or topic classification, sentence embeddings can represent the overall sentiment or topic of the entire sentence or document, rather than relying on individual words.

#### 3. **Summarization**
For automatic summarization, sentence embeddings help in understanding the main idea of each sentence and in selecting the most important sentences for the summary.

#### 4. **Question Answering**
Sentence embeddings are essential in question-answering systems where both the question and potential answers need to be compared for semantic similarity.

### Conclusion

**Sentence embeddings** are a powerful improvement over **word embeddings** because they encode entire sentences, allowing for a richer understanding of meaning, context, and sentence structure. While word embeddings are useful for word-level tasks, sentence embeddings shine in applications that require a deeper semantic understanding of text. Models like Sentence-BERT and the Universal Sentence Encoder have made it possible to leverage sentence embeddings in practical NLP applications like search, summarization, and sentiment analysis, offering much more nuanced and effective representations of text.

## III. Transformer

A **transformer** is a deep learning model architecture that revolutionized the field of Natural Language Processing (NLP) due to its ability to handle large amounts of data and capture long-range dependencies in sequences. It was introduced in the paper *"Attention is All You Need"* by Vaswani et al. in 2017.

### III.1. Architecture

[Transformer](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Transformer%2C_full_architecture.png/220px-Transformer%2C_full_architecture.png)


In [44]:
from IPython.core.display import HTML

# URL de l'image
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Transformer%2C_full_architecture.png/220px-Transformer%2C_full_architecture.png"

# Afficher l'image en HTML
HTML(f'<img src="{image_url}" width="500">') 




### III.2. Key Concepts of a Transformer

1. **Self-Attention Mechanism**: The core innovation of the transformer is its self-attention mechanism, which allows the model to weigh the importance of each word in a sequence relative to others. This enables the transformer to understand the context of a word based on its relationship with other words in the sentence, regardless of their position.

2. **Parallel Processing**: Unlike previous models like RNNs (Recurrent Neural Networks) that process sequences word by word, transformers can process the entire input sequence in parallel. This drastically improves training speed and efficiency.

3. **Positional Encoding**: Since transformers don’t inherently process sequences in order, they use positional encodings to keep track of word positions within the sequence, ensuring that the model understands the order of words.

4. **Layers and Attention Heads**: Transformers are made up of multiple layers of self-attention mechanisms and feedforward neural networks. They also use multiple "attention heads" to focus on different aspects of the sentence simultaneously.

### Applications:
Transformers have become the foundation for many state-of-the-art NLP models, like **BERT**, **GPT**, and **T5**, and are used for tasks such as machine translation, text generation, summarization, and more.

In summary, a transformer is a model architecture known for its attention mechanism, which excels at handling sequences of data efficiently, making it the backbone of modern NLP models.


## IV. ChromaDB

**ChromaDB** is an open-source vector database designed to efficiently store, index, and search high-dimensional vectors, often used in machine learning, natural language processing (NLP), and artificial intelligence applications.

Vectors, or "embeddings," are numerical representations that capture the semantic characteristics of data like text, images, or documents. ChromaDB is optimized for fast and accurate searches over these vectors, making it useful for tasks such as semantic search, content recommendation, or retrieval-augmented generation in chatbot systems.

[ChromaDB](https://www.trychroma.com/)

In [45]:
from IPython.core.display import HTML

# URL de l'image
image_url = "https://www.trychroma.com/_next/static/media/computer.fcd1bd54.svg"

# Afficher l'image en HTML
HTML(f'<img src="{image_url}" width="500">') 


### IV.1. Key Features
1. **Vector Storage and Indexing**: ChromaDB allows for the storage of large volumes of vectors and efficient access to them. Each vector is associated with a unique ID and can have associated metadata.
2. **Similarity Search**: The database is designed to quickly find the most similar vectors to a given vector. Common similarity measures are cosine distance or Euclidean distance.
3. **Metadata and Filtering**: You can add metadata to the vectors to perform filtered searches. This means you can search vectors based on their semantic characteristics while applying specific filters based on metadata.

### IV.2. Use Cases
- **Semantic Search**: Find similar documents, images, or text snippets to a specific query, for example, searching for relevant passages in a knowledge base.
- **Chatbots and Response Systems**: Retrieve relevant contextual information to feed into a natural language model’s responses.
- **Content Recommendation**: Recommend similar items to a user by comparing the vectors of their previous interactions with the available content.

### IV.3. Data Structure
To use ChromaDB, you need to organize the data as follows:
1. **Vectors (Embeddings)**: A list of vectors (a list or array of floats) representing the items to store.
   ```python
   vectors = [
       [0.1, 0.23, 0.35, ...],  # Vector 1
       [0.4, 0.57, 0.68, ...],  # Vector 2
       ...
   ]
   ```
2. **IDs**: Each vector must have a unique identifier to allow retrieval or updating.
   ```python
   ids = [
       "item_1",  # ID for vector 1
       "item_2",  # ID for vector 2
       ...
   ]
   ```
3. **Metadata (Optional)**: Additional information that can be associated with each vector, enabling filtered or contextual searches.
   ```python
   metadata = [
       {"category": "news", "date": "2023-01-01"},  # Metadata for vector 1
       {"category": "sports", "date": "2023-01-02"},  # Metadata for vector 2
       ...
   ]
   ```

### IV.4. Setup and Usage
Here’s a typical approach to using ChromaDB:
1. **Embedding Generation**: Generate the vectors from your data using a machine learning model, such as BERT for text.
2. **Data Insertion**: Insert the vectors, IDs, and metadata into the ChromaDB database.
3. **Search Queries**: Perform searches in the database to retrieve the vectors most similar to a given query vector.
4. **Using the Results**: Use the results for the task at hand, such as providing a response to a query or recommending relevant content.

ChromaDB can be integrated into machine learning or NLP applications that require fast and efficient searches over large volumes of vector data.

## V. Mini-Project

### V.1. Project Objective : Use Retrievers with LLM

#### Choix du Corpus de données
Pour ce cas, nous allons utiliser le corpus appelé **CORD-19** (COVID-19 Open Research Dataset), qui contient des articles de recherche sur COVID-19, le SARS-CoV-2 et les coronavirus connexes. Ce corpus est accessible via la bibliothèque **Hugging Face Datasets**.

#### Step 1 : Prétraitement et Vectorisation des Données
Nous allons utiliser le même modèle de génération d'embeddings (**sentence-transformers/all-MiniLM-L6-v2**) pour transformer le texte scientifique en vecteurs. La méthode de calcul de similarité la plus courante entre des vecteurs de texte est la **similarité cosinus**.

#### Installation des Bibliothèques
Si vous ne les avez pas déjà installées :


In [None]:
!pip install chromadb transformers datasets sentence-transformers tensorflow tf-keras 



#### Code to Create the Vector Store
Here’s how to proceed to create a vector store from the CORD-19 corpus:

In [8]:
import chromadb
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Charger le modèle d'embedding
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:

# Charger un sous-ensemble du corpus CORD-19
#dataset = load_dataset("cord19", "metadata", download_mode="force_redownload", split="train[:1000]")
#dataset.save_to_disk("Covid_19_dataset")
#dataset.to_csv('/content/covid_19_dataset.csv', index=True)


# Obtenir et afficher le chemin courant
#import os
#current_path = os.getcwd()
#print("current path :", current_path)

In [9]:
# Choose how to open the data

#from datasets import load_from_disk
#dataset = load_from_disk("Covid_19_dataset")
#dataset

# Load the CSV file
import pandas as pd
from datasets import Dataset

csv_path = "covid_19_dataset.csv"  # Replace with the path to your CSV file
dataframeP = pd.read_csv(csv_path)
# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(dataframeP)

# Display the Dataset
dataset


Dataset({
    features: ['Unnamed: 0', 'cord_uid', 'sha', 'source_x', 'title', 'doi', 'abstract', 'publish_time', 'authors', 'journal', 'url'],
    num_rows: 1000
})

In [14]:
# only if you need delete the collection
chroma_client.delete_collection(name="scientific_corpus")

In [15]:
texts = dataset['abstract']
titles = dataset['title'] #Current columns in the dataset: ['cord_uid', 'sha', 'source_x', 'title', 'doi', 'abstract', 'publish_time', 'authors', 'journal', 'url']"

# Handle missing values (if any)
texts = [text if text is not None else "" for text in texts]
titles = [title if title is not None else "Untitled" for title in titles]


# Créer les embeddings pour chaque texte
embeddings = model.encode(texts)

# Créer le Vector Store avec ChromaDB
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name="scientific_corpus", metadata={"hnsw:space": "cosine"} )

# Ajouter les données à ChromaDB
collection.add(
    ids=[f"doc_{i}" for i in range(len(texts))],  # IDs uFloadniques pour chaque document
    embeddings=embeddings.tolist(),  # Embeddings vectoriels
    metadatas=[{"title": title} for title in titles],  # Ajouter le titre comme métadonnée
    documents=texts  # Texte d'origine
)

print("Vector store created with ChromaDB for the scientific corpus!")


Vector store created with ChromaDB for the scientific corpus!


#### Step 2: Search for Similar Texts Using a New Sentence

We will now create a query, vectorize it, and search for similar scientific texts in our vector store.

In [None]:
# Phrase de requête pour la recherche de similarité
query = "Is there a vaccine for covid?"
#query = " Can you explain what is a highly pathogenic H5N1"
query_embedding = model.encode([query])

# Rechercher des documents similaires dans ChromaDB
results = collection.query(
    query_embeddings=query_embedding.tolist(),
    n_results=5  # Nombre de résultats similaires à retourner
)

# Afficher les résultats
for doc, score, meta in zip(results['documents'], results['distances'], results['metadatas']):
    if isinstance(meta, list):  # Check if meta is a list
        meta = meta[0]  # Get the first element of the list
    print(f"Titre : {meta['title']}\nTexte : {doc}\nScore de similarité : {score}\n---")


NameError: name 'collection' is not defined

In [53]:
meta['title']

'La Crosse virus infectivity, pathogenesis, and immunogenicity in mice and monkeys'

#### Step 3: Cosine Similarity

Calculation ChromaDB already provides a distance measure, but if you want to explicitly calculate the cosine similarity between the query and the vectors in the store, you can manually do so using scikit-learn.


In [54]:
# Calculer la similarité cosinus entre l'embedding de la requête et les embeddings stockés
cosine_similarities = cosine_similarity([query_embedding[0]], embeddings)[0]

# Trouver les indices des 5 documents les plus similaires
top_indices = np.argsort(cosine_similarities)[::-1][:5]

# Afficher les résultats
for idx in top_indices:
    print(f"Titre : {titles[idx]}\nTexte : {texts[idx]}\nSimilarité cosinus : {cosine_similarities[idx]}\n---")


Titre : La Crosse virus infectivity, pathogenesis, and immunogenicity in mice and monkeys
Texte : BACKGROUND: La Crosse virus (LACV), family Bunyaviridae, was first identified as a human pathogen in 1960 after its isolation from a 4 year-old girl with fatal encephalitis in La Crosse, Wisconsin. LACV is a major cause of pediatric encephalitis in North America and infects up to 300,000 persons each year of which 70–130 result in severe disease of the central nervous system (CNS). As an initial step in the establishment of useful animal models to support vaccine development, we examined LACV infectivity, pathogenesis, and immunogenicity in both weanling mice and rhesus monkeys. RESULTS: Following intraperitoneal inoculation of mice, LACV replicated in various organs before reaching the CNS where it replicates to high titer causing death from neurological disease. The peripheral site where LACV replicates to highest titer is the nasal turbinates, and, presumably, LACV can enter the CNS via

#### Step 4: Potential Uses
- **Enhanced Semantic Search**: Search for research papers similar to a given question.
- **Literature Exploration**: Find articles related to a specific scientific topic.
- **Scientific Text Analysis**: Compare the similarity between new research and existing documents to discover connections or trends.

By using ChromaDB with sentence embeddings, you can easily search and compare scientific texts based on their semantic content."

### V.2. Implementation


To create a **RAG (Retrieval-Augmented Generation)** pipeline with **ChromaDB** using a large language model (**LLM**) loaded via **Ollama**, here’s a detailed approach on how to proceed.

### Step 1: Setting up the Environment
Make sure you have installed the necessary libraries:
- **Ollama** to load and run the LLM.
- **ChromaDB** to create the vector store.
- **Transformers and Sentence-Transformers** to generate embeddings.

### Step 2: Loading the LLM with Ollama
Ollama is a platform that allows loading and running LLM models locally. Let’s assume you already have a model loaded in Ollama (e.g., **GPT-3** or a similar optimized model).

To check the list of available models in Ollama:"

In [36]:
!curl -fsSL https://ollama.com/install.sh | sh

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


>>> Cleaning up old version at /usr/local/lib/ollama
[sudo] Mot de passe de anne : 


### Use %load_ext colabxterm and %xterm only on Colab Otherwise you can use your terminal

In [12]:
%load_ext colabxterm

In [14]:
%xterm
# launch in the xterm terminal
# ollama serve

Launching Xterm...

🚀  Listen to 10001
{"success": true, "reason": null}



In [17]:
%xterm
# ollama run gemma2:2b

Launching Xterm...

<IPython.core.display.Javascript object>

#### ollama list

In [75]:
%xterm
#ollama list

Launching Xterm...

<IPython.core.display.Javascript object>

you can choose your model here:
[Ollama](https://ollama.com/library)

Make sure you have a functional model, such as **'gemma:2b'**, to use in the following steps.

### Step 3: Creating the Vector Store with ChromaDB
For this step, let's use a scientific corpus (like CORD-19 or another relevant corpus), which we will vectorize and store in ChromaDB.

Only if you don't have the vectore store scientific_corpus created, you can execute this:

In [20]:
from chromadb import Client
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import pandas as pd

# Initialize ChromaDB client with default settings
# Remove the chroma_api_impl argument to use the default backend
#chroma_client = Client(Settings(persist_directory="chroma_db"))

# Load and preprocess the corpus
corpus = pd.read_csv("covid_19_dataset.csv")
texts = corpus["abstract"].fillna("").tolist()  # Assume abstracts are in the 'abstract' column
metadata = corpus.apply(lambda row: {"title": row["title"], "id": row.name}, axis=1).tolist()

# Load a pre-trained model for vectorization
model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=True)

# Create a collection in ChromaDB
collection_name = "scientific_corpus"
collection = chroma_client.get_or_create_collection(collection_name)

# Add documents to the collection
collection.add(
    documents=texts,
    metadatas=metadata,
    ids=[str(i) for i in range(len(texts))],
    embeddings=embeddings.tolist()
)

print(f"Corpus added to ChromaDB in the collection: {collection_name}")


Corpus added to ChromaDB in the collection: scientific_corpus


In [18]:
texts

['OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were

In [16]:
titles

['Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia',
 'Nitric oxide: a pro-inflammatory mediator in lung disease?',
 'Surfactant protein-D and pulmonary host defense',
 'Role of endothelin-1 in lung disease',
 'Gene expression in epithelial cells in response to pneumovirus infection',
 'Sequence requirements for RNA strand transfer during nidovirus discontinuous subgenomic RNA synthesis',
 'Debate: Transfusing to normal haemoglobin levels will not improve outcome',
 'The 21st International Symposium on Intensive Care and Emergency Medicine, Brussels, Belgium, 20-23 March 2001',
 'Heme oxygenase-1 and carbon monoxide in pulmonary medicine',
 'Technical Description of RODS: A Real-time Public Health Surveillance System',
 'Conservation of polyamine regulation by translational frameshifting from yeast to mammals',
 'Heterogeneous nuclear ribonucleoprotein A1 regulates RNA synthesis of a cytoplasmic virus',
 "A

#### Step 3: Integrating ChromaDB Search into a RAG Pipeline

Now that we have our vector store, we will create a function to search for relevant documents in ChromaDB and formulate a query for the LLM model via Ollama.

In [17]:
def retrieve_from_chromadb(query, top_k=5):
    # Générer l'embedding de la requête (assuming you have an embedding model already defined)
    query_embedding = model.encode([query])

    # Effectuer la recherche dans ChromaDB
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=top_k
    )

    # Vérification de la structure des résultats
    documents = results.get('documents', [])
    metadatas = results.get('metadatas', [])

    if isinstance(documents, list) and isinstance(metadatas, list):
        # Ajouter un titre par défaut s'il n'existe pas dans les métadonnées
        for i, meta in enumerate(metadatas):
            if not isinstance(meta, dict):
                metadatas[i] = {}  # Si meta n'est pas un dict, initialiser un dict vide
            if 'title' not in metadatas[i]:
                # Ajouter un titre par défaut s'il n'est pas fourni
                metadatas[i]['title'] = f"Document {i+1}"

        return documents, metadatas
    else:
        raise ValueError("Les résultats retournés ne sont pas des listes.")


#### Function to Formulate a Query to the LLM via Ollama
We will query the LLM using Ollama's API (make sure Ollama is properly set up to run the model).

In [18]:
import subprocess

def query_llm_with_ollama(prompt, model_name="gemma2:2b"):
    # Utiliser Ollama pour interroger le LLM
    result = subprocess.run(
        ["ollama", "run", model_name],
        input=prompt.encode('utf-8'),
        stdout=subprocess.PIPE
    )
    return result.stdout.decode('utf-8')

### Step 4: Complete RAG Pipeline
We will create a RAG function to combine the search for relevant contexts in ChromaDB with querying the LLM to generate an enhanced response.

In [19]:
def rag_pipeline(query):
    # 1. Rechercher des documents pertinents dans ChromaDB
    try:
        retrieved_docs, metadatas = retrieve_from_chromadb(query)

        # Debugging the structure of retrieved metadatas
        print(f"Metadatas structure: {metadatas}")

        # 2. Vérification que les métadonnées sont bien des dictionnaires contenant la clé 'title'
        if isinstance(metadatas, list) and all(isinstance(meta, dict) and 'title' in meta for meta in metadatas):
            # Construire le contexte pour la requête LLM
            context = "\n\n".join([f"{meta['title']}: {doc}" for doc, meta in zip(retrieved_docs, metadatas)])
        else:
            raise ValueError("Les métadonnées ne sont pas correctement formatées ou ne contiennent pas la clé 'title'.")

        prompt = f"Contexte :\n{context}\n\nQuestion : {query}\nRéponse :"

        # 3. Interroger le LLM via Ollama
        response = query_llm_with_ollama(prompt)

        return response
    except ValueError as ve:
        print(f"Erreur : {ve}")
    except Exception as e:
        print(f"Une erreur s'est produite : {e}")

#### Step 5: Testing the Pipeline

You can test the pipeline by formulating a relevant question for your scientific corpus.

In [21]:
query = "What impact will COVID-19 have on healthcare systems worldwide ?"
response = rag_pipeline(query)

print("Réponse du LLM :")
print(response)

Metadatas structure: [{'title': 'Document 1'}]


[?25l⠙ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [

Réponse du LLM :
The provided text focuses on lessons learned from the 2009 H1N1 pandemic and the potential challenges of a similar outbreak with COVID-19, but it doesn't offer information about the exact impacts of COVID-19 on global healthcare systems.  

Here are some potential points that will likely be relevant to understand how COVID-19 is impacting healthcare worldwide:

* **Strain on Healthcare Systems:** 
    * Like the H1N1 pandemic, COVID-19 significantly strained healthcare systems globally, leading to overwhelmed hospitals, shortages of medical supplies and personal protective equipment (PPE), and increased mortality rates.
* **Adaptation and Innovation:** The pandemic prompted rapid adaptation in healthcare practices like telemedicine adoption, remote patient monitoring, and increased focus on infection control protocols. 
* **Resource Allocation:**  The pandemic highlighted the need for robust resource allocation strategies, including equitable access to medical resource

[?25l[?25h[?25l[?25h

### Step 7: Optimizations and Uses

1. **Improving Search**: You can adjust the number of results (top_k) or use other embedding models to refine the relevance of retrieved documents.
2. **Prompt Engineering**: Adjust the prompt used for the LLM to get more precise or contextually rich answers.
3. **Integration into Applications**: Once the RAG pipeline is ready, it can be integrated into chatbot systems, semantic search engines, or research assistance systems to provide enhanced answers based on contextual knowledge.

By using ChromaDB for context search and Ollama for the LLM, you have a RAG pipeline for enriching responses based on the content of a scientific corpus.

# LLM Parameters

To adjust parameters such as **temperature**, **top_k**, and **top_p** when querying the LLM via **Ollama**, you will need to pass these parameters to Ollama during text generation. This depends on how Ollama exposes these parameters for the loaded models, but generally, LLMs like **GPT** have these parameters available to adjust the creativity and diversity of the responses.

Here’s how you can adapt the previous code to include these parameters.

### Step 1: Updating the `query_llm_with_ollama` Function
We will modify the function to accept the **temperature**, **top_k**, and **top_p** parameters to pass them along when making the request to Ollama.

In [91]:
import chromadb
import subprocess
from sentence_transformers import SentenceTransformer

# Initialize ChromaDB client
chroma_client = chromadb.Client()

# Get the collection
collection = chroma_client.get_collection(name="scientific_corpus")

In [62]:
def query_llm_with_ollama(prompt, model_name="gemma2:2b"):
    """Queries an Ollama model with the given prompt."""
    result = subprocess.run(
        ["ollama", "run", model_name],
        input=prompt.encode('utf-8'),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE  # Capture errors
    )
    if result.returncode != 0:
        print(f"Error querying Ollama: {result.stderr.decode('utf-8')}")
        return None
    return result.stdout.decode('utf-8').strip()

### Step 2: Updating the `rag_pipeline` Function

Update the `rag_pipeline` function to allow passing these parameters during the response generation.

In [22]:
def rag_pipeline(query, temperature=0.7, top_k=50, top_p=0.9):
    """Executes the RAG pipeline."""
    try:
        retrieved_docs, metadatas = retrieve_from_chromadb(query)
        context = "\n\n".join([f"{meta['title']}: {doc}" for doc, meta in zip(retrieved_docs, metadatas)])
        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
        # Pass custom parameters in the prompt
        prompt = prompt + f"\n### Instruction:\nPlease answer the question based on the context provided.\nUse a temperature of {temperature}, top_k of {top_k}, and top_p of {top_p}.\n"  
        response = query_llm_with_ollama(prompt)
        return response
    except Exception as e:
        print(f"Error in RAG pipeline: {e}")
        return None

In [24]:

def retrieve_from_chromadb(query, top_k=5):
    """Retrieves relevant documents from ChromaDB."""
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode(query).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k
    )
    documents = results['documents'][0]  
    metadatas = results.get('metadatas', [])  
    # Handle potential issues with metadatas
    if isinstance(metadatas, list) and all(isinstance(meta, dict) and 'title' in meta for meta in metadatas):
        pass  # Metadatas are correctly formatted
    else:
        # If metadatas are missing or incorrect, use default values
        metadatas = [{'title': f"Document {i+1}"} for i in range(len(documents))]
    return documents, metadatas



### Step 3: Testing the Pipeline with Custom Parameters

You can now test the pipeline by adjusting **`temperature`**, **`top_k`**, and **`top_p`** according to your needs:

In [25]:
# Example usage
user_query = "What impact will COVID-19 have on healthcare systems worldwide ?"
response = rag_pipeline(user_query, temperature=0.2, top_k=5, top_p=0.1)

if response:
    print("Ollama Response:\n", response)

[?25l⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [

Ollama Response:
 While the provided documents offer insights into pandemic preparedness and challenges faced with influenza outbreaks, they don't directly address the potential impact of COVID-19 on healthcare systems worldwide.  

To formulate a comprehensive answer about COVID-19's impact on global healthcare systems, we need to consider: 

**Key Factors:**

* **Global Spread & Strain on Resources:** The pandemic has overwhelmed healthcare resources globally, forcing nations to adapt and prioritize critical care while grappling with shortages of ventilators, testing kits, and medical staff.
* **Social Impact:** Lockdowns, travel restrictions, and economic instability have created a significant burden on individuals and societies. Healthcare systems are dealing with not just the disease itself but also its social and psychological consequences. 
* **Long-Term Effects & Public Health:** The pandemic's long-term health effects on populations remain largely unknown. Concerns about incre

[?25l[?25h[?25l[?25h

### Adjusting the Parameters

- **Temperature**: For more creative or uncertain responses, increase the temperature (towards 1). For more deterministic and conservative responses, decrease it (towards 0).
- **Top_k**: Lower this number if you want to limit the diversity of tokens at each step (which can make the response more coherent).
- **Top_p**: Adjust **top_p** to control the "size" of the token nucleus. A value close to 1 will include almost all possible tokens, while a lower value will focus on the most probable tokens.

This code allows you to configure the LLM's generation behavior according to your application's requirements, while using ChromaDB for contextual search in a RAG pipeline.

# Weighting ... the Key to Success!

To perform a **weighted sum of embeddings** in the context of **ChromaDB**, you can use a method that combines the vectors by taking into account specific weights for each vector in a document. This can be useful to give more importance to certain parts of the text or to adjust the contribution of each embedding in the similarity calculation.

Here’s how you can proceed to calculate a distance using a weighted sum of embeddings.

### Step 1: Preparing Embeddings and Weights
Suppose you have a list of embeddings (vectors) for each text, and a corresponding list of weights that determine the importance of each embedding. The weights could be based on factors such as the relevance of the context or the position of the words in the text.

In [29]:
import numpy as np

# Exemple d'embeddings d'un document (chaque ligne est un embedding)
embeddings = [
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9]
]

# Poids associés à chaque embedding
weights = [0.2, 0.5, 0.3]

### Step 2: Calculating the Weighted Sum of Embeddings

The weighted sum can be calculated by multiplying each embedding by its weight, then summing the results.

In [30]:
def weighted_sum_embeddings(embeddings, weights):
    # Conversion des listes en tableaux numpy
    embeddings = np.array(embeddings)
    weights = np.array(weights)

    # Normalisation des poids pour qu'ils summent à 1
    weights = weights / weights.sum()

    # Calcul de la somme pondérée
    weighted_embedding = np.sum(embeddings * weights[:, None], axis=0)
    return weighted_embedding

Let's apply the weighted sum function to the embeddings:

In [31]:
weighted_embedding = weighted_sum_embeddings(embeddings, weights)
print("Weighted Embedding:", weighted_embedding)

Weighted Embedding: [0.43 0.53 0.63]


### Step 3: Calculating the Distance
Once you have the weighted sum of your embeddings, you can use a distance or similarity measure to compare this vector with another, such as a query.

#### Calculating Cosine Distance
Cosine distance is a commonly used measure to compare text vectors:

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

# Embedding de la requête
query_embedding = [0.5, 0.5, 0.5]

# Calcul de la similarité cosinus entre la somme pondérée et l'embedding de la requête
cosine_sim = cosine_similarity([weighted_embedding], [query_embedding])[0][0]
print("Cosine Similarity:", cosine_sim)

Cosine Similarity: 0.9883405131724471


#### Other Distance Measures
You can also use other measures, such as **Euclidean distance**, to compare embeddings:

In [33]:
from scipy.spatial.distance import euclidean

# Calcul de la distance euclidienne entre la somme pondérée et l'embedding de la requête
euclidean_dist = euclidean(weighted_embedding, query_embedding)
print("Euclidean Distance:", euclidean_dist)

Euclidean Distance: 0.15066519173319362


### Step 4: Application in the Context of ChromaDB

When searching for documents in ChromaDB, you can calculate the weighted sum of the embeddings for each stored document, then compare this weighted sum with the query to determine the most similar documents.

Here’s how this can be integrated into a contextual search function:

In [121]:

def retrieve_with_weighted_title_text(query, collection, title_weight=0.3, text_weight=0.7, top_k=5):
    """
    Retrieves documents from the ChromaDB collection based on weighted similarity of titles and texts.

    Args:
        query (str): The search query.
        collection (chromadb.Collection): The ChromaDB collection to search in.
        title_weight (float, optional): Weight for title similarity (default is 0.3).
        text_weight (float, optional): Weight for text similarity (default is 0.7).
        top_k (int, optional): Number of top results to return (default is 5).

    Returns:
        list: A list of tuples (document_id, similarity_score) for the top_k results.
    """

    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode([query])[0]
    all_documents = collection.get()

    results = []
    for i in range(len(all_documents['documents'])):
        metadata = all_documents['metadatas'][i]
        if metadata is not None:
            title = metadata.get('title', '')
        else:
            title = ""
        text = all_documents['documents'][i]

        title_embedding = model.encode([title])[0]
        text_embedding = model.encode([text])[0]

        # Calculate weighted similarity
        title_similarity = cosine_similarity([query_embedding], [title_embedding])[0][0]
        text_similarity = cosine_similarity([query_embedding], [text_embedding])[0][0]
        weighted_similarity = title_weight * title_similarity + text_weight * text_similarity

        results.append((all_documents['ids'][i], weighted_similarity, text, i))
        #results.append((i, weighted_similarity))

    results.sort(key=lambda x: x[1], reverse=True)  # Sort by similarity score
    return results[:top_k]  # Return top_k results

In [123]:
# Assuming 'collection' is your ChromaDB collection
query = "Can you describe the impact of the covid ?"
top_results = retrieve_with_weighted_title_text(query, collection, title_weight=0.3, text_weight=0.7)

for doc_id, similarity_score, text, i in top_results:
    print(f"Document ID: {doc_id}, Similarity Score: {similarity_score}")
    print("Response:\n", text)

Document ID: id2, Similarity Score: 0.1143983006477356
Response:
 This is document 2
Document ID: id1, Similarity Score: 0.1054505705833435
Response:
 This is document 1
Document ID: id3, Similarity Score: 0.10415945947170258
Response:
 This is document 3


### Step 5: Running the Weighted Search
You can now use the `retrieve_with_weighted_embeddings` function to perform a search based on a weighted sum of embeddings:

In [None]:
import chromadb

# Définition du répertoire de stockage persistant
PERSIST_DIRECTORY = "RAG_OLLAMA"

# Utilisation de PersistentClient pour assurer la persistance
chroma_client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)

# Vérification des collections existantes
print("Collections disponibles:", chroma_client.list_collections())

# Chargement de la collection existante
collection_name = "scientific_corpus"
try:
    collection = chroma_client.get_collection(name=collection_name)
    print(f"Collection '{collection_name}' chargée avec succès !")
except Exception as e:
    print(f"Erreur : Impossible de charger la collection '{collection_name}' - {e}")


Collections disponibles: [Collection(name=scientific_corpus)]
Collection 'scientific_corpus' chargée avec succès !


In [25]:
all_docs = collection.get()
print("📝 Documents récupérés :", all_docs)



In [10]:
# Charger une collection spécifique
collection_name = "scientific_corpus"  # Remplace par le nom de ta collection
collection = chroma_client.get_collection(collection_name)

# Afficher les métadonnées
print(f" Métadonnées de la collection '{collection_name}':")
print(collection.metadata)

# Récupérer quelques éléments pour voir la structure
sample_data = collection.get()
print("\n Exemple de structure des données :")
print(sample_data.keys())  # Liste des clés disponibles (ex: 'ids', 'documents', 'embeddings', 'metadatas')

# Vérifier un échantillon des données stockées
if 'id' in sample_data and sample_data['ids']:
    print("\n Exemple d'un ID stocké :", sample_data['ids'][0])
if 'documents' in sample_data and sample_data['documents']:
    print("\n Exemple de document :", sample_data['documents'][0])
if 'embeddings' in sample_data and sample_data['embeddings']:
    print("\n Exemple d'un embedding (longueur) :", len(sample_data['embeddings'][0]))
if 'metadatas' in sample_data and sample_data['metadatas']:
    print("\n Exemple de métadonnées :", sample_data['metadatas'][0])


 Métadonnées de la collection 'scientific_corpus':
None

 Exemple de structure des données :
dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])

 Exemple de document : OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Tw

In [None]:
import chromadb
import subprocess
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Configuration de la persistance de ChromaDB
PERSIST_DIRECTORY = "RAG_OLLAMA"  # Chemin où ChromaDB stocke les données

# Initialisation du client ChromaDB avec persistance
chroma_client = chromadb.PersistentClient(path=PERSIST_DIRECTORY)

# Recharger la collection existante (sans la recréer)
try:
    collection = chroma_client.get_collection(name="scientific_corpus")
except Exception as e:
    print(f"Erreur lors du chargement de la collection: {e}")
    collection = None

if collection is None:
    raise ValueError("La collection 'scientific_corpus' n'existe pas. Vérifiez votre stockage ou créez-la d'abord.")

# Fonction pour calculer la somme pondérée des embeddings
def weighted_sum_embeddings(embeddings, weights):
    embeddings = np.array(embeddings)
    weights = np.array(weights)
    weights = weights / weights.sum()
    weighted_embedding = np.sum(embeddings * weights[:, None], axis=0)
    return weighted_embedding

# Fonction pour calculer la distance pondérée et récupérer le document le plus pertinent
def calculate_weighted_distance(query, collection, title_weight=0.7, text_weight=0.3):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode([query])[0]
    all_documents = collection.get()

    if not all_documents or not all_documents['documents'] or not all_documents['metadatas'] or not all_documents['ids']:
        print("La collection est vide ou ne contient pas les données nécessaires.")
        return []

    results = []
    for i in range(len(all_documents['documents'])):
        metadata = all_documents['metadatas'][i] or {}
        title = metadata.get('title', '')
        text = all_documents['documents'][i]
        
        title_embedding = model.encode([title])[0]
        text_embedding = model.encode([text])[0]
        weighted_embedding = weighted_sum_embeddings([title_embedding, text_embedding], [title_weight, text_weight])
        distance = cosine_similarity([query_embedding], [weighted_embedding])[0][0]
        results.append((all_documents['ids'][i], distance))

    results.sort(key=lambda x: x[1], reverse=True)
    return results

def query_ollama_with_weighted_embedding_and_parameters(query,
                                                        model_name="gemma2:2b",
                                                        title_weight=0.5,
                                                        text_weight=0.5,
                                                        temperature=0.7,
                                                        top_k=10,
                                                        top_p=0.1):

    results = calculate_weighted_distance(query, collection, title_weight, text_weight)
    print("results : ", results)

    if results:
        best_match_id = results[0][0]
        print(f"🔍 Meilleur ID retourné : {best_match_id}")

        # Récupérer les documents associés à l'ID
        fetched_data = collection.get(where={"ids": best_match_id})
        best_match_document = fetched_data['documents']

        # Affichage pour vérifier la structure
        print("🔍 Données récupérées pour cet ID :", best_match_document)
        prompt = f"Contexte :\n{best_match_document}\n\nQuestion : {query}\n\nRéponse :"


    else:
        print("⚠️ Aucun résultat pertinent trouvé.")
        best_match_document = None
        prompt = f"⚠️ Aucun contexte disponible.\n\nQuestion : {query}\nRéponse :"

    # Construire la commande sans les paramètres invalides
    command = ["ollama", "run", model_name]

    # Exécuter le modèle avec la requête en entrée
    result = subprocess.run(command, input=prompt.encode('utf-8'), stdout=subprocess.PIPE, stderr=subprocess.PIPE)

    if result.returncode != 0:
        print(f"Erreur lors de l'interrogation d'Ollama: {result.stderr.decode('utf-8')}")
        return None

    return result.stdout.decode('utf-8').strip()

# Exemple d'utilisation
if __name__ == "__main__":
    user_query = "Can you describe the impact of the covid ?"
    response = query_ollama_with_weighted_embedding_and_parameters(user_query)
    
    if response:
        print("Réponse d'Ollama:\n", response)

results :  [('439', 0.47782186930882903), ('253', 0.46415572215737544), ('626', 0.45563060150759854), ('685', 0.444258781037017), ('948', 0.441807892293513), ('213', 0.4404301538005053), ('791', 0.43706882637483213), ('48', 0.4366887954696191), ('362', 0.4364371505746746), ('919', 0.4287178721636105), ('432', 0.4274468133609167), ('680', 0.4185063969512079), ('125', 0.415119223428856), ('81', 0.4150889480829245), ('508', 0.4144397272929265), ('897', 0.41315939678107444), ('218', 0.4124353019758276), ('287', 0.4121308806887559), ('20', 0.4107113452720892), ('985', 0.4094133552479491), ('268', 0.405050812008223), ('756', 0.4039665689172962), ('219', 0.4034160770742994), ('494', 0.3972119258318355), ('627', 0.39530148368990614), ('509', 0.3944219488234698), ('318', 0.3893573270512606), ('194', 0.3882079671496974), ('467', 0.38786864594425374), ('82', 0.3877188768247351), ('129', 0.3876857861137708), ('670', 0.38536976279752033), ('207', 0.37949381436235374), ('459', 0.3786194523788448), (

#### For the final result we can combine both ... result of retriever in LLM called by Ollama API

### Conclusion
The weighted sum of embeddings allows you to adjust the contribution of each portion of a text based on its relevance or relative importance. This approach can be very useful for semantic search tasks where certain words or phrases should carry more weight than others.

By appropriately adjusting the weights, you can optimize search or recommendation results to be more relevant to the user’s query."


[To go further... LangChain](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/)

# Bonus

```bash

pip install streamlit
```
One possible solution in the file RAG streamlit_RAG.py
You can test it with this command
```bash

streamlit run streamlit_RAG.py
```