## <span style="color: green;">**Vektordatenbanken**</span> using <span style="color: red;">**ChromaDB**</span> with Embeddings for Long Texts

### **ChromaDB vs FAISS**

| **Feature**                 | **FAISS**                    | **ChromaDB**                  |
|------------------------------|------------------------------|-------------------------------|
| **Storage**                 | Requires separate metadata   | Integrated metadata           |
| **Flexibility**             | Static after creation        | Dynamic, supports updates     |
| **Ease of Use**             | More complex setup           | Simpler API                   |
| **Performance**             | Highly optimized             | Good for medium datasets      |
| **Offline Support**         | Yes                          | Yes                           |

**So in the end, we have decided to use ChromaDB**


### **Embeddings models**


| **Model**               | **Dimensionality** | **Quality**          | **Offline** | **Use Cases**                                | **Speed**      |
|--------------------------|--------------------|----------------------|-------------|---------------------------------------------|----------------|
| **text-embedding-ada-002** | 1536               | Best-in-class        | No          | Complex queries, large-scale semantic tasks | Moderate (API) |
| **All-MiniLM-L6-v2**     | 384                | Good                 | Yes         | Lightweight tasks, semantic search          | Fast           |
| **Instructor-XL**        | 768                | Very Good            | Yes         | Knowledge bases, task-specific embeddings   | Moderate       |
| **MPNet**                | 768                | Very Good            | Yes         | Context-aware embeddings, multilingual      | Moderate       |
| <span style="color: green;">**GTR-T5 (Large)**</span>       | 1024               | Excellent            | Yes         | Cross-domain, large-scale retrieval         | Slower         |
| **Sentence-BERT**        | 768                | Very Good            | Yes         | Sentence similarity, classification         | Moderate       |

**we decided to use the free and best option which is GTR-T5 (Large)**

## **Implementation 😎**

### Install the needed Libs.

In [1]:
!pip install transformers -U
!pip install chromadb
!pip install sentence-transformers

Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Downloading transformers-4.47.1-py3-none-any.whl (10.1 MB)
   ---------------------------------------- 0.0/10.1 MB ? eta -:--:--
   --------- ------------------------------ 2.4/10.1 MB 12.2 MB/s eta 0:00:01
   ------------------- -------------------- 5.0/10.1 MB 12.6 MB/s eta 0:00:01
   -------------------------- ------------- 6.8/10.1 MB 11.0 MB/s eta 0:00:01
   ---------------------------------------  10.0/10.1 MB 12.2 MB/s eta 0:00:01
   ---------------------------------------- 10.1/10.1 MB 11.7 MB/s eta 0:00:00
Using cached tokenizers-0.21.0-cp39-abi3-win_amd64.whl (2.4 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successful

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.0 which is incompatible.




ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 4.47.1 requires tokenizers<0.22,>=0.21, but you have tokenizers 0.20.3 which is incompatible.



Collecting tokenizers<=0.20.3,>=0.13.2 (from chromadb)
  Using cached tokenizers-0.20.3-cp311-none-win_amd64.whl.metadata (6.9 kB)
Using cached tokenizers-0.20.3-cp311-none-win_amd64.whl (2.4 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.21.0
    Uninstalling tokenizers-0.21.0:
      Successfully uninstalled tokenizers-0.21.0
Successfully installed tokenizers-0.20.3
Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Using cached tokenizers-0.21.0-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Using cached tokenizers-0.21.0-cp39-abi3-win_amd64.whl (2.4 MB)
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.20.3
    Uninstalling tokenizers-0.20.3:
      Successfully uninstalled tokenizers-0.20.3
Successfully installed tokenizers-0.21.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.5.23 requires tokenizers<=0.20.3,>=0.13.2, but you have tokenizers 0.21.0 which is incompatible.


### Initialize ChromaDB client

In [1]:
import chromadb

db_path="./vektor_DB"  # I add this code in order to save the db locally
client = chromadb.PersistentClient(path=db_path)

### Load gtr-t5-large model

In [2]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Load the SentenceTransformer model for embeddings
embedding_model = SentenceTransformer("sentence-transformers/gtr-t5-large") # after some research that was on of the best free SentenceTransformer model

  from .autonotebook import tqdm as notebook_tqdm


### Create a Collection

In [3]:
collection = client.create_collection("meinungen")

### Load our CSV Data

In [4]:
import pandas as pd
csv_file = "Daten.csv"
data = pd.read_csv(csv_file)

### Because our text are very long I needed to split it into chunks

In [5]:
# we have a really long texts,and that was our second problem , so I split them into smaller chunks.
def chunk_text(text, max_tokens=512):
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/gtr-t5-large")
    words = text.split()
    chunks = []
    while len(words) > 0:
        chunk = " ".join(words[:max_tokens])
        chunks.append(chunk)
        words = words[max_tokens:]
    return chunks

### Add records to ChromaDB

In [None]:
def add_to_collection(party, theme, long_text):
    # Skip if long_text is not a string cause there was a nan value in the end that causes a problem, and I needed to start from the beg.
    if not isinstance(long_text, str):
        print(f"Skipping invalid data: party={party}, theme={theme}, long_text={long_text}")
        return
    
    chunks = chunk_text(long_text)  # call the function to chunk long texts 
    for chunk_index, chunk in enumerate(chunks):
        embedding = embedding_model.encode(chunk)
        
        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            metadatas=[{"party": party, "theme": theme, "chunk_index": chunk_index}],
            ids=[f"{party}_{theme}_chunk_{chunk_index}"],  # creating Unique ID for each chunk
        )

# Iterate over all rows and add them to ChromaDB
for index, row in data.iterrows():
    party = row['Partei']
    theme = row['Thema']
    long_text = row['Meinung']
    add_to_collection(party, theme, long_text)

Add of existing embedding ID: AFD_Volksabstimmungen nach Schweizer Modell_chunk_0
Insert of existing embedding ID: AFD_Volksabstimmungen nach Schweizer Modell_chunk_0
Add of existing embedding ID: AFD_Volksabstimmungen nach Schweizer Modell_chunk_0
Insert of existing embedding ID: AFD_Volksabstimmungen nach Schweizer Modell_chunk_0
Add of existing embedding ID: AFD_quote_chunk_0
Insert of existing embedding ID: AFD_quote_chunk_0
Add of existing embedding ID: AFD_Außen- und Verteidigungspolitik_chunk_0
Insert of existing embedding ID: AFD_Außen- und Verteidigungspolitik_chunk_0
Add of existing embedding ID: AFD_Quote_chunk_0
Insert of existing embedding ID: AFD_Quote_chunk_0
Add of existing embedding ID: AFD_Quote_chunk_0
Insert of existing embedding ID: AFD_Quote_chunk_0
Add of existing embedding ID: AFD_quote_chunk_0
Insert of existing embedding ID: AFD_quote_chunk_0
Add of existing embedding ID: AFD_quote_chunk_0
Insert of existing embedding ID: AFD_quote_chunk_0
Add of existing embe