# BASIC RAG - Indexing


### Ingestion

On commence par charger les documents :

In [1]:
from llama_index import SimpleDirectoryReader, Document

documents = SimpleDirectoryReader("data", recursive = True).load_data()

In [2]:
import re
for doc in documents : 
    doc.text = doc.text.replace("’", "'")
    # doc.text = re.sub(r'[a-zA-Z]\n[a-z]', '', doc.text)

### ChromaDB persistent client

On cherche ensuite à les indexer dans un client persistent ChromaDb en précisant le modèle d'embedding.  
Ici, on a pris le modèle HuggingFace sentence-transformers/all-MiniLm-L6-v2

In [2]:
import chromadb
from chromadb.utils import embedding_functions

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
with open("log/embedding_function.txt", 'w') as file : 
    file.write("embedding_functions.SentenceTransformerEmbeddingFunction(model_name='all-MiniLM-L6-v2')")
chroma_client = chromadb.PersistentClient("./dbs/documentation/chroma_index")
try : 
    chroma_collection = chroma_client.create_collection("Basic_rag", embedding_function=sentence_transformer_ef)
except : 
    chroma_client.delete_collection(name="Basic_rag")
    chroma_collection = chroma_client.create_collection("Basic_rag", embedding_function=sentence_transformer_ef)

### Embedding 

- LangChain Integrations

Link : https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html

On vient charger le modèle

In [3]:
# Modèle 

from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2"))

On choisit ensuite les paramètres du découpages des chunks : leur taille et l'overlap

In [4]:
# Text splitter 

from llama_index.text_splitter import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=200, chunk_overlap=10)

Puis, on peut lancer l'embedding

In [5]:
# Embedding & Storage

from llama_index import ServiceContext, VectorStoreIndex
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext


print("Original size collection : ", chroma_collection.count())
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(embed_model=embed_model, text_splitter=text_splitter)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context, show_progress=True
)
print("Post-embeddind size collection : ", chroma_collection.count())

Original size collection :  0


Parsing nodes:   0%|          | 0/139 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/979 [00:00<?, ?it/s]

Post-embeddind size collection :  979
