<a href="https://colab.research.google.com/github/Mohammadhsiavash/DeepL-Training/blob/main/NLP/Simple_RAG_System_using_ChromaDB_and_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install required packages if not already installed
!pip install chromadb sentence-transformers transformers torch

Collecting chromadb
  Downloading chromadb-1.0.17-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp311-cp311-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.36.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.36.0-py3-none-any.whl.metadata (1.5 k

In [2]:
import chromadb
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

1. Setup ChromaDB and Embedding Model

In [3]:
# Initialize ChromaDB client and collection
chroma_client = chromadb.Client()

# Create or get a collection
collection = chroma_client.create_collection(name="knowledge_base")

# Initialize embedding model (small model for demonstration)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2. Add Documents to Knowledge Base

In [4]:
# Sample documents - in a real system, these would be your knowledge base
documents = [
    "The capital of France is Paris.",
    "The Eiffel Tower is located in Paris.",
    "France is known for its wine and cheese.",
    "The official language of France is French.",
    "France won the FIFA World Cup in 2018.",
    "The Louvre Museum is in Paris, France.",
    "France is a member of the European Union.",
    "The French Revolution took place in the late 18th century.",
    "France shares borders with Germany, Belgium, and Spain.",
    "The currency used in France is the Euro."
]

# Generate embeddings for the documents
embeddings = embedding_model.encode(documents)

# Add documents to ChromaDB collection
ids = [f"id_{i}" for i in range(len(documents))]

collection.add(
    documents=documents,
    embeddings=[embedding.tolist() for embedding in embeddings],
    ids=ids
)

3. Initialize the Generator Model

In [5]:
# Initialize a text generation pipeline (using a small model for demonstration)
generator = pipeline(
    "text-generation",
    model="gpt2",  # Using a small model for demo purposes
    device="cpu"   # Change to "cuda" if you have a GPU
)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


4. Implement the RAG System

In [9]:
def rag_query(query: str, top_k: int = 3):
    # Step 1: Embed the query
    query_embedding = embedding_model.encode(query)

    # Step 2: Retrieve relevant documents
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=top_k
    )

    retrieved_docs = results['documents'][0]

    # Step 3: Format the prompt with context
    context = "\n".join(retrieved_docs)
    prompt = f"""Based on the following information:
{context}

Question: {query}
Answer the question concisely:"""

    # Step 4: Generate the answer with better parameters
    answer = generator(
        prompt,
        max_new_tokens=50,  # Limits the length of the generated text
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
        truncation=True,  # Explicitly enable truncation
        pad_token_id=generator.tokenizer.eos_token_id  # Properly set pad token
    )

    # Extract just the generated part (after the prompt)
    full_output = answer[0]['generated_text']
    generated_part = full_output[len(prompt):].strip()

    # Sometimes the model continues with unwanted text, so we'll take the first sentence
    final_answer = generated_part.split('.')[0] + '.' if '.' in generated_part else generated_part

    return {
        "answer": final_answer,
        "retrieved_documents": retrieved_docs
    }

5. Test the RAG System

In [10]:
# Example query
query = "What is the capital of France?"
result = rag_query(query)

print("Answer:")
print(result["answer"])
print("\nRetrieved documents:")
for doc in result["retrieved_documents"]:
    print(f"- {doc}")

Answer:
The capital of France is Paris.

Retrieved documents:
- The capital of France is Paris.
- The official language of France is French.
- France is a member of the European Union.


In [11]:
# Another example
query = "Tell me about France's relationship with wine."
result = rag_query(query)

print("Answer:")
print(result["answer"])
print("\nRetrieved documents:")
for doc in result["retrieved_documents"]:
    print(f"- {doc}")

Answer:
In the 19th century, France began a process of making wine in a series of factories which produced wines with a variety of characteristics.

Retrieved documents:
- France is known for its wine and cheese.
- France is a member of the European Union.
- The official language of France is French.
