## AI-Powered Article Search with ChromaDB + Gemini

Build a lightweight Retrieval-Augmented Generation (RAG) system using:

- **Chunking** – Split long articles into smaller, searchable pieces  
- **LlamaIndex** – For intelligent text chunking and preprocessing  
- **ChromaDB** – Store and search vector embeddings efficiently  
- **Gemini (Google)** – Generate answers using retrieved context

---

### Workflow Overview

1. **Data Chunking**  
   - Use LlamaIndex to split articles into overlapping chunks  

2. **Embedding + Indexing**  
   - Generate embeddings using SentenceTransformers  
   - Store in a persistent Chroma vector DB  

3. **Semantic Search**  
   - Query Chroma to retrieve relevant chunks for any question  

4. **Answer Generation**  
   - Feed results into Gemini 1.5 Flash for context-aware answers  

---

> Fast, simple, and effective — ideal for building smart search over large text datasets.


### 1. Data Preparation

#### 1.1) Get the data & install dependencies

In [None]:
!git clone https://github.com/AshishJangra27/datasets/
!pip install llama-index
!pip install chromadb

#### 1.2) Import Required Libraries

In [3]:
import os
import json
import time
import pandas as pd

from tqdm.auto import tqdm
from tqdm.notebook import tqdm

#### 1.3) Load and Combine the Datasets

- List all CSV files from dataset folder  
- Read and combine them into a single DataFrame  
- Save the merged data as `data.csv`

In [4]:
csvs = [csv for csv in os.listdir('/content/datasets/GFG Articles/data') if '.csv' in csv]

df = pd.DataFrame()

for csv in tqdm(csvs):
    df_ = pd.read_csv('/content/datasets/GFG Articles/data/' + csv )
    df = pd.concat((df,df_))

df.to_csv('data.csv', index = False)

  0%|          | 0/30 [00:00<?, ?it/s]

In [42]:
df

Unnamed: 0,id,title,author_name,author_id,tags,no_of_imgs,file_path,link,img_links,article
0,816123.0,"Computerized Accounting System – Meaning, Feat...",akademixs2477,akademixs2477,"Picked,Accountancy,Class 12,Commerce",0.0,articles/816123.txt,https://www.geeksforgeeks.org/computerized-acc...,,A companys accounting system is the core of it...
1,818743.0,Journal Entry for Discount Allowed and Received,sukantkumar,sukantkumar,"Journal Entries,Accountancy,Class 11,Commerce",11.0,articles/818743.txt,https://www.geeksforgeeks.org/journal-entry-fo...,https://media.geeksforgeeks.org/wp-content/upl...,A discount is a concession in the selling pric...
2,803597.0,Journal Entry for Interest on Capital,sukantkumar,sukantkumar,tanushreegupta2000,3.0,articles/803597.txt,https://www.geeksforgeeks.org/journal-entry-fo...,https://media.geeksforgeeks.org/wp-content/upl...,The proprietor can charge interest on the amou...
3,819505.0,Journal Entry for Depreciation,sukantkumar,sukantkumar,"Journal Entries,Accountancy,Class 11,Commerce",3.0,articles/819505.txt,https://www.geeksforgeeks.org/journal-entry-fo...,https://media.geeksforgeeks.org/wp-content/upl...,Depreciation is the decrease in the value of a...
4,812251.0,Partnership Deed and Provisions of the Indian ...,bhawna0217,bhawna0217,"Accountancy,Class 12,Commerce",0.0,articles/812251.txt,https://www.geeksforgeeks.org/partnership-deed...,,Partners are two or more people who agree to c...
...,...,...,...,...,...,...,...,...,...,...
1695,725662.0,Spring @Service Annotation with Example,AmiyaRanjanRout,AmiyaRanjanRout,"simranarora5sos,surindertarika1234,kk773572498",3.0,articles/725662.txt,https://www.geeksforgeeks.org/spring-service-a...,https://media.geeksforgeeks.org/wp-content/cdn...,Spring is one of the most popular Java EE fram...
1696,727816.0,Spring Framework Annotations,AmiyaRanjanRout,AmiyaRanjanRout,"Java-Spring,Java",1.0,articles/727816.txt,https://www.geeksforgeeks.org/spring-framework...,https://media.geeksforgeeks.org/wp-content/upl...,Spring framework is one of the most popular Ja...
1697,721304.0,Java Applet Class,khurpaderushi143,khurpaderushi143,"ankit_kumar_,rkbhola5",3.0,articles/721304.txt,https://www.geeksforgeeks.org/java-applet-class/,https://media.geeksforgeeks.org/wp-content/upl...,Java Applet is a special type of small Java pr...
1698,727813.0,String Arrays in Java,satyajit1910,satyajit1910,akshaytripathi19410,0.0,articles/727813.txt,https://www.geeksforgeeks.org/string-arrays-in...,,In programming an array is a collection of the...


#### 1.4) Data Chunking & Formatting

**Chunking Text from Articles using `SentenceSplitter`**

This code takes a DataFrame `df` containing articles and splits each article into smaller chunks using `SentenceSplitter` from `llama_index`.

**Steps:**
1. Initialize `SentenceSplitter` with:
   - `chunk_size=1000`
   - `chunk_overlap=200`
2. Iterate over each row of the DataFrame:
   - Extract the article text.
   - Split the text into chunks if it is a valid string.
   - For each chunk, store relevant metadata (e.g., title, author info, tags).
3. Store all chunks and their metadata in a new DataFrame `chunked_df`.

This is useful for preparing text data for tasks like retrieval, indexing, or language model input.


In [7]:
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=1000, chunk_overlap=200)

chunked_data = []
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    article = row['article']
    if isinstance(article, str):
        chunks = text_splitter.split_text(article)
        for chunk in chunks:
            chunked_data.append({
                'original_id': row['id'],
                'title': row['title'],
                'chunk': chunk,
                'author_name': row['author_name'],
                'author_id': row['author_id'],
                'tags': row['tags'],
                'no_of_imgs': row['no_of_imgs'],
                'file_path': row['file_path'],
                'link': row['link'],
                'img_links': row['img_links']
            })

chunked_df = pd.DataFrame(chunked_data)

100%|██████████| 49328/49328 [05:34<00:00, 147.45it/s]


### 2. Generate Embeddings

#### 2.1) Generate Embeddings with All_MiniLM

**Preparing Data for Chroma Vector Store**

This code converts the chunked articles (`chunked_df`) into a format suitable for use with a vector database like **Chroma**.

---

**🔧 Steps:**

1. Initialize an empty list `chroma_data`.
2. Iterate over each row in `chunked_df`.
3. For each chunk:
   - Generate a unique `id` using the original article ID and index.
   - Store the chunk text under the `documents` key.
   - Add all related metadata (title, author, tags, etc.) under the `metadatas` key.
4. Use `json.dumps()` to view the structure of one sample record.

---

**📦 Example Output Format:**

```json
{
  "ids": "816123.0_0",
  "documents": "A company's accounting system is the core of its financial management...",
  "metadatas": {
    "original_id": 816123.0,
    "title": "Computerized Accounting System – Meaning, Features, Advantages and Disadvantages",
    "author_name": "akademixs2477",
    "author_id": "akademixs2477",
    "tags": "Picked,Accountancy,Class 12,Commerce",
    "no_of_imgs": 0.0,
    "file_path": "articles/816123.txt",
    "link": "https://www.geeksforgeeks.org/computerized-accounting-system-meaning-features-advantages-and-disadvantages/",
    "img_links": null
  }
}
```

This structure is compatible with many vector database libraries like **Chroma** for efficient document search and retrieval.

In [None]:
chroma_data = []

for index, row in chunked_df.iterrows():
    chroma_data.append({
        'ids': f"{row['original_id']}_{index}", # Create a unique ID for each chunk
        'documents': row['chunk'],
        'metadatas': {
            'original_id': row['original_id'],
            'title': row['title'],
            'author_name': row['author_name'],
            'author_id': row['author_id'],
            'tags': row['tags'],
            'no_of_imgs': row['no_of_imgs'],
            'file_path': row['file_path'],
            'link': row['link'],
            'img_links': row['img_links']
        }
    })

import json
print(json.dumps(chroma_data[0], indent=2))

### 3. Push Embeddings on ChromaDB

#### 3.1) Setup Chroma Persistent Vector Store with Embeddings

**Setting Up a Chroma Persistent Vector Store with Embeddings**

This code initializes a persistent Chroma database and sets up a new vector collection using GPU-enabled sentence embeddings.

---

**📌 Steps:**

1. **Import Required Libraries:**  
   Import ChromaDB, SentenceTransformer for embeddings, and other utilities.

2. **Initialize Chroma Client:**
   - Create a persistent Chroma client with a storage path `"chroma_db"`.

3. **Delete Old Collection (If Exists):**
   - Attempt to delete a previously existing collection named `"my_collection"` to avoid duplication.

4. **Initialize Embedding Function:**
   - Use the `all-MiniLM-L6-v2` model from `SentenceTransformer`.
   - Set the device to `"cuda"` to utilize GPU for faster embedding.

5. **Create New Collection:**
   - Create or retrieve a new collection named `"my_collection"` using the embedding function.

---

**✅ Result:**
- A clean, GPU-enabled vector collection is ready in the Chroma database for inserting documents.


In [14]:
import os
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import Settings
from tqdm import tqdm

client = chromadb.PersistentClient(path="chroma_db", settings=Settings())

old_collection_name = "my_collection"
try:
    client.delete_collection(name=old_collection_name)
    print(f"Deleted collection '{old_collection_name}'.")
except Exception:
    print(f"No existing collection named '{old_collection_name}'.")


ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2", device="cuda")

new_collection_name = "my_collection"
collection = client.get_or_create_collection(name=new_collection_name, embedding_function=ef)
print(f"Created new collection '{new_collection_name}' with GPU-enabled embedding.")

No existing collection named 'my_collection'.


#### 3.2) Push Vectors on ChromaDB

**Adding Documents to Chroma Collection in Batches**

This section handles uploading all the document chunks (`chroma_data`) into the Chroma vector store collection in batches for efficient processing.

---

**📌 Steps:**

1. **Set Batch Size:**
   - Use `batch_size = 10` to control memory usage and insertion speed.

2. **Loop Through Chunks:**
   - For every batch:
     - Extract `ids`, `documents`, and `metadatas`.
     - Add them to the Chroma collection using `collection.add()`.

3. **Verify Upload:**
   - Print total number of chunks added and the current count in the Chroma collection using `collection.count()`.

---

**📦 Example Format of a Single Item in `chroma_data`:**

```json
{
  "ids": "816123.0_0",
  "documents": "A company's accounting system is the core of its financial management as it processes all transactions within the organization...",
  "metadatas": {
    "original_id": 816123.0,
    "title": "Computerized Accounting System – Meaning, Features, Advantages and Disadvantages",
    "author_name": "akademixs2477",
    "author_id": "akademixs2477",
    "tags": "Picked,Accountancy,Class 12,Commerce",
    "no_of_imgs": 0.0,
    "file_path": "articles/816123.txt",
    "link": "https://www.geeksforgeeks.org/computerized-accounting-system-meaning-features-advantages-and-disadvantages/",
    "img_links": null
  }
}
```


**✅ Result:**
- All document chunks are added efficiently to Chroma with progress tracking via `tqdm`.

In [16]:
batch_size = 10
for i in tqdm(range(0, len(chroma_data), batch_size)):
    batch = chroma_data[i : i + batch_size]
    collection.add(
        ids=[item['ids'] for item in batch],
        documents=[item['documents'] for item in batch],
        metadatas=[item['metadatas'] for item in batch]
    )

# 6. Confirm final count
print(f"Added {len(chroma_data)} documents to '{new_collection_name}'.")
print(f"Final document count: {collection.count()}")

  return forward_call(*args, **kwargs)
100%|██████████| 10003/10003 [33:20<00:00,  5.00it/s]

Added 100027 documents to 'my_collection'.
Final document count: 100027





### 4. Search for similar embedding Articles

#### 4.1) Search for similar embedding Articles

**Querying Chroma Collection and Displaying Top Matches**

This section performs a semantic search in the Chroma vector store using an embedded query and displays the top 5 most relevant document chunks.

---

**📌 Steps:**

1. **Run a Query:**
   - Use `collection.query()` with a natural language query like `"What is GAN?"`.
   - `Chroma` automatically embeds the query and retrieves the top 5 matching documents (`n_results=5`).

2. **Display Results:**
   - Check if results contain both `'documents'` and `'metadatas'`.
   - Loop through each result:
     - Print the title, author, and link from metadata.
     - Display the first 500 characters of the matching chunk for preview.

---

```json
Top 5 Articles:

--- Result 1 ---
Title: Basics of Generative Adversarial Networks (GANs)
Author: sachingera
Link: https://www.geeksforgeeks.org/basics-of-generative-adversarial-networks-gans/
Chunk:
GANs is an approach for generative modeling using deep learning methods such as CNN Convolutional Neural Network Generative modeling is an unsupervised learning approach...

--- Result 2 ---
Title: Generative Adversarial Network (GAN)
Author: Rahul_Roy
Link: https://www.geeksforgeeks.org/generative-adversarial-network-gan/
Chunk:
A Generative Adversarial Network GAN is a deep learning architecture that consists of two neural networks competing against each other in a zerosum game framework...

--- Result 3 ---
Title: Generative Adversarial Network (GAN)
Author: Rahul_Roy
Link: https://www.geeksforgeeks.org/generative-adversarial-network-gan/
Chunk:
Here the Generator and the Discriminator are simple multilayer perceptrons In vanilla GAN the algorithm is really simple it tries to optimize the mathematical equation...

--- Result 4 ---
Title: Generative Adversarial Networks (GANs) | An Introduction
Author: RahulDas
Link: https://www.geeksforgeeks.org/generative-adversarial-networks-gans-an-introduction/
Chunk:
Generative Adversarial Networks GANs was first introduced by Ian Goodfellow in 2014 GANs are a powerful class of neural networks that are used for unsupervised learning...

--- Result 5 ---
Title: Super Resolution GAN (SRGAN)
Author: pawangfg
Link: https://www.geeksforgeeks.org/super-resolution-gan-srgan/
Chunk:
SRGAN was proposed by researchers at Twitter The motive of this architecture is to recover finer textures from the image when we upscale it so that its quality...

```


This output shows the most relevant articles retrieved by the Chroma vector store in response to the query `"What is GAN?"`, along with titles, authors, links, and a content preview.

In [None]:
results = collection.query(
    query_texts=["What is GAN?"], # Chroma will embed this for you
    n_results=5
)


if results and results.get('documents') and results.get('metadatas'):
    print("Top 5 Articles:")
    for i in range(len(results['documents'][0])):
        document = results['documents'][0][i]
        metadata = results['metadatas'][0][i]

        print(f"\n--- Result {i+1} ---")
        print(f"Title: {metadata.get('title', 'N/A')}")
        print(f"Author: {metadata.get('author_name', 'N/A')}")
        print(f"Link: {metadata.get('link', 'N/A')}")
        print(f"Chunk:\n{document[:500]}...") # Displaying the first 500 characters of the chunk
        # You can access other metadata fields similarly, e.g., metadata.get('tags')
else:
    print("No results found or results structure is unexpected.")

#### 4.2) Get Similar Chunks and feed them to list of json format

**Retrieving Top Matching Articles from Chroma**

This code queries the Chroma vector store for a specific topic and collects the top matching article chunks in a structured format.

---

**📌 Steps:**

1. **Define Query and Result Count:**
   - Set `query_text` (e.g., `"What is GAN?"`) and number of results (`n_results = 3`).

2. **Query the Chroma Collection:**
   - Use `collection.query()` to search for semantically similar chunks.

3. **Extract Results:**
   - Loop through the returned results and extract:
     - `title` from metadata
     - `chunk` (retrieved document content)

4. **Store Results:**
   - Append each result into a list called `retrieved_articles`.

5. **Check Output:**
   - Print a fallback message if no relevant articles were found.

---

**📦 Example Output Format (`retrieved_articles`):**

```python
[
  {
    'title': 'Basics of Generative Adversarial Networks (GANs)',
    'chunk': 'GANs is an approach for generative modeling using deep learning methods...'
  },
  {
    'title': 'Generative Adversarial Network (GAN)',
    'chunk': 'A Generative Adversarial Network GAN is a deep learning architecture...'
  },
  {
    'title': 'Super Resolution GAN (SRGAN)',
    'chunk': 'SRGAN was proposed by researchers at Twitter...'
  }
]
```

In [45]:
query_text = "What is GAN?" # Replace with your query
n_results = 3 # Number of top articles to retrieve

results = collection.query(
    query_texts=[query_text],
    n_results=n_results
)

retrieved_articles = []
if results and results.get('documents') and results.get('metadatas'):
    for i in range(len(results['documents'][0])):
        document = results['documents'][0][i]
        metadata = results['metadatas'][0][i]
        retrieved_articles.append({
            'title': metadata.get('title', 'N/A'),
            'chunk': document
        })

# Check if any articles were retrieved
if not retrieved_articles:
    print("No relevant articles found in ChromaDB.")

#### 4.3) Defining Prompt for Gemini

**Generating a Prompt for LLM-Based Question Answering**

This section creates a final prompt that can be passed to a language model (LLM) for answering a user query using retrieved document chunks as context.

---

**📌 Steps:**

1. **Check for Retrieved Articles:**
   - Ensure `retrieved_articles` contains at least one result.

2. **Construct Context Block:**
   - Join article titles and chunks using newline formatting:
     ```
     Title: You are a helpful AI assistant. Use the following articles to answer the question.If the answer is not available in the provided articles, say "I cannot answer this question based on the provided information."
     
     Articles:
      Title: Basics of Generative Adversarial Networks (GANs)
      Content: GANs is an approach for generative modeling using deep learning methods such as CNN Convolutional Neural Network Generative modeling is an unsupervised learning approach that involves automatically discovering
      
      Title: Generative Adversarial Network (GAN)
      Content: A Generative Adversarial Network GAN is a deep learning architecture that consists of two neural networks competing against each other in a zerosum
      
      Title: Generative Adversarial Network (GAN)
      Content: Here the Generator and the Discriminator are simple multilayer perceptrons In vanilla GAN the algorithm is really simple it tries to optimize the mathematical equation using stochastic gradient descentConditional GAN CGAN CGAN can be described as a deep learning method in which some conditional parameters are put into place In CGAN an additional parameter y is added to the Generator for generating the corresponding data Labels are also put into the

    Question: What is GAN?

    Answer:
     ```

3. **Create the Final Prompt:**
   - Embed the context and the `query_text` into a structured instruction:
     - Instruct the assistant to only use the provided content.
     - Include fallback behavior if the answer is not found.

4. **Fallback Condition:**
   - If no articles are retrieved, set `prompt = None` and print a message.



In [48]:
if retrieved_articles:
    context = "\n\n".join([f"Title: {article['title']}\nContent: {article['chunk']}" for article in retrieved_articles])

    prompt = f"""You are a helpful AI assistant. Use the following articles to answer the question.
    If the answer is not available in the provided articles, say "I cannot answer this question based on the provided information."

    Articles:
    {context}

    Question: {query_text}

    Answer:
    """
    print("Prompt created with context.")
else:
    prompt = None
    print("No articles retrieved, skipping prompt creation.")

Prompt created with context.


#### 4.4) Defining Gemini Model

**Gemini 1.5 Flash Setup (Google Generative AI)**

This block sets up the Gemini model using the Google Generative AI SDK.

In [33]:
import google.generativeai as genai
from google.colab import userdata

try:
    GOOGLE_API_KEY = ""
    genai.configure(api_key=GOOGLE_API_KEY)
    print("Google Generative AI configured.")
    gemini_model = genai.GenerativeModel('gemini-1.5-flash-latest') # Using flash model
    print("Gemini 1.5 Flash model initialized.")

except Exception as e:
    print(f"Error setting up Google Generative AI: {e}")
    gemini_model = None
    print("Could not initialize Gemini model. Please check your API key and network connection.")

Google Generative AI configured.
Gemini 1.5 Flash model initialized.


#### 4.5) RAG based Answer

**Generate Answer using Gemini (RAG)**

This code uses the Gemini model to generate an answer based on the previously constructed prompt.


In [34]:
if prompt and gemini_model:
    try:
        response = gemini_model.generate_content(prompt)
        rag_answer = response.text
        print("\n--- RAG Based Answer ---")
        print(rag_answer)
    except Exception as e:
        print(f"Error generating content from Gemini model: {e}")
        rag_answer = "Could not generate answer."
else:
    rag_answer = "Could not generate answer due to missing prompt or model."
    print(rag_answer)


--- RAG Based Answer ---
A Generative Adversarial Network (GAN) is a deep learning architecture consisting of two neural networks competing against each other in a zero-sum game framework.  The goal is to generate new synthetic data that resembles a known data distribution.  It uses a Generator, which creates fake data samples, and a Discriminator, which tries to distinguish between real and fake data.  The two networks are trained iteratively, with the Generator aiming to fool the Discriminator and the Discriminator aiming to correctly identify the fake data.  This adversarial training process allows the Generator to learn the underlying distribution of the real data and generate increasingly realistic samples.



#### 4.6 End-to-End RAG Workflow using Chroma + Gemini

This process retrieves relevant article chunks from a Chroma vector store and uses the Gemini model to generate an answer based on that content.

---

**Steps Involved**

1. **Query Chroma:**
   - A user question is sent to Chroma to find the most relevant document chunks.

2. **Retrieve Articles:**
   - Extract matching chunks and their titles from Chroma results.

3. **Build Prompt:**
   - A prompt is created that includes:
     - The retrieved articles as context.
     - The original user question.

4. **Generate Answer with Gemini:**
   - The prompt is passed to the Gemini 1.5 Flash model.
   - The model generates an answer using only the provided context.

5. **Fallback Handling:**
   - If no relevant documents are found or the model is not set up, a fallback message is shown.

---

**✅ Outcome:**
- A grounded, context-aware answer is returned using the retrieved documents and the Gemini model.


In [51]:
query_text = "What is Image Processing in OpenCV?"
n_results = 10

results = collection.query( query_texts=[query_text], n_results=n_results)

retrieved_articles = []
if results and results.get('documents') and results.get('metadatas'):
    for i in range(len(results['documents'][0])):
        document = results['documents'][0][i]
        metadata = results['metadatas'][0][i]
        retrieved_articles.append({
            'title': metadata.get('title', 'N/A'),
            'chunk': document
        })



if retrieved_articles:
    context = "\n\n".join([f"Title: {article['title']}\nContent: {article['chunk']}" for article in retrieved_articles])

    prompt = f"""You are a helpful AI assistant. Use the following articles to answer the question.
    If the answer is not available in the provided articles, say "I cannot answer this question based on the provided information."

    Articles:
    {context}

    Question: {query_text}

    Answer:
    """
    print("Prompt created with context.")
else:
    prompt = None
    print("No articles retrieved, skipping prompt creation.")

if prompt and gemini_model:
    try:
        response = gemini_model.generate_content(prompt)
        rag_answer = response.text
        print("\n--- RAG Based Answer ---")
        print(rag_answer)
    except Exception as e:
        print(f"Error generating content from Gemini model: {e}")
        rag_answer = "Could not generate answer."
else:
    rag_answer = "Could not generate answer due to missing prompt or model."
    print(rag_answer)

  return forward_call(*args, **kwargs)


Prompt created with context.

--- RAG Based Answer ---
Based on the provided text, image processing in OpenCV involves modifying and analyzing digital images using computer algorithms.  OpenCV offers a wide range of capabilities and methods for this, including image resizing, rotation, translation, shearing, normalization, edge detection, blurring, and morphological image processing.  It's used in various applications such as computer vision, medical imaging, and security.  OpenCV is optimized for real-time applications and supports multiple programming languages.



### 5. Saving Embeddings

This step creates a downloadable ZIP archive of the Chroma vector database stored in a local directory.

---

**What it does:**

1. **Locate ChromaDB Directory:**
   - Uses `./chroma_db` as the path to the persistent Chroma database.

2. **Create ZIP Archive:**
   - Compresses the entire directory into a `.zip` file using `shutil.make_archive()`.

3. **Save Output:**
   - The archive is saved with the name `chroma_db_archive.zip` in the current working directory.

4. **Error Handling:**
   - Handles cases where the source directory is missing or any other exception occurs.

---

**✅ Outcome:**
- A zipped version of your Chroma database is ready for download via the Colab file browser.


In [35]:
import shutil
import os

# Define the path to your ChromaDB persistent directory
chroma_db_path = "./chroma_db"
# Define the name for the zip file
zip_filename = "chroma_db_archive"
# Define the output path for the zip file
output_path = "./" + zip_filename

# Create a zip archive of the chroma_db directory
try:
    shutil.make_archive(output_path, 'zip', chroma_db_path)
    print(f"Successfully created zip archive: {output_path}.zip")
    print("You can now download this zip file from the Colab file browser.")
except FileNotFoundError:
    print(f"Error: The directory '{chroma_db_path}' was not found.")
except Exception as e:
    print(f"An error occurred while creating the archive: {e}")

Successfully created zip archive: ./chroma_db_archive.zip
You can now download this zip file from the Colab file browser.
