<h1><center>Vector Database Operations with Chroma</center></h1>
<h2><center>Vector Database Explained</center></h2>
<h3><center>Build AI Apps - Beginner Level</center></h3>

## Before you start

In order to complete the project you will need to create a developer account with OpenAI and store your API key as an environment variable. Instructions for these steps are outlined below.

### Create a developer account with OpenAI

1. Go to the [API signup page](https://platform.openai.com/signup). 

2. Create your account (you'll need to provide your email address and your phone number).

3. Go to the [API keys page](https://platform.openai.com/account/api-keys). 

4. Create a new secret key.


5. **Take a copy of it**. (If you lose it, delete the key and create a new one.)

### Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary based on geography. You may need to add debit/credit card details. 

**Using the `text-embedding-3-small` model in this project should incur a cost less than 1 US cent (but if you rerun tasks, you will be charged every time).** For more information on pricing, see [OpenAI's pricing page](https://openai.com/pricing).

1. Go to the [Payment Methods page](https://platform.openai.com/account/billing/payment-methods).

2. Click Add payment method.

3. Fill in your card details.

### Install open ai library

In [57]:
# !pip install openai

### Load Open AI Key

In [58]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

OPEN_API_KEY=os.getenv('OPENAI_API_KEY')

### Create Client

In [59]:
from openai import OpenAI

# Define Model
model="gpt-4o-mini"
emb_model="text-embedding-3-small"

#define client
# client = OpenAI(api_key=OPEN_API_KEY)

## Getting Started with Vector Database - Chroma DB Example with Netflix dataset

In [60]:
# !pip install chromadb

In [61]:
# import libraries
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [62]:
# Create a persistant client- along with path of saving db files
client = chromadb.PersistentClient("vectordb_files")

## Collections - Sample

In [63]:
# test_collection = client.create_collection(
#     name="test_titles",
#     embedding_function=OpenAIEmbeddingFunction(
#         model_name="text-embedding-3-small",
#         api_key=OPEN_API_KEY
#     ),
#     get_or_create=True
# )

In [64]:
test_collection = client.get_or_create_collection(
    name="test_titles",
    embedding_function=OpenAIEmbeddingFunction(
        model_name="text-embedding-3-small",
        api_key=OPEN_API_KEY
    )
)

In [65]:
client.list_collections()

[Collection(name=test_titles1),
 Collection(name=netflix_titles),
 Collection(name=test_titles)]

In [66]:
test_collection.add(
    ids=["my_doc"],
    documents=["This is the source text"]
)

# Add multiple documents
test_collection.add(
    ids=["my-doc-1", "my-doc-2"],
    documents=["This is document 1", "This is document 2"]
)

Insert of existing embedding ID: my_doc
Add of existing embedding ID: my_doc
Insert of existing embedding ID: my-doc-1
Insert of existing embedding ID: my-doc-2
Add of existing embedding ID: my-doc-1
Add of existing embedding ID: my-doc-2


In [67]:
test_collection.count()

3

In [68]:
test_collection.get(ids=["my-doc-1"])

{'ids': ['my-doc-1'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['This is document 1'],
 'uris': None,
 'data': None}

In [69]:
# test_collection.get(ids=["my-doc-1"], include=["embeddings", "documents", "metadatas"])


In [70]:
# len(test_collection.get(ids=["my-doc-1"], include=["embeddings", "documents", "metadatas"])['embeddings'][0])

## Vector Operations With Netflix Titles Dataset

In [71]:
import csv  # To read the CSV file

# Create empty lists to store our dataset
ids = []
documents = []
metadatas = []

# Open the Netflix dataset file
with open('netflix_titles.csv') as csvfile:
    reader = csv.DictReader(csvfile)  # Read CSV as a dictionary for each row

    # Loop over each row in the CSV
    for i, row in enumerate(reader):
        # Save the unique show_id
        ids.append(row['show_id'])

        # Save metadata like type and release year
        metadatas.append({
            "type": row['type'],
            "release_year": int(row['release_year'])  # Cast to integer for consistency
        })

        # Combine title, type, description, and categories into one text block
        text = f"Title: {row['title']} {row['type']} \nDescription: {row['description']} \nCategories: {row['listed_in']}"
        documents.append(text)

In [72]:
print(ids[:2])
print(documents[:2])
print(metadatas[:2])

['s1', 's2']
['Title: Dick Johnson Is Dead Movie \nDescription: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable. \nCategories: Documentaries', 'Title: Blood & Water TV Show \nDescription: After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth. \nCategories: International TV Shows, TV Dramas, TV Mysteries']
[{'type': 'Movie', 'release_year': 2020}, {'type': 'TV Show', 'release_year': 2021}]


In [73]:
# for doc in documents[1:3]:
#     print(doc)
#     print('-' * 50)  # Just a separator for readability

## Create Collection for Netflix Titles

In [74]:
netflix_collection = client.get_or_create_collection(
    name="netflix_titles",
    embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small", api_key=OPEN_API_KEY)
)
# List the collections
print(client.list_collections())

[Collection(name=test_titles1), Collection(name=netflix_titles), Collection(name=test_titles)]


## Add Documents

In [75]:
# Add the documents and metadats, IDs to the collection
netflix_collection.add(ids=ids,metadatas=metadatas, documents=documents)

Add of existing embedding ID: s1
Add of existing embedding ID: s2
Add of existing embedding ID: s3
Add of existing embedding ID: s4
Add of existing embedding ID: s5
Add of existing embedding ID: s6
Add of existing embedding ID: s7
Add of existing embedding ID: s8
Add of existing embedding ID: s9
Add of existing embedding ID: s10
Add of existing embedding ID: s11
Add of existing embedding ID: s12
Add of existing embedding ID: s13
Add of existing embedding ID: s14
Add of existing embedding ID: s15
Add of existing embedding ID: s16
Add of existing embedding ID: s17
Add of existing embedding ID: s18
Add of existing embedding ID: s19
Add of existing embedding ID: s20
Add of existing embedding ID: s21
Add of existing embedding ID: s22
Add of existing embedding ID: s23
Add of existing embedding ID: s24
Add of existing embedding ID: s25
Add of existing embedding ID: s26
Add of existing embedding ID: s27
Add of existing embedding ID: s28
Add of existing embedding ID: s29
Add of existing embeddi

In [76]:
print(f"No. of documents: {netflix_collection.count()}")


No. of documents: 100


In [77]:
netflix_collection.get(ids=["s1"])

{'ids': ['s1'],
 'embeddings': None,
 'metadatas': [{'release_year': 2020, 'type': 'Movie'}],
 'documents': ['Title: Dick Johnson Is Dead Movie \nDescription: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable. \nCategories: Documentaries'],
 'uris': None,
 'data': None}

## Semantic search application

In [78]:
query_strings=["comedy tv show"]

In [79]:
result=netflix_collection.query(query_texts=query_strings,
                               n_results=2)
print(result)

{'ids': [['s100', 's16']], 'distances': [[1.0077470541000366, 1.0844204425811768]], 'metadatas': [[{'release_year': 2021, 'type': 'TV Show'}, {'release_year': 2021, 'type': 'TV Show'}]], 'embeddings': None, 'documents': [['Title: On the Verge TV Show \nDescription: Four women â€” a chef, a single mom, an heiress and a job seeker â€” dig into love and work, with a generous side of midlife crises, in pre-pandemic LA. \nCategories: TV Comedies, TV Dramas', 'Title: Dear White People TV Show \nDescription: Students of color navigate the daily slights and slippery politics of life at an Ivy League college that\'s not nearly as "post-racial" as it thinks. \nCategories: TV Comedies, TV Dramas']], 'uris': None, 'data': None}


In [85]:
for doc, meta, distance in zip(
    result['documents'][0],
    result['metadatas'][0],
    result['distances'][0]
):
    print(f"Document: {doc}")
    print(f"Metadata: {meta}")
    print(f"Distance: {distance}")
    print('-' * 50)

Document: Title: Grown Ups Movie 
Description: Mourning the loss of their beloved junior high basketball coach, five middle-aged pals reunite at a lake house and rediscover the joys of being a kid. 
Categories: Comedies
Metadata: {'release_year': 2010, 'type': 'Movie'}
Distance: 1.3652856349945068
--------------------------------------------------
Document: Title: Kid Cosmic TV Show 
Description: A boy's superhero dreams come true when he finds five powerful cosmic stones. But saving the day is harder than he imagined â€” and he can't do it alone. 
Categories: Kids' TV, TV Comedies, TV Sci-Fi & Fantasy
Metadata: {'release_year': 2021, 'type': 'TV Show'}
Distance: 1.4220644235610962
--------------------------------------------------


In [98]:
query_strings2=["comedy","kids"]

In [99]:
result2=netflix_collection.query(query_texts=query_strings2,
                               n_results=5,where={"$and": 
                                 [ 
                                     {"type":
                                      {"$eq": "Movie"}
                                     },
                                     {"release_year":
                                      {"$gt": 2020}
                                     } 
                                 ]
                                })
print(result2)

{'ids': [['s80', 's10', 's78', 's31', 's14'], ['s24', 's7', 's65', 's72', 's78']], 'distances': [[1.451798677444458, 1.5047898292541504, 1.5348659753799438, 1.543830156326294, 1.5622550249099731], [1.435831069946289, 1.5152090787887573, 1.515388011932373, 1.5307660102844238, 1.550815463066101]], 'metadatas': [[{'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}], [{'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}, {'release_year': 2021, 'type': 'Movie'}]], 'embeddings': None, 'documents': [['Title: Tughlaq Durbar (Telugu) Movie \nDescription: A budding politician has devious plans to rise in the ranks â€” until an unexpected new presence begins to interfere with his every crooked move. \nCategories: Comedies, Dramas, Internat

In [100]:
for doc, meta, distance in zip(
    result2['documents'][0],
    result2['metadatas'][0],
    result2['distances'][0]
):
    print(f"Document: {doc}")
    print(f"Metadata: {meta}")
    print(f"Distance: {distance}")
    print('-' * 50)

Document: Title: Tughlaq Durbar (Telugu) Movie 
Description: A budding politician has devious plans to rise in the ranks â€” until an unexpected new presence begins to interfere with his every crooked move. 
Categories: Comedies, Dramas, International Movies
Metadata: {'release_year': 2021, 'type': 'Movie'}
Distance: 1.451798677444458
--------------------------------------------------
Document: Title: The Starling Movie 
Description: A woman adjusting to life after a loss contends with a feisty bird that's taken over her garden â€” and a husband who's struggling to find a way forward. 
Categories: Comedies, Dramas
Metadata: {'release_year': 2021, 'type': 'Movie'}
Distance: 1.5047898292541504
--------------------------------------------------
Document: Title: Little Singham - Black Shadow Movie 
Description: Kid cop Little Singham loses all his superpowers while trying to stop the demon Kaalâ€™s new evil plans! Can his inner strength help him defeat the enemy? 
Categories: Children & Fa