# ChromaDB Guide & Poopulation with Kaggle Data
This notebook creates the locally-persisted ChromaDB (in /inference/chromadb) if it doesn't yet exist, creates the movie collection, and adds in the data from the catalogue that is in the Kaggle dataset. 

Note: we opt for the default distance function (squared L2 norm) and embedding function (Sentence Embedding). If these are changed, we must ensure that this script is run again after we clear the chromadb to reflect that. 

In [31]:
import chromadb
import pandas as pd
import json
import time

In [32]:
# Connect to our locally-persisted chromadb
client = chromadb.PersistentClient(path="../inference/chromadb/")

In [33]:
# Chroma uses Sentence Embedding model by default, but can change this here
# Also, it by default uses squared L2 norm as the distance function. To change, see: https://docs.trychroma.com/usage-guide#changing-the-distance-function
movies = client.get_or_create_collection(name="movies")

In [34]:
# Test to add with just embedding and id
movies.add(
    embeddings=[1 for _ in range(384)],
    ids=["test_id"]
)

In [36]:
movies.query(
    query_embeddings=[1 for _ in range(384)],
    n_results=1
)

{'ids': [['test_id']],
 'distances': [[0.0]],
 'metadatas': [[None]],
 'embeddings': None,
 'documents': [[None]]}

In [159]:
# Gets the first 10 elements
movies.peek()
# Gets size
movies.count()

17618

## Add Kaggle data into ChromaDB

We currently pass it in as strings of jsons with the structure and all. Documentation says it can take in documents, but doesn't work with dictionaries it seems. 

Time to embed and save: 13:31 mins for the 17618 documents = 0.0460324668 seconds / embedding. Very quick! 

Averages roughly 0.045 seconds per query

In [153]:
# Load the comp585_movies.csv dataset
df = pd.read_csv("./kaggle_dataset/comp585_movies.csv")

In [154]:
df.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,movie_id
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,toy+story+1995
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,jumanji+1995


In [155]:
# Delete any duplicate rows based on movie_id
df = df.drop_duplicates(subset=['movie_id'])

In [156]:
# Format it properly
jsons = df.to_dict(orient='records')
jsons = [str(j) for j in jsons]
# Get the ids
ids = df['movie_id'].tolist()

In [158]:
"""
As documents, we pass in the entire row as a string of JSON document
As ids, we pass in the movie_id field

List of fields: adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,movie_id
API gives us the same fields, so course likely just uses Kaggle dataset as its source

This then automatically calculates embeddings from the documents and stores them in the database.
"""
movies.add(
  documents = jsons,
  ids = ids
)

Insert of existing embedding ID: toy+story+1995
Insert of existing embedding ID: jumanji+1995
Insert of existing embedding ID: grumpier+old+men+1995
Insert of existing embedding ID: waiting+to+exhale+1995
Insert of existing embedding ID: father+of+the+bride+part+ii+1995
Insert of existing embedding ID: heat+1995
Insert of existing embedding ID: sabrina+1995
Insert of existing embedding ID: tom+and+huck+1995
Insert of existing embedding ID: sudden+death+1995
Insert of existing embedding ID: goldeneye+1995
Insert of existing embedding ID: the+american+president+1995
Insert of existing embedding ID: balto+1995
Insert of existing embedding ID: nixon+1995
Insert of existing embedding ID: cutthroat+island+1995
Insert of existing embedding ID: casino+1995
Insert of existing embedding ID: sense+and+sensibility+1995
Insert of existing embedding ID: four+rooms+1995
Insert of existing embedding ID: money+train+1995
Insert of existing embedding ID: get+shorty+1995
Insert of existing embedding ID: 

In [143]:
movies.count()

17618

In [257]:
# To test
i = 4   # Movie 0 is Toy Story, 1 is Jumanji, 4 is Father of the Bride Part II
print(jsons[i])
start = time.time()
movies.query(
    query_texts=[jsons[i]],
    n_results=10,
)
print("Time taken: ", time.time() - start)
# Averages roughly 0.045 seconds per query

{'adult': False, 'belongs_to_collection': "{'id': 96871, 'name': 'Father of the Bride Collection', 'poster_path': '/nts4iOmNnq7GNicycMJ9pSAn204.jpg', 'backdrop_path': '/7qwE57OVZmMJChBpLEbJEmzUydk.jpg'}", 'budget': 0, 'genres': "[{'id': 35, 'name': 'Comedy'}]", 'homepage': nan, 'id': 11862, 'imdb_id': 'tt0113041', 'original_language': 'en', 'original_title': 'Father of the Bride Part II', 'overview': "Just when George Banks has recovered from his daughter's wedding, he receives the news that she's pregnant ... and that George's wife, Nina, is expecting too. He was planning on selling their home, but that's a plan that -- like George -- will have to change with the arrival of both a grandchild and a kid of his own.", 'popularity': 8.387519, 'poster_path': '/e64sOI48hQXyru7naBFyssKFxVd.jpg', 'production_companies': "[{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]", 'production_countries': "[{'iso_3166_1': 'US', 'name': 'United States of Americ