# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
import openai # to have it installed
from chromadb.utils import embedding_functions
from dotenv import load_dotenv, find_dotenv

In [3]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [4]:
# TODO: Load environment variables
path = find_dotenv(usecwd=True)
load_dotenv(dotenv_path=path, override=True)

True

### VectorDB Instance

In [5]:
# TODO: Instantiate your ChromaDB Client
# Choose any path you want
chroma_client = chromadb.PersistentClient(path="chromadb")

### Collection

In [6]:
# TODO: Pick one embedding function
# If picking something different than openai, 
# make sure you use the same when loading it
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))

In [7]:
# TODO: Create a collection
# Choose any name you want
try:
    collection = chroma_client.get_collection(
        name="udaplay",
        embedding_function=embedding_fn
    )
except Exception:
    collection = chroma_client.create_collection(
        name="udaplay",
        embedding_function=embedding_fn
    )

### Add documents

In [8]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    try:
        collection.upsert(
            ids=[doc_id],
            documents=[content],
            metadatas=[game]
        )
        print(f"Added document ID: {doc_id}")
    except Exception as e:
        print(f"Error adding document ID: {doc_id}")
        print(e)


Added document ID: 001
Added document ID: 002
Added document ID: 003
Added document ID: 004
Added document ID: 005
Added document ID: 006
Added document ID: 007
Added document ID: 008
Added document ID: 009
Added document ID: 010
Added document ID: 011
Added document ID: 012
Added document ID: 013
Added document ID: 014
Added document ID: 015


In [11]:
query = "When was Marvel's Spider-Man 2 for PlayStation 5 released?"

results = collection.query(
    query_texts=[query],
    n_results=3,
)

print("VECTOR SEARCH RESULTS:\n")
for i, (doc, meta, dist) in enumerate(zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
)):
    score = 1 - dist
    score_pct = round(score * 100, 2)

    print(f"Result #{i+1}")
    print(f"Distance: {dist}")
    print(f"Score: {score:.4f} ({score_pct}%)")
    print(f"Game: {meta.get('Name')}")
    print(f"Platform: {meta.get('Platform')}")
    print(f"Year: {meta.get('YearOfRelease')}\n")


VECTOR SEARCH RESULTS:

Result #1
Distance: 0.11023908853530884
Score: 0.8898 (88.98%)
Game: Marvel's Spider-Man 2
Platform: PlayStation 5
Year: 2023

Result #2
Distance: 0.1486375331878662
Score: 0.8514 (85.14%)
Game: Marvel's Spider-Man
Platform: PlayStation 4
Year: 2018

Result #3
Distance: 0.2188379168510437
Score: 0.7812 (78.12%)
Game: Gran Turismo 5
Platform: PlayStation 3
Year: 2010

