# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [None]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [3]:
# TODO: Load environment variables
load_dotenv()

True

### VectorDB Instance

In [4]:
# TODO: Instantiate your ChromaDB Client
# Choose any path you want
# chroma_client = chromadb.PersistentClient(path="chromadb")
chroma_client = chromadb.PersistentClient(path="chromadb")

### Collection

In [5]:
# TODO: Pick one embedding function
# If picking something different than openai, 
# make sure you use the same when loading it
embedding_fn = embedding_functions.OpenAIEmbeddingFunction()

In [None]:
# TODO: Create a collection
# Choose any name you want
collection = chroma_client.get_or_create_collection(
    name="udaplay",
    embedding_function=embedding_fn
)

### Add documents

In [None]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )


In [8]:
print("Count:", collection.count())


Count: 15


In [9]:
doc_id = "001"  # example
print(collection.get(ids=[doc_id], include=["documents", "metadatas"]))


{'ids': ['001'], 'embeddings': None, 'documents': ['[PlayStation 1] Gran Turismo (1997) - A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.'], 'uris': None, 'included': ['documents', 'metadatas'], 'data': None, 'metadatas': [{'Description': 'A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.', 'Name': 'Gran Turismo', 'Publisher': 'Sony Computer Entertainment', 'YearOfRelease': 1997, 'Platform': 'PlayStation 1', 'Genre': 'Racing'}]}


In [10]:
results = collection.query(
    query_texts=["strategy game with diplomacy"],
    n_results=5,
    where={"Platform": "PC"},
    include=["documents", "metadatas", "distances"]
)
print(results)


{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'included': ['documents', 'metadatas', 'distances'], 'data': None, 'metadatas': [[]], 'distances': [[]]}


In [11]:
results = collection.query(query_texts=["open world RPG"], n_results=3)
print(results["documents"][0])

["[PlayStation 4] Marvel's Spider-Man (2018) - An open-world superhero game that lets players swing through New York City as Spider-Man, battling iconic villains.", '[Xbox One] Minecraft (2014) - A sandbox game that allows players to build and explore infinite worlds, fostering creativity and adventure.', "[PlayStation 2] Grand Theft Auto: San Andreas (2004) - An expansive open-world game set in the fictional state of San Andreas, following the story of Carl 'CJ' Johnson."]


In [12]:
print("Documents in collection:", collection.count())

q = "first 3D platformer Mario game"
res = collection.query(
    query_texts=[q],
    n_results=3,
    include=["metadatas", "distances"]
)

print("QUERY:", q)
for meta, dist in zip(res["metadatas"][0], res["distances"][0]):
    print(f"- {meta.get('Name')} ({meta.get('YearOfRelease')}) [{meta.get('Platform')}]  distance={dist:.4f}")


Documents in collection: 15
QUERY: first 3D platformer Mario game
- Super Mario 64 (1996) [Nintendo 64]  distance=0.1032
- Super Mario World (1990) [Super Nintendo Entertainment System (SNES)]  distance=0.1278
- Mario Kart 8 Deluxe (2017) [Nintendo Switch]  distance=0.1880
