# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [4]:
# Load environment variables (OPENAI_API_KEY, CHROMA_OPENAI_API_KEY for embeddings)
load_dotenv()

True

### VectorDB Instance

In [5]:
# Persistent ChromaDB client so the vector DB survives restarts
chroma_client = chromadb.PersistentClient(path="chromadb")

### Collection

In [6]:
# OpenAI embedding function (Vocareum endpoint)
# API key from OPENAI_API_KEY or CHROMA_OPENAI_API_KEY in .env
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key_env_var="OPENAI_API_KEY",
    api_base="https://openai.vocareum.com/v1",
    model_name="text-embedding-3-small",
)

In [7]:
# Create (or get) collection with the embedding function
collection = chroma_client.get_or_create_collection(
    name="udaplay",
    embedding_function=embedding_fn,
)

### Add documents

In [8]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )

In [None]:
# Demonstrate that the vector database can be queried for semantic search
demo_query = "Mario platformer Nintendo"
results = collection.query(query_texts=[demo_query], n_results=3, include=["metadatas", "documents"])
print(f"Semantic search query: '{demo_query}'")
print(f"Top {len(results['metadatas'][0])} results:")
for i, (meta, doc) in enumerate(zip(results["metadatas"][0], results["documents"][0]), 1):
    print(f"  {i}. {meta.get('Name')} ({meta.get('Platform')}, {meta.get('YearOfRelease')})")
    print(f"     {meta.get('Description', doc)[:80]}...")

Semantic search query: 'Mario platformer Nintendo'
Top 3 results:
  1. Super Mario World (Super Nintendo Entertainment System (SNES), 1990)
     A classic platformer where Mario embarks on a quest to save Princess Toadstool a...
  2. Super Mario 64 (Nintendo 64, 1996)
     A groundbreaking 3D platformer that set new standards for the genre, featuring M...
  3. Super Smash Bros. Melee (GameCube, 2001)
     A crossover fighting game featuring characters from various Nintendo franchises ...
