# Udaplay Project


## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [1]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
# To successfully run part 1 of the project, create a .env file with the following variables
CHROMA_OPENAI_API_KEY=os.getenv("CHROMA_OPENAI_API_KEY")

Setting up ChromaDB paths and names. Should be consistent with what you used in the `Udaplay_02.ipynb` file.

In [4]:
CHROMADB_PATH = "chromadb"
CHROMADB_COLLECTION_NAME = "udaplay"

### Creating a VectorDB Instance ...

In [5]:
chroma_client = chromadb.PersistentClient(path=CHROMADB_PATH)

### ... and a Collection within this instance

In [6]:
# Make sure you use the same function for encoding and decoding

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
# EMBEDDING_MODEL_NAME = "text-embedding-3-small"

embeddings_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key=CHROMA_OPENAI_API_KEY,
    model_name=EMBEDDING_MODEL_NAME  
)

In [7]:
# If you want to start with a fresh collection, uncomment the following lines
chroma_client.delete_collection(name=CHROMADB_COLLECTION_NAME)
collection = chroma_client.create_collection(
   name=CHROMADB_COLLECTION_NAME,
   embedding_function=embeddings_fn
)

### Add documents

In [8]:
# Make sure you have a directory "/games"
data_dir = "games"

for file_name in sorted(os.listdir(data_dir)):
    if not file_name.endswith(".json"):
        continue

    file_path = os.path.join(data_dir, file_name)
    with open(file_path, "r", encoding="utf-8") as f:
        game = json.load(f)

    # You can change what text you want to index
    content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}, {game['Publisher']}) - {game['Description']}"

    # Use file name (like 001) as ID
    doc_id = os.path.splitext(file_name)[0]

    collection.add(
        ids=[doc_id],
        documents=[content],
        metadatas=[game]
    )

### Show document retrieval

This query shows that it is possible to query the database, even though for the particular query there is no useful data in the database. 

In [9]:
query = "When was Zelda: Breath of the Wild released?"
results = collection.query(
    query_texts=[query], 
    n_results=5
    )
    
print(results["documents"][0])

["[Nintendo 64] Super Mario 64 (1996, Nintendo) - A groundbreaking 3D platformer that set new standards for the genre, featuring Mario's quest to rescue Princess Peach.", '[Nintendo Switch] Mario Kart 8 Deluxe (2017, Nintendo) - An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechanics.', '[Super Nintendo Entertainment System (SNES)] Super Mario World (1990, Nintendo) - A classic platformer where Mario embarks on a quest to save Princess Toadstool and Dinosaur Land from Bowser.', "[Wii] Wii Sports (2006, Nintendo) - A collection of sports games that utilize the Wii's motion controls, bundled with the console to showcase its capabilities.", '[Game Boy Color] Pokémon Gold and Silver (1999, Nintendo) - Second-generation Pokémon games introducing new regions, Pokémon, and gameplay mechanics.']
