# [STARTER] Udaplay Project

## Part 01 - Offline RAG

In this part of the project, you'll build your VectorDB using Chroma.

The data is inside folder `project/starter/games`. Each file will become a document in the collection you'll create.
Example.:
```json
{
  "Name": "Gran Turismo",
  "Platform": "PlayStation 1",
  "Genre": "Racing",
  "Publisher": "Sony Computer Entertainment",
  "Description": "A realistic racing simulator featuring a wide array of cars and tracks, setting a new standard for the genre.",
  "YearOfRelease": 1997
}
```


### Setup

In [12]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [13]:
import os
import json
import chromadb
from chromadb.utils import embedding_functions
from dotenv import load_dotenv

In [14]:
# TODO: Create a .env file with the following variables
# OPENAI_API_KEY="YOUR_KEY"
# CHROMA_OPENAI_API_KEY="YOUR_KEY"
# TAVILY_API_KEY="YOUR_KEY"

In [15]:
# TODO: Load environment variables
from dotenv import load_dotenv
import os 

load_dotenv('.env')
assert os.getenv('OPENAI_API_KEY') is not None
assert os.getenv('TAVILY_API_KEY') is not None

### VectorDB Instance

In [16]:
# TODO: Instantiate your ChromaDB Client
# Choose any path you want
chroma_client = chromadb.PersistentClient(path="chromadb")

In [17]:
try:
    chroma_client.delete_collection(name="udaplay")
except:
    None

### Collection

In [18]:
# TODO: Pick one embedding function
# If picking something different than openai, 
# make sure you use the same when loading it
embedding_fn = embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv('OPENAI_API_KEY'),
                api_base="https://openai.vocareum.com/v1")

In [19]:
# TODO: Create a collection
# Choose any name you want
collection = chroma_client.create_collection(
   name="udaplay",
   embedding_function=embedding_fn
)

### Add documents

In [20]:
# Make sure you have a directory "project/starter/games"
data_dir = "games"
print(f'Looking in {data_dir} for files')
files_to_load = sorted(os.listdir(data_dir))
print(f'found {len(files_to_load)} files')

x=1
for file_name in files_to_load:
    try:
        print(f'processing file #{x} :: {file_name} ...starting')
        if not file_name.endswith(".json"):
            continue
    
        file_path = os.path.join(data_dir, file_name)
        with open(file_path, "r", encoding="utf-8") as f:
            game = json.load(f)
    
        # You can change what text you want to index
        content = f"[{game['Platform']}] {game['Name']} ({game['YearOfRelease']}) - {game['Description']}"
    
        # Use file name (like 001) as ID
        doc_id = os.path.splitext(file_name)[0]
    
        collection.add(
            ids=[doc_id],
            documents=[content],
            metadatas=[game]
        )
        print(f'processing file #{x} :: {file_name} ...complete!')
        x=x+1
    except:
        print(f'processing file #{x} :: {file_name} ...!FAILED!')

print(f'{x-1}/{len(files_to_load)} files processed successfully.')
    

Looking in games for files
found 15 files
processing file #1 :: 001.json ...starting
processing file #1 :: 001.json ...complete!
processing file #2 :: 002.json ...starting
processing file #2 :: 002.json ...complete!
processing file #3 :: 003.json ...starting
processing file #3 :: 003.json ...complete!
processing file #4 :: 004.json ...starting
processing file #4 :: 004.json ...complete!
processing file #5 :: 005.json ...starting
processing file #5 :: 005.json ...complete!
processing file #6 :: 006.json ...starting
processing file #6 :: 006.json ...complete!
processing file #7 :: 007.json ...starting
processing file #7 :: 007.json ...complete!
processing file #8 :: 008.json ...starting
processing file #8 :: 008.json ...complete!
processing file #9 :: 009.json ...starting
processing file #9 :: 009.json ...complete!
processing file #10 :: 010.json ...starting
processing file #10 :: 010.json ...complete!
processing file #11 :: 011.json ...starting
processing file #11 :: 011.json ...complet

### Implement semantic search functionality

In [21]:
def semantic_search_func(query, collection, n_results=1):  
    """
    A function to help query n number of results from the collection for specific matches to a query input.
    params:
    query: The query to match to (str)
    collection: The collection object (obj) (e.g. chromadb)
    n_results: The number of results to return (int)
    returns: results  objects from collection and prints results as well.
    """
    result = collection.query(
        query_texts=[query],
        n_results=n_results,
        include=['documents', 'distances', 'metadatas'])

    # Print results
    try:
        print(f"______________ {query} _______________")
        print(f"{len(result['documents'][0])} matches found.\n")
        for i, (d, dist, meta) in enumerate(zip(result['documents'][0],result['distances'][0],result['metadatas'][0])):
            print(f'#{i+1}')
            print(f"Name: {meta['Name']}")
            print(f"Platform: {meta['Platform']}")
            print(f"Genre: {meta['Genre']}")
            print(f"Publisher: {meta['Publisher']}")
            print(f"Description: {meta['Description']}")
            print(f"Year of Release: {meta['YearOfRelease']}")
            print(f"Similarity Score: {round(1 - dist,2)}")
            print("\n")
    except:
        print("/n !!Error!! /n")

    # Return the results
    return result



In [22]:
queries = ["when was the last halo?","mario racing games", "rockstar game", "sports"]
[semantic_search_func(query, collection, n_results=1) for query in queries]

______________ when was the last halo? _______________
1 matches found.

#1
Name: Halo Infinite
Platform: Xbox Series X|S
Genre: First-person shooter
Publisher: Xbox Game Studios
Description: The latest installment in the Halo franchise, featuring Master Chief's return in a new open-world setting.
Year of Release: 2021
Similarity Score: 0.81


______________ mario racing games _______________
1 matches found.

#1
Name: Mario Kart 8 Deluxe
Platform: Nintendo Switch
Genre: Racing
Publisher: Nintendo
Description: An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechanics.
Year of Release: 2017
Similarity Score: 0.86


______________ rockstar game _______________
1 matches found.

#1
Name: Grand Theft Auto: San Andreas
Platform: PlayStation 2
Genre: Action-adventure
Publisher: Rockstar Games
Description: An expansive open-world game set in the fictional state of San Andreas, following the story of Carl 'CJ' Johnson.
Year of Release: 2004
Similari

[{'ids': [['015']],
  'embeddings': None,
  'documents': [["[Xbox Series X|S] Halo Infinite (2021) - The latest installment in the Halo franchise, featuring Master Chief's return in a new open-world setting."]],
  'uris': None,
  'included': ['documents', 'distances', 'metadatas'],
  'data': None,
  'metadatas': [[{'Platform': 'Xbox Series X|S',
     'Description': "The latest installment in the Halo franchise, featuring Master Chief's return in a new open-world setting.",
     'Publisher': 'Xbox Game Studios',
     'YearOfRelease': 2021,
     'Genre': 'First-person shooter',
     'Name': 'Halo Infinite'}]],
  'distances': [[0.18839609622955322]]},
 {'ids': [['012']],
  'embeddings': None,
  'documents': [['[Nintendo Switch] Mario Kart 8 Deluxe (2017) - An enhanced version of Mario Kart 8, featuring new characters, tracks, and improved gameplay mechanics.']],
  'uris': None,
  'included': ['documents', 'distances', 'metadatas'],
  'data': None,
  'metadatas': [[{'Platform': 'Nintendo S