# Setup
Please perform the following if you can't install the external libraries with `!pip install -r requirements.txt`
```
pip install numpy sentence-transformers chromadb polars more-itertools openai python-dotenv
```

# RAG demo

## 1. Define similarity metric

In [1]:
import numpy as np

def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Cosine similarity metric between 2 vectors"""
    return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

## 2. Text embedding
Here we make use of the library `SentenceTransformer`. Characteristics of `SentenceTransformer` model:
1. Calculates a fixed-size vector representation (embedding) given texts or images.
2. Embedding calculation is often efficient, embedding similarity calculation is very fast. In our case, we will just define the similarity function on the `chromadb` API.
3. Applicable for a wide range of tasks. In our case, we need semantic textual similarity.

### Pretrained models
Various pre-trained Sentence Transformers models are provided via Sentence Transformers Hugging Face organization. The pre-trained models can be found [here](https://huggingface.co/models?library=sentence-transformers&author=sentence-transformers). You can refer to the model card to learn more about the model.

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "The movie is not as good as everybody expected.",
    "The movie has a stale storyline.",
    "Human is expected to conquer Mars around the year 2040.",
    "Human had landed on the moon in year 1969."
]

text_embeddings = model.encode(texts)

# Put the text embedding in dict with key as documents itself and value as the correponding embeddings
text_embeddings_dict = dict(zip(texts, text_embeddings))

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [3]:
text_embeddings_dict

{'The movie is not as good as everybody expected.': array([-1.17516341e-02, -3.44875120e-02, -8.63215071e-04,  2.00314391e-02,
        -5.69748804e-02,  7.41276611e-03, -4.05711271e-02, -1.36493593e-02,
         3.22466232e-02,  1.14074117e-02,  5.37382364e-02,  4.20888364e-02,
        -4.80891997e-03, -1.65639427e-02, -9.73862857e-02,  1.41841322e-02,
         5.54666854e-02, -9.06853974e-02, -4.48351502e-02,  8.30482505e-03,
        -1.27478698e-02,  1.63926892e-02,  1.18417881e-01,  5.09514101e-02,
        -7.96555206e-02, -2.97664441e-02, -5.20522986e-03, -5.68447635e-02,
        -1.11407034e-01, -5.34533663e-03,  9.74927843e-03,  3.20234746e-02,
         3.78442439e-03, -4.02020253e-02, -4.64739799e-02,  2.17817966e-02,
        -1.68288462e-02, -4.28137071e-02, -4.97106016e-02, -1.69626493e-02,
        -1.19402750e-04,  1.04057491e-02,  3.76179479e-02, -5.99904219e-03,
         3.15318853e-02, -5.97875305e-02, -9.19919810e-04, -4.99583185e-02,
         5.62546402e-02, -8.42506960e

### 2.1 Perform some sanity checking

In [3]:
movie_txt_1 = texts[0]
movie_txt_2 = texts[1]

compute_cosine_similarity(text_embeddings_dict[movie_txt_1],
                         text_embeddings_dict[movie_txt_2])

0.5038894

In [4]:
compute_cosine_similarity(text_embeddings_dict[texts[2]],
                         text_embeddings_dict[texts[3]])

0.29489478

In [5]:
compute_cosine_similarity(text_embeddings_dict[texts[0]],
                         text_embeddings_dict[texts[-1]])

0.029668428

The above results show that the text embeddings are more or less able to capture the semantic meaning of input sentences.

## 3 Setup vector database
To setup a local vector database, we use `chromadb` library. 

Chroma can also use any Sentence Transformers model to create embeddings. You can pass in an optional model_name argument, which lets you choose which Sentence Transformers model to use. By default, Chroma uses `all-MiniLM-L6-v2`.

In [6]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "chroma_embedding_demo/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

# configure Chroma to save and load the database from your local machine
client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL   
)

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"}
)



### 3.1 Add in some data into the vector database and perform some sanity checking

In [7]:
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"
]

topics = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]
# Chroma will store your text and handle embedding and indexing.
collection.add(
    documents=documents,
    ids=[f"id{i}" for i in range(len(documents))],
    metadatas=[{"topics": topic} for topic in topics]
)

In [7]:
collection.peek()

{'ids': ['id0', 'id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9'],
 'embeddings': [[-0.0301264226436615,
   -0.008223401382565498,
   0.07016211003065109,
   -0.06197300925850868,
   -0.0015618414618074894,
   -0.02813219651579857,
   -0.0119687020778656,
   0.08269321918487549,
   0.035055793821811676,
   0.06343533843755722,
   0.08856028318405151,
   -0.006440437398850918,
   0.00521568488329649,
   0.06499352306127548,
   0.052934806793928146,
   -0.04098890349268913,
   0.12041020393371582,
   -0.06898967921733856,
   -0.06871923059225082,
   -0.03942180424928665,
   -0.031998444348573685,
   0.01355114858597517,
   0.0799776017665863,
   0.004149056971073151,
   0.05734124779701233,
   0.04662569239735603,
   -0.05715063586831093,
   -0.06682708114385605,
   0.05258721485733986,
   -0.05084732547402382,
   -0.019250402227044106,
   -0.0008829222642816603,
   0.023810774087905884,
   0.06795576959848404,
   -0.05496685206890106,
   -0.10000955313444138,
   0.079046979

In [8]:
# Lets do some query
query = "Find me some delicious food"
# query the collection with a list of query texts, and Chroma will return the n most similar results.
query_results = collection.query(
    query_texts=[query],
    n_results=1
)
query_results.keys()

dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data'])

In [9]:
query_results = collection.query(
    query_texts=[query],
    include=["distances", "documents"],
    n_results=1
)

(query_results["distances"][0], query_results["documents"][0])

([0.7526361480284711],
 ['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.'])

You can pass multiple queries at once to the function. And, not to forget, chromaDB (vector database) supports CRUD operations.

## 4 End-to-end RAG example

The top priority of this example is to demonstrate RAG workflow with a practical simple example.

We will start with this [Kaggle data](https://www.kaggle.com/datasets/ankkur13/edmundsconsumer-car-ratings-and-reviews), which is associated with consumer's thought and the star rating of car manufacturer/model/type. It consists of multiple csv files with the following headers:
1. *Review_date*
2. *Author_name*
3. *Vehicle_title*
4. *Review_title*
5. *Review*
6. *Rating*

In [10]:
import pathlib
import polars as pl

def prepare_car_reviews_data(data_path: pathlib.Path, vehicle_years: list[int] = [2018]):
    """Prepare the car reviews dataset for ChromaDB
    Parameters:
    data_path: csv files directory
    vehicle_years: year of customer review
    
    Returns:
    Dictionary with 'ids', 'documents' and 'metadatas' as keys."""

    # Define the schema to ensure proper data types are enforced
    dtypes = {
        "": pl.Int64,
        "Review_Date": pl.Utf8,
        "Author_Name": pl.Utf8,
        "Vehicle_Title": pl.Utf8,
        "Review_Title": pl.Utf8,
        "Review": pl.Utf8,
        "Rating": pl.Float64,
    }

    # Scan the car reviews dataset(s)
    car_reviews = pl.scan_csv(data_path, dtypes=dtypes)

    # Extract the vehicle title and year as new columns
    # Filter on selected years
    car_review_db_data = (
        car_reviews.with_columns(
            [
                (
                    pl.col("Vehicle_Title").str.split(
                        by=" ").list.get(0).cast(pl.Int64)
                ).alias("Vehicle_Year"),
                (pl.col("Vehicle_Title").str.split(by=" ").list.get(1)).alias(
                    "Vehicle_Model"
                ),
            ]
        )
        .filter(pl.col("Vehicle_Year").is_in(vehicle_years))
        .select(["Review_Title", "Review", "Rating", "Vehicle_Year", "Vehicle_Model"])
        .sort(["Vehicle_Model", "Rating"])
        .collect()
    )

    # Create ids, documents, and metadatas data in the format chromadb expects
    ids = [f"review{i}" for i in range(car_review_db_data.shape[0])]
    documents = car_review_db_data["Review"].to_list()
    metadatas = car_review_db_data.drop("Review").to_dicts()

    return {"ids": ids, "documents": documents, "metadatas": metadatas}

For the sake of this demo, we will just extract the data in year **2018**. 

In [11]:
DATA_PATH = "textual_data/car_reviews_data/*"
chroma_car_reviews_dict = prepare_car_reviews_data(DATA_PATH, vehicle_years=[2018])

In [12]:
len(chroma_car_reviews_dict["ids"])

3346

In [13]:
# Do some checking
chroma_car_reviews_dict.keys()

dict_keys(['ids', 'documents', 'metadatas'])

In [14]:
print(chroma_car_reviews_dict["ids"][:2])
print(chroma_car_reviews_dict["documents"][:2])
print(chroma_car_reviews_dict["metadatas"][:2])

['review0', 'review1']
[' This car drives decently, and maybe average for a "luxury" vehicle. Advertised at getting 19 mpg in the city and will give you 13 mpg on a good day. What\'s worse is the way Acura has handled it. They are rude on the phone and even take 5 business days to return your call. No other luxury car company would treat customers like this. Highly recommend to avoid Acura.', " This car is not worth the money. It is very hard to get in and out. I used to own the Accura TL's and they were a great car.  This car is nothing like that.  The ride is very hard. There is no give to the seat, tires, or suspension.  It is painful to go over bumps and my back hurt constantly. I traded it in after owning for 4 months. The heat blasted out hot air on your foot so you had to turn the heat off. The radio is poor for what they claim is an upscale car. Only one speaker in the center. Save your money and buy another car."]


This is a car review dataset in the form of **dictionary** in year 2018. Now we have to setup the chromadb database

In [14]:
!python -m pip install more-itertools

Collecting more-itertools
  Downloading more_itertools-10.2.0-py3-none-any.whl.metadata (34 kB)
Downloading more_itertools-10.2.0-py3-none-any.whl (57 kB)
   ---------------------------------------- 0.0/57.0 kB ? eta -:--:--
   --------------------- ------------------ 30.7/57.0 kB 660.6 kB/s eta 0:00:01
   ---------------------------------------- 57.0/57.0 kB 742.5 kB/s eta 0:00:00
Installing collected packages: more-itertools
Successfully installed more-itertools-10.2.0


We have to define the chromadb path, database name, embedding functions, the dictionary of data that we have built and distance function.

In [16]:
from more_itertools import batched
import pathlib

def build_chroma_collection(
    chroma_path: pathlib.Path,
    collection_name: str,
    embedding_func_name: str,
    ids: list[str],
    documents: list[str],
    metadatas: list[dict],
    distance_func_name: str = "cosine",
):
    """Create a ChromaDB collection"""
    # initiate chroma instance
    chroma_client = chromadb.PersistentClient(chroma_path)

    embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=embedding_func_name
    )
    # create the chroma collection
    collection = chroma_client.create_collection(
        name=collection_name,
        embedding_function=embedding_func,
        metadata={"hnsw:space": distance_func_name},
    )

    document_indices = list(range(len(documents)))
    # add in data to the Chroma data store. We breaks the data into chunks of tuple list of size 166 
    for batch in batched(document_indices, 166):
        start_idx = batch[0]
        end_idx = batch[-1]

        collection.add(
            ids=ids[start_idx:end_idx],
            documents=documents[start_idx:end_idx],
            metadatas=metadatas[start_idx:end_idx],
        )

In [45]:
CHROMA_PATH = "car_review_embeddings"
# DB_NAME = "car_review"
EMBEDDING_FUNC = "all-MiniLM-L6-v2"

build_chroma_collection(
    CHROMA_PATH,
    COLLECTION_NAME,
    EMBEDDING_FUNC,
    ids=chroma_car_reviews_dict["ids"],
    documents=chroma_car_reviews_dict["documents"],
    metadatas=chroma_car_reviews_dict["metadatas"]
)

Let's perform some sanity check on the chromadb database that we set up.

In [18]:
CHROMA_PATH = "car_review_embeddings"
# DB_NAME = "car_review"
EMBEDDING_FUNC = "all-MiniLM-L6-v2"
client = chromadb.PersistentClient(CHROMA_PATH)
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBEDDING_FUNC
)
# Existing collections can be retrieved by name with .get_collection, or use .get_or_create_collection to get a collection 
# if it exists, or create it if it doesn't.
collection = client.get_collection(name=COLLECTION_NAME, embedding_function=embedding_func)

great_reviews = collection.query(
    query_texts=["Find me some positive reviews about car fuel effificiency"],
    n_results=5,
    include=["documents", "distances", "metadatas"]
)

great_reviews["documents"][0]

[' Fuel economy is superb for a car with this size and comfort at 40 mpg.',
 ' Fuel efficient',
 ' Overall this vehicle has been only fair. The fuel economy  sucks to say the least.',
 " First - a disclaimer; this is my first new vehicle in 11 years.  The technological leap was huge, so by comparison to my Honda Element, the 4Runner is amazing.  I use this for hauling, off-road, camping and outdoors activities.  I'm just starting out with it, but the fuel economy has been great - much more than I expected.  Highway travel has been more than 24mpg!  Bopping around town takes me down to 21.5.  I've been carefully recording my fuel use because frankly this far exceeds the window sticker fuel economy of 17/21 with 18mpg average.   Cargo space is great... very comfortable... plenty of power and looks great.  So far, so good!",
 ' Good fuel efficiency,  For be a pick/up look great']

In [48]:
!python -m pip install openai python-dotenv



In [54]:
from dotenv import load_dotenv

load_dotenv()

True

### Test OpenAI API

In [19]:
from openai import OpenAI
import json

with open("config.json", mode="r") as f:
    config_data = json.load(f)

client = OpenAI(api_key=config_data.get("openai_secret_key"))

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a friendly chatbot"},
        {"role": "user", "content": "Tell me a joke about soccer"}
    ],
    temperature=0.0,
    n=1
)

In [20]:
print(response.choices[0].message.content)

Why did the soccer player bring string to the game? 

So he could tie the score!


### Adding context to LLM

In [21]:
# We start with a question
question = "What are the key factors that contribute to customer satisfaction?"

reviews = collection.query(
    query_texts=[question],
    n_results=5,
    include=["documents"],
    where={"Rating": {"$gte": 3}}
)

good_reviews = "\n".join(reviews["documents"][0])

context = f"""
You are an expert market analyst who works in a large car dealership. \
Answer questions based on the following reviews: \
'''{good_reviews}'''
"""

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": context},
        {"role": "user", "content": question}
    ],
    temperature=0.0,
    n=1
)

print(response.choices[0].message.content)

Based on the reviews provided, key factors that contribute to customer satisfaction include reliability, fuel economy, handling, resale value, performance, engine quality, comfort, luxury, price, service, and the overall buying experience. Customers value a reliable vehicle that offers good fuel economy, handles well, retains its value over time, performs well, has a high-quality engine, is comfortable and luxurious, and comes at a reasonable price. Additionally, customers appreciate good customer service, including prompt responses to issues such as engine failures, providing loaner vehicles when needed, and a positive selling approach that focuses on building a relationship rather than just making a transaction.


# Reference:
1. Real Python article: https://realpython.com/chromadb-vector-database/#practical-example-add-context-for-a-large-language-model-llm