# RAG-on-GKE Application

This is a Python notebook for generating the vector embeddings used by the RAG on GKE application. For full information, please checkout the GitHub documentation [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/rag/README.md).


## Setup Kaggle Credentials

First we will setup your Kaggle credentials and use the Kaggle CLI to download the NetFlix shows dataset to the GCS bucket. Replace the following with your own settings from the Kaggle web page. Navigate to https://www.kaggle.com/settings/account and generate an API token to be used to setup the env variable. See https://www.kaggle.com/docs/api#authentication how to create one.

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = "<username>"
os.environ['KAGGLE_KEY'] = "<token>"

# Download the zip file to local storage and then extract the desired contents directly to the GKE GCS CSI mounted bucket. The bucket is mounted at the "/persist-data" path in the jupyter pod.
!kaggle datasets download -d shivamb/netflix-shows -p ~/data --force
!mkdir /data/netflix-shows -p
!unzip -o ~/data/netflix-shows.zip -d /data/netflix-shows

In [None]:
!pip install langchain-google-cloud-sql-pg

## Creating the Database Connection

Let's now set up a connection to your CloudSQL database:

In [None]:
import os
import uuid

import ray
import torch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

from langchain_google_cloud_sql_pg import PostgresEngine, PostgresVectorStore
from google.cloud.sql.connector import IPTypes

# initialize parameters
INSTANCE_CONNECTION_NAME = os.environ.get("CLOUDSQL_INSTANCE_CONNECTION_NAME")
print(f"Your instance connection name is: {INSTANCE_CONNECTION_NAME}")
cloud_variables = INSTANCE_CONNECTION_NAME.split(":")

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", cloud_variables[0])
GCP_CLOUD_SQL_REGION = os.environ.get("CLOUDSQL_INSTANCE_REGION", cloud_variables[1])
GCP_CLOUD_SQL_INSTANCE = os.environ.get("CLOUDSQL_INSTANCE", cloud_variables[2])

DB_NAME = os.environ.get("INSTANCE_CONNECTION_NAME", "pgvector-database")
VECTOR_EMBEDDINGS_TABLE_NAME = os.environ.get("EMBEDDINGS_TABLE_NAME", "netflix_reviews_db")
CHAT_HISTORY_TABLE_NAME = os.environ.get("CHAT_HISTORY_TABLE_NAME", "message_store")

VECTOR_DIMENSION = os.environ.get("VECTOR_DIMENSION", 384)

try:
    db_username_file = open("/etc/secret-volume/username", "r")
    DB_USER = db_username_file.read()
    db_username_file.close()

    db_password_file = open("/etc/secret-volume/password", "r")
    DB_PASS = db_password_file.read()
    db_password_file.close()
except:
    DB_USER = os.environ.get("DB_USERNAME", "postgres")
    DB_PASS = os.environ.get("DB_PASS", "postgres")

engine = PostgresEngine.from_instance(
        project_id=GCP_PROJECT_ID,
        region=GCP_CLOUD_SQL_REGION,
        instance=GCP_CLOUD_SQL_INSTANCE,
        database=DB_NAME,
        user=DB_USER,
        password=DB_PASS,
        ip_type=IPTypes.PRIVATE,
)

try:
    engine.init_vectorstore_table(
        VECTOR_EMBEDDINGS_TABLE_NAME,
        vector_size=VECTOR_DIMENSION,
        overwrite_existing=True,
    )
except Exception as err:
    print(f"Error: {err}")

Next we'll setup some parameters for the dataset processing steps:

In [None]:
SENTENCE_TRANSFORMER_MODEL = 'intfloat/multilingual-e5-small' # Transformer to use for converting text chunks to vector embeddings

# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.
SHARED_DATASET_BASE_PATH="/data/netflix-shows/"
REVIEWS_FILE_NAME="netflix_titles.csv"

BATCH_SIZE = 100
CHUNK_SIZE = 1000 # text chunk sizes which will be converted to vector embeddings
CHUNK_OVERLAP = 10
TABLE_NAME = 'netflix_reviews_db'  # CloudSQL table name
DIMENSION = 384  # Embeddings size
ACTOR_POOL_SIZE = 1 # number of actors for the distributed map_batches function

## Generating Documents splits

We are ready to begin. Let's first create some code for generating the dataset splits:



In [None]:
class Splitter:
  def __init__(self):
        self.splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, length_function=len)

  def __call__(self, text_batch):
      text = text_batch["item"]
      chunks = []
      for data in text:
        splits = self.splitter.split_text(data)
        chunks.extend(splits)

      return {'results':chunks}

Next we will initialize a Ray cluster to execute the remote task:

In [None]:
import ray

ray.init(
    address="ray://ray-cluster-kuberay-head-svc:10001",
    runtime_env={
        "pip": [               
            "langchain==0.1.10",
            "transformers==4.38.1",
            "sentence-transformers==2.5.1",
            "pyarrow",
            "datasets==2.18.0",
            "torch==2.0.1",
            "huggingface_hub==0.21.3",
            "langchain-google-cloud-sql-pg"
        ]
    }
)

Generate vector embeddings using our Embed class above:

In [None]:
# Process the dataset first, wrap the csv file contents into a Ray dataset
ray_ds = ray.data.read_csv(SHARED_DATASET_BASE_PATH + REVIEWS_FILE_NAME)
print(ray_ds.schema)

# Distributed flat map to extract the raw text fields.
ds_batch = ray_ds.flat_map(lambda row: [{
    'item': "This is a " + str(row["type"]) + " in " + str(row["country"]) + " called " + str(row["title"]) + 
    " added at " + str(row["date_added"]) + " whose director is " + str(row["director"]) + 
    " and with cast: " + str(row["cast"]) + " released at " + str(row["release_year"]) + 
    ". Its rating is: " + str(row['rating']) + ". Its duration is " + str(row["duration"]) + 
    ". Its description is " + str(row['description']) + "."
}])
print(ds_batch.schema)

# Distributed map batches to create chunks out of each row.
ds_splitted = ds_batch.map_batches(
    Splitter,
    compute=ray.data.ActorPoolStrategy(size=ACTOR_POOL_SIZE),
    batch_size=BATCH_SIZE,  # Large batch size to maximize GPU utilization.
    num_gpus=1,  # 1 GPU for each actor.
    # num_cpus=1,
)

Retrieve the result data from Ray remote workers:

In [None]:
@ray.remote
def ray_data_task(ds_splitted):
    results = []
    for row in ds_splitted.iter_rows():
        data_text = row["results"]
        data_id = str(uuid.uuid4()) 

        results.append((data_id, data_text))
        
    return results
    
results = ray.get(ray_data_task.remote(ds_splitted))

## Writing Results Back to MySQL

Now that we have our vector embeddings, we can write our results back to the MySQL database:

In [None]:
print("torch cuda version", torch.version.cuda)
device="cpu"
if torch.cuda.is_available():
    print("device cuda found")
    device="cuda"
    
embeddings_service = HuggingFaceEmbeddings(model_name=SENTENCE_TRANSFORMER_MODEL, model_kwargs=dict(device=device))
vector_store = PostgresVectorStore.create_sync(
    engine=engine,
    embedding_service=embeddings_service,
    table_name=VECTOR_EMBEDDINGS_TABLE_NAME,
)

for result in results:
    id = result[0]
    splits = result[1]
    vector_store.add_texts(splits, id)

Finally let's verify that our embeddings got stored in the database correctly:

In [None]:
query = "List the cast of squid game"
query_vector = embeddings_service.embed_query(query)
docs = vector_store.similarity_search_by_vector(query_vector, k=4)

for i, document in enumerate(docs):
  print(f"Result #{i+1}")
  print(document.page_content)
  print("-" * 100)