# RAG-on-GKE Application

This is a Python notebook for generating the vector embeddings based on [Kubernetes docs](https://github.com/dohsimpson/kubernetes-doc-pdf/) used by the RAG on GKE application.   
For full information, please checkout the GitHub documentation [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/rag/README.md).



- Clone the kubernetes docs repo



In [None]:
!mkdir /data/kubernetes-docs -p
!git clone https://github.com/dohsimpson/kubernetes-doc-pdf /data/kubernetes-docs


- Install the required packages

In [None]:
!pip install pgvector
!pip install langchain langchain-community sentence_transformers pypdf
!pip install google cloud-sql-python-connector[pg8000] langchain-google-cloud-sql-pg

 - Import required functions and libraries

In [None]:
# Import base libraries
import os
import uuid
import glob

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

from langchain_google_cloud_sql_pg import PostgresEngine, PostgresVectorStore
from google.cloud.sql.connector import IPTypes


## Creating the Database Connection

Let's now set up a connection to your CloudSQL database:

In [None]:
# initialize parameters
INSTANCE_CONNECTION_NAME = os.environ.get("CLOUDSQL_INSTANCE_CONNECTION_NAME", "")
print(f"Your instance connection name is: {INSTANCE_CONNECTION_NAME}")
cloud_variables = INSTANCE_CONNECTION_NAME.split(":")

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", cloud_variables[0])
GCP_CLOUD_SQL_REGION = os.environ.get("CLOUDSQL_INSTANCE_REGION", cloud_variables[1])
GCP_CLOUD_SQL_INSTANCE = os.environ.get("CLOUDSQL_INSTANCE", cloud_variables[2])

DB_NAME = os.environ.get("INSTANCE_CONNECTION_NAME", "pgvector-database")
VECTOR_EMBEDDINGS_TABLE_NAME = os.environ.get("EMBEDDINGS_TABLE_NAME", "rag_vector_embeddings")

db_username_file = open("/etc/secret-volume/username", "r")
DB_USER = db_username_file.read()
db_username_file.close()

db_password_file = open("/etc/secret-volume/password", "r")
DB_PASS = db_password_file.read()
db_password_file.close()

# Create Cloud SQL Postgres Engine
pg_engine = PostgresEngine.from_instance(
    project_id=GCP_PROJECT_ID,
    instance=GCP_CLOUD_SQL_INSTANCE,
    region=GCP_CLOUD_SQL_REGION,
    database=DB_NAME,
    user=DB_USER,
    password=DB_PASS,
    ip_type=IPTypes.PRIVATE
)

Next we'll setup some parameters for the dataset processing steps:

In [None]:
SENTENCE_TRANSFORMER_MODEL = "intfloat/multilingual-e5-small"  # Transformer to use for converting text chunks to vector embeddings

# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.
SHARED_DATASET_BASE_PATH = "/data/kubernetes-docs/"

BATCH_SIZE = 100
CHUNK_SIZE = 1000  # text chunk sizes which will be converted to vector embeddings
CHUNK_OVERLAP = 10
VECTOR_DIMENSION = 384  # Embeddings size

## Initialize Vector Store Table

We are ready to begin. Let's first create some code for generating the vector embeddings:

In [None]:
pg_engine.init_vectorstore_table(
    VECTOR_EMBEDDINGS_TABLE_NAME,
    vector_size=VECTOR_DIMENSION,
    overwrite_existing=True,  # Enabling this will recreate the table if exists.
)

# Initialize Vector Store

In [None]:
embeddings_service = HuggingFaceEmbeddings(model_name=SENTENCE_TRANSFORMER_MODEL)
vector_store = PostgresVectorStore.create_sync(
    engine=pg_engine,
    embedding_service=embeddings_service,
    table_name=VECTOR_EMBEDDINGS_TABLE_NAME,
)

## Ingest PDF docs into CloudSQL DB

### Load and Split the kubernetes docs

In [None]:
documents_file_path = glob.glob(f"{SHARED_DATASET_BASE_PATH}/PDFs/*.pdf")

documents = []
for file_path in documents_file_path:
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    documents.extend(pages)
    print(f"Processed: {file_path}")

In [None]:
splitter = RecursiveCharacterTextSplitter(
            chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, length_function=len
)

splits = splitter.split_documents(documents)

### Add the splits on the vector store

In [None]:
ids = [str(uuid.uuid4()) for i in range(len(splits))]
vector_store.add_documents(splits, ids)

## Trying the Vector Storage

In [None]:
query = "What's kubernetes?"
query_vector = embeddings_service.embed_query(query)
docs = vector_store.similarity_search_by_vector(query_vector, k=4)

for i, document in enumerate(docs):
  print(f"Result #{i+1}")
  print(document.page_content)
  print("-" * 100)