# RAG-on-GKE Application

This is a Python notebook for generating the vector embeddings based on [Kubernetes docs](https://github.com/dohsimpson/kubernetes-doc-pdf/) used by the RAG on GKE application.   
For full information, please checkout the GitHub documentation [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/rag/README.md).



- Clone the kubernetes docs repo



In [1]:
!mkdir /data/kubernetes-docs -p
!git clone https://github.com/dohsimpson/kubernetes-doc-pdf /data/kubernetes-docs


fatal: destination path '/data/kubernetes-docs' already exists and is not an empty directory.


- Install the required packages

In [2]:
!pip install pgvector
!pip install langchain langchain-community sentence_transformers unstructured[pdf]
!pip install google cloud-sql-python-connector[pg8000] langchain-google-cloud-sql-pg



 - Import required functions and libraries

In [3]:
# Import base libraries
import os
import uuid

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

from langchain_google_cloud_sql_pg import PostgresEngine, PostgresVectorStore


## Creating the Database Connection

Let's now set up a connection to your CloudSQL database:

In [5]:
# initialize parameters
INSTANCE_CONNECTION_NAME = os.environ.get("CLOUDSQL_INSTANCE_CONNECTION_NAME", "")
print(f"Your instance connection name is: {INSTANCE_CONNECTION_NAME}")
cloud_variables = INSTANCE_CONNECTION_NAME.split(":")

GCP_PROJECT_ID = os.environ.get("GCP_PROJECT_ID", cloud_variables[0])
GCP_CLOUD_SQL_REGION = os.environ.get("CLOUDSQL_INSTANCE_REGION", cloud_variables[1])
GCP_CLOUD_SQL_INSTANCE = os.environ.get("CLOUDSQL_INSTANCE", cloud_variables[2])

DB_NAME = os.environ.get("INSTANCE_CONNECTION_NAME", "pgvector-database")
VECTOR_EMBEDDINGS_TABLE_NAME = os.environ.get("EMBEDDINGS_TABLE_NAME", "rag_vector_embeddings")

db_username_file = open("/etc/secret-volume/username", "r")
DB_USER = db_username_file.read()
db_username_file.close()

db_password_file = open("/etc/secret-volume/password", "r")
DB_PASS = db_password_file.read()
db_password_file.close()

# Create Cloud SQL Postgres Engine
pg_engine = PostgresEngine.from_instance(
    project_id=GCP_PROJECT_ID,
    instance=GCP_CLOUD_SQL_INSTANCE,
    region=GCP_CLOUD_SQL_REGION,
    database=DB_NAME,
    user=DB_USER,
    password=DB_PASS,
)

Next we'll setup some parameters for the dataset processing steps:

In [6]:
SENTENCE_TRANSFORMER_MODEL = "intfloat/multilingual-e5-small"  # Transformer to use for converting text chunks to vector embeddings

# the dataset has been pre-dowloaded to the GCS bucket as part of the notebook in the cell above. Ray workers will find the dataset readily mounted.
SHARED_DATASET_BASE_PATH = "/data/kubernetes-docs/"

BATCH_SIZE = 100
CHUNK_SIZE = 1000  # text chunk sizes which will be converted to vector embeddings
CHUNK_OVERLAP = 10
VECTOR_DIMENSION = 384  # Embeddings size

## Initialize Vector Store Table

We are ready to begin. Let's first create some code for generating the vector embeddings:

In [7]:
pg_engine.init_vectorstore_table(
    VECTOR_EMBEDDINGS_TABLE_NAME,
    vector_size=VECTOR_DIMENSION,
    overwrite_existing=True,  # Enabling this will recreate the table if exists.
)

# Initialize Vector Store

In [8]:
embeddings_service = HuggingFaceEmbeddings(model_name=SENTENCE_TRANSFORMER_MODEL)
vector_store = PostgresVectorStore.create_sync(
    engine=pg_engine,
    embedding_service=embeddings_service,
    table_name=VECTOR_EMBEDDINGS_TABLE_NAME,
)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Ingest PDF docs into CloudSQL DB

### Load and Split the kubernetes docs

In [9]:
loader = DirectoryLoader(f"{SHARED_DATASET_BASE_PATH}/PDFs", glob="*.pdf", show_progress=True)
documents = loader.load()

100%|██████████| 6/6 [11:36<00:00, 116.07s/it]


In [10]:
splitter = RecursiveCharacterTextSplitter(
            chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, length_function=len
)

splits = splitter.split_documents(documents)

### Add the splits on the vector store

In [11]:
ids = [str(uuid.uuid4()) for i in range(len(splits))]
vector_store.add_documents(splits, ids)

['3db34d89-aca6-4152-a2f5-09e26d932652',
 '49e44105-7700-4f22-a82b-859bfdb52a6e',
 'f4413c9d-0a33-412e-a840-9eb0ef320bd1',
 'e390d513-74db-4f86-9383-083220440296',
 '7bb7e45c-2f1e-403e-974d-ce3dd037e2d2',
 '57f1f180-3ecb-4218-a17f-b8bfcbcbdc6c',
 '8837f666-2568-40a8-b140-ed0632aa9086',
 '4ef75735-dc5e-48e9-8f04-95d7c0a06e31',
 'c4ffa497-042a-452c-b3c0-3b1080893ea4',
 '0fb1cc93-0753-4f34-96c4-0435dc0d06f8',
 'ddc41036-f30a-422b-98d4-48290a2e65c3',
 '6a79fb07-7bdc-4b16-9fef-e3548b374395',
 '5bca673d-d3c7-4f50-ae26-c10cd59ca3b4',
 '1ff27c15-7e38-4f3d-b0bc-1cf873b35068',
 'aefd3ab2-118e-4143-9e4f-2c6a7fe70ded',
 '4505f156-8813-463b-9283-f2e7835f1d82',
 '94a88882-bd06-4f3d-bbde-2eef5522273d',
 '7f199fbb-df58-45c6-a94f-4311bf2162b0',
 '28fbeed7-22dd-407c-8c51-67ddbed64297',
 '5a72d95d-fe3b-429f-b5df-5bab3d539e8f',
 '8faccd3d-29bd-48c4-871f-161864e32bfb',
 '4fc9ca3c-55ca-452d-bfee-4f7bbea33a65',
 '038e2bb2-1548-4d3f-96cf-c2e28eb37271',
 '87741718-2d75-47fc-962a-39b8864f7313',
 '32ce4d84-2824-

## Trying the Vector Storage

In [15]:
query = "Hello, what's kubernetes"
query_vector = embeddings_service.embed_query(query)
docs = vector_store.similarity_search_by_vector(query_vector, k=4)

for i, document in enumerate(docs):
  print(f"Result #{i+1}")
  print(document.page_content)
  print("-" * 100)

Result #1
Overview

Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

This page is an overview of Kubernetes.

Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.

The name Kubernetes originates from Greek, meaning helmsman or pilot. K8s as an abbreviation results from counting the eight letters between the "K" and the "s". Google open- sourced the Kubernetes project in 2014. Kubernetes combines over 15 years of Google's experience running production workloads at scale with best-of-breed ideas and practices from the community.
----------