<div style="display: flex; align-items: left;">
    <a href="https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0">
        <img src="https://storage.googleapis.com/miscfilespublic/Linkedin%20Banner%20%E2%80%93%202.png" style="margin-right">
    </a>
</div>

In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# **Open Data QnA: Vector Store Setup**

---

This notebook walks through the Vector Store Setup needed for running the Open Data QnA application. 

Currently supported Source DBs are: 
- PostgreSQL on Google Cloud SQL 
- BigQuery

Furthermore, the following vector stores are supported 
- pgvector on PostgreSQL 
- BigQuery vector


The notebook covers the following steps: 
> 1. Configuration: Intial GCP project, IAM permissions, Environment  and Databases setup including logging on Bigquery for analytics

> 2. Creation of Table, Column and Known Good Query Embeddings in the Vector Store  for Retreival Augmented Generation(RAG)




### 📒 Using this interactive notebook

If you have not used this IDE with jupyter notebooks it will ask for installing Python + Jupyter extensions. Please go ahead install them

Click the **run** icons ▶️  of each cell within this notebook.

> 💡 Alternatively, you can run the currently selected cell with `Ctrl + Enter` (or `⌘ + Enter` on a Mac).

> ⚠️ **To avoid any errors**, wait for each section to finish in their order before clicking the next “run” icon.

This sample must be connected to a **Google Cloud project**, but nothing else is needed other than your Google Cloud project.

You can use an existing project. Alternatively, you can create a new Cloud project [with cloud credits for free.](https://cloud.google.com/free/docs/gcp-free-tier)

## 🚧 **0. Pre-requisites**

Make sure that Google Cloud CLI is installed before moving to the next cell! You can refer to the link below for guidance

Installation Guide: https://cloud.google.com/sdk/docs/install

##  **1. Setup GCP Project and Environment** 

### 🔗 **1A. Connect Your Google Cloud Project**
Time to connect your Google Cloud Project to this notebook. 

In [None]:
#@markdown Please fill in the value below with your GCP project ID and then run the cell.
PROJECT_ID = input("Please enter the GCP Project ID")

# Quick input validations.
assert PROJECT_ID, "⚠️ Please provide your Google Cloud Project ID"

# Configure gcloud.
!gcloud config set project {PROJECT_ID}
print(f'Project has been set to {PROJECT_ID}')
# !gcloud auth application-default set-quota-project {PROJECT_ID}

import os
module_path = os.path.abspath(os.path.join('..'))

import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

config['GCP']['PROJECT_ID'] = PROJECT_ID

with open(module_path+'/config.ini', 'w') as configfile:    # save
    config.write(configfile)



### 🔐 **1B. Authenticate to Google Cloud**
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.
You can do this within Google Colab or using the Application Default Credentials in the Google Cloud CLI.


In [None]:
"""Colab Auth""" 
# from google.colab import auth
# auth.authenticate_user()


"""Google CLI Auth"""
# !gcloud auth application-default login


import google.auth
credentials, project_id = google.auth.default()
# credentials = google.auth.credentials.with_scopes_if_required(credentials)
# authed_http = google.auth.transport.requests.AuthorizedSession(credentials)

### ⚙️ **1C. Enable Required API Services in the GCP Project**

In [None]:
#Enable all the required APIs for the Open Data QnA solution

!gcloud services enable \
  cloudapis.googleapis.com \
  cloudbuild.googleapis.com \
  cloudresourcemanager.googleapis.com \
  compute.googleapis.com \
  container.googleapis.com \
  containerregistry.googleapis.com \
  iam.googleapis.com \
  run.googleapis.com \
  sqladmin.googleapis.com \
  aiplatform.googleapis.com \
  artifactregistry.googleapis.com \
  bigquery.googleapis.com \
  firebase.googleapis.com \
  monitoring.googleapis.com \
  serviceusage.googleapis.com \
  storage.googleapis.com \
  orgpolicy.googleapis.com

### 💻 **1D. Install Code Dependencies**
Install the dependencies by runnign the poetry commands below 


In [None]:
# Install poetry
! pip uninstall poetry -y
! pip install poetry --quiet

#Run the poetry commands below to set up the environment
!poetry lock #resolve dependecies (also auto create poetry venv if not exists)
!poetry install #installs dependencies
!poetry env info #Displays the evn just created and the path to it

### 📌 **Important Step: Activate your virtual environment**

Check to ensure that Jupyter Kernel is set to the enviromnet just created by poetry

Inorder to set the kernel to poetry env, do one of the below
1. Run 'poetry shell' directly in the terminal (or)
2. Manually select the python interpreter in your IDE


Don't forget to switch your notebook kernel to the newly generated .venv environment after running the poetry command.

If you cannot find it manually select the Python Interpreter path that you see when you run poetry shell (e.g. /home/admin_/Talk2Data/.venv/bin/python)

In [None]:
# Jupyter Kernel Env verification
verified_jupyter_kernel = False
assert verified_jupyter_kernel, "⚠️ Please ensure that the Jupyter Kernel is set to the evn created by poetry and change verified_jupyter_kernel to True before proceeding"

### 📈 **1E. Set Up your Data Source and Vector Store**

This section assumes that a datasource is already set up in your GCP project. If a datasource has not been set up, use the notebooks below to copy a public data set from BigQuery to Cloud SQL or BigQeury on your GCP project


Enabled Data Sources:
* PostgreSQL on Google Cloud SQL (Copy Sample Data: [0_CopyDataToCloudSqlPG.ipynb](0_CopyDataToCloudSqlPG.ipynb))
* BigQuery (Copy Sample Data: [0_CopyDataToBigQuery.ipynb](0_CopyDataToBigQuery.ipynb))

Enabled Vector Stores:
* pgvector on PostgreSQL 
* BigQuery vector


####  **Choose Data Source and Vector Store**

In [None]:
# Data source details
DATA_SOURCE = '' # Options: 'bigquery' and 'cloudsql-pg' i.e, PostgreSQL database on Google Cloud SQL

# Please specify what you would like to use as vector store for embeddings
VECTOR_STORE = '' # Options: 'bigquery-vector' i.e, Bigquery vector and 'cloudsql-pgvector' i.e, pgvector on PostgreSQL

# If you have chosen 'cloudsql-pg' as DATA_SOURCE; provide information below
PG_REGION = "" #@param {type:"string"}
PG_INSTANCE = ""
PG_DATABASE = ""
PG_USER = ""
PG_PASSWORD = ""
PG_SCHEMA = '' # Name of the dataset that contains all the tables


# If you have chosen 'bigquery' as DATA_SOURCE; provide information below
BQ_DATASET_REGION = ''
BQ_DATASET_NAME = ''

# Input verification - Source
assert DATA_SOURCE in {'bigquery', 'cloudsql-pg'}, "⚠️ Invalid DATA_SOURCE. Must be 'bigquery' or 'cloudsql-pg'"

# Input verification - Vector Store
assert VECTOR_STORE in {'bigquery-vector', 'cloudsql-pgvector'}, "⚠️ Invalid VECTOR_STORE. Must be 'bigquery-vector' or 'cloudsql-pgvector'"


if DATA_SOURCE == 'bigquery':
    assert BQ_DATASET_REGION, "⚠️ Please provide the Data Set Region"
    assert BQ_DATASET_NAME, "⚠️ Please provide the name of the dataset on Bigquery"
elif DATA_SOURCE == 'cloudsql-pg':
    assert PG_REGION, "⚠️ Please provide Region of the Cloud SQL Instance"
    assert PG_INSTANCE, "⚠️ Please provide the name of the Cloud SQL Instance"
    assert PG_DATABASE, "⚠️ Please provide the name of the PostgreSQL Database on the Cloud SQL Instance"
    assert PG_USER, "⚠️ Please provide a username for the Cloud SQL Instance"
    assert PG_PASSWORD, "⚠️ Please provide the Password for the PG_USER"

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

PROJECT_ID = config['GCP']['PROJECT_ID']

#### **Authenticate and Set Quota Project for .venv**

In [None]:
import google.auth
credentials, project_id = google.auth.default()

import os
os.environ['GOOGLE_CLOUD_QUOTA_PROJECT']=PROJECT_ID
os.environ['GOOGLE_CLOUD_PROJECT']=PROJECT_ID

####  **Database Setup for Vector Store**

Create PostgreSQL Instance on CloudSQL if 'cloudsql-pgvector' is chosen as vector store

Note that a PostgreSQL Instance on CloudSQL already exists if 'cloudsql-pg' is the data source. PostgreSQL Instance is created only if a different data store is chosen

In [None]:
#@markdown Feel free to update PostgreSQL or BigQuery parameters.
# If not updated, we will proceed with default values!

# Create PostgreSQL Instance is data source is different from PostgreSQL Instance
if VECTOR_STORE == 'cloudsql-pgvector' and DATA_SOURCE != 'cloudsql-pg':
  # Parameters for PostgreSQL Instance
  PG_REGION = "us-central1" #@param {type:"string"}
  PG_INSTANCE = "pg15-opendataqna"
  PG_DATABASE = "opendataqna-db"
  PG_USER = "pguser"
  PG_PASSWORD = "pg123"
  PG_SCHEMA = 'pg-vector-store' 


  # check if Cloud SQL instance exists in the provided region
  database_version = !gcloud sql instances describe {PG_INSTANCE} --format="value(databaseVersion)"
  if database_version[0].startswith("POSTGRES"):
    print("Found existing Postgres Cloud SQL Instance!")
  else:
    print("Creating new Cloud SQL instance...")
    !gcloud sql instances create {PG_INSTANCE} --database-version=POSTGRES_15 \
      --region={PG_REGION} --cpu=1 --memory=4GB --root-password={PG_PASSWORD} \
      --database-flags=cloudsql.iam_authentication=On

  # Create a database on the instance and a user with password
  database_exists = !gcloud sql databases list --instance={PG_INSTANCE} | grep -z 'NAME: {PG_DATABASE}'
  if database_exists:
      print("Found existing Postgres Cloud SQL database!")
  else:
      print("Creating new Cloud SQL database...")
      !gcloud sql databases create  {PG_DATABASE} --instance={PG_INSTANCE}
  !gcloud sql users create {PG_USER} \
  --instance={PG_INSTANCE} \
  --password={PG_PASSWORD}


# Create a new data set on Bigquery to use as Vector store; the same will be used for logging as well
if VECTOR_STORE == 'bigquery-vector':
  BQ_OPENDATAQNA_DATASET_NAME = "opendataqna" #@param {type:"string"} - name of the dataset in Vector Store
  BQ_DATASET_REGION = "europe-west9" #@param {type:"string"}

  from google.cloud import bigquery
  import google.api_core 
  client=bigquery.Client(project=PROJECT_ID)
  dataset_ref = f"{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}"


  # Create the dataset if it does not exist already
  try:
      client.get_dataset(dataset_ref)
      print("Destination Dataset exists")
  except google.api_core.exceptions.NotFound:
      print("Cannot find the dataset hence creating.......")
      dataset=bigquery.Dataset(dataset_ref)
      dataset.location=BQ_DATASET_REGION
      client.create_dataset(dataset)
      print(str(dataset_ref)+" is created")

### 📝 **1F. Set up Logging on Bigquery** 

If Bigquery is the vector store, the same database is used for logging. If not, a new database is created for logging

In [None]:
#@markdown Feel free to update PostgreSQL Instance Parameters.
# If not updated, we will proceed with default values!

# Set up database for logging on BigQuery if one has not been already set up for Vector Store
if VECTOR_STORE != 'bigquery-vector':
    BQ_OPENDATAQNA_DATASET_NAME = "opendataqna" #@param {type:"string"} - name of the dataset in Vector Store
    BQ_DATASET_REGION = "us-central1" #@param {type:"string"}

    from google.cloud import bigquery
    import google.api_core 

    client=bigquery.Client(project=PROJECT_ID)
    dataset_ref = f"{PROJECT_ID}.{BQ_OPENDATAQNA_DATASET_NAME}"


    # Create the dataset if it does not exist already
    try:
        client.get_dataset(dataset_ref)
        print("Destination Dataset for logging exists...")
    except google.api_core.exceptions.NotFound:
        print("Cannot find the dataset hence creating.......")
        dataset=bigquery.Dataset(dataset_ref)
        dataset.location=BQ_DATASET_REGION
        client.create_dataset(dataset)
        print(str(dataset_ref)+" is created")

### 💾 **1G. Save Configuration to File** 
Save the configurations set in this notebook to  `config.ini`. The parameters from this file are used in subsequent notebooks and in various modeules in the repo

In [None]:
import os
module_path = os.path.abspath(os.path.join('..'))

import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

config['GCP']['PROJECT_ID'] = PROJECT_ID
config['CONFIG']['DATA_SOURCE'] = DATA_SOURCE
config['CONFIG']['VECTOR_STORE'] = VECTOR_STORE
config['BIGQUERY']['BQ_OPENDATAQNA_DATASET_NAME'] = BQ_OPENDATAQNA_DATASET_NAME

# Save the parameters based on Data Source and Vector Store Choices
if DATA_SOURCE == 'cloudsql-pg' or VECTOR_STORE == 'cloudsql-pgvector':
    config['PGCLOUDSQL']['PG_INSTANCE'] = PG_INSTANCE
    config['PGCLOUDSQL']['PG_DATABASE'] = PG_DATABASE
    config['PGCLOUDSQL']['PG_USER'] = PG_USER
    config['PGCLOUDSQL']['PG_PASSWORD'] = PG_PASSWORD
    config['PGCLOUDSQL']['PG_REGION'] = PG_REGION
    config['PGCLOUDSQL']['PG_SCHEMA'] = PG_SCHEMA

if DATA_SOURCE := 'bigquery':
    config['BIGQUERY']['BQ_DATASET_REGION'] = BQ_DATASET_REGION
    config['BIGQUERY']['BQ_DATASET_NAME'] = BQ_DATASET_NAME

with open(module_path+'/config.ini', 'w') as configfile:    # save
    config.write(configfile)

print('All configuration paramaters saved to file!')

##  **2. Create Embeddings in Vector Store for RAG** 

###  **2A. Create Table and Column Embeddings**

In this step, table and column metadata is retreived from the data source and embeddings are generated for both

In [None]:
# Create Table and Column Embeddings
from embeddings.retrieve_embeddings import retrieve_embeddings

if DATA_SOURCE =='bigquery':
    table_schema_embeddings, col_schema_embeddings = retrieve_embeddings(DATA_SOURCE, SCHEMA=BQ_DATASET_NAME)
else: 
    table_schema_embeddings, col_schema_embeddings = retrieve_embeddings(DATA_SOURCE, SCHEMA=PG_SCHEMA)

print("Table and Column embeddings are created")


### 💾 **2B. Save the Table and Column Embeddings in the Vector Store**
The table and column embeddings created in the above step are save to the Vector Store chosen

In [None]:
from embeddings.store_embeddings import store_schema_embeddings

if VECTOR_STORE=='bigquery-vector':
    await(store_schema_embeddings(table_details_embeddings=table_schema_embeddings, 
                                  tablecolumn_details_embeddings=col_schema_embeddings, 
                                  project_id=PROJECT_ID,
                                  instance_name=None,
                                  database_name=None,
                                  schema=BQ_OPENDATAQNA_DATASET_NAME,
                                  database_user=None,
                                  database_password=None,
                                  region=BQ_DATASET_REGION,
                                  VECTOR_STORE = VECTOR_STORE
                                  ))

elif VECTOR_STORE=='cloudsql-pgvector':
    await(store_schema_embeddings(table_details_embeddings=table_schema_embeddings, 
                                tablecolumn_details_embeddings=col_schema_embeddings, 
                                project_id=PROJECT_ID,
                                instance_name=PG_INSTANCE,
                                database_name=PG_DATABASE,
                                schema=None,
                                database_user=PG_USER,
                                database_password=PG_PASSWORD,
                                region=PG_REGION,
                                VECTOR_STORE = VECTOR_STORE
                                ))

print("Table and Column embeddings are saved to vector store")

### 🗄️ **2C. Load Known Good SQL into Vector Store**
Known Good Queries are used to create query cache for Few shot examples. Creating a query cache is highly recommended for best outcomes! 

The following cell will load the Natural Language Question and Known Good SQL pairs into our Vector Store. There pairs are loaded from `known_good_sql.csv` file inside scripts folder. If you have your own Question-SQL examples, curate them in .csv file before running the cell below. 

If no Known Good Queries are available at this time to create query cache, you can use [3_LoadKnownGoodSQL.ipynb](3_LoadKnownGoodSQL.ipynb) to load them later!!" Empty table for KGQ embedding will be created!



In [None]:
# If you have Known Good Queries, load them to known_good_sql.csv file; 
# These will be used as few shot examples for query generation. 
# This step is highly recommended for best outcomes!
EXAMPLES = 'yes'# Options 'yes' or 'no'

from embeddings.kgq_embeddings import setup_kgq_table
# Delete any old tables and create a new table to KGQ embeddings
if VECTOR_STORE=='bigquery-vector':
    await(setup_kgq_table(project_id=PROJECT_ID,
                          instance_name=None,
                          database_name=None,
                          schema=BQ_OPENDATAQNA_DATASET_NAME,
                          database_user=None,
                          database_password=None,
                          region=BQ_DATASET_REGION,
                          VECTOR_STORE = VECTOR_STORE
                          ))

elif VECTOR_STORE=='cloudsql-pgvector':
    await(setup_kgq_table(project_id=PROJECT_ID,
                          instance_name=PG_INSTANCE,
                          database_name=PG_DATABASE,
                          schema=None,
                          database_user=PG_USER,
                          database_password=PG_PASSWORD,
                          region=PG_REGION,
                          VECTOR_STORE = VECTOR_STORE
                          ))


if EXAMPLES == 'yes':
    print("Examples are provided, creating KGQ embeddings and saving them to Vector store.....")

    import os
    import pandas as pd
    current_dir = os.getcwd()
    root_dir = os.path.expanduser('~')  # Start at the user's home directory

    while current_dir != root_dir:
        for dirpath, dirnames, filenames in os.walk(current_dir):
            config_path = os.path.join(dirpath, 'known_good_sql.csv')
            if os.path.exists(config_path):
                file_path = config_path  # Update root_dir to the found directory
                break  # Stop outer loop once found

        current_dir = os.path.dirname(current_dir)

    print("Known Good SQL Found at Path :: "+file_path)

    # Load the file
    df_kgq = pd.read_csv(file_path)
    df_kgq = df_kgq.loc[:, ["prompt", "sql", "database_name"]]
    df_kgq = df_kgq.dropna()

    from embeddings.kgq_embeddings import store_kgq_embeddings
    # Add KGQ to the vector store
    if VECTOR_STORE=='bigquery-vector':
        await(store_kgq_embeddings(df_kgq,
                                   project_id=PROJECT_ID,
                                    instance_name=None,
                                    database_name=BQ_OPENDATAQNA_DATASET_NAME,
                                    schema=BQ_DATASET_NAME,
                                    database_user=None,
                                    database_password=None,
                                    region=BQ_DATASET_REGION,
                                    VECTOR_STORE = VECTOR_STORE
                                    ))

    elif VECTOR_STORE=='cloudsql-pgvector':
        await(store_kgq_embeddings(df_kgq,
                                   project_id=PROJECT_ID,
                                    instance_name=PG_INSTANCE,
                                    database_name=PG_DATABASE,
                                    schema=None,
                                    database_user=PG_USER,
                                    database_password=PG_PASSWORD,
                                    region=PG_REGION,
                                    VECTOR_STORE = VECTOR_STORE
                                    ))
    print('Done!!')

else:
    print("⚠️ WARNING: No Known Good Queries are provided to create query cache for Few shot examples!")
    print("Creating a query cache is highly recommended for best outcomes")
    print("If no Known Good Queries for the dataset are availabe at this time, you can use 3_LoadKnownGoodSQL.ipynb to load them later!!")


### 🥁 If all the above steps are executed suucessfully, the following should be set up:

* GCP project and all the required IAM permissions

* Environment to run the solution

* Data source and Vector store for the solution