<div style="display: flex; align-items: left;">
    <a href="https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0">
        <img src="https://storage.googleapis.com/miscfilespublic/Linkedin%20Banner%20%E2%80%93%202.png" style="margin-right">
    </a>
</div>

In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# **Talk2Data: Initial Environment Setup**

---

This notebook walks through the initial environment setup needed for running the Talk2Data solution. 
As the solution is build on modular components, you can skip to the sections that are relevant for your use case. 

Currently supported Source DBs are: 
- PostgreSQL on Google Cloud SQL 
- BigQuery

Furthermore, the following vector stores are supported 
- pgvector on PostgreSQL 
- BigQuery vector


The notebook covers the following steps: 
> 1. Setting up PostgreSQL instance on Google Cloud SQL (for using both or either of PostgreSQL & pgvector)

> 2. Migrating BigQuery public data to the PostgreSQL instance (for using PostgreSQL as the Source DB)

> 3. Setting up BigQuery environment (for using both or either of BigQuery Source DB & Vector Store)

> 4. Populating the vector store with the 'known good' question-SQL pairs (BigQuery vector DB & pgvector)



### 📒 Using this interactive notebook

If you have not used this IDE with jupyter notebooks it will ask for installing Python + Jupyter extensions. Please go ahead install them

Click the **run** icons ▶️  of each section within this notebook.

> 💡 Alternatively, you can run the currently selected cell with `Ctrl + Enter` (or `⌘ + Enter` on a Mac).

> ⚠️ **To avoid any errors**, wait for each section to finish in their order before clicking the next “run” icon.

This sample must be connected to a **Google Cloud project**, but nothing else is needed other than your Google Cloud project.

You can use an existing project. Alternatively, you can create a new Cloud project [with cloud credits for free.](https://cloud.google.com/free/docs/gcp-free-tier)

## 🚧 **0. Getting Started**


### 💻 **Install Code Dependencies**
If you didn't already, install the dependencies by running either poetry install 

##### Don't forget to switch your notebook kernel to the newly generated .venv environment after running the poetry command. 

If you cannot find it manually select the Python Interpreter path that you see when you run poetry shell (e.g. ~/.cache/talk2data-Fajjajah-py3.9/bin/python)

In [3]:
#ignore this if you already using the shell 
import sys
import os
sys.path.append(os.path.abspath(os.path.join('..')))

#install poetry and run below
!poetry lock
!poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update

[39;1mInstalling[39;22m the current project: [36mtalktodata[39m ([39;1m0.1.0[39;22m)[1G[2K[39;1mInstalling[39;22m the current project: [36mtalktodata[39m ([32m0.1.0[39m)


### 🔐 Authenticate to Google Cloud 
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.

You can do this within Google Colab or using the Application Default Credentials in the Google Cloud CLI.

In [1]:
"""Colab Auth""" 
# from google.colab import auth
# auth.authenticate_user()


"""Google CLI Auth"""
# !gcloud auth application-default login


import google.auth
credentials, project_id = google.auth.default()
# credentials = google.auth.credentials.with_scopes_if_required(credentials)
# authed_http = google.auth.transport.requests.AuthorizedSession(credentials)

### 🔗 Connect Your Google Cloud Project
Time to connect your Google Cloud Project to this notebook. 

In [2]:
#@markdown Please fill in the value below with your GCP project ID and then run the cell.
# PROJECT_ID = "talk2data-genai-sa" #@param {type:"string"}
PROJECT_ID = input("Please enter the Project ID. This project will be considered to create source data. ")

# Quick input validations.
assert PROJECT_ID, "⚠️ Please provide your Google Cloud Project ID"

# Configure gcloud.
!gcloud config set project {PROJECT_ID}
PROJECT_ID
# !gcloud auth application-default set-quota-project {PROJECT_ID}

Updated property [core/project].


'talk2data-genai-sa'

### Enable Required APIs

In [3]:
!gcloud services enable sqladmin.googleapis.com # Enable Cloud SQL Admin API
!gcloud services enable aiplatform.googleapis.com # Enable AI Platform API

# **PostgreSQL Source Setup**

## ☁️ **1. Setting up Cloud SQL PostgreSQL** 
A **Postgres** Cloud SQL instance is required for the following stages of this notebook.

To connect and access our Postgres Cloud SQL database instance(s) we will leverage the [Cloud SQL Python Connector](https://github.com/GoogleCloudPlatform/cloud-sql-python-connector).

The Cloud SQL Python Connector is a library that can be used alongside a database driver to allow users to easily connect to a Cloud SQL database without having to manually allowlist IP or manage SSL certificates. 

💽 **Create a Postgres Instance**

Running the below cell will verify the existence of a Cloud SQL instance or create a new one if one does not exist.

> ⏳ - Creating a Cloud SQL instance may take a few minutes.

In [4]:
#@markdown Please fill in the both the Google Cloud region and name of your Cloud SQL instance. Once filled in, run the cell.

# Please fill in these values.
PG_REGION = "us-central1" #@param {type:"string"}
PG_INSTANCE = "domingo"
PG_PASSWORD = "vector123"

# Quick input validations.
assert PG_REGION, "⚠️ Please provide a Google Cloud region"
assert PG_INSTANCE, "⚠️ Please provide the name of your instance"

# check if Cloud SQL instance exists in the provided region
database_version = !gcloud sql instances describe {PG_INSTANCE} --format="value(databaseVersion)"
if database_version[0].startswith("POSTGRES"):
  print("Found existing Postgres Cloud SQL Instance!")
else:
  print("Creating new Cloud SQL instance...")
  !gcloud sql instances create {PG_INSTANCE} --database-version=POSTGRES_15 \
    --region={PG_REGION} --cpu=1 --memory=4GB --root-password={PG_PASSWORD} \
    --database-flags=cloudsql.iam_authentication=On

Found existing Postgres Cloud SQL Instance!


## ➡️ **2. Migrate a public BigQuery database to our PostgreSQL instance**
Let's migrate a public BigQuery dataset over to our newly created PostgreSQL instance. 

### A) Set up a Google Cloud Storage Bucket 
This bucket will be used to store the export of our BigQuery public dataset.

In [5]:
#@markdown Please fill in the both the Google Cloud region and name of your Cloud SQL instance. Once filled in, run the cell.

# Please fill in these values.
BUCKET_NAME = str(PROJECT_ID+'-talk2data') #@param {type:"string"}
print("Bucket Created : "+ BUCKET_NAME)
# Quick input validations.
assert BUCKET_NAME, "⚠️ Please provide a unique name for your bucket"

Bucket Created : talk2data-genai-sa-talk2data


In [6]:
from google.cloud import storage
from urllib.error import HTTPError

storage_client = storage.Client(project=PROJECT_ID)

try: 
    bucket = storage_client.bucket(BUCKET_NAME)

    if bucket.exists(): 
        print("This bucket already exists.")

    else:
        bucket = storage_client.create_bucket(BUCKET_NAME)
        print(f"Bucket {bucket.name} created")

except:
        print("⚠️ This bucket already exists in another project. Make sure to give your bucket a unique name.")

Bucket talk2data-genai-sa-talk2data created


### B) Export BigQuery Dataset to the Bucket


In [7]:
#@markdown Please fill in which BigQuery dataset to export. You can leave the default values. Once filled in, run the cell.

# Please fill in these values.
BQ_PROJECT = "bigquery-public-data"
BQ_DATABASE = "google_dei"
BQ_TABLE = "dar_intersectional_hiring"


# Quick input validations.
assert BQ_PROJECT, "⚠️ Please specify the BigQuery Project"
assert BQ_DATABASE, "⚠️ Please specify the BigQuery Database"
assert BQ_TABLE, "⚠️ Please specify the BigQuery Table"

In [8]:
BUCKET_FILENAME = "export*.csv" #@param {type:"string"}


from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)

destination_uri = "gs://{}/{}".format(BUCKET_NAME, BUCKET_FILENAME)
dataset_ref = bigquery.DatasetReference(BQ_PROJECT, BQ_DATABASE)
table_ref = dataset_ref.table(BQ_TABLE)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
)  # API request
extract_job.result()  # Waits for job to complete.

print(
    f"Exported {BQ_PROJECT}:{BQ_DATABASE}.{BQ_TABLE} to {destination_uri}"
)

Exported bigquery-public-data:google_dei.dar_intersectional_hiring to gs://talk2data-genai-sa-talk2data/export*.csv


### C) Retrieve Data Types and Formats 
To migrate our exported .csv files to PostgreSQL, we need to fetch the Data Types and Format from our table in the .csv export. 
This needs to be done as we're setting up the PostgreSQL table and columns first (and need to provide the columns in the setup).
We will load the .csv content into the table afterwards. 

In [9]:
from google.cloud import storage
import pandas as pd 
from google.cloud.sql.connector import Connector

storage_client = storage.Client(project=PROJECT_ID)

bucket = storage_client.get_bucket(BUCKET_NAME)
blobs = bucket.list_blobs()

for idx,blob in enumerate(blobs):
    if idx == 0: 
        URI = "gs://{}".format(blob.id).split('.csv', 1)[0]+'.csv'
        df = pd.read_csv(URI)

        field_names = df.columns
        field_types = df.dtypes

    else: 
        break


In [None]:
field_names

In [None]:
field_types

### D) Build the SQL Query for Table Creation 
Every database is different. To acommodate for different table structures depending on which BigQuery dataset is being loaded in, we will build the SQL query for creating the required PostgreSQL table dynamically. 

OPTIONAL: 
If you want to specify specific values for the schema, database, and user, please modify the cell below. 
You can also keep the default values.  

In [10]:
#@markdown Please specify the PGSchema or leave it as default (public)
PG_SCHEMA = 'dei'   #default: 'public'
PG_DATABASE = 'vectordb'    #default: 'postgres'
PG_USER = 'vector_user'    #default: 'postgres'

!gcloud sql databases create  {PG_DATABASE} --instance={PG_INSTANCE}

!gcloud sql users create {PG_USER} \
--instance={PG_INSTANCE} \
--password={PG_PASSWORD}

[1;31mERROR:[0m (gcloud.sql.databases.create) HTTPError 400: Invalid request: failed to create database vectordb. Detail: pq: database "vectordb" already exists.
Creating Cloud SQL user...done.                                                
Created user [vector_user].


In [11]:
def get_sql(PG_SCHEMA, BQ_TABLE, field_names, field_types): 

    cols = "" 

    for i in range(len(field_names)): 
        cols += str(field_names[i]) +" "+ str(field_types[i])
        if i < (len(field_names)-1): 
            cols += ", "


    sql = f"""CREATE TABLE {PG_SCHEMA}.{BQ_TABLE}({cols})"""

    return sql

sql = get_sql(PG_SCHEMA, BQ_TABLE, field_names, field_types)
sql

  cols += str(field_names[i]) +" "+ str(field_types[i])


'CREATE TABLE dei.dar_intersectional_hiring(workforce object, report_year int64, gender_us object, race_asian float64, race_black float64, race_hispanic_latinx float64, race_native_american float64, race_white float64)'

### E) Create the PostgreSQL Table

In [20]:
import asyncio 

async def create_pg_table(PROJECT_ID,
                          PG_REGION,
                          PG_INSTANCE,
                          PG_PASSWORD,
                          BQ_TABLE, 
                          PG_DATABASE, 
                          SQL,
                          PG_USER): 
    """Create PG Table from BQ Schema"""
    import asyncio
    import asyncpg
    from google.cloud.sql.connector import Connector

    # Replace the Data Types to work with PostgreSQL supported ones 
    SQL = SQL.replace("object", "TEXT").replace("int64", "INTEGER").replace("float64", "DOUBLE PRECISION")

    loop = asyncio.get_running_loop()
    async with Connector(loop=loop) as connector:
        # Create connection to Cloud SQL database
        conn: asyncpg.Connection = await connector.connect_async(
            f"{PROJECT_ID}:{PG_REGION}:{PG_INSTANCE}",  # Cloud SQL instance connection name
            "asyncpg",
            user=f"{PG_USER}",
            db=f"{PG_DATABASE}",
            password=f"{PG_PASSWORD}"
        )

        await conn.execute(f"DROP SCHEMA IF EXISTS {PG_SCHEMA} CASCADE")        

        await conn.execute(f"CREATE SCHEMA {PG_SCHEMA}")        

        await conn.execute(f"DROP TABLE IF EXISTS {BQ_TABLE} CASCADE")
        
        # Create the table.
        await conn.execute(SQL)

        await conn.close()


# # Create PG Table 
await(create_pg_table(PROJECT_ID, PG_REGION, PG_INSTANCE, PG_PASSWORD, BQ_TABLE, PG_DATABASE, sql,PG_USER))

### F) Import Data to PostgreSQL Table
The below cell will iterate through each export file on our Google Cloud Storage Bucket and load it to the PostgreSQL instance. 
This may take a while, depending on the size of the BigQuery public dataset. You can optionally set the LIMIT parameter to limit how many export files will be loaded in. 

In [21]:
async def import_to_pg(PROJECT_ID,
                          PG_REGION,
                          PG_INSTANCE,
                          PG_USER,
                          PG_PASSWORD,
                          PG_DATABASE,
                          BQ_TABLE, 
                          BUCKET_NAME,
                          field_types): 
    from google.cloud import storage
    import pandas as pd 
    import asyncio
    import asyncpg
    from google.cloud.sql.connector import Connector

    storage_client = storage.Client(project=PROJECT_ID)

    bucket = storage_client.get_bucket(BUCKET_NAME)
    blobs = bucket.list_blobs()

    LIMIT = 3 

    PG_SCHEMA = 'dei'

    loop = asyncio.get_running_loop()
    async with Connector(loop=loop) as connector:
        # Create connection to Cloud SQL database
        conn: asyncpg.Connection = await connector.connect_async(
            f"{PROJECT_ID}:{PG_REGION}:{PG_INSTANCE}",  # Cloud SQL instance connection name
            "asyncpg",
            user=f"{PG_USER}",
            password=f"{PG_PASSWORD}",
            db=f"{PG_DATABASE}",
        )

        for idx,blob in enumerate(blobs):
            if idx < LIMIT:     # Comment this out if you don't want to use a limit 
                URI = "gs://{}".format(blob.id).split('.csv', 1)[0]+'.csv'
                df = pd.read_csv(URI)

                df = df.dropna()

                df.info()   

                # Copy the dataframe to the table.
                tuples = list(df.itertuples(index=False))

                r = tuples 
                c = list(df) 

                await conn.copy_records_to_table(
                    BQ_TABLE, records=tuples, columns=list(df), schema_name=PG_SCHEMA, timeout=3600
                )
        await conn.close()

    # # Load Data into PG Table 
await(import_to_pg(PROJECT_ID, PG_REGION, PG_INSTANCE, PG_USER, PG_PASSWORD, PG_DATABASE, BQ_TABLE, BUCKET_NAME, field_types))

<class 'pandas.core.frame.DataFrame'>
Index: 102 entries, 3 to 107
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   workforce             102 non-null    object 
 1   report_year           102 non-null    int64  
 2   gender_us             102 non-null    object 
 3   race_asian            102 non-null    float64
 4   race_black            102 non-null    float64
 5   race_hispanic_latinx  102 non-null    float64
 6   race_native_american  102 non-null    float64
 7   race_white            102 non-null    float64
dtypes: float64(5), int64(1), object(2)
memory usage: 7.2+ KB


## 🗄️ **4. Load Known Good SQL into Vector Store**
The following cell will load the Known Good Question-SQL pairs into our Vector Store. 
For this, it is fetching the contents of the file `known_good_sql.csv`. 
If you have your own Question-SQL examples, put them in the .csv file before running the cell below. 

Note: For pgvector, the pairs will be loaded in the 'public' schema. 


In [22]:
#TODO: Use PGConnector 
async def cache_known_sql(PROJECT_ID,
                           PG_REGION,
                           PG_INSTANCE,
                           PG_USER,
                           PG_PASSWORD,
                           PG_DATABASE,
                           VECTOR_STORE):
    from utilities import root_dir 
    import pandas as pd
    import os
    import asyncio
    import asyncpg
    from google.cloud.sql.connector import Connector

    df = pd.read_csv(root_dir+"/scripts/known_good_sql.csv")
    df = df.loc[:, ["prompt", "sql", "database_name"]]
    df = df.dropna()

    if VECTOR_STORE == "cloudsql-pgvector": 
        loop = asyncio.get_running_loop()
        async with Connector(loop=loop) as connector:
            # Create connection to Cloud SQL database
            conn: asyncpg.Connection = await connector.connect_async(
                f"{PROJECT_ID}:{PG_REGION}:{PG_INSTANCE}",  # Cloud SQL instance connection name
                "asyncpg",
                user=f"{PG_USER}",
                password=f"{PG_PASSWORD}",
                db=f"{PG_DATABASE}",
            )


            await conn.execute("DROP TABLE IF EXISTS query_example_embeddings CASCADE")

            # Create the `query_example_embeddings` table.
            await conn.execute(
                """CREATE TABLE query_example_embeddings(
                                    prompt TEXT,
                                    sql TEXT,
                                    database_name TEXT)"""
            )

            # Copy the dataframe to the 'query_example_embeddings' table.
            tuples = list(df.itertuples(index=False))
            await conn.copy_records_to_table(
                "query_example_embeddings", records=tuples, columns=list(df), timeout=3600
            )
            await conn.close()

    elif VECTOR_STORE == "bigquery": 
        """TODO"""
        print("")

    else: raise ValueError("Not a valid parameter for a vector store.")




await(cache_known_sql(PROJECT_ID,
                           PG_REGION,
                           PG_INSTANCE,
                           PG_USER,
                           PG_PASSWORD,
                           PG_DATABASE,
                           VECTOR_STORE="cloudsql-pgvector"))

### BigQuery Dataset Setup for Processing (Logging)

In [23]:
# @markdown Please fill in the both the Google Cloud region and name of your Dataset. Once filled in, run the cell.

# Please fill in these values.
BQ_TALK2DATA_DATASET_NAME = "talk2data" #@param {type:"string"}
BQ_DATASET_REGION = "us-central1" #@param {type:"string"}

from google.cloud import bigquery
import google.api_core 

client=bigquery.Client(project=PROJECT_ID)

dataset_ref = f"{PROJECT_ID}.{BQ_TALK2DATA_DATASET_NAME}"

try:
    client.get_dataset(dataset_ref)
    print("Destination Dataset exists")
except google.api_core.exceptions.NotFound:
    print("Cannot find the dataset hence creating.......")
    dataset=bigquery.Dataset(dataset_ref)
    dataset.location=BQ_DATASET_REGION
    client.create_dataset(dataset)

print(str(dataset_ref)+" is created")

Destination Dataset exists
talk2data-genai-sa.talk2data is created


### Enter your vector store

In [18]:
VECTOR_STORE=input("Enter which vector you would like to us \"cloudsql-pgvector\"  or \"bigquery-vector\"")

## 💾 Save your Configuration 
We will save the configurations set in this notebook to avoid redundant parameter settings in the following notebooks. 
The information will be stored in the file `config.ini` in the root directory of this repository. 

In [19]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))

import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

config['GCP']['PROJECT_ID'] = PROJECT_ID
config['CONFIG']['DATA_SOURCE'] = 'cloudsql-pg'
config['CONFIG']['VECTOR_STORE'] = VECTOR_STORE
config['PGCLOUDSQL']['PG_SCHEMA'] = PG_SCHEMA
config['PGCLOUDSQL']['PG_DATABASE'] = PG_DATABASE
config['PGCLOUDSQL']['PG_USER'] = PG_USER
config['PGCLOUDSQL']['PG_REGION'] = PG_REGION
config['PGCLOUDSQL']['PG_INSTANCE'] = PG_INSTANCE
config['PGCLOUDSQL']['PG_PASSWORD'] = PG_PASSWORD

config['BIGQUERY']['BQ_DATASET_REGION'] = BQ_DATASET_REGION
config['BIGQUERY']['BQ_TALK2DATA_DATASET_NAME'] = BQ_TALK2DATA_DATASET_NAME


with open(module_path+'/config.ini', 'w') as configfile:    # save
    config.write(configfile)

# **BigQuery Source Setup**

## ☁️ **1. Source from BigQuery Public Dataset**

Conside the table from the public dataset to ask questions against. Copy that the needed table to local dataset so that.



In [4]:
# Please fill in these values.
BQ_SRC_PROJECT = "bigquery-public-data"
BQ_SRC_DATASET = "fda_food"
BQ_SRC_TABLE = "food_enforcement"
BQ_SRC_REGION= "us"

BQ_DST_PROJECT = PROJECT_ID
BQ_DST_DATASET = "fda_food"
BQ_DST_TABLE = "food_enforcement"

In [5]:
def createBQDataset(bq_project_id, dataset_name,dataset_region):
    from google.cloud import bigquery
    import google.api_core 

    client=bigquery.Client(project=PROJECT_ID)

    dataset_ref = f"{bq_project_id}.{dataset_name}"

    try:
        client.get_dataset(dataset_ref)
        print("Destination Dataset exists")
    except google.api_core.exceptions.NotFound:
        print("Cannot find the dataset hence creating.......")
        dataset=bigquery.Dataset(dataset_ref)
        dataset.location=dataset_region
        client.create_dataset(dataset)
        
    return dataset_ref

def createBQTable(bq_project_id,dataset_name, table_name):
        from google.cloud import bigquery
        import google.api_core 

        client=bigquery.Client(project=PROJECT_ID)

        table_ref = client.dataset(dataset_name, project=bq_project_id).table(table_name)

        try:
            client.get_table(table_ref)
            print("Destination Table exists")
            
        except google.api_core.exceptions.NotFound:
            print("Cannot find the table hence creating.......")
            table = bigquery.Table(table_ref)
            client.create_table(table)

        return table_ref



In [6]:
#Create destination able and copy the table data
from google.cloud import bigquery

client=bigquery.Client(project=PROJECT_ID)

dst_dataset_ref=createBQDataset(BQ_DST_PROJECT,BQ_DST_DATASET,BQ_SRC_REGION)

dst_table_ref=createBQTable(BQ_DST_PROJECT,BQ_DST_DATASET,BQ_DST_TABLE)

src_table_ref = client.dataset(BQ_SRC_DATASET, project=BQ_SRC_PROJECT).table(BQ_SRC_TABLE)

job_config = bigquery.CopyJobConfig(write_disposition="WRITE_TRUNCATE")

copy_job = client.copy_table(src_table_ref, dst_table_ref, job_config=job_config)



# Wait for the job to complete and check for errors
copy_job.result()  


Destination Dataset exists
Destination Table exists


CopyJob<project=talk2data-genai-sa, location=US, id=b6f2b6de-066e-46db-a72a-d0bd4316e9b2>

##  **2. BigQuery Dataset Setup for Processing (Logs and Vector Embedding (bigquery vector store)**

In the above step we setup our source data now we need to setup bigquery for processing and operational work such as storing embedding or storing logs etc.



In [7]:
# @markdown Please fill in the both the Google Cloud region and name of your Dataset. Once filled in, run the cell.

# Please fill in these values.
BQ_TALK2DATA_DATASET_NAME = "talk2data" #@param {type:"string"}
BQ_DATASET_REGION = "us-central1" #@param {type:"string"}

from google.cloud import bigquery
import google.api_core 

client=bigquery.Client(project=PROJECT_ID)

dataset_ref=createBQDataset(PROJECT_ID,BQ_TALK2DATA_DATASET_NAME,BQ_DATASET_REGION)

print(str(dataset_ref)+" is created")

Destination Dataset exists
talk2data-genai-sa.talk2data is created


##  **Create PostgreSQL Instance on CloudSQL if you storing embeddings on pgvector store**

In [8]:
#@markdown Please fill in the both the Google Cloud region and name of your Cloud SQL instance. Once filled in, run the cell.

# Please fill in these values.
PG_REGION = "us-central1" #@param {type:"string"}
PG_INSTANCE = "domingo"
PG_PASSWORD = "vector123"


# Quick input validations.
assert PG_REGION, "⚠️ Please provide a Google Cloud region"
assert PG_INSTANCE, "⚠️ Please provide the name of your instance"

# check if Cloud SQL instance exists in the provided region
database_version = !gcloud sql instances describe {PG_INSTANCE} --format="value(databaseVersion)"
if database_version[0].startswith("POSTGRES"):
  print("Found existing Postgres Cloud SQL Instance!")
else:
  print("Creating new Cloud SQL instance...")
  !gcloud sql instances create {PG_INSTANCE} --database-version=POSTGRES_15 \
    --region={PG_REGION} --cpu=1 --memory=4GB --root-password={PG_PASSWORD} \
    --database-flags=cloudsql.iam_authentication=On

Found existing Postgres Cloud SQL Instance!


In [9]:
#Customize below values if needed or they will default to default values

PG_SCHEMA = 'talk2data'   #default: 'public'
PG_DATABASE = 'talk2data'    #default: 'postgres'
PG_USER = 'vector_user'    #default: 'postgres'


!gcloud sql databases create  {PG_DATABASE} --instance={PG_INSTANCE}

!gcloud sql users create {PG_USER} \
--instance={PG_INSTANCE} \
--password={PG_PASSWORD}


[1;31mERROR:[0m (gcloud.sql.databases.create) HTTPError 400: Invalid request: failed to create database talk2data. Detail: pq: database "talk2data" already exists.
Creating Cloud SQL user...done.                                                
Created user [vector_user].


### Enter your vector store

In [10]:
VECTOR_STORE=input("Enter which vector store you would like to use \"cloudsql-pgvector\"  or \"bigquery-vector\"")

## 💾 Save your Configuration 
We will save the configurations set in this notebook to avoid redundant parameter settings in the following notebooks. 
The information will be stored in the file `config.ini` in the root directory of this repository. 

In [13]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))

import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

config['GCP']['PROJECT_ID'] = PROJECT_ID
config['CONFIG']['DATA_SOURCE'] = 'bigquery'
config['BIGQUERY']['BQ_DATASET_REGION'] = BQ_DATASET_REGION
config['BIGQUERY']['BQ_DATASET_NAME'] = BQ_DST_DATASET
config['BIGQUERY']['BQ_TALK2DATA_DATASET_NAME'] = BQ_TALK2DATA_DATASET_NAME

#ignore below you do not intend to use pgvector store
config['PGCLOUDSQL']['PG_SCHEMA'] = PG_SCHEMA
config['PGCLOUDSQL']['PG_DATABASE'] = PG_DATABASE
config['PGCLOUDSQL']['PG_USER'] = PG_USER
config['PGCLOUDSQL']['PG_REGION'] = PG_REGION
config['PGCLOUDSQL']['PG_INSTANCE'] = PG_INSTANCE
config['PGCLOUDSQL']['PG_PASSWORD'] = PG_PASSWORD


with open(module_path+'/config.ini', 'w') as configfile:    # save
    config.write(configfile)