# UNDER CONSTRUCTION: Couchbase Tutorial: Vector Search With Google AI

## In this tutorial, we'll cover:
* Installing required packages
* Importing necessary libraries
* Configuring Couchbase connection and Google Gemini client
* Creating a function to generate embeddings
* Inserting sample documents with embeddings
* Creating a vector search index (note: this step is typically done through the Couchbase UI or REST API)
* Performing a vector search
* Displaying search results
* Cleaning up (optional)

To use this tutorial:
* Replace 'YOUR_GEMINI_API_KEY' with your actual Google Gemini API key.

## Install Packages

In [1]:
%pip install --upgrade --quiet couchbase numpy ipywidgets

[0mNote: you may need to restart the kernel to use updated packages.


## Set Environment Variables

In [12]:
import env
env.load()

In [14]:
"""
Install the Google AI Python SDK

$ pip install google-generativeai
"""

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash",
  generation_config=generation_config,
  # safety_settings = Adjust safety settings
  # See https://ai.google.dev/gemini-api/docs/safety-settings
)

chat_session = model.start_chat(
  history=[
    {
      "role": "user",
      "parts": [
        "what is the opposite of hot\n",
      ],
    },
    {
      "role": "model",
      "parts": [
        "The opposite of hot depends on the context:\n\n**Temperature:**\n\n* **Cold** is the direct opposite of hot in terms of temperature.\n\n**Other contexts:**\n\n* **Cold** can also be used as an opposite of hot in a figurative sense, like when describing food (hot vs. cold pizza).\n* **Cool** can be used as a synonym for \"not hot\" in some contexts, particularly in relation to style or fashion.\n* **Mild** or **tepid** can describe something that is not hot or cold, but somewhere in between.\n* **Boring** or **uninteresting** can be used as the opposite of \"hot\" when referring to something exciting or attractive.\n\nThe best opposite for \"hot\" depends on the specific situation and meaning. \n",
      ],
    },
  ]
)

response = chat_session.send_message("INSERT_INPUT_HERE")

print(response.text)

DefaultCredentialsError: 
  No API_KEY or ADC found. Please either:
    - Set the `GOOGLE_API_KEY` environment variable.
    - Manually pass the key with `genai.configure(api_key=my_api_key)`.
    - Or set up Application Default Credentials, see https://ai.google.dev/gemini-api/docs/oauth for more information.

## Module Import And Client Setup

In [13]:
import os
import uuid
import json
from datetime import timedelta
import couchbase.subdocument as SD
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.options import (ClusterOptions, ClusterTimeoutOptions, QueryOptions, MutateInOptions)
from couchbase.exceptions import ScopeAlreadyExistsException, CollectionAlreadyExistsException
from couchbase.management.collections import CreateCollectionSettings
from couchbase.search import SearchQuery, QueryStringQuery
from couchbase.vector_search import VectorQuery, VectorSearch
import google.generativeai as genai

print("test")

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel('gemini-1.5-flash')
print("test")

response = model.generate_content("The opposite of hot is")
print(response)

COUCHBASE_USERNAME = os.environ['COUCHBASE_USERNAME']
COUCHBASE_PASSWORD = os.environ['COUCHBASE_PASSWORD']
COUCHBASE_BUCKET_NAME = os.environ['COUCHBASE_BUCKET_NAME']
COUCHBASE_URL = os.environ['COUCHBASE_URL']

auth = PasswordAuthenticator(
    COUCHBASE_USERNAME,
    COUCHBASE_PASSWORD
)
cluster = Cluster(COUCHBASE_URL, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))
bucket = cluster.bucket(COUCHBASE_BUCKET_NAME)

scope_name = "cillers_play"
collection_name_products = "products"

create_index_query = f"""
CREATE SEARCH INDEX IF NOT EXISTS product_vector_index
ON `{bucket.name}`.`{scope_name}`.`{collection_name_products}`(
    embedding AS {{"type": "vector", "dimensions": 768}}
) USING FTS WITH {{"similarity": "cosine"}}
"""

try:
    cluster.query(create_index_query)
    print("Vector search index created successfully (or already exists).")
except Exception as e:
    print(f"Error creating index: {e}")




test
test


DefaultCredentialsError: 
  No API_KEY or ADC found. Please either:
    - Set the `GOOGLE_API_KEY` environment variable.
    - Manually pass the key with `genai.configure(api_key=my_api_key)`.
    - Or set up Application Default Credentials, see https://ai.google.dev/gemini-api/docs/oauth for more information.

## Prepare The Data Structure

In [None]:
import data_structure_couchbase

data_structure_spec = {
    scope_name: [
        collection_name_products
    ]
}

data_structure_couchbase.create(bucket, data_structure_spec)

## Load Documents To Search For

In [9]:
insert_products_query = f"""
INSERT INTO `{bucket.name}`.`{scope_name}`.`{collection_name_products}` (KEY, VALUE)
VALUES 
    ("product_" || UUID(), {{
        "name": "Smartphone X", 
        "type": "Electronics", 
        "price": 699.99, 
        "details": {{
            "color": "Midnight Blue",
            "storage": "128GB",
            "screen_size": "6.1 inches"
        }},
        "tags": ["smartphone", "5G", "high-resolution camera", "water-resistant"],
        "description": "A powerful smartphone with 5G capability and a high-resolution camera."
    }}),
    ("product_" || UUID(), {{
        "name": "Laptop Pro", 
        "type": "Electronics", 
        "price": 1299.99, 
        "details": {{
            "color": "Silver",
            "processor": "Intel i7",
            "ram": "16GB"
        }},
        "tags": ["laptop", "high-performance", "lightweight", "long battery life"],
        "description": "A powerful and lightweight laptop with long battery life, perfect for professionals on the go."
    }}),
    ("product_" || UUID(), {{
        "name": "Smart TV 4K", 
        "type": "Electronics", 
        "price": 799.99, 
        "details": {{
            "color": "Black",
            "screen_size": "55 inches",
            "resolution": "4K"
        }},
        "tags": ["TV", "4K", "smart", "HDR", "voice control"],
        "description": "A smart 4K TV with HDR and voice control for an immersive viewing experience."
    }}),
    ("product_" || UUID(), {{
        "name": "Wireless Earbuds", 
        "type": "Electronics", 
        "price": 159.99, 
        "details": {{
            "color": "White",
            "battery_life": "24 hours",
            "connectivity": "Bluetooth 5.0"
        }},
        "tags": ["earbuds", "wireless", "noise-cancelling", "water-resistant"],
        "description": "Lightweight and water-resistant wireless earbuds with excellent noise-cancelling capabilities."
    }}),
    ("product_" || UUID(), {{
        "name": "Coffee Maker Deluxe", 
        "type": "Appliances", 
        "price": 129.99, 
        "details": {{
            "color": "Stainless Steel",
            "capacity": "12 cups",
            "features": "Programmable"
        }},
        "tags": ["coffee maker", "programmable", "thermal carafe"],
        "description": "A programmable coffee maker with a thermal carafe to keep your coffee hot for hours."
    }})
RETURNING *
"""
insert_products_result = cluster.query(insert_products_query)
print("Inserted products:")
for row in insert_products_result:
    print(json.dumps(row, indent=2))

Inserted products:
{
  "products": {
    "description": "A powerful smartphone with 5G capability and a high-resolution camera.",
    "details": {
      "color": "Midnight Blue",
      "screen_size": "6.1 inches",
      "storage": "128GB"
    },
    "name": "Smartphone X",
    "price": 699.99,
    "tags": [
      "smartphone",
      "5G",
      "high-resolution camera",
      "water-resistant"
    ],
    "type": "Electronics"
  }
}
{
  "products": {
    "description": "A powerful and lightweight laptop with long battery life, perfect for professionals on the go.",
    "details": {
      "color": "Silver",
      "processor": "Intel i7",
      "ram": "16GB"
    },
    "name": "Laptop Pro",
    "price": 1299.99,
    "tags": [
      "laptop",
      "high-performance",
      "lightweight",
      "long battery life"
    ],
    "type": "Electronics"
  }
}
{
  "products": {
    "description": "A smart 4K TV with HDR and voice control for an immersive viewing experience.",
    "details": {
    

## Add Embeddings To The Documents

In [8]:
collection = bucket.scope(scope_name).collection(collection_name_products)

def add_embedding_to_product(product):
    text = product.get('description', '') + " " + " ".join(product.get('tags', []))
    product['embedding'] = embeddings.embed_query(text)
    collection.upsert(product['id'], product)

products_query = f"SELECT META().id, * FROM `{bucket.name}`.`{scope_name}`.`{collection_name_products}`"
for product in cluster.query(products_query):
    add_embedding_to_product(product)

print("Embeddings added to products.")

Embeddings added to products.


## Create a vector search index

Vector search index created successfully (or already exists).


## Define Vector Search Function

In [10]:
def vector_search(query_text, top_k=3):
    # Generate the query embedding
    query_embedding = embeddings.embed_query(query_text)
    
    # Create a vector query using the embedding
    vector_query = VectorQuery("embedding", query_embedding, top_k)
    
    # Construct the FTS (Full-Text Search) query
    search_query = SearchQuery(vector_query)  # Only pass the query object
    
    # Execute the search query on the cluster, including the index name separately
    results = cluster.search_query("vector_index", search_query)
    
    return results

## Perform A Vector Search

In [11]:
search_results = vector_search("dark mobile")

print("Search Results:")
for result in search_results:
    doc = collection.get(result.id).content_as[dict]
    print(f"ID: {result.id}")
    print(f"Text: {doc['text']}")
    print(f"Score: {result.score}")
    print("---")


TypeError: SearchQuery.__init__() takes 1 positional argument but 2 were given

In [4]:
"""
Install the Google AI Python SDK

$ pip install google-generativeai
"""

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash",
  generation_config=generation_config,
  # safety_settings = Adjust safety settings
  # See https://ai.google.dev/gemini-api/docs/safety-settings
)

chat_session = model.start_chat(
  history=[
    {
      "role": "user",
      "parts": [
        "get embedding for a string",
      ],
    },
    {
      "role": "model",
      "parts": [
        "```python\nfrom sentence_transformers import SentenceTransformer, util\n\n# Load a pre-trained sentence transformer model\nmodel = SentenceTransformer('paraphrase-distilroberta-base-v1')\n\n# Define your string\ntext = \"This is a sample sentence.\"\n\n# Get the embedding\nembedding = model.encode(text)\n\n# Print the embedding (a list of floats)\nprint(embedding)\n```\n\n**Explanation:**\n\n1. **Import necessary libraries:**\n   - `sentence_transformers` for loading pre-trained models and generating embeddings.\n   - `util` for potential use of similarity functions later.\n\n2. **Load a pre-trained model:**\n   - `SentenceTransformer('paraphrase-distilroberta-base-v1')` loads a model specifically designed for sentence embedding. You can choose other models based on your task and desired performance.\n\n3. **Define your input string:**\n   - `text = \"This is a sample sentence.\"` sets the string you want to embed.\n\n4. **Generate the embedding:**\n   - `embedding = model.encode(text)` uses the loaded model to convert the string into a numerical representation (embedding).\n\n5. **Print the embedding:**\n   - `print(embedding)` displays the embedding as a list of floats.\n\n**Important Notes:**\n\n- **Choose the right model:** The model you select (`paraphrase-distilroberta-base-v1` in this case) impacts the embedding's quality and suitability for your specific application.\n- **Embedding size:** The resulting embedding will have a specific length (e.g., 768 dimensions for the chosen model), depending on the model architecture.\n- **Similarity calculations:** You can use the generated embeddings for tasks like:\n    - **Similarity search:** Find similar text by comparing embeddings using cosine similarity or other metrics.\n    - **Clustering:** Group similar text by their embeddings.\n    - **Classification:** Train a classifier using embeddings as input features.\n\nThis code provides a basic example. You can modify it by:\n\n- Using different pre-trained models from the `sentence_transformers` library.\n- Processing multiple strings simultaneously.\n- Performing further analysis or computations using the obtained embeddings. \n",
      ],
    },
  ]
)

response = chat_session.send_message("INSERT_INPUT_HERE")

print(response.text)

Please provide me with the input you want me to use. I need to know what you want to do with the text embedding. 

For example, you could tell me:

* **"I want to get the embedding for the string 'The quick brown fox jumps over the lazy dog'."**  
* **"Give me the embedding for the sentence 'This is a sample sentence'."**
* **"I want to find the embedding for a list of strings, including 'Hello, world!' and 'How are you?'"**

Once you provide me with the input, I can generate the embedding for you and explain how it works. 

