# Couchbase Tutorial: Vector Search

## In this tutorial, we'll cover:
* Installing required packages
* Importing necessary libraries
* Configuring Couchbase connection and Google Gemini client
* Creating a function to generate embeddings
* Inserting sample documents with embeddings
* Creating a vector search index (note: this step is typically done through the Couchbase UI or REST API)
* Performing a vector search
* Displaying search results
* Cleaning up (optional)

To use this tutorial:
* Replace 'YOUR_GEMINI_API_KEY' with your actual Google Gemini API key.

## Install Required Packages

In [4]:
%pip install couchbase google-generativeai numpy
%pip install ipywidgets --upgrade
!jupyter nbextension enable --py widgetsnbextension --sys-prefix

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: dejavu events execute kernel kernelspec lab
labextension labhub migrate nbclassic nbconvert notebook run server
troubleshoot trust

Jupyter co

## Set The Environment Variables

This is typically done in the polytope.yml file. Here we initiate the environment variables to simulate that they were already set by Polytope. 

In [1]:
%env GEMINI_API_KEY=AIzaSyCDlC_Lf3cJRzXKEua972-TUBxb0KAvz50
%env COUCHBASE_USERNAME=admin
%env COUCHBASE_PASSWORD=password
%env COUCHBASE_BUCKET_NAME=main
%env COUCHBASE_URL=couchbase://couchbase

env: GEMINI_API_KEY=AIzaSyCDlC_Lf3cJRzXKEua972-TUBxb0KAvz50
env: COUCHBASE_USERNAME=admin
env: COUCHBASE_PASSWORD=password
env: COUCHBASE_BUCKET_NAME=main
env: COUCHBASE_URL=couchbase://couchbase


## Module Import And Client Setup

In [2]:
import os
import uuid
from datetime import timedelta
import couchbase.subdocument as SD
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.options import (ClusterOptions, ClusterTimeoutOptions, QueryOptions, MutateInOptions)
from couchbase.search import SearchQuery, QueryStringQuery
from couchbase.vector_search import VectorQuery, VectorSearch
import google.generativeai as genai
import numpy as np

COUCHBASE_USERNAME = os.environ['COUCHBASE_USERNAME']
COUCHBASE_PASSWORD = os.environ['COUCHBASE_PASSWORD']
COUCHBASE_BUCKET_NAME = os.environ['COUCHBASE_BUCKET_NAME']
COUCHBASE_URL = os.environ['COUCHBASE_URL']
# NOTE: For TLS/SSL connection use 'couchbases://' instead
GEMINI_API_KEY = os.environ['GEMINI_API_KEY']

auth = PasswordAuthenticator(
    COUCHBASE_USERNAME,
    COUCHBASE_PASSWORD
)
cluster = Cluster(COUCHBASE_URL, ClusterOptions(auth))
cluster.wait_until_ready(timedelta(seconds=5))
bucket = cluster.bucket(COUCHBASE_BUCKET_NAME)

genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel('gemini-pro')

## Prepare The Example Scope And Collections

In [3]:
collection_manager = bucket.collections()

def create_scope(scope_name):
    try:
        collection_manager.create_scope(scope_name)
        print(f"Scope '{scope_name}' created successfully.")
    except ScopeAlreadyExistsException:
        print(f"Scope '{scope_name}' already exists.")
    except Exception as e:
        print(f"An error occurred while creating scope: {e}")

def create_collections(scope_name, collection_names):
    for collection_name in collection_names:
        try:
            collection_manager.create_collection(CollectionSpec(collection_name, scope_name=scope_name))
            print(f"Collection '{collection_name}' created successfully in scope '{scope_name}'.")
        except CollectionAlreadyExistsException:
            print(f"Collection '{collection_name}' already exists in scope '{scope_name}'.")
        except Exception as e:
            print(f"An error occurred while creating collection '{collection_name}': {e}")

def create_data_structure(spec):
    for scope_name, collection_names in spec.items():
        create_scope(scope_name)
        create_collections(scope_name, collection_names)        

scope_name = "cillers_play"
collection_name_products = "products"

data_structure_spec = { 
    scope_name: [
        collection_name_products
    ]
}

create_scope_and_collections(data_structure_spec)

{'cillers_play': ['products']}


## Load Documents To Search For

In [None]:
insert_products_query = f"""
INSERT INTO `{bucket.name}`.`{scope_name}`.`{collection_name_products}` (KEY, VALUE)
VALUES 
    ("product_" || UUID(), {{
        "name": "Smartphone X", 
        "type": "Electronics", 
        "price": 699.99, 
        "details": {{
            "color": "Midnight Blue",
            "storage": "128GB",
            "screen_size": "6.1 inches"
        }},
        "tags": ["smartphone", "5G", "high-resolution camera", "water-resistant"],
        "description": "A powerful smartphone with 5G capability and a high-resolution camera."
    }}),
    ("product_" || UUID(), {{
        "name": "Laptop Pro", 
        "type": "Electronics", 
        "price": 1299.99, 
        "details": {{
            "color": "Silver",
            "processor": "Intel i7",
            "ram": "16GB"
        }},
        "tags": ["laptop", "high-performance", "lightweight", "long battery life"],
        "description": "A powerful and lightweight laptop with long battery life, perfect for professionals on the go."
    }}),
    ("product_" || UUID(), {{
        "name": "Smart TV 4K", 
        "type": "Electronics", 
        "price": 799.99, 
        "details": {{
            "color": "Black",
            "screen_size": "55 inches",
            "resolution": "4K"
        }},
        "tags": ["TV", "4K", "smart", "HDR", "voice control"],
        "description": "A smart 4K TV with HDR and voice control for an immersive viewing experience."
    }}),
    ("product_" || UUID(), {{
        "name": "Wireless Earbuds", 
        "type": "Electronics", 
        "price": 159.99, 
        "details": {{
            "color": "White",
            "battery_life": "24 hours",
            "connectivity": "Bluetooth 5.0"
        }},
        "tags": ["earbuds", "wireless", "noise-cancelling", "water-resistant"],
        "description": "Lightweight and water-resistant wireless earbuds with excellent noise-cancelling capabilities."
    }}),
    ("product_" || UUID(), {{
        "name": "Coffee Maker Deluxe", 
        "type": "Appliances", 
        "price": 129.99, 
        "details": {{
            "color": "Stainless Steel",
            "capacity": "12 cups",
            "features": "Programmable"
        }},
        "tags": ["coffee maker", "programmable", "thermal carafe"],
        "description": "A programmable coffee maker with a thermal carafe to keep your coffee hot for hours."
    }})
RETURNING *
"""
insert_products_result = cluster.query(insert_products_query)
print("Inserted products:")
for row in insert_products_result:
    print(json.dumps(row, indent=2))

## Function To Generate Embeddings

In [None]:
def generate_embedding(text):
    response = model.generate_content(text, stream=False)
    embedding = response.candidates[0].content.parts[0].text
    return np.array(eval(embedding))

## Add Embeddings To The Documents

In [None]:
def add_embedding_to_product(product):
    # Combine description and tags for embedding
    combined_text = product['description'] + " " + " ".join(product['tags'])
    embedding = generate_embedding(combined_text)
    product['embedding'] = embedding.tolist()
    collection.upsert(product['id'], product)

products_query = f"SELECT META().id, * FROM `{bucket.name}`.`{scope_name}`.`{collection_name_products}`"
for product in cluster.query(products_query):
    add_embedding_to_product(product)

print("Embeddings added to products.")

## Create a vector search index

In [None]:
create_index_query = f"""
CREATE SEARCH INDEX IF NOT EXISTS product_vector_index 
ON `{bucket.name}`.`{scope_name}`.`{collection_name_products}`(
    embedding AS {{"type": "vector", "dimensions": 768}}
) USING FTS WITH {{"similarity": "cosine"}}
"""

try:
    cluster.query(create_index_query)
    print("Vector search index created successfully (or already exists).")
except Exception as e:
    print(f"Error creating index: {e}")

## Define Vector Search Function

In [None]:
def vector_search(query_text, top_k=3):
    query_embedding = generate_embedding(query_text)
    
    vector_query = VectorQuery("embedding", query_embedding.tolist(), top_k)
    search_query = SearchQuery(vector_query, index_name="vector_index")
    
    results = cluster.search_query("vector_index", search_query)
    
    return results

## Perform A Vector Search

In [None]:
search_results = vector_search("What is the meaning of life?")

print("Search Results:")
for result in search_results:
    doc = collection.get(result.id).content_as[dict]
    print(f"ID: {result.id}")
    print(f"Text: {doc['text']}")
    print(f"Score: {result.score}")
    print("---")


In [4]:
"""
Install the Google AI Python SDK

$ pip install google-generativeai
"""

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash",
  generation_config=generation_config,
  # safety_settings = Adjust safety settings
  # See https://ai.google.dev/gemini-api/docs/safety-settings
)

chat_session = model.start_chat(
  history=[
    {
      "role": "user",
      "parts": [
        "get embedding for a string",
      ],
    },
    {
      "role": "model",
      "parts": [
        "```python\nfrom sentence_transformers import SentenceTransformer, util\n\n# Load a pre-trained sentence transformer model\nmodel = SentenceTransformer('paraphrase-distilroberta-base-v1')\n\n# Define your string\ntext = \"This is a sample sentence.\"\n\n# Get the embedding\nembedding = model.encode(text)\n\n# Print the embedding (a list of floats)\nprint(embedding)\n```\n\n**Explanation:**\n\n1. **Import necessary libraries:**\n   - `sentence_transformers` for loading pre-trained models and generating embeddings.\n   - `util` for potential use of similarity functions later.\n\n2. **Load a pre-trained model:**\n   - `SentenceTransformer('paraphrase-distilroberta-base-v1')` loads a model specifically designed for sentence embedding. You can choose other models based on your task and desired performance.\n\n3. **Define your input string:**\n   - `text = \"This is a sample sentence.\"` sets the string you want to embed.\n\n4. **Generate the embedding:**\n   - `embedding = model.encode(text)` uses the loaded model to convert the string into a numerical representation (embedding).\n\n5. **Print the embedding:**\n   - `print(embedding)` displays the embedding as a list of floats.\n\n**Important Notes:**\n\n- **Choose the right model:** The model you select (`paraphrase-distilroberta-base-v1` in this case) impacts the embedding's quality and suitability for your specific application.\n- **Embedding size:** The resulting embedding will have a specific length (e.g., 768 dimensions for the chosen model), depending on the model architecture.\n- **Similarity calculations:** You can use the generated embeddings for tasks like:\n    - **Similarity search:** Find similar text by comparing embeddings using cosine similarity or other metrics.\n    - **Clustering:** Group similar text by their embeddings.\n    - **Classification:** Train a classifier using embeddings as input features.\n\nThis code provides a basic example. You can modify it by:\n\n- Using different pre-trained models from the `sentence_transformers` library.\n- Processing multiple strings simultaneously.\n- Performing further analysis or computations using the obtained embeddings. \n",
      ],
    },
  ]
)

response = chat_session.send_message("INSERT_INPUT_HERE")

print(response.text)

Please provide me with the input you want me to use. I need to know what you want to do with the text embedding. 

For example, you could tell me:

* **"I want to get the embedding for the string 'The quick brown fox jumps over the lazy dog'."**  
* **"Give me the embedding for the sentence 'This is a sample sentence'."**
* **"I want to find the embedding for a list of strings, including 'Hello, world!' and 'How are you?'"**

Once you provide me with the input, I can generate the embedding for you and explain how it works. 

