<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1RA6fwwFHGNm3MMvPhbFSW6zqTpK_x5Ga?usp=sharing"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Overview

This example demonstrates how to use the Gemini API to create embeddings so that you can perform document search. You will use the Python client library to build a word embedding that allows you to compare search strings, or questions, to document contents.

In this tutorial, you'll use embeddings to perform document search over a set of documents to ask questions related to the Google Car.


## Setup

In [None]:
!pip install -U -q google-generativeai

In [6]:
import textwrap
import numpy as np
import pandas as pd

import google.generativeai as genai
import google.ai.generativelanguage as glm

# Used to securely store your API key
from google.colab import userdata

from IPython.display import Markdown

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see the [Authentication](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb) quickstart for an example.

In [7]:
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

## Building an embeddings database

Here are three sample texts to use to build the embeddings database. You will use the Gemini API to create embeddings of each of the documents. Turn them into a dataframe for better visualization.

In [10]:
DOCUMENT1 = {
    "title": "Baahubali: The Beginning (2015)",
    "content": "Directed by S.S. Rajamouli, this epic action film is set in the ancient kingdom of Mahishmati. It tells the story of two brothers, Amarendra Baahubali and Bhallaladeva, who compete for the throne. Amarendra is the rightful heir and beloved by the people, but Bhallaladeva plots against him with his father. The story unfolds with grand battle scenes, intricate politics, and a dramatic revelation about Amarendra's son, who seeks to avenge his father's betrayal and reclaim the throne."}
DOCUMENT2 = {
    "title": "Ala Vaikunthapurramuloo (2020)",
    "content": "This action-drama directed by Trivikram Srinivas centers around Bantu, a middle-class man who learns that he was swapped at birth with the son of a wealthy businessman. After discovering his real parentage, Bantu enters the world of affluence and decides to improve the dysfunctional family's relationships while facing his usurper, who is intent on undermining him. The film blends comedy, action, and family drama, featuring standout music and dance sequences."}
DOCUMENT3 = {
    "title": "Arjun Reddy (2017)",
    "content": "Directed by Sandeep Reddy Vanga, this intense romantic drama follows Arjun Reddy, a brilliant but self-destructive surgeon who spirals into self-destruction after his girlfriend is forced to marry another man. The film delves deeply into themes of love, anger, and redemption as Arjun struggles with his uncontrollable temper and the consequences of his actions on his personal and professional life. It's noted for its raw portrayal of emotion and its departure from traditional romantic storytelling."}

documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

Organize the contents of the dictionary into a dataframe for better visualization.

In [11]:
df = pd.DataFrame(documents)
df.columns = ['Title', 'Text']
df

Unnamed: 0,Title,Text
0,Baahubali: The Beginning (2015),"Directed by S.S. Rajamouli, this epic action f..."
1,Ala Vaikunthapurramuloo (2020),This action-drama directed by Trivikram Sriniv...
2,Arjun Reddy (2017),"Directed by Sandeep Reddy Vanga, this intense ..."


Get the embeddings for each of these bodies of text. Add this information to the dataframe.

In [14]:
model = 'models/embedding-001'

In [15]:
# Get the embeddings of each text and add to an embeddings column in the dataframe
def embed_fn(title, text):
  return genai.embed_content(model=model,
                             content=text,
                             task_type="retrieval_document",
                             title=title)["embedding"]

df['Embeddings'] = df.apply(lambda row: embed_fn(row['Title'], row['Text']), axis=1)
df

Unnamed: 0,Title,Text,Embeddings
0,Baahubali: The Beginning (2015),"Directed by S.S. Rajamouli, this epic action f...","[0.04053588, -0.015484576, -0.026146438, -0.01..."
1,Ala Vaikunthapurramuloo (2020),This action-drama directed by Trivikram Sriniv...,"[0.03625944, 0.008953982, -0.056172673, 0.0759..."
2,Arjun Reddy (2017),"Directed by Sandeep Reddy Vanga, this intense ...","[0.036841977, -0.012119846, 0.0043612146, 0.04..."


## Document search with Q&A

Now that the embeddings are generated, let's create a Q&A system to search these documents. You will ask a question about hyperparameter tuning, create an embedding of the question, and compare it against the collection of embeddings in the dataframe.

The embedding of the question will be a vector (list of float values), which will be compared against the vector of the documents using the dot product. This vector returned from the API is already normalized. The dot product represents the similarity in direction between two vectors.

The values of the dot product can range between -1 and 1, inclusive. If the dot product between two vectors is 1, then the vectors are in the same direction. If the dot product value is 0, then these vectors are orthogonal, or unrelated, to each other. Lastly, if the dot product is -1, then the vectors point in the opposite direction and are not similar to each other.

Note, with the new embeddings model (`embedding-001`), specify the task type as `QUERY` for user query and `DOCUMENT` when embedding a document text.

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.

In [24]:
query = "What is the name of the kingdom in Bahubali?"
model = 'models/embedding-001'

request = genai.embed_content(model=model,
                              content=query,
                              task_type="retrieval_query")

Use the `find_best_passage` function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passage out of the database.

In [25]:
def find_best_passage(query, dataframe):
  """
  Compute the distances between the query and each document in the dataframe
  using the dot product.
  """
  query_embedding = genai.embed_content(model=model,
                                        content=query,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embeddings']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Text'] # Return text from index with max value

View the most relevant document from the database:

In [26]:
passage = find_best_passage(query, df)
passage

"Directed by S.S. Rajamouli, this epic action film is set in the ancient kingdom of Mahishmati. It tells the story of two brothers, Amarendra Baahubali and Bhallaladeva, who compete for the throne. Amarendra is the rightful heir and beloved by the people, but Bhallaladeva plots against him with his father. The story unfolds with grand battle scenes, intricate politics, and a dramatic revelation about Amarendra's son, who seeks to avenge his father's betrayal and reclaim the throne."

## Question and Answering Application

Let's try to use the text generation API to create a Q & A system. Input your own custom data below to create a simple question and answering example. You will still use the dot product as a metric of similarity.

In [27]:
def make_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = textwrap.dedent("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
  strike a friendly and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

    ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt

In [28]:
prompt = make_prompt(query, passage)
print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below.   Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.   However, you are talking to a non-technical audience, so be sure to break down complicated concepts and   strike a friendly and converstional tone.   If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: 'What is the name of the kingdom in Bahubali?'
  PASSAGE: 'Directed by S.S. Rajamouli, this epic action film is set in the ancient kingdom of Mahishmati. It tells the story of two brothers, Amarendra Baahubali and Bhallaladeva, who compete for the throne. Amarendra is the rightful heir and beloved by the people, but Bhallaladeva plots against him with his father. The story unfolds with grand battle scenes, intricate politics, and a dramatic revelation about Amarendras son, who seeks to avenge his fathers betrayal and reclaim the throne.'



Choose one of the Gemini content generation models in order to find the answer to your query.

In [29]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-001
models/gemini-1.5-flash-latest
models/gemini-1.5-pro
models/gemini-1.5-pro-001
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision


In [30]:
model = genai.GenerativeModel('models/gemini-pro')
answer = model.generate_content(prompt)

In [31]:
Markdown(answer.text)

The kingdom in Bahubali is called Mahishmati.