<a href="https://colab.research.google.com/github/Ali-mohammadi-design/LLM_Engineering_and_Machine_Learning/blob/main/Gemini_API_Embedding_RAG_Retrieval_Augmented_Generation_search_by_similarity_comparing_Vectors_Augmented_prompt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install google-generativeai



In [2]:
import google.generativeai as genai

In [8]:
import os
GenAI_API_Key=os.environ['GenAI_API_Key']

In [9]:
genai.configure(api_key=GenAI_API_Key)

Let's embedd some text

In [11]:
sport_news={'title':'sport section', 'text': 'Tehran would organize a new festival for cycling'}

In [12]:
political_news={'title':'politics', 'text': 'Iran president was killed in an aciddent last week'}

Note: as is clear in the below we have a separate model for embedding in Gemini,embedding-001.

In [13]:
for model in genai.list_models():
  print(model.name)


models/chat-bison-001
models/text-bison-001
models/embedding-gecko-001
models/gemini-1.0-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-latest
models/gemini-1.0-pro-vision-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-001
models/gemini-1.5-flash-latest
models/gemini-1.5-pro
models/gemini-1.5-pro-001
models/gemini-1.5-pro-latest
models/gemini-pro
models/gemini-pro-vision
models/embedding-001
models/text-embedding-004
models/aqa


Let's embed:

In [15]:
sport_embedding_vector=genai.embed_content(model='models/text-embedding-004',content=sport_news['text'])

In [16]:
type(sport_embedding_vector)

dict

In [22]:
sport_embedding_vector.keys()

dict_keys(['embedding'])

It just have embeeding

In [26]:
#sport_embedding_vector['embedding']

Note: We can make a function for the embedding:

In [29]:
def embed(query):
   return genai.embed_content(model='models/text-embedding-004',content=query)['embedding']

Note: we want to make a dataframe that all of the embedded text be there. Then we search by smilarity and find out the most similar text to the search and use this text in the prompt to be prcessed. Thus, we would make a dataframe.

In [30]:
import pandas as pd

In [31]:
df=pd.DataFrame()

In [32]:
documents=[sport_news,political_news]

In [34]:
df=pd.DataFrame(documents)

In [35]:
df

Unnamed: 0,title,text
0,sport section,Tehran would organize a new festival for cycling
1,politics,Iran president was killed in an aciddent last ...


Note: You can change the name of the columns if they are not appropriate!

In [36]:
df.columns=['Tilte','Text']

In [37]:
df

Unnamed: 0,Tilte,Text
0,sport section,Tehran would organize a new festival for cycling
1,politics,Iran president was killed in an aciddent last ...


Note: we want to add a column called embedding text and provide the vectors there!

In [42]:
df['embedding']=df['Text'].apply(embed)

In [43]:
df

Unnamed: 0,Tilte,Text,embedding
0,sport section,Tehran would organize a new festival for cycling,"[-0.030091116, 0.0129978, 0.04402315, -0.05142..."
1,politics,Iran president was killed in an aciddent last ...,"[0.03517581, -0.003256268, -0.035170935, -0.04..."


For the similarity search, In the numpy we have a function called .dot() that can compare two vectors and retrun a value. The bigger value means there is more similarity!

In [44]:
import numpy as np

In [45]:
def similarity(query,vector):
  query_embeded_vector=embed(query)
  return np.dot(query_embeded_vector,vector)

For the similarity apply on the data frame we would use Lambda. Let's explain lmbda:

In Python, a lambda function is a small, anonymous function that is defined using the lambda keyword. Unlike regular functions defined with def, lambda functions are typically used for short, simple operations and are created on-the-fly.

Here's the basic syntax of a lambda function:

lambda arguments: expression

lambda is the keyword used to define the function.
arguments are the input parameters, similar to those in a regular function.
expression is a single expression that is evaluated and returned.

Example 1: Simple Addition
Let's start with a basic example where we create a lambda function to add two numbers.

add = lambda x, y: x + y
print(add(3, 5))

In this example, lambda x, y: x + y creates an anonymous function that takes two arguments, x and y, and returns their sum. The function is then assigned to the variable add.

Note: in dataframe we can use lambda to target a specific column and use the information of that column as an input of a function and return the output.

In [47]:
text='teharan'
df['similarity']=df['embedding'].apply(lambda vector:similarity(text,vector))

In [48]:
df

Unnamed: 0,Tilte,Text,embedding,similarity
0,sport section,Tehran would organize a new festival for cycling,"[-0.030091116, 0.0129978, 0.04402315, -0.05142...",0.276927
1,politics,Iran president was killed in an aciddent last ...,"[0.03517581, -0.003256268, -0.035170935, -0.04...",0.259705


In [49]:
df.sort_values('similarity',ascending=False)

Unnamed: 0,Tilte,Text,embedding,similarity
0,sport section,Tehran would organize a new festival for cycling,"[-0.030091116, 0.0129978, 0.04402315, -0.05142...",0.276927
1,politics,Iran president was killed in an aciddent last ...,"[0.03517581, -0.003256268, -0.035170935, -0.04...",0.259705


Then we just speparate the fist row with iloc() function

In [53]:
df.sort_values('similarity',ascending=False)[['Tilte','Text']].iloc[0]

Tilte                                       sport section
Text     Tehran would organize a new festival for cycling
Name: 0, dtype: object

We can gather all of them in a function:

In [58]:
def most_similar_doc(text):
  df['similarity']=df['embedding'].apply(lambda vector:similarity(text,vector))
  return df.sort_values('similarity',ascending=False)[['Tilte','Text']].iloc[0]['Text']



In [59]:
text='what does happen in Iran last week?'

In [60]:
most_similar_doc(text)

'Iran president was killed in an aciddent last week'

Now we can use this information in an augmented prompt.

In [61]:
model= genai.GenerativeModel('gemini-pro')

In [98]:
def augmented_response(query):
  text=most_similar_doc(query)
  prompt= f"Regarding the following text, please answer this prompt.\n text:{text}\n\n prompt:{query}"
  response=model.generate_content(prompt)
  return print(response.text)


In [99]:
prompt='what was happened in Tehran related to the sport?'

In [100]:
augmented_response(prompt)

Tehran would organize a new festival for cycling
