## STEP 1: Original Documents

Let's create some strings to act as text documents, later we will read in other files, like PDFs.

In [1]:
sports_news_text = {'title':'Sports Section',
                    'text':"The San Francisco 49ers are heading to the super bowl in a football showdown!"}

In [2]:
finance_news_text = {'title':"Finance Section",
                     'text':"Meta stock has reached all time highs and has become a major part of the S&P500."}

## Step 2: Load Embedding Model

In [3]:
import google.generativeai as genai
api_key = ''
genai.configure(api_key=api_key)

## Step 3: Create Vector Embeddings

In [4]:
sports_embedding_vector = genai.embed_content(model='models/embedding-001',content=sports_news_text['text'],
                             task_type='retrieval_document')

In [5]:
len(sports_embedding_vector['embedding'])

768

In [6]:
finance_embedding_vector = genai.embed_content(model='models/embedding-001',content=finance_news_text['text'],
                             task_type='retrieval_document')

In [7]:
len(finance_embedding_vector['embedding'])

768

In [8]:
finance_embedding_vector.keys()

dict_keys(['embedding'])

In [9]:
def embed_text(text):
    return genai.embed_content(model='models/embedding-001',content=text,
                             task_type='retrieval_document')['embedding']

## Step 4: Store Embeddings

For larger applications, you should use a vector database, like ChromaDB, but for now we'll create our own simple Vector Database connecting the embedding to the model for RAG.

In [10]:
import pandas as pd

In [11]:
df = pd.DataFrame()

In [12]:
documents = [finance_news_text,sports_news_text]
df = pd.DataFrame(documents)
df.columns = ['Title', 'Text']
df

Unnamed: 0,Title,Text
0,Finance Section,Meta stock has reached all time highs and has ...
1,Sports Section,The San Francisco 49ers are heading to the sup...


In [13]:
df['Embeddings'] = df['Text'].apply(embed_text)

In [14]:
df

Unnamed: 0,Title,Text,Embeddings
0,Finance Section,Meta stock has reached all time highs and has ...,"[0.055816136, -0.0018495269, -0.024931083, 0.0..."
1,Sports Section,The San Francisco 49ers are heading to the sup...,"[0.013654178, -0.010523142, -0.053972486, -0.0..."


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Title       2 non-null      object
 1   Text        2 non-null      object
 2   Embeddings  2 non-null      object
dtypes: object(3)
memory usage: 180.0+ bytes


## Step 5: Similarity Search

Question and Answer (Q&A) system aimed at sifting through these documents. The process involves posing a query specifically about hyperparameter tuning. Subsequently, this query is transformed into an embedding, essentially a numerical vector composed of floating-point values. This vector representing the question is then methodically compared with the array of document embeddings stored within the dataframe.

The comparison hinges on the mathematical operation known as the dot product. This operation quantitatively assesses the alignment or similarity in direction between two vectors. Notably, the vector we receive from the API is pre-normalized, ensuring its readiness for comparison.

The outcome of the dot product, which measures the similarity, spans a range from -1 to 1. A dot product value of 1 signifies perfect alignment, indicating that the vectors share the same direction. Conversely, a value of -1 denotes complete opposition in direction, reflecting dissimilarity. A value of 0, falling in the middle, indicates orthogonality, meaning the vectors are perpendicular and bear no relation to each other in terms of direction. Understanding these values and their implications is crucial for interpreting the similarity between the query and document embeddings in our Q&A system.

In [16]:
import numpy as np

def query_similarity_score(query,vector):
    '''
    INPUTS:
        query: str: The user prompt
        vector: array: The existing vector embedding from a document
    OUTPUT:
        score: float - Cosine similarity score
    '''
    query_embedding = embed_text(query)
    return np.dot(query_embedding,vector)

In [17]:
query = "Any interesting news about the stock market today?"

In [18]:
df['Similarity'] = df['Embeddings'].apply(lambda vector: query_similarity_score(query,vector))

In [19]:
df

Unnamed: 0,Title,Text,Embeddings,Similarity
0,Finance Section,Meta stock has reached all time highs and has ...,"[0.055816136, -0.0018495269, -0.024931083, 0.0...",0.790957
1,Sports Section,The San Francisco 49ers are heading to the sup...,"[0.013654178, -0.010523142, -0.053972486, -0.0...",0.704175


In [20]:
df.sort_values('Similarity',ascending=False)[['Title','Text']].iloc[0]

Title                                      Finance Section
Text     Meta stock has reached all time highs and has ...
Name: 0, dtype: object

In [21]:
def most_similar_document(query):
    df['Similarity'] = df['Embeddings'].apply(lambda vector: query_similarity_score(query,vector))
    title = df.sort_values('Similarity',ascending=False)[['Title','Text']].iloc[0]['Title']
    text = df.sort_values('Similarity',ascending=False)[['Title','Text']].iloc[0]['Text']
    return title,text

## Step 6: Inject Text as Context using RAG

We simply grab the most relevant text to help the Text Generation Model answer the query

In [22]:
def RAG(query):
    title,text = most_similar_document(query)
    model = genai.GenerativeModel('gemini-pro')
    prompt = f'Answer this query:\n{query}.\nOnly use this context to answer:\n{text}'
    response = model.generate_content(prompt)
    return f'{response.text}\n\nSource Document:{title}'

In [23]:
# Careful, it can still add its own context, which could be out of date!
print(RAG("Any interesting news about the stock market today?"))

This context does not mention any news about the stock market today, so I cannot answer this query.

Source Document:Finance Section


In [24]:
print(RAG("Anything interesting happening in the world of sports?"))

The San Francisco 49ers are heading to the Super Bowl, suggesting that they are participating in a significant sporting event.

Source Document:Sports Section


## OPTIONAL: Expand to more real world documents
There are lots of libraries allowing you to extract text from real world documents, for example, PDFs! You could make a bot that helps answer questions about your company's own documents. Do a google search for relevant Python libraries to extract text from your documents.

In [25]:
!pip install PyPDF2

Defaulting to user installation because normal site-packages is not writeable
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [1]:
import os
import pandas as pd
from PyPDF2 import PdfReader

# Initialize an empty DataFrame with columns 'Title' and 'Text'
df = pd.DataFrame(columns=['Title', 'Text'])

# Loop through each file in the current directory
for file_name in os.listdir('.'):
    if file_name.endswith('.pdf'):
        try:
            # Open the PDF file
            with open(file_name, 'rb') as file:
                # Initialize a PDF file reader
                pdf_reader = PdfReader(file)
                # Initialize text variable to store the content of the PDF
                text = ''
                # Iterate through each page in the PDF
                for page_num in range(len(pdf_reader.pages)):
                    # Extract text from the page
                    text += pdf_reader.pages[page_num].extract_text()
                    text = text.replace('\n',' ')
                # Create a new DataFrame with the file's title and text
                new_row = pd.DataFrame({'Title': [file_name], 'Text': [text]})
                # Concatenate the new DataFrame row to the existing DataFrame
                df = pd.concat([df, new_row], ignore_index=True)
        except Exception as e:
            print(f"Error processing file {file_name}: {e}")

In [2]:
df

Unnamed: 0,Title,Text
0,Wonka Chocolate Facility Rules.pdf,Wonka Milk Chocolate Factory: Facility Safety ...
