# Building a Simple RAG

A Retrieval Augmented Generation (RAG) system is a type of Generative AI that combines the benefits of pre-trained large language models with the ability to reference a defined knowledge base during generation. In a typical RAG system, when a new input prompt is received, the model first retrieves relevant documents from the supplied data sources and then uses this retrieved information to inform the generation of the output. This approach allows the model to pull in a wider range of information than it could from an out-of-the-box LLM.

In this notebook, we are going to use a financial headline dataset outside the training scope of Chat-GPT to have it answer questions relevant to the largest financial news stories of late 2022 and early 2023. 

Before getting started, create a virtual environment and run `pip install -r /path/to/requirements.txt`

First we need to set our OpenAI API key and chose our model 

In [1]:
import os
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_KEY"] = 'Your OpenAI API Key'

chat = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    model='gpt-3.5-turbo'
)

A "language chain" is a concept used in natural language processing and conversational AI systems. It is a sequence of interlinked messages exchanged between a human and an LLM that can be used for various prompting techniques.

We use the langchain library to begin our messaging sequence.

In [2]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

messages = [
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="Tell me about the financial market in late 2022 and early 2023. I am not intersted in predictions or events in the past.")
]

In [3]:
response = chat(messages)
print(response.content)

I apologize, but as an AI language model, I don't have access to real-time data or the ability to predict future events. My responses are based on information available up until September 2021. It's important to consult financial experts or trusted sources for the most up-to-date and accurate information on the financial market in late 2022 and early 2023.


As you can see by the response, Chat-GPT has no information about any events that occurred after September 2022. The dataset we are going to use contains a collection of financial news headlines and subheadlines from September 2022 to April 2023. Using this database as an external knowledge base, we will be able to augment the model's knowledge about recent events.

We are going to use this dataset: [Data](https://huggingface.co/datasets/PaulAdversarial/all_news_finance_sm_1h2023/viewer/default/train?q=Skittles)

In [4]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset(
    'PaulAdversarial/all_news_finance_sm_1h2023',
    split="train"
)

In [5]:
# this makes it easier to view the dataset and iterate over it.
pd.set_option('display.max_colwidth', None)
data = dataset.to_pandas() 
data.head()

Unnamed: 0,_id,main_domain,title,description,created_at
0,6453d70d358e80adbfc4cb2b,cnbc.com,"Dow drops 400 points, turns negative for the year as bank fears grow: Live updates",Regional banks led the broader market lower as contagion fears resurfaced.,2023-05-04T16:01:46.448Z
1,6453cf909a78e3af538abe44,cointelegraph.com,Bitcoin drops with stocks as analyst warns of banking crisis ‘endgame’,"Bitcoin dips as the U.S. banking crisis engulfs more lenders, BTC price falling in line with stocks.",2023-05-04T15:25:28.809Z
2,6453cb87ccab8508100df076,co.uk,Bitcoin Price Analysis: 29370 Tested After Surge - 5 May...,"Bitcoin (BTC/USD) sought to add to recent gains early in the Asian session as the pair extended recent gains to the 29383.50 area, representing a test of an upside p...",2023-05-04T15:12:00.971Z
3,6453afd269f3c1643cf0a4f6,bitcoinist.com,"Bitcoin Is 75% To Halving, Here's How Past Cycles Compare",The current Bitcoin cycle is now 75% on the way to the next halving. Here's what previous cycles looked like at similar stages in their timeline.,2023-05-04T13:10:51.220Z
4,645399d92471d73ea0976d27,seekingalpha.com,"Iron Mountain FFO of $0.71 beats by $0.03, revenue of $1.31B misses by $10M (NYSE:IRM)",Iron Mountain press release (IRM): Q1 FFO of $0.71 beats by $0.03.Revenue of $1.31B (+4.8% Y/Y) misses by $10M.2023 Outlook: Total revenue of $5.50B-$5.60B vs,2023-05-04T11:41:12.498Z


Each entry in the dataset has the following fields:

    _id: Unique identifier for each entry
    main_domain: The domain of the news source
    title: Title of the news article
    description: Description or summary of the news article
    created_at: Date and time when the news article was created or published

We are only intersted in `_id`, `title`, and `description` so we'll drop the `main_domain`. We'll reformate date and append it to title so the model can use it as a reference tehn drop `created_at`. 

In [6]:
data['title'] = data['title'] + data['created_at'].apply(lambda x: pd.to_datetime(x).strftime(' Published on %Y-%m-%d'))
data = data.drop(columns=['main_domain', 'created_at'])
data.head()

Unnamed: 0,_id,title,description
0,6453d70d358e80adbfc4cb2b,"Dow drops 400 points, turns negative for the year as bank fears grow: Live updates Published on 2023-05-04",Regional banks led the broader market lower as contagion fears resurfaced.
1,6453cf909a78e3af538abe44,Bitcoin drops with stocks as analyst warns of banking crisis ‘endgame’ Published on 2023-05-04,"Bitcoin dips as the U.S. banking crisis engulfs more lenders, BTC price falling in line with stocks."
2,6453cb87ccab8508100df076,Bitcoin Price Analysis: 29370 Tested After Surge - 5 May... Published on 2023-05-04,"Bitcoin (BTC/USD) sought to add to recent gains early in the Asian session as the pair extended recent gains to the 29383.50 area, representing a test of an upside p..."
3,6453afd269f3c1643cf0a4f6,"Bitcoin Is 75% To Halving, Here's How Past Cycles Compare Published on 2023-05-04",The current Bitcoin cycle is now 75% on the way to the next halving. Here's what previous cycles looked like at similar stages in their timeline.
4,645399d92471d73ea0976d27,"Iron Mountain FFO of $0.71 beats by $0.03, revenue of $1.31B misses by $10M (NYSE:IRM) Published on 2023-05-04",Iron Mountain press release (IRM): Q1 FFO of $0.71 beats by $0.03.Revenue of $1.31B (+4.8% Y/Y) misses by $10M.2023 Outlook: Total revenue of $5.50B-$5.60B vs


### Vectorization

Vectorization is the process of transforming textual data into a numerical representations which enables the model to capture semantic relationships within the new training data and facilitate its predictive capacity. It is beyond the scope of this notebook. [This blog provides a more in-depth and high level explanation of vector databases.](https://archive.is/O4BQC)

Pinecone is a vector database tool specifically designed for vector similarity search. The indexing process organizes the vectors in a way that enables rapid retrieval based on vector similarities. It allows you to perform vector similarity search to find vectors that are most similar to a given query vector. 

In order to get an API key you need to go the [site](https://www.pinecone.io/) and create an account

In [7]:
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY') or 'Your Pinecone API Key'',
    environment=os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'
)

Here we create our Pinecone index, assign it a dimension, and assign it a distance metric. 

When creating a new Pinecone index the default dimension is set to 1536 to match the OpenAI embedding model text-embedding-ada-002 which uses 1536 dimensions. 

For distance metrics you can use 'euclidean', 'cosine', or 'dotproduct'. The default choice of cosine as a distance metric is primarily driven by its efficiency in handling high-dimensional data and its ability to disregard vector magnitudes.

In [8]:
import time

index_name = '2023-finance-news'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

index = pinecone.Index(index_name)

In [9]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

Embeddings are vectors of numbers that represent the meaning and context of tokens processed by a model. They are derived from the model's parameters or weights and are used to encode and decode input and output texts. Embeddings help models understand the relationships between tokens and generate more relevant texts

We are our using `text-embedding-ada-002`, which is one of OpenAI's models. For more information visit their [docs](https://openai.com/blog/new-and-improved-embedding-model).

In [10]:
from langchain.embeddings.openai import OpenAIEmbeddings

embed_model = OpenAIEmbeddings(model="text-embedding-ada-002")

Next we begin embedding the data in batches using the `embed_model`, and then upserting the resulting embeddings and metadata into our pinecone index.


- `batch_size`: Sets the batch size for processing data. Each batch will contain up to 100 records.

- `The for loop` iterates over a range of indices based on the length of the data object, with a step size of batch_size.

- `i_end` calculates the end index of the current batch, ensuring that it doesn't exceed the total length of the data object.

- `batch` retrieves the batch of data from the data object using the calculated start and end indices.

- `metadata` creates a list of dictionaries containing metadata information extracted from the batch using the column names form the dataset.

- `ids` generates a list of unique IDs for each record in the batch.

- `texts` combines the 'title' and 'description' fields from each record in the batch to form a single text object

- `embeds` uses the `embed_model` to embed the text documents into vectors.

- `index.upsert` Upserts the vectors, along with their respective identifiers and metadata, into the pinecone index.

- `try - except block`: Handles a '400' error that occurs due to metadata size exceeding the limit of Pinecones free tier index size. The execution is stopped

In [11]:
from tqdm.auto import tqdm  # for progress bar
import sys
import uuid

batch_size = 100

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i + batch_size)
    batch = data.iloc[i:i_end]
    metadata = [
        {'title': x['title'],
         'description': x['description']} for _, x in batch.iterrows()
    ]
    ids = {x['_id']: 'id' for _, x in batch.iterrows()}
    texts = [x['title'] + ' ' + x['description'] for _, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    try:
        index.upsert(vectors=zip(ids, embeds, metadata))
    except Exception as e:  # Catch generic Python exceptions
        if '400' in str(e):  # Check if error message contains '400'
            print(f"Error '{e}'. Metadata size may have exceeded the limit, stopping execution.")
            break
        else:
            print(f"Encountered an unexpected error while trying to upsert vectors: {e}")
            continue

  0%|          | 0/51 [00:00<?, ?it/s]

Next we initialize the vector store object with the data field we want the model to do a similairty serch on

In [12]:
from langchain.vectorstores import Pinecone

text_field = "title"  # the metadata field that contains our text

# initialize the vector store object
vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)




Lets choose a headline from ealry 2023 and grab the five most related text objects in the vector

In [13]:
def augment_prompt(query: str):
    # get top 5 results from knowledge base
    results = vectorstore.similarity_search(query, k=5)
    
    # get the text from the results
    source_knowledge = "\n".join([x.page_content for x in results])
    
    # feed into an augmented prompt
    augmented_prompt = f"""Using the contexts below, answer the query.

    Contexts:
    {source_knowledge}

    Query: {query}"""
    
    return augmented_prompt

Here we can see the vector algorithm grabbing the five most relevant pieces of info related ot the query

In [14]:
query = "What was the job market in the use like the United States in 2023?"
print(augment_prompt(query))

Using the contexts below, answer the query.

    Contexts:
    A surprising burst of US hiring in January: 517,000 jobs Published on 2023-02-03
Mass Layoffs or Hiring Boom? What’s Actually Happening in the Jobs Market Published on 2023-02-09
Job growth totals 236,000 in March, near expectations as hiring pace slows Published on 2023-04-07
As Americans Work From Home, Europeans and Asians Head Back to the Office Published on 2023-02-28
Bank Failures. High Inflation. Rising Rates. Is the Resilient Jobs Market About to Crack? Published on 2023-04-06

    Query: What was the job market in the use like the United States in 2023?


When we ask GPT a question of the 2023 job market without the data augmentation we get the result we expect

In [15]:
prompt = HumanMessage(
    content= "What was the job market in the use like the United States in 2023?"
)

response = chat(messages + [prompt])
print(response.content)

I apologize, but as an AI language model, I don't have real-time data or the ability to browse the internet. Therefore, I cannot provide you with specific information on the financial market in late 2022 and early 2023, or the job market in the United States in 2023. It's always best to refer to reliable news sources or consult with financial experts for the most up-to-date and accurate information on these topics.


But asking the same question using out augmented data now generates a meaningful response! 

In [16]:
prompt = HumanMessage(
    content=augment_prompt(
        "What was the job market in the use like the United States in 2023?"
    )
)

response = chat(messages + [prompt])
print(response.content)

In late 2022 and early 2023, the job market in the United States experienced a mix of positive and concerning trends. In January 2023, there was a surprising burst of hiring, with 517,000 jobs added. This indicated a period of strong job growth. However, in February 2023, there were reports of Americans working from home while Europeans and Asians were returning to the office, suggesting a potential difference in employment situations across regions.

In March 2023, job growth totaled 236,000, which was near expectations but also indicated a slowdown in the hiring pace compared to the previous months. This could suggest a possible stabilization or cooling down of the job market.

However, there were also concerns in the financial market during this period. The possibility of bank failures, high inflation, and rising rates raised questions about the resilience of the jobs market and its potential to withstand these challenges. It is important to note that the financial market and the jo