# Building a chatbot - LLM + RAG

This notebook demonstrates a simple chatbot for news-based stock investments. The user supplies a list of ticker symbols of interest. We scrape the web to download recent news events relating to the symbols. Then, we use an LLM based on RAG to query this dataset to determine, for example, whether a given stock appears promising to buy. In such a situation, additional recent context is critical since otherwise, the LLM would not have recent news events in its training data to intelligently gauge potential near-term stock performance.

## Data Wrangling

In the cells below, load needed imports, and work towards creating a `pandas` dataframe with a column named `"text"`. This column contains all of our text data that can be used as potential context by our RAG-based solution. We also keep another column named "ticker" , which captures which symbol the text pertains to.  As we will see, the ticker symbol is not used for classical RAG, but we will use it to compare to a "manual RAG" in which we directly feed, as context, the text pertaining to the symbol that is queried by the user. This forms an interesting baseline to compare semantic search with! 

In [1]:
api_key = "YOUR API KEY"  # replace with your actual key

import numpy as np
import yfinance as yf
import pandas as pd
import openai
import time
from sklearn.metrics.pairwise import cosine_similarity

from openai import OpenAI
client = OpenAI(
    base_url = "https://openai.vocareum.com/v1",
    api_key = api_key
)




We define a helper function which inputs news items returned by yfinance into a date and a short text snippet.

In [2]:
#Helper function which maps news items returned by yfinance into a date and snippet, discarding other fields
def extract_stock_snippets(data):
    results = []
    for item in data:
        content = item.get("content", {})
        pub_date = content.get("pubDate", "N/A")

        # Prioritize 'summary', then fallback to 'description' or 'title'
        summary = content.get("summary") or content.get("description") or content.get("title") or "No content"

        # Clean any HTML tags from 'description' if needed
        from bs4 import BeautifulSoup
        if "<" in summary:
            summary = BeautifulSoup(summary, "html.parser").get_text()

        results.append({
            "date": pub_date,
            "snippet": summary.strip()
        })
    return results

Next, we define a list of stock symbols we are interested in, and obtain recent news snippets for each one.

In [3]:

tickers = [
    'AAPL',  # Apple
    'TSLA',  # Tesla
    'GOOGL', # Alphabet 
    'MSFT',  # Microsoft
    'AMZN',  # Amazon
    'META',  # Meta 
    'NVDA',  # NVIDIA
    'NFLX',  # Netflix
    'INTC',  # Intel
    'AMD',   # Advanced Micro Devices
    'BRK.B', # Berkshire Hathaway
    'JPM',   # JPMorgan Chase
    'BAC',   # Bank of America
    'WMT',   # Walmart
    'TGT',   # Target
    'KO',    # Coca-Cola
    'PEP',   # PepsiCo
    'CVX',   # Chevron
    'XOM',   # ExxonMobil
    'UNH',   # UnitedHealth Group
    'PFE',   # Pfizer
    'MRK',   # Merck
    'DIS',   # Disney
    'BA',    # Boeing
    'GM'     # General Motors
]

snippets = {}

for ticker in tickers:
    stock = yf.Ticker(ticker)
    news = stock.news[:5]  # First few news items
    snippets[ticker] = extract_stock_snippets(news)

To wrap up our data creation step, we create one row of text per symbol, where we consolidate multiple news snippets per symbol into a single row. Then we save the dataframe to csv. This is our final "context" data source that our RAG system will use.

In [4]:
# Flatten the dict into a list of rows
rows = []

for ticker, articles in snippets.items():
    # Combine each snippet with its date
    combined = [
        f"[{item['date']}] {item['snippet']}" for item in articles
    ]
    
    # Join into a single string (one cell per stock)
    combined_text = "\n".join(combined)
    
    # Add to row list
    rows.append({
        "ticker": ticker,
        "text": combined_text
    })

# Create DataFrame
df = pd.DataFrame(rows)

# Save
df.to_csv("stock_snippets_summary.csv", index=False)


## Custom Query Completion

In the following code blocks, we define helper functions and work towards completing a user-provided prompt by using RAG to augment the LLM's response.

We begin by implementing a function which embeds our context data csv.

In [5]:



def create_embeddings_from_doc():
    # Load CSV
    df = pd.read_csv("stock_snippets_summary.csv")  
    
    # Configure model
    EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
    batch_size = 100
    df['text'] = df['text'].astype(str)  # ensure all inputs are strings
    
    # Function to get embeddings in batches
    def get_embeddings(text_list):
        response = client.embeddings.create(
            input=text_list,
            model=EMBEDDING_MODEL_NAME
        )
        return [item.embedding for item in response.data]
    
    # Generate embeddings
    embeddings = []
    for i in range(0, len(df), batch_size):
        batch = df['text'].iloc[i:i + batch_size].tolist()
        try:
            batch_embeddings = get_embeddings(batch)
            embeddings.extend(batch_embeddings)
        except Exception as e:
            print(f"Error on batch {i}: {e}")
            # Optional: backoff and retry
            time.sleep(5)
    
    # Add embeddings to the DataFrame
    df["embedding"] = embeddings
    
    # Save the DataFrame
    df.to_pickle("embedded_stocks.pkl")  


Next, we create a function to embed the user provided query.

In [6]:
def embed_query(prompt: str, model="text-embedding-ada-002"):
    """
    Embed the user prompt using OpenAI client v1.
    """
    response = client.embeddings.create(
        input=[prompt],
        model=model
    )
    return response.data[0].embedding

Finally, we create a function to find context results that are similar to the user query.

In [7]:
def search_similar(prompt, df, top_n=5):
    """
    Embed the prompt and find top N most similar rows in df based on cosine similarity.
    """
    # Embed the prompt
    query_embedding = embed_query(prompt)

    # Convert embeddings from df into a matrix
    embedding_matrix = np.array(df['embedding'].tolist())

    # Compute cosine similarity
    similarities = cosine_similarity([query_embedding], embedding_matrix)[0]

    # Get top N most similar indices
    top_indices = np.argsort(similarities)[-top_n:][::-1]

    # Return matching rows with similarity scores
    results = df.iloc[top_indices].copy()
    results["similarity"] = similarities[top_indices]
    return results

In [8]:
create_embeddings_from_doc()

In [9]:
def build_prompt(user_prompt, context_results, max_context_chars=3000):
    """
    Builds a prompt with context from the RAG results and user query.
    """
    context_texts = []

    # Accumulate snippets until max_context_chars is reached
    char_count = 0
    for idx, row in context_results.iterrows():
        snippet = row.get("text") or ""
        if snippet and (char_count + len(snippet)) < max_context_chars:
            context_texts.append(snippet)
            char_count += len(snippet)
        else:
            break

    context_block = "\n\n".join(context_texts)

    final_prompt = (
        f"You are an intelligent financial assistant. You will be given recent news snippets related to specific stocks. "
        f"Use this context to answer the user's question as accurately as possible.\n\n"
        f"Context:\n{context_block}\n\n"
        f"User question: {user_prompt}\n\n"
        f"Answer:"
    )
    return final_prompt


def query_openai_with_context(user_prompt, context_df=None):


    if context_df is not None and not context_df.empty:
        final_prompt = build_prompt(user_prompt, context_df)
    else:
        final_prompt = user_prompt
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",  
        messages=[
            {"role": "system", "content": "You are an intelligent financial assistant."},
            {"role": "user", "content": final_prompt}
        ],
        temperature=0.3,
        max_tokens=500,
        n=1,
        stop=None
    )

    return response.choices[0].message.content.strip()




## Custom Performance Demonstration

In the cells below, we compare the outputs of our custom queries with various prompting techniques seeking to know how attractive it is to purchase the stocks of 2 different companies. For each company, our prommpts include

i) a basic LLM-based prompt, with no RAG (i.e., no recent context)

ii) a "manual RAG" based prompt, wherein we explicitly feed in the text pertaining to the stock symbol of interest. This step bypasses the vector embedding and cosine-similarity steps, using knowledge of the symbol to directly provide context.

iii) a "classic RAG" based prompt, using vector embeddings and cosine-similarity to find the most relevant context for the provided query. 





We begin with asking questions about Tesla. Notice that the vanilla LLM reply is to avoid answering the question, whereas the other two responses are qualitatively similar. 

### Question 1

In [10]:
loaded_embeddings = pd.read_pickle("embedded_stocks.pkl")  
query="Is Tesla worth buying right now?"
response = query_openai_with_context(query, None)
print("LLM-based answer, no RAG:\n")
print(response)

LLM-based answer, no RAG:

I cannot provide personalized investment advice. However, it's important to conduct thorough research and consider various factors such as Tesla's financial performance, industry trends, and your own investment goals and risk tolerance before making any investment decisions. It may also be helpful to consult with a financial advisor for guidance tailored to your individual situation.


In [11]:
query="Is Tesla worth buying right now?"
symbol='TSLA'
df = pd.read_csv("stock_snippets_summary.csv")
top_results = df[df['ticker'] == symbol]['text'].to_frame() #manually acquire context
response = query_openai_with_context(query, top_results)
print("LLM-based answer, manual RAG:\n")
print(response)

LLM-based answer, manual RAG:

As an intelligent financial assistant, I cannot provide personalized investment advice. However, based on recent news snippets, it appears that Tesla (TSLA) stock has been climbing due to positive developments such as the $16.5 billion chip deal with Samsung for AI chips. This deal could potentially have a positive impact on Tesla's future technology and product offerings. It is important for investors to conduct their own research, consider their investment goals, risk tolerance, and consult with a financial advisor before making any investment decisions regarding Tesla or any other stock.


In [12]:
query="Is Tesla worth buying right now?"
loaded_embeddings = pd.read_pickle("embedded_stocks.pkl")  
top_results = search_similar(query, loaded_embeddings, top_n=5) #semantic search based context
response = query_openai_with_context(query, top_results)
print("LLM-based answer, with RAG:\n")
print(response)

LLM-based answer, with RAG:

As of the recent news snippets provided, Tesla (TSLA) stock has been on the rise following the announcement of a significant $16.5 billion chip deal with Samsung for AI chips. This deal is seen as a positive development for Tesla's future technology advancements, including applications in autonomous driving and AI data centers. Additionally, Elon Musk's statements regarding the potential growth opportunities from this deal have generated optimism around Tesla's prospects.

However, it's important to note that stock prices can be volatile and subject to various market factors. Considering the recent positive news surrounding Tesla and its strategic partnerships, some investors may view Tesla as a potential buy opportunity. As always, it's advisable to conduct thorough research, consider your investment goals and risk tolerance, and consult with a financial advisor before making any investment decisions.


### Question 2

Next, we ask a slightly differently worded question about Nvidia.  This wording circuments the vanilla LLM 's behavior or avoiding the question. 

In [13]:
loaded_embeddings = pd.read_pickle("embedded_stocks.pkl")

query="Is Nvidia showing growth potential based on recent news?"
response = query_openai_with_context(query, None)
print("LLM-based answer, no RAG:\n")
print(response)

LLM-based answer, no RAG:

Yes, Nvidia has been showing strong growth potential based on recent news. The company has been benefiting from the increasing demand for its graphics processing units (GPUs) in various industries such as gaming, data centers, and artificial intelligence. Nvidia's recent acquisitions and partnerships have also positioned the company well for future growth opportunities. Additionally, the company's focus on innovation and development of new technologies has been well-received by investors and analysts, further supporting its growth potential.


In [14]:
query="Is Nvidia showing growth potential based on recent news?"
symbol='NVDA'
df = pd.read_csv("stock_snippets_summary.csv")
top_results = df[df['ticker'] == symbol]['text'].to_frame()
response = query_openai_with_context(query, top_results)
print("LLM-based answer, manual RAG:\n")
print(response)

LLM-based answer, manual RAG:

Based on recent news, there are mixed signals regarding NVIDIA's growth potential. While NVIDIA has been one of the top stocks Wall Street is buzzing about, with a strong performance leading to profit-taking recommendations from experts like Josh Brown, there may be some caution regarding its future growth trajectory. It is advisable to closely monitor further developments and expert opinions to assess NVIDIA's growth potential accurately.


In [15]:
query="Is Nvidia showing growth potential based on recent news?"
top_results = search_similar(query, loaded_embeddings, top_n=5)
response = query_openai_with_context(query, top_results)
print("LLM-based answer, with RAG:\n")
print(response)

LLM-based answer, with RAG:

Based on recent news snippets, there are mixed signals regarding NVIDIA's growth potential. While NVIDIA is one of the top stocks Wall Street is buzzing about, with strong performance mentioned, there are also reports of Josh Brown, CEO of Ritholtz Wealth Management, suggesting his followers to sell NVIDIA shares to take profits. This indicates some caution in the market regarding NVIDIA's future growth potential. It would be advisable to monitor further developments and analyst opinions to assess NVIDIA's growth trajectory accurately.


In [16]:
loaded_embeddings = pd.read_pickle("embedded_stocks.pkl")

query="Is Boeing likely to do well based on recent news?"
response = query_openai_with_context(query, None)
print("LLM-based answer, no RAG:\n")
print(response)

LLM-based answer, no RAG:

I don't have real-time data or the ability to predict future stock performance. I recommend conducting thorough research on Boeing, including analyzing their financial statements, market trends, and news updates to make an informed decision about their potential performance. It may also be helpful to consult with a financial advisor for personalized advice.


In [17]:
query="Is Boeing likely to do well based on recent news?"
symbol='BA'
df = pd.read_csv("stock_snippets_summary.csv")
top_results = df[df['ticker'] == symbol]['text'].to_frame()
response = query_openai_with_context(query, top_results)
print("LLM-based answer, manual RAG:\n")
print(response)

LLM-based answer, manual RAG:

Based on the recent news snippets provided, Boeing is expected to report its second-quarter earnings soon, and investors are looking to CEO Kelly Ortberg to continue the company's turnaround efforts. Additionally, the fact that the Dow Jones manufacturer, which includes Boeing, was exempted from tariffs in an EU trade deal could be seen as a positive development. However, it is important to note that the stock market is at record highs, and there is anticipation for a busy week of corporate earnings reports, which could impact Boeing's performance. Overall, while there are some positive indicators for Boeing, it is essential to consider the broader market conditions and the company's specific financial results when evaluating its potential performance.


In [18]:
query="Is Boeing likely to do well based on recent news?"
top_results = search_similar(query, loaded_embeddings, top_n=5)
response = query_openai_with_context(query, top_results)
print("LLM-based answer, with RAG:\n")
print(response)

LLM-based answer, with RAG:

Based on the recent news snippets provided, Boeing is set to report its second-quarter earnings, and investors are looking to CEO Kelly Ortberg to continue the turnaround efforts at the company. Additionally, Boeing was exempted from tariffs in an EU trade deal. However, union workers are preparing for a possible strike. The stock market is at record highs, but there are growing headwinds in the market.

Considering these factors, Boeing's performance may be influenced by a combination of its earnings report, CEO leadership, tariff exemptions, and potential labor issues. It is essential to monitor the earnings release and any subsequent developments to assess Boeing's performance accurately.


In general, manual and classical RAG produce similar output. In situations where the vector database may be massive, and there is a way to "index" relevant contexts (such as via the "ticker" key here, which allows us to effectively bypass semantic search), manual RAG can be employed to produce more rapid results.  Such a solution may be appropriate whenever our queries naturally fall into "categories" and where it is easy for us to a-priori store context per-category.