# Notebook 3: Approaches for Tabular Data

In this notebook we will show a few ways to use LLM pipelines with tabular data. I like to think of working with tabular data with LLMs with 3 options.

1. **Static Approach:** Store data in a knowledge base and try to retrieve the relevant data to answer the question directly
2. **Templated Code / Function Calling Approach:** Have pre-defined code that can run to satisfy certain requests. Extract inputs and produce answer using outputs with LLMs.
3. **AI Agent Approach:** The LLM creates code to run in order to produce analysis required to answer the question.
 
Complexity and reliance on the LLM increases from 1. where no code is executed to 2. where only pre-define coded is executed to 3. where the LLM produces code to execute. The separation is not perfectly clean agents can also just point only to pre-defined code, but this is just a framework. Additionally, for 1., the stored data could consist of extensively pre-process "views" of the data aimed at satisfying the majority of user queries. 

In this notebook, we will focus on 1. and 3., but demonstrate 2. through the use of agents to some extent. We will consider on Tesla (TSLA) stock prices data.
* First, we will use the common approach of using LangChain's [CSV Loader](https://python.langchain.com/docs/integrations/document_loaders/csv)
* Next, we will explore Langchain's [TimeWeightedVectorStoreRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/time_weighted_vectorstore) and learn why this may not be what we want
* Then, we will explore using agent based approaches from [LangChain](https://github.com/langchain-ai/langchain/tree/master) and [Pandas-AI](https://github.com/Sinaptik-AI/pandas-ai/tree/main)

## Import libraries and load the stock data

In [3]:
import pandas as pd
import os
import csv
import json
import time
import re
import numpy as np
from datetime import datetime

from sklearn.metrics.pairwise import cosine_similarity
import faiss

#We will use langchain to create a vector store to retrieve stronger negatives
from langchain.vectorstores.faiss import FAISS
from langchain.docstore import InMemoryDocstore
from langchain_core.vectorstores import VectorStoreRetriever
from langchain.document_loaders import UnstructuredPDFLoader, csv_loader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.utils import mock_now


EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"## "BAAI/bge-base-en-v1.5""all-MiniLM-L6-v2"

In [7]:
#load stock data. Load dataframe and load directly as docs
tsla_stock = pd.read_csv("../data/TSLA.csv")
spy_etf = pd.read_csv("../data/SPY.csv") #for later

Document(page_content='Date: 2010-06-29\nOpen: 1.266667\nHigh: 1.666667\nLow: 1.169333\nClose: 1.592667\nAdj Close: 1.592667\nVolume: 281494500', metadata={'source': '../data/TSLA.csv', 'row': 0})

## Static Approaches 

Here we use a knowledge base of TSLA's stock data

#### Using LangChain's CSVLoader
This just loads each record as a json in an individual document in our vector database to retrieve

In [5]:
loader  = csv_loader.CSVLoader(file_path="../data/TSLA.csv")
stock_data_docs = loader.load()
stock_data_docs[0]

#for the retrieval process
embedding_function = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        cache_folder="../models/sentencetransformers"
    )

In [4]:
#first let's add more context to help with Retrieval and add date to metadata (for later use)
for ii in range(0, len(stock_data_docs)):
    stock_data_docs[ii].page_content = "Daily stock market data for Tesla (TSLA):\n" + stock_data_docs[ii].page_content
    date = re.findall(r'Date: (\d{4}-\d{2}-\d{2})', stock_data_docs[ii].page_content)
    if len(date) > 0:
        stock_data_docs[ii].metadata['last_accessed_at'] = datetime.strptime(date[0], '%Y-%m-%d')
    else:
        stock_data_docs[ii].metadata['last_accessed_at'] = None
stock_data_docs[0]

Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2010-06-29\nOpen: 1.266667\nHigh: 1.666667\nLow: 1.169333\nClose: 1.592667\nAdj Close: 1.592667\nVolume: 281494500', metadata={'source': '../data/TSLA.csv', 'row': 0, 'last_accessed_at': datetime.datetime(2010, 6, 29, 0, 0)})

In [5]:
db_data = FAISS.from_documents(stock_data_docs, embedding_function)
db_data.save_local("../data/faiss_stock")

Our stock data is just through 2024-02-02 so we will specify that as today's date

In [6]:
top_k=16
retriever_stock = VectorStoreRetriever(vectorstore=db_data, search_kwargs={"k": top_k})

def generate_response(prompt, retriever):
    #today's date - let's pretend it is 2024-02-02
    today = "2024-02-02"
    #replace "current" or "today" with today's date
    prompt = re.sub(r'current|today', today, prompt, flags=re.IGNORECASE)
    print("Prompt: ", prompt)
    # Get the top k most similar documents
    results = retriever.get_relevant_documents(prompt)
    return results

In [7]:
question ="What is TSLA's current close price?"
generate_response(question, retriever_stock)

Prompt:  What is TSLA's 2024-02-02 close price?


[Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2021-08-03\nOpen: 239.666672\nHigh: 240.883331\nLow: 233.669998\nClose: 236.580002\nAdj Close: 236.580002\nVolume: 64860900', metadata={'source': '../data/TSLA.csv', 'row': 2793, 'last_accessed_at': datetime.datetime(2021, 8, 3, 0, 0)}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2021-04-20\nOpen: 239.139999\nHigh: 245.750000\nLow: 236.896667\nClose: 239.663330\nAdj Close: 239.663330\nVolume: 106827000', metadata={'source': '../data/TSLA.csv', 'row': 2720, 'last_accessed_at': datetime.datetime(2021, 4, 20, 0, 0)}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2022-08-18\nOpen: 306.000000\nHigh: 306.500000\nLow: 301.853333\nClose: 302.869995\nAdj Close: 302.869995\nVolume: 47500500', metadata={'source': '../data/TSLA.csv', 'row': 3056, 'last_accessed_at': datetime.datetime(2022, 8, 18, 0, 0)}),
 Document(page_content='Daily stock market data for Tesla (TSLA)

Notice that even when we replace "current" with today's date, our retrieval process is not strong enough to only pick recent dates. 

Let's try using TimeWeightedVectorStoreRetriever to bias towards recent dates...<br>

#### TimeWeightedVectorStoreRetriever

In [8]:
# Solution 1
embedding_size = len(embedding_function.embed_documents([question])[0])
index = faiss.IndexFlatL2(384)
vectorstore = FAISS(embedding_function, index, InMemoryDocstore({}), {})
tw_retriever_stock = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore,decay_rate=0.005, k=top_k)
# Notice the last access time is that date time
tw_retriever_stock.add_documents(stock_data_docs)
with mock_now(datetime(2024, 2, 2, 23, 0)):
    rel_docs = generate_response(question, tw_retriever_stock)
rel_docs

Prompt:  What is TSLA's 2024-02-02 close price?


[Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2024-02-02\nOpen: 185.039993\nHigh: 188.690002\nLow: 182.000000\nClose: 187.910004\nAdj Close: 187.910004\nVolume: 110505100', metadata={'source': '../data/TSLA.csv', 'row': 3422, 'last_accessed_at': MockDateTime(2024, 2, 2, 23, 0), 'created_at': datetime.datetime(2024, 3, 27, 10, 23, 4, 663286), 'buffer_idx': 3422}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2024-02-01\nOpen: 188.500000\nHigh: 189.880005\nLow: 184.279999\nClose: 188.860001\nAdj Close: 188.860001\nVolume: 91843300', metadata={'source': '../data/TSLA.csv', 'row': 3421, 'last_accessed_at': MockDateTime(2024, 2, 2, 23, 0), 'created_at': datetime.datetime(2024, 3, 27, 10, 23, 4, 663286), 'buffer_idx': 3421}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2024-01-31\nOpen: 187.000000\nHigh: 193.970001\nLow: 185.850006\nClose: 187.289993\nAdj Close: 187.289993\nVolume: 103221400', metadata={'sour

This technically worked, but is not what we want. The 'last_accessed_at' was updated so it is not longer using the date for the stock and this will hurt performance when asking for previous dates. Let's try agents.

## Agent-based Approaches

#### Langchain CSV / Pandas Agents

Here we will show simple use of Langchain [csv](https://python.langchain.com/docs/integrations/toolkits/csv) and [Pandas](https://python.langchain.com/docs/integrations/toolkits/pandas) agents. The csv agent uses the Pandas agent under the hood so we will just use the pandas agent directly. 

In [6]:
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
# Load environment variables
load_dotenv()

True

In [11]:
tsla_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,1.266667,1.666667,1.169333,1.592667,1.592667,281494500
1,2010-06-30,1.719333,2.028,1.553333,1.588667,1.588667,257806500
2,2010-07-01,1.666667,1.728,1.351333,1.464,1.464,123282000
3,2010-07-02,1.533333,1.54,1.247333,1.28,1.28,77097000
4,2010-07-06,1.333333,1.333333,1.055333,1.074,1.074,103003500


In [18]:
def try_agent(agent, prompt, verbose=True):
    response = None
    try:
        response = agent.invoke(prompt, verbose=verbose)
        print(response)
    except Exception as e:
        print(e)
    return response

#### First we'll try the ZERO_SHOT_REACT_DESCRIPTION agent

**Warning:** this does not work well

**Also, it works much better with Langchain's [ChatOpenAI](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/chat_models/openai.py#L351) that just Langchain's [OpenAI](https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/openai.py#L731) which is used in the zero shot example**

In [238]:
llm = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")
# llm = OpenAI(temperature=0, model="gpt-4-1106-preview")
agent = create_pandas_dataframe_agent(llm, tsla_stock, agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)

In [239]:
response = try_agent(agent, "What was the most recent close price for TSLA?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To find the most recent close price for TSLA, I need to look at the last row of the dataframe and get the value from the 'Close' column.

Action: python_repl_ast
Action Input: df.tail(1)['Close'].iloc[0][0m[36;1m[1;3m187.910004[0m[32;1m[1;3mI now know the final answer.

Final Answer: The most recent close price for TSLA was $187.91.[0m

[1m> Finished chain.[0m
{'input': 'What was the most recent close price for TSLA?', 'output': 'The most recent close price for TSLA was $187.91.'}


In [240]:
response = try_agent(agent, "Does the change in open or close prices have a larger standard deviation?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To answer this question, I need to calculate the standard deviation of the changes in open prices and the changes in close prices. The change can be calculated by subtracting the previous day's price from the current day's price. I will use the `diff()` method to calculate the changes and then the `std()` method to calculate the standard deviation for both 'Open' and 'Close' columns.

Action: Calculate the standard deviation of the changes in open prices.
Action Input: df['Open'].diff().std()
[0mCalculate the standard deviation of the changes in open prices. is not a valid tool, try one of [python_repl_ast].[32;1m[1;3mI need to use the python_repl_ast tool to execute the Python command.

Action: python_repl_ast
Action Input: df['Open'].diff().std()[0m[36;1m[1;3m4.945367157862929[0m[32;1m[1;3mThe standard deviation of the changes in open prices is approximately 4.945.

Action: Calculate the standard deviation

That sort-of worked when I ran it (sometimes errors). Sometimes we can see that it enters another step. 

 We can look at [Langchain's default code](https://github.com/langchain-ai/langchain/blob/master/libs/experimental/langchain_experimental/agents/agent_toolkits/pandas/base.py) and [prompts](https://github.com/langchain-ai/langchain/blob/master/libs/experimental/langchain_experimental/agents/agent_toolkits/pandas/prompt.py) to see how we can improve it and work around bugs.

The issue is in part related to a bug in Langchain. Let's use a work around and define out own prompt (which we will see another very minor bug).

- **Bug 1:** The stop token in the React Agent is \nObservation, but needs to be \nFinal Answer for Pandas agent
- **Bug 2:** "create_pandas_dataframe_agent" requiring include_df_in_prompt to be None vs False.


Building our own prompts for the agent:

In [241]:
PREFIX = """
You are a large language model being used in an agent workflow involving questions, thoughts, use of tools, and actions using those tools. One of those tools allows you to work with a pandas dataframe in Python. The name of the dataframe is `df`.
You should use the tools below to answer the question posed of you:\n"""

In [242]:
tsla_stock.describe(include='all')
#add first 2 rows and last 2 rows to describe
df_describe = tsla_stock.describe(include='all')
df_describe.loc['1st_row'] = tsla_stock.iloc[0]
df_describe.loc['2nd_row'] = tsla_stock.iloc[1]
df_describe.loc['2nd_last_row'] = tsla_stock.iloc[-2]
df_describe.loc['last_row'] = tsla_stock.iloc[-1]
df_describe

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
count,3423,3423.0,3423.0,3423.0,3423.0,3423.0,3423.0
unique,3423,,,,,,
top,2010-06-29,,,,,,
freq,1,,,,,,
mean,,71.463546,73.027288,69.772792,71.436731,71.436731,96904990.0
std,,101.915934,104.171288,99.425029,101.828184,101.828184,79806460.0
min,,1.076,1.108667,0.998667,1.053333,1.053333,1777500.0
25%,,10.870667,11.166334,10.688667,10.945,10.945,46230000.0
50%,,17.025333,17.266666,16.719334,16.999332,16.999332,81603000.0
75%,,122.324997,126.674999,119.958332,123.389999,123.389999,123274800.0


In [243]:
SUFFIX_WITH_DF_DESCRIBE = """
You must STOP once you found the final answer. This is the result of running:

df_describe = df.describe(include='all')
df_describe.loc['1st_row'] = df.iloc[0]
df_describe.loc['2nd_row'] = df.iloc[1]
df_describe.loc['2nd_last_row'] = df.iloc[-2]
df_describe.loc['last_row'] = df.iloc[-1]
print(df_describe.to_markdown())
del df_describe:
{df_describe}

Please note that python_repl_ast cannot use df_decribe, but can use df.

Begin!
Question: {{input}}
{{agent_scratchpad}}"""

In [244]:
df_describe_str = str(df_describe.to_markdown())#json.dumps((df_describe.to_json())).replace("{", "{{").replace("}", "}}")
partial_format = SUFFIX_WITH_DF_DESCRIBE.format(df_describe=df_describe_str)
SUFFIX_FINAL = partial_format.replace("{{input}}", "{input}").replace("{{agent_scratchpad}}", "{agent_scratchpad}")
print(SUFFIX_FINAL)


You must STOP once you found the final answer. This is the result of running:

df_describe = df.describe(include='all')
df_describe.loc['1st_row'] = df.iloc[0]
df_describe.loc['2nd_row'] = df.iloc[1]
df_describe.loc['2nd_last_row'] = df.iloc[-2]
df_describe.loc['last_row'] = df.iloc[-1]
print(df_describe.to_markdown())
del df_describe:
|              | Date       |       Open |       High |         Low |      Close |   Adj Close |         Volume |
|:-------------|:-----------|-----------:|-----------:|------------:|-----------:|------------:|---------------:|
| count        | 3423       | 3423       | 3423       | 3423        | 3423       |  3423       | 3423           |
| unique       | 3423       |  nan       |  nan       |  nan        |  nan       |   nan       |  nan           |
| top          | 2010-06-29 |  nan       |  nan       |  nan        |  nan       |   nan       |  nan           |
| freq         | 1          |  nan       |  nan       |  nan        |  nan       |   nan   

Now, let's use a workaround for the Langchain bug:

In [245]:
llm = ChatOpenAI(temperature=0, model="gpt-4-1106-preview")

#Bug 1 workaround: the stop token in the React Agent is \nObservation, but needs to be \nFinal Answer for Pandas agent
llm_with_stop = llm.bind(stop=["\nFinal Answer"])

In [246]:
#Bug 2 workaround:
agent = create_pandas_dataframe_agent(llm_with_stop, tsla_stock, include_df_in_prompt = None, #handle_parsing_errors=True,
                                      agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION, prefix=PREFIX, suffix=SUFFIX_FINAL, verbose=True)

In [247]:
response = try_agent(agent, "What was the most recent close price for TSLA?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The most recent close price for TSLA can be found in the 'Close' column of the last row of the dataframe `df`.

Action: python_repl_ast
Action Input: df['Close'].iloc[-1][0m[36;1m[1;3m187.910004[0m[32;1m[1;3mI now know the final answer.
Final Answer: The most recent close price for TSLA was $187.91.[0m

[1m> Finished chain.[0m
{'input': 'What was the most recent close price for TSLA?', 'output': 'The most recent close price for TSLA was $187.91.'}


In [248]:
response = try_agent(agent, "Does the change in open or close prices have a larger standard deviation?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: To answer this question, I need to compare the standard deviation of the 'Open' prices with the standard deviation of the 'Close' prices.

Action: python_repl_ast
Action Input: df['Open'].std(), df['Close'].std()[0m[36;1m[1;3m(101.91593442111841, 101.82818449060503)[0m[32;1m[1;3mI now know the final answer
Final Answer: The 'Open' prices have a slightly larger standard deviation than the 'Close' prices.[0m

[1m> Finished chain.[0m
{'input': 'Does the change in open or close prices have a larger standard deviation?', 'output': "The 'Open' prices have a slightly larger standard deviation than the 'Close' prices."}


That failed to take the difference and I had to significantly change the PREFIX and SUFFIX to get it to sometimes work. We can take a look at the Langchain code again, but let's move on.

#### Let's now try the OPENAI_FUNCTIONS agent in Langchain

**Note:** This works very badly with "gpt-4-1106-preview", but a bit better with "gpt-4" and "gpt-3.5-turbo-0613"

In [258]:
llm = ChatOpenAI(temperature=0, model="gpt-4")
agent = create_pandas_dataframe_agent(llm, tsla_stock,verbose=True,agent_type=AgentType.OPENAI_FUNCTIONS)

In [259]:
response = try_agent(agent, "What was the most recent close price for TSLA?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df['Close'].iloc[-1]"}`


[0m[36;1m[1;3m187.910004[0m[32;1m[1;3mThe most recent close price for TSLA was $187.91.[0m

[1m> Finished chain.[0m
{'input': 'What was the most recent close price for TSLA?', 'output': 'The most recent close price for TSLA was $187.91.'}


In [260]:
response = try_agent(agent, "Does the change in open or close prices have a larger standard deviation?", verbose=True)



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3m
Invoking: `python_repl_ast` with `{'query': "df['Open'].diff().std(), df['Close'].diff().std()"}`


[0m[36;1m[1;3m(4.945367157862929, 4.678933840976249)[0m[32;1m[1;3mThe standard deviation of the change in open prices is larger than the standard deviation of the change in close prices.[0m

[1m> Finished chain.[0m
{'input': 'Does the change in open or close prices have a larger standard deviation?', 'output': 'The standard deviation of the change in open prices is larger than the standard deviation of the change in close prices.'}


Let's forget about Langchain Agents for now as it is a little bit of a pain ... 

### Let's try Agents with [Pandas AI](https://docs.pandas-ai.com/en/latest/)

In [2]:
import pandas as pd
from pandasai import SmartDatalake
from pandasai import SmartDataframe
from pandasai.llm import OpenAI
from pandasai import Agent
from pandasai.skills import skill

In [8]:
llm = OpenAI(model="gpt-4-1106-preview", verbose=True)

Let's first just use the simple agent with SmartDataframe

In [15]:
agent = SmartDataframe(tsla_stock, description ="TSLA Stock Market Data", config={"llm": llm})
agent.chat("Does the change in open or close prices have a larger standard deviation?")

'Close prices have a larger standard deviation.'

That actually did not work at all - looking at the log below shows it just made up the answer.

Let's show something a bit more interesting using the Agent framework from Pandas-AI which allows us to directly inform our LLM or "skills" we added

In [34]:
# Function doc string to give more context to the model for use this skill
@skill
def get_capm_params(stock_data: pd.DataFrame, benchmark_data: pd.DataFrame):
    """
    This function estimates an Intercept(alpha) and a beta by regressing the stock returns on the benchmark returns.
    This is the Capital Asset Pricing Model (CAPM) model.
    Args:
        stock_data: A Pandas DataFrame with stock data
        benchmark_data: A Pandas DataFrame with the benchmark's data
    Returns:
        alpha: Intercept
        beta: Beta
    """
    from sklearn.linear_model import LinearRegression
    import pandas as pd
    #calculate stock returns
    stock_data['Return'] = stock_data['Close'].pct_change()
    stock_data.dropna(inplace=True)
    #calculate benchmark returns
    benchmark_data['Return'] = benchmark_data['Close'].pct_change()
    benchmark_data.dropna(inplace=True)
    #perform inner join to get common dates
    stock_data = stock_data.merge(benchmark_data, on='Date', how='inner', suffixes=('_stock', '_benchmark'))
    X = stock_data[['Return_benchmark']]
    y = stock_data['Return_stock']
    model = LinearRegression().fit(X, y)
    alpha = model.intercept_
    beta = model.coef_[0]
    return alpha, beta

agent = Agent([tsla_stock, spy_etf], config={"llm": llm}, memory_size=10, description="A list of 2 DataFrames: TSLA stock data and SPY ETF data respectively.")

agent.add_skills(get_capm_params)

In [44]:
get_capm_params(tsla_stock, spy_etf)

(0.0013747603371657858, 1.4142763404820184)

In [36]:
# Chat with the agent
response = agent.chat("What is the Beta of TSLA? You can use the SPY ETF as the benchmark and use the CAPM model.")
print(response)

1.4098669332856566


We got the right answer, but the beta is slightly different - let's check the logs ... 

This appears to just be estimation sensitivity so all good!

In [37]:
#print pandasai.log file to see the logs
# with open("pandasai.log", "r") as f:
#     print(f.read())

This seems promising! So far we have just shown pretty simple agents and can satisfy requirements as long as the users can read the code and understand all he intermediate steps.

Based on trying a few other prompts and tasks as well as diagnosing all of the above to make it work, I believe a more robust approach is needed to make this work for an end-user that might not read code or understand intermediate steps. This may be possible in LangChain or with Pandas-AI, but there are also building from scratch or using an existing full-scale agent-based frameworks such as [AutoGen](https://github.com/microsoft/autogen), [AutoGPT](https://github.com/Significant-Gravitas/AutoGPT), [OpenAgents](https://github.com/Significant-Gravitas/AutoGPT) and [crewAI](https://github.com/joaomdmoura/crewAI)