# Q&A against Tabular Data from a CSV file  (experimental)

To really have a Smart Search Engine or Virtual assistant that can answer any question about your corporate documents, this "engine" must understand tabular data, aka, sources with tables, rows and columns with numbers. 
This is a different problem that simply looking for the top most similar results.  The concept of indexing, bringing top results, embedding, doing a cosine semantic search and summarize an answer, doesn't really apply to this problem.
We are dealing now with sources with Tables in which each row and column are related to each other, and in order to answer a question, all of the data is needed, not just top results.

In this notebook, the goal is to show how to deal with this kind of use cases. To continue with our Covid-19 theme, we will be using an open dataset called ["Covid Tracking Project"](https://covidtracking.com/data/download). The COVID Tracking Project dataset is a  CSV file that provides the latest numbers on tests, confirmed cases, hospitalizations, and patient outcomes from every US state and territory (they stopped tracking on March 7 2021).

Imagine that many documents on a data lake are tabular data, or that your use case is to ask questions in natural language to a LLM model and this model needs to get the context from a CSV file or even a SQL Database in order to answer the question. A GPT Smart Search Engine, must understand how to deal with this sources, understand the data and answer acoordingly.

In [3]:
import os
import pandas as pd
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.agents import create_pandas_dataframe_agent
from langchain.agents import create_csv_agent

from common.prompts import CSV_PROMPT_PREFIX, CSV_PROMPT_SUFFIX

from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials2.env")

def printmd(string):
    display(Markdown(string))

In [4]:
pd.set_option('display.max_columns', None)

In [5]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"] = "azure"

## Download the dataset and load it into Pandas Dataframe

In [6]:
#os.makedirs("data",exist_ok=True)

In [7]:
#!wget https://covidtracking.com/data/download/all-states-history.csv -P ./data/

In [8]:
file_url = "./data/updatedbunnsfullstockreport.csv"
df = pd.read_csv(file_url).fillna(value = 0)
print("Rows and Columns:",df.shape)
df.head()

Rows and Columns: (41951, 179)


Unnamed: 0,Id,CountryCode,Fineline,Team_Member,Supersession_Next_Item,Supersession_Previous_Item,DeptName,Sub_Dept,Item_Description,Item_Band,Planning_Status,StatusDC,State,Ethical_Sourcing_Approved,DC,Transfer_Pack,Units_per_Pallet,Total_Min,Total_Max,Number_of_stores_ranged_for_the_item,Shipping_Terms,Supplier_Name,Stock_Unit_Cost_Zone1_(ExTax)_(based_on_last_cost_product_landed_at),Metro_Retail,Margin,Qty_20ft,Qty_40ft,Qty_40hq,MOQ,SOH_Units_In_Store,SOH_Units_In_Transit,SOH_Units_in_DC_(inc._Rec.),SOH_units_DC_Avail,Store_Orders_=_Pending_Ords_+_On_Pick,Network,3PL_SOH,Sell_Through,Pallet_Count_in_Store,Pallet_Count_In_Transit,Pallet_Count_in_DC,Pallet_Count_in_Network,WOH_future_forecast_in_Store,WOH_future_forecast__in_DC,WOH_future_forecast__in_Network,WOH_future_forecast_Store_Orders,WOH_Orders_Fullfilled_in_Store,WOH_based_on_Store_Orders_Fullfilled_DC,WOH_8_Weeks_COGNOS_Sales_in_Store,WOH_8_Weeks_COGNOS_Sales_in_DC,WOH_8_Weeks_COGNOS_Sales_in_Network,8_Weeks_COGNOS_Sales,Sales_Units_MTD,Sales_Units_MTD_%,Sales_Units_YTD,Forecast_May_2023,Forecast_Jun_2023,Forecast_Jul_2023,Forecast_Aug_2023,Forecast_Sep_2023,Forecast_Oct_2023,Forecast_Nov_2023,Forecast_Dec_2023,Forecast_Jan_2023,Forecast_Feb_2023,Forecast_Mar_2023,Forecast_Apr_2023,12_Month_Forecast,Orders_Past_Due,Orders_Due_into_DC_4/06/2023,Orders_Due_into_DC_11/06/2023,Orders_Due_into_DC_18/06/2023,Orders_Due_into_DC_25/06/2023,Orders_Due_into_DC_2/07/2023,Orders_Due_into_DC_9/07/2023,Orders_Due_into_DC_16/07/2023,Orders_Due_into_DC_23/07/2023,Orders_Due_into_DC__30/07/2023,Orders_Due_into_DC_6/08/2023,Orders_Due_into_DC_13/08/2023,Orders_Due_into_DC_20/08/2023,Orders_Due_into_DC_27/08/2023,Orders_Due_into_DC_3/09/2023,Orders_Due_into_DC__10/09/2023,Orders_Due_into_DC__17/09/2023,Orders_Due_into_DC__24/09/2023,Orders_Due_into_DC__1/10/2023,Orders_Due_into_DC_8/10/2023,Orders_Due_into_DC_15/10/2023,Orders_Due_into_DC_22/10/2023,Orders_Due_into_DC_29/10/2023,Orders_Due_into_DC_5/11/2023,Orders_Due_into_DC_12/11/2023,Orders_Due_into_DC_19/11/2023,Orders_Due_into_DC_26/11/2023,Orders_Due_into_DC_3/12/2023,Orders_Due_into_DC_10/12/2023,Orders_Due_into_DC_17/12/2023,Orders_Due_into_DC_24/12/2023,Orders_Due_into_DC_31/12/2023,Orders_Due_into_DC_7/01/2024,Orders_Due_into_DC_14/01/2024,Orders_Due_into_DC_21/01/2024,Orders_Due_into_DC_28/01/2024,Orders_Due_into_DC_4/02/2024,Orders_Due_into_DC_11/02/2024,Orders_Due_into_DC_18/02/2024,Orders_Due_into_DC_25/02/2024,Orders_Due_into_DC_3/03/2024,Orders_Due_into_DC_10/03/2024,Orders_Due_into_DC_17/03/2024,Orders_Due_into_DC_24/03/2024,Orders_Due_into_DC_31/03/2024,Orders_Due_into_DC_7/04/2024,Orders_Due_into_DC_14/04/2024,Orders_Due_into_DC_21/04/2024,Orders_Due_into_DC_28/04/2024,Orders_Due_into_DC_5/05/2024,Orders_Due_into_DC_12/05/2024,Orders_Due_into_DC_19/05/2024,Orders_Due_into_DC_26/05/2024,Later,Orders_Due_into_DC_TotalOrders,$_in_DC,$_in_Stores,$_In_Transit,SOH_Value_Network,DC_Weeks_On_Hand_04_Jun,DC_Weeks_On_Hand_11_Jun,DC_Weeks_On_Hand_18_Jun,DC_Weeks_On_Hand_25_Jun,DC_Weeks_On_Hand_02_Jul,_DC_Weeks_On_Hand_09_Jul,DC_Weeks_On_Hand_16_Jul,DC_Weeks_On_Hand_23_Jul,DC_Weeks_On_Hand_30_Jul,DC_Weeks_On_Hand_06_Aug,DC_Weeks_On_Hand_13_Aug,DC_Weeks_On_Hand_20_Aug,DC_Weeks_On_Hand_27_Aug,DC_Weeks_On_Hand_03_Sep,DC_Weeks_On_Hand_10_Sep,DC_Weeks_On_Hand_17_Sep,DC_Weeks_On_Hand_24_Sep,DC_Weeks_On_Hand_01_Oct,DC_Weeks_On_Hand_08_Oct,DC_Weeks_On_Hand_15_Oct,DC_Weeks_On_Hand_22_Oct,DC_Weeks_On_Hand_29_Oct,DC_Weeks_On_Hand_05_Nov,DC_Weeks_On_Hand_12_Nov,DC_Weeks_On_Hand_19_Nov,DC_Weeks_On_Hand_26_Nov,DC_Weeks_On_Hand_03_Dec,DC_Weeks_On_Hand_10_Dec,DC_Weeks_On_Hand_17_Dec,DC_Weeks_On_Hand_24_Dec,DC_Weeks_On_Hand_31_Dec,DC_Weeks_On_Hand_07_Jan,DC_Weeks_On_Hand_14_Jan,DC_Weeks_On_Hand_21_Jan,DC_Weeks_On_Hand_28_Jan,DC_Weeks_On_Hand_04_Feb,DC_Weeks_On_Hand_11_Feb,DC_Weeks_On_Hand_18_Feb,DC_Weeks_On_Hand_25_Feb,DC_Weeks_On_Hand_04_Mar,DC_Weeks_On_Hand_11_Mar,DC_Weeks_On_Hand_18_Mar,DC_Weeks_On_Hand_25_Mar,DC_Weeks_On_Hand_01_Apr,DC_Weeks_On_Hand_08_Apr,DC_Weeks_On_Hand_15_Apr,DC_Weeks_On_Hand_22_Apr,DC_Weeks_On_Hand_29_Apr,DC_Weeks_On_Hand_06_May,DC_Weeks_On_Hand_13_May,DC_Weeks_On_Hand_20_May,DC_Weeks_On_Hand_27_May,Next_Order_Due_Date
0,0,1280NZ,1280,Team Member 1,0.0,0.0,300 TOOL ACCESSORIES,400 DRILLING,NESTED DRILL/DRVR++SETS 6360177-180,Band E,0,Live,NZ,0,9130,1,160,47,90,43.0,0,0,239.5,479.0,0.5,0.0,0.0,0.0,0.0,9991,0.0,0.0,0,0.0,9991.0,0.0,0.0,0,0,0,0,999.0,999.0,1998.0,999.0,1998.0,0.0,0.0,0.0,0.0,2421,1210.5,0.0,13315.5,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,1109.625,13315.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,0
1,1,1380NZ,1380,Team Member 2,0.0,0.0,300 CLEANING AND ACCESSORIES,400 CLEANING ACCESSORIES,NESTED CLOTHS++4460819-820 & 823,Band E,0,Promotion,NZ,0,9130,40,160,81,202,43.0,0,0,105.565,211.13,0.5,0.0,0.0,0.0,0.0,2711,0.0,0.0,0,0.0,2711.0,0.0,0.0,0,0,0,0,999.0,999.0,1998.0,999.0,1998.0,0.0,0.0,0.0,0.0,1577,788.5,0.0,8673.5,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,722.791667,8673.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,0
2,2,1400NZ,1400,Team Member 3,0.0,0.0,300 TOOL ACCESSORIES,400 DRILLING,NESTED BIT SET/DRILL++222506/511/514/517,Band E,0,Promotion,NZ,0,9130,6,375,47,100,43.0,0,0,54.5,109.0,0.5,0.0,0.0,0.0,0.0,4709,0.0,0.0,0,0.0,4709.0,0.0,0.0,0,0,0,0,999.0,999.0,1998.0,999.0,1998.0,0.0,0.0,0.0,0.0,5676,2838.0,0.0,31218.0,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,2601.5,31218.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,0
3,3,1430NZ,1430,Team Member 4,0.0,0.0,300 FURNITURE,400 FURNITURE ACCESSORIES,NESTED CUSHION ASSTD++3191892/894/895,Band E,0,Promotion,NZ,0,9130,6,182,45,96,41.0,0,0,113.0,226.0,0.5,0.0,0.0,0.0,0.0,5607,0.0,0.0,0,0.0,5607.0,0.0,0.0,0,0,0,0,999.0,999.0,1998.0,999.0,1998.0,0.0,0.0,0.0,0.0,9674,4837.0,0.0,53207.0,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,4433.916667,53207.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,0
4,4,1440NZ,1440,Team Member 5,0.0,0.0,300 FURNITURE,400 FURNITURE ACCESSORIES,NESTED CUSHION THROW++3192085/086,Band E,0,Promotion,NZ,0,9130,8,150,47,102,41.0,0,0,113.0,226.0,0.5,0.0,0.0,0.0,0.0,9952,0.0,0.0,0,0.0,9952.0,0.0,0.0,0,0,0,0,999.0,999.0,1998.0,999.0,1998.0,0.0,0.0,0.0,0.0,8787,4393.5,0.0,48328.5,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,4027.375,48328.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,999.0,0


In [9]:
df.columns

Index(['Id', 'CountryCode', 'Fineline', 'Team_Member',
       'Supersession_Next_Item', 'Supersession_Previous_Item', 'DeptName',
       'Sub_Dept', 'Item_Description', 'Item_Band',
       ...
       'DC_Weeks_On_Hand_01_Apr', 'DC_Weeks_On_Hand_08_Apr',
       'DC_Weeks_On_Hand_15_Apr', 'DC_Weeks_On_Hand_22_Apr',
       'DC_Weeks_On_Hand_29_Apr', 'DC_Weeks_On_Hand_06_May',
       'DC_Weeks_On_Hand_13_May', 'DC_Weeks_On_Hand_20_May',
       'DC_Weeks_On_Hand_27_May', 'Next_Order_Due_Date'],
      dtype='object', length=179)

## Load our LLM and create our MRKL Agent

The implementation of Agents is inspired by two papers: the [MRKL Systems](https://arxiv.org/abs/2205.00445) paper (pronounced ‘miracle’ 😉) and the [ReAct](https://arxiv.org/abs/2210.03629) paper.

Agents are a way to leverage the ability of LLMs to understand and act on prompts. In essence, an Agent is an LLM that has been given a very clever initial prompt. The prompt tells the LLM to break down the process of answering a complex query into a sequence of steps that are resolved one at a time.

Agents become really cool when we combine them with ‘experts’, introduced in the MRKL paper. Simple example: an Agent might not have the inherent capability to reliably perform mathematical calculations by itself. However, we can introduce an expert - in this case a calculator, an expert at mathematical calculations. Now, when we need to perform a calculation, the Agent can call in the expert rather than trying to predict the result itself. This is actually the concept behind [ChatGPT Pluggins](https://openai.com/blog/chatgpt-plugins).

In our case, in order to solve the problem "How do I ask questions to a tabular CSV file", we need this REACT/MRKL approach, in which we need to instruct the LLM that it needs to use an 'expert/tool' in order to read/load/understand/interact with a CSV tabular file.

OpenAI opened the world to a whole new concept. Libraries are being created fast and furious. We will be using [LangChain](https://docs.langchain.com/docs/) as our library to solve this problem, however there are others that we recommend: [HayStack](https://haystack.deepset.ai/) and [Semantic Kernel](https://learn.microsoft.com/en-us/semantic-kernel/whatissk).

In [10]:
import os
import openai
openai.api_type = "azure"
openai.api_base = "https://openai-demo-poc-3232.openai.azure.com/"
openai.api_version = "2023-03-15-preview"
openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
  engine="chatgpt",
  messages = [{"role":"system","content":"You are an AI assistant that helps people find information."},{"role":"user","content":"hello"},{"role":"assistant","content":"Hello! How can I help you today?"}],
  temperature=0.7,
  max_tokens=800,
  top_p=0.95,
  frequency_penalty=0,
  presence_penalty=0,
  stop=None)

In [11]:
# A sample API call for chat completions looks as follows:
# Messages must be an array of message objects, where each object has a role (either "system", "user", or "assistant") and content (the content of the message).
# For more info: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/reference#chat-completions

try:
    response = openai.ChatCompletion.create(
                  engine="gpt4",
                  messages=[
                        {"role": "system", "content": "You are a helpful assistant."},
                        {"role": "user", "content": "Who won the world series in 2020?"}
                    ]
                )

    # print the response
    print(response['choices'][0]['message']['content'])
    
except openai.error.APIError as e:
    # Handle API error here, e.g. retry or log
    print(f"OpenAI API returned an API Error: {e}")

except openai.error.AuthenticationError as e:
    # Handle Authentication error here, e.g. invalid API key
    print(f"OpenAI API returned an Authentication Error: {e}")

except openai.error.APIConnectionError as e:
    # Handle connection error here
    print(f"Failed to connect to OpenAI API: {e}")

except openai.error.InvalidRequestError as e:
    # Handle connection error here
    print(f"Invalid Request Error: {e}")

except openai.error.RateLimitError as e:
    # Handle rate limit error
    print(f"OpenAI API request exceeded rate limit: {e}")

except openai.error.ServiceUnavailableError as e:
    # Handle Service Unavailable error
    print(f"Service Unavailable: {e}")

except openai.error.Timeout as e:
    # Handle request timeout
    print(f"Request timed out: {e}")
    
except:
    # Handles all other exceptions
    print("An exception has occured.")

The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in six games.


In [26]:
#MODEL = "chatgpt" # options: gpt-35-turbo, gpt-4, gpt-4-32k
MODEL = "gpt4"
chatllm = AzureChatOpenAI(deployment_name=MODEL, temperature=0, 
                      max_tokens=350, 
                      streaming=True)

In [88]:
#QUESTION = "how many SOH In Stores for NSW state for ANTISLIP TAPE SYNECO++48MM 20M BLK 1369, who is the supplier, and what is the next order due Date"
#QUESTION = "how many SOH In Stores for all states for ANTISLIP TAPE, and who are the suppliers, and when are the next order due Dates"
QUESTION = "who are the suppliers for DOG HOUSE in NSW and how much SOH In Stores is there"

In [89]:
from langchain.prompts import PromptTemplate
# Now we create a simple prompt template
prompt = PromptTemplate(
    input_variables=["question", "language"],
    template='Answer the following question: "{question}". Give your response in {language}',
)

print(prompt.format(question=QUESTION, language="French"))

Answer the following question: "who are the suppliers for DOG HOUSE in NSW and how much SOH In Stores is there". Give your response in French


In [90]:
from langchain.chains import LLMChain
# And finnaly we create our first generic chain
chain_chat = LLMChain(llm=chatllm, prompt=prompt)
chain_chat({"question": QUESTION, "language": "French"})

{'question': 'who are the suppliers for DOG HOUSE in NSW and how much SOH In Stores is there',
 'language': 'French',
 'text': 'Les fournisseurs de DOG HOUSE en Nouvelle-Galles du Sud sont [nom des fournisseurs]. Le stock en magasin (SOH) est de [quantité] unités.'}

In [91]:
# First we load our LLM: GPT-4 (you are welcome to try GPT-3.5-Turbo. You will see that GPT-3.5 
# does not have the cognitive capabilities to solve a complex question and wil make mistakes)
#llm = AzureChatOpenAI(deployment_name="gpt4", temperature=0, max_tokens=500)

Now we need our agent and our expert/tool.  
LangChain has created an out-of-the-box agents that we can use to solve our Q&A to CSV tabular data file problem. For more informatio about tje **CSV Agent** click [HERE](https://python.langchain.com/en/latest/modules/agents/toolkits/examples/csv.html)

In [92]:
agent_executor = create_pandas_dataframe_agent(llm=chatllm,df=df,verbose=True)

In [93]:
agent_executor.agent.allowed_tools

['python_repl_ast']

In [94]:
printmd(agent_executor.agent.llm_chain.prompt.template)


You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
You should use the tools below to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. Input should be a valid python command. When using this tool, sometimes output is abbreviated - make sure it does not look abbreviated before using it in your answer.

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [python_repl_ast]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question


This is the result of `print(df.head())`:
{df_head}

Begin!
Question: {input}
{agent_scratchpad}

## Enjoy the response and the power of GPT-4 + REACT/MKRL approach

In [96]:
# We are doing a for loop to retry N times. This is because: 
# 1) GPT-4 is still in preview and the API is being very throttled and 
# 2) Because the LLM not always gives the answer on the exact format the agent needs and hence cannot be parsed

for i in range(5):
    try:
        response = agent_executor.run(CSV_PROMPT_PREFIX + QUESTION + CSV_PROMPT_SUFFIX) 
        break
    except:
        response = "Error too many failed retries"
        continue
        
print(response)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: First, I need to set the pandas display options to show all the columns.
Action: python_repl_ast
Action Input: pd.set_option('display.max_columns', None)[0m
Observation: [36;1m[1;3m[0m
Thought:[32;1m[1;3mNow that I have set the display options, I need to get the column names of the dataframe.
Action: python_repl_ast
Action Input: df.columns[0m
Observation: [36;1m[1;3mIndex(['Id', 'CountryCode', 'Fineline', 'Team_Member',
       'Supersession_Next_Item', 'Supersession_Previous_Item', 'DeptName',
       'Sub_Dept', 'Item_Description', 'Item_Band',
       ...
       'DC_Weeks_On_Hand_01_Apr', 'DC_Weeks_On_Hand_08_Apr',
       'DC_Weeks_On_Hand_15_Apr', 'DC_Weeks_On_Hand_22_Apr',
       'DC_Weeks_On_Hand_29_Apr', 'DC_Weeks_On_Hand_06_May',
       'DC_Weeks_On_Hand_13_May', 'DC_Weeks_On_Hand_20_May',
       'DC_Weeks_On_Hand_27_May', 'Next_Order_Due_Date'],
      dtype='object', length=179)[0m
Thought:

[1m> En

## Evaluation
Let's see if the answer is correct

In [97]:
#check the dataframe for the answer
df.loc[df['Item_Description'].str.contains('DOG HOUSE'),['DeptName','Sub_Dept','State','Item_Description','Supplier_Name','SOH_Units_In_Store']]

Unnamed: 0,DeptName,Sub_Dept,State,Item_Description,Supplier_Name,SOH_Units_In_Store
96,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,WA,DOG HOUSE++75X59X66CM MEDIUM LPDH02^,Supplier 123,1059
97,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,VIC,DOG HOUSE++75X59X66CM MEDIUM LPDH02^,Supplier 123,8235
98,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,NSW,DOG HOUSE++75X59X66CM MEDIUM LPDH02^,Supplier 123,808
99,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,QLD,DOG HOUSE++75X59X66CM MEDIUM LPDH02^,Supplier 123,6486
104,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,WA,DOG HOUSE++87X92X86.6CM LARGE LPDH03^,Supplier 123,1048
105,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,VIC,DOG HOUSE++87X92X86.6CM LARGE LPDH03^,Supplier 123,7267
106,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,NSW,DOG HOUSE++87X92X86.6CM LARGE LPDH03^,Supplier 123,3628
107,300 OUTDOOR PLAY KIDS AND PETS,400 PET CARE,QLD,DOG HOUSE++87X92X86.6CM LARGE LPDH03^,Supplier 123,853


In [None]:
# #df['date'] = pd.to_datetime(df['date'])
# july_2020 = df[(df['date'] >= '2020-07-01') & (df['date'] <= '2020-07-31')]
# texas_hospitalized_july_2020 = july_2020[july_2020['state'] == 'TX']['hospitalizedIncrease'].sum()
# nationwide_hospitalized_july_2020 = july_2020['hospitalizedIncrease'].sum()

In [None]:
#print( "TX:",texas_hospitalized_july_2020,"Nationwide:",nationwide_hospitalized_july_2020)

It is Correct!

**Note**: You will also notice that if you run the above cell multiple times, not always you will get the same result. Sometimes it will even fail an error out. Why? 
1) This is still a very new field and LLMs and libraries still has a lot room to grow
2) Because for complex questions that require multiple steps to solve it, even humans make mistakes
3) Because if the column names are not clear, or ambiguous, or the data is not clean, it will make mistakes, just as humans would.

# Summary

So, we just solved our problem on how to ask questions in natural language to our Tabular data hosted on a CSV File.
With this approach you can see then that it is NOT necessary to make a dump of a database data into a CSV file and index that on a Search Engine, you don't even need to use the above approach and deal with a CSV data dump file. With the Agents framework, the best engineering decision is to interact directly with the data source API without the need to replicate the data in order to ask questions to it. Remember, GPT-4 can do SQL very well. 

Just think about this: if GPT-4 can do the above, imagine what GPT-5/6/7/8 will be able to do.

**Note**: We don't recommend using a pandas agent to answer questions from tabular data. It is not fast and it makes too many parsing mistakes. We recommend using SQL (see next notebook).

# Reference

- https://haystack.deepset.ai/blog/introducing-haystack-agents
- https://python.langchain.com/en/latest/modules/agents/agents.html
- https://tsmatz.wordpress.com/2023/03/07/react-with-openai-gpt-and-langchain/
- https://medium.com/@meghanheintz/intro-to-langchain-llm-templates-and-agents-8793f30f1837

# NEXT
We can see that GPT-4 is powerful and can translate a natural language question into the right steps in python in order to query a CSV data loaded into a pandas dataframe. 
That's pretty amazing. However the question remains: **Do I need then to dump all the data from my original sources (Databases, ERP Systems, CRM Systems) in order to be searchable by a Smart Search Engine?**

The next Notebook answers this question by implementing a Question->SQL process and get the information from data in a SQL Database, eliminating the need to dump it out.