# WORK IN PROGRESS

# Notebook 3: Approaches for Tabular Data

In this notebook we will show a few ways to use LLM pipelines with tabular data.

We will consider on Tesla (TSLA) stock prices data.
* TBD
* TBD

## Import libraries and load the 10k and stock data

In [1]:
import subprocess
import tiktoken
import pandas as pd
import os
import csv
import json
import time
import re
import transformers
import torch
import numpy as np
from datetime import datetime

from sklearn.metrics.pairwise import cosine_similarity

#We will use langchain to create a vector store to retrieve stronger negatives
import faiss
from langchain.vectorstores.faiss import FAISS
from langchain.docstore import InMemoryDocstore
from langchain_core.vectorstores import VectorStoreRetriever
from langchain.document_loaders import UnstructuredPDFLoader, csv_loader
# from langchain.embeddings.sentence_transformer import HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import TimeWeightedVectorStoreRetriever
from langchain.utils import mock_now


EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"## "BAAI/bge-base-en-v1.5""all-MiniLM-L6-v2"

## Load .csv data

In [16]:
#load stock data. Load dataframe and load directly as docs
tsla_stock = pd.read_csv("../data/TSLA.csv")
loader  = csv_loader.CSVLoader(file_path="../data/TSLA.csv")
stock_data_docs = loader.load()
stock_data_docs[0]

Document(page_content='Date: 2010-06-29\nOpen: 1.266667\nHigh: 1.666667\nLow: 1.169333\nClose: 1.592667\nAdj Close: 1.592667\nVolume: 281494500', metadata={'source': '../data/TSLA.csv', 'row': 0})

In [17]:
embedding_function = HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL_NAME,
        cache_folder="../models/sentencetransformers"
    )

## Build Static Knowledge Base of Stock Data

In [18]:
#first let's add more context to help with Retrieval and add date to metadata (for later use)
for ii in range(0, len(stock_data_docs)):
    stock_data_docs[ii].page_content = "Daily stock market data for Tesla (TSLA):\n" + stock_data_docs[ii].page_content
    date = re.findall(r'Date: (\d{4}-\d{2}-\d{2})', stock_data_docs[ii].page_content)
    if len(date) > 0:
        stock_data_docs[ii].metadata['last_accessed_at'] = datetime.strptime(date[0], '%Y-%m-%d')
    else:
        stock_data_docs[ii].metadata['last_accessed_at'] = None
stock_data_docs[0]

Document(page_content='Daily stock market data for Tesla (TSLA):\nDate: 2010-06-29\nOpen: 1.266667\nHigh: 1.666667\nLow: 1.169333\nClose: 1.592667\nAdj Close: 1.592667\nVolume: 281494500', metadata={'source': '../data/TSLA.csv', 'row': 0, 'last_accessed_at': datetime.datetime(2010, 6, 29, 0, 0)})

In [8]:
db_data = FAISS.from_documents(stock_data_docs, embedding_function)
db_data.save_local("../data/faiss_stock")

## Build Basic LLM Pipelines for Structured Data

Note: Our stock data is just through 2024-02-02

In [10]:
top_k=16
retriever_stock = VectorStoreRetriever(vectorstore=db_data, search_kwargs={"k": top_k})

def generate_response(prompt, retriever):
    #today's date - let's pretend it is 2024-02-02
    today = "2024-02-02"
    #replace "current" or "today" with today's date
    prompt = re.sub(r'current|today', today, prompt, flags=re.IGNORECASE)
    print("Prompt: ", prompt)
    # Get the top k most similar documents
    results = retriever.get_relevant_documents(prompt)
    return results

In [11]:
question ="What is TSLA's current close price?"
generate_response(question, retriever_stock)

Prompt:  What is TSLA's 2024-02-02 close price?


[Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2021-08-03\nOpen: 239.666672\nHigh: 240.883331\nLow: 233.669998\nClose: 236.580002\nAdj Close: 236.580002\nVolume: 64860900', metadata={'source': '../data/TSLA.csv', 'row': 2793, 'last_accessed_at': datetime.datetime(2021, 8, 3, 0, 0)}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2022-08-18\nOpen: 306.000000\nHigh: 306.500000\nLow: 301.853333\nClose: 302.869995\nAdj Close: 302.869995\nVolume: 47500500', metadata={'source': '../data/TSLA.csv', 'row': 3056, 'last_accessed_at': datetime.datetime(2022, 8, 18, 0, 0)}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2021-01-06\nOpen: 252.830002\nHigh: 258.000000\nLow: 249.699997\nClose: 251.993332\nAdj Close: 251.993332\nVolume: 134100000', metadata={'source': '../data/TSLA.csv', 'row': 2

### TimeWeightedVectorStoreRetriever

Notice that even when we replace "current" with today's date, our retrieval process is not strong enough to only pick recent dates. Let's try using TimeWeightedVectorStoreRetriever to bias towards recent dates<br>

In [12]:
# Solution 1
embedding_size = len(embedding_function.embed_documents([question])[0])
index = faiss.IndexFlatL2(384)
vectorstore = FAISS(embedding_function, index, InMemoryDocstore({}), {})
tw_retriever_stock = TimeWeightedVectorStoreRetriever(vectorstore=vectorstore,decay_rate=0.005, k=top_k)
# Notice the last access time is that date time
tw_retriever_stock.add_documents(stock_data_docs)
with mock_now(datetime(2024, 2, 2, 23, 0)):
    rel_docs = generate_response(question, tw_retriever_stock)
rel_docs

Prompt:  What is TSLA's 2024-02-02 close price?


[Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2024-02-02\nOpen: 185.039993\nHigh: 188.690002\nLow: 182.000000\nClose: 187.910004\nAdj Close: 187.910004\nVolume: 110505100', metadata={'source': '../data/TSLA.csv', 'row': 3422, 'last_accessed_at': MockDateTime(2024, 2, 2, 23, 0), 'created_at': datetime.datetime(2024, 3, 25, 20, 52, 52, 21390), 'buffer_idx': 3422}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2024-02-01\nOpen: 188.500000\nHigh: 189.880005\nLow: 184.279999\nClose: 188.860001\nAdj Close: 188.860001\nVolume: 91843300', metadata={'source': '../data/TSLA.csv', 'row': 3421, 'last_accessed_at': MockDateTime(2024, 2, 2, 23, 0), 'created_at': datetime.datetime(2024, 3, 25, 20, 52, 52, 21390), 'buffer_idx': 3421}),
 Document(page_content='Daily stock market data for Tesla (TSLA):\nDaily stock market data for Tesla (TSLA):\nDate: 2024-01-31\nO

This technically worked, but is not what we want. The 'last_accessed_at' was updated so it is not longer using the date for the stock and this will hurt performance when asking for previous dates. Let's try agents.

# Create CSV Agent

### To be implemented at a later date

# Route Questions using Similarity Functions