# TSR Guru

### Made by: Jose Miguel Garzón Vargas
Universidad EAFIT

TSR Guru employs the Retrieval-Augmented Generation (RAG) approach to facilitate the understanding of scientific articles within the Table Structure Recognition (TSR) domain. Utilizing embeddings stored in Chroma DB, the system retrieves and augments queries. Initially, queries are transformed into embeddings and matched against similar documents stored in Chroma DB. These retrieved documents then serve as context for queries directed to the AzureChatOpenAI API. This integration ensures that responses provided through AzureChatOpenAI API are not only accurate but also enriched with insights from relevant literature, simplifying access to and comprehension of TSR content.

### Load secrets

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_API_BASE = os.getenv('OPENAI_API_BASE')
OPENAI_API_VERSION = os.getenv('OPENAI_API_VERSION')
OPENAI_API_TYPE = os.getenv('OPENAI_API_TYPE')

### Load files

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

FOLDER_PATH = r".\TSR-docs"

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
documents = []
# Create a loader for the `rtdocs/python.langchain.com/en/latest` folder
for file_name in os.listdir(FOLDER_PATH):
    # Join the folder path with the file name to get the full path
    file_path = os.path.join(FOLDER_PATH, file_name)
    # Check if the current path is a file (not a folder)
    if os.path.isfile(file_path):
        loader = PyPDFLoader(file_path)
        raw_documents = loader.load()


        # Split the documents
        documents.extend(splitter.split_documents(raw_documents))
print(f"All documents loaded!")


All documents loaded!


### Embedding the documents

#### Cost estimation

In [3]:
# Import tiktoken
import tiktoken

# Create an encoder
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")

# Count tokens in each document
doc_tokens = [len(encoder.encode(doc.page_content)) for doc in documents]

# Calculate the sum of all token counts
total_tokens = sum(doc_tokens)

# Calculate a cost estimate
cost = (total_tokens/1000) * 0.0004
print(f"Total tokens: {total_tokens} - cost: ${cost:.2f}")

Total tokens: 384359 - cost: $0.15


### Embeddings generation

Using ChromaDB to store the embeddings. We’ll skip this if we’ve already saved the embedding

In [4]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

# Create the mebedding function
embedding_function = OpenAIEmbeddings(deployment="text-embedding-ada-002",chunk_size = 1)

In [5]:
# Create a database from the documents and embedding function
db = Chroma.from_documents(documents=documents[:1], embedding=embedding_function, persist_directory="TSR-embeddings")



In [6]:
# Persist the data to disk
# db.persist() We already have the data

### Embeddings loading
Since the embeddings are already created, we'll just load them from the files:

In [7]:
# Import chroma
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

# Create the embedding function
embedding = OpenAIEmbeddings(deployment="text-embedding-ada-002",chunk_size = 1)
db = Chroma(persist_directory="TSR-embeddings", embedding_function=embedding)

### Create a QA chain using RAG

In [8]:
question = "What is TSR?"
question = "Main challenges of TSR"
question = "What are some approaches to solve the TSR problem?"
question = "Which datasets are useful and why?"
question = "Where should I start if I want to compare TSR models?"
question = "How can I evaluate a TSR model performance?"
question = "Explain the details of the GriTS  metric"
question = "Explain the details of the TEDS  metric"

In [9]:
# Import
from langchain.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage

# Query the database as store the results as `context_docs`
context_docs = db.similarity_search(question)

# Create a prompt with 2 variables: `context` and `question`
prompt = PromptTemplate(
    template=""""Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<context>
{context}
</context>

Question: {question}
Helpful Answer, formatted in markdown:""",
    input_variables=["context", "question"]
)

# Create an LLM with ChatOpenAI
llm = AzureChatOpenAI(
    openai_api_base=OPENAI_API_BASE,
    openai_api_version=OPENAI_API_VERSION,
    deployment_name="gpt-35-turbo",
    openai_api_key=OPENAI_API_KEY,
    openai_api_type=OPENAI_API_TYPE,
)

In [10]:
# Create the chain
qa_chain = LLMChain(llm=llm, prompt=prompt)

# Call the chain
result = qa_chain({
    "question": question,
    "context": "\n".join([doc.page_content for doc in context_docs])
})

# Print the result
print(f"User:\n\t-{question}\n")
print(f"TSR Guru:\n{result['text']}")

User:
	-Explain the details of the TEDS  metric

TSR Guru:
The TEDS (Tree-Edit-Distance-Based Similarity) metric is a similarity measure used to evaluate the quality of table recognition methods. It was introduced in a research paper [37] and is calculated based on the tree structure of HTML tags representing the prediction and ground-truth tables.

The TEDS metric is calculated using the following formula:

TEDS(Ta, Tb) = 1 - EditDist(Ta, Tb) / max(|Ta|, |Tb|)

Where Ta and Tb represent tables in tree structure HTML format, and EditDist represents the tree-edit distance between the two tables. |T| represents the number of nodes in the table.

The TEDS metric is used to measure the similarity between predicted and ground-truth tables. A higher TEDS value indicates a higher similarity between the two tables. It is a commonly used evaluation metric in the field of table recognition.
