# Improve semantic search with table summarization

In this notebook, we will show how to summarize extracted tables to provide a better similarity score for semantic search. 

We will use:
- the extracted HTML table
- the | separated table (less tokens)

We will compare the question similarity score for both tables.

In [1]:
import sys
sys.path.append('../')

In [2]:
from operator import itemgetter
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from config.llm import GPT_4
from config.embedings import AZURE_ADA_002_EMBEDDINGS
from utils import html_table_to_pipe_table, cosine_similarity

  warn_deprecated(


In [3]:
html_table = """
    <table>
        <thead>
            <tr>
                <th>Provider</th>
                <th>Model</th>
                <th>input price per 1k Token</th>
                <th>output price per 1K Token</th>
                <th>input price per 1M Token</th>
                <th>output price per 1M Token</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>(Azure) OpenAI</td>
                <td>GPT-4 (8K)</td>
                <td>$0.03000</td>
                <td>$0.06000</td>
                <td>$30.00</td>
                <td>$60.00</td>
            </tr>
            <tr>
                <td></td>
                <td>GPT-4 Turbo</td>
                <td>$0.01000</td>
                <td>$0.03000</td>
                <td>$10.00</td>
                <td>$30.00</td>
            </tr>
            <tr>
                <td></td>
                <td>GPT-3.5-turbo</td>
                <td>$0.00050</td>
                <td>$0.00150</td>
                <td>$0.50</td>
                <td>$1.50</td>
            </tr>
            <tr>
                <td>Google Vertex AI<br>1 token ~= 4 chars</td>
                <td>Gemini Pro</td>
                <td>$0.00100</td>
                <td>$0.00200</td>
                <td>$1.00</td>
                <td>$2.00</td>
            </tr>
            <tr>
                <td></td>
                <td>PaLM 2</td>
                <td>$0.00200</td>
                <td>$0.00200</td>
                <td>$2.00</td>
                <td>$2.00</td>
            </tr>
        </tbody>
    </table>
"""

In [4]:

pipe_table = html_table_to_pipe_table(html_table)
print(pipe_table)

Provider | Model | input price per 1k Token | output price per 1K Token | input price per 1M Token | output price per 1M Token
------------------------------------------------------------------------------------------------------------------------------
(Azure) OpenAI | GPT-4 (8K) | $0.03000 | $0.06000 | $30.00 | $60.00
nan | GPT-4 Turbo | $0.01000 | $0.03000 | $10.00 | $30.00
nan | GPT-3.5-turbo | $0.00050 | $0.00150 | $0.50 | $1.50
Google Vertex AI 1 token ~= 4 chars | Gemini Pro | $0.00100 | $0.00200 | $1.00 | $2.00
nan | PaLM 2 | $0.00200 | $0.00200 | $2.00 | $2.00



In [5]:
html_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([html_table])[0]
html_embeddings[:5]

[0.00203144090557718,
 0.007626032678073521,
 0.0007786905269745078,
 -0.01533421359754763,
 -0.005866705610244608]

In [6]:
pipe_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([pipe_table])[0]
pipe_embeddings[:5]

[0.01585491179451621,
 -0.014221546164635206,
 0.005311958929257258,
 -0.0089764706313801,
 -0.0020293865710770877]

In [7]:
en_question = "Can you sort the Azure and GCP LLMs by their costs per thousand tokens?"
en_question_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([en_question])[0]

fr_question = "Peux-tu trier les LLM des cloud Azure et GCP selon leur coûts par milliers de token ?"
fr_question_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([fr_question])[
    0]


Let's study the similarity between the question (english and french) and the table content (html and pipe).

In [8]:
# English question
print("English question")
print("Cosine similarity between html table and question:",
      cosine_similarity(html_embeddings, en_question_embeddings))
print("Cosine similarity between pipe table and question:",
      cosine_similarity(pipe_embeddings, en_question_embeddings))

# French question
print("\nFrench question:")
print("Cosine similarity between html table and question:",
      cosine_similarity(html_embeddings, fr_question_embeddings))
print("Cosine similarity between pipe table and question:",
        cosine_similarity(pipe_embeddings, fr_question_embeddings))

English question
Cosine similarity between html table and question: 0.8224233060059337
Cosine similarity between pipe table and question: 0.8110228519118848

French question:
Cosine similarity between html table and question: 0.7645364328228879
Cosine similarity between pipe table and question: 0.7492020302722854


Two conclusions:
- Be careful with languages
- The HTML has a slightly better similarity score than the pipe table. The

## Table summarization 

Let's try now to summarize the HTML and | tables to improves the similarity score between the question and the table content.

In [9]:
def summarize_table(llm, prompt_text, extracted_table: str) -> str:
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Summary chain
    summarize_chain = (
        {
            "extracted_table": itemgetter("extracted_table")
        }
        |
        prompt
        |
        llm
        |
        StrOutputParser()
    )
    return summarize_chain.invoke(input={"extracted_table": extracted_table})

# 

In [10]:
question = "Summarize the table with keys informations."

prompt_summarization = """From the following HMTL or | separated table:
    ----------
    {extracted_table}
    ----------
    Summarize the table with keys informations.
    """


                    kawargs was transferred to model_kwargs.
                    Please confirm that kawargs is what you intended.


In [11]:
pipe_table_summary = summarize_table(
    llm=GPT_4,
    prompt_text=prompt_summarization,
    extracted_table=pipe_table,
)

html_table_summary = summarize_table(
    llm=GPT_4,
    prompt_text=prompt_summarization,
    extracted_table=html_table,
)

In [12]:
print(pipe_table_summary)

The table provides information about different AI models from various providers along with their input and output prices per 1k and 1M tokens. 

1. The Azure OpenAI provides the GPT-4 (8K) model with an input price of $0.03000 per 1k tokens and $30.00 per 1M tokens. The output prices are $0.06000 per 1k tokens and $60.00 per 1M tokens.

2. The GPT-4 Turbo model has an input price of $0.01000 per 1k tokens and $10.00 per 1M tokens. The output prices are $0.03000 per 1k tokens and $30.00 per 1M tokens.

3. The GPT-3.5-turbo model has the lowest input price of $0.00050 per 1k tokens and $0.50 per 1M tokens. The output prices are $0.00150 per 1k tokens and $1.50 per 1M tokens.

4. Google Vertex AI provides the Gemini Pro model where 1 token is approximately equal to 4 characters. The input price is $0.00100 per 1k tokens and $1.00 per 1M tokens. The output prices are $0.00200 per 1k tokens and $2.00 per 1M tokens.

5. The PaLM 2 model has an input price of $0.00200 per 1k tokens and $2.00 

In [13]:
print(html_table_summary)

The table provides information about the pricing of different AI models from providers such as Azure OpenAI and Google Vertex AI. 

1. Azure OpenAI offers three models: GPT-4 (8K), GPT-4 Turbo, and GPT-3.5-turbo. The input price per 1k tokens for these models are $0.03000, $0.01000, and $0.00050 respectively. The output prices per 1k tokens are $0.06000, $0.03000, and $0.00150 respectively. When considering 1M tokens, the input prices are $30.00, $10.00, and $0.50, and the output prices are $60.00, $30.00, and $1.50.

2. Google Vertex AI, where 1 token is approximately equal to 4 characters, offers two models: Gemini Pro and PaLM 2. The input price per 1k tokens for these models are $0.00100 and $0.00200 respectively, and both have an output price per 1k tokens of $0.00200. For 1M tokens, the input prices are $1.00 and $2.00, and the output prices are $2.00 and $2.00.


In [14]:
summarize_pipe_table_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents(
    [pipe_table_summary])[0]

summarized_html_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents(
    [html_table_summary])[0]

In [15]:
# English question
print("English question")
print("Cosine similarity between summarized html table and question:",
      cosine_similarity(summarized_html_embeddings, en_question_embeddings))
print("Cosine similarity between summarized pipe table and question:",
      cosine_similarity(summarize_pipe_table_embeddings, en_question_embeddings))

# French question
print("\nFrench question:")
print("Cosine similarity between summarized html table and question:",
      cosine_similarity(summarized_html_embeddings, fr_question_embeddings))
print("Cosine similarity between summarized pipe table and question:",
      cosine_similarity(summarize_pipe_table_embeddings, fr_question_embeddings))

English question
Cosine similarity between summarized html table and question: 0.8293981621712595
Cosine similarity between summarized pipe table and question: 0.8235538381337154

French question:
Cosine similarity between summarized html table and question: 0.7762218680702694
Cosine similarity between summarized pipe table and question: 0.7657509307006923


We observe a better cosine similarity score for a randomly chosen question

In [16]:
en_question = "How much does it cost to process 1M tokens with GPT-4 Turbo?"
en_question_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([en_question])[0]

fr_question = "Combien coûte le traitement de 1M de tokens avec GPT-4 Turbo ?"
fr_question_embeddings = AZURE_ADA_002_EMBEDDINGS.embed_documents([fr_question])[0]

In [17]:
print("Without summarization \n")
# English question
print("English question")
print("Cosine similarity between html table and question:",
      cosine_similarity(html_embeddings, en_question_embeddings))
print("Cosine similarity between pipe table and question:",
      cosine_similarity(pipe_embeddings, en_question_embeddings))

# French question
print("\nFrench question:")
print("Cosine similarity between html table and question:",
      cosine_similarity(html_embeddings, fr_question_embeddings))
print("Cosine similarity between pipe table and question:",
      cosine_similarity(pipe_embeddings, fr_question_embeddings))


print("\nWith summarization \n")
# English question
print("English question")
print("Cosine similarity between summarized html table and question:",
      cosine_similarity(summarized_html_embeddings, en_question_embeddings))
print("Cosine similarity between summarized pipe table and question:",
      cosine_similarity(summarize_pipe_table_embeddings, en_question_embeddings))

# French question
print("\nFrench question:")
print("Cosine similarity between summarized html table and question:",
      cosine_similarity(summarized_html_embeddings, fr_question_embeddings))
print("Cosine similarity between summarized pipe table and question:",
      cosine_similarity(summarize_pipe_table_embeddings, fr_question_embeddings))

Without summarization 

English question
Cosine similarity between html table and question: 0.8307020510808852
Cosine similarity between pipe table and question: 0.8398822824210638

French question:
Cosine similarity between html table and question: 0.7975859989604802
Cosine similarity between pipe table and question: 0.8040695385068106

With summarization 

English question
Cosine similarity between summarized html table and question: 0.8496684726885921
Cosine similarity between summarized pipe table and question: 0.8477521058415244

French question:
Cosine similarity between summarized html table and question: 0.8211399793020073
Cosine similarity between summarized pipe table and question: 0.8162567056870702


# Conclusion

- The similarity improvement depends on the question asked but is still higher than the extracted table. 
- The summary is by definition a summary, it can helps the retriever to find the adequate chunks but may loose some information. For this reason we advice the developer to keep both the summary and the extracted table to provide a better semantic search. One should consider linking the summary to the original table and give only the orignal information to the context to avoid duplicated informations. 
