# Markdown Table Chunking

In this notebook we will experiment with Markdown table chunking using various approaches. 


### Python Imports


In [15]:
%load_ext autoreload
%autoreload 2


import sys
sys.path.append('..\\code')


import os
from dotenv import load_dotenv
load_dotenv()

from IPython.display import display, JSON, Markdown, HTML
import copy
from PIL import Image
from doc_utils import *


def show_img(img_path, width = None):
    if width is not None:
        display(HTML(f'<img src="{img_path}" width={width}>'))
    else:
        display(Image.open(img_path))


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Make sure we have the OpenAI Models information

We will need the GPT-4-Turbo and GPT-4-Vision models for this notebook.

When running the below cell, the values should reflect the OpenAI reource you have created in the `.env` file.

In [4]:
model_info = {
        'AZURE_OPENAI_RESOURCE': os.environ.get('AZURE_OPENAI_RESOURCE'),
        'AZURE_OPENAI_KEY': os.environ.get('AZURE_OPENAI_KEY'),
        'AZURE_OPENAI_MODEL_VISION': os.environ.get('AZURE_OPENAI_MODEL_VISION'),
        'AZURE_OPENAI_MODEL': os.environ.get('AZURE_OPENAI_MODEL'),
}

### Read in the Excel Sheet

In [5]:
file_path = r"sample_data/11_work_orders.xlsx"
dataframes = read_excel_to_dataframes(file_path)

In [6]:
# Printing the sheet names
for k in dataframes: print(k)

Instructions
WOs
AdminLists
MyLinks


In [18]:
def nb_table_df_cleanup_df(df):
    dfc = copy.deepcopy(df)
    dfc = dfc.dropna(axis=0, how='all')
    dfc = dfc.dropna(axis=1, how='all')
    dfc = dfc.replace(r'\n','   //    ', regex=True) 
    dfc = dfc.replace(r'\|','   ///    ', regex=True) 
    return dfc


df = nb_table_df_cleanup_df(dataframes['Instructions'])
md_table = df.to_markdown()
Markdown(md_table)

|    | 2                                             |
|---:|:----------------------------------------------|
|  4 | Downloaded From                               |
|  5 | Sample Data for Excel                         |
|  7 | Related tutorials                             |
|  8 | Named Excel Tables                            |
|  9 | Data Entry Tips                               |
| 10 | More Excel Sample Files                       |
| 12 | Notes                                         |
| 13 | Fake work order data to use for Excel testing |

In [16]:
markdown_extract_header_and_summarize_prompt = """
You are a Data Engineer resonsible for reforming and preserving the quality of Markdown tables. A table will be passed to you in the form of a Markdown string. You are designed to output JSON. 

Your task is to extract the column names of the header of the table from the Markdown string in the form of a comma-separated list. If the column names do exist, please return them verbatim word-for-word with no change, except fixing format or alignment issues (extra spaces and new lines can be removed). 

If the table does not have a header, then please check the data rows and generate column names for the header that fit the data types of the columns and the nature of the data. 

**VERY IMPORTANT**: If the table has an unnamed index column, typically the leftmost column, you **MUST** generate a column name for it.

Finally, please generate a brief semantic summary of the table in English. This is not about the technical characteristics of the table. The summary should summarize the business purpose and contents of the table. The summary should be to the point with two or three paragraphs.

The Markdown table: 
## START OF MARKDOWN TABLE
{table}
## END OF MARKDOWN TABLE

JSON OUTPUT:
You **MUST** generate the below JSON dictionary as your output. 

{{
    "columns": "list of comma-separated column names. If the table has a header, please return the column names as they are. If the table does not have a header, then generate column names that fit the data types and nature of the data. Do **NOT** forget any unnamed index columns.",
    "columns_inferred": "true/false. Set to true in the case the table does not have a header, and you generated column names based on the data rows.",
    "total_number_of_columns": "total number of columns in the table",
    "summary_of_the_table": "a brief semantic summary of the table in English. This is not about the technical characteristics of the table. The summary should summarize the business purpose and contents of the table. The summary should be concise and to the point, one or two short paragraphs."
}}

"""


prompt = markdown_extract_header_and_summarize_prompt.format(table=md_table)
output = ask_LLM_with_JSON(prompt, model_info = model_info)
display(JSON(json.loads(output)))

<IPython.core.display.JSON object>

In [21]:
df_wos = nb_table_df_cleanup_df(dataframes['WOs'])
md_table_wos_head = df_wos.head().to_markdown()
md_table_wos = df_wos.to_markdown()
Markdown(md_table_wos_head)

|    | 0      | 1        | 2        | 3       | 4    | 5                   | 6                   | 7     | 8      | 9        | 10     | 11        | 12      | 13   | 14      | 15      | 16     | 17       | 18                 | 19                 | 20     | 21      |
|---:|:-------|:---------|:---------|:--------|:-----|:--------------------|:--------------------|:------|:-------|:---------|:-------|:----------|:--------|:-----|:--------|:--------|:-------|:---------|:-------------------|:-------------------|:-------|:--------|
|  0 | WO     | District | LeadTech | Service | Rush | ReqDate             | WorkDate            | Techs | WtyLbr | WtyParts | LbrHrs | PartsCost | Payment | Wait | LbrRate | LbrCost | LbrFee | PartsFee | TotalCost          | TotalFee           | ReqDay | WorkDay |
|  1 | A00100 | North    | Khan     | Assess  | nan  | 2020-09-01 00:00:00 | 2020-09-15 00:00:00 | 2     | nan    | nan      | 0.5    | 360       | Account | 14   | 140     | 70      | 70     | 360      | 430                | 430                | Tue    | Tue     |
|  2 | A00101 | South    | Lopez    | Replace | nan  | 2020-09-01 00:00:00 | 2020-09-04 00:00:00 | 1     | nan    | nan      | 0.5    | 90.0416   | Account | 3    | 80      | 40      | 40     | 90.0416  | 130.04160000000002 | 130.04160000000002 | Tue    | Fri     |
|  3 | A00102 | Central  | Cartier  | Deliver | nan  | 2020-09-01 00:00:00 | 2020-09-17 00:00:00 | 1     | nan    | nan      | 0.25   | 120       | P.O.    | 16   | 80      | 20      | 20     | 120      | 140                | 140                | Tue    | Thu     |
|  4 | A00103 | South    | Lopez    | Deliver | nan  | 2020-09-01 00:00:00 | 2020-09-17 00:00:00 | 1     | nan    | nan      | 0.25   | 16.25     | Account | 16   | 80      | 20      | 20     | 16.25    | 36.25              | 36.25              | Tue    | Thu     |

In [22]:
print(f"Table Token Count: {get_token_count(md_table_wos)}")

prompt = markdown_extract_header_and_summarize_prompt.format(table=md_table_wos)
output = ask_LLM_with_JSON(prompt, model_info = model_info)
display(JSON(json.loads(output)))

Table Token Count: 113325


<IPython.core.display.JSON object>

In [34]:
print(f"Table Token Count: {get_token_count(md_table_wos)}")

prompt = markdown_extract_header_and_summarize_prompt.format(table=md_table_wos.split('\n')[:100])
output = ask_LLM_with_JSON(prompt, model_info = model_info)
display(JSON(json.loads(output)))

Table Token Count: 113325


<IPython.core.display.JSON object>

In [44]:
def chunk_markdown_table_with_overlap(md_table, cols = None, n_tokens = 512, overlap = 128):

    mds = md_table.split('\n')

    if cols is not None:
        header = '|   ' + '   |   '.join(cols) + '   |\n'
    else:
        header = mds[0] + '\n'

    chunks = []
    chunk = header

    for i, r in enumerate(mds[1:]):
        chunk += r + '\n'

        ## Check if the chunk is over n_tokens
        if get_token_count(chunk) > n_tokens:
            ## Add Overlap
            try:
                for j, ovr in enumerate(mds[i + 1:]):
                    chunk += ovr + '\n'
                    if get_token_count(chunk) > n_tokens + overlap:
                        break
            except Exception as e:
                print(e)
            
            chunks.append(chunk)        

            # print(f"Chunk {len(chunks)}: {get_token_count(chunk)}")
            chunk = header  + mds[1] + '\n'

    return chunks, header


def chunk_markdown_table(df, model_info):

    df_clean = nb_table_df_cleanup_df(df)
    md_table = df_clean.to_markdown()

    prompt = markdown_extract_header_and_summarize_prompt.format(table=md_table.split('\n')[:100])
    output = ask_LLM_with_JSON(prompt, model_info = model_info)
    outd = json.loads(output)
    cols = outd['columns'].split(',')
    summary = outd['summary_of_the_table']

    chunks, header = chunk_markdown_table_with_overlap(md_table, cols, n_tokens = 512, overlap = 128)
    print("Chunks:", len(chunks))
    return chunks, header, summary

chunks, header, summary = chunk_markdown_table(dataframes['WOs'], model_info)
print(summary)


Chunks: 319
The table appears to be a log of work orders (WO) for a service company. Each row represents a job with details such as the district where the job took place, the lead technician (LeadTech), the type of service performed, whether it was a rush job, request and work dates, the number of technicians (Techs) involved, warranty labor and parts information, labor hours (LbrHrs), parts cost, payment method, waiting time (Wait), labor rate (LbrRate), labor cost (LbrCost), labor fee (LbrFee), parts fee (PartsFee), total cost (TotalCost), total fee (TotalFee), and the days of the week when the request was made and the work was done (ReqDay and WorkDay). This table is useful for tracking job costs, technician assignments, and scheduling efficiency.


In [42]:
Markdown(chunks[0])

|   Index   |    WO   |    District   |    LeadTech   |    Service   |    Rush   |    ReqDate   |    WorkDate   |    Techs   |    WtyLbr   |    WtyParts   |    LbrHrs   |    PartsCost   |    Payment   |    Wait   |    LbrRate   |    LbrCost   |    LbrFee   |    PartsFee   |    TotalCost   |    TotalFee   |    ReqDay   |    WorkDay   |
|-----:|:-------|:----------|:---------|:--------|:-----|:--------------------|:--------------------|:------|:-------|:---------|:-------|:----------|:---------|:-----|:--------|:--------|:-------|:----------|:-------------------|:-------------------|:-------|:--------|
|    0 | WO     | District  | LeadTech | Service | Rush | ReqDate             | WorkDate            | Techs | WtyLbr | WtyParts | LbrHrs | PartsCost | Payment  | Wait | LbrRate | LbrCost | LbrFee | PartsFee  | TotalCost          | TotalFee           | ReqDay | WorkDay |
|    1 | A00100 | North     | Khan     | Assess  | nan  | 2020-09-01 00:00:00 | 2020-09-15 00:00:00 | 2     | nan    | nan      | 0.5    | 360       | Account  | 14   | 140     | 70      | 70     | 360       | 430                | 430                | Tue    | Tue     |
|    2 | A00101 | South     | Lopez    | Replace | nan  | 2020-09-01 00:00:00 | 2020-09-04 00:00:00 | 1     | nan    | nan      | 0.5    | 90.0416   | Account  | 3    | 80      | 40      | 40     | 90.0416   | 130.04160000000002 | 130.04160000000002 | Tue    | Fri     |
|    3 | A00102 | Central   | Cartier  | Deliver | nan  | 2020-09-01 00:00:00 | 2020-09-17 00:00:00 | 1     | nan    | nan      | 0.25   | 120       | P.O.     | 16   | 80      | 20      | 20     | 120       | 140                | 140                | Tue    | Thu     |
|    3 | A00102 | Central   | Cartier  | Deliver | nan  | 2020-09-01 00:00:00 | 2020-09-17 00:00:00 | 1     | nan    | nan      | 0.25   | 120       | P.O.     | 16   | 80      | 20      | 20     | 120       | 140                | 140                | Tue    | Thu     |


In [43]:
Markdown(chunks[300])

|   Index   |    WO   |    District   |    LeadTech   |    Service   |    Rush   |    ReqDate   |    WorkDate   |    Techs   |    WtyLbr   |    WtyParts   |    LbrHrs   |    PartsCost   |    Payment   |    Wait   |    LbrRate   |    LbrCost   |    LbrFee   |    PartsFee   |    TotalCost   |    TotalFee   |    ReqDay   |    WorkDay   |
|-----:|:-------|:----------|:---------|:--------|:-----|:--------------------|:--------------------|:------|:-------|:---------|:-------|:----------|:---------|:-----|:--------|:--------|:-------|:----------|:-------------------|:-------------------|:-------|:--------|
|  927 | A01026 | West      | Khan     | Deliver | nan  | 2021-07-02 00:00:00 | nan                 | 1     | nan    | nan      | nan    | 74.7804   | Account  | nan  | 80      | 0       | 0      | 74.7804   | 74.7804            | 74.7804            | Fri    | Sat     |
|  928 | A01027 | Central   | Cartier  | Install | nan  | 2021-07-02 00:00:00 | nan                 | 2     | nan    | nan      | nan    | 445.1606  | C.O.D.   | nan  | 140     | 0       | 0      | 445.1606  | 445.1606           | 445.1606           | Fri    | Sat     |
|  929 | A01028 | Central   | Khan     | Assess  | nan  | 2021-07-05 00:00:00 | 2021-07-20 00:00:00 | 2     | nan    | nan      | 0.5    | 85.32     | Account  | 15   | 140     | 70      | 70     | 85.32     | 155.32             | 155.32             | Mon    | Tue     |
|  930 | A01029 | West      | Khan     | Assess  | nan  | 2021-07-05 00:00:00 | nan                 | 2     | nan    | nan      | nan    | 180.33    | Account  | nan  | 140     | 0       | 0      | 180.33    | 180.33             | 180.33             | Mon    | Sat     |
|  930 | A01029 | West      | Khan     | Assess  | nan  | 2021-07-05 00:00:00 | nan                 | 2     | nan    | nan      | nan    | 180.33    | Account  | nan  | 140     | 0       | 0      | 180.33    | 180.33             | 180.33             | Mon    | Sat     |
