<a href="https://colab.research.google.com/github/RisingVoicesBk/MemGPT-1/blob/main/Unstructured_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Use Unstructured to parse the PDF. While unstructured has a core library for processing PDFs, HTML etc, they use
# another library, unstructured_inference, for PDFs with Tables.
# Unstructured_inference uses the YOLOX ML model to parse the PDF. PDF libraries like PDFMiner, Camelot etc are not used.
# However, YOLOX has challenges with complex Tables. Specifically, we seem to get many unnecessary "UncategorizedText" elements.
# Almost like a Table got blown up.
# This code REPLACES YOLO's Tables with Tables from Tabula (in a DataFrame Format) and also eliminates the spurious
# UncategorizedTexts.
# So it can be considered as a hybrid approach using a ML Model & conventional coding.
# While "Chunking By Title" comes out-of-the-box with unstructured-core, there are some challenges in converting objects from
# unstructured_inference to the unstructured-core model, hence using a Custom Chunk By Title approach.
# Trying to get help from unstructured community on the above.
# ToDo - Merge Tables spanning multiple Pages

In [None]:
#SETUP
!pip install "unstructured[pdf]"
!apt-get install -y poppler-utils
!pip install chromadb
!pip install tabula-py
#!pip install "camelot-py[cv]" -q
!pip install 'PyPDF2<3.0'
#!apt-get install ghostscript
!pip install unstructured-inference
#!pip install paddepaddle-gpu
!pip install "unstructured.PaddleOCR"
import os

os.environ['TABLE_OCR'] = 'paddle'
! apt install tesseract-ocr

Collecting unstructured[pdf]
  Downloading unstructured-0.11.2-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured[pdf])
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured[pdf])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured[pdf])
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from unstructured[pdf])
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting python-iso639 (from unstructured[pdf])
  Downloading python_iso639-2023.6.15-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.1/275.1 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 15 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.3 [186 kB]
Fetched 186 kB in 0s (827 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 120882 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.3_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.3) ...
Setting up poppler-utils (22.02.0-2ubuntu0.3) ...
Processing triggers for man-db (2.10.2-1) ...
Collecting chromadb
  Downloading chromadb-0.4.18-py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.4/502.4 kB[0m [31m8.4 MB/s[0m eta [36m0:0

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 15 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (4,161 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 120912 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

In [None]:
import unstructured
from unstructured.partition.pdf import partition_pdf
import tabula
import pandas as pd
import PyPDF2
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout

#Config
pdf_path = "/content/texas-sla.pdf"
MAX_CHARACTERS_PER_CHUNK = 2500
MODEL_NAME = "yolox"

def get_number_of_pages(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfFileReader(file)
        return reader.numPages

#Use Tabula to infer Tables as Dataframes, Indexed By Page.
def get_all_tables_by_page(pdf_path):
    # Get the total number of pages in the PDF
    total_pages = get_number_of_pages(pdf_path)

    # Dictionary to store tables with their page numbers
    tables_as_dataframes_by_page = {}

    for page in range(1, total_pages + 1):
        dataframes = tabula.read_pdf(pdf_path, pages=page)
        # If tables are found on the page, store them in the dictionary
        if dataframes:
            tables_as_dataframes_by_page[page] = dataframes
    return tables_as_dataframes_by_page

model = get_model(MODEL_NAME)
tables_as_dataframes_by_page = get_all_tables_by_page(pdf_path)
#Invoke Inference
layout = DocumentLayout.from_file(pdf_path, detection_model=model)


yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

Dec 04, 2023 4:27:05 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Dec 04, 2023 4:27:05 PM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
Dec 04, 2023 4:27:05 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:05 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:07 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:07 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Dec 04, 2023 4:27:08 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:09 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:09 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Dec 04, 2023 4:27:10 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Dec 04, 2023 4:27:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Dec 04, 2023 4:27:13 PM org.apache.pdfbox.pdmodel.font.PD

In [None]:
#Helper function to merge smaller Chunks
def merge_small_chunks_v3(chunks, min_chunk_length):
    # Helper function to get the text length of an element
    def get_length(element):
        length = len(element.text)
        return length

    merged_chunks = []
    temp_chunk = []
    temp_length = 0

    for chunk in chunks:
        chunk_length = sum(get_length(el) for el in chunk)
        if temp_length + chunk_length < min_chunk_length:
            temp_chunk.extend(chunk)
            temp_length += chunk_length
        else:
            if temp_chunk:
                merged_chunks.append(temp_chunk)
                temp_chunk = []
                temp_length = 0
            merged_chunks.append(chunk)

    # Append any remaining temp_chunk
    if temp_chunk:
        merged_chunks.append(temp_chunk)

    return merged_chunks

#Iterate through Elements of the PDF and Chunk/Group by Section-Header
def chunk_by_section_headers(layout):
    element_to_page_map = {}
    chunks = []
    current_chunk = []
    table_tracker = set()
    all_elements = []
    pages = len( layout.pages )
    #Unstructured uses a 0 based index, unlike Tabula.
    for i in range(pages):
        index_of_table_within_page = 0
        for idx, el in enumerate(layout.pages[i].elements):
            #Ignore Images
            if el.type in ["Image", "UncategorizedText"]:
                continue

            #Consider Page-header on only the first page
            if el.type == "Page-header" and i != 0:
                continue

            if el.type == 'Section-header':
                chunks.append(current_chunk)
                current_chunk = []
                current_chunk.append(el)
            elif el.type == "Table":
                el.text = tables_as_dataframes_by_page[i+1][index_of_table_within_page].to_string(index=False)
                index_of_table_within_page += 1
                current_chunk.append(el)
            else:
                current_chunk.append(el)

            #Page Number Metadata is needed to show the PDF Page in UI
            element_to_page_map[el.text] = i+1
    chunks.append(current_chunk)

    merged_chunks_list = merge_small_chunks_v3(chunks, MAX_CHARACTERS_PER_CHUNK)
    return merged_chunks_list, element_to_page_map

def print_chunk(chunks_param):
    length = 0
    for el in chunks_param:
        length += len(el.text)

    for el in chunks_param:
#        if type(el) == "unstructured_inference.inference.layoutelement.LayoutElement":
        print( el.text)
    print("------------------------------------------------------------------")

#Get Chunks by Section-Headers and store them in a VectorDB/ElasticSearch
chunks_list, element_to_page_map = chunk_by_section_headers(layout)
for i, new_chunk in enumerate(chunks_list):
    print_chunk(new_chunk)


DIR CONTRACT NO. DIR-TEX-AN-NG-CTSA-010 ATTACHMENT D-1 TO EXHIBIT D SERVICE LEVEL AGREEMENTS 
EXHIBIT D – Service Level Agreements  
1.  Service Level Agreement Matrix  
                   Unnamed: 0 Service Level Agreement Metrics Unnamed: 1 Unnamed: 2
             Category/Service    Mean Time Packet Delivery or        NaN        NaN
                          NaN                    Availability     Jitter    Latency
                          NaN                  To Repair Loss        NaN        NaN
            Internet Services                             NaN        NaN        NaN
           Internet Dedicated                      4 hrs to 8        NaN        NaN
                          NaN                             hrs        NaN        NaN
              (North American      99.90% depending ≥  99.50%     ≤ 1 ms    ≤ 45 ms
             IP Network Only)                       on access        NaN        NaN
                SOHO Services                             NaN        NaN  