# Unstructured to extract Info from Scanned pdf
- https://unstructured.io/
- https://unstructured-io.github.io/unstructured/index.html
- https://docs.unstructured.io/api-reference/api-services/python-sdk


In [102]:
# %%capture
# %pip install "unstructured[all-docs]"

In [103]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [104]:
from IPython.display import JSON

import json

from unstructured.partition.html import partition_html
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements, elements_to_json

In [105]:
# %pip show unstructured

In [106]:
# import unstructured.partition

# help(unstructured.partition)


In [107]:
filename = "CaseStudies.pdf"

In [108]:
# from unstructured.partition.pdf import partition_pdf

# # Specify the path to your PDF file
# filename = "gpt4all.pdf"

# # Call the partition_pdf function
# # Returns a List[Element] present in the pages of the parsed pdf document
# elements = partition_pdf(filename)

# # Now, elements is a list of all elements present in the pages of the parsed pdf document

In [109]:
# elements

In [110]:
# len(elements)

In [111]:
# element_dict = [el.to_dict() for el in elements]
# output = json.dumps(element_dict, indent=2)
# print(output)

In [112]:
# unique_types = set()

# for item in element_dict:
#     unique_types.add(item['type'])

# print(unique_types)

In [113]:
# from unstructured.partition.pdf import partition_pdf

# # Specify the path to your PDF file
# filename = "data/scanned_gpt4all.pdf"

# # Call the partition_pdf function
# # Returns a List[Element] present in the pages of the parsed pdf document
# elements = partition_pdf(filename)

# # Now, elements is a list of all elements present in the pages of the parsed pdf document

In [114]:
# elements

In [115]:
# len(elements)

In [116]:
# element_dict = [el.to_dict() for el in elements]
# output = json.dumps(element_dict, indent=2)
# print(output)

### Okay, scanned pdf extraction works.

##### We don't see `Table`, table information is not extracted as we expected, lets use different strategy.

### Table extraction from PDF
- Now let’s say that your PDF has tables and let’s say you want to preserve the structure of the tables. 
- You will have to specify the [strategy](https://unstructured-io.github.io/unstructured/best_practices/strategies.html) parameter as `hi_res`. This will use a combination of computer vision and Optical Character Recognition (OCR) to extract the tables and maintain the structure. 
It will return both the text and the html of the table. This is super useful for rendering the tables or passing to a LLM.

> Note: For even better table extraction Unstructured offers an API that improves upon the existing open source models.

> Depending upon machine, you might face different module / library issues, these links might help
- https://stackoverflow.com/questions/59690698/modulenotfounderror-no-module-named-lzma-when-building-python-using-pyenv-on
- https://unstructured-io.github.io/unstructured/installation/full_installation.html

In [117]:
poppler_path = r"C:/Users/Hemant.Singhsidar/Downloads/Release-24.08.0-0/poppler-24.08.0/Library/bin"

In [118]:
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(filename=filename,
                         infer_table_structure=True,
                         strategy='hi_res',
           )

In [119]:
len(elements)

126

In [120]:

element_dict = [el.to_dict() for el in elements]
output = json.dumps(element_dict, indent=2)
# print(output)

unique_types = set()

for item in element_dict:
    unique_types.add(item['type'])

print(unique_types)

{'NarrativeText', 'Table', 'ListItem', 'UncategorizedText', 'Title'}


In [121]:
tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print(tables[0].metadata.text_as_html)

Sr.No Particulars Amount in Lacs 1 Land 3.00 2 Construction of premises and electricity 8.00
<table><thead><tr><th>Sr.No |</th><th>Particulars</th><th>Amount in Lacs</th></tr></thead><tbody><tr><td>1</td><td>Land</td><td>3.00</td></tr><tr><td>2</td><td>Construction of premises and electricity</td><td>8.00</td></tr></tbody></table>


In [122]:
tables

[<unstructured.documents.elements.Table at 0x163ac9c3a90>,
 <unstructured.documents.elements.Table at 0x1633abf80d0>,
 <unstructured.documents.elements.Table at 0x163acd4bdc0>,
 <unstructured.documents.elements.Table at 0x16336718190>,
 <unstructured.documents.elements.Table at 0x16336d8c3d0>,
 <unstructured.documents.elements.Table at 0x16336dffa90>,
 <unstructured.documents.elements.Table at 0x163ac86c130>,
 <unstructured.documents.elements.Table at 0x163ac86ca90>,
 <unstructured.documents.elements.Table at 0x1633a2d77c0>,
 <unstructured.documents.elements.Table at 0x16367c71760>,
 <unstructured.documents.elements.Table at 0x16339134f70>,
 <unstructured.documents.elements.Table at 0x1633a1b11f0>,
 <unstructured.documents.elements.Table at 0x16336778400>,
 <unstructured.documents.elements.Table at 0x1633a145280>,
 <unstructured.documents.elements.Table at 0x1633a145250>]

In [123]:
len(tables)

15

In [152]:
tables[5].text

'Mortgage expenses : 10,000 Processing fees of the bank: 5,000 Consultant’s charges : 5,000'

In [125]:
tables[0].metadata

<unstructured.documents.elements.ElementMetadata at 0x163ac9c3a00>

### Now, comes the most interesting part ( utilizing the extracted data in most efficient way)

- It's helpful to have an HTML representation of the table so that you can the information to an LLM while maintaining the table structure.

In [153]:
table_html = tables[5].metadata.text_as_html

In [154]:
table_html

'<table><tbody><tr><td>Mortgage</td><td>expenses</td><td>: 10,000</td></tr><tr><td colspan="3">Processing fees of the bank: 5,000</td></tr><tr><td colspan="3">Consultant’s charges : 5,000</td></tr><tr><td colspan="3">Stamp Duty : 5,000</td></tr><tr><td>Miscellaneous</td><td>expenses :</td><td>5,000</td></tr><tr><td>Total</td><td></td><td>30,000</td></tr></tbody></table>'

In [128]:
# # view what the HTML in the metadata field looks like

# from io import StringIO 
# from lxml import etree

# parser = etree.XMLParser(remove_blank_text=True)
# file_obj = StringIO(table_html)
# tree = etree.parse(file_obj, parser)
# print(etree.tostring(tree, pretty_print=True).decode())

In [129]:
# # let's display this table

# from IPython.core.display import HTML
# HTML(table_html)

#### Now, lets plugin in LangChain to summarize these tables using `Llama3` via `Ollama`
#### [Ollama Playlist](https://www.youtube.com/playlist?list=PLz-qytj7eIWX-bpcRtvkixvo9fuejVr8y)

In [130]:
# %%capture
# %pip install langchain-ollama langchain_core langchain_community

In [131]:
# from langchain_ollama import ChatOllama
# from langchain_core.documents import Document
# from langchain.chains.summarize import load_summarize_chain

In [132]:
# ChatOllama??

First run the Ollama server  
http://localhost:11434

In [133]:
# llm = ChatOllama(model="llama3.1:8b")
# chain = load_summarize_chain(llm, chain_type="stuff")
# output = chain.invoke([Document(page_content=table_html)])

In [134]:
# output

In [135]:
# print(output['output_text'])

#### Convert to pandas df

In [136]:
# %pip install pandas

In [155]:
import pandas as pd

# Convert HTML table to pandas DataFrame
dfs = pd.read_html(table_html)

In [156]:
dfs

[                                    0                                   1  \
 0                            Mortgage                            expenses   
 1  Processing fees of the bank: 5,000  Processing fees of the bank: 5,000   
 2        Consultant’s charges : 5,000        Consultant’s charges : 5,000   
 3                  Stamp Duty : 5,000                  Stamp Duty : 5,000   
 4                       Miscellaneous                          expenses :   
 5                               Total                                 NaN   
 
                                     2  
 0                            : 10,000  
 1  Processing fees of the bank: 5,000  
 2        Consultant’s charges : 5,000  
 3                  Stamp Duty : 5,000  
 4                                5000  
 5                               30000  ]

In [157]:

# Assuming there's only one table, get the DataFrame
df = dfs[0]

# Now you have the DataFrame
print(df)


                                    0                                   1  \
0                            Mortgage                            expenses   
1  Processing fees of the bank: 5,000  Processing fees of the bank: 5,000   
2        Consultant’s charges : 5,000        Consultant’s charges : 5,000   
3                  Stamp Duty : 5,000                  Stamp Duty : 5,000   
4                       Miscellaneous                          expenses :   
5                               Total                                 NaN   

                                    2  
0                            : 10,000  
1  Processing fees of the bank: 5,000  
2        Consultant’s charges : 5,000  
3                  Stamp Duty : 5,000  
4                                5000  
5                               30000  


In [158]:
df.shape

(6, 3)

In [159]:
df.head()

Unnamed: 0,0,1,2
0,Mortgage,expenses,": 10,000"
1,"Processing fees of the bank: 5,000","Processing fees of the bank: 5,000","Processing fees of the bank: 5,000"
2,"Consultant’s charges : 5,000","Consultant’s charges : 5,000","Consultant’s charges : 5,000"
3,"Stamp Duty : 5,000","Stamp Duty : 5,000","Stamp Duty : 5,000"
4,Miscellaneous,expenses :,5000
