
# Doc Fusion - Document Extraction Library.
#### This script demonstrates how to extract unstructured data (PDF & DOCX) in just a few steps.


## Example PDF Document used in this notebook

![pdf-file](../docs/src/images/illustration-1.png)

## Install the required package

In [None]:
# Uncomment the following line if the package is not installed.
# %pip install docfusion==1.0.0

## Import the library

In [1]:
from main import DocFusion

## Extract Documents

#### Configure the DocFusion library to extract documents

Enter the following as configuration

```
What is the expected output? (documents/chunks):  documents
```

In [2]:
# Input expected output as "documents" to extract the document.
DocFusion.configure()

What is the expected output? (docs/chunks):  documents


Config data successfully written to config/data_config.json


'Configuration saved to config.json'

#### Extract Document 
Extract unstructured data from a PDF document

In [3]:
# Provide the path to the document to be sourced. Use the DocFusion source method to extract document pages.
docs = DocFusion.source(
    input_data="Source this: data/insurance.pdf"
)

In [4]:
# Output the number of pages extracted
print(f"{len(docs)} pages sourced.")

12 pages sourced.


In [5]:
# Iterate over the extracted pages and display metadata and content
for page in docs:
    print("Page Metadata:\n---")
    print(page.metadata, "\n")
    print("Page Content:\n---")
    print(page.page_content, "\n")

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 1} 

Page Content:
---
Insurance Policies
Insurance policies provide financial security against unforeseen events. Below is a breakdown
of different types of insurance policies, their coverage, key benefits, and premium calculation
basis.
1. Types of Insurance Policies
The table below summarizes the most common types of insurance policies:
 

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 2} 

Page Content:
---
Each insurance type provides unique benefits, and the premium amount depends on multiple
factors, such as personal information, property value, and risk assessment.
2. Eligibility Criteria
Insurance policies are available to individuals, families, and businesses based on their needs.
Common eligibility criteria include:
●
Stable income for paying premiums
●
Accurate information regarding health, age, and lifestyle
●
Medical screening for life insurance, depending on age and coverage 

#### You can see the document output of the PDF

<img src="../docs/src/images/illustration-2.png" width="60%"/>

# Extract Chunks

## i) Configure the DocFusion library to chunk the text and retreive the table in markdown format 

Enter the following as configuration

```
What is the expected output? (documents/chunks):  chunks
How would you like to chunk the tables? (Row wise/ Full table):  full table
What is the table output format you are expecting? (md/json):  md
What is the chunk size for text chunking? (input number, ex: 512):  400
What is the chunk text overlap? (input number, ex: 20):  20
```

In [6]:
# Input expected output as "chunks" to extract the chunks. 
DocFusion.configure()

What is the expected output? (docs/chunks):  chunks
How would you like to chunk the tables? (Row wise/ Full table):  full table
What is the table output format you are expecting? (md/json):  md
What is the chunk size for text chunking? (input number, ex: 512):  400
What is the chunk text overlap? (input number, ex: 20):  20


Config data successfully written to config/data_config.json


'Configuration saved to config.json'

In [7]:
# Provide the path to the document to be sourced. Use the DocFusion source method to extract document pages.
docs = DocFusion.source(
    input_data="Source this: data/insurance.pdf"
)

In [8]:
# Output the number of pages extracted
print(f"{len(docs)} pages sourced.")

7 pages sourced.


In [9]:
# Iterate over the extracted pages and display metadata and content
for page in docs:
    print("Page Metadata:\n---")
    print(page.metadata, "\n")
    print("Page Content:\n---")
    print(page.page_content, "\n")

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 1} 

Page Content:
---
Insurance Policies
Insurance policies provide financial security against unforeseen events. Below is a breakdown
of different types of insurance policies, their coverage, key benefits, and premium calculation
basis.
1. Types of Insurance Policies
The table below summarizes the most common types of insurance policies: 

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 2} 

Page Content:
---
Each insurance type provides unique benefits, and the premium amount depends on multiple
factors, such as personal information, property value, and risk assessment.
2. Eligibility Criteria
Insurance policies are available to individuals, families, and businesses based on their needs.
Common eligibility criteria include:
●
Stable income for paying premiums
● 

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 2} 

Page Content:
---
●
Accurate information regarding

#### You can see the full table chunk output of the PDF in md format

<img src="../docs/src/images/illustration-3.png" width="40%"/>

## ii) Configure the DocFusion library to chunk the table row wise in json format 

Enter the following as configuration

```
What is the expected output? (documents/chunks):  chunks
How would you like to chunk the tables? (Row wise/ Full table):  row wise
How many rows in a table to consider for row wise table chunking?:  2
What is the table output format you are expecting? (md/json):  json
What is the chunk size for text chunking?:  512
What is the chunk text overlap?:  15
```

In [10]:
# Input expected output as "chunks" to extract the chunks. 
DocFusion.configure()

What is the expected output? (docs/chunks):  chunks
How would you like to chunk the tables? (Row wise/ Full table):  row wise
How many rows in a table to consider for row wise table chunking?:  2
What is the table output format you are expecting? (md/json):  json
What is the chunk size for text chunking?:  512
What is the chunk text overlap?:  15


Config data successfully written to config/data_config.json


'Configuration saved to config.json'

In [11]:
# Provide the path to the document to be sourced. Use the DocFusion source method to extract document pages.
docs = DocFusion.source(
    input_data="Source this: data/insurance.pdf"
)

In [12]:
# Output the number of pages extracted
print(f"{len(docs)} pages sourced.")

9 pages sourced.


In [13]:
# Iterate over the extracted pages and display metadata and content
for page in docs:
    print("Page Metadata:\n---")
    print(page.metadata, "\n")
    print("Page Content:\n---")
    print(page.page_content, "\n")

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 1} 

Page Content:
---
Insurance Policies
Insurance policies provide financial security against unforeseen events. Below is a breakdown
of different types of insurance policies, their coverage, key benefits, and premium calculation
basis.
1. Types of Insurance Policies
The table below summarizes the most common types of insurance policies: 

Page Metadata:
---
{'source': 'data/insurance.pdf', 'page': 1, 'index': 2} 

Page Content:
---
Each insurance type provides unique benefits, and the premium amount depends on multiple
factors, such as personal information, property value, and risk assessment.
2. Eligibility Criteria
Insurance policies are available to individuals, families, and businesses based on their needs.
Common eligibility criteria include:
●
Stable income for paying premiums
●
Accurate information regarding health, age, and lifestyle
●
Medical screening for life insurance, depending on age and coverage a

#### You can see the row wise table chunks output of the PDF in json format

<img src="../docs/src/images/illustration-4.png" width="60%"/>