## Document Image Analysis Techniques

- Document layout detection
- Vision transformers

### Preprocessing Using Rule based parsers

Some document types can be preprocessed using rule based techniques. Document types such as HTML, Markdown and Word Docs have predefined structures and rules that we can leverage to build a rule based parser to be used to preprocess them.

#### Visual Information

Some other documents can not be used with rule based preprocessors. Their layout information can only be understood visually. Example of such documents is PDf and Images.

For Visual Information techniques, we have:

1. **Document Image Analysis (DIA)**

Allows us to extract the formatting information and text from the raw image of a document. Under this technique we have:

#### a. Document layout detection (DLD)

> This uses an object detection model to detect elements, label the elements and draw bounding boxes around each detected element of a document image and the text within each label gets extracted.

Steps in document layout detection include:

1. **Vision Detection**

This involves identying and drawing bounding boxes around document elements using a computer vision model eg YOLOX or Detectron2.

Reading sources:

[OCR-free Document Understanding Transformer](https://arxiv.org/pdf/2111.15664)

[YOLOX: Exceeding YOLO Series in 2021](https://arxiv.org/pdf/2107.08430)

2. **Text Extraction**

Extract the text from the detected bounding boxes. This is done using Object Character Recognition or OCR. For some documents, text can be extracted directly without the use of OCR. This is called **Direct Text extraction. Example in PDFs.

#### b. Vision transformers (ViT)


> These models take in document image as input and produce text representation of a structured output (Like JSON) as an output.


A Vision Transformer (ViT) is a type of neural network model used for processing and understanding images. Here's a simple explanation:

### Transformers

Transformers are a type of model originally developed for natural language processing (NLP) tasks, like translating languages or summarizing text. They are good at handling sequential data by focusing on the relationships between elements in the sequence using a mechanism called attention.

### Applying Transformers to Images

To apply transformers to images, we need to make a few adjustments, since images are not naturally sequential like text. Here's how it works:

1. **Splitting the Image into Patches:**
   - Instead of looking at the whole image at once, the Vision Transformer divides the image into smaller, fixed-size patches (like splitting an image into a grid of smaller squares).
   - For example, a 256x256 image might be split into 16x16 patches.

2. **Flattening the Patches:**
   - Each patch is flattened into a single vector (a long row of numbers) representing the pixel values in that patch.

3. **Embedding the Patches:**
   - These vectors are then converted into a format the transformer can understand by using a process called embedding. This step is similar to how words are turned into word embeddings in NLP tasks.

4. **Adding Positional Information:**
   - Since the position of each patch in the image matters, we add positional encodings to the embeddings to give the model information about where each patch is located in the original image.

5. **Processing with the Transformer:**
   - The patches, now embedded and encoded with positional information, are fed into the transformer model.
   - The transformer uses its attention mechanism to understand the relationships between patches and to process the image. It essentially looks at how different parts of the image relate to each other.

6. **Classifying or Understanding the Image:**
   - After processing the patches, the model can perform tasks like classifying what is in the image (e.g., identifying a cat, dog, or car) or other image-related tasks.

### Advantages
- **Flexibility:** Vision Transformers can handle different image sizes and resolutions.
- **Scalability:** They can be scaled up easily, often resulting in better performance with more data and computational power.
- **Performance:** They have shown excellent performance on various image recognition tasks, sometimes outperforming traditional convolutional neural networks (CNNs).

### Summary
In simple terms, a Vision Transformer takes an image, splits it into small pieces, and then processes these pieces using a transformer model to understand and analyze the image. It's a way of bringing powerful techniques from language processing to the world of computer vision.


### Reading Sources:

[DONUT Architecture](https://medium.com/@renix_informatics/how-to-use-donut-the-document-understanding-transformer-for-document-classification-parsing-and-fde0c7efa3f3)


#### When To Use What?


![DLD And ViT](./images/DLD_and_ViT.png)



In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
import pdfminer

pdfminer.__version__

'20191125'

In [12]:
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html

import os

from unstructured.staging.base import dict_to_elements

In [13]:
s = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

#### Process HTML

In [14]:
filename = "./example_datasets/el_nino.html"
html_elements = partition_html(filename=filename)

INFO: Reading document from string ...
INFO: Reading document ...


In [15]:
for element in html_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

TITLE: CNN
UNCATEGORIZEDTEXT: 1/30/2024
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its first mark on winter
TITLE: By Mary Gilbert, CNN Meteorologist
UNCATEGORIZEDTEXT: Updated: 
        3:49 PM EST, Tue January 30, 2024
TITLE: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the first clear sign of the influence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the flood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Pacific that influences weather around the globe – causes changes in the jet stream that can point storms directly at California. Storms can also tap into an extra-potent supply of moisture from t

#### Process the Document with Document Layout Detection

In [16]:
filename = "./example_datasets/el_nino.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="hi_res",
    hi_res_model_name="yolox",
)

try:
    resp = s.general.partition(req)
    dld_elements = dict_to_elements(resp.elements)
except SDKError as e:
    print(e)

In [17]:
for element in dld_elements[:10]:
    print(f"{element.category.upper()}: {element.text}")

HEADER: 1/30/24, 5:11 PM
HEADER: CNN 1/30/2024
HEADER: Pineapple express: California to get drenched by back-to-back storms fueling a serious ﬂood threat | CNN
TITLE: A potent pair of atmospheric rivers will drench California as El Niño makes its ﬁrst mark on winter
NARRATIVETEXT: By Mary Gilbert, CNN Meteorologist
NARRATIVETEXT: Updated: 3:49 PM EST, Tue January 30, 2024
NARRATIVETEXT: Source: CNN
NARRATIVETEXT: A potent pair of atmospheric river-fueled storms are about to unleash a windy and incredibly wet week in California in what is the ﬁrst clear sign of the inﬂuence El Niño was expected to have on the state this winter.
NARRATIVETEXT: The soaking storms will raise the ﬂood threat across much of California into next week, but it appears the wet pattern is likely to continue well into February as a more typical El Niño pattern kicks into gear.
NARRATIVETEXT: El Niño – a natural phenomenon in the tropical Paciﬁc that inﬂuences weather around the globe – causes changes in the jet st

In [18]:
import collections

In [19]:
len(html_elements)

32

In [20]:
html_categories = [el.category for el in html_elements]
collections.Counter(html_categories).most_common()

[('NarrativeText', 23), ('Title', 6), ('UncategorizedText', 3)]

In [21]:
len(dld_elements)

39

In [22]:
dld_categories = [el.category for el in dld_elements]
collections.Counter(dld_categories).most_common()

[('NarrativeText', 28), ('Header', 6), ('Title', 4), ('Footer', 1)]