# Financial documents
In this notebook we show how to read the FUNSD dataset [1]. This dataset contains annotated documents and serves as a benchmark for machine learning applied on smart documents. In the FinTorch repository, we developed the code to load the dataset and make it ready to be digested by a machine learning model. Here we discuss the dataset of the FinTorch repo in more detail.

## FUNSD dataset
The dataset consists of 199 form that are annotated. Here are the complete statistics of the dataset:

|                    | Value   |
|--------------------------|---------|
| Fully Annotated Forms    | 199     |
| Total Words              | 31,485  |
| Semantic Entities        | 9,707   |
| Relations                | 5,304   |

An example of such an annotated form is the following:
![Annotated forms](https://guillaumejaume.github.io/FUNSD/img/two_forms.png)

The dataset is split into the following folder structure

* **training_data** : contains the training examples
* **test_data**: contains all the test examples

The train and test folder contain:

* **annotations**: a directory with json files
* **images**: a directory with png files

Next, we discuss each in more detail.

## Annotation
The forms are annotated and the annotations are stored in a json format.

Here is an example snipped provided by Guillame et al. [1]:

![](https://guillaumejaume.github.io/FUNSD/img/semantic_entity_example.png)

and the corresponding `json` annotation:


```json
    {
        "form": [
        {
            "id": 0,
            "text": "Registration No.",
            "box": [94,169,191,186],
            "linking": [
                [0,1]
            ],
            "label": "question",
            "words": [
                {
                    "text": "Registration",
                    "box": [94,169,168,186]
                },
                {
                    "text": "No.",
                    "box": [170,169,191,183]
                }
            ]
        },
        {
            "id": 1,
            "text": "533",
            "box": [209,169,236,182],
            "label": "answer",
            "words": [
                {
                    "box": [209,169,236,182
                    ],
                    "text": "533"
                }
            ],
            "linking": [
                [0,1]
            ]
        }
    ]
    }

```

Below we present the structure of the json file. The json file contains a "form" key which captures all the annotation details of a document.
The annotations consist of a *box* which are the ((x,y), (x,y)) coordinates of the annotation box, the *text* that is contained in the box, the type of annotation, e.g., question or answer in this example, and the *words* which shows each individual word and bounding box. The linking represents an *link* between two *id* fields. In this example, the link is [from:0, to:1] indicating that id:0 (question) has a directed link to id:1 (answer).

## Loading the dataset
The following Python snippet demonstrates how to load the InvoiceDataset, which can be easily plugged into PyTorch DataLoaders or model training pipelines. Once you have the dataset, you can iterate through batches, transform them, and feed them into your torch models.

In [None]:
import logging
from pathlib import Path

from fintorch.datasets.invoice import InvoiceDataset

logging.basicConfig(level=logging.INFO)

data_path = Path("~/.fintorch_data/invoice-data/").expanduser()
auction_data = InvoiceDataset(data_path, force_reload=False)

print(f"Length of the dataset:{len(auction_data)} \n Print first 10 records:")

for i in range(2):
    print(auction_data[i])


INFO:root:Loading invoice dataset
INFO:root:Downloading the FUNSD dataset
100%|██████████| 16.8M/16.8M [00:00<00:00, 46.3MiB/s]
INFO:root:Download and extraction complete
INFO:root:Processing: apply transformation to FUNSD dataset
INFO:root:Processing training data
Processing files: 100%|██████████| 149/149 [00:03<00:00, 49.11it/s]
INFO:root:Processing test data
Processing files: 100%|██████████| 50/50 [00:01<00:00, 46.38it/s]


Length of the dataset:199 
 Print first 10 records:
{'image': tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]]]), 'meta': [{'box': [340, 110, 405, 128], 'text': 'LORILLARD', 'label': 'other', 'words': [{'box': [340, 110, 405, 128], 'text': 'LORILLARD'}], 'linking':

## Conclusion
In this tutorial, we explored how to access and interpret the FUNSD dataset, reviewing its structure, annotations, and how it can be loaded using FinTorch utilities. With this foundation, you can seamlessly integrate the data into a PyTorch pipeline—allowing for batching, preprocessing, and model training. The annotation details (bounding boxes, text, and linking) illustrate the richness of this dataset for machine learning tasks involving document understanding, making it a prime resource for smart document analysis in finance and beyond.


## References
- [1] Jaume, G., Ekenel, H. K., & Thiran, J.-P. (2019). FUNSD: A dataset for form understanding in noisy scanned documents. In Proceedings of the International Conference on Document Analysis and Recognition - Open Service Track (ICDAR-OST). 
- [2] https://arxiv.org/pdf/1905.13538