<a href="https://colab.research.google.com/github/Strojove-uceni/2024-final-pr-team/blob/main/TabuVision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instal modules

In [1]:
!apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-ces
!wget https://github.com/tesseract-ocr/tessdata/raw/main/osd.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata/
!pip install pytesseract opencv-python pillow numpy scikit-image
!pip install pdf2image
!apt-get install -y poppler-utils
!pip install ultralytics

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,755 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [3,481 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd

## Download files from cloud

In [2]:
!pip install gdown

# Sdílený odkaz na složku Drive
shared_folder_url = "https://drive.google.com/drive/folders/1iv-WoXXZHMABZWwk4y3YCojuTmzGbYkh?usp=drive_link"

# ID složky (získáno z odkazu sdílení)
folder_id = "1iv-WoXXZHMABZWwk4y3YCojuTmzGbYkh"

# Stažení složky pomocí gdown
!gdown --folder "$folder_id" -O TabuVision

# Zobrazení stažených souborů
!ls TabuVision


Retrieving folder contents
Retrieving folder 15Oe_ySawTbKeeeICPJ5AUpPGs4o1kL9b backend
Retrieving folder 1wFlm8nC8R8nTh7VPlS8gF8YEeQ7iijhU app
Retrieving folder 12YTST-NftBu3S2NtC1MZEIL6AILrjivX __pycache__
Processing file 1AmoDYsuwoqbw9L40Kk-fBV6IeoTaLjQt Table.cpython-312.pyc
Processing file 1kVeVdsMQ9JcIbT37vCQqa61B4OgpXJOX Table.py
Retrieving folder 1xHYsUvN658FJ2AK2T_1_pnYYMx3ANzkc TableDetection_utils
Retrieving folder 1JUBDwgMcFtz1PdBXrLCQVQC-Gbxs0nbl __pycache__
Processing file 1hS0QBueeVlYYlNU1djt4fScV_ZhlfnYn SkewDetection.cpython-312.pyc
Processing file 152axCQia1_PSSv13FZ0LY2L-iQ9S7TQd RotateDetection.py
Processing file 1pnMAbS8d-ee5kT5xIvazfydITxmI_m_V SkewDetection.py
Retrieving folder 1F7bPCwCGQyVSlT2fF33tWjviSnouWmcq utils
Retrieving folder 1lqvLQmCVvto9Lz5GoU72gqA3ZWG4fXw- __pycache__
Processing file 1--pGRn0vdllDJ80_TFLSnbWd6Ab_2AAE TableExtractor.cpython-312.pyc
Processing file 1M2MDSCX6Qh4tYi6UXPLn3H7Ws-fhXdKE utils.cpython-312.pyc
Processing file 1ZSNeCfgh39bdNUixJ

In [11]:
import json
import sys
import os
sys.path.append('TabuVision')

# Load config file
output_file = 'TabuVision/config/config.json'

if os.path.exists(output_file):
    with open(output_file, 'r') as json_file:
        data = json.load(json_file)
else:
    raise FileNotFoundError(f"Configuration file '{output_file}' is missing. Please ensure it exists in the expected location.")

# Updating values
data['tesseract_exec_location'] = '/usr/bin/tesseract'
data['tessdata_location'] =  '/usr/share/tesseract-ocr/4.00/tessdata/'

# Uložení změn zpět do JSON souboru
with open(output_file, 'w') as json_file:
    json.dump(data, json_file, indent=4)


# TabuVision
## TabuVision demo
Following block is only a copy of TabuVision.py

In [12]:
from backend.utils.TableExtractor import TableExtractorCluster, extract_cells
from backend.TableDetection import TableDetection
from backend.StructureDetection import StructureDetection
from backend.ContentDetection import ContentDetection
from PIL import Image
from pathlib import Path
from backend.utils.utils import PDFFormatToPIL, clean_dir_files
import os


class TabuVision:
    def __init__(self, format: str, debug: bool = False):
        """
        TabuVision class handles table transformation pipeline. It primarily uses classes from backed folder.
        :param format: output format of the extracted tables.
        :param debug: boolean flag whether to print logs, show log images and other information.
        """

        # Initialize models or other attributes as needed
        self.debug = None
        self.table_name = None
        self.TableDetectionUnit = TableDetection(debug=debug)
        self.StructureDetectionUnit = StructureDetection(debug=debug)
        self.ContentDetectionUnit = ContentDetection(debug=debug)

        # Initialize table extractor
        self.TableExtractorClusterUnit = TableExtractorCluster(debug=debug)

        # Set attributes
        self.allowed_suffix_image = ['.jpeg', '.jpg', '.png']
        self.cache_dir = 'cache'
        self.output_dir = 'output'
        self.format = format

        # Set allowed file formats
        # In case of adding new formats, you only need to specify the file suffix and
        # provide a function that takes a file_path as input and returns a list of PIL.Image objects.
        PDFToImage = PDFFormatToPIL(debug=debug)
        self.allowed_suffix_others = {'.pdf': PDFToImage}

        # Clean cache and output dirs
        self.setup_dirs()

    def __call__(self, filepath: str, table_name: str):
        """
        Run method which starts tables extraction.

        :param filepath: filepath of the file to be processed.
        :return: extracted tables if given format.
        """
        return self.run(filepath, table_name)

    def setup_dirs(self):
        """
        Setup cache and output directories.
        """

        # Create or clean cache dir
        if not os.path.exists(self.cache_dir):
            os.makedirs(self.cache_dir)
        else:
            clean_dir_files(self.cache_dir)

        # Create or clean output dir
        if not os.path.exists(self.output_dir):
            os.makedirs(self.output_dir)
        else:
            clean_dir_files(self.output_dir)

    def run(self, filepath: str, table_name: str = 'table'):
        """
        Run file extraction and pass it to table extraction pipeline.
        :param table_name: name of the table to be processed (optional).
        :param filepath: filepath of the file to be processed.
        :return: list of extracted tables in given format.
        """

        self.table_name = table_name

        # Extract page images from file
        images = self.extract(filepath)

        if images is None or len(images) == 0:
            print(f'No tables found for file {filepath}!')
            return None

        # Pass images to extraction pipeline
        output_list = []
        for image in images:
            html_table = self.to_pipeline(image)
            output_list.append(html_table)

        return output_list

    def to_pipeline(self, page_img: Image = None):
        """
        Complete pipeline of processing image of the page and extracting tables.

        :param page_img: Input image of the page.
        :return: list of extracted tables in given format.
        """

        # Process an image
        #

        # Step 1: Detect the tables
        table_images = self.TableDetectionUnit.to_pipeline(page_img)

        if len(table_images) == 0:
            print('No tables detected!')
            return None

        # Analyse structure of each table
        table_idx = 1
        processed_table_list = []

        for table_img in table_images:

            # Step 2: Detect table structure and return predicted objects (class, bbox)
            predicted_objects = self.StructureDetectionUnit.to_pipeline(table_img)

            # Step 3: Retrieve table from predicted objects
            table_object = self.TableExtractorClusterUnit(predicted_objects, f'{self.table_name}_{table_idx}', image_size=table_img.size)

            # Print detected table structure
            if self.debug:
                table_object.plot_table(image=table_img)

            # Step 3: Extract cell content
            # Detects content of each cell using OCR.
            # Parameter 'fill_on_error' indicates whether cell image should be retrieved when OCR detection fails.
            table_object = extract_cells(
                table_img,
                table_object,
                mode='ocr',
                fill_on_error=True,
                ContentDetectionUnit=self.ContentDetectionUnit,
                cache_dir=self.cache_dir,
                log_progress=True
            )

            # Step 4: Build table in given format out of general table object.
            if self.format == 'html':
                table_html = table_object.to_html(file_name=f'{self.output_dir}/{table_object.filename}_.html', cache_dir=self.cache_dir)
                processed_table_list.append(table_html)

            table_idx += 1

        return processed_table_list

    def extract(self, file_path: str):
        """
        Extract pages from a file in format of PIL.Image list. Valid file formats can be either images or more
        complex files (containing more pages) - for example PDF file.

        :param file_path: path to the file to be
        extracted. :return: list of pages in PIL.Image format.
        """

        file_path = Path(file_path)
        file_suffix = file_path.suffix.lower()

        # Image file
        if file_suffix in self.allowed_suffix_image:
            if self.debug:
                print(f"Processing image file: {file_path}")

            image = Image.open(file_path)
            return [image]

        # Other file types
        elif file_suffix in self.allowed_suffix_others.keys():
            try:
                transformation_func = self.allowed_suffix_others[file_suffix]
                images = transformation_func(file_path)
                return images

            except Exception as e:
                print(f'During extracting file with suffix {file_suffix} following error occurred: {e}.')
                return None

        else:
            raise ValueError(f"Unsupported file type: {file_path.suffix}")

Creating new Ultralytics Settings v0.0.6 file ✅ 
View Ultralytics Settings with 'yolo settings' or at '/root/.config/Ultralytics/settings.json'
Update Settings with 'yolo settings key=value', i.e. 'yolo settings runs_dir=path/to/dir'. For help see https://docs.ultralytics.com/quickstart/#ultralytics-settings.


In [13]:
# *printing HTML code*
from IPython.display import HTML

def display_pretty_table(table_html):
    STYLE = """
            <style>
          body {
            font-family: Arial, sans-serif;
            background-color: #f9f9f9;
            margin: 20px;
          }

          table {
            width: 100%;
            border-collapse: collapse;
            margin: 20px 0;
            background-color: white;
            box-shadow: 0 2px 5px rgba(0, 0, 0, 0.1);
            border-radius: 8px;
            overflow: hidden;
          }

          th, td {
            padding: 12px 15px;
            text-align: left;
          }

          th {
            background-color: #f2f2f2;
            color: #333;
            font-weight: bold;
            text-transform: uppercase;
            font-size: 14px;
            border-bottom: 2px solid #e0e0e0;
          }

          tr {
            border-bottom: 1px solid #e0e0e0;
          }

          tr:nth-of-type(even) {
            background-color: #f9f9f9;
          }

          td {
            color: #555;
            font-size: 14px;
          }

          caption {
            margin-bottom: 10px;
            font-size: 18px;
            font-weight: bold;
            color: #333;
          }
        </style>
        """
    display(HTML(STYLE+' '+table_html))


## Lets initialize TabuVision
Just specify table's output format.

In [14]:
TabuVisionApp = TabuVision(
        format='html'
)

# Adjusting paths as the script is executed outside the program’s root directory.
TabuVisionApp.cache_dir = 'TabuVision/cache'
TabuVisionApp.output_dir = 'TabuVision/output'

## TabuVision can extract table from an image ...

In [15]:
output_list = \
    TabuVisionApp(
        filepath='TabuVision/PDFs/test_img_3.png',
        table_name='tabuvision_demo'
    )

Analyzing cells: 100%|██████████| 196/196 [00:56<00:00,  3.47it/s]


HTML byl úspěšně uložen do souboru: TabuVision/output/tabuvision_demo_1_.html


Analyzing cells: 100%|██████████| 10/10 [00:02<00:00,  4.43it/s]

HTML byl úspěšně uložen do souboru: TabuVision/output/tabuvision_demo_2_.html





In [16]:
table_idx = 1
for page in output_list:
    for table_html in page:
        print(f'Extracted table #{table_idx}:')
        display_pretty_table(table_html)
        print('\n\n')
        table_idx += 1

Extracted table #1:


označ,2 AKTIVA,řád c,Běžné účetní období,Běžné účetní období,Běžné účetní období,Min.úč. období Netto 4
Unnamed: 0_level_1,Unnamed: 1_level_1,řád c,Brutto,Korekce,Netto,Min.úč. období Netto 4
a,b l,řád c,1,2,3,Min.úč. období Netto 4
,ICHIVA CELREM @. G24 63 408-455),“oor.,sada,el lag,Zase,Oh
B.,JARS O POR RÁN p nn TO EPO A APE ARDO POOR eee a ONE ME Vo P Toa OK V Pk SE OP eRe rhe Sigs sh gk [oe en,PEE eg Sat Bee yeaa dl dán al,oe ad o drnů ee,ea ane tie SA de de,ERR TO BP RE PRS sos ene,ote
B. L,Nebrmotný investiční masek OST 8,,Cee aad,,,
B. Lol,Zřizovací výdaje,005,100,-60,40,66
2,Nehmotné výsledky výzkumné a obdobné činnosti,006,0,0,0,0
3,Software,007,0,0,0,0
4,Ocenitelná práva E,008,0,0,0,0
5,Jiný nehmotný investiční majetek oe,009,0,0,0,0
6,Nedokončené nehmotné investice,010,0,0,0,0
7,Poskytnuté zálohy na nehmotný investiční majetek,011,0,0,0,0





Extracted table #2:


0,1,2,3,4
Cis.,IKF,Rok,Mésic,Ico
Ol,801095,1999,12,25088033







## ... Or process whole PDF

In [19]:
output_list = \
    TabuVisionApp(
        filepath='TabuVision/PDFs/table_test.pdf',
        table_name='tabuvision_demo'
    )

Analyzing cells: 100%|██████████| 55/55 [00:14<00:00,  3.75it/s]


HTML byl úspěšně uložen do souboru: TabuVision/output/tabuvision_demo_1_.html


Analyzing cells: 100%|██████████| 30/30 [00:09<00:00,  3.19it/s]


HTML byl úspěšně uložen do souboru: TabuVision/output/tabuvision_demo_1_.html


Analyzing cells: 100%|██████████| 112/112 [00:31<00:00,  3.52it/s]

HTML byl úspěšně uložen do souboru: TabuVision/output/tabuvision_demo_1_.html





In [20]:
table_idx = 1
for page in output_list:
    for table_html in page:
        print(f'Extracted table #{table_idx}:')
        display_pretty_table(table_html)
        print('\n\n')
        table_idx += 1

Extracted table #1:


0,1,2,3,4
A ZOE,běžné účetní období,běžné účetní období,běžné účetní období,minulé uéetni obdobi netto
JAKA RY,brutto,korekce,netto,minulé uéetni obdobi netto
,1,2,3,minulé uéetni obdobi netto
Dlouhodoby hmotny majetek,6284,4976,1308,971
Stálá aktiva,6284,4976,1308,971
AKTIVA CELKEM,11266,4976,6290,5177
Zasoby,1334,,1334,1232
Oběžná aktiva,4982,,4982,4206
Kratkodobé pohledavky,520,,520,958
Pohledávky,520,,520,958





Extracted table #2:


0,1,2
PASIVA,běžné účetní období,minulé účetní období
,1,2
Zakladni kapital,200,200
Vlastni kapital,3581,3185
PASIVA CELKEM,6290,5177
Výsledek hospodaření minulých let (+/-),1986,2057
Výsledek hospodaření běžného účetního období (+/-),1395,928
Kratkodobé zavazky,2709,1992
Zavazky,2709,1992
Cizi zdroje,2709,1992





Extracted table #3:


0,1,2,3
,Nazev polozky,běžné účetní období,minulé účetní období
,,1,2
I.,Tržby z prodeje výrobků a služeb,34,27
*,Provozní výsledek hospodaření (+/-),1751,1170
*,Cisty obrat za héetni obdobi = I. + I. + TI. + IV. + V. + VI. + VII.,43205,40094
dk,Výsledek hospodaření před zdaněním (+/-),1723,1146
dk,Výsledek hospodaření po zdanění (+/-),1395,928
TER AK,Výsledek hospodaření za účetní období (+/-),1395,928
II.,Tržby za prodej zboží,43171,40067
A1,Náklady vynaložené na prodané zboží,32951,30180





