In [None]:
# ===============================================================================================================#
# Copyright 2023 Infosys Ltd.                                                                                    #
# Use of this source code is governed by Apache License Version 2.0 that can be found in the LICENSE file or at  #
# http://www.apache.org/licenses/                                                                                #
# ===============================================================================================================#

## 1 . Installation Guide

##### Install required libraries for the document preprocessing (python >3.8.3)

> `Libs:`    
    `pip install ./lib/infy_dpp_sdk-0.0.5-py3-none-any.whl`
    
   > `pip install ./lib/infy_dpp_core-0.0.1-py3-none-any.whl`
    
   > `pip install ./lib/infy_dpp_segmentation-0.0.1-py3-none-any.whl`
   
- ##### `infy_dpp_core` consists of RequestCreator, metadata_extractor, document_data_saver, request_closer , document_data_updater
- ##### `infy_dpp_segmentation` consists of segment_generator, segment_parser, chunk_generator, chunk_saver
- ##### `infy_dpp_sdk` library is used as a dpp framework to create uniform input/output structure

If you want to install all dependancies of segmentation, please install as followed.

   > `pip install ./lib/infy_dpp_segmentation-0.0.1-py3-none-any.whl['all']`

If you want to install for any specific library like segment-generator and segment-parser please follow below respectively
   > `pip install ./lib/infy_dpp_segmentation-0.0.1-py3-none-any.whl['segment-generator']`
   
   > Download and install `pip install ./lib/detectron2-0.5+cpu-cp38-cp38-linux_x86_64.whl` from https://dl.fbaipublicfiles.com/detectron2/wheels/cpu/torch1.9/index.html (It's supported only in linux environment.)
   
   > `pip install ./lib/infy_dpp_segmentation-0.0.1-py3-none-any.whl['segment-parser']`


## 2. Processor Configurations

##### Please make sure to configure the `pipeline_input_config.json`.
> 1.`RequestCreator` - `read_path` is the relative path from where processor fetches the input docs.

> 2.`RequestCloser` - `output_root_path` is the relative path where processor generates the output.

> 3.`SegmentGenerator` - 

   > `a.` Make sure textproviders properties are configured properly. Its used for extracting text from input documents.
   
   > `b.` To detect the segment using detectron download the model from - configure model path and config file path accordingly. (This works only in linux env)
   
   > github-link - https://github.com/ibm-aur-nlp/PubLayNet/tree/master/pre-trained-models
   
   > model-path - https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/pre-trained-models/Mask-RCNN/model_final.pkl
   
   > config-path - https://github.com/ibm-aur-nlp/PubLayNet/blob/master/pre-trained-models/Mask-RCNN/e2e_mask_rcnn_X-101-64x4d-FPN_1x.yaml
   
   > `c.` Activate the techniques based on the file type, text providers and model providers 
   
> `4.Segment Parser` - Provide layout and enable the pattern.

> `5.ChunkDataParser` - Provide type of chunk like `page`, `paragraph`. Also you can limit pages using `page_num` property.

> `6.SaveChunkDataParser` - provide the root path to save the chunks and its meta data in text file format.

## 2a. RequestCreator

In [None]:
{
    "read_path": "/input/",
    "batch_size": 20,
    "filter": {
        "include": [
            "jpg",
            "json"
        ],
        "exclude": [
            "_"
        ]
    },
    "work_root_path": "/work/",
    "queue": {
        "enabled": true,
        "queue_root_path": "/work/queue/"
    }
}

##### `read_path`: its the relative path from where processor fetches documents
##### `batch_size`: Here we can define how many files can be taken at a time for processing
##### `filter`: we can mention which documents need to include and exclude.
##### `work_root_path`: where all the internal files will be there

## 2b. RequestCloser

In [None]:
{
    "queue": {
        "enabled": true,
        "queue_root_path": "/work/queue/"
    },
    "work_root_path": "/work/",
    "output_root_path": "/output/"
}

##### `work_root_path`: This path should be same as `ReuestCreator`
##### `output_root_path`: Output root path should be defined

## 2c. SegmentGenerator

In [None]:
{
    "textProviders": [
        {
            "provider_name": "tesseract_ocr_provider",
            "properties": {
                "tesseract_path": ""
            }
        },
        {
            "provider_name": "azure_read_ocr_provider",
            "properties": {
                "subscription_key": "",
                "url": ""
            }
        },
        {
            "provider_name": "pdf_box_text_provider",
            "properties": {}
        },
        {
            "provider_name": "json_provider",
            "properties": {
                "template1_file_path": "/data/config/templates/email_template.txt"
            }
        }
    ],
    "modelProviders": [
        {
            "provider_name": "detectron",
            "properties": {
                "model_path": "",
                "config_file_path": "",
                "model_threshold": 0.8
            }
        }
    ],
    "techniques": [
        {
            "enabled": false,
            "name": "technique1",
            "input_file_type": "image",
            "text_provider_name": "tesseract_ocr_provider",
            "model_provider_name": "detectron"
        },
        {
            "enabled": false,
            "name": "technique2",
            "input_file_type": "image",
            "text_provider_name": "azure_read_ocr_provider",
            "model_provider_name": "detectron"
        },
        {
            "enabled": false,
            "name": "technique3",
            "input_file_type": "pdf",
            "text_provider_name": "pdf_box_text_provider",
            "model_provider_name": null
        },
        {
            "enabled": false,
            "name": "technique4",
            "input_file_type": "pdf",
            "text_provider_name": "pdf_box_text_provider",
            "model_provider_name": "detectron"
        },
        {
            "enabled": false,
            "name": "technique5",
            "input_file_type": "json",
            "text_provider_name": "json_provider",
            "model_provider_name": null
        },
        {
            "enabled": true,
            "name": "technique6",
            "input_file_type": "image",
            "text_provider_name": "azure_read_ocr_provider",
            "model_provider_name": null
        }
    ]
}

##### `textProviders`: Here provide the supported text providers and their properties
##### `modelProviders`: Need to provide model details in list if used
##### `techniques`: Need to enable which combination of text providers and model providers to used.

## 2d. SegmentDataParser

In [None]:
{
    "layout": {
        "single-column": {
            "enabled": true
        },
        "multi-column": {
            "enabled": false
        }
    },
    "pattern": {
        "sequence-order": {
            "enabled": true
        },
        "left-right": {
            "enabled": false
        },
        "zig-zag": {
            "enabled": false
        }
    }
}

##### `layout`: Here need to mention document column type
##### `pattern`: Its the reading techniue from documents like top to bottom, left to right or zig zag.

## 2e. ChunkDataParser

In [None]:
{
    "chunking_method": "page",
    "merge_title_paragraph": false,
    "page_num": [
        "1:10"
    ],
    "exclude": [
        "table",
        "figure"
    ]
}

##### `chunking_method`: Chunk will be paragraph or page level
##### `page_num`: Mention the to be extracted Page number
##### `exclude`: Need to mention which type of content type you need to exclude. Note: it works only with `detectron` model provider as of now. 

## 2f. SaveChunkDataParser

In [None]:
{
    "chunked_files_root_path": "/vectordb/chunked"
}

##### `chunked_files_root_path`: Mention where you want to save chunked data and its repective metadata