# Comprehensive Scanned Doc Pre-processing

Scanned documents can have pages with wrong alignment, low contrast, low brightness and rotated upside down. This can create challenges while processing of documents using OCR, ICR, Text extraction, image-based ML/AI modelling, etc. 
This solution incorporates statistical models which identify angle of tilt based on textual orientation and position of text relative to page boundaries and corrects the alignment / tilt of the pages. It identifies the contrast between background and text in the pages of the scanned document and adjust the contrast of the low contrast pages. It also incorporates deep learning models which identify if a page is upside down. The models are trained on a large dataset of thousands of pages. This enables OCR/ICR engines to achieve higher accuracy and improves the subsequent text extraction pipelines.


### Prerequisite

The kernel comes pre-installed with the required packages. Else ensure to have the following Python Packages in your environment at minimum:


    - Sklearn
    - numpy
    - pandas
    - opencv
    - keras
    - tensorflow


 ### Contents

1. [Input Data](#Input-Format)
1. [Creating Model](#Creating-Model)
1. [Batch Transform](#Batch-Transform)
1. [Output](#Output)
1. [Invoking through Endpoint](#Invoking-through-Endpoint)

### Input Format


The solution works with scanned documents in formats - pdfs and images. The input documents must be zipped.




<b> Note: 
    Input file from sage_maker should be of the form pdfs and images.<br>Ensure Content-Type is 'application/zip'
</b>

### Input instructions

•The solution works with scanned documents in formats - pdfs and images. The input documents must be zipped.

•The input zip file can have up to 2 images [for types see below] or a scanned document in PDF format with maximum 2 pages.

•Also the image size must be less than 2 MB and PDF size must be less than 4 MB.

•Images can be of following types - bmp, dib, jpeg, jpg, jpe, png, pbm, pgm, ppm, tiff, tif .




### Output interpretation

• Output file will contain the corrected tilt, better contrast/ brightness and upright pages in images/ PDF.

• Output will be a zip file.


## Importing libraries for runtime

In [1]:
import pandas as pd
import boto3
import re

## Read the input file

In [3]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img1 = mpimg.imread('input.png')


fig, (ax1) = plt.subplots(nrows=1,ncols=1,figsize=(25,15))

ax1.imshow(img1)
ax1.set_title("Image 1")


plt.show()


## Creating Model


In [3]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'

model_package_arn = ''

from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

In [4]:
bucket_name = ''
folder_name = ''
sample_zip = f's3://{bucket_name}/{folder_name}/input.zip'

## Batch Transform

Now that model is ready, we can deploy the model and make predictions.

### Prediction Classes - Batch Transform Job

<b>Output (zip) file will contain the corrected tilt, better contrast/ brightness and upright pages in images/ PDF.  </b>

In [10]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.xlarge', max_payload=100)
transformer.transform(sample_zip, content_type='application/zip')
transformer.wait()
# transformer.output_path
print("Batch Transform complete")

INFO:sagemaker:Creating model with name: scanned-document-enhancement-v3-2023-05-07-16-33-48-015
INFO:sagemaker:Creating transform job with name: scanned-document-enhancement-v3-2023-05-07-16-33-48-683


.................... * Serving Flask app 'serve' (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on all addresses.
 * Running on http://169.254.255.131:8080/ (Press CTRL+C to quit)
169.254.255.130 - - [07/May/2023 16:37:13] "GET /ping HTTP/1.1" 200 -
169.254.255.130 - - [07/May/2023 16:37:13] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -
[(0.45835649967193604, 15398), (0.45835649967193604, 15707), (90.0, 42622), (90.0, 90960), (90.0, 298295)]
[90.0]
[90]
[(0.45835649967193604, 15398), (0.45835649967193604, 15707), (90.0, 42622), (90.0, 90960), (90.0, 298295)]
[90.0]
[90]
2023-05-07T16:37:13.566:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=100, BatchStrategy=MULTI_RECORD
ROTATE_90_COUNTERCLOCKWISE
[(0.45835649967193604, 15398), (0.45835649967193604, 15707), (90.0, 42622), (90.0, 90960), (90.0, 298295)]
[90.0]
[90]
ROTATE_90_COUNTERCLOCKWISE
[(0.45835649967193604, 15398), (0.45835649967193604, 157

## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [12]:
#print(s3bucket,s3prefix)
import boto3
bucketFolder = transformer.output_path.rsplit('/')[3]
bucket_name=transformer.output_path.rsplit('/')[2]

s3_conn = boto3.client("s3")
with open('input.zip', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/input.zip.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


## Output

• The processed output is of the form zip file.
  
• Output (zip) file will contain the corrected tilt, better contrast/ brightness and upright pages in images/ PDF.

    

In [13]:
import zipfile
import os
with zipfile.ZipFile('input.zip', 'r') as zip_ref:
    zip_ref.extractall('output')

In [2]:
img1 = mpimg.imread('output/image_module/output/input.png')

fig, (ax1) = plt.subplots(nrows=1,ncols=1,figsize=(25,15))

ax1.imshow(img1)
ax1.set_title("Image 1")

plt.show()

## Invoking through Endpoint

In [9]:
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit

role = get_execution_role()

sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [10]:
content_type='application/zip'
model_name='document-enhancement'
real_time_inference_instance_type='ml.m5.large'

In [23]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'your-arn-number'

In [12]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [13]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [14]:
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

-------------!

### Invoking endpoint result through CLI command

In [15]:
file_name="input.zip"

In [16]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'application/zip' --region us-east-2 output.zip

{
    "ContentType": "application/zip",
    "InvokedProductionVariant": "AllTraffic"
}


### Delete Endpoint

In [19]:
predictor.delete_endpoint()