# Financial Spreading Info Extraction

Financial Spreading Info Extraction detects finance related tables like balance sheet, cash flow, income statement etc. and extracts the content as key value pairs. The solution uses text matching, NLP and Machine Learning to localize and identify financial tables and extract information. This will help banks, and other financial institutions to rapidly analyse borrowers'/clients' financial standings and associated risks. 

Automate financial spreading information extraction from company's digital financial statements. This reduces manual effort for a non-value added activity and improves the productivity of financial analysts, insurance brokers, data entry operators.

This solution uses key word matching, NLP and Machine Learning  to accurately identify  tables in digital documents. It can recognise assets/liabilities, income sources and expense line items. The output will contain list of pages with content, financial periods and key metrics mapped to respective financial terms. 

### Prerequisite

The kernel comes pre-installed with the required packages. Else ensure to have the following Python Packages in your environment at minimum:

    - numpy
    - pandas
    - sklearn

 ### Contents

1. [Importing libraries for runtime](#Importing-libraries-for-runtime)
1. [Model](#Model)
1. [Batch Transform](#Batch-Transform)
1. [Output](#Output)
1. [Interpretation](#Interpretation) 
1. [Endpoint](#Endpoint)

## Importing libraries for runtime

In [46]:
import pandas as pd
import boto3
import re


### Input Format

This solution identifies key information from company financial statements.

The financial report must be valid .pdf format file. 

* The input must be 'Input.zip' file. 
* The zip file should contain Input file which has a .pdf file.
* Name of the folder inside the zip file should be “Input” which is case-sensitive

Input.zip
	|--Input
		|--sample_financial_report.pdf


<b> Note: 
 Ensure Content-Type is 'application/zip' and contain a file named "Input".
</b>

## Model

### De-Serializing model

The serialzed Pickle file containing the trained model must be loaded for customer segmentation from the input variables.

The model is de-serialized to a Python object.

<b> Note: 
    Ensure the trained model exist in sagemaker container and is placed in ../model directory.
</b>

In [47]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-financial-spreading-info-v2'

In [48]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [49]:
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Batch Transform


In [50]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.4xlarge')
transformer.transform('s3://mphasis-marketplace/Financial_spreading/Input-1/Input.zip', content_type='application/zip')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")
bucketFolder = transformer.output_path.rsplit('/')[3]

.....................[34m/opt/program/tika/tika-server-1.9.jar
 * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat[0m
[34m/opt/program/tika/tika-server-1.9.jar
 * Debugger is active!
 * Debugger PIN: 267-829-433[0m
[34m169.254.255.130 - - [14/Jan/2021 20:05:14] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [14/Jan/2021 20:05:14] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34m2[0m
[34mstart function[0m
[34m3[0m
[34m--I--[0m
[34m2021-01-14 20:05:14,551 [Thread-4    ] [WARNI]  Failed to see startup log message; retrying...[0m
[35m/opt/program/tika/tika-server-1.9.jar
 * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with st

In [51]:
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('output.json', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/Input.zip.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


## Output

Now that KYC data and Trained model are ready, we can deploy the model for extracting clusters.

In [52]:
import json
with open('output.json') as f:
    data = json.load(f)

print("Output: ")

print(data)

Output: 
{u'balance_sheet': {u'page.No:188': {u'Content': u'\nDigital for all\n\nAnnual Report 2014-15186\n\nConsolidated Financial Statements\n\nParticulars\nPage \nNos.\n\nIndependent Auditor\u2019s Report 187\n\nConsolidated Income Statement 188\n\nConsolidated Statement of Comprehensive Income 188\n\nConsolidated Statement of Financial Position 189\n\nConsolidated Statement of Changes in Equity 190\n\nConsolidated Statement of Cash Flows 191\n\nNotes to Consolidated Financial Statements\n\n1. Corporate Information 192\n\n2. Basis of Preparation 192\n\n3. Summary of Significant Accounting Policies 192\n\n4. Significant Accounting Judgements, Estimates \nand Assumptions\n\n204\n\n5. Standards Issued But Not yet Effective up to \nthe Date of Issuance of the Group\u2019s Financial \nStatements\n\n206\n\n6. Segment Reporting 208\n\n7. Business Combination / Disposal of Subsidiary / \nOther Acquisitions / Transaction with Non-\ncontrolling Interest\n\n212\n\n8. Operating Expenses 214\n\n

## Interpretation

The Json format is as follows:

Balance sheet: {"page_no:XX": {content: ______, Years:______; sub-fields:______};
                "page_no:yy": {content: ______, Years:______; sub-fields:______}
                },
Cash flow: {"page_no:aa": {content: ______, Years:______; sub-fields:______};
                "page_no:bb": {content: ______, Years:______; sub-fields:______}
                },
Profit loss: {"page_no:cc": {content: ______, Years:______; sub-fields:______};
                "page_no:dd": {content: ______, Years:______; sub-fields:______}
                }
- Content: the whole page content
- Years: financial Years for which the figures are presented
- sub-fileds : Important fields in each table.

## Endpoint
Here is a sample endpoint for reference

In [53]:
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit


role = get_execution_role()

sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [54]:
content_type='application/zip'
model_name='financial-spreading-model'
real_time_inference_instance_type='ml.m5.xlarge'

In [55]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-financial-spreading-info-v2'

In [56]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [57]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

In [58]:
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

-----------!

In [59]:
file_name="Input.zip"

In [60]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'application/zip' --region us-east-2 output.csv

{
    "InvokedProductionVariant": "AllTraffic", 
    "ContentType": "application/json"
}


In [61]:
import json
with open('output.json') as f:
    data = json.load(f)

print("Output: ")

print(data)

Output: 
{u'balance_sheet': {u'page.No:188': {u'Content': u'\nDigital for all\n\nAnnual Report 2014-15186\n\nConsolidated Financial Statements\n\nParticulars\nPage \nNos.\n\nIndependent Auditor\u2019s Report 187\n\nConsolidated Income Statement 188\n\nConsolidated Statement of Comprehensive Income 188\n\nConsolidated Statement of Financial Position 189\n\nConsolidated Statement of Changes in Equity 190\n\nConsolidated Statement of Cash Flows 191\n\nNotes to Consolidated Financial Statements\n\n1. Corporate Information 192\n\n2. Basis of Preparation 192\n\n3. Summary of Significant Accounting Policies 192\n\n4. Significant Accounting Judgements, Estimates \nand Assumptions\n\n204\n\n5. Standards Issued But Not yet Effective up to \nthe Date of Issuance of the Group\u2019s Financial \nStatements\n\n206\n\n6. Segment Reporting 208\n\n7. Business Combination / Disposal of Subsidiary / \nOther Acquisitions / Transaction with Non-\ncontrolling Interest\n\n212\n\n8. Operating Expenses 214\n\n

In [62]:
predictor.delete_endpoint()