# Mphasis DeepInsights Document Similarity

Document Similarity solution helps a user find pairwise similarity between documents. This will help in identifying whether two documents have similar verbatim and contextual information. Higher similarity value means documents have very similar contextual information and are written in similar verbatim. This helps in removing duplicate documents from a set.
[Mphasis DeepInsights](https://www.mphasis.com/home/innovation/nextlabs/deepInsights.html) is a cloud-based cognitive computing platform that offers data extraction & predictive analytics capabilities.

### Prerequisite

The kernel comes pre-installed with the required packages. Else ensure to have the following Python Packages in your environment at minimum:

    - Sklearn
    - numpy
    - pandas
    - scipy
    - zipfile

 ### Contents

1. [Input Data](#Input-Data)
1. [Creating the Model](#Creating-the-Model)
1. [Batch Transform](#Batch-Transform)
1. [Output Data](#Output-Data)

## Input Data
The input is a zip of text files.

<b> Note: 
    Input file from sage_maker should be of the form .zip and contain text files with 'utf-8' encoding. Ensure Content-Type is 'application/zip'
</b>

### Importing libraries for runtime

In [38]:
import pandas as pd
import boto3
import re
from zipfile import ZipFile
import os

In [50]:
files = []
with ZipFile('Input.zip', 'r') as zipObj:
    zipObj.extractall()
print('\033[1m'+'Input File:'+'\033[0m\n')
print('The Input zip file contains the following list of files:\n')
files_ip = os.listdir('Input')
count = 0
for f in files_ip:
    count = count+1
    print(str(count)+') '+f)
print('\n'+'\033[1m'+'Sample input of one of the files \'Vodafone_Idea_AGR_dues.txt\' is as below:'+'\033[0m'+'\n')

with open('Input/Vodafone_Idea_AGR_dues.txt','rb') as f:
    doc = f.read()
    print(doc)

[1mInput File:[0m

The Input zip file contains the following list of files:

1) Progressive Media - Company News.txt
2) Effects of air pollution on human health and practical measures for prevention in Iran.txt
3) Vodafone_Idea_AGR_dues.txt
4) BlackRock Investment Management.txt
5) Air Pollution and Climate and Health(WHO).txt
6) Air Pollution - The Carter Center.txt
7) TC_bill30%_increase.txt
8) Vodafone_Jio_competition.txt
9) Air Pollution - Unicef.txt
10) RelianceJio_biggest_telecom_firm.txt
11) Jio_Airtel_lose_30mn_customers.txt
12) Health impacts of air pollution - SCOR.COM.txt
13) Client Asset Risks.txt

[1mSample input of one of the files 'Vodafone_Idea_AGR_dues.txt' is as below:[0m

The Supreme Court recently quashed the review petition filed by incumbent operators Vodafone Idea and Bharti Airtel on the AGR (adjusted gross revenues) dues. The apex court's October judgement has affected 15 telcos but only five of them are operational right now. Out of these five, nearly 60 p

## Creating the Model

We would now need to create a Model resource in SageMaker using the Mphasis DeepInsights(TM) Document SImilarity ModelPackage.

The Model will then be used in Deploying an Endpoint. The zip of texts would be sent as input to the API exposed at the Endpoint for making predictions.

In [40]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-doc-sim-v2'

In [41]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [42]:
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Batch Transform

Now that the Input file is in place, the model package will then be using this input to inferene the similarity index matrix

### Prediction Classes - Batch Transform Job

The output of the prediction is a csv containing a document to document matrix.


In [43]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.large')
transformer.transform('s3://mphasis-marketplace/doc-sim/Input.zip', content_type='application/zip')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")
bucketFolder = transformer.output_path.rsplit('/')[3]

..............[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat[0m
[34m * Debugger is active!
 * Debugger PIN: 132-505-881[0m
[34m169.254.255.130 - - [28/Jan/2020 08:56:26] "GET /ping HTTP/1.1" 200 -[0m
[34m169.254.255.130 - - [28/Jan/2020 08:56:26] "GET /execution-parameters HTTP/1.1" 404 -[0m
[34m######### zip extracted into Texts folder #########[0m
[34m######### Before reading the text files in zip folder #########[0m
[35m169.254.255.130 - - [28/Jan/2020 08:56:26] "GET /ping HTTP/1.1" 200 -[0m
[35m169.254.255.130 - - [28/Jan/2020 08:56:26] "GET /execution-parameters HTTP/1.1" 404 -[0m
[35m######### zip extracted into Texts folder #########[0m
[35m######### Before reading the text files in zip folder #########[0m
[32m2020-01-28T08:56:26.209:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB

In [44]:
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('FILE_NAME', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/Input.zip.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


## Output Data

The processed output is of the form .csv file which consists of a matrix of similarity indices (between 0 to 1) for the documents provide. This will be interpreted as:

    - 0 being least similar
    - 1 being most similar
    

In [45]:
output_df  = pd.read_csv("FILE_NAME")
#output_df  = output_df.drop('Unnamed: 0',1)
#out_final = output_df[["Input","Sentiment"]]
print('\033[1m'+'Output:'+'\033[0m')
output_df.head(10)

[1mOutput:[0m


Unnamed: 0.1,Unnamed: 0,Health impacts of air pollution - SCOR.COM.txt,RelianceJio_biggest_telecom_firm.txt,Client Asset Risks.txt,Air Pollution - The Carter Center.txt,Vodafone_Idea_AGR_dues.txt,BlackRock Investment Management.txt,TC_bill30%_increase.txt,Air Pollution and Climate and Health(WHO).txt,Progressive Media - Company News.txt,Jio_Airtel_lose_30mn_customers.txt,Effects of air pollution on human health and practical measures for prevention in Iran.txt,Air Pollution - Unicef.txt,Vodafone_Jio_competition.txt
0,Health impacts of air pollution - SCOR.COM.txt,1.0,0.03,0.04,0.486,0.064,0.038,0.009,0.508,0.045,0.015,0.453,0.542,0.031
1,RelianceJio_biggest_telecom_firm.txt,0.03,1.0,0.058,0.068,0.124,0.042,0.135,0.029,0.048,0.385,0.039,0.02,0.415
2,Client Asset Risks.txt,0.04,0.058,1.0,0.094,0.123,0.681,0.017,0.07,0.715,0.032,0.038,0.039,0.027
3,Air Pollution - The Carter Center.txt,0.486,0.068,0.094,1.0,0.131,0.105,0.026,0.486,0.092,0.063,0.263,0.382,0.052
4,Vodafone_Idea_AGR_dues.txt,0.064,0.124,0.123,0.131,1.0,0.133,0.219,0.064,0.123,0.131,0.064,0.029,0.157
5,BlackRock Investment Management.txt,0.038,0.042,0.681,0.105,0.133,1.0,0.021,0.053,0.715,0.057,0.028,0.021,0.025
6,TC_bill30%_increase.txt,0.009,0.135,0.017,0.026,0.219,0.021,1.0,0.026,0.017,0.167,0.011,0.004,0.213
7,Air Pollution and Climate and Health(WHO).txt,0.508,0.029,0.07,0.486,0.064,0.053,0.026,1.0,0.044,0.021,0.437,0.538,0.012
8,Progressive Media - Company News.txt,0.045,0.048,0.715,0.092,0.123,0.715,0.017,0.044,1.0,0.041,0.037,0.025,0.023
9,Jio_Airtel_lose_30mn_customers.txt,0.015,0.385,0.032,0.063,0.131,0.057,0.167,0.021,0.041,1.0,0.024,0.016,0.243


In [61]:
print('Interpretation of the matrix taking the example of two files: \n\n 1) \'Progressive Media - Company News.txt\' \n 2) \'Client Asset Risks.txt\' \n\n The similarity score computed is: '+'\033[1m'+'0.715'+'\033[0m'+', which means the documents are more similar than different\n')
print('File 1: Progressive Media - Company News.txt\n')
with open ('Input/Progressive Media - Company News.txt','rb') as f:
    file = f.read()
    print(file)
print('\n')
print('File 2: Client Asset Risks.txt\n')
with open ('Input/Client Asset Risks.txt','rb') as f:
    file = f.read()
    print(file)    

Interpretation of the matrix taking the example of two files: 

 1) 'Progressive Media - Company News.txt' 
 2) 'Client Asset Risks.txt' 

 The similarity score computed is: [1m0.715[0m, which means the documents are more similar than different

File 1: Progressive Media - Company News.txt

HIGHLIGHT: The Financial Services Authority (FSA) has penalized BlackRock Investment Management
(UK) (BIM) £9,533,100, for failing to ensure proper protection of client money.
The UK financial regulator accused the wealth manager for not putting trust letters in place for certain money
market deposits, and for failing to take reasonable care to organize and control in relation to the identification
and protection of client money.
FSA has laid a rule for protecting clients' money to avert any loss in the event of a firm's insolvency, and the
firms must clearly identify and ring-fence the clients' money from the firm's own assets so that it can be
promptly returned.
During the course of probe, the U