# DeepInsights Text Comprehend

Text Comprehend is a Natural Language Understanding solution that help users comprehend a passage of text. This is a state-of-the-art context aware, factoid model with bi-directional attention for comprehension. A deep contextualized embedding is used for distributed word representation. The output of the model will be a sub-string of words of variable length from the context passage.


 ### Contents

1. [Preparing Input Data](#Preparing-Input-Data)
1. [Creating Model](#Creating-Model)
1. [Batch Transform](#Batch-Transform)
1. [Processing Output](#Processing-Output)

## Importing libraries for runtime

In [7]:
import pandas as pd

In [6]:
import boto3
import re

### Input Format
   • The input has to be a '.zip' file named as “Input.zip” which contains two text files :
        1. passage.txt – contains passage whose length should be between 100 and 1024 words.
        2. question.txt – contains question whose length should be of minmum 3 words 
   •  The text files should follow ‘utf-8’ encoding.

In [None]:
!unzip Input.zip

In [1]:
!head passage.txt













In [14]:
!head question.txt

Who popularized the term agile?

## Creating Model

In [1]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-text-comprehend-v1'

In [2]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [3]:
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Batch Transform

In [4]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.large')
transformer.transform('s3://aws-marketplace-mphasis-assets/text-comprehend/Input.zip', content_type='application/zip')
transformer.wait()
print("Batch Transform complete")
bucketFolder = transformer.output_path.rsplit('/')[3]


........................[32m2020-03-04T05:52:11.245:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 375-904-931[0m
[34m169.254.255.130 - - [04/Mar/2020 05:52:11] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [04/Mar/2020 05:52:11] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34mFile Name                                             Modified             Size[0m
[34mpassage.txt                                    2020-02-26 14:17:38         3905[0m
[34mquestion.txt                                   2020-02-26 14:18:38           31[0m
[34mExtracting all the files now...[0m
[34mDone![0m
[34mPassage: [0m
[34mAGILE SOFTWARE DEVELOPMENT
[

In [8]:
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('text_comprehend_output.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/Input.zip.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


# Processing Output

Output file will be a .txt file named as answer.txt that contains the answer to the question in the input question.txt


In [18]:
output_df  = pd.read_csv("text_comprehend_output.csv")
output_df.rename(columns = {"Unnamed: 0":"Type","0":"Content"},inplace = True)
output_df.head()

Unnamed: 0,Type,Content
0,Question,Who popularized the term agile?
1,Answer,the Manifesto for Agile Software Development
