# Mphasis DeepInsights Key Phrase Extractor

Mphasis DeepInsights is a cloud-based cognitive computing platform that offers data extraction & predictive analytics capabilities Key phrase extractor collects all the important key phrases form given text. This module uses end-to-end key-phrase extraction pipeline, text analysis and natural language processing techniques to automate key words extraction from text documents. This solution is an unsupervised graph-based algorithm for the construction of word network in order to identify most relevant keywords.

### Prerequisite

The kernel comes pre-installed with the required packages. Else ensure to have the following Python Packages in your environment at minimum:

    - numpy
    - pandas
    - nltk

 ### Contents

1. [Importing libraries for runtime](#Importing-libraries-for-runtime)
1. [Model](#Model)
1. [Batch Transform](#Batch-Transform)
1. [Output](#Output)
1. [Endpoint](#Endpoint)

## Importing libraries for runtime

In [1]:
import pandas as pd
import boto3
import re

### Input Format
Input file from sage_maker should be of the form .txt and with 'utf-8' encoding. Ensure Content-Type is 'text/plane'

In [2]:
file1 = open("input.txt","r+")
print(file1.read())

Uttar Pradesh Chief Minister Yogi Adityanath on Friday flagged off the Tejas Express, the country's first "private" train run by its subsidiary IRCTC,
on the Lucknow-New Delhi route. The commercial run of the train starts on Saturday.
The Tejas Express cuts the time travelled between the two cities to 6.15 hours from the 6.40 hours taken by the Swarn Shatabdi,
currently the fastest train on the route."It is the first corporate train of the country. I congratulate the first batch of passengers travelling
in it and hope such initiatives are taken to connect other cities as well," Adityanath said.
"I thank Prime Minister Narendra Modi and Railway Minister Piyush Goel for giving the first corporate train from the biggest state to Delhi.
This is a competitive era and and there is need for environment friendly public transport to be accepted in society," Adityanath said.
When mobile phones were first introduced, charges were astronomical but now every person has a mobile phone, he sai

## Model

### De-Serializing model

The serialzed Pickle file containing the trained model must be loaded for extracting key-phrases from the input text.

The model is de-serialized to a Python object.

<b> Note: 
    Ensure the trained model exist in sagemaker container and is placed in ../model directory.
</b>

In [3]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-keyphrase-extractor'

In [4]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [5]:
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)

## Batch Transform


In [6]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.large')
transformer.transform('s3://mphasis-marketplace/topic-identification/input.txt', content_type='text/plain')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")
bucketFolder = transformer.output_path.rsplit('/')[3]

 * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat[0m
 * Debugger is active!
 * Debugger PIN: 228-130-405[0m
[34m169.254.255.130 - - [07/Feb/2020 10:01:16] "GET /ping HTTP/1.1" 200 -[0m
[34m169.254.255.130 - - [07/Feb/2020 10:01:16] "GET /execution-parameters HTTP/1.1" 404 -[0m
[35m169.254.255.130 - - [07/Feb/2020 10:01:16] "GET /ping HTTP/1.1" 200 -[0m
[35m169.254.255.130 - - [07/Feb/2020 10:01:16] "GET /execution-parameters HTTP/1.1" 404 -[0m
[34m1[0m
[35m1[0m
[34m1[0m
[35m1[0m
[34m1[0m
[34m1[0m
[35m1[0m
[35m1[0m
[32m2020-02-07T10:01:16.746:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m1[0m
[35m1[0m
[34m1[0m
[34m1[0m
[34mBefore Processing ['narendra modi', 'minister narendra', 'pradesh chief minister yogi', 'tejas express', 'public', 'frie

In [7]:
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('output.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/input.txt.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


## Output

Now that Text and Trained model are ready, we can deploy the model for extracting important Topics/Key words from the text. The processed output is of the form .csv file containing all the Key words present in the input text.

In [8]:
output_df  = pd.read_csv("output.csv")
output_df  = output_df.drop('Unnamed: 0',1)
out_final = output_df[["Key Topics"]]
print("Output: ")
out_final.head(10)

Output: 


Unnamed: 0,Key Topics
0,pradesh chief minister yogi
1,minister narendra
2,environment friendly public transport
3,mobile phone
4,train start
5,subsidiary irctc
6,corporate train
7,first batch
8,tejas express
9,minister piyush


## Endpoint
Here is a sample endpoint for reference

In [9]:
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit

role = get_execution_role()

sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [10]:
content_type='text/plain'
model_name='key-phrase-model'
real_time_inference_instance_type='ml.m5.xlarge'

In [11]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/mphasis-marketplace-keyphrase-extractor'

In [12]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [13]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

In [14]:
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

-----------!

In [15]:
file_name="input.txt"

In [16]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'text/plain' --region us-east-2 output.csv

{
    "InvokedProductionVariant": "AllTraffic", 
    "ContentType": "text/csv; charset=utf-8"
}


In [17]:
f = open('./input.txt', mode='r')
data=f.read()
prediction = predictor.predict(data)
print(prediction)

,Key Topics
0,minister narendra
1,fast train
2,pradesh chief minister yogi
3,minister piyush
4,friendly public
5,train start
6,environment friendly public transport
7,mobile phone
8,corporate train
9,other cities
10,tejas express
11,first batch
12,subsidiary irctc
13,railway minister piyush goel
14,mobile phones



In [18]:
predictor.delete_endpoint()