## Company Profile Creation Model Package from AWS Marketplace 


An NLP and knowledge graph based approach to create company profile by aggregating data from multiple sources.

This solution creates a knowledge graph based on entity-name pairs from data collected from multiple sources of information such as Wikipedia, company's website, CrunchBase etc. This solution creates a graph model of a company's profile based on unstructured data.

This sample notebook shows you how to deploy Company Profile Creation.



#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Company Profile Creation. If so, skip step: [Subscribe to the model package](#1.-Subscribe-to-the-model-package)

#### Contents:
1. [Subscribe to the model package](#1.-Subscribe-to-the-model-package)
2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
   1. [Create an endpoint](#A.-Create-an-endpoint)
   2. [Create input payload](#B.-Create-input-payload)
   3. [Perform real-time inference](#C.-Perform-real-time-inference)
   4. [Visualize output](#D.-Visualize-output)
   5. [Delete the endpoint](#E.-Delete-the-endpoint)
3. [Perform batch inference](#3.-Perform-batch-inference) 
4. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))
    

#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page **Company Profile Creation.**
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

# Helper code to crawl the data from the website


In [15]:
from urllib2 import urlopen
import re
import os
import json

def save(site, filePath):
    """
    Reads given URL, processes it, and saves resulting JSON object to
    filePath + /page.json returns the a tags extracted from a-tags
    Input:  site = URL to website
            filePath = filePath to directory to save page.json file in
    Output: returns array of tags extracted
    """

    #Code for creating pathway to page.json file if one does not exist
    path = filePath.split("/")
    curr = ""
    while (len(path)):
        curr += ("/" if curr != "" else curr) + path.pop(0)
        try:
            os.mkdir(curr)
        except:
            pass
    data = {}

    #Gets website data, otherwise returns 3 and saves error message
    try:

        site = str(urlopen(site).read())
        #--------entry point OF html----------------#
        
        if '<html' not in site:
            print("Not an html site")
            site = "<xYnC>failed</xYnC>"
        else:
            aTags = re.findall(r'<a.*?href=[\"|\'](/.*?)[\"|\']', site)
    except:
        print("Invalid URL or failed to access: " + site)
        site = "<xYnC>failed</xYnC>"
    if (site == "<xYnC>failed</xYnC>"):
        data["error"]="Invalid URL or failed to access"
        return 3
    #processing for the site data
    else:
        #Remove head
        site = re.sub(r'<head.*?>.*?</head>', '', site, 1)
        try:
            #tags that we are going to pick out
            tags = [r'h1', r'h2', r'h3', r'p']
            #creates the stack that we will be using
            queue = [(0, site, data)]
            #runs the code till we create the full dict structure
            while queue:
                #pops elem from stack and processes the components
                pog = queue.pop(-1)
                tagIndex = pog[0]
                site = pog[1]
                d = pog[2]
                #handles the p-tag case (captures li and a-tag text as well
                if tagIndex == len(tags) - 1:
                    pUnf = [re.sub(r'(?:<.*?>|\\+(?:[A-Za-mo-rt-z0-9]+| |\'))', '', x).strip()
                                                  for x in re.findall(r'<(?:[pa]|li).*?>(.*?)<(?:/p>|/a>|/li>|(?:h1|h2|h3).*?>)', site)]
                    d["p"] = pUnf
                #Captures given headers and assigns content to it
                else:
                    #finding headers
                    titles = [re.sub(r'(?:<.*?>|\\+(?:[A-Za-mo-rt-z0-9]+| |\'))', '', x).strip()
                                          for x in re.findall(r'<' + tags[tagIndex] + r'.*?>(.*?)</(?:' +
                                                              tags[tagIndex] + r'|div)>', site)]
                    #if headers exist, we assign content and push to stack
                    if titles:
                        x = re.search(r'(.*?)<' + tags[tagIndex] + r'.*?>', site).group(0)
                        if x:
                            queue.append((tagIndex + 1, x, d))
                        contents = [x for x in re.findall(r'</' + tags[tagIndex] + r'>.*?<(?:' + tags[tagIndex] + r'|div).*?>', site)]
                        site = re.sub(r'</' + tags[tagIndex] + r'>.*?<(?:' + tags[tagIndex] + r'|div).*?>', '', site)
                        contents.append(site)
                        for x in range(len(titles)):
                            d[titles[x]] = {}
                            queue.append((tagIndex + 1, contents[x], d[titles[x]]))
                            queue.append((tagIndex + 1, site, d))
                    # if headers dont exist of this kind we move onto the next kind
                    else:
                        queue.append((tagIndex + 1, site, d))
            #saves the data to the fp (file path)
            with open(filePath + '/page.json', 'w') as fp:
                json.dump(data, fp)
        except Exception as e:
            data["error"] = "Did not finish"
            with open(filePath + '/page.json', 'w') as fp:
                json.dump(data, fp)
            print(e)
        return aTags


def crawl(landing, depth=3, baseDir="data", lander="", thorough=True):
    """
    Performs a breadth first traversal on given website and saves the
    processed websites using the save() function in baseDir
    Input:  landing = URL to website (make sure nothing after .com, .edu etc
            depth = depth of the traversal
            baseDir (optional) = the place where you want to store the folder
                                    crawl will create
            lander (optional) = limits the data saved to the urls containing
                                    landing + lander
    """

    #Standardized inputs
    while landing[-1] == "/":
        landing = landing[0:-1]
    while lander != "" and lander[-1] == "/":
        lander = lander[0:-1]
    #Setting directory base
    inDir = baseDir + "/"
    try:
        os.mkdir(inDir)
    except:
        pass
    site = landing + lander
    queue = [(0, site)]
    #keeping track of failed visits
    revisit = []
    #keeping track of visits to stop repeats
    visited = []
    #works till queue is not empty
    while (len(queue)):
        #takes first elem in queue and then standardizes
        curr = queue.pop(0)
        print(curr)
        if landing not in curr[1]:
            site = landing + curr[1]
        else:
            site = curr[1]
        #processing the site
        if site not in visited:
            print(site)
            visited.append(site)
            #processes site only if lander in site tag
            if lander in site:
                ind = site.rindex(lander) + len(lander)
                if (ind == len(site) or site[ind] == "/"):
                    tags = save(site, inDir + re.findall(r'https?:/+(.*)', site)[0])
                elif thorough:
                    try:
                        tags = re.findall(r'<a.*?href=[\"|\'](/.*?)[\"|\']', site)
                    except:
                        print("Invalid URL or failed to access: " + site)
                        revisit.append(site)
                        tags = []
            # if thorough gets the a-tags and doesnt save if lander not in tag
            elif thorough:
                try:
                    tags = re.findall(r'<a.*?href=[\"|\'](/.*?)[\"|\']', site)
                except:
                    print("Invalid URL or failed to access: " + site)
                    revisit.append(site)
                    tags = []
            else:
                tags = []
            # Adds a-tags that are found to the queue
            if curr[0] < depth and tags != 3:
                for tag in tags:
                    q = re.findall(landing + r'(.*)', tag)
                    if (not q or "." not in q[0] or ".html" in tag) and tag not in visited:
                        if "#" in tag:
                            tag = tag[:tag.index("#")]
                        if landing not in tag:
                            site = landing + tag
                        else:
                            site = tag
                        while site != "" and site[-1] == "/":
                            site = site[:-1]
                        if site not in visited:
                            queue.append((curr[0] + 1, site))
    #Revisites all failed sites to see if they work the second time around
    if len(revisit) > 0:
        print("\n\n\n\n\n\n\nRetrying.....\n")
        for tag in revisit:
            print(tag)
            try:
                site = str(urlopen(landing + tag).read())
                if '<html' not in site:
                    print("Not an HTML file")
                    site = "<xYnC>failed</xYnC>"
            except:
                print("Invalid URL or failed to access")
                site = "<xYnC>failed</xYnC>"
            save(site, inDir + tag)
    else:
        print("\n\n\n\n\n\nNothing to Retry")


In [18]:
crawl("https://jupiter.money/", baseDir="Input", thorough=False)

(0, 'https://jupiter.money')
https://jupiter.money
(1, 'https://jupiter.money/about')
https://jupiter.money/about
(1, 'https://jupiter.money/contact')
https://jupiter.money/contact
(1, 'https://jupiter.money/about')
(2, 'https://jupiter.money/contact')
(2, 'https://jupiter.money/contact')






Nothing to Retry


In [20]:
import zipfile
zipObj = zipfile.ZipFile('input.zip', 'w')
zipObj.write('Input')
zipObj.close()

In [21]:
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/companyportfolio'

In [22]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
#from urllib.parse import urlparse
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit
#import urllib.request
import numpy as np

No handlers could be found for logger "sagemaker"


In [23]:
content_type='application/zip'
model_name='company-port-folio'
real_time_inference_instance_type='ml.m5.xlarge'


### 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [24]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

#### A. Create an endpoint

In [25]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

#Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

-------------!

Once endpoint has been created, you would be able to perform real-time inference.

#### B. Create input payload

In [26]:
file_name="input.zip"

<Add code snippet that shows the payload contents>

#### C. Perform real-time inference

In [27]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'application/zip' --region us-east-2 output.zip

{
    "InvokedProductionVariant": "AllTraffic", 
    "ContentType": "application/zip"
}


#### D. Visualize output

In [29]:
with zipfile.ZipFile('output.zip', 'r') as zipObj:
    zipObj.extractall()

In [31]:
with open('graphs/final_graph.gml') as f:
    data = f.read()
    

In [32]:
data

'graph [\n  node [\n    id 0\n    label "juno"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 1\n    label "doggo"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 2\n    label "kedar nimkar"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 3\n    label "vp"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 4\n    label "kedar"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 5\n    label "finance"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 6\n    label "piyush kabra"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 7\n    label "piyush"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 8\n    label "nihar gupta"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 9\n    label "banking"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 10\n

#### E. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [33]:
predictor.delete_endpoint()

### 3. Perform batch inference

In this section, you will perform batch inference using multiple input payloads together. If you are not familiar with batch transform, and want to learn more, see these links:
1. [How it works](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html)
2. [How to run a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)

In [34]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.4xlarge')
transformer.transform('s3://mphasis-marketplace/company_portfolio/input/Input.zip', content_type='application/zip')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")
bucketFolder = transformer.output_path.rsplit('/')[3]

............................[34mInitializing NLP Library...[0m
[34mInitialized NLP Library!
 * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
 * Restarting with stat[0m
[34mInitializing NLP Library...[0m
[34mInitialized NLP Library!
 * Debugger is active!
 * Debugger PIN: 266-560-827[0m
[34m169.254.255.130 - - [11/Mar/2021 15:39:33] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [11/Mar/2021 15:39:33] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34mcreated folder[0m
[34mhere[0m
[34mzip complete[0m
[34mfinish[0m
[34m169.254.255.130 - - [11/Mar/2021 15:39:33] "#033[37mPOST /invocations HTTP/1.1#033[0m" 200 -[0m
[32m2021-03-11T15:39:33.236:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m

Batch Transform complete


In [35]:
#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('output.json', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/Input.zip.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [38]:
with zipfile.ZipFile('output.zip', 'r') as zipObj:
    zipObj.extractall()

In [40]:
with open('graphs/final_graph.gml') as f:
    data = f.read()
data

'graph [\n  node [\n    id 0\n    label "juno"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 1\n    label "doggo"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 2\n    label "kedar nimkar"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 3\n    label "vp"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 4\n    label "kedar"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 5\n    label "finance"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 6\n    label "piyush kabra"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 7\n    label "piyush"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 8\n    label "nihar gupta"\n    data "_networkx_list_start"\n    data "PERSON"\n  ]\n  node [\n    id 9\n    label "banking"\n    data "_networkx_list_start"\n    data "ORG"\n  ]\n  node [\n    id 10\n

### 4. Clean-up

#### A. Delete the model

delete the endpoint after you have used it to save resources.


#### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

