<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="4" color="black"><b>Use the Python function feature to scrape a webpage</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/images/python.png?raw=true" width="600" alt="Icon"> </th>
   </tr>
</table>

A *Python function* is a feature to save and deploy Python code through notebooks or IDE. Python functions can be implemented in Python notebooks or through REST API using IDE.

The requirement of a Python function is to have a `score()` function inside the Python function. The `score()` function will be called when running the deployed Python function.

A Python function can be:
- Saved in a project as well as a deployment space.
- Deployed in a deployment space.
- Scored.


<div class="alert alert-block alert-info">For more details on the Python function, please refer to this <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener no referrer">link</a>.</div>

This notebook demonstrates how to save, deploy, and score a Python function. Although the `score()` function is intended to score a Python function, it has other custom functionality such as preprocessing texts.

The `score()` function of the Python function in this example notebook does the following tasks:
- Scrapes texts that are enclosed in `<p>` tags.
- Tokenizes scraped texts.
    
The data that will be used in this notebook is the <a href="http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" target="_blank" rel="noopener no referrer">SMS spam data set</a> from the UCI Machine Learning Repository. 

The original data set has both texts and labels in a single file. Only the text parts of the data set were extracted and converted into an `html` file.

You can find the `html` version of the SMS messages <a href="https://github.com/pmservice/wml-sample-notebooks/tree/master/datasets" target="_blank" rel="noopener no referrer">here</a>.

Some familiarity with Python is helpful. This notebook uses `watson-machine-learning-client-V4` and is compatible CP4D 3.0 and Python 3.6.

#### To get started on CP4D 3.0, find documentation on installation and set up <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/cpd/overview/welcome.html" target="_blank" rel="noopener no referrer">here</a>.

## Table of Contents

This notebook contains the following parts:

1.	[Define a Python function](#function)
2.  [Setting up](#setup)
2.	[Save the Python function](#save)<br>
3.  [Deploy the Python function (with deployment space only)](#deploy)<br>
4.	[Summary and next steps](#summary)

## 1. Define a Python function <a id="function"></a>

You can pass a `parameter dict` to the Python function in the cell below.

In [1]:
# You can add any information needed to run the Python function, e.g., wml credentials.
py_params = {

}

The code outside the `score()` function executes one time only and can load objects, install libs, etc. 

In this example, the `score()` function takes the URL(s) of the payload and passes it (them) to BeautifulSoup to scrape texts enclosed in `<p>` tags. The extracted texts are passed to `scikit-learn`'s CountVectorizer in order to tokenize the texts.

<div class="alert alert-block alert-info">If you are importing modules inside the Python function, you must install packages through the <tt>subprocess</tt> module. </div>

In [2]:
def py_funct(params=py_params):  
    try:
        # Import the subprocess module.
        import subprocess
        
        # Install required packages.
        subprocess.check_output('pip install --user lxml', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user bs4', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user sklearn', stderr=subprocess.STDOUT, shell=True)
    except subprocess.CalledProcessError as e:        
        install_err = 'subprocess.CalledProcessError:\n\n' + 'cmd:\n' + e.cmd + '\n\noutput:\n' + e.output.decode()
        raise Exception( 'Installation failed:\n' + install_err )
    
    def score(payload):
        try:
            # Import required modules.
            from bs4 import BeautifulSoup
            from urllib.request import urlopen
            from sklearn.feature_extraction.text import CountVectorizer

            urls = payload['input_data'][0]['values']
            final_texts = []   # An array that will have stripped clean text from html tag enclosed text.

            for url in urls:            
                html = urlopen(url)
                soup = BeautifulSoup(html, 'lxml')

                p_tags = soup.find_all('p')    # Text is enclosed in <p> tag.

                for p in p_tags:
                    str_p = str(p)
                    text = BeautifulSoup(str_p, 'lxml').get_text()
                    final_texts.append(text)

            vectorizer = CountVectorizer()
            vectorizer.fit_transform(final_texts)

            return {"predictions": [{"fields": ["tokens"], "values": [vectorizer.get_feature_names()]}]}
        except Exception as e:
            return {"predictions": [{"error" : repr(e)}]}
        
    return score

Prepare a sample payload.

In [3]:
sample_data = {
    'input_data': [{
        'fields': ['url'],
        'values': [
            'https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/datasets/sms_spam_text.html'
        ]
    }]
}

Pass the list of URLs to the Python function.

In [4]:
pf = py_funct(py_params)
tokens = pf(sample_data)

The Python function object returns a `dict` that has a list of tokens as the `value`; the name of the `value` is `tokens`.

In [5]:
# Token list
tokens['predictions'][0]['values'][0][:10]

['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02']

## 2. Setting up <a id="setup"></a>

Import the `watson-machine-learning-client` module.
<div class="alert alert-block alert-info">
For more information about the <b>Watson Machine Learning Python client (V4)</b>, please refer to the <a href="https://wml-api-pyclient-dev-v4.mybluemix.net/" target="_blank" rel="noopener no referrer">Python client documentation</a>. If you're using the notebook within a project on your CP4D cluster, you do not need to install this package as it comes pre-installed with the notebooks. The installation code below is for demonstration but is non-executable at this stage.
</div>

In [7]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

**Authenticate the Python client on CP4D.**

<div class="alert alert-block alert-info">To find your authentication information (your credentials) follow the steps provided here in the <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/wsj/analyze-data/ml-authentication-local.html" target="_blank" rel="noopener no referrer">Documentation.</a></div>

**Action**: Enter your credentials in the following cell.

In [8]:
# Enter your credentials here.

from project_lib.utils import environment
url = environment.get_common_api_url()

import sys,os,os.path
token = os.environ['USER_ACCESS_TOKEN']

wml_credentials = {
     "instance_id": "openshift",
     "token": token,
     "url": url,
     "version": "3.0.0"
}

Now, instantiate a `WatsonMachineLearningAPIClient` object.

In [9]:
client = WatsonMachineLearningAPIClient(wml_credentials)

<div class="alert alert-block alert-info">
You have a choice to either save the function in the <b>project</b> or the <b>deployment space</b>:<br><br>
    <li> If you're saving the function in your project, you have to set the default project using the python client.</li><br>
    <li>If you're saving the function in the deployment space, you have to obtain the space UID of the deployment space you've created. Then you'd use this to set the default space using the python client. From there you'll be able to deploy and score the function in your deployment space.</li></div>

### To set the default project, use the following code.

In [34]:
from project_lib import Project
project = Project.access()
project_id = project.get_metadata()["metadata"]["guid"]

client.set.default_project(project_id)

'SUCCESS'

### To set the default space, follow these steps.

<div class="alert alert-block alert-info">
You can create your own <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/wsj/analyze-data/ml-spaces_local.html" target="_blank" rel="noopener no referrer">deployment space</a> by selecting <b>Analytics deployments</b> under <b>Analyze</b> from the Navigation Menu on the top left of this page.</div>

Alternatively, you can create a deployment using the code in the following cell. The cell is not executable cell at this stage, but you can enter the name of your space in the metadata and use it if needed.

In [10]:
# Obtain the UId of your space
def guid_from_space_name(client, space_name):
    space = client.spaces.get_details()
    return(next(item for item in space['resources'] if item['entity']["name"] == space_name)['metadata']['guid'])

**Action:** Enter the name of your deployment space in the code below: `space_uid = guid_from_space_name(client, 'YOUR DEPLOYMENT SPACE')`.

In [11]:
# Enter the name of your deployment space here:
space_uid = guid_from_space_name(client, 'YOUR DEPLOYMENT SPACE')
print("Space UID = " + space_uid)

Space UID = 2a330a2c-5943-4dfa-aad9-dff16d954d56


You can set the default space using the cell below.

In [12]:
client.set.default_space(space_uid)

'SUCCESS'

## 3. Save the Python function <a id="save"></a>

Create the function metadata.

Watson Studio Notebooks use a software specification with the name `default_py3.6` by default, but python functions can be saved using the `ai-function_0.1-py3.6` software specification as well. This exampled uses `default_py3.6` and you'll use its UID in the function metadata.

In [14]:
# Function Metadata.
software_spec_uid = client.software_specifications.get_uid_by_name("default_py3.6")

meta_props = {
    client.repository.FunctionMetaNames.NAME: "Web scraping python function",
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_spec_uid
}

<div class="alert alert-block alert-info">To list the supported software specifications, run <tt>client.software_specifications.list()</tt>.<br>To find more information about the frameworks with their respective <b>Types</b> and <b>Software Specifications</b>, visit the <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/wsj/wmls/wmls-deploy-python-types.html" target="_blank" rel="noopener no referrer">documentation</a>.</div>

You can extract the function UID from the saved function details.

In [15]:
#Create the function artifact.
function_artifact = client.repository.store_function(meta_props=meta_props, function=py_funct)
function_uid = client.repository.get_function_uid(function_artifact)
print("Function UID = " + function_uid)

Function UID = e895d3c8-38ec-48a1-b6d1-126176f487db


Get the saved function metadata using the function UID.

In [16]:
# Details about the function.
function_details = client.repository.get_details(function_uid)
from pprint import pprint
pprint(function_details)

{'entity': {'name': 'Web scraping python function',
            'software_spec': {'id': '0cdb0f1e-5376-4f4d-92dd-da3b69aa9bda'},
            'space': {'href': '/v4/spaces/2a330a2c-5943-4dfa-aad9-dff16d954d56',
                      'id': '2a330a2c-5943-4dfa-aad9-dff16d954d56'},
            'type': 'python'},
 'metadata': {'created_at': '2020-03-23T22:51:36.002Z',
              'guid': 'e895d3c8-38ec-48a1-b6d1-126176f487db',
              'href': '/v4/functions/e895d3c8-38ec-48a1-b6d1-126176f487db?space_id=2a330a2c-5943-4dfa-aad9-dff16d954d56',
              'id': 'e895d3c8-38ec-48a1-b6d1-126176f487db',
              'modified_at': '2020-03-23T22:51:43.002Z',
              'name': 'Web scraping python function',
              'owner': '1000330999',
              'space_id': '2a330a2c-5943-4dfa-aad9-dff16d954d56'}}


You can list all stored functions using the `list_functions` method.

In [17]:
# Display a list of all the functions.
client.repository.list_functions()

------------------------------------  ----------------------------  ------------------------  ------
GUID                                  NAME                          CREATED                   TYPE
e895d3c8-38ec-48a1-b6d1-126176f487db  Web scraping python function  2020-03-23T22:51:36.002Z  python
------------------------------------  ----------------------------  ------------------------  ------


<div class="alert alert-block alert-info">
From the list of stored functions, you can see that function is successfully saved.<br><br>
If you've set the default project, this means you've saved the function in your project. You can see the saved function in your project UI by clicking on your project name in the breadcrumb at the top of the application.<br><br>
If you've set the default space, this means that you've saved the function in your deployment space. You can view your function by selecting <b>Analytics Deployments</b> under <b>Analyze</b> from the Navigation Menu and clicking on your deployment space name.</div>

If you're using a deployment space, proceed to Section 4: [Deploy the Python function (with deployment space only)](#deploy). If not, you may skip to the [summary](#summary).

## 4. Deploy the Python function (with deployment space only) <a id="deploy"></a>

Next, deploy the *Python function* to the deployment space by creating deployment metadata and using the function UID obtained in the previous section.

In [18]:
# Deployment metadata.
deploy_meta = {
    client.deployments.ConfigurationMetaNames.NAME: "Web scraping python function deployment",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [19]:
# Create the deployment.
deployment_details = client.deployments.create(function_uid, meta_props=deploy_meta)



#######################################################################################

Synchronous deployment creation for uid: 'e895d3c8-38ec-48a1-b6d1-126176f487db' started

#######################################################################################


initializing....
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='14aad500-ab6d-4caa-aa75-ca42c8b25ad9'
------------------------------------------------------------------------------------------------




You can check the deployment details by running the following cell.

In [20]:
deployment_details

{'metadata': {'parent': {'href': ''},
  'name': 'Web scraping python function deployment',
  'guid': '14aad500-ab6d-4caa-aa75-ca42c8b25ad9',
  'description': '',
  'id': '14aad500-ab6d-4caa-aa75-ca42c8b25ad9',
  'modified_at': '2020-03-23T22:52:00.580Z',
  'created_at': '2020-03-23T22:52:00.580Z',
  'href': '/v4/deployments/14aad500-ab6d-4caa-aa75-ca42c8b25ad9',
  'space_id': '2a330a2c-5943-4dfa-aad9-dff16d954d56'},
 'entity': {'name': 'Web scraping python function deployment',
  'custom': {},
  'online': {},
  'description': '',
  'space': {'id': '2a330a2c-5943-4dfa-aad9-dff16d954d56',
   'href': '/v4/spaces/2a330a2c-5943-4dfa-aad9-dff16d954d56'},
  'status': {'state': 'ready',
   'online_url': {'url': 'https://internal-nginx-svc:12443/v4/deployments/14aad500-ab6d-4caa-aa75-ca42c8b25ad9/predictions'}},
  'asset': {'id': 'e895d3c8-38ec-48a1-b6d1-126176f487db',
   'href': '/v4/functions/e895d3c8-38ec-48a1-b6d1-126176f487db?space_id=2a330a2c-5943-4dfa-aad9-dff16d954d56'},
  'space_id': '

Please check if the deployment was successfully created by listing deployments.

In [21]:
# List the deployments.
client.deployments.list()

------------------------------------  ---------------------------------------  -----  ------------------------  -------------
GUID                                  NAME                                     STATE  CREATED                   ARTIFACT_TYPE
14aad500-ab6d-4caa-aa75-ca42c8b25ad9  Web scraping python function deployment  ready  2020-03-23T22:52:00.580Z  function
------------------------------------  ---------------------------------------  -----  ------------------------  -------------


<div class="alert alert-block alert-info">
From the list of deployments, you can see that function was successfully deployed in the deployment space.</div>

In [22]:
# Deployment UID.
deployment_uid = client.deployments.get_uid(deployment_details)
print('Deployment uid = {}'.format(deployment_uid))

Deployment uid = 14aad500-ab6d-4caa-aa75-ca42c8b25ad9


### 4.1 Score data <a id="score"></a>

In this subsection, you will learn how to score the deployed *Python function* with a test data record.

The following is the record that will be used for scoring.

In [23]:
# Prepare scoring payload.
job_payload = {
    client.deployments.ScoringMetaNames.INPUT_DATA: [{
        'fields': ['url'],
        'values': [
            'https://www.ibm.com/cloud/machine-learning'
        ]
    }]
}
pprint(job_payload)

{'input_data': [{'fields': ['url'],
                 'values': ['https://www.ibm.com/cloud/machine-learning']}]}


In [24]:
# Perform prediction and display the result.
job_details = client.deployments.score(deployment_uid, job_payload)
pprint(job_details['predictions'][0]['values'][0][:10])

['02',
 '2018',
 '2019',
 '459',
 '49',
 '574',
 '698',
 'accelerate',
 'accelerator',
 'access']


## 5. Summary and next steps <a id="summary"></a>

You successfully completed this notebook! 
 
You learned how to define a *Python function*. Also, you learned how to save, deploy, and score the *Python function*. 

In the next step, in addition to tokenizing, a classification model trained with the `SMS spam` data set will be called in the `score()` function and perform scoring.

### Resources <a id="resources"></a>

To learn more about configurations used in this notebook or more sample notebooks, tutorials, documentation, how-tos, and blog posts, check out these links:

<div class="alert alert-block alert-success">

<h4>IBM documentation</h4>
<br>
 <li> <a href="https://wml-api-pyclient-dev-v4.mybluemix.net" target="_blank" rel="noopener no referrer">watson-machine-learning</a></li> 
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/cpd/overview/welcome.html" target="_blank" rel="noopener noreferrer">CP4D 3.0</a></li>
 
<h4> IBM Samples</h4>
<br>
 <li> <a href="https://github.com/IBMDataScience/sample-notebooks" target="_blank" rel="noopener noreferrer">Sample notebooks</a></li>
 
<h4> Others</h4>
<br>
 <li> <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank" rel="noopener noreferrer">BeautifulSoup documentation</a></li>
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SSQNUZ_3.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener noreferrer">Deploying Python functions in Watson Machine Learning</a></li>
 <li> <a href="https://www.python.org" target="_blank" rel="noopener noreferrer">Official Python website</a></li>
 <li> <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener noreferrer">scikit-learn: machine learning in Python</a></li>
 <li> <a href="https://www.datacamp.com/community/tutorials/web-scraping-using-python" target="_blank" rel="noopener noreferrer">Web scraping using Python</a></li>
 <li> <a href="https://tokenex.com/resource-center/what-is-tokenization/" target="_blank" rel="noopener noreferrer">What is tokenization</a></li></div>

### Citation

Almeida, T. A. and Hidalgo, J. M. G. (2012), [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), Irvine, CA.

### Author

**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.

Copyright © 2019, 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>