<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Use the Python function feature to scrape a webpage</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/images/python.png?raw=true" width="700" alt="Icon"> </th>
   </tr>
</table>

A *Python function* is a feature to save and deploy Python code through notebooks or IDE. Python functions can be implemented in Python notebooks or through REST API using IDE.

The requirement of a Python function is to have a `score()` function inside the Python function. The `score()` function will be called when running the deployed Python function.

A Python function can be:
- Saved in the Watson Machine Learning (WML) repository.
- Deployed to the Watson Machine Learning (WML) repository.
- Scored.


<div class="alert-block alert-info"><br>&nbsp;&nbsp;&nbsp;&nbsp;For more details on the Python function, refer to this <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener no referrer">link</a>.<br><br></div>

This notebook demonstrates how to save, deploy, and score a Python function. Although the `score()` function is intended to score a Python function, other custom functionality such as preprocessing texts can be implemented inside it.

The `score()` function of the Python function in this example notebook does the following tasks:
- Scrapes texts that are enclosed in `<p>` tags.
- Tokenizes scraped texts.
    
The data that will be used in this notebook is the <a href="http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" target="_blank" rel="noopener no referrer">SMS spam data set</a> from the UCI Machine Learning Repository. 

The original data set has both texts and labels in a single file. Only the text parts of the data set were extracted and converted into an `html` file.

You can find the `html` version of the SMS messages at this <a href="https://github.com/pmservice/wml-sample-notebooks/tree/master/datasets" target="_blank" rel="noopener no referrer">link</a>.

Some familiarity with Python is helpful. This notebook is compatible with Watson Studio Desktop 1.1, Watson Machine Learning Server 2.0, and Python 3.6.


## Table of Contents

This notebook contains the following parts:

1. [Define a Python function](#function)
2. [Setting up](#setup)  
    2.1 [Connecting to Watson Studio Desktop](#wsd)   
    2.2 [Connecting to Watson Machine Learning Server](#wmls)   
3.  [Save the Python function](#save)  
4.  [Deploy the Python function (WML Server only)](#deploy)   
    4.1 [Score data](#score)  
    4.2 [Delete the deployment and model](#delete)
4.	[Summary and next steps](#summary)

To get started on Watson Machine Learning (WML) Server, find documentation on installion and set up <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/wmls/wmls-install-over.html" target="_blank" rel="noopener no referrer">here</a>.

## 1. Define a Python function <a id="function"></a>

You can pass a `parameter dict` to the Python function in the cell below.

In [1]:
# You can add any information needed to run the Python function, e.g., wml credentials.
py_params = {

}

The code outside the `score()` function executes one time only and can load objects, install Python packages, etc. 

In this example, the `score()` function takes the url(s) of the payload and passes it (them) to BeautifulSoup to scrape texts enclosed in `<p>` tags. The extracted texts are passed to `scikit-learn`'s CountVectorizer in order to tokenize the texts.

<div class="alert-block alert-info"><br>&nbsp;&nbsp;&nbsp;&nbsp;If you are importing modules inside the Python function, you must install packages through the <tt>subprocess</tt> module.<br><br>&nbsp;&nbsp;&nbsp;&nbsp;More details can be found <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-functions.html?audience=wdp#import" target="_blank" rel="noopener no referrer">here</a>, with further documentation for Python functions.<br><br></div>

In [2]:
def py_funct(params=py_params):
    try:
        # Import the subprocess module.
        import subprocess

        # Install required packages.
        subprocess.check_output('pip install --user lxml',
                                stderr=subprocess.STDOUT,
                                shell=True)
        subprocess.check_output('pip install --user bs4',
                                stderr=subprocess.STDOUT,
                                shell=True)
        subprocess.check_output('pip install --user sklearn',
                                stderr=subprocess.STDOUT,
                                shell=True)
    except subprocess.CalledProcessError as e:
        install_err = ('subprocess.CalledProcessError:\n\n' + 'cmd:\n' +
                       e.cmd + '\n\noutput:\n' + e.output.decode())
        raise Exception('Installation failed:\n' + install_err)

    def score(payload):
        try:
            # Import required modules.
            from bs4 import BeautifulSoup
            from urllib.request import urlopen
            from sklearn.feature_extraction.text import CountVectorizer

            urls = payload['input_data'][0]['values']
            final_texts = [
            ]  # An array that will have stripped clean text from html tag enclosed text.

            for url in urls:
                html = urlopen(url)
                soup = BeautifulSoup(html, 'lxml')

                p_tags = soup.find_all('p')  # Text is enclosed in <p> tags.

                for p in p_tags:
                    str_p = str(p)
                    text = BeautifulSoup(str_p, 'lxml').get_text()
                    final_texts.append(text)

            vectorizer = CountVectorizer()
            vectorizer.fit_transform(final_texts)

            return {
                'predictions': [{
                    'fields': ['tokens'],
                    'values': [vectorizer.get_feature_names()]
                }]
            }
        except Exception as e:
            return {'predictions': [{'error': repr(e)}]}

    return score

Prepare a sample payload.

In [3]:
sample_data = {
    'input_data': [{
        'fields': ['url'],
        'values': [
            'https://raw.githubusercontent.com/pmservice/' +
            'wml-sample-notebooks/master/datasets/sms_spam_text.html'
        ]
    }]
}

Pass the list of urls to the Python function.

In [4]:
pf = py_funct(py_params)
tokens = pf(sample_data)

The Python function object returns a `dict` that has a list of tokens as the `value`; the name of the `value` is `tokens`.

In [5]:
# Token list
tokens['predictions'][0]['values'][0][:10]

['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02']

## 2. Setting up <a id="setup"></a>

In this section, you will learn how to use the python client to connect to both `Watson Studio Desktop (WSD)` and `Watson Machine Learning (WML) Server`. If you only intend to save the model on WSD, you will need to follow the steps in section [2.1 Connecting to Watson Studio Desktop](#wsd).

If you want to use the WML Server, you will need to refer to section [2.2 Connecting to Watson Machine Learning Server](#wmls). From there you will be able to save, deploy, and score the model in your deployment space on the WML Server.

First, import the `watson-machine-learning-client` module.

<div class="alert alert-block alert-warning">
To simply hide the output of pip install, use <tt>-q</tt> after <tt>!pip install</tt>.
</div>

In [6]:
!pip install -q --upgrade watson-machine-learning-client-V4

In [7]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

### 2.1 Connecting to Watson Studio Desktop <a id="wsd"></a>

To associate the python client with Watson Studio Desktop, use the following credentials.

In [8]:
from project_lib.utils import environment
url = environment.get_common_api_url()

wml_credentials = {
    'instance_id': 'wsd_local',
    'url': url,
    'version': '1.1'
}

Now, instantiate a `WatsonMachineLearningAPIClient` object.

In [9]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [10]:
client.version

'1.0.112'

Setting the default project is mandatory when you use WSD. You can use the cell below.

In [11]:
from project_lib import Project

project = Project.access()
project_id = project.get_metadata()['metadata']['guid']

client.set.default_project(project_id)

'SUCCESS'

To proceed, you can go directly to section [3. Save the Python function](#save).

### 2.2 Connecting to Watson Machine Learning Server <a id="wmls"></a>

In this subsection, you will learn how to set up the Watson Machine learning (WML) Server that is required to save, deploy, and score the *Python function* in the Watson Machine learning (WML) repository.

**Connect to the Watson Machine Learning Server using the Python client**<br><br>

<div class="alert-block alert-info"><br>
&nbsp;&nbsp;&nbsp;&nbsp;To install the Watson Machine Learning Server, follow <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/wmls/wmls-install-over.html" target="_blank" rel="noopener no referrer">these documentation steps</a>.
<br><br>&nbsp;&nbsp;&nbsp;&nbsp;To connect to the WML server and find your authentication information (your credentials) follow the steps provided here in the <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/wmls/wmls-connect.html" target="_blank" rel="noopener no referrer">Documentation</a>.<br><br>
</div>

**Action**: Enter your WML Server credentials in the following cell.

In [12]:
# Enter your credentials here.
wml_credentials = {
    'url': '<URL>:31843',
    'username': '---',
    'password': '---',
    'instance_id': 'wml_local',
    'version': '2.0'
}

In [13]:
# @hidden_cell

wml_credentials = {
    'url': 'https://wmlserver-dev-test.ml.test.cloud.ibm.com:31843',
    'username': 'admin',
    'password': 'password',
    'instance_id': 'wml_local',
    'version': '2.0'
}

Now, instantiate a WatsonMachineLearningAPIClient object.

In [14]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [15]:
client.version

'1.0.112'

Since you are using WML Server in this section, you can obtain the space UID by using the following cells.

<div class="alert-block alert-info"><br>
&nbsp;&nbsp;&nbsp;&nbsp;You can create your own <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/analyze-data/ml-spaces_local.html" target="_blank" rel="noopener no referrer">deployment space</a> by selecting <b>Deployment Spaces</b> from the Navigation Menu on the top left of this page.<br><br></div>

Alternatively, you can create a deployment and obtain its UID using the code in the following cell.

In [16]:
# Obtain the UId of your space
def guid_from_space_name(client, space_name):
    instance_details = client.service_instance.get_details()
    space = client.spaces.get_details()
    return (next(item for item in space['resources']
                 if item['entity']['name'] == space_name)['metadata']['guid'])

**Action:** Enter the name of your deployment space in the code below: `space_uid = guid_from_space_name(client, 'YOUR_DEPLOYMENT_SPACE')`.

In [17]:
# Enter the name of your deployment space here:
space_uid = guid_from_space_name(client, 'YOUR_DEPLOYMENT_SPACE')
print('Space UID = {}'.format(space_uid))

Space UID = 3ce5e796-f127-431a-9609-46456e10d274


Setting the default space is mandatory when you use WML Server. You can set this using the cell below.

In [18]:
client.set.default_space(space_uid)

'SUCCESS'

Now, you can save and deploy the function to the deployment space.

## 3. Save the Python function <a id="save"></a>

Create the function metadata.

In [19]:
# Function Metadata.
software_spec_uid = client.software_specifications.get_uid_by_name(
    'ai-function_0.1-py3.6')

meta_props = {
    client.repository.FunctionMetaNames.NAME: 'Web scraping python function',
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_spec_uid
}

You can extract the function UID from the saved function details.

In [20]:
#Create the function artifact.
function_artifact = client.repository.store_function(meta_props=meta_props,
                                                     function=py_funct)
function_uid = client.repository.get_function_uid(function_artifact)
print('Function UID = {}'.format(function_uid))

Function UID = 52c3da15-ecaa-45e1-9736-05ce3ed46579


Get the saved function metadata from WML using the function UID.

In [21]:
from json import dumps

# Details about the function.
function_details = client.repository.get_details(function_uid)
print(dumps(function_details, indent=4))

{
    "metadata": {
        "name": "Web scraping python function",
        "guid": "52c3da15-ecaa-45e1-9736-05ce3ed46579",
        "id": "52c3da15-ecaa-45e1-9736-05ce3ed46579",
        "modified_at": "2020-08-28T16:31:51.002Z",
        "created_at": "2020-08-28T16:31:49.002Z",
        "owner": "1000330999",
        "href": "/v4/functions/52c3da15-ecaa-45e1-9736-05ce3ed46579?space_id=3ce5e796-f127-431a-9609-46456e10d274",
        "space_id": "3ce5e796-f127-431a-9609-46456e10d274"
    },
    "entity": {
        "space": {
            "id": "3ce5e796-f127-431a-9609-46456e10d274",
            "href": "/v4/spaces/3ce5e796-f127-431a-9609-46456e10d274"
        },
        "name": "Web scraping python function",
        "type": "python",
        "software_spec": {
            "id": "0cdb0f1e-5376-4f4d-92dd-da3b69aa9bda"
        }
    }
}


You can list all stored functions using the `list_functions` method.

In [22]:
# Display a list of all the functions.
client.repository.list_functions()

------------------------------------  ----------------------------  ------------------------  ------
GUID                                  NAME                          CREATED                   TYPE
52c3da15-ecaa-45e1-9736-05ce3ed46579  Web scraping python function  2020-08-28T16:31:49.002Z  python
------------------------------------  ----------------------------  ------------------------  ------


<div class="alert-block alert-info"><br>
&nbsp;&nbsp;&nbsp;&nbsp;From the list of stored functions, you can see that function is successfully saved. <br>

&nbsp;&nbsp;&nbsp;&nbsp;With Watson Studio Desktop credentials, this means you have saved the function in your project.
<br>&nbsp;&nbsp;&nbsp;&nbsp;You can see the saved function in your project UI by clicking on your project name in the breadcrumb at the top of the application. <br>

&nbsp;&nbsp;&nbsp;&nbsp;With WML Server credentials, this means that you have saved the function in your deployment space.
<br>&nbsp;&nbsp;&nbsp;&nbsp;You can view your function by selecting <b>Deployment Spaces</b> from the Navigation Menu and clicking on your deployment space name.<br>
<br></div>

If you are using WML Server, proceed to section [4. Deploy the Python function (WML Server only)](#deploy). If you are using Watson Studio Desktop, you may skip to the [summary](#summary).

## 4. Deploy the Python function (WML Server only) <a id="deploy"></a>

Next, deploy the *Python function* to the deployment space by creating deployment metadata and using the function UID obtained in the previous section.

In [23]:
# Deployment metadata.
deploy_meta = {
    client.deployments.ConfigurationMetaNames.NAME:
    'Web scraping python function deployment',
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [24]:
# Create the deployment.
deployment_details = client.deployments.create(function_uid,
                                               meta_props=deploy_meta)



#######################################################################################

Synchronous deployment creation for uid: '52c3da15-ecaa-45e1-9736-05ce3ed46579' started

#######################################################################################


initializing...
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='6527b7bd-8427-48c7-9f59-ec1efd8c0de3'
------------------------------------------------------------------------------------------------




You can check the deployment details by running the following cell.

In [25]:
print(dumps(deployment_details, indent=4))

{
    "entity": {
        "asset": {
            "href": "/v4/functions/52c3da15-ecaa-45e1-9736-05ce3ed46579?space_id=3ce5e796-f127-431a-9609-46456e10d274",
            "id": "52c3da15-ecaa-45e1-9736-05ce3ed46579"
        },
        "custom": {},
        "description": "",
        "name": "Web scraping python function deployment",
        "online": {},
        "space": {
            "href": "/v4/spaces/3ce5e796-f127-431a-9609-46456e10d274",
            "id": "3ce5e796-f127-431a-9609-46456e10d274"
        },
        "space_id": "3ce5e796-f127-431a-9609-46456e10d274",
        "status": {
            "online_url": {
                "url": "https://wmlserver-dev-test.ml.test.cloud.ibm.com:31843/v4/deployments/6527b7bd-8427-48c7-9f59-ec1efd8c0de3/predictions"
            },
            "state": "ready"
        }
    },
    "metadata": {
        "created_at": "2020-08-28T16:31:53.148Z",
        "description": "",
        "guid": "6527b7bd-8427-48c7-9f59-ec1efd8c0de3",
        "href": "/v4/de

Check if the deployment was successfully created by listing deployments.

In [26]:
# List the deployments.
client.deployments.list()

------------------------------------  ---------------------------------------  -----  ------------------------  -------------
GUID                                  NAME                                     STATE  CREATED                   ARTIFACT_TYPE
6527b7bd-8427-48c7-9f59-ec1efd8c0de3  Web scraping python function deployment  ready  2020-08-28T16:31:53.148Z  function
------------------------------------  ---------------------------------------  -----  ------------------------  -------------


In [27]:
# Deployment UID.
deployment_uid = client.deployments.get_uid(deployment_details)
print('Deployment UID = {}'.format(deployment_uid))

Deployment UID = 6527b7bd-8427-48c7-9f59-ec1efd8c0de3


### 4.1 Score data <a id="score"></a>

In this subsection, you will learn how to score  a test data record against the deployed *Python function*.

The following is the record that will be used for scoring.

In [28]:
client.repository.list_functions()

------------------------------------  ----------------------------  ------------------------  ------
GUID                                  NAME                          CREATED                   TYPE
52c3da15-ecaa-45e1-9736-05ce3ed46579  Web scraping python function  2020-08-28T16:31:49.002Z  python
------------------------------------  ----------------------------  ------------------------  ------


In [29]:
# Prepare scoring payload.
job_payload = {
    client.deployments.ScoringMetaNames.INPUT_DATA: [{
        'fields': ['url'],
        'values': [
            'https://www.ibm.com/cloud/machine-learning'
        ]
    }]
}
print(dumps(job_payload, indent=4))

{
    "input_data": [
        {
            "fields": [
                "url"
            ],
            "values": [
                "https://www.ibm.com/cloud/machine-learning"
            ]
        }
    ]
}


In [30]:
# Perform prediction and display the result.
job_details = client.deployments.score(deployment_uid, job_payload)
print(dumps(job_details['predictions'][0]['values'][0][:10], indent=4))

[
    "2018",
    "2019",
    "459",
    "574",
    "698",
    "accelerate",
    "accelerator",
    "access",
    "accuracy",
    "across"
]


### 4.2 Delete the deployment and Python function<a id='delete'></a>

Use the following method to delete the deployment.

In [31]:
client.deployments.delete(deployment_uid)

'SUCCESS'

In [32]:
client.deployments.list()

----  ----  -----  -------  -------------
GUID  NAME  STATE  CREATED  ARTIFACT_TYPE
----  ----  -----  -------  -------------


You can delete the Python function as well by running the following cell.

In [33]:
client.repository.delete(function_uid)

'SUCCESS'

You can check that your model was deleted by generating a list of your saved Python functions.

In [34]:
client.repository.list_functions()

----  ----  -------  ----
GUID  NAME  CREATED  TYPE
----  ----  -------  ----


## 5. Summary and next steps <a id="summary"></a>

You successfully completed this notebook! 
 
You learned how to define a *Python function*. Also, you learned how to save the function both on Watson Studio Desktop and Watson Machine Learning Server. You can now deploy and score the *Python function* on WML Server as well. 

In the next step, in addition to tokenizing, a classification model trained with the `SMS spam` data set will be called in the `score()` function and perform scoring.

### Resources <a id="resources"></a>

To learn more about configurations used in this notebook or more sample notebooks, tutorials, documentation, how-tos, and blog posts, check out these links:

<div class="alert alert-block alert-success"><a id="resources"></a>

<h4>IBM documentation</h4>
<br>
 <li> <a href="https://wml-api-pyclient-dev-v4.mybluemix.net" target="_blank" rel="noopener no referrer">watson-machine-learning</a></li> 
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/wmls/overview.html" target="_blank" rel="noopener noreferrer">Watson Machine Learning Server</a></li>
 
<h4> IBM Samples</h4>
<br>
 <li> <a href="https://github.com/IBMDataScience/sample-notebooks" target="_blank" rel="noopener noreferrer">Sample notebooks</a></li>
 
<h4> Others</h4>
<br>
 <li> <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank" rel="noopener noreferrer">BeautifulSoup documentation</a></li>
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SS3PWM_1.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener noreferrer">Deploying Python functions in Watson Machine Learning</a></li>
 <li> <a href="https://www.python.org" target="_blank" rel="noopener noreferrer">Official Python website</a></li>
 <li> <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener noreferrer">scikit-learn: Machine Learning in Python</a></li>
 <li> <a href="https://www.datacamp.com/community/tutorials/web-scraping-using-python" target="_blank" rel="noopener noreferrer">Web scraping using Python</a></li>
 <li> <a href="https://tokenex.com/resource-center/what-is-tokenization/" target="_blank" rel="noopener noreferrer">What is tokenization</a></li></div>

### Citation

Almeida, T. A. and Hidalgo, J. M. G. (2012), [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), Irvine, CA.

### Author

**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who contributes to Watson Studio to democratize data science.

Copyright © 2019-2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>