<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="4" color="black"><b>Use the Python function feature to scrape a webpage</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/images/python.png?raw=true" width="600" alt="Icon"> </th>
   </tr>
</table>

A *Python function* is a feature to save and deploy Python code through notebooks or IDE. Python functions can be implemented in Python notebooks or through REST API using IDE.

The requirement of a Python function is to have a `score()` function inside the Python function. The `score()` function will be called when running the deployed Python function.

A Python function can be:
- Saved in the Watson Machine Learning (WML) repository.
- Deployed in the Watson Machine Learning (WML) repository.
- Scored.


<div class="alert-block alert-info"><br>For more details on the Python function, please refer to this <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-deploy-functions.html?audience=wdp" target="_blank" rel="noopener no referrer">link</a>.<br><br></div>

This notebook demonstrates how to save, deploy, and score a Python function. Although the `score()` function is intended to score a Python function, it has other custom functionality such as preprocessing texts.

The `score()` function of the Python function in this example notebook does the following tasks:
- Scrapes texts that are enclosed in `<p>` tags.
- Tokenizes scraped texts.
    
The data that will be used in this notebook is the <a href="http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" target="_blank" rel="noopener no referrer">SMS spam data set</a> from the UCI Machine Learning Repository. 

The original data set has both texts and labels in a single file. Only the text parts of the data set were extracted and converted into an `html` file.

You can find the `html` version of the SMS messages <a href="https://github.com/pmservice/wml-sample-notebooks/tree/master/datasets" target="_blank" rel="noopener no referrer">here</a>.

Some familiarity with Python is helpful. This notebook uses `watson-machine-learning-client-V4` and is compatible Watson Studio Local 2.0 and Python 3.5.

#### To get started on Watson Studio Local, find documentation on installion and set up <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/wsj/getting-started/get-started-wdp.html" target="_blank" rel="noopener no referrer">here</a>.

## Table of Contents

This notebook contains the following parts:

1.	[Define a Python function](#function)
2.	[Save, deploy, and score the Python function](#deploy)<br>
    2.1  [Set up the environment](#setup) <br>
    2.2  [Save and deploy the Python function](#save) <br>
    2.3  [Score data](#score)<br>
3.	[Summary and next steps](#summary)

## 2. Define a Python function <a id="function"></a>

You can pass a `parameter dict` to the Python function in the cell below.

In [1]:
# You can add any information needed to run the Python function, e.g., wml credentials.
py_params = {

}

The code outside the `score()` function executes one time only and can load objects, install libs, etc. 

In this example, the `score()` function takes the url(s) of the payload and passes it (them) to BeautifulSoup to scrape texts enclosed in `<p>` tags. The extracted texts are passed to `scikit-learn`'s CountVectorizer in order to tokenize the texts.

<div class="alert-block alert-info"><br>If you are importing modules inside the Python function, you must install packages through the <tt>subprocess</tt> module. More information on Python functions can be found <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener no referrer">here</a> in the documentation.<br><br></div>

In [2]:
def py_funct(params=py_params):  
    try:
        # Import the subprocess module.
        import subprocess
        
        # Install required packages.
        subprocess.check_output('pip install --user lxml', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user bs4', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user sklearn', stderr=subprocess.STDOUT, shell=True)
    except subprocess.CalledProcessError as e:        
        install_err = 'subprocess.CalledProcessError:\n\n' + 'cmd:\n' + e.cmd + '\n\noutput:\n' + e.output.decode()
        raise Exception( 'Installation failed:\n' + install_err )
    
    def score(payload):
        try:
            # Import required modules.
            from bs4 import BeautifulSoup
            from urllib.request import urlopen
            from sklearn.feature_extraction.text import CountVectorizer

            urls = payload['values']
            final_texts = []   # An array that will have stripped clean text from html tag enclosed text.

            for url in urls:            
                html = urlopen(url)
                soup = BeautifulSoup(html, 'lxml')

                p_tags = soup.find_all('p')    # Text is enclosed in <p> tag.

                for p in p_tags:
                    str_p = str(p)
                    text = BeautifulSoup(str_p, 'lxml').get_text()
                    final_texts.append(text)

            vectorizer = CountVectorizer()
            vectorizer.fit_transform(final_texts)

            return {'tokens': vectorizer.get_feature_names()}
        except Exception as e:
            return {'error': repr(e)}
        
    return score

Prepare a sample payload.

In [3]:
sample_data = {
    'fields': ['url'],
    'values': [
        'https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/datasets/sms_spam_text.html'
    ]
}

Pass the list of urls to the Python function.

In [4]:
pf = py_funct(py_params)
tokens = pf(sample_data)

The Python function object returns a `dict` that has a list of tokens as the `value`; the name of the `value` is `tokens`.

In [5]:
# Token list
tokens['tokens'][:10]

['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02']

## 3. Save, deploy, and score the Python function <a id="deploy"></a>

In this section, you will learn how to save, deploy, and score the Python function in the Watson Machine Learning (WML) repository.

### 3.1 Set up the environment <a id="setup"></a>

In this subsection, you will learn how to save, deploy, and score the *Python function* in the Watson Machine learning (WML) repository.

**Authenticate the Python client on Watson Studio Local.**

<div class="alert-block alert-info"><br>To find your authentication information (your credentials) follow the steps provided here in the <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/wsj/analyze-data/ml-notebook_local.html" target="_blank" rel="noopener no referrer">Documentation</a>.<br><br></div>

**Action**: Enter your credentials in the following cell.

In [6]:
wml_credentials = {
    'url': '---',
    'username': '---',
    'password': '---',
    'instance_id': 'icp'
}

#### Import the `watson-machine-learning-client` module.

In [8]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

Now, instantiate a WatsonMachineLearningAPIClient object.

In [9]:
client = WatsonMachineLearningAPIClient(wml_credentials)

### 2.2 Save and deploy the Python function <a id="save"></a>

In this subsection, you will learn how to save and deploy the *Python function*.
First, store the *Python function* with the metadata.

You can obtain the space UID (one of the MetaNames in the function metadata) by using the following cells.

<div class="alert-block alert-info"><br>You can create your own <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/wsj/analyze-data/ml-spaces_local.html" target="_blank" rel="noopener no referrer">deployment space</a> by selecting <b>Deployment Spaces</b> from the Navigation Menu on the top left of this page.<br><br></div>

In [10]:
# Obtain the UId of your space
def guid_from_space_name(client, space_name):
    instance_details = client.service_instance.get_details()
    space = client.spaces.get_details()
    return(next(item for item in space['resources'] if item['entity']["name"] == space_name)['metadata']['guid'])

**Action:** Enter the name of your deployment space in the code below: `space_uid = guid_from_space_name(client, 'YOUR DEPLOYMENT SPACE')`.

In [11]:
# Enter the name of your deployment space here:
space_uid = guid_from_space_name(client, 'YOUR DEPLOYMENT SPACE')
print("Space UID = " + space_uid)

Space UID = 1c428722-4621-4ed3-9f77-7082c338ac35


In [12]:
# Function Metadata.
meta_props = {
    client.repository.FunctionMetaNames.NAME: "Web scraping python function",
    client.repository.FunctionMetaNames.RUNTIME_UID: "ai-function_0.1-py3",
    client.repository.FunctionMetaNames.SPACE_UID: space_uid
}

You need the function UID to create the deployment. You can extract the function UID from the saved function details and use it in the next section to create the deployment.

In [13]:
#Create the function artifact.
function_artifact = client.repository.store_function(meta_props=meta_props, function=py_funct)
function_uid = client.repository.get_function_uid(function_artifact)
print("Function UID = " + function_uid)

Function UID = 39f4f9f4-3af3-4190-bf91-9cbd9043b881


Get the saved function metadata from the WML Repository.

In [14]:
# Details about the function.
function_details = client.repository.get_details(function_uid)
from pprint import pprint
pprint(function_details)

{'entity': {'name': 'Web scraping python function',
            'runtime': {'href': '/v4/runtimes/ai-function_0.1-py3'},
            'space': {'href': '/v4/spaces/1c428722-4621-4ed3-9f77-7082c338ac35'},
            'type': 'python'},
 'metadata': {'created_at': '2019-09-25T22:08:59.755Z',
              'guid': '39f4f9f4-3af3-4190-bf91-9cbd9043b881',
              'href': '/v4/functions/39f4f9f4-3af3-4190-bf91-9cbd9043b881?rev=4e57f1a6-48b0-4dfd-a088-94da1bbf9763',
              'id': '39f4f9f4-3af3-4190-bf91-9cbd9043b881',
              'modified_at': '2019-09-25T22:08:59.844Z',
              'rev': '4e57f1a6-48b0-4dfd-a088-94da1bbf9763'}}


You can list all stored functions using the `list_functions` method.

In [15]:
# Display a list of all the functions.
client.repository.list_functions()

------------------------------------  ----------------------------  ------------------------  ------
GUID                                  NAME                          CREATED                   TYPE
39f4f9f4-3af3-4190-bf91-9cbd9043b881  Web scraping python function  2019-09-25T22:08:59.755Z  python
------------------------------------  ----------------------------  ------------------------  ------


Next, deploy the *Python function*.

In [16]:
# Deployment metadata.
deploy_meta = {
    client.deployments.ConfigurationMetaNames.NAME: "Web scraping python function deployment",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [17]:
# Create the deployment.
deployment_details = client.deployments.create(function_uid, meta_props=deploy_meta)



#######################################################################################

Synchronous deployment creation for uid: '39f4f9f4-3af3-4190-bf91-9cbd9043b881' started

#######################################################################################


initializing


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='a8a1ab40-9130-4d75-a530-a16d0f7d7ac4'
------------------------------------------------------------------------------------------------




You can check the deployment details by running the following cell.

In [18]:
deployment_details

{'entity': {'asset': {'href': '/v4/functions/39f4f9f4-3af3-4190-bf91-9cbd9043b881?rev=4e57f1a6-48b0-4dfd-a088-94da1bbf9763'},
  'auto_redeploy': False,
  'custom': {},
  'description': '',
  'name': 'Web scraping python function deployment',
  'online': {},
  'status': {'message': '',
   'online_url': {'url': 'https://jctesti22-lb-1.fyre.ibm.com:31843/v4/deployments/a8a1ab40-9130-4d75-a530-a16d0f7d7ac4/predictions'},
   'state': 'initializing'}},
 'metadata': {'created_at': '2019-09-25T22:09:02+0000',
  'guid': 'a8a1ab40-9130-4d75-a530-a16d0f7d7ac4',
  'href': '/v4/deployments/a8a1ab40-9130-4d75-a530-a16d0f7d7ac4',
  'modified_at': '',
  'parent': {'href': ''}}}

Please check if the deployment was successfully created by listing deployments.

In [19]:
# List the deployments.
client.deployments.list()

------------------------------------  ---------------------------------------  ------------  ------------------------  -------------
GUID                                  NAME                                     STATE         CREATED                   ARTIFACT_TYPE
a8a1ab40-9130-4d75-a530-a16d0f7d7ac4  Web scraping python function deployment  initializing  2019-09-25T22:09:02+0000  function
------------------------------------  ---------------------------------------  ------------  ------------------------  -------------


In [20]:
# Deployment UID.
deployment_uid = client.deployments.get_uid(deployment_details)
print('Deployment uid = {}'.format(deployment_uid))

Deployment uid = a8a1ab40-9130-4d75-a530-a16d0f7d7ac4


### 2.3 Score data <a id="score"></a>

In this subsection, you will learn how to score the deployed *Python function* with a test data record.

The following is the record that will be used for scoring.

In [21]:
# Prepare scoring payload.
job_payload = {
    client.deployments.ScoringMetaNames.INPUT_DATA: [{
        'values': [
            'https://www.ibm.com/cloud/machine-learning'
        ]
    }]
}
pprint(job_payload)

{'scoring_input_data': [{'values': ['https://www.ibm.com/cloud/machine-learning']}]}


In [22]:
# Perform prediction and display the result.
job_details = client.deployments.score(deployment_uid, job_payload)['predictions'][0]['tokens'][:10]
pprint(job_details)

['accelerate',
 'accelerating',
 'access',
 'accuracy',
 'across',
 'actively',
 'adapts',
 'advantage',
 'ai',
 'algorithms']


## 3. Summary and next steps <a id="summary"></a>

You successfully completed this notebook! 
 
You learned how to define a *Python function*. Also, you learned how to save, deploy, and score the *Python function* in the Watson Machine Learning (WML) repository. 

In the next step, in addition to tokenizing, a classification model trained with the `SMS spam` data set will be called in the `score()` function and perform scoring.

### Resources <a id="resources"></a>

To learn more about configurations used in this notebook or more sample notebooks, tutorials, documentation, how-tos, and blog posts, check out these links:

<div class="alert alert-block alert-success">

<h4>IBM documentation</h4>
<br>
 <li> <a href="https://wml-api-pyclient-dev-v4.mybluemix.net" target="_blank" rel="noopener no referrer">watson-machine-learning</a></li> 
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/local/welcome.html" target="_blank" rel="noopener noreferrer">Watson Studio</a></li>
 
<h4> IBM Samples</h4>
<br>
 <li> <a href="https://github.com/IBMDataScience/sample-notebooks" target="_blank" rel="noopener noreferrer">Sample notebooks</a></li>
 
<h4> Others</h4>
<br>
 <li> <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank" rel="noopener noreferrer">BeautifulSoup documentation</a></li>
 <li> <a href="https://www.ibm.com/support/knowledgecenter/SSHGWL_2.0.0/wsj/analyze-data/ml-deploy-functions_local.html" target="_blank" rel="noopener noreferrer">Deploying Python functions in Watson Machine Learning</a></li>
 <li> <a href="https://www.python.org" target="_blank" rel="noopener noreferrer">Official Python website</a></li>
 <li> <a href="https://scikit-learn.org/stable/" target="_blank" rel="noopener noreferrer">scikit-learn: machine learning in Python</a></li>
 <li> <a href="https://www.datacamp.com/community/tutorials/web-scraping-using-python" target="_blank" rel="noopener noreferrer">Web scraping using Python</a></li>
 <li> <a href="https://tokenex.com/resource-center/what-is-tokenization/" target="_blank" rel="noopener noreferrer">What is tokenization</a></li></div>

### Citation

Almeida, T. A. and Hidalgo, J. M. G. (2012), [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), Irvine, CA.

### Author

**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>