<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="4" color="black"><b>Use the Python Function feature to scrape a webpage</b></font></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/images/python.png?raw=true" width="600" alt="Icon"> </th>
   </tr>
</table>

A Python Function is a feature to save and deploy Python code through notebooks or IDE. Python Functions can be implemented in Python notebooks or through REST API using IDE.

The requirement of a Python Function is to have a `score()` function inside the Python Function. This `score()` function will be called when running the deployed Python Function.

A Python Function can be:
- Saved in the Watson Machine Learning (WML) repository.
- Deployed in the Watson Machine Learning (WML) repository.
- Scored.


**Note**: For more details on Python Function, please refer to this <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-deploy-functions.html?audience=wdp" target="_blank" rel="noopener no referrer">link</a>.

In this notebook, saving, deploying, and scoring a Python Function will be demonstrated. Although the `score()` function is intended to score a Python Function, it can also have other custom functionality such as preprocessing texts.

The `score()` function of the Python Function that will be presented in this example notebook does the following tasks:
- Scrapes texts that are enclosed in `<p>` tags.
- Tokenizes scraped texts.
    
The data that will be used in this notebook is the <a href="http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" target="_blank" rel="noopener no referrer">SMS spam data set</a> from the UCI Machine Learning Repository. 

The original data set has both texts and labels in a single file. Only the text parts of the data set were extracted and converted into an `html` file.

You can find the html version of the SMS messages <a href="https://github.com/pmservice/wml-sample-notebooks/tree/master/datasets" target="_blank" rel="noopener no referrer">here</a>.
This notebook runs on Python.

## Contents

This notebook contains the following parts:

1.	[Define a Python Function](#function)
2.	[Save, deploy, and score the Python Function](#deploy)
3.	[Summary and next steps](#summary)

## 1. Define a Python Function <a id="function"></a>

You can pass a `parameter dict` to the Python Function like in the cell below.

In [1]:
# You can add any information needed to run the Python function, e.g., wml credentials.
py_params = {

}

The code outside the `score()` function only executes one time and can do things like load objects, install libs, etc. 

In this example, the `score()` function takes the `url`(s) of the payload and passes it (them) to BeautifulSoup to scrape texts enclosed in `<p>` tags. The extracted texts are passed to `scikit-learn`'s CountVectorizer in order to tokenize the texts.

**Note**: If you are importing modules inside the Python Function, you have to install packages through the `subprocess` module. More details can be found <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-functions.html?audience=wdp#import" target="_blank" rel="noopener no referrer">here</a> in the documentation for Python Function.

In [2]:
def py_funct(params=py_params):  
    try:
        # Import the subprocess module.
        import subprocess
        
        # Install required packages.
        subprocess.check_output('pip install --user lxml', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user bs4', stderr=subprocess.STDOUT, shell=True)
        subprocess.check_output('pip install --user sklearn', stderr=subprocess.STDOUT, shell=True)
    except subprocess.CalledProcessError as e:        
        install_err = 'subprocess.CalledProcessError:\n\n' + 'cmd:\n' + e.cmd + '\n\noutput:\n' + e.output.decode()
        raise Exception( 'Installation failed:\n' + install_err )
    
    def score(payload):
        try:
            # Import required modules.
            from bs4 import BeautifulSoup
            from urllib.request import urlopen
            from sklearn.feature_extraction.text import CountVectorizer

            urls = payload['values']
            final_texts = []   # An array that will have stripped clean text from html tag enclosed text.

            for url in urls:            
                html = urlopen(url)
                soup = BeautifulSoup(html, 'lxml')

                p_tags = soup.find_all('p')    # Text is enclosed in <p> tag.

                for p in p_tags:
                    str_p = str(p)
                    text = BeautifulSoup(str_p, 'lxml').get_text()
                    final_texts.append(text)

            vectorizer = CountVectorizer()
            vectorizer.fit_transform(final_texts)

            return {'tokens': vectorizer.get_feature_names()}
        except Exception as e:
            return {'error': repr(e)}
        
    return score

Prepare a sample payload like in the following cell.

In [3]:
sample_data = {
    'fields': ['url'],
    'values': [
        'https://raw.githubusercontent.com/pmservice/wml-sample-notebooks/master/datasets/sms_spam_text.html'
    ]
}

Pass the list of `url`s to the Python Function.

In [4]:
pf = py_funct(py_params)
tokens = pf(sample_data)

The Python Function object returns a `dict` that has a list of tokens as the `value` - the name of the `value` is `tokens`.

In [5]:
# Token list
tokens['tokens'][:10]

['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02']

## 2. Save, deploy, and score the Python Function <a id="deploy"></a>

In this section, you will learn how to save, deploy, and score the Python Function in the Watson Machine Learning (WML) repository.

- [2.1 Set up the environment](#setup)
- [2.2 Save and deploy the Python Function](#save)
- [2.3 Score data](#score)

### 2.1 Set up the environment <a id="setup"></a>

In this subsection, you will learn how to set up the Watson Machine learning (WML) service that is required in order to save, deploy, and score the `Python Function` in the Watson Machine learning (WML) repository.

#### Install the `watson-machine-learning-client` package from pypi
**Note:** `watson-machine-learning-client` documentation can be found <a href="http://wml-api-pyclient.mybluemix.net/" target="_blank" rel="noopener no referrer">here</a>.

In [6]:
!rm -rf $PIP_BUILD/watson-machine-learning-client

In [None]:
!pip install --upgrade watson-machine-learning-client

Authenticate the Watson Machine Learning service on the IBM Cloud.

**Tip**: Authentication information (your credentials) can be found in the <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-get-wml-credentials.html" target="_blank" rel="noopener no referrer">Service credentials</a> tab of the service instance that you created on IBM the Cloud. <BR>If you cannot find the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. 

**Action**: Enter your Watson Machine Learning service instance credentials in the following cell.

In [8]:
wml_credentials = {
    "apikey": "...",
    "username": "...",
    "password": "...",
    "instance_id": "...",
    "url": "https://ibm-watson-ml.mybluemix.net"
}

#### Import the `watson-machine-learning-client` module and authenticate the service instance.

In [9]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

Now, let's instantiate a WatsonMachineLearningAPIClient object.

In [10]:
wml_credentials = {
    'apikey': '...',
    'instance_id': '...',
    'password': '...',
    'url': '...',
    'username': '...'
}
client = WatsonMachineLearningAPIClient(wml_credentials)

### 2.2 Save and deploy the Python Function <a id="save"></a>

In this subsection, you will learn how to save and deploy the `Python Function`.
First, store the `Python Function` with the meta data.

In [11]:
meta_data = { client.repository.FunctionMetaNames.NAME : 'Web scraping python function' }
function_details = client.repository.store_function(meta_props=meta_data, function=py_funct) # If the py_param dict is not empty, function=py_funct(py_param).

No matching default runtime found. Creating one...SUCCESS

Successfully created runtime with uid: de847f71-5389-411e-b671-a5454e86a221


In [None]:
function_details

Second, deploy the `Python Function`.

In [13]:
function_id = function_details['metadata']['guid']
function_deployment_details = client.deployments.create(artifact_uid=function_id, name='Web scraping python function deployment')



#######################################################################################

Synchronous deployment creation for uid: '9a6ac643-68d5-41ac-93c6-13ee690bb698' started

#######################################################################################


INITIALIZING
DEPLOY_IN_PROGRESS...
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='2ff7f9af-163f-4283-b815-af729d05e79a'
------------------------------------------------------------------------------------------------




You can check the deployment details by running the following cell.

In [None]:
function_deployment_details

Please check if the deployment was successfully created by listing deployments.

In [None]:
client.deployments.list()

### 2.3 Score data <a id="score"></a>

In this subsection, you will learn how to score the deployed `Python Function` with a test data record.

First, create an online deployment endpoint.

In [16]:
function_deployment_endpoint_url = client.deployments.get_scoring_url(function_deployment_details)

The following is the record that will be used for scoring.

In [17]:
payload = {
    'values': [ 
        'https://www.ibm.com/cloud/machine-learning' 
    ] 
}

In [18]:
client.deployments.score(function_deployment_endpoint_url, payload)['tokens'][:10]

['accelerate',
 'accelerating',
 'access',
 'accuracy',
 'across',
 'actively',
 'adapts',
 'advantage',
 'ai',
 'algorithms']

## 3. Summary and next steps <a id="summary"></a>

You successfully completed this notebook! 
 
You learned how to define a `Python Function`. Also, you learned how to save, deploy, and score the `Python Function` in the Watson Machine Learning (WML) repository. 

In the next step, in addition to tokenizing, a classification model trained with the `SMS spam` data set will be called in the `score()` function and perform scoring.
 
Check out our <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener noreferrer">Online Documentation</a> for more samples, tutorials, documentation, how-tos, and blog posts.

### Citation

Almeida, T. A. and Hidalgo, J. M. G. (2012), <a href="http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" target="_blank" rel="noopener noreferrer">UCI Machine Learning Repository</a>, Irvine, CA.

### Author

**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>