# Spam-N-Bayes 
**An Introduction to the Watson Machine Learning Python Client** <br> <br>
This tutorial will show you how to build a deploy an SMS Spam Classifer with Watson Machine Learning and IBM Data Science Experience. <br> We'll use the new [Watson Machine Learning API Client for Python](http://wml-api-pyclient.mybluemix.net/) which is avialable on `PyPi`. 
______
This notebook runs on `Python 3.5` with any `Spark` version. 
This notebook can be used as a companion to another [tutorial on our blog](https://medium.com/@adammassachi/dsx-hybrid-mode-91b580450c5b).  <br>
[Get the data](https://apsportal.ibm.com/exchange/public/entry/view/e39fb7848165baca7fc0395025ba4e48). 


## Table of Contents
1. [Load data](#load)
2. [Build model](#build)
3. [Save and deploy](#save)
4. [Make API requests](#api)

_____

## 1. Load data <a id="load"></a>
First, install the Watson Machine Learning library via `pip` if you have not yet installed it. <br> We'll use this library to communicate with Watson Machine Learning. The `python client` allows anyone with a Watson Machine Learning instance to programmatically save, load, and deploy models, among other tasks. 

In [3]:
!pip install watson-machine-learning-client

The data are SMS Messages which have been labeled `spam` or `ham`.

In [46]:
# you'll need to get the data and add it to your DSX project or local environment 

from io import StringIO
import requests
import json
import pandas as pd

# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_051057ada6f24724b4e68c3152e92f7e(container, filename):
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': 'member_32ee99828223dae28f6f4ef55bb69120ea9f3d52','domain': {'id': '04144f60bfe64c0d9fdfb632799cf206'},
            'password': 'A.2-e~Zx6t)9O~BT'}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)
df = pd.read_csv(get_object_storage_file_with_credentials_051057ada6f24724b4e68c3152e92f7e('hybridDemos', 'spam.csv'))

Our first step will be converting the string label into a numeric representation. <br> 
We can use a `pandas.Series method`,  `factorize()[0]`, to convert strings into numeric factors.

In [99]:
df = df[df.columns[:2]]
df.columns = ['ham', 'text']
df['label'] = df.ham.factorize()[0]
df['text'] = df.text.apply(lambda x: x.lower())

In [100]:
df.head()

Unnamed: 0,ham,text,label
0,ham,"go until jurong point, crazy.. available only ...",0
1,ham,ok lar... joking wif u oni...,0
2,spam,free entry in 2 a wkly comp to win fa cup fina...,1
3,ham,u dun say so early hor... u c already then say...,0
4,ham,"nah i don't think he goes to usf, he lives aro...",0


<a id="build"></a>
## 2. Build a model 
We're going to use `scikit-learn` to create a `Naive Bayes` model. <br>We’ll use the `HashingVectorizer`, which converts the SMS’ text into a matrix representation suitable for modeling.

In [101]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=5000, stop_words='english', non_negative=True)

We need connect the output of the `vectorizer` to the input of the model. We’ll use `Multinomial Naive Bayes`. It’s a Naive Bayes classifier which works well with the representation of our features — integer representations of the word frequencies.


Next, we’ll use `train_test_split` in order to divide the data into `testing` and `training` sets so that we can evaluate the performance of the model.

In [102]:
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df['text'], df['label'], random_state=0)

Next, we need to transform the text and fit the model.

In [103]:
# first transform the text data
transformed_x = vectorizer.fit_transform(x_train)

# import the modules and fit
from sklearn.naive_bayes import MultinomialNB
bn = MultinomialNB().fit(transformed_x, y_train)

We’ve got a fit model in `bn`. Let’s evalue the performance on the test data after creating the pipeline.

In [104]:
# make a pipe
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(vectorizer, bn)

The pipe will sequentially transform the data according to the transformers specified, terminating in what scikit-learn calls an estimator. <br>Then, we can call predict or score, and so on.

In [105]:
pipe.predict_proba(["URGENT! You have built a model - scroll down to see more"])

array([[ 0.80165557,  0.19834443]])

Let's score

In [106]:
pipe.score(x_test, y_test)

0.95405599425699927

`96% accuracy`, not bad. You can experiement with different numbers of features and vectorizers for your model. You can also create other features that are not captured by the vectorizer, such as the length of the message. 

<a id="save"></a>
## 3. Save and deploy
Use the client to save your model to the WML Repository. From there, you can load and deploy models as well. 

In [111]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

wml_credentials = {
  "url": "https://ibm-watson-ml.mybluemix.net",
 "access_key": "p4och04zC8JgoPyT4CZyrPArLm8tT01vAKuwRfBUuBhy0Ex14vUTrK3u85i94r5YHxGxQ3pIogjgEOjN0TGDTcL0h32gVzPkwMbmHXNpi+FQYUqQmv73SQJrb1WXWeZv",
  "username": "a90b3ea8-1bbb-48b8-b8e4-d71fb8902b5d",
  "password": "ebeafa46-9b6d-4beb-9070-f42d7feb1286",
  "instance_id": "0b917078-2507-4590-b152-9904dfdff9d9"
}

client = WatsonMachineLearningAPIClient(wml_credentials)

In [112]:
# publish model 
client.repository.publish(pipe, name="Spam-N-Bayes-py3")

<watson_machine_learning_client.libs.repository.mlrepositoryclient.model_adapter.ScikitModelArtifact at 0x7fd605623da0>

Now that we've published the model to the repository, we can load it into a python object using it's `uid`. First, let's look at the 

In [113]:
# get my repository details
deets = client.repository.get_details()

# got to the model respources, find the first model, go to the metadata, get the `id`
model_id = deets['resources'][0]['metadata']['guid']

In [128]:
deets['resources'][0]['entity']['deployments']

{'count': 0,
 'url': 'https://ibm-watson-ml.mybluemix.net/v3/wml_instances/0b917078-2507-4590-b152-9904dfdff9d9/published_models/23de8978-33a1-4d23-a3e4-e010c8fdb656/deployments'}

In [114]:
mod = client.repository.load(model_id)
type(mod)

sklearn.pipeline.Pipeline

In [115]:
deets['resources'][0]['metadata']

{'created_at': '2017-10-11T16:48:40.576Z',
 'guid': '23de8978-33a1-4d23-a3e4-e010c8fdb656',
 'modified_at': '2017-10-11T16:48:40.837Z',
 'url': 'https://ibm-watson-ml.mybluemix.net/v3/wml_instances/0b917078-2507-4590-b152-9904dfdff9d9/published_models/23de8978-33a1-4d23-a3e4-e010c8fdb656'}

In [117]:
# check that it's the same
mod.predict_proba(["URGENT! You have built a model - scroll down to see more"])

array([[ 0.80165557,  0.19834443]])

<a id="api"></a>
## 4. Test the API
Now that we've successfully saved and loaded the model, we can create an API endpoint and make requests. 

In [118]:
scoring_endpoint = client.deployments.create(model_id, name="SPAMO", description="Spam model deployed from notebook")

Let's create some JSON to send. We'll use `client.deployments.score(scoring_url, payload)`. [Read our docs](http://wml-api-pyclient.mybluemix.net/) for more details. 

In [122]:
# create a payload
my_sms = "Send me to the API por favor"
payload = {"values": [my_sms]}

# make a request
response = client.deployments.score(scoring_endpoint, payload)

Let's check out the response

In [123]:
response

{'fields': ['prediction', 'probability'],
 'values': [[0, [0.8164689448418778, 0.1835310551581235]]]}

Reference the [blog](https://medium.com/p/91b580450c5b/edit).

____________

### Author
Adam Massachi is a Data Scientist with the Data Science Experience and Watson Data Platform teams at IBM. Before IBM, he worked on political campaigns, building and managing large volunteer operations and organizing campaign finance initiatives. Say hello [@adammassach](https://twitter.com/adammassach?lang=en)!

### Citations

City of Chicago (2017). Building Violations <a href=https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr target="_blank" rel="noopener noreferrer">https://data.cityofchicago.org/Buildings/Building-Violations/22u3-xenr</a>  Chicago, IL: Chicago City Data Portal

Copyright © IBM Corp. 2017. This notebook and its source code are released under the terms of the MIT License.