## Ensure you have the watson-machine-learning-client version that you need.

In [1]:
# !pip uninstall --yes watson-machine-learning-client-V4
# !pip install watson-machine-learning-client-V4==1.0.112
!pip freeze | grep watson-machine-learning-client


watson-machine-learning-client-V4==1.0.112


# Clustering data using scikit-learn

Clustering algorithms allow you to automatically find ways to group multidimentional data into clusters.

In this notebook, we'll use scikit-learn to predict clusters. 
scikit-learn provides implementations of many clustering algorithms.
We'll use **k-means** clustering to create clusters based on a shopping cart dataset.
Using that model, we can take any shopping cart and determine which cluster it fits best.

Once we've predicted a cluster, we'll use the most popular products in that cluster to
recommend additional purchases.


## Setup

### Install python modules

> NOTE!  Some pip installs require a kernel restart.

The shell command `pip install` is used to install Python modules. Some installs require a kernel restart to complete.
To avoid confusing errors, run the following cell once and then use the **Kernel** menu to restart the kernel before proceeding.


In [2]:
# The Watson Studio Python kernel should already have the scikit-learn module we need.

!pip freeze | grep scikit-learn

# CPD 3.0.1 has scikit-learn==0.22.1

scikit-learn==0.22.1


## Imports

Import the python modules that we need in the rest of the notebook.

In [3]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans


## Load the shopping cart data for training the model

Run the cell below to slurp the shopping cart training data from a CSV file into a pandas DataFrame.

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/IBM/product-recommendation-with-watson-ml/master/data/customers_orders1_opt.csv")

## Prepare the cart data

Keep the columns with product category values. The keep_columns list of labels will also be handy.


In [5]:
keep_columns = ['Baby Food','Diapers','Formula','Lotion','Baby wash','Wipes','Fresh Fruits','Fresh Vegetables','Beer','Wine','Club Soda','Sports Drink','Chips','Popcorn','Oatmeal','Medicines','Canned Foods','Cigarettes','Cheese','Cleaning Products','Condiments','Frozen Foods','Kitchen Items','Meat','Office Supplies','Personal Care','Pet Supplies','Sea Food','Spices']
df_carts = df[keep_columns]
df_carts.head()

Unnamed: 0,Baby Food,Diapers,Formula,Lotion,Baby wash,Wipes,Fresh Fruits,Fresh Vegetables,Beer,Wine,...,Cleaning Products,Condiments,Frozen Foods,Kitchen Items,Meat,Office Supplies,Personal Care,Pet Supplies,Sea Food,Spices
0,0,0,1,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Train a k-means model that will put the carts into 10 clusters and show the centers

In [6]:
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters)
predicted = kmeans.fit_predict(df_carts.values)
centers = kmeans.cluster_centers_

In [11]:
# print(centers) but with nicer number formatting
print("CLUSTER CENTERS...")
print("Number of clusters: ", n_clusters)
print("Number of products: ", len(keep_columns))
print(keep_columns)
for center in centers:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')


CLUSTER CENTERS...
Number of clusters:  10
Number of products:  29
['Baby Food', 'Diapers', 'Formula', 'Lotion', 'Baby wash', 'Wipes', 'Fresh Fruits', 'Fresh Vegetables', 'Beer', 'Wine', 'Club Soda', 'Sports Drink', 'Chips', 'Popcorn', 'Oatmeal', 'Medicines', 'Canned Foods', 'Cigarettes', 'Cheese', 'Cleaning Products', 'Condiments', 'Frozen Foods', 'Kitchen Items', 'Meat', 'Office Supplies', 'Personal Care', 'Pet Supplies', 'Sea Food', 'Spices']
[ 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.97, 0.74, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ]
[ 0.09, 0.01, 0.11, 0.09, 0.09, 0.10, 0.09, 0.10, 0.00, 0.00, 0.38, 0.11, 0.00, 0.00, 0.09, 0.00, 0.09, 0.00, 0.00, 0.01, 0.16, 0.00, 0.10, 0.00, 0.09, 0.00, 0.00, 0.00, 0.00, ]
[ 0.02, 0.02, 0.02, 0.02, 0.02, 0.03, 0.02, 0.03, 0.00, 0.00, 0.37, 0.02, 0.00, 1.00, 0.04, 0.00, 0.00, 0.00, 0.00, 0.03, 0.03, 0.00, 0.03, 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, ]
[ 0.11, 0.24, 0.

In [15]:
# Test the model
# Provide a shopping cart and see how the model predicts a cluster for it.
# Instead of zeros, try 0.5 to let the model decide whether to lean closer to buy or not-buy.
test_cart1 = [1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
test_cart2 = [1,0.5,1,1,1,1,0.5,1,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5]
print(test_cart1)
print(test_cart2)
test_carts = [ test_cart1, test_cart2]

[1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0.5, 1, 1, 1, 1, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]


In [16]:

predicted_cluster = kmeans.predict(   
    test_carts)
print(predicted_cluster)
print(centers[predicted_cluster])

[1 7]
[[ 8.83550489e-02  1.14006515e-02  1.06677524e-01  8.99837134e-02
   9.36482085e-02  9.60912052e-02  9.28338762e-02  9.89413681e-02
   2.08166817e-16  8.02136135e-15  3.77850163e-01  1.06677524e-01
   1.44328993e-15  1.72084569e-15  9.44625407e-02 -3.02535774e-15
   9.36482085e-02 -3.99680289e-15 -5.46784840e-15  1.26221498e-02
   1.62866450e-01  3.41393580e-15  9.64983713e-02 -2.11636264e-15
   9.36482085e-02 -6.35602682e-15  1.36696210e-15  2.55351296e-15
  -7.43849426e-15]
 [ 6.79347826e-02  1.18659420e-01  6.97463768e-02  5.79710145e-02
   5.88768116e-02  7.60869565e-02  7.42753623e-02  6.43115942e-02
   6.06884058e-02  2.69021739e-01  1.49880108e-15  7.06521739e-02
   4.21884749e-15  1.55431223e-15  4.80072464e-02  6.21376812e-01
   7.60869565e-02  6.28623188e-01  6.25000000e-01  1.18659420e-01
   1.43115942e-01  7.80797101e-01  6.25000000e-02  4.25724638e-02
   6.61231884e-02  6.34963768e-01  5.97826087e-02  4.99600361e-16
   4.97282609e-01]]


In [18]:
# print centers of predicted cluster
center = centers[predicted_cluster][0]
for center in centers[predicted_cluster]:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')

[ 0.09, 0.01, 0.11, 0.09, 0.09, 0.10, 0.09, 0.10, 0.00, 0.00, 0.38, 0.11, 0.00, 0.00, 0.09, 0.00, 0.09, 0.00, 0.00, 0.01, 0.16, 0.00, 0.10, 0.00, 0.09, 0.00, 0.00, 0.00, 0.00, ]
[ 0.07, 0.12, 0.07, 0.06, 0.06, 0.08, 0.07, 0.06, 0.06, 0.27, 0.00, 0.07, 0.00, 0.00, 0.05, 0.62, 0.08, 0.63, 0.62, 0.12, 0.14, 0.78, 0.06, 0.04, 0.07, 0.63, 0.06, 0.00, 0.50, ]


In [20]:
# Use the selected cluster centers to suggest additional products
    
threshold = 0.5
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] > threshold:
        print("{:.2f} already in cart:".format(center[i]), keep_columns[i])

for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] > 0.5:
        print("{:.2f} product to recommend: ".format(center[i]), keep_columns[i] )
        
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] <= 0.5:
        print("{:.2f} other product: ".format(center[i]), keep_columns[i] )


0.07 already in cart: Baby Food
0.07 already in cart: Formula
0.06 already in cart: Lotion
0.06 already in cart: Baby wash
0.08 already in cart: Wipes
0.06 already in cart: Fresh Vegetables
0.62 product to recommend:  Medicines
0.63 product to recommend:  Cigarettes
0.62 product to recommend:  Cheese
0.78 product to recommend:  Frozen Foods
0.63 product to recommend:  Personal Care
0.12 other product:  Diapers
0.07 other product:  Fresh Fruits
0.06 other product:  Beer
0.27 other product:  Wine
0.00 other product:  Club Soda
0.07 other product:  Sports Drink
0.00 other product:  Chips
0.00 other product:  Popcorn
0.05 other product:  Oatmeal
0.08 other product:  Canned Foods
0.12 other product:  Cleaning Products
0.14 other product:  Condiments
0.06 other product:  Kitchen Items
0.04 other product:  Meat
0.07 other product:  Office Supplies
0.06 other product:  Pet Supplies
0.00 other product:  Sea Food
0.50 other product:  Spices


In [21]:
# @hidden_cell
import sys,os,os.path

token = os.environ['USER_ACCESS_TOKEN']

wml_credentials = {
"token": token,
"instance_id" : "openshift",
"url": "https://zen-cpd-zen.apps.xxxxxxxxpak8.ibmcodetest.us",  # Provide your CPD URL here
"version": "3.0.1"
}


In [22]:
# To store the trained model, first create a deployment space and set it as the default.

from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

In [23]:
# Set your deployment space name and model name

# MODEL_NAME = "Shopping Cart Affinity Model"
# DEPLOYMENT_SPACE_NAME = "shopping_ml_deployment_space"
MODEL_NAME = "testmodel0825"
DEPLOYMENT_SPACE_NAME = "testdepspace0825"

In [24]:

metadata = {
 wml_client.spaces.ConfigurationMetaNames.NAME: DEPLOYMENT_SPACE_NAME,
 wml_client.spaces.ConfigurationMetaNames.DESCRIPTION: 'Deployment space created from notebook for shopping cart model'
}
space_details = wml_client.spaces.store(meta_props=metadata)

space_uid = wml_client.spaces.get_uid(space_details)

In [25]:
wml_client.set.default_space(space_uid)

'SUCCESS'

In [26]:
print(space_uid)

f5dfdade-b422-4488-b459-490fa1adfbaa


In [27]:
wml_client.spaces.list()

------------------------------------  ----------------------------------------  ------------------------
GUID                                  NAME                                      CREATED
f5dfdade-b422-4488-b459-490fa1adfbaa  testdepspace0825                          2020-08-25T18:02:23.959Z
2f0d8a0e-124b-4904-bb31-1a2f38b9c64d  shopping_ml_deployment_space              2020-08-25T02:18:36.466Z
9bac80bb-ee30-4ac4-9c34-8de8732e117b  shopping_ml_deployment_space              2020-08-25T01:58:10.188Z
ff049a39-bb3c-4c05-97a1-591f57ed7879  shopping_ml_deployment_space              2020-08-25T01:50:47.588Z
a39d1834-576b-4861-8d94-b167a2011506  shopping_ml_deployment_space              2020-08-25T01:39:31.515Z
b68bc017-4e55-4d3b-a53a-9ef554b3e76d  shopping_ml_deployment_space              2020-08-25T01:37:07.951Z
eafb91e0-2eaf-4fc6-afae-407faa18fcb4  shopping_ml_deployment_space              2020-08-25T01:17:36.763Z
65b77eec-913a-4e5b-b1ce-348eb2545659  Deployment Space for Shopping ML P

In [28]:
from sklearn.pipeline import Pipeline
import pickle
pipeline_org = Pipeline( steps = [ ( "classifier", KMeans() ) ] )
pipeline_org.fit( df_carts, keep_columns )
pickle.dump( pipeline_org, open( "kmeans-prediction-model.pkl", 'wb') )

!mkdir model-dir
!cp kmeans-prediction-model.pkl model-dir
!tar -zcvf kmeans-prediction-model.tar.gz kmeans-prediction-model.pkl

kmeans-prediction-model.pkl


In [29]:
input_schema = [{
    'id': 'testid',
    'type': 'struct',
    'fields': [
        {
            'name': 'input_cart',
            'type': 'array',
            'nullable': False
        }
    ]
}]

model_def_meta_props = {
     wml_client.model_definitions.ConfigurationMetaNames.NAME: 'Shopping_Cart_Cluster_Model_definition',
     wml_client.model_definitions.ConfigurationMetaNames.VERSION: '1.0',
     wml_client.model_definitions.ConfigurationMetaNames.PLATFORM: {'name': 'python',  'versions': ['3.6']}
 }

In [30]:
model_def_details = wml_client.model_definitions.store(
     model_definition='kmeans-prediction-model.tar.gz',
     meta_props=model_def_meta_props
)

model_def_id = wml_client.model_definitions.get_uid(model_def_details)

In [31]:
print(model_def_id)

83d3ae3e-26b0-4ca9-9069-04fcc65875df


In [32]:
wml_client.software_specifications.list()

--------------------------  ------------------------------------  ----
NAME                        ASSET_ID                              TYPE
default_py3.6               0062b8c9-8b7d-44a0-a9b9-46c416adcbd9  base
scikit-learn_0.20-py3.6     09c5a1d0-9c1e-4473-a344-eb7b665ff687  base
ai-function_0.1-py3.6       0cdb0f1e-5376-4f4d-92dd-da3b69aa9bda  base
shiny-r3.6                  0e6e79df-875e-4f24-8ae9-62dcc2148306  base
pytorch_1.1-py3.6           10ac12d6-6b30-4ccd-8392-3e922c096a92  base
scikit-learn_0.22-py3.6     154010fa-5b3b-4ac1-82af-4d5ee5abbc85  base
default_r3.6                1b70aec3-ab34-4b87-8aa0-a4a3c8296a36  base
tensorflow_1.15-py3.6       2b73a275-7cbf-420b-a912-eae7f436e0bc  base
pytorch_1.2-py3.6           2c8ef57d-2687-4b7d-acce-01f94976dac1  base
spark-mllib_2.3             2e51f700-bca0-4b0d-88dc-5c6791338875  base
pytorch-onnx_1.1-py3.6-edt  32983cea-3f32-4400-8965-dde874a8d67e  base
spark-mllib_2.4             390d21f8-e58b-4fac-9c55-d7ceda621326  base
xgboos

In [33]:

model_props = {wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
               wml_client.repository.ModelMetaNames.INPUT_DATA_SCHEMA: input_schema,
               wml_client.repository.ModelMetaNames.RUNTIME_UID : "scikit-learn_0.22-py3.6",
               wml_client.repository.ModelMetaNames.TYPE : "scikit-learn_0.22"
              }

In [34]:
model_artifact = wml_client.repository.store_model(kmeans, pipeline=pipeline_org, meta_props=model_props)

In [35]:
model_uid = wml_client.repository.get_model_uid(model_artifact)
print("Model UID = " + model_uid)

Model UID = 73f17ece-9660-4061-bd10-e19513c91b66


In [36]:
import json
print(json.dumps(model_artifact, indent=3))

{
   "metadata": {
      "name": "testmodel0825",
      "guid": "73f17ece-9660-4061-bd10-e19513c91b66",
      "id": "73f17ece-9660-4061-bd10-e19513c91b66",
      "modified_at": "2020-08-25T18:05:06.002Z",
      "created_at": "2020-08-25T18:05:04.002Z",
      "owner": "1000331001",
      "href": "/v4/models/73f17ece-9660-4061-bd10-e19513c91b66?space_id=f5dfdade-b422-4488-b459-490fa1adfbaa",
      "space_id": "f5dfdade-b422-4488-b459-490fa1adfbaa"
   },
   "entity": {
      "name": "testmodel0825",
      "content_status": {
         "state": "persisted"
      },
      "space": {
         "id": "f5dfdade-b422-4488-b459-490fa1adfbaa",
         "href": "/v4/spaces/f5dfdade-b422-4488-b459-490fa1adfbaa"
      },
      "type": "scikit-learn_0.22",
      "runtime": {
         "id": "scikit-learn_0.22-py3.6",
         "href": "/v4/runtimes/scikit-learn_0.22-py3.6"
      },
      "schemas": {
         "input": [
            {
               "id": "testid",
               "type": "struct",
       

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>