# Clustering data using scikit-learn

Clustering algorithms allow you to automatically find ways to group multidimentional data into clusters.

In this notebook, we'll use scikit-learn to predict clusters. 
scikit-learn provides implementations of many clustering algorithms.
We'll use **k-means** clustering to create clusters based on a shopping cart dataset.
Using that model, we can take any shopping cart and determine which cluster it fits best.

Once we've predicted a cluster, we'll use the most popular products in that cluster to
recommend additional purchases.


## Setup

### Set your CPD URL in wml_credentials

In [2]:
# @hidden_cell
import sys,os,os.path

token = os.environ['USER_ACCESS_TOKEN']

wml_credentials = {
"token": token,
"instance_id" : "openshift",
"url": "https://zen-cpd-zen.apps.marksturpak8.ibmcodetest.us",  # Provide your CPD URL here
"version": "3.0.1"
}


### Install python modules

> NOTE!  Some pip installs require a kernel restart.

The shell command `pip install` is used to install Python modules. Some installs require a kernel restart to complete.
To avoid confusing errors, run the following cell once and then use the **Kernel** menu to restart the kernel before proceeding.

### Ensure you have the watson-machine-learning-client version that you need.

In [3]:
!pip uninstall --yes watson-machine-learning-client-V4
!pip install watson-machine-learning-client-V4==1.0.112
!pip freeze | grep watson-machine-learning-client


Found existing installation: watson-machine-learning-client-V4 1.0.95
Uninstalling watson-machine-learning-client-V4-1.0.95:
  Successfully uninstalled watson-machine-learning-client-V4-1.0.95
Collecting watson-machine-learning-client-V4==1.0.112
  Downloading watson_machine_learning_client_V4-1.0.112-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 6.1 MB/s eta 0:00:01ta 0:00:01
Collecting ibm-cos-sdk==2.6.0
  Downloading ibm-cos-sdk-2.6.0.tar.gz (53 kB)
[K     |████████████████████████████████| 53 kB 4.0 MB/s  eta 0:00:01
Collecting ibm-cos-sdk-core==2.6.0
  Downloading ibm-cos-sdk-core-2.6.0.tar.gz (763 kB)
[K     |████████████████████████████████| 763 kB 72.2 MB/s eta 0:00:01�██████████▍| 747 kB 72.2 MB/s eta 0:00:01
[?25hCollecting ibm-cos-sdk-s3transfer==2.6.0
  Downloading ibm-cos-sdk-s3transfer-2.6.0.tar.gz (221 kB)
[K     |████████████████████████████████| 221 kB 54.9 MB/s eta 0:00:01
Building wheels for collected packages: ibm-cos-sdk, ibm-cos-s

In [4]:
# The Watson Studio Python kernel should already have the scikit-learn module we need.
#
# Tested on CPD 3.0.1 with scikit-learn==0.22.1

!pip freeze | grep scikit-learn


scikit-learn==0.22.1


## Imports

Import the python modules that we need in the rest of the notebook.

In [5]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans


## Load the shopping cart data for training the model

Run the cell below to slurp the shopping cart training data from a CSV file into a pandas DataFrame.

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/IBM/ibm-streams-with-ml-model/master/data/customers_orders1_opt.csv")

## Prepare the cart data

Keep the columns with product category values. The keep_columns list of labels will also be handy.


In [7]:
keep_columns = ['Baby Food','Diapers','Formula','Lotion','Baby wash','Wipes','Fresh Fruits','Fresh Vegetables','Beer','Wine','Club Soda','Sports Drink','Chips','Popcorn','Oatmeal','Medicines','Canned Foods','Cigarettes','Cheese','Cleaning Products','Condiments','Frozen Foods','Kitchen Items','Meat','Office Supplies','Personal Care','Pet Supplies','Sea Food','Spices']
df_carts = df[keep_columns]
df_carts.head()

Unnamed: 0,Baby Food,Diapers,Formula,Lotion,Baby wash,Wipes,Fresh Fruits,Fresh Vegetables,Beer,Wine,...,Cleaning Products,Condiments,Frozen Foods,Kitchen Items,Meat,Office Supplies,Personal Care,Pet Supplies,Sea Food,Spices
0,0,0,1,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Train a k-means model that will put the carts into 10 clusters and show the centers

In [8]:
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters)
predicted = kmeans.fit_predict(df_carts.values)
centers = kmeans.cluster_centers_

In [9]:
# print(centers) but with nicer number formatting
print("CLUSTER CENTERS...")
print("Number of clusters: ", n_clusters)
print("Number of products: ", len(keep_columns))
print(keep_columns)
for center in centers:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')


CLUSTER CENTERS...
Number of clusters:  10
Number of products:  29
['Baby Food', 'Diapers', 'Formula', 'Lotion', 'Baby wash', 'Wipes', 'Fresh Fruits', 'Fresh Vegetables', 'Beer', 'Wine', 'Club Soda', 'Sports Drink', 'Chips', 'Popcorn', 'Oatmeal', 'Medicines', 'Canned Foods', 'Cigarettes', 'Cheese', 'Cleaning Products', 'Condiments', 'Frozen Foods', 'Kitchen Items', 'Meat', 'Office Supplies', 'Personal Care', 'Pet Supplies', 'Sea Food', 'Spices']
[ 0.11, 0.25, 0.12, 0.11, 0.11, 0.12, 0.13, 0.12, 0.14, 0.04, 0.05, 0.12, 0.00, 0.06, 0.13, 0.02, 0.10, 1.00, 0.03, 0.24, 0.22, 0.02, 0.15, 1.00, 0.14, 0.02, 0.13, 0.00, 1.00, ]
[ 0.00, 0.01, 0.01, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.01, 1.00, 0.80, 0.01, 0.00, 1.00, 0.00, 0.00, 0.01, 0.80, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, ]
[ 0.14, 0.26, 0.14, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.00, 0.00, 0.14, 0.00, 0.00, 0.13, 0.00, 0.00, 0.00, 0.00, 0.26, 0.14, 0.00, 0.14, 0.00, 0.14, 0.00, 0.15, 0.00, 0.00, ]
[ 0.05, 0.08, 0.

In [10]:
# Test the model
# Provide a shopping cart and see how the model predicts a cluster for it.
# Instead of zeros, try 0.5 to let the model decide whether to lean closer to buy or not-buy.
test_cart1 = [1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
test_cart2 = [1,0.5,1,1,1,1,0.5,1,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5]
print(test_cart1)
print(test_cart2)
test_carts = [ test_cart1, test_cart2]

[1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0.5, 1, 1, 1, 1, 0.5, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]


In [11]:

predicted_cluster = kmeans.predict(   
    test_carts)
print(predicted_cluster)
print(centers[predicted_cluster])

[2 9]
[[ 1.37171888e-01  2.59102456e-01  1.41828959e-01  1.29974598e-01
   1.32091448e-01  1.30397968e-01  1.34208298e-01  1.45639289e-01
   1.50296359e-01  9.65894031e-15  8.46740051e-04  1.37171888e-01
   2.72004641e-15  1.72084569e-15  1.32938188e-01 -3.02535774e-15
  -5.60662627e-15 -3.96904731e-15 -5.27355937e-15  2.63336156e-01
   1.36748518e-01  3.35842465e-15  1.36748518e-01 -2.52575738e-15
   1.35478408e-01 -6.18949336e-15  1.47332769e-01  1.85962357e-15
  -7.29971639e-15]
 [ 7.16332378e-02  1.23209169e-01  6.37535817e-02  6.44699140e-02
   6.23209169e-02  7.09169054e-02  7.37822350e-02  6.73352436e-02
   7.16332378e-02  4.19770774e-01 -3.66373598e-15  8.09455587e-02
   9.04831765e-15  1.66533454e-15  4.72779370e-02  6.27507163e-01
   7.02005731e-02  6.67621777e-01  6.31088825e-01  1.24641834e-01
   1.38968481e-01  4.29083095e-01  6.73352436e-02  7.23495702e-02
   6.01719198e-02  6.46848138e-01  5.44412607e-02 -1.47104551e-15
   6.63323782e-01]]


In [12]:
# print centers of predicted cluster
center = centers[predicted_cluster][0]
for center in centers[predicted_cluster]:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')

[ 0.14, 0.26, 0.14, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.00, 0.00, 0.14, 0.00, 0.00, 0.13, 0.00, 0.00, 0.00, 0.00, 0.26, 0.14, 0.00, 0.14, 0.00, 0.14, 0.00, 0.15, 0.00, 0.00, ]
[ 0.07, 0.12, 0.06, 0.06, 0.06, 0.07, 0.07, 0.07, 0.07, 0.42, 0.00, 0.08, 0.00, 0.00, 0.05, 0.63, 0.07, 0.67, 0.63, 0.12, 0.14, 0.43, 0.07, 0.07, 0.06, 0.65, 0.05, 0.00, 0.66, ]


In [13]:
# Use the selected cluster centers to suggest additional products
    
threshold = 0.5
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] > threshold:
        print("{:.2f} already in cart:".format(center[i]), keep_columns[i])

for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] > 0.5:
        print("{:.2f} product to recommend: ".format(center[i]), keep_columns[i] )
        
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] <= 0.5:
        print("{:.2f} other product: ".format(center[i]), keep_columns[i] )


0.07 already in cart: Baby Food
0.06 already in cart: Formula
0.06 already in cart: Lotion
0.06 already in cart: Baby wash
0.07 already in cart: Wipes
0.07 already in cart: Fresh Vegetables
0.63 product to recommend:  Medicines
0.67 product to recommend:  Cigarettes
0.63 product to recommend:  Cheese
0.65 product to recommend:  Personal Care
0.66 product to recommend:  Spices
0.12 other product:  Diapers
0.07 other product:  Fresh Fruits
0.07 other product:  Beer
0.42 other product:  Wine
-0.00 other product:  Club Soda
0.08 other product:  Sports Drink
0.00 other product:  Chips
0.00 other product:  Popcorn
0.05 other product:  Oatmeal
0.07 other product:  Canned Foods
0.12 other product:  Cleaning Products
0.14 other product:  Condiments
0.43 other product:  Frozen Foods
0.07 other product:  Kitchen Items
0.07 other product:  Meat
0.06 other product:  Office Supplies
0.05 other product:  Pet Supplies
-0.00 other product:  Sea Food


In [14]:
# To store the trained model, first create a deployment space and set it as the default.

from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

In [15]:
# Set your deployment space name and model name

MODEL_NAME = "Shopping Cart Cluster Model"
DEPLOYMENT_SPACE_NAME = "ibm_streams_with_ml_model_deployment_space"


In [16]:

metadata = {
 wml_client.spaces.ConfigurationMetaNames.NAME: DEPLOYMENT_SPACE_NAME,
 wml_client.spaces.ConfigurationMetaNames.DESCRIPTION: 'Deployment space created from notebook for shopping cart model'
}
space_details = wml_client.spaces.store(meta_props=metadata)

space_uid = wml_client.spaces.get_uid(space_details)

In [17]:
wml_client.set.default_space(space_uid)

'SUCCESS'

In [18]:
print(space_uid)

d7d5fa99-058a-437e-ba53-ea2ce813de39


In [19]:
wml_client.spaces.list()

------------------------------------  ------------------------------------------  ------------------------
GUID                                  NAME                                        CREATED
d7d5fa99-058a-437e-ba53-ea2ce813de39  ibm_streams_with_ml_model_deployment_space  2020-08-27T06:39:19.914Z
f5dfdade-b422-4488-b459-490fa1adfbaa  testdepspace0825                            2020-08-25T18:02:23.959Z
2f0d8a0e-124b-4904-bb31-1a2f38b9c64d  shopping_ml_deployment_space                2020-08-25T02:18:36.466Z
9bac80bb-ee30-4ac4-9c34-8de8732e117b  shopping_ml_deployment_space                2020-08-25T01:58:10.188Z
ff049a39-bb3c-4c05-97a1-591f57ed7879  shopping_ml_deployment_space                2020-08-25T01:50:47.588Z
a39d1834-576b-4861-8d94-b167a2011506  shopping_ml_deployment_space                2020-08-25T01:39:31.515Z
b68bc017-4e55-4d3b-a53a-9ef554b3e76d  shopping_ml_deployment_space                2020-08-25T01:37:07.951Z
eafb91e0-2eaf-4fc6-afae-407faa18fcb4  shopping_ml_depl

In [20]:
from sklearn.pipeline import Pipeline
import pickle
pipeline_org = Pipeline( steps = [ ( "classifier", KMeans() ) ] )
pipeline_org.fit( df_carts, keep_columns )
pickle.dump( pipeline_org, open( "kmeans-prediction-model.pkl", 'wb') )

!mkdir model-dir
!cp kmeans-prediction-model.pkl model-dir
!tar -zcvf kmeans-prediction-model.tar.gz kmeans-prediction-model.pkl

kmeans-prediction-model.pkl


In [21]:
input_schema = [{
    'id': 'testid',
    'type': 'struct',
    'fields': [
        {
            'name': 'input_cart',
            'type': 'array',
            'nullable': False
        }
    ]
}]

model_def_meta_props = {
     wml_client.model_definitions.ConfigurationMetaNames.NAME: 'Shopping_Cart_Cluster_Model_definition',
     wml_client.model_definitions.ConfigurationMetaNames.VERSION: '1.0',
     wml_client.model_definitions.ConfigurationMetaNames.PLATFORM: {'name': 'python',  'versions': ['3.6']}
 }

In [22]:
model_def_details = wml_client.model_definitions.store(
     model_definition='kmeans-prediction-model.tar.gz',
     meta_props=model_def_meta_props
)

model_def_id = wml_client.model_definitions.get_uid(model_def_details)

In [23]:
print(model_def_id)

94c0026a-0bd7-4205-86c0-42818e9bf15c


In [24]:
wml_client.software_specifications.list()

--------------------------  ------------------------------------  ----
NAME                        ASSET_ID                              TYPE
default_py3.6               0062b8c9-8b7d-44a0-a9b9-46c416adcbd9  base
scikit-learn_0.20-py3.6     09c5a1d0-9c1e-4473-a344-eb7b665ff687  base
ai-function_0.1-py3.6       0cdb0f1e-5376-4f4d-92dd-da3b69aa9bda  base
shiny-r3.6                  0e6e79df-875e-4f24-8ae9-62dcc2148306  base
pytorch_1.1-py3.6           10ac12d6-6b30-4ccd-8392-3e922c096a92  base
scikit-learn_0.22-py3.6     154010fa-5b3b-4ac1-82af-4d5ee5abbc85  base
default_r3.6                1b70aec3-ab34-4b87-8aa0-a4a3c8296a36  base
tensorflow_1.15-py3.6       2b73a275-7cbf-420b-a912-eae7f436e0bc  base
pytorch_1.2-py3.6           2c8ef57d-2687-4b7d-acce-01f94976dac1  base
spark-mllib_2.3             2e51f700-bca0-4b0d-88dc-5c6791338875  base
pytorch-onnx_1.1-py3.6-edt  32983cea-3f32-4400-8965-dde874a8d67e  base
spark-mllib_2.4             390d21f8-e58b-4fac-9c55-d7ceda621326  base
xgboos

In [25]:

model_props = {wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
               wml_client.repository.ModelMetaNames.INPUT_DATA_SCHEMA: input_schema,
               wml_client.repository.ModelMetaNames.RUNTIME_UID : "scikit-learn_0.22-py3.6",
               wml_client.repository.ModelMetaNames.TYPE : "scikit-learn_0.22"
              }

In [26]:
model_artifact = wml_client.repository.store_model(kmeans, pipeline=pipeline_org, meta_props=model_props)

In [27]:
model_uid = wml_client.repository.get_model_uid(model_artifact)
print("Model UID = " + model_uid)

Model UID = 20a0ea1c-0f41-4844-aa05-33c53263a399


In [28]:
import json
print(json.dumps(model_artifact, indent=3))

{
   "metadata": {
      "name": "Shopping Cart Cluster Model",
      "guid": "20a0ea1c-0f41-4844-aa05-33c53263a399",
      "id": "20a0ea1c-0f41-4844-aa05-33c53263a399",
      "modified_at": "2020-08-27T06:40:52.002Z",
      "created_at": "2020-08-27T06:40:50.002Z",
      "owner": "1000331001",
      "href": "/v4/models/20a0ea1c-0f41-4844-aa05-33c53263a399?space_id=d7d5fa99-058a-437e-ba53-ea2ce813de39",
      "space_id": "d7d5fa99-058a-437e-ba53-ea2ce813de39"
   },
   "entity": {
      "name": "Shopping Cart Cluster Model",
      "content_status": {
         "state": "persisted"
      },
      "space": {
         "id": "d7d5fa99-058a-437e-ba53-ea2ce813de39",
         "href": "/v4/spaces/d7d5fa99-058a-437e-ba53-ea2ce813de39"
      },
      "type": "scikit-learn_0.22",
      "runtime": {
         "id": "scikit-learn_0.22-py3.6",
         "href": "/v4/runtimes/scikit-learn_0.22-py3.6"
      },
      "schemas": {
         "input": [
            {
               "id": "testid",
            

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>