# Clustering data using scikit-learn

Clustering algorithms allow you to automatically find ways to group multidimentional data into clusters.

In this notebook, we'll use scikit-learn to predict clusters. 
scikit-learn provides implementations of many clustering algorithms.
We'll use **k-means** clustering to create clusters based on a shopping cart dataset.
Using that model, we can take any shopping cart and determine which cluster it fits best.

Once we've predicted a cluster, we'll use the most popular products in that cluster to
recommend additional purchases.


## Setup

### Set your CPD URL in wml_credentials

In [None]:
# @hidden_cell
import sys,os,os.path

token = os.environ['USER_ACCESS_TOKEN']

wml_credentials = {
"token": token,
"instance_id" : "openshift",
"url": "https://zen-cpd-zen.apps.marksturpak8.ibmcodetest.us",  # Provide your CPD URL here
"version": "3.0.1"
}


### Install python modules

> NOTE!  Some pip installs require a kernel restart.

The shell command `pip install` is used to install Python modules. Some installs require a kernel restart to complete.
To avoid confusing errors, run the following cell once and then use the **Kernel** menu to restart the kernel before proceeding.

### Ensure you have the watson-machine-learning-client version that you need.

In [None]:
!pip uninstall --yes watson-machine-learning-client-V4
!pip install watson-machine-learning-client-V4==1.0.112
!pip freeze | grep watson-machine-learning-client


In [None]:
# The Watson Studio Python kernel should already have the scikit-learn module we need.
#
# Tested on CPD 3.0.1 with scikit-learn==0.22.1

!pip freeze | grep scikit-learn


## Imports

Import the python modules that we need in the rest of the notebook.

In [None]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans


## Load the shopping cart data for training the model

Run the cell below to slurp the shopping cart training data from a CSV file into a pandas DataFrame.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/IBM/ibm-streams-with-ml-model/master/data/customers_orders1_opt.csv")

## Prepare the cart data

Keep the columns with product category values. The keep_columns list of labels will also be handy.


In [None]:
keep_columns = ['Baby Food','Diapers','Formula','Lotion','Baby wash','Wipes','Fresh Fruits','Fresh Vegetables','Beer','Wine','Club Soda','Sports Drink','Chips','Popcorn','Oatmeal','Medicines','Canned Foods','Cigarettes','Cheese','Cleaning Products','Condiments','Frozen Foods','Kitchen Items','Meat','Office Supplies','Personal Care','Pet Supplies','Sea Food','Spices']
df_carts = df[keep_columns]
df_carts.head()

## Train a k-means model that will put the carts into 10 clusters and show the centers

In [None]:
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters)
predicted = kmeans.fit_predict(df_carts.values)
centers = kmeans.cluster_centers_

In [None]:
# print(centers) but with nicer number formatting
print("CLUSTER CENTERS...")
print("Number of clusters: ", n_clusters)
print("Number of products: ", len(keep_columns))
print(keep_columns)
for center in centers:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')


In [None]:
# Test the model
# Provide a shopping cart and see how the model predicts a cluster for it.
# Instead of zeros, try 0.5 to let the model decide whether to lean closer to buy or not-buy.
test_cart1 = [1,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
test_cart2 = [1,0.5,1,1,1,1,0.5,1,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5]
print(test_cart1)
print(test_cart2)
test_carts = [ test_cart1, test_cart2]

In [None]:

predicted_cluster = kmeans.predict(   
    test_carts)
print(predicted_cluster)
print(centers[predicted_cluster])

In [None]:
# print centers of predicted cluster
center = centers[predicted_cluster][0]
for center in centers[predicted_cluster]:
    print('[ ', end='')
    for i in center:
        print("{:.2f}".format(abs(i)), end=', ')
    print(']')

In [None]:
# Use the selected cluster centers to suggest additional products
    
threshold = 0.5
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] > threshold:
        print("{:.2f} already in cart:".format(center[i]), keep_columns[i])

for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] > 0.5:
        print("{:.2f} product to recommend: ".format(center[i]), keep_columns[i] )
        
for i, prod in enumerate(keep_columns):
    if test_carts[0][i] <= threshold and center[i] <= 0.5:
        print("{:.2f} other product: ".format(center[i]), keep_columns[i] )


In [None]:
# To store the trained model, first create a deployment space and set it as the default.

from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
# Set your deployment space name and model name

MODEL_NAME = "Shopping Cart Cluster Model"
DEPLOYMENT_SPACE_NAME = "ibm_streams_with_ml_model_deployment_space"


In [None]:

metadata = {
 wml_client.spaces.ConfigurationMetaNames.NAME: DEPLOYMENT_SPACE_NAME,
 wml_client.spaces.ConfigurationMetaNames.DESCRIPTION: 'Deployment space created from notebook for shopping cart model'
}
space_details = wml_client.spaces.store(meta_props=metadata)

space_uid = wml_client.spaces.get_uid(space_details)

In [None]:
wml_client.set.default_space(space_uid)

In [None]:
print(space_uid)

In [None]:
wml_client.spaces.list()

In [None]:
from sklearn.pipeline import Pipeline
import pickle
pipeline_org = Pipeline( steps = [ ( "classifier", KMeans() ) ] )
pipeline_org.fit( df_carts, keep_columns )
pickle.dump( pipeline_org, open( "kmeans-prediction-model.pkl", 'wb') )

!mkdir model-dir
!cp kmeans-prediction-model.pkl model-dir
!tar -zcvf kmeans-prediction-model.tar.gz kmeans-prediction-model.pkl

In [None]:
input_schema = [{
    'id': 'testid',
    'type': 'struct',
    'fields': [
        {
            'name': 'input_cart',
            'type': 'array',
            'nullable': False
        }
    ]
}]

model_def_meta_props = {
     wml_client.model_definitions.ConfigurationMetaNames.NAME: 'Shopping_Cart_Cluster_Model_definition',
     wml_client.model_definitions.ConfigurationMetaNames.VERSION: '1.0',
     wml_client.model_definitions.ConfigurationMetaNames.PLATFORM: {'name': 'python',  'versions': ['3.6']}
 }

In [None]:
model_def_details = wml_client.model_definitions.store(
     model_definition='kmeans-prediction-model.tar.gz',
     meta_props=model_def_meta_props
)

model_def_id = wml_client.model_definitions.get_uid(model_def_details)

In [None]:
print(model_def_id)

In [None]:
wml_client.software_specifications.list()

In [None]:

model_props = {wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
               wml_client.repository.ModelMetaNames.INPUT_DATA_SCHEMA: input_schema,
               wml_client.repository.ModelMetaNames.RUNTIME_UID : "scikit-learn_0.22-py3.6",
               wml_client.repository.ModelMetaNames.TYPE : "scikit-learn_0.22"
              }

In [None]:
model_artifact = wml_client.repository.store_model(kmeans, pipeline=pipeline_org, meta_props=model_props)

In [None]:
model_uid = wml_client.repository.get_model_uid(model_artifact)
print("Model UID = " + model_uid)

In [None]:
import json
print(json.dumps(model_artifact, indent=3))

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>