# Return Propensity using ICP4D and Watson Machine Learning.

We'll use this notebook to create a machine learning model to predict customer churn.

## 1.0 Import the data set

We need to import the data in the TON_PREV_NEW.csv file. 

In [1]:
import os, pandas as pd
# Add asset from file system
df = pd.read_csv('/project_data/data_asset/TON_PREV_NEW.csv')
df.head()

Unnamed: 0,BASKET_SIZE,EXTN_COMPOSITION,CARRIER_SERVICE_CODE_OL,CATEGORY,COUNTRY_OF_ORIGIN_OI,DAY_OF_MONTH,DAY_OF_WEEK,DAY_OF_YEAR,EXTN_BRAND,EXTN_DISCOUNT_ID,...,OTHER_CHARGES,OTHER_CHARGES_OL,REQ_DELIVERY_DATE,TOTAL_AMOUNT_USD,WEEKEND,ZIP_CODE,MTS_CTS,HOUR_OF_DAY,LOCKID,RETURN_FLAG
0,1,,STANDARD,Slip,CN,14,Saturday,287,XYZAA,,...,0.0,0.0,0,0.0,1,Zipcode_261,1,16,26,0
1,1,,STANDARD,Slip,CN,17,Tuesday,290,XYZAA,,...,0.0,0.0,0,0.0,0,Zipcode_165,2,16,36,0
2,1,"85% Polyamide, 15% Elastane",PREMIER_EVENING,Slip,CN,19,Thursday,292,XYZAA,,...,25.0,0.0,0,40.0,0,Zipcode_599,11,17,215,1
3,1,"54% Polyamide, 46% Polyester",STANDARD,Slip,CN,24,Tuesday,297,XYZAA,,...,0.0,0.0,0,0.0,0,Zipcode_261,1,15,25,0
4,2,"93% Cotton, 7% Elastane",STANDARD,Maniche Lunghe,PT,30,Monday,303,XYZAB,,...,13.0,0.0,0,251.192578,0,Zipcode_228,12,13,179,0


## 2.0 Clean the data

### 2.1 We will first fill all NA(s) and empty values with 0.

In [2]:
df=df.fillna(0)

### 2.2 Next we will see if we have any columns of dtype=object. These will then be converted to category codes in order to be fed into the model.

In [3]:
df.dtypes

BASKET_SIZE                  int64
EXTN_COMPOSITION            object
CARRIER_SERVICE_CODE_OL     object
CATEGORY                    object
COUNTRY_OF_ORIGIN_OI        object
DAY_OF_MONTH                 int64
DAY_OF_WEEK                 object
DAY_OF_YEAR                  int64
EXTN_BRAND                  object
EXTN_DISCOUNT_ID            object
EXTN_IS_GIFT                object
EXTN_IS_PREORDER            object
EXTN_SHIP_TO_CITY           object
EXTN_SHIP_TO_COUNTRY        object
EXTN_SEASON                 object
LIST_PRICE                   int64
MONTH_OF_YEAR                int64
OTHER_CHARGES              float64
OTHER_CHARGES_OL           float64
REQ_DELIVERY_DATE            int64
TOTAL_AMOUNT_USD           float64
WEEKEND                      int64
ZIP_CODE                    object
MTS_CTS                      int64
HOUR_OF_DAY                  int64
LOCKID                       int64
RETURN_FLAG                  int64
dtype: object

In [4]:
qual = list( df.loc[:,df.dtypes == 'object'].columns.values )
for col in qual:
     df[col] = df[col].astype('category')
quant = list( df.loc[:,df.dtypes != 'category'].columns.values )
print(qual,quant)

['EXTN_COMPOSITION', 'CARRIER_SERVICE_CODE_OL', 'CATEGORY', 'COUNTRY_OF_ORIGIN_OI', 'DAY_OF_WEEK', 'EXTN_BRAND', 'EXTN_DISCOUNT_ID', 'EXTN_IS_GIFT', 'EXTN_IS_PREORDER', 'EXTN_SHIP_TO_CITY', 'EXTN_SHIP_TO_COUNTRY', 'EXTN_SEASON', 'ZIP_CODE'] ['BASKET_SIZE', 'DAY_OF_MONTH', 'DAY_OF_YEAR', 'LIST_PRICE', 'MONTH_OF_YEAR', 'OTHER_CHARGES', 'OTHER_CHARGES_OL', 'REQ_DELIVERY_DATE', 'TOTAL_AMOUNT_USD', 'WEEKEND', 'MTS_CTS', 'HOUR_OF_DAY', 'LOCKID', 'RETURN_FLAG']


In [5]:
cats = list( df.loc[:,df.dtypes == 'category'].columns.values)
categories={}
for col in cats:
    categories[col]= dict(enumerate(df[col].cat.categories))

In [6]:
df.dtypes

BASKET_SIZE                   int64
EXTN_COMPOSITION           category
CARRIER_SERVICE_CODE_OL    category
CATEGORY                   category
COUNTRY_OF_ORIGIN_OI       category
DAY_OF_MONTH                  int64
DAY_OF_WEEK                category
DAY_OF_YEAR                   int64
EXTN_BRAND                 category
EXTN_DISCOUNT_ID           category
EXTN_IS_GIFT               category
EXTN_IS_PREORDER           category
EXTN_SHIP_TO_CITY          category
EXTN_SHIP_TO_COUNTRY       category
EXTN_SEASON                category
LIST_PRICE                    int64
MONTH_OF_YEAR                 int64
OTHER_CHARGES               float64
OTHER_CHARGES_OL            float64
REQ_DELIVERY_DATE             int64
TOTAL_AMOUNT_USD            float64
WEEKEND                       int64
ZIP_CODE                   category
MTS_CTS                       int64
HOUR_OF_DAY                   int64
LOCKID                        int64
RETURN_FLAG                   int64
dtype: object

### 2.3 Next, we find out how many orders were returned and how many were not returned.

In [7]:
df["RETURN_FLAG"].value_counts()

0    128487
1     24287
Name: RETURN_FLAG, dtype: int64

### Here we can see that there are ~24K orders that have been returned and ~128K orders that have not been returned. 

### 2.4 Let's split our data into training and test sets.

In [8]:
from sklearn.model_selection import train_test_split
X=(df.drop(["RETURN_FLAG"], axis=1))
y=df['RETURN_FLAG']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=42)

## 3.0 Install Custom Modules for the Pipeline Transformations

### 3.1 Let us now install the custom transformation library that we had uploaded to the project - CustTrans-0.2.zip

In [9]:
!pip install --upgrade /project_data/data_asset/CustTrans-0.2.zip

Processing /project_data/data_asset/CustTrans-0.2.zip
Collecting sklearn (from CustTrans==0.1)
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-multilearn (from CustTrans==0.1)
[?25l  Downloading https://files.pythonhosted.org/packages/bb/1f/e6ff649c72a1cdf2c7a1d31eb21705110ce1c5d3e7e26b2cc300e1637272/scikit_multilearn-0.2.0-py3-none-any.whl (89kB)
[K     |################################| 92kB 3.9MB/s eta 0:00:011
Building wheels for collected packages: CustTrans, sklearn
  Building wheel for CustTrans (setup.py) ... [?25ldone
[?25h  Created wheel for CustTrans: filename=CustTrans-0.1-cp36-none-any.whl size=1801 sha256=b48653717755b2699ca30e934ce4b18b4dfa58287fee29217a111e7a174a6ec8
  Stored in directory: /home/wsuser/.cache/pip/wheels/d8/13/54/c87b5cac3899188ef9b3013bce4976e8726028e908d07643c6
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for

### 3.2 Next, we install the sklearn-pandas library

In [10]:
!pip install sklearn-pandas

Collecting sklearn-pandas
  Downloading https://files.pythonhosted.org/packages/1f/48/4e1461d828baf41d609efaa720d20090ac6ec346b5daad3c88e243e2207e/sklearn_pandas-1.8.0-py2.py3-none-any.whl
Installing collected packages: sklearn-pandas
Successfully installed sklearn-pandas-1.8.0


## 4.0 Build the model

### 4.1 Now, let us create the custom pipeline transformer which essentially is our model.

In [11]:
from CustomTransformer.CustTrans import TypeSelector,StringIndexer,ConvToCategorical

In [12]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn_pandas import DataFrameMapper


transformer = Pipeline([
   ('features', FeatureUnion(n_jobs=1, transformer_list=[
       # Part 1
       ('boolean', Pipeline([
           ('selector', TypeSelector('bool')),
       ])),  # booleans close

       ('numericals', Pipeline([
           ('selector', TypeSelector(np.number)),
           ('scaler', StandardScaler()),
       ])),
       # Part 2
       ('categoricals', Pipeline([
           ('convertor', ConvToCategorical()),
           ('selector', TypeSelector('category')),
           ('labeler', StringIndexer()),
           ('encoder', OneHotEncoder(handle_unknown='ignore')),
       ]))
       # categoricals close
   ])),  # features close
   ('clf' , RandomForestClassifier(n_estimators=30,criterion="entropy")),
    
])

### 4.2 Let's now pass the input data through the transformer(fit), also known as training model.

In [13]:
import timeit
start_time = timeit.default_timer()
transformer.fit(X_train, y_train)
print("Time for model training",timeit.default_timer() - start_time)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


Time for model training 104.29557862994261


### 4.3 Once training is complete, we can evaluate the accuracy of the model using the hold-out test data.

In [14]:
scores= transformer.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, scores)
accuracy

  Xt = transform.transform(Xt)


0.8874226804123712

## 5.0 Save and deploy the model to WML

### 5.1 Create a WML API client.

In [15]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

### Add in the credentials as per your IBM Cloud Pak for Data cluster.

Replace the username and password values of `*****` with your IBM Cloud Pak for Data username and password. The value for url should match the url for your IBM Cloud Pak for Data cluster.

In [16]:
wml_credentials = {
                    "url": "https://zen-cpd-zen.apps.marksturpak4.ibmcodetest.us",
                    "username": "*****",
                    "password": "*****",
                    "instance_id": "wml_local",
                    "version" : "2.5.0"
 }

In [17]:
client = WatsonMachineLearningAPIClient(wml_credentials)

### Use the following cell to perform any clean up of previously created models, deployments and spaces.

In [18]:
# see if any spaces already exist
# client.spaces.list()

# set the default space before moving ahead
# client.set.default_space('<GUID of the space>')

# see if any stored models exist
# client.repository.list_models()
# client.repository.delete('<GUID of model to delete>')

# see if any deployments exist
# client.deployments.list()
# client.deployments.delete('<GUID of deployment to delete>')

# once the deployments and models are deleted, the space can be deleted
# client.spaces.delete('<GUID of the space>')

### Create a deployment space and set it as the default space to be used for deployments. If you would rather use an existing space (that was previously created), skip the code in the cell below and directly use the next cell to set the default space.

In [19]:
# Use this code to create a new deployment space.
space_details = client.spaces.store(meta_props={client.spaces.ConfigurationMetaNames.NAME: "ReturnPropensity_Space"})
space_id = client.spaces.get_uid(space_details)
print(space_id)

f7cc0abc-ad9d-4d83-82fa-8dfbec7f2f60


In [20]:
# Set default space - if you have a previously created space that you'd like to use, 
# use that space's id instead of `space_id`. For eg. client.set.default_space('<GUID of the space>')
client.set.default_space(space_id)
print(client.deployments.list())

----  ----  -----  -------  -------------
GUID  NAME  STATE  CREATED  ARTIFACT_TYPE
----  ----  -----  -------  -------------
None


### 5.2 Before we deploy the model, let's create a custom python runtime with our custom transformer library installed.

In [21]:
lib_meta = {
        client.runtimes.LibraryMetaNames.NAME: "CustomTransformers_v0.1",
        client.runtimes.LibraryMetaNames.DESCRIPTION: "CustomTransformers_v0.1",
        client.runtimes.LibraryMetaNames.FILEPATH: "/project_data/data_asset/CustTrans-0.2.zip",
        client.runtimes.LibraryMetaNames.VERSION: "1.0",
        client.runtimes.LibraryMetaNames.PLATFORM: {"name": "python", "versions": ["3.6"]}
    }
custom_library_details = client.runtimes.store_library(lib_meta)
custom_library_uid = client.runtimes.get_library_uid(custom_library_details)
print("Custom Library UID: " + custom_library_uid)

Custom Library UID: 0bff1d2c-4089-4ee1-ba6c-c82fbbb95714


In [22]:
runtimes_meta = {
    client.runtimes.ConfigurationMetaNames.NAME: "CustomTransformers_v0.1", 
    client.runtimes.ConfigurationMetaNames.DESCRIPTION: "CustomTransformers_v0.1", 
    client.runtimes.ConfigurationMetaNames.PLATFORM: { "name": "python", "version": "3.6" }, 
    client.runtimes.ConfigurationMetaNames.LIBRARIES_UIDS: [custom_library_uid]
}

In [23]:
runtime_details = client.runtimes.store(runtimes_meta)
runtime_details

{'metadata': {'id': '6e06aecf-37d4-4d34-bdcf-9edb2e017170',
  'guid': '6e06aecf-37d4-4d34-bdcf-9edb2e017170',
  'href': '/v4/runtimes/6e06aecf-37d4-4d34-bdcf-9edb2e017170',
  'created_at': '2020-02-18T16:24:12.035Z'},
 'entity': {'services': ['Training', 'Scoring'],
  'name': 'CustomTransformers_v0.1',
  'description': 'CustomTransformers_v0.1',
  'custom_libraries': [{'href': '/v4/libraries/0bff1d2c-4089-4ee1-ba6c-c82fbbb95714'}],
  'space': {'href': '/v4/spaces/f7cc0abc-ad9d-4d83-82fa-8dfbec7f2f60'},
  'system_defined': False,
  'platform': {'name': 'python', 'version': '3.6'}}}

In [24]:
runtime_uid = client.runtimes.get_uid(runtime_details)
print("Runtime UID: " + runtime_uid)

Runtime UID: 6e06aecf-37d4-4d34-bdcf-9edb2e017170


### 5.3 Now, let us store our model.

In [25]:
model_props = {client.repository.ModelMetaNames.NAME: "ReturnRiskPandas_v0.1",
               client.repository.ModelMetaNames.RUNTIME_UID: runtime_uid,
               client.repository.ModelMetaNames.TYPE: "scikit-learn_0.20"
              }

In [26]:
published_model = client.repository.store_model(model=transformer, meta_props=model_props,training_data=X_train, training_target=y_train)
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)

In [27]:
import json
print(json.dumps(model_details, indent=2))

{
  "metadata": {
    "guid": "1b54c596-0578-4264-afb9-fdcc4b5a6c27",
    "id": "1b54c596-0578-4264-afb9-fdcc4b5a6c27",
    "modified_at": "2020-02-18T16:25:11.002Z",
    "created_at": "2020-02-18T16:24:13.002Z",
    "owner": "1000330999",
    "href": "/v4/models/1b54c596-0578-4264-afb9-fdcc4b5a6c27?space_id=f7cc0abc-ad9d-4d83-82fa-8dfbec7f2f60"
  },
  "entity": {
    "name": "ReturnRiskPandas_v0.1",
    "training_data_references": [
      {
        "location": {
          "bucket": "not_applicable"
        },
        "type": "fs",
        "connection": {
          "access_key_id": "not_applicable",
          "secret_access_key": "not_applicable",
          "endpoint_url": "not_applicable"
        },
        "schema": {
          "id": "1",
          "type": "DataFrame",
          "fields": [
            {
              "name": "BASKET_SIZE",
              "type": "int64"
            },
            {
              "name": "EXTN_COMPOSITION",
              "type": "category"
           

### 5.4 Finally, let's deploy the model.

In [28]:
metaProps = {
client.deployments.ConfigurationMetaNames.NAME: "ReturnRiskPandas_CustomTransformers_v0.2",
client.deployments.ConfigurationMetaNames.ONLINE: {}
}

In [29]:
created_deployment = client.deployments.create(published_model_uid, metaProps)



#######################################################################################

Synchronous deployment creation for uid: '1b54c596-0578-4264-afb9-fdcc4b5a6c27' started

#######################################################################################


initializing........
ready


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='f0675444-1ed4-4f54-935c-29e27cb816f7'
------------------------------------------------------------------------------------------------




## 6.0 Test the model

### 6.1 Obtain the deployment_id and deployment_href for the model.

The deployment_id is required to score the model using the client.deployments.score() methos in the WML API Client.
The deployment_href can be used to generate the URL to be used to score the model via a cURL command. The scoring_url can be generated as `"<URL for your IBM Cloud Pak for Data cluster>" + <deployment_href>`

In [30]:
deployment_href = client.deployments.get_href(created_deployment)
print(deployment_href)
deployment_id = client.deployments.get_uid(created_deployment)
print(deployment_id)

/v4/deployments/f0675444-1ed4-4f54-935c-29e27cb816f7
f0675444-1ed4-4f54-935c-29e27cb816f7


### 6.2 Score the model using a sample payload.

In [31]:
scoring_payload={client.deployments.ScoringMetaNames.INPUT_DATA: [{"fields":["BASKET_SIZE","EXTN_COMPOSITION","CARRIER_SERVICE_CODE_OL","CATEGORY","COUNTRY_OF_ORIGIN_OI","DAY_OF_MONTH","DAY_OF_WEEK","DAY_OF_YEAR","EXTN_BRAND","EXTN_DISCOUNT_ID","EXTN_IS_GIFT","EXTN_IS_PREORDER","EXTN_SHIP_TO_CITY","EXTN_SHIP_TO_COUNTRY","EXTN_SEASON","LIST_PRICE","MONTH_OF_YEAR","OTHER_CHARGES","OTHER_CHARGES_OL","REQ_DELIVERY_DATE","TOTAL_AMOUNT_USD","WEEKEND","ZIP_CODE","MTS_CTS","HOUR_OF_DAY","LOCKID"],"values":[[3, '91% Nylon, 9% Elastercell', 'STANDARD', 'Bikini', 'US', 18, 'Saturday', 322, 'XYZAI', 'None', 'N', 'N', 'Los Angeles', 'US', 'FW17', 75, 11, 0.0, 0.0, 0, 165.35, 1, 'Zipcode_401', 24, 19, 277]]}]}

In [32]:
prediction = client.deployments.score(deployment_id, scoring_payload)

In [33]:
prediction

{'predictions': [{'fields': ['prediction', 'probability'],
   'values': [[0, [0.6666666666666666, 0.3333333333333333]]]}]}

### The first field - prediction - indicates the model's prediction of whether the items indicated by the sample payload will be returned (value of 0) or not (value of 1). The second field - probability - has 2 numeric values. The first corresponds to the probability of a prediction value of 0 and the second corresponds to the probability of the prediction value of 1.