# Introduction to Spark lab, part 3: machine learning

In this notebook you'll learn how to create a model for purchase recommendations using the alternating least squares algorithm of the Spark machine learning library. Machine learning is based on algorithms that can learn from data without relying on rules-based programming.  

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E"
-Tom M. Mitchell

This notebook uses pySpark, the Python API for Spark. Some knowledge of Python is recommended. This notebook runs on Python and Spark.

If you are new to Spark, see the first two parts of this lab: 
 - <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/95811fca38af4ccbea8acf8658bedcfc" target="_blank" rel="noopener noreferrer">Introduction to Spark lab, part 1: Basic Concepts</a>
 - <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/5ad1c820f57809ddec9a040e37b2bd55" target="_blank" rel="noopener noreferrer">Introduction to Spark lab, part 2: Querying data</a>

## Spark machine learning library
The Spark machine learning library makes practical machine learning scalable and easy. The library consists of common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering (this notebook!), dimensionality reduction, lower-level optimization primitives, and higher-level pipeline APIs.

The library has two packages:
- spark.mllib contains the original API that handles data in RDDs. It's in maintenance mode, but fully supported.
- spark.ml contains a newer API for constructing ML pipelines. It handles data in DataFrames. It's being actively enhanced.

## Alternating least squares algorithm
The alternating least squares (ALS) algorithm provides collaborative filtering between customers and products to find products that the customers might like, based on their previous purchases or ratings.

The ALS algorithm creates a matrix of all customers versus all products. Most cells in the matrix are empty, which means the customer hasn't bought that product. The ALS algorithm then fills in the probability of customers buying products that they haven't bought yet, based on similarities between customer purchases and similarities between products. The algorithm uses the least squares computation to minimize the estimation errors, and alternates between fixing the customer factors and solving for product factors and fixing the product factors and solving for customer factors.

You don't, however, need to understand how the ALS algorithm works to use it! Spark machine learning algorithms have default values that work well in most cases.

## Table of contents

1. [Get the data](#getdata)<br>
2. [Prepare and shape the data](#prepare)<br>
    2.1 [Format the data](#prepare1)<br>
    2.2 [Clean the data](#prepare2)<br>
    2.3 [Create a DataFrame](#prepare3)<br>
    2.4 [Remove unneeded columns](#prepare4)<br>
3. [Split the data into three sets](#split)<br>
4. [Build recommendation models](#model)<br>
5. [Test the models](#test)<br>
    5.1 [Clean the cross validation data set](#test1)<br>
    5.2 [Run the models on the cross validation data set](#test2)<br>
    5.3 [Calculate the accuracy for each model](#test3)<br>
    5.4 [Confirm the best model](#test4)<br>
6. [Implement the mode](#implement)<br>
    6.1 [Create a DataFrame for the customer and all products](#implement1)<br>
    6.2 [Rate each product](#implement2)<br>
    6.3 [Find the top recommendations](#implement3)<br>
    6.4 [Compare purchased and recommended products](#implement4)<br>
7. [Summary and next steps](#summary)

<a id="getdata"></a>
## 1. Get the data 
The data set contains the transactions of an online retailer of gift items for the period from 01/12/2010 to 09/12/2011. Many of the customers are wholesalers.

You'll be using a slightly modified version of UCI's <a href="http://archive.ics.uci.edu/ml/datasets/Online+Retail" target="_blank" rel="noopener noreferrer">Online Retail Data Set</a>.  

Here's a glimpse of the data:

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/FullFile.png' width="80%" height="80%"></img>

Download the CSV version of the data set, from which commas in the product descriptions are removed:

In [1]:
!rm 'OnlineRetail.csv.gz' -f
!wget https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190527201552-0001
KERNEL_ID = 99334292-0b37-459c-b178-6fc5e5b202e2
--2019-05-27 20:15:57--  https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7483128 (7.1M) [application/octet-stream]
Saving to: 'OnlineRetail.csv.gz'


2019-05-27 20:15:57 (73.8 MB/s) - 'OnlineRetail.csv.gz' saved [7483128/7483128]



Put the data into an RDD and print the first 5 rows:

In [2]:
loadRetailData = sc.textFile("OnlineRetail.csv.gz")

for row in loadRetailData.take(5):
    print (row)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/10 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/10 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/10 8:26,3.39,17850,United Kingdom


Each row in the RDD is a string that correlates to a line in the CSV file.

<a id="prepare"></a>
## 2. Prepare and shape the data

It's been said that preparing and shaping data is 80% of a data scientist's job. Having the right data in the right format is critical for getting accurate results.

To get the data ready, complete these tasks:

1. [Format the data](#prepare1)
1. [Clean the data](#prepare2)
1. [Create a DataFrame](#prepare3)
1. [Remove unneeded columns](#prepare4)

<a id="prepare1"></a>
### 2.1 Format the data
Remove the header from the RDD and split the string in each row with a comma:

In [3]:
header = loadRetailData.first()
loadRetailData = loadRetailData.filter(lambda line: line != header).\
                            map(lambda l: l.split(","))

for row in loadRetailData.take(5):
    print (row)

['536365', '85123A', 'WHITE HANGING HEART T-LIGHT HOLDER', '6', '12/1/10 8:26', '2.55', '17850', 'United Kingdom']
['536365', '71053', 'WHITE METAL LANTERN', '6', '12/1/10 8:26', '3.39', '17850', 'United Kingdom']
['536365', '84406B', 'CREAM CUPID HEARTS COAT HANGER', '8', '12/1/10 8:26', '2.75', '17850', 'United Kingdom']
['536365', '84029G', 'KNITTED UNION FLAG HOT WATER BOTTLE', '6', '12/1/10 8:26', '3.39', '17850', 'United Kingdom']
['536365', '84029E', 'RED WOOLLY HOTTIE WHITE HEART.', '6', '12/1/10 8:26', '3.39', '17850', 'United Kingdom']


<a id="prepare2"></a>
### 2.2 Clean the data
Remove the rows that have incomplete data. Keep only the rows that meet the following criteria:
 - The purchase quantity is greater than 0 
 - The customer ID not equal to 0 
 - The stock code is not blank after you remove non-numeric characters

In [4]:
import re

loadRetailData = loadRetailData.filter(lambda l: int(l[3]) > 0\
                                and len(re.sub("\D", "", l[1])) != 0 \
                                and len(l[6]) != 0)

print (loadRetailData.take(2))

[['536365', '85123A', 'WHITE HANGING HEART T-LIGHT HOLDER', '6', '12/1/10 8:26', '2.55', '17850', 'United Kingdom'], ['536365', '71053', 'WHITE METAL LANTERN', '6', '12/1/10 8:26', '3.39', '17850', 'United Kingdom']]


<a id="prepare3"></a>
### 2.3 Create a DataFrame

First, create an SQLContext and map each line to a row: 

In [5]:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)

#Convert each line to a Row.
loadRetailData = loadRetailData.map(lambda l: Row(inv=int(l[0]),\
                                    stockCode=int(re.sub("\D", "", l[1])),\
                                    description=l[2],\
                                    quant=int(l[3]),\
                                    invDate=l[4],\
                                    price=float(l[5]),\
                                    custId=int(l[6]),\
                                    country=l[7]))

Create a DataFrame and show the inferred schema:

In [6]:
retailDf = sqlContext.createDataFrame(loadRetailData)
print (retailDf.printSchema())

root
 |-- country: string (nullable = true)
 |-- custId: long (nullable = true)
 |-- description: string (nullable = true)
 |-- inv: long (nullable = true)
 |-- invDate: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quant: long (nullable = true)
 |-- stockCode: long (nullable = true)

None


Register the DataFrame as a table so that you can run SQL queries on it and show the first two rows:

In [7]:
retailDf.registerTempTable("retailPurchases")
sqlContext.sql("SELECT * FROM retailPurchases limit 2").toPandas()

Unnamed: 0,country,custId,description,inv,invDate,price,quant,stockCode
0,United Kingdom,17850,WHITE HANGING HEART T-LIGHT HOLDER,536365,12/1/10 8:26,2.55,6,85123
1,United Kingdom,17850,WHITE METAL LANTERN,536365,12/1/10 8:26,3.39,6,71053


<a id="prepare4"></a>
### 2.4 Remove unneeded columns
The only columns you need are `custId`, `stockCode`, and a new column, `purch`, which has a value of 1 to indicate that the customer purchased the product:

In [8]:
query = """
SELECT 
    custId, stockCode, 1 as purch
FROM 
    retailPurchases 
group 
    by custId, stockCode"""
retailDf = sqlContext.sql(query)
retailDf.registerTempTable("retailDf")

sqlContext.sql("select * from retailDf limit 10").toPandas()

Unnamed: 0,custId,stockCode,purch
0,18074,22224,1
1,13705,21889,1
2,15862,22441,1
3,15862,21592,1
4,12838,22739,1
5,12838,22149,1
6,14078,22548,1
7,14078,22423,1
8,12433,21977,1
9,14696,84360,1


<a id="split"></a>
## 3. Split the data into three sets
You'll split the data into three sets: 
 - a testing data set (10% of the data)
 - a cross-validation data set (10% of the data)
 - a training data set (80% of the data)

Split the data randomly and create a DataFrame for each data set:

In [9]:
testDf, cvDf, trainDf = retailDf.randomSplit([.1,.1,.8],1)

print ("trainDf count: ", trainDf.count(), " example: ")
for row in trainDf.take(2): print (row)
print ()

print ("cvDf count: ", cvDf.count(), " example: ")
for row in cvDf.take(2): print (row)
print ()

print ("testDf count: ", testDf.count(), " example: ")
for row in testDf.take(2): print (row)
print ()

trainDf count:  208123  example: 
Row(custId=12359, stockCode=23345, purch=1)
Row(custId=12363, stockCode=20685, purch=1)

cvDf count:  25876  example: 
Row(custId=12349, stockCode=23545, purch=1)
Row(custId=12388, stockCode=22960, purch=1)

testDf count:  26113  example: 
Row(custId=12362, stockCode=22372, purch=1)
Row(custId=12391, stockCode=20985, purch=1)



<a id="model"></a>
## 4. Build recommendation models
Machine learning algorithms have standard parameters and hyperparameters. Standard parameters specify data and options. Hyperparameters control the performance of the algorithm.

The ALS algorithm has these hyperparameters:  

 - The `rank` hyperparameter represents the number of features. The default value of `rank` is 10.
 - The `maxIter` hyperparameter represents the number of iterations to run the least squares computation. The default value of `maxIter` is 10.

Use the training DataFrame to train three models with the ALS algorithm with different values for the `rank` and `maxIter` hyperparameters. Assign the `userCol`, `itemCol`, and `ratingCol` parameters to the appropriate data columns. Set the `implicitPrefs` parameter to `true` so that the algorithm can predict latent factors.

In [10]:
from pyspark.ml.recommendation import ALS

als1 = ALS(rank=3, maxIter=15,userCol="custId",itemCol="stockCode",ratingCol="purch",implicitPrefs=True)
model1 = als1.fit(trainDf)

als2 = ALS(rank=15, maxIter=3,userCol="custId",itemCol="stockCode",ratingCol="purch",implicitPrefs=True)
model2 = als2.fit(trainDf)

als3 = ALS(rank=15, maxIter=15,userCol="custId",itemCol="stockCode",ratingCol="purch",implicitPrefs=True)
model3 = als3.fit(trainDf)

print ("The models are trained")

The models are trained


<a id="test"></a>
## 5. Test the models

First, test the three models on the cross-validation data set, and then on the testing data set. 

You'll know the model is accurate when the prediction values for products that the customers have already bought are close to 1. 

<a id="test1"></a>
### 5.1 Clean the cross validation data set

Remove any of the customers or products in the cross-validation data set that are not in the training data set:

In [11]:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import BooleanType
customers = set(trainDf.rdd.map(lambda line: line.custId).collect())
stock = set(trainDf.rdd.map(lambda line: line.stockCode).collect())

print (cvDf.count())
cvDf = cvDf.rdd.filter(lambda line: line.stockCode in stock and\
                                           line.custId in customers).toDF()
print (cvDf.count())

25876
25846


<a id="test2"></a>
### 5.2 Run the models on the cross-validation data set
Run the model with the cross-validation DataFrame by using the `transform` function and print the first two rows of each set of predictions:

In [12]:
predictions1 = model1.transform(cvDf)
predictions2 = model2.transform(cvDf)
predictions3 = model3.transform(cvDf)

print (predictions1.take(2))
print (predictions2.take(2))
print (predictions3.take(2))

[Row(custId=14606, stockCode=20735, purch=1, prediction=0.02294829487800598), Row(custId=16464, stockCode=20735, purch=1, prediction=0.00998256541788578)]
[Row(custId=14606, stockCode=20735, purch=1, prediction=0.0441482812166214), Row(custId=16464, stockCode=20735, purch=1, prediction=0.004716672468930483)]
[Row(custId=14606, stockCode=20735, purch=1, prediction=0.10467907041311264), Row(custId=16464, stockCode=20735, purch=1, prediction=0.0019032559357583523)]


<a id="test3"></a>
### 5.3 Calculate the accuracy for each model  

You'll use the mean squared error calculation to determine accuracy by comparing the prediction values for products to the actual purchase values. Remember that if a customer purchased a product, the value in the `purch` column is 1. The mean squared error calculation measures the average of the squares of the errors between what is estimated and the existing data. The lower the mean squared error value, the more accurate the model. 

For all predictions, subtract the prediction from the actual purchase value (1), square the result, and calculate the mean of all of the squared differences:

In [13]:
meanSquaredError1 = predictions1.rdd.map(lambda line: (line.purch - line.prediction)**2).mean()
meanSquaredError2 = predictions2.rdd.map(lambda line: (line.purch - line.prediction)**2).mean()
meanSquaredError3 = predictions3.rdd.map(lambda line: (line.purch - line.prediction)**2).mean()
    
print ('Mean squared error = %.4f for our first model' % meanSquaredError1)
print ('Mean squared error = %.4f for our second model' % meanSquaredError2)
print ('Mean squared error = %.4f for our third model' % meanSquaredError3)

Mean squared error = 0.7393 for our first model
Mean squared error = 0.7011 for our second model
Mean squared error = 0.6683 for our third model


The third model (model3) has the lowest mean squared error value, so it's the most accurate.

Notice that of the three models, model3 has the highest values for the hyperparameters. At this point you might be tempted to run the model with even higher values for `rank` and `maxIter`. However, you might not get better results. Increasing the values of the hyperparameters increases the time for the model to run. Also, you don't want to overfit the model so that it exactly fits the original data. In that case, you wouldn't get any recommendations! For best results, keep the values of the hyperparameters close to the defaults.

<a id="test4"></a>
### 5.4 Confirm the best model 

Now run model3 on the testing data set to confirm that it's the best model. You want to make sure that the model is not over-matched to the cross-validation data. It's possible for a model to match one subset of the data well but not another. If the values of the mean squared error for the testing data set and the cross-validation data set are close, then you've confirmed that the model works for all the data.

Clean the testing data set, run model3 on the testing data set, and calculate the mean squared error:

In [14]:
filteredTestDf = testDf.rdd.filter(lambda line: line.stockCode in stock and\
                                              line.custId in customers).toDF()
predictions4 = model3.transform(filteredTestDf)
meanSquaredError4 = predictions4.rdd.map(lambda line: (line.purch - line.prediction)**2).mean()
    
print ('Mean squared error = %.4f for our best model' % meanSquaredError4)

Mean squared error = 0.6693 for our best model


That's pretty close. The model works for all the data.

<a id="implement"></a>
## 6. Implement the model

Use the best model to predict which products a specific customer might be interested in purchasing.

<a id="implement1"></a>
### 6.1 Create a DataFrame for the customer and all products 

Create a DataFrame in which each row has the customer ID (15544) and a product ID:

In [15]:
from pyspark.sql.functions import lit

stock15544 = set(trainDf.filter(trainDf['custId'] == 15544).rdd.map(lambda line: line.stockCode).collect())

userItems = trainDf.select("stockCode").distinct().\
            withColumn('custId', lit(15544)).\
            rdd.filter(lambda line: line.stockCode not in stock15544).toDF()

for row in userItems.take(5):
    print (row.stockCode, row.custId)

21899 15544
22429 15544
22201 15544
22165 15544
21209 15544


<a id="implement2"></a>
### 6.2 Rate each product

Run the `transform` function to create a prediction value for each product:

In [16]:
userItems = model3.transform(userItems)

for row in userItems.take(5):
    print (row.stockCode, row.custId, row.prediction)

20735 15544 0.003406377974897623
21220 15544 0.06270085275173187
21700 15544 0.05252227559685707
22097 15544 -0.025320032611489296
22223 15544 0.02550472691655159


<a id="implement3"></a>
### 6.3 Find the top recommendations

Print the top five product recommendations:

In [17]:
userItems.registerTempTable("predictions")
query = "select * from predictions order by prediction desc limit 5"

sqlContext.sql(query).toPandas()

Unnamed: 0,stockCode,custId,prediction
0,21242,15544,0.57396
1,22417,15544,0.531522
2,21987,15544,0.508154
3,22367,15544,0.497426
4,21122,15544,0.494572


<a id="implement4"></a>
### 6.4 Compare purchased and recommended products

Here's a view of the products that customer 15544 bought:

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/user.png' width="80%" height="80%"></img>

This customer bought lots of children's gifts and some holiday items. 

Look at the descriptions of the recommended products to see if they are in the same categories.

<div class="alert alert-block alert-info">Note: The ALS algorithm uses some randomness, so the recommendations you get might be different from these.</div>

In [18]:
stockItems = sqlContext.sql("select distinct stockCode, description from retailPurchases")
stockItems.registerTempTable("stockItems")

query = """
select 
    predictions.*,
    stockItems.description
from
    predictions
inner join stockItems on
    predictions.stockCode = stockItems.stockCode
order by predictions.prediction desc
limit 10
"""
sqlContext.sql(query).toPandas()

Unnamed: 0,stockCode,custId,prediction,description
0,21242,15544,0.57396,RED RETROSPOT PLATE
1,22417,15544,0.531522,PACK OF 60 SPACEBOY CAKE CASES
2,21987,15544,0.508154,PACK OF 6 SKULL PAPER CUPS
3,22367,15544,0.497426,CHILDRENS APRON SPACEBOY DESIGN
4,21122,15544,0.494572,SET/10 PINK POLKADOT PARTY CANDLES
5,22326,15544,0.486443,ROUND SNACK BOXES SET OF4 WOODLAND
6,21975,15544,0.485125,PACK OF 60 DINOSAUR CAKE CASES
7,22554,15544,0.477713,PLASTERS IN TIN WOODLAND ANIMALS
8,21559,15544,0.474228,STRAWBERRY LUNCH BOX WITH CUTLERY
9,21124,15544,0.472176,SET/10 BLUE POLKADOT PARTY CANDLES


The recommended products look pretty similar to the purchased products, and, in some cases, are actually the same. Your model works!

<a id="summary"></a>
## 7. Summary and next steps
You created a predictive model that makes product recommendations for customers and verified that it works.

Dig deeper:
 - <a href="http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html" target="_blank" rel="noopener noreferrer">Collaborative Filtering</a>
 - <a href="http://spark.apache.org/docs/latest/ml-guide.html" target="_blank" rel="noopener noreferrer">Spark Machine Learning Library (MLlib) Guide</a>
 - <a href="http://spark.apache.org/docs/latest/api/python/index.html" target="_blank" rel="noopener noreferrer">Spark Python API Docs</a>


### Authors
**Carlo Appugliese** is a Spark and Hadoop evangelist at IBM.<br>
**Braden Callahan** is a Big Data Technical Specialist for IBM.<br>
**Ross Lewis** is a Big Data Technical Sales Specialist for IBM.<br>
**Mokhtar Kandil** is a World Wide Big Data Technical Specialist for IBM.


## Data citation
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Chen, D. (2012). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

<hr>
Copyright &copy; IBM Corp. 2017-2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:100px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Want to do more?</span><span style="border: 1px solid #3d70b2;padding: 15px;float:right;margin-right:40px; color:#3d70b2; "><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
<span style="color:#5A6872;"> Try out this notebook with your free trial of IBM Watson Studio.</span>
</div>