# Current BPR + LR (linear combination) for CG recommendations

In [1]:
import os
import pandas as pd
import numpy as np

office = pd.read_csv('/Users/Yiteng/venv/rfm-project/data/SM-office.csv', sep=',', engine='c')

By now, we take office (OFM) data as one example, where the code segment below (commended) provided the segmentations of the original data towards specific categories.

In [2]:
productfile1 = '/Users/Yiteng/venv/rfm-project/data/og_data/ProductMaster.csv'
productfile2 = '/Users/Yiteng/venv/rfm-project/data/og_data/ProductMaster_Tops.csv'
product1 = pd.read_csv(productfile1, sep=',', engine='c')
product2 = pd.read_csv(productfile2, sep=',', engine='c')

Data Segmentation according to the meeting with Vijay.

In [3]:
# office = sub_segment_df[sub_segment_df.BUID.isin([15,56])]
# non_office = sub_segment_df[~sub_segment_df.BUID.isin([15,56])]

# sub_segment_df = orders[orders.SKUCode.isin(product1.SKUID)]
# sub_segment_df.to_csv('ShoppingMall.csv', index=False)
# sub_segment_df2 = orders[orders.SKUCode.isin(product2.SKUID)]
# sub_segment_df2.to_csv('Tops_supermarket.csv', index=False)
# sub_segment_df3 = orders[~orders.SKUCode.isin(product1.SKUID) & ~orders.SKUCode.isin(product2.SKUID)]
# sub_segment_df3.to_csv('Problematic.csv', index=False)

# sub_segment_df.TicketNumber.drop_duplicates()

Taking "office" data for consideration in this notebook, the transaction data looks like:

In [11]:
data = office
data.head(4)

Unnamed: 0,TypeGroup,BUID,BranchID,TransactionDate,CustomerID,CardNo,TicketNumber,SKUCode,Spending,DeptCode,SubDeptCode,QTY,tDate2
0,1,56,1555,2016-12-09 15:28:00,D0A9EA46-B77A-4CA3-A0B6-A09C35B8CFCF,2201014718,1612032157,315019,795.99,555.0,552.0,4,2016-12-09
1,1,56,1555,2016-08-04 22:33:00,D0A9EA46-B77A-4CA3-A0B6-A09C35B8CFCF,2201014718,1608022955,159515,279.36,540.0,403.0,4,2016-08-04
2,1,15,548,2016-10-23 11:51:00,D4B95E99-58C3-4749-A5E2-AE36CC9447BC,8054949528,301297920,152657,35.0,530.0,313.0,1,2016-10-23
3,1,15,877,2016-09-26 14:23:00,E51E941C-76C8-423B-A3CA-CB13284756B6,8018323043,102000910,317279,60.0,540.0,403.0,1,2016-09-26


### Bayesian Personalized Ranking (BPR) Algorithm

Retrieve info from the data -- where we focus on a match between user and item

In [4]:
from theano_bpr import BPR

idx = pd.Categorical(list(data.CustomerID)).codes
itm = pd.Categorical(list(data.SKUCode)).codes

# match the ID and the "series number"
idx_match = zip(data.CustomerID, idx)
itm_match = zip(data.SKUCode, itm)

train_set = zip(idx,itm)

bpr = BPR(20, idx.max()+1, itm.max()+1)
bpr.train(train_set, epochs=20)

Generating 1542580 random training samples
Processed 1542000 ( 99.96% ) in 0.0257 seconds
Total training time 42.52 seconds; 2.756146e-05 per sample


The result we get so far, is the matrix indicating the "willing" value of purchasing for a certain user towards a certain item.

Hence the shape of the matrix is (#users, #items)

In [5]:
res_bpr = bpr.predictions(range(idx.max()+1))
print res_bpr
print res_bpr.shape

[[ 2.7540555   3.56475544  2.66798973 ...,  6.3465724   3.92303658
   3.40365911]
 [ 1.34239006  2.08928657  2.19591284 ...,  3.75435495  2.13423014
   1.02919364]
 [-0.03819942  1.47356796  0.38009506 ...,  0.07140817  0.71647155
   0.38269472]
 ..., 
 [ 2.78549528  2.68791413  2.49471378 ...,  5.76980734  3.50338292
   2.89824772]
 [ 2.21514463  0.4706493   3.82976198 ...,  4.31011057  3.15388608
   2.15237689]
 [ 1.48023415  1.269207    3.20969653 ...,  4.39368868  3.47753906
   2.56968951]]
(3557, 10641)


As a simple performance showcase, this indicates a rough AUC value.

While for our recommendation, it is surely higher (above 0.9) since this is a self-learning that is indeed overfitting to historical data.

In [6]:
import math

bpr_show = BPR(20, idx.max()+1, itm.max()+1)
bpr_show.train(train_set[:int(math.floor(len(train_set) * 0.9))], epochs=20)
test_set = train_set[int(math.floor(len(train_set) * 0.9)):]
bpr_show.test(test_set)

Generating 1388320 random training samples
Processed 1388000 ( 99.98% ) in 0.0274 seconds
Total training time 38.93 seconds; 2.803902e-05 per sample
Current AUC mean (1700 samples): 0.77218


0.7727383754848931

### Logistic Regression (LR) Algorithm

Generating Training data according to CustomerID -- making #users as multi-class classification problem

In [12]:
idx_match_dic = dict(idx_match)
y_train = data.CustomerID.map(idx_match_dic)
# X_train = data[['BUID', 'SKUCode', 'Spending', 'SubDeptCode', 'QTY']]
X_train = data[[ 'SKUCode', 'Spending', 'QTY']]

X_train = np.array(X_train)
y_train = np.array(y_train)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=10.0, random_state=0)
lr.fit(X_train_std, y_train)


Testing data are corresponded.

In [101]:
X_test = data[[ 'SKUCode', 'Spending', 'QTY']]
X_test = X_test.groupby('SKUCode').agg({'Spending': lambda x: x.sum(), 'QTY': lambda x: len(x)}) 
X_test = np.array(X_test.reset_index())
sc.fit(X_test)
X_test_std = sc.transform(X_test)

Prediction result, same format with BPR algorithm.

In [104]:
res_lrt = lr.predict_proba(X_test_std) 
print res_lrt.T
print res_lrt.T.shape

[[  1.46183000e-04   1.49231599e-04   1.45071916e-04 ...,   5.51199589e-05
    5.56648051e-05   5.52155370e-05]
 [  3.95754792e-04   3.85086876e-04   4.05009858e-04 ...,   5.05141019e-04
    5.05269877e-04   5.03431775e-04]
 [  1.71538713e-04   1.68301664e-04   1.86933112e-04 ...,   5.08052944e-04
    5.11380805e-04   5.07344866e-04]
 ..., 
 [  1.45057808e-04   1.45659365e-04   1.26917942e-04 ...,   1.00871428e-05
    1.00721123e-05   1.00656586e-05]
 [  5.33756297e-05   5.37050172e-05   5.09415282e-05 ...,   1.38312255e-05
    1.38301861e-05   1.38176190e-05]
 [  1.18230955e-04   1.22538953e-04   1.31299386e-04 ...,   2.02681242e-04
    2.03918310e-04   2.03454524e-04]]
(3557, 10641)


In [117]:
res_lr = lr.predict_proba(X_train_std) 
print res_lr.T

[[  5.40383495e-05   1.22965506e-04   1.29365325e-04 ...,   7.22978552e-05
    1.28773601e-04   1.25401759e-04]
 [  6.29961435e-04   5.35834727e-04   4.66817296e-04 ...,   5.26999885e-04
    4.69864932e-04   4.78197942e-04]
 [  6.08631896e-04   3.04627129e-04   2.54776315e-04 ...,   4.61058498e-04
    2.52459998e-04   2.64618096e-04]
 ..., 
 [  1.08012235e-05   7.15349104e-05   8.10509931e-05 ...,   1.75609510e-05
    8.32038912e-05   7.63721347e-05]
 [  1.42178220e-05   3.90862786e-05   4.19175336e-05 ...,   1.90377203e-05
    4.23433133e-05   4.07062521e-05]
 [  1.67322033e-04   1.41914221e-04   1.53127248e-04 ...,   1.99447876e-04
    1.49160208e-04   1.53138187e-04]]


In [115]:
itm_match_dic = dict(itm_match)

### A linear combination of BPR and LR

Normalization is needed before a linear combination

In [107]:
from sklearn.preprocessing import normalize
normed_res_bpr = normalize(res_bpr, axis=0, norm='l1')
normed_res_lrt = normalize(res_lrt.T, axis=0, norm='l1')

A simple linear model at hand, later can even learn an optimized α.

In [109]:
res = 0.5 * normed_res_bpr + 0.5 * normed_res_lrt

Current memory cannot handle such matrix multiplication, thus need to adapt to sparse matrix for matrix manipulating.

In [119]:
from scipy.sparse import coo_matrix

purchased = coo_matrix((np.ones(len(itm)),(idx,itm)),shape=(max(idx)+1,max(itm)+1)).tocsr()
# purchased = purchased.todense()
# willing_from_purchased = np.multiply(res , purchased)
# willing_from_unpurchased = res - willing_from_purchased

res_csr = coo_matrix(res).tocsr()
willing_from_purchased = purchased.multiply(res_csr)
willing_from_unpurchased = res_csr - willing_from_purchased

# def matrix_reflection(train_set, res):
#     res_mat = np.zeros([max(idx)+1,max(itm)+1])
#     for each in train_set:
#         res_mat[each[0]][each[1]] = res[each[0]][each[1]]
#     return res_mat

# willing_from_purchased = matrix_reflection(train_set, res)
# willing_from_unpurchased = res - willing_from_purchased
# from scipy.sparse import coo_matrix
# print coo_matrix(willing_from_purchased)

Simulating a memory dumping -- no need to use "Pickle" -- for coding only

In [120]:
pd.DataFrame(willing_from_purchased.todense()).to_csv("purchased.csv",header=False)
pd.DataFrame(willing_from_unpurchased.todense()).to_csv("unpurchased.csv",header=False)
pd.DataFrame(res).to_csv("result.csv",header=False)
# read csv to restore memory

# Use Cases
## Side Data Preparation

In [169]:
cstmfile = '/Users/Yiteng/venv/rfm-project/data/og_data/Customer_Profile.csv'
customer = pd.read_csv(cstmfile, sep=',', engine='c')

def strip_brace(input):
#     input = input.strip('{')
#     input = input.strip('}')
    return input[1:-1]
#     return input

customer.CustomerID = customer.CustomerID.map(strip_brace)

## use case 1 -- recommend a certain item to users who are interested in

Generating dictionaries for content:index matching

In [159]:
idx_match_rev_dic = dict(zip(idx,data.CustomerID))
itm_match_rev_dic = dict(zip(itm,data.SKUCode))

Randomly select one item.

In [163]:
import random
rec_itm = random.randint(0,itm.max())
SKUID_rec_itm = itm_match_rev_dic[rec_itm]
print rec_itm, SKUID_rec_itm

2112 166868


Set a parameter τ here, indicating a % (e.g., 10%) of a selection for compaign

In [197]:
tau = 0.1 
qualified_no = int(math.floor(res.shape[0]*tau))
print 'Recommend item to ' + str(qualified_no) + ' customers out of ' + str(res.shape[0]) + ' in total.'

Recommend item to 355 customers out of 3557 in total.


Retreive the CustomerID to understand the details of this recommendation.

In [269]:
# product1[product1.SKUID.isin(data.SKUCode)]
import heapq
rec_col = res[:,rec_itm]
nlargest_values = heapq.nlargest(qualified_no, rec_col)
rec_customer_indices = [np.where(rec_col == item)[0][0] for item in nlargest_values]
rec_customer_ID = map(idx_match_rev_dic.get, rec_customer_indices)

Hence, sub-segment this portion of data out from overall customer data. And following analysis can be based on this data, also can do anything on it based on what we need.

In [275]:
sub_customer = customer[customer.CustomerID.isin(rec_customer_ID)]
print 'Among the total ' + str(sub_customer.shape[0]), 'recommendations, there are ' + str(len(sub_customer[sub_customer.Gender == 'F'])) + ' ladies and ' + str(len(sub_customer[sub_customer.Gender == 'M'])) + ' gentlemen, while ' + str(len(sub_customer[sub_customer.Gender.isnull()])) + ' ppl have not indicated their gender.'


Among the total 355 recommendations, there are 252 ladies and 61 gentlemen, while 42 ppl have not indicated their gender.


In [278]:
print 'The details of these customers are listed here:'
print ''
print sub_customer

The details of these customers are listed here:

                                CustomerID          DateofBirth  Nationality  \
77    EE7E8A57-0169-463D-837E-7875964703D7  1957-02-16 00:00:00         Thai   
100   3B08E562-918B-4484-9A85-1B27E1BB6FF4  1971-04-13 00:00:00         Thai   
133   7B734315-FD6C-4223-A3A4-8F83EC249087  1957-11-13 00:00:00         Thai   
149   DDA9F9BA-2054-46A3-987F-7CDECD618E9A  1981-01-03 00:00:00         Thai   
222   F595459A-1EA2-47D3-8AF6-FECF28D24257  1954-03-25 00:00:00         Thai   
236   AA123317-3326-494B-97CA-6B4377374BBF  1976-01-18 00:00:00         Thai   
273   84DBD14F-33A0-46F3-BB05-9EBA1155EE8E  1982-11-21 00:00:00         Thai   
300   C2E76D84-D2CF-464D-B437-44E3A1180DC1                  NaN      Unknown   
323   FF8D66CE-4072-4C0C-BF9C-5E559750697A  1971-07-09 00:00:00         Thai   
333   4210EC2D-0943-4E80-A762-BFB7D7046C2A  1973-10-22 00:00:00         Thai   
353   82439C18-E9C7-42E0-A396-AD8A5CBDDE0D  1978-07-19 00:00:00        

## use case 2 -- recommend a certain group of users a certain item

As a discussion with May-E, this will be based on the original plan of recommending 64 (cube(4)) different user groups based on RFM segmentations.

Below are current segmentation according to 4 quartiles.

In [277]:
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def RClass(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FMClass(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

Making the RFM table according to the data read:

In [284]:
import datetime
NOW = datetime.date(2017, 1, 1)
orders = data
orders['tDate2'] = pd.to_datetime(orders['TransactionDate']).dt.date
rfmTable = orders.groupby('CustomerID').agg({'tDate2': lambda x: (NOW - x.max()).days, # Recency
                                           'TicketNumber': lambda x: len(x),      # Frequency
                                           'Spending': lambda x: x.mean()}) # Monetary Value

In [286]:
rfmTable['tDate2'] = rfmTable['tDate2'].astype(int)
rfmTable.rename(columns={'tDate2': 'recency', 
                          'TicketNumber': 'frequency', 
                          'Spending': 'monetary_value'}, inplace=True)

In [288]:
quantiles = rfmTable.quantile(q=[0.25, 0.50, 0.75])
quantiles = quantiles.to_dict()
rfmSegmentation = rfmTable
rfmSegmentation['R_Quartile'] = rfmSegmentation['recency'].apply(RClass, args=('recency',quantiles,))
rfmSegmentation['F_Quartile'] = rfmSegmentation['frequency'].apply(FMClass, args=('frequency',quantiles,))
rfmSegmentation['M_Quartile'] = rfmSegmentation['monetary_value'].apply(FMClass, args=('monetary_value',quantiles,))
rfmSegmentation['RFMClass'] = rfmSegmentation.R_Quartile.map(str) + rfmSegmentation.F_Quartile.map(str) + rfmSegmentation.M_Quartile.map(str)


The segmentations are ready, hence recommend item based on the result matrix to each segment.

One may wonder that why there are many "recommendation failed" -- it is because the data contains a vast of missing values (i.e., NaN) for office items.

In [313]:
import collections
idx_match_dic = dict(idx_match)

for i in range(1,5):
    for j in range(1,5):
        for k in range(1,5):
            # for each RFM Quartile, select the data out from original data based on CustermerID
            sub_segment = rfmSegmentation[rfmSegmentation.RFMClass == str(i*100+j*10+k)]
            sub_segment = sub_segment.reset_index()

            print 'category-R('+str(i)+')-F('+str(j)+')-M('+str(k)+') has '+ str(sub_segment.shape[0]) + ' customers.'

            sub_segment_df2 = customer[customer.CustomerID.isin(sub_segment.CustomerID)]
            print 'Among them, ' + str(len(sub_segment_df2[sub_segment_df2.Gender == 'F'])) + ' ladies and ' + str(len(sub_segment_df2[sub_segment_df2.Gender == 'M'])) + ' gentlemen, while ' + str(len(sub_segment_df2[sub_segment_df2.Gender.isnull()])) + ' ppl have not indicated their gender.'

#             sub_segment_df = orders[orders.CustomerID.isin(sub_segment.CustomerID)]
#             some = collections.Counter(sub_segment_df.SKUCode).most_common()
#             recommended_pos = int(math.floor(some.__len__()*.4))
#             recommended_id = some[recommended_pos][0]

            rec_customer_indices = map(idx_match_dic.get, sub_segment.CustomerID)
            sub_res = res[rec_customer_indices]
            rec_from = sub_res.mean(axis=0)
            rec_item_SKUID = itm_match_rev_dic[np.where(rec_from == rec_from.max())[0][0]]            

            result = product1[product1.SKUID == rec_item_SKUID] #recommended_id]
            try:
                if result.shape[0]:
                    print 'Recommendation:'+ '    ' + result.iloc[0].DeptName + '    ' + result.iloc[0].SubDeptName
                else:
                    result = product2[product2.SKUID == recommended_id]
                    if result.shape[0]:
                        print 'Recommendation:'+ '    ' + result.iloc[0].PRODUCT_ENG_DESC
            except:
                print 'SKUID ' + str(result.iloc[0].SKUID) + ' is a unknown product -- recommendation failed'


category-R(1)-F(1)-M(1) has 118 customers.
Among them, 80 ladies and 27 gentlemen, while 11 ppl have not indicated their gender.
SKUID 310637 is a unknown product -- recommendation failed
category-R(1)-F(1)-M(2) has 184 customers.
Among them, 125 ladies and 39 gentlemen, while 20 ppl have not indicated their gender.
SKUID 310636 is a unknown product -- recommendation failed
category-R(1)-F(1)-M(3) has 125 customers.
Among them, 88 ladies and 16 gentlemen, while 21 ppl have not indicated their gender.
SKUID 171179 is a unknown product -- recommendation failed
category-R(1)-F(1)-M(4) has 23 customers.
Among them, 20 ladies and 3 gentlemen, while 0 ppl have not indicated their gender.
SKUID 328557 is a unknown product -- recommendation failed
category-R(1)-F(2)-M(1) has 55 customers.
Among them, 40 ladies and 13 gentlemen, while 2 ppl have not indicated their gender.
SKUID 310637 is a unknown product -- recommendation failed
category-R(1)-F(2)-M(2) has 49 customers.
Among them, 34 ladies 

SKUID 317917 is a unknown product -- recommendation failed
category-R(4)-F(1)-M(1) has 7 customers.
Among them, 4 ladies and 1 gentlemen, while 2 ppl have not indicated their gender.
SKUID 330244 is a unknown product -- recommendation failed
category-R(4)-F(1)-M(2) has 16 customers.
Among them, 12 ladies and 0 gentlemen, while 4 ppl have not indicated their gender.
SKUID 149981 is a unknown product -- recommendation failed
category-R(4)-F(1)-M(3) has 20 customers.
Among them, 13 ladies and 2 gentlemen, while 5 ppl have not indicated their gender.
SKUID 171179 is a unknown product -- recommendation failed
category-R(4)-F(1)-M(4) has 5 customers.
Among them, 5 ladies and 0 gentlemen, while 0 ppl have not indicated their gender.
SKUID 171179 is a unknown product -- recommendation failed
category-R(4)-F(2)-M(1) has 26 customers.
Among them, 20 ladies and 3 gentlemen, while 3 ppl have not indicated their gender.
SKUID 330244 is a unknown product -- recommendation failed
category-R(4)-F(2)-M