## pySpark-recommendation with BPR implemented (implicit feedback)

Have noted with Vidyut that currently we have to build method-pool instead of rules. (Notes are in separated email.)

* Here we illustrate the pySpark with ALS and BPR using our own data

In [1]:
from pyspark import SparkConf, SparkContext 
from pyspark.mllib.recommendation import ALS, Rating
import numpy as np

This fuction translates the data into RDD rating format.

In [3]:
def get_rating(str):  
    arr = str.split('\t') 
    user_id = int(arr[0])  
    movie_id = int(arr[1])  
    user_rating = float(arr[2])  
    return Rating(user_id, movie_id, user_rating)
try:
    sc.stop()
except:
    pass

### Load our data

Set Spark Context (cannot be reset once set well).

In [80]:
conf = SparkConf().setMaster('local').setAppName('RecoEng').set("spark.executor.memory", "1g")  
sc = SparkContext(conf=conf)

In [81]:
data = sc.textFile('/Users/ito/venv/pyspark-rec/CG-Tops/Tops_user-item_data')  

In [82]:
data.top(3)

[u'999\t7071\t1\t736433', u'999\t7070\t1\t736433', u'999\t6951\t1\t736485']

In [83]:
ratings = data.map(get_rating) 
ratings.top(3)

[Rating(user=9204, product=43518, rating=1.0),
 Rating(user=9204, product=43392, rating=2.0),
 Rating(user=9204, product=43378, rating=1.0)]

### ALS model

In [84]:
rank = 10  
iterations = 5    

In [85]:
%%time
ALSmodel = ALS.train(ratings, rank, iterations)

CPU times: user 8.55 ms, sys: 2.99 ms, total: 11.5 ms
Wall time: 20.3 s


The scalability and efficiency of pySpark is going well, and even it is not CPU-wise multiprocessed. So far so good, means my next step is towards accuracy and realistic level.

In [86]:
userid = 10
rec_items = ALSmodel.recommendProducts(userid, 5)  
print ('\n################################')    
print ('recommend items for userid %d:' % userid)
for i in rec_items:
    print(i) 
print ('################################\n' ) 


################################
recommend items for userid 10:
Rating(user=10, product=39247, rating=109.10205162239563)
Rating(user=10, product=36829, rating=86.21868390496543)
Rating(user=10, product=39898, rating=80.4563362042768)
Rating(user=10, product=10884, rating=56.624898387699034)
Rating(user=10, product=10939, rating=55.588593045927624)
################################



In [22]:
sc.stop()

### BPR model

In [23]:
from bpr_spark.bpr import bprMF

In [24]:
conf = SparkConf().setMaster("local").setAppName("BPR").set("spark.executor.memory", "8g")
sc = SparkContext(conf=conf)

In [25]:
data = sc.textFile("/Users/ito/venv/pyspark-rec/CG-Tops/Tops_user-item_data")
ratings = data.map(lambda line: line.split("\t")).map(lambda x: map(int, x[:2]))

* Have remained only user-item information yet (the basic BPR or say BPR-1)

In [26]:
ratings.top(5)

[[9204, 43518], [9204, 43392], [9204, 43378], [9204, 43378], [9204, 42598]]

In [27]:
%%time
userMat, prodMat = bprMF(ratings, 10, 10) 

CPU times: user 458 ms, sys: 405 ms, total: 863 ms
Wall time: 17min 32s


Building another version that can run faster (MR for map reduce)

In [28]:
# from bpr_spark.bprMR import bpr_MF_MR
# userMat2, prodMat2 = bpr_MF_MR(ratings, 10, 10)

In [44]:
import numpy as np
userid = 10
rec_items_bpr = np.inner(userMat[userid].T, prodMat)

In [48]:
import heapq
res = []
top_howmany = 5
top_list = heapq.nlargest(top_howmany,rec_items_bpr)
res.append([i for i in range(len(rec_items_bpr)) if rec_items_bpr[i] in top_list])

In [78]:
print ('\n################################')    
print ('recommend items for userid %d:' % userid)
for i in res[0]:
    print 'user=%d, ' % userid + ' product=%d' %i  
print ('################################\n' ) 


################################
recommend items for userid 10:
user=10,  product=92
user=10,  product=108
user=10,  product=1226
user=10,  product=6951
user=10,  product=11419
################################



In [None]:
recommend movies for userid 10:
Rating(user=10, product=10884, rating=59.42902969981236)
Rating(user=10, product=2560, rating=58.662982221055486)
Rating(user=10, product=5565, rating=31.719984045270465)
Rating(user=10, product=39898, rating=30.352139717436557)
Rating(user=10, product=11101, rating=24.679715171218653)

In [79]:
sc.stop()