## A simple frame to showcase the scalability and efficiency of pySpark-recommendation

Import corresponding packages and methods, where ALS stands for Alternating Least Square that is commonly used in Collaborative Filtering.
#### Here we illustrate the pySpark frame using a widely used sample data -- MovieLens data (movie rating dataset): ml-100k

In [1]:
from pyspark import SparkConf, SparkContext 
from pyspark.mllib.recommendation import ALS, Rating

This function aims at collecting all the matches of movies and corresponding IDs.

In [2]:
def movie_dict(file):  
    dict = {}  
    with open(file) as f:  
        for line in f:  
            arr= line.split('|')  
            movie_id = int(arr[0])  
            movie_name = str(arr[1])  
            dict[movie_id] = movie_name  
    return dict  

This fuction translates the data into RDD rating format.

In [3]:
def get_rating(str):  
    arr = str.split('\t') 
    user_id = int(arr[0])  
    movie_id = int(arr[1])  
    user_rating = float(arr[2])  
    return Rating(user_id, movie_id, user_rating)
try:
    sc.stop()
except:
    pass

Set Spark Context (cannot be reset once set well).

In [4]:
conf=SparkConf().setMaster('local').setAppName('RecoEng').set("spark.executor.memory", "1024m")  
sc = SparkContext(conf=conf)

### Load Data

In [5]:
data = sc.textFile('/Users/ito/venv/pyspark-rec/ml-100k/u.data')  

In [6]:
data.top(3)

[u'99\t98\t5\t885679596', u'99\t978\t3\t885679382', u'99\t975\t3\t885679472']

Transfer Data to Rating(user, product, rating) tuple format.

In [7]:
ratings = data.map(get_rating) 
ratings.top(3)

[Rating(user=943, product=1330, rating=3.0),
 Rating(user=943, product=1228, rating=3.0),
 Rating(user=943, product=1188, rating=3.0)]

### Constructing the model

In [8]:
import time

In [9]:
rank = 10  
iterations = 5  

In [10]:
starting = time.time()
model = ALS.train(ratings, rank, iterations)  
ending = time.time()
print 'time elapsed: ' + str(ending - starting)

time elapsed: 2.7349319458


#### Data Infomation:
943 users
1682 items
100000 ratings
#### As one can see, 100K ratings are analysed within 3 secs. Hence We later try to see the performance on our sample data, which is of 2 million transactions.

Simple testing on specific user.

In [11]:
userid = 10  
user_ratings = ratings.filter(lambda x: x[0] == userid)  

E.g., recommending top 5 movies 

In [12]:
rec_movies=model.recommendProducts(userid, 5)  
print ('\n################################\n')    
print ('recommend movies for userid %d:' % userid)
for i in rec_movies:
    print(i) 
print ('\n################################\n' ) 


################################

recommend movies for userid 10:
Rating(user=10, product=1643, rating=6.051609929829508)
Rating(user=10, product=1631, rating=5.770377710543525)
Rating(user=10, product=913, rating=5.7452749514953325)
Rating(user=10, product=1131, rating=5.67329557355634)
Rating(user=10, product=1463, rating=5.666075555982269)

################################



### Try our data

In [13]:
data = sc.textFile('/Users/ito/venv/pyspark-rec/CG-Tops/Tops_user-item_data')  

In [14]:
data.top(3)

[u'999\t7071\t1\t736433', u'999\t7070\t1\t736433', u'999\t6951\t1\t736485']

In [15]:
ratings = data.map(get_rating) 
ratings.top(3)

[Rating(user=9204, product=43518, rating=1.0),
 Rating(user=9204, product=43392, rating=2.0),
 Rating(user=9204, product=43378, rating=1.0)]

In [16]:
rank = 10  
iterations = 5  

starting = time.time()
model = ALS.train(ratings, rank, iterations)  
ending = time.time()
print 'time elapsed: ' + str(ending - starting)

time elapsed: 20.2458558083


The scalability and efficiency of pySpark is going well, and even it is not CPU-wise multiprocessed. So far so good, means my next step is towards accuracy and realistic level.

In [17]:
rec_items=model.recommendProducts(userid, 5)  
print ('\n################################\n')    
print ('recommend movies for userid %d:' % userid)
for i in rec_items:
    print(i) 
print ('\n################################\n' ) 


################################

recommend movies for userid 10:
Rating(user=10, product=10939, rating=92.88199816094956)
Rating(user=10, product=2560, rating=37.4570235657623)
Rating(user=10, product=36829, rating=34.941969025331055)
Rating(user=10, product=10884, rating=23.99853971587603)
Rating(user=10, product=24504, rating=12.740545296770325)

################################



In [18]:
sc.stop()