### File download
First of all, let's get the data. GroupLens Research has collected movie rating data sets and made it available from [MoiveLens web site](https://grouplens.org/datasets/movielens/). There are two datasets:
- Small: 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Last updated 10/2016. 
- Full: 24,000,000 ratings and 670,000 tag applications applied to 40,000 movies by 260,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Last updated 10/2016.

For now, we just use the small dataset.

In [2]:
smallset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
fullset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'

import os

smallset_path = os.path.join('data', 'ml-latest-small.zip')
fullset_path = os.path.join('data', 'ml-latest.zip')

import urllib
smallset = urllib.urlretrieve(smallset_url, smallset_path)
fullset = urllib.urlretrieve(fullset_url, fullset_path)

import zipfile
with zipfile.ZipFile(smallset_path, "r") as z:
    z.extractall('data')
with zipfile.ZipFile(fullset_path, "r") as z:
    z.extractall('data')

### Analysing datasets
Now we take a closer look at the dataset.

Each line in the ratings dataset (*ratings.csv*) is formatted as:

*userId,movieId,rating,timestamp*

Each line in the movies (*movies.csv*) dataset is formatted as:

*movieId,title,genres*

Where *genres* has the format:

*Genre1|Genre2|Genre3...*

The tags file (*tags.csv*) has the format:

*userId,movieId,tag,timestamp*

The *links.csv* file has the format:

*movieId,imdbId,tmdbId*

These files are uniformly formatted, it's easy to parse using split(). To train a recommender, we need to parse movies and ratings into two RDDs:
- For each line in movies dataset, we create a tuple of (movieId, title).
- For each line in ratings dataset, we create a tuple of (userId, movieId, rating).

In [3]:
# Load the samll dataset
small_ratings_path = os.path.join('data', 'ml-latest-small', 'ratings.csv')
small_ratings_data = sc.textFile(small_ratings_path)
small_ratings_header = small_ratings_data.take(1)[0]

#Parse
small_ratings = small_ratings_data.filter(lambda line: line!=small_ratings_header).map(lambda line: line.split(","))\
.map(lambda tokens: (tokens[0], tokens[1], tokens[2])).cache()

small_ratings.take(3)

[(u'1', u'31', u'2.5'), (u'1', u'1029', u'3.0'), (u'1', u'1061', u'3.0')]

In [4]:
small_movies_path = os.path.join('data', 'ml-latest-small', 'movies.csv')
small_movies_data = sc.textFile(small_movies_path)
small_movies_header = small_movies_data.take(1)[0]

small_movies = small_movies_data.filter(lambda line: line!=small_movies_header).map(lambda line: line.split(","))\
.map(lambda tokens: (tokens[0], tokens[1])).cache()

small_movies.take(3)

[(u'1', u'Toy Story (1995)'),
 (u'2', u'Jumanji (1995)'),
 (u'3', u'Grumpier Old Men (1995)')]

[Collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors. 
### Select ALS parameters using the dataset
First, we need to split the dataset into train, validation and test data.

In [5]:
train_RDD, validation_RDD, test_RDD = small_ratings.randomSplit([6, 2, 2], seed = 0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

In [6]:
from pyspark.mllib.recommendation import ALS
import math

seed = 5L
iterations = 10
regularization_parameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.02

min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
    model = ALS.train(train_RDD, rank, seed = seed, iterations = iterations,
                     lambda_ = regularization_parameter)
    predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2])) #(userId,movieId),predicted_rating
    rates_and_preds = validation_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)#(userId, moveiId),(actual_rating, predicted_rating)
    error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())#square error between actual and predicted ratings.
    errors[err] = error
    err += 1
    print 'For rank %s the RMSE is %s' % (rank, error)
    if error < min_error:
        min_error = error
        best_rank = rank
        
print 'The best model was trained with rank %s' % best_rank

For rank 4 the RMSE is 0.952401795348
For rank 8 the RMSE is 0.959968407203
For rank 12 the RMSE is 0.953274504895
The best model was trained with rank 4


The above code finds the best hyper-parameter(rank) for the model, now we need to try the model on test set.

In [7]:
model = ALS.train(train_RDD, best_rank, seed=seed, iterations=iterations, lambda_=regularization_parameter)
predictions = model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

print 'For testing data, the RMSE is %s' % (error)

For testing data, the RMSE is 0.943727800028


### Build model with full dataset
First let's load the dataset:

In [9]:
# Load the full dataset
full_ratings_path = os.path.join('data', 'ml-latest', 'ratings.csv')
full_ratings_data = sc.textFile(full_ratings_path)
full_ratings_header = full_ratings_data.take(1)[0]

#Parse
full_ratings = full_ratings_data.filter(lambda line: line!=full_ratings_header).map(lambda line: line.split(","))\
.map(lambda tokens: (int(tokens[0]), int(tokens[1]), float(tokens[2]))).cache()

print "There are %s entries in the full dataset." % (full_ratings.count())

There are 24404096 entries in the full dataset.


Then we train and test the model using the same method as we did for small dataset:

In [10]:
train_RDD, test_RDD = full_ratings.randomSplit([7, 3], seed = 0L)
full_model = ALS.train(train_RDD, best_rank, seed=seed, iterations=iterations, lambda_ = regularization_parameter)

test_for_predict_RDD = test_RDD.map(lambda r: (r[0], r[1]))

predictions = full_model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())

print 'The RMSE on testing data is %s' % (error)

The RMSE on testing data is 0.831954381788


### Make recommendations
Building a recommendation using collaborative filtering is not as predictiong new entries using previously generated model. Because we need to include the new user preference in order to compare them with existing users in the dataset. Therefore, whenever a new user ratings comes in, we need to train the system again. This is very expensive, the scalability is a problem, that's why we use Spark!

Let's load the movies data first:

In [14]:
# Load the full dataset
full_movies_path = os.path.join('data', 'ml-latest', 'movies.csv')
full_movies_data = sc.textFile(full_movies_path)
full_movies_header = full_movies_data.take(1)[0]

#Parse
full_movies = full_movies_data.filter(lambda line: line!=full_movies_header).map(lambda line: line.split(","))\
.map(lambda tokens: (tokens[0], tokens[1], tokens[2])).cache()

full_movies_titles = full_movies.map(lambda r: (int(r[0]), r[1]))
full_movies_titles.take(3)

#print "There are %s movies in the complete dataset" % (full_movies_titles.count())


[(1, u'Toy Story (1995)'),
 (2, u'Jumanji (1995)'),
 (3, u'Grumpier Old Men (1995)')]

In [15]:
def get_counts_and_avarages(id_rating_tuple):
    nrating = len(id_rating_tuple[1])
    return id_rating_tuple[0], (nrating, float(sum(x for x in id_rating_tuple[1])) / nrating)

movieId_ratings_RDD = (full_ratings.map(lambda x: (x[1], x[2])).groupByKey())
movieId_avg_ratings_RDD = movieId_ratings_RDD.map(get_counts_and_avarages)
movie_rating_counts_RDD = movieId_avg_ratings_RDD.map(lambda x: (x[0], x[1][0]))
movie_rating_counts_RDD.take(3)

[(122880, 10), (147460, 1), (131080, 53)]

### Add new user ratings

Now let's fake some data for the new user, we need to rate some movies for the new user and put them in a new RDD. We give the user ID 0, because it's not assigned in the existing dataset. 

In [16]:
new_user_Id = 0

# userId, movieId, rating
new_user_ratings = [
    (0, 1, 9), #Toy Story
    (0, 32, 7), #Twelve Monkeys 
    (0, 296, 8), #Pulp Fiction
    (0, 162376, 8), #Strang things
    (0, 159858, 6), #The conjuring 2
    (0, 152081, 9), #Zootopia
    (0, 71899, 9), #Mary and Max
    (0, 68793, 8), #Night at the museum
    (0, 68137, 5), #Nana
    (0, 63436, 4), #Saw V
    (0, 60397, 9), #Mama Mia
    (0, 33794, 7), #Batman Begins
    (0, 122886, 6), #Star War: The force aweken 
    (0, 1721, 9), #Titanic
    (0, 2011, 7), #Back to the future II
]
new_user_ratings_RDD = sc.parallelize(new_user_ratings)
print "New user ratings: %s" % new_user_ratings_RDD.take(15)

New user ratings: [(0, 1, 9), (0, 32, 7), (0, 296, 8), (0, 162376, 8), (0, 159858, 6), (0, 152081, 9), (0, 71899, 9), (0, 68793, 8), (0, 68137, 5), (0, 63436, 4), (0, 60397, 9), (0, 33794, 7), (0, 122886, 6), (0, 1721, 9), (0, 2011, 7)]


Then we need to add this new user rating into our rating dataset.

In [17]:
full_data_with_new_rating_RDD = full_ratings.union(new_user_ratings_RDD)

And re-train the model:

In [18]:
from time import time

t0 = time()
new_model = ALS.train(full_data_with_new_rating_RDD, best_rank, seed=seed, 
                      iterations=iterations, lambda_ = regularization_parameter)
tt = time() - t0
print "New model trained in %s seconds." % round(tt, 3)

New model trained in 157.223 seconds.


### Get recommendations

Now, it's time to get some recommendations! We will get an RDD with all the movies the new user hasn't rated yet. 

In [19]:
new_user_ratings_ids = map(lambda r: r[1], new_user_ratings)
#Only keep the movies the user hasn't rated yet.
new_user_unrated_movies_RDD = (full_movies.filter(lambda x: x[0] not in new_user_ratings_ids)
                               .map(lambda x: (new_user_Id, x[0])))

#Use the input RDD, new_user_unrated_movies_RDD, with new_model.predictAll() to predict new ratings for the movie
new_user_recommendations_RDD = new_model.predictAll(new_user_unrated_movies_RDD)
new_user_recommendations_RDD.take(3)

[Rating(user=0, product=116688, rating=2.7996465831532102),
 Rating(user=0, product=32196, rating=6.095657455529821),
 Rating(user=0, product=138744, rating=4.62930823409996)]

Now we have all predictions, we can print out the top 25 movies and join them with the movies RDD to get the title, and ratings count to get movies with a minumum number of counts. First, let's take a look at the results. 

In [20]:
new_user_recommendations_rating_RDD = new_user_recommendations_RDD.map(lambda x: (x.product, x.rating))
new_user_recommendations_rating_title_and_count_RDD = new_user_recommendations_rating_RDD.join \
(full_movies_titles).join(movie_rating_counts_RDD)

new_user_recommendations_rating_title_and_count_RDD.take(3)

[(52224, ((2.2741304360240884, u'Turn of Faith (2002)'), 2)),
 (8194, ((6.377721071958565, u'Baby Doll (1956)'), 93)),
 (130730, ((6.641372912166924, u'Lemon Popsicle (1978)'), 3))]

In [22]:
new_user_recommendations_rating_title_and_count_RDD = \
    new_user_recommendations_rating_title_and_count_RDD.map(lambda r: (r[1][0][1], r[1][0][0], r[1][1]))
#Now the data is flattened to [movieName, movieRating, num_ratings] 

Now we need to get the highest rated recommendations for the new user, filtering out the movies with less than 25 ratings.

In [23]:
top_movies = new_user_recommendations_rating_title_and_count_RDD. \
    filter(lambda r:r[2] >=25).takeOrdered(25, key = lambda x: -x[1])

print ('Top recommended movies (with more than 25 reviews): \n%s' % '\n'.join(map(str, top_movies)))

Top recommended movies (with more than 25 reviews): 
(u'Sense & Sensibility (2008)', 9.617256200119671, 31)
(u'Boys (2014)', 9.581908124785942, 41)
(u'North & South (2004)', 9.361230394901336, 318)
(u'"Very Potter Sequel', 9.329569886824654, 31)
(u'Piper (2016)', 9.30138577062134, 88)
(u'Bridegroom (2013)', 9.301033879213085, 26)
(u'Pride and Prejudice (1995)', 9.241373327810845, 2510)
(u'Long Way Round (2004)', 9.203669938324683, 29)
(u'Anne of Green Gables: The Sequel (a.k.a. Anne of Avonlea) (1987)', 9.190376397097724, 293)
(u'"Shawshank Redemption', 9.141456020001094, 84455)
(u'Anne of Green Gables (1985)', 9.127411772902109, 622)
(u"Schindler's List (1993)", 9.073768668021366, 63889)
(u'Wild China (2008)', 9.05449812547844, 91)
(u'"Very Potter Musical', 8.9969789880336, 75)
(u'Dylan Moran Live: What It Is (2009)', 8.985010211520505, 59)
(u"Into the Forest of Fireflies' Light (2011)", 8.944206628425253, 35)
(u'Planet Earth (2006)', 8.940593396410696, 193)
(u'Emma (2009)', 8.9027861

### Get individual ratings

We can also use this to get the predicted rating for a particular movie for a given user. To do this, we just need to pass a single entry with the movie we want to predict to the predictAll method.

In [24]:
my_movie = sc.parallelize([(0, 300)]) #Quiz show (1994)
individual_movie_rating_RDD = new_model.predictAll(new_user_unrated_movies_RDD)
individual_movie_rating_RDD.take(1)

[Rating(user=0, product=116688, rating=2.7996465831532102)]

Seems like this user may not like this movie.

### Save the model

We might wnat to persist this model for later use in our recommendations. Although a new model needs to be generated everytime we have new user ratings, it's still worth to save the current one. We can also save some of the RDDs we have generated, especially those that takes a long time to process.

In [32]:
from pyspark.mllib.recommendation import MatrixFactorizationModel

model_path = os.path.join('models', 'movie_lens_als')

#model.save(sc, model_path)
#new_model = MatrixFactorizationModel.load(sc, model_path)