Lab : Classification with MLlib
===================================

### Introduction

This lab explores a well known dataset from the Czech dating website libimseti.cz.  We'll just call it the "dating" dataset. :)

Normally we talk of users and items as different entities, but in dating websites we relate users to one another.

In our example, we're going to ignore the gender and orientation of each user in doing the recommendations.   The dating dataset does include a file which identifies the gender of each participant, but for simplicity we're not handling it here. This isn't as bad as it sounds, as most users likely will rate only one gender of dating site participants, and will no doubt receive recommendations from the same gender. Naturally there are always exceptions.

The checked in version is a tiny subset of the actual, as only the first 9999 users are included.  Furthermore, the ratings outside the subset are ignored, so a good portion of users have no data.

In [None]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
sc = spark.sparkContext

In [None]:
from pyspark.mllib.recommendation import *

### Step 1 : Inspect Data
* Sample Data : [/data/dating/sample.txt](/data/dating/sample.txt)
* Rating data file : [/data/dating/medium/ratings.dat](/data/dating/medium/ratings.dat)

(browsers may not display the data properly, open the data in text editor)

### Step 2 : Create Rating Object for the Data

In [None]:
data = sc.textFile("../data/dating/medium/ratings.dat") 

For the dating website 
* Users = Users
* Products = Other users
* Rating = Rating given by one user to anothr user

In [None]:
splitted_data = data.map(lambda x : x.split(","))
#  Rating represents a (user, product, rating) tuple.
ratings = splitted_data.map(lambda x : Rating(x[0],x[1],x[2]))
# ratings.collect()

In [None]:
model = ALS.train(ratings, rank = 10, iterations = 5, lambda_= 0.01)

### Step 3: Transform the Rating object to a tuple of User, Product

In [None]:
# Get rid of rating to test model's effectiveness
# TODO: TRANSFORM Rating -> Tuple of (user, product)
# (i.e., get rid of the rating)
userItems = ???

### Step 4: Use the predictAll method to map the output to User, Product

In [None]:
# Do a test prediction
# TODO call model.predictAll() on userItems, and then map the output of that 
# to (user, product), rating
predict = ???
recs = predict.map(???)

In [None]:
ratingsAndRecs = ratings.map(lambda x: ((int(x[0]), int(x[1])), int(x[2]))).join(recs)
mse = ratingsAndRecs.map(lambda x: (x[1][0] - x[1][1]) * (x[1][0] - x[1][1])).mean()
print (mse)

### Step 5 : Find recommendations for Users based on ratings

In [None]:
# recommendProductsForUsers will give recommendations for all users in an arrray
# Number of recommendations needed should be provided as arguments
recsForEachUser = model.recommendProductsForUsers(4)
recsForEachUser.collect()

In [None]:
# recommendProducts will give recommendations for the particular user
# parameters : (User, NumberOfRecommemdationsNeeded
recsForEachUser = model.recommendProducts(892, 4)
print (recsForEachUser)

# Beware: some numbers aren't represented (e.g. 3)

### Step 6: Running on some of your own data

Create a file called personalratings.txt.  Include some test data as preferences.
We have included a file /data/dating/sample.txt for you.
you can refer to it.
    

In [3]:
#### Sample Output:
# model.recommendProducts(4, 2)
# [Rating(user=4, product=7, rating=7.997956670993958), Rating(user=4, product=6, rating=5.9996782303607565), Rating(user=4, product=5, rating=4.757658554726618)]