# In this notebook, we'll take some first steps with MLlib, Spark's machine learning library


http://spark.apache.org/docs/latest/mllib-guide.html

## Spark has two ML libraries to enable machine learning at scale:
### It divides into two packages:

* spark.mllib contains the original API built on top of RDDs.
* spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

### We will try to predict user ratings based on other user ratings (collaborative filtering!). More specifically, we will:
* Load user ratings data,
* Separate it into training and test sets,
* Train an Alternating Lease Squares model,
* Make predictions on the test set
* Compare the prediction to truth,
* Produce a quantitative goodness-of-fit 

In [1]:
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import Rating

In [2]:
CORES_PER_NODE = 2
NUM_WORKERS = 4
REP_FACTOR = 4

#### Create a DataFrame from the csv while repartitioning and persisting

In [3]:
# Read in the ratings file (fromUserId, toUserId, rating).  These ratings are 0-9.
ratings_raw_DF = sqlContext.read.format("com.databricks.spark.csv") \
                           .options(header="false") \
                           .load("s3n://insight-spark-after-dark/ratings.csv.gz") \
                           .repartition(CORES_PER_NODE*NUM_WORKERS*REP_FACTOR)\
                           .persist(StorageLevel.MEMORY_AND_DISK_SER)

To write SparkSQL, we need to create a table object from our dataframe which we can use to run SparkSQL commands. The transformation registerTempTable() does this and we call our table 'ratings_raw_tbl'.

In [4]:
# Register the ratings_raw DataFrame as a temp table - this allows us to run the SparkSQL queries
ratings_raw_DF.registerTempTable("ratings_raw_tbl")

In [5]:
#let's trigger the job and load the data
#just take a sample for now to save time
#ratings_raw_DF = ratings_raw_DF.sample(False, .05, 20)
ratings_raw_DF.count()

17359346

In [6]:
#let's look at it! (fromUserId, toUserId, rating)
ratings_raw_DF.take(5)

[Row(C0=u'1', C1=u'3751', C2=u'7'),
 Row(C0=u'1', C1=u'19231', C2=u'5'),
 Row(C0=u'1', C1=u'36750', C2=u'2'),
 Row(C0=u'1', C1=u'51399', C2=u'7'),
 Row(C0=u'1', C1=u'70694', C2=u'8')]

In [7]:
# Cast the DataFrame to enforce a schema with (from_user_id, to_user_id, rating)
ratings_DF = sqlContext.sql("""
SELECT
    CAST(C0 as int) AS from_user_id,
    CAST(C1 as int) AS to_user_id,
    CAST(C2 as int) AS rating
FROM 
    ratings_raw_tbl
""")

In [8]:
# Create mllib.recommendation.Rating RDD from ratings DataFrame
ratings_RDD = ratings_DF.rdd.map(lambda r: Rating(r.from_user_id, r.to_user_id, r.rating))

ratings_RDD.take(5)

[Rating(user=1, product=3751, rating=7.0),
 Rating(user=1, product=19231, rating=5.0),
 Rating(user=1, product=36750, rating=2.0),
 Rating(user=1, product=51399, rating=7.0),
 Rating(user=1, product=70694, rating=8.0)]

### Task 1: Separate ratings data into training data (80%) and test data (20%)


In [9]:
# Task 1: Separate ratings data into training data (80%) and test data (20%)
split_ratings_RDD = ratings_RDD.randomSplit([0.8, 0.2])
train_ratings_RDD = split_ratings_RDD[0]
test_ratings_RDD = split_ratings_RDD[1]

## One algorithm to learn the features in a collaborative filtering model is Alternating Least Squares. We learn factor vectors for each from_user and each to_user. Reach more here:
#### http://bugra.github.io/work/notes/2014-04-19/alternating-least-squares-method-for-collaborative-filtering/

In [10]:
# Train the ALS model using the training data and various model hyperparameters
# hyperparameters are: 
# rank       number of features to use
# iterations number of iterations of ALS (recommended: 10-20
# lambda_     regularization factor (recommended: 0.01)
# blocks     level of parallelism to split computation into
# seed       random seed

model = ALS.train(train_ratings_RDD, 1, iterations=5, lambda_=0.01, blocks=10)

In [11]:
# Convert known test data to have only (from_user_id, to_user_id), 
# we can then feed this into our model to predict new ratings for 
# pairs of from_user_ids and to_user_ids
test_from_to_RDD = test_ratings_RDD.map(lambda r: (r[0], r[1]))

### Task 2: Test the model by predicting the ratings for the test dataset
#### Hint: the predictAll() method can be called with the RDD as input to return a new RDD


In [12]:
# Task 2: Test the model by predicting the ratings for the test dataset
test_predictions_RDD = model.predictAll(test_from_to_RDD)

test_predictions_RDD.take(5)

[Rating(user=129733, product=108150, rating=-8.476700963223664),
 Rating(user=54883, product=108150, rating=9.600600074043882),
 Rating(user=40749, product=108150, rating=10.414304912736725),
 Rating(user=90280, product=28730, rating=7.1463274083643),
 Rating(user=63293, product=28730, rating=7.942889531623109)]

### Task 3: Prepare the known test predictions and actual predictions for comparison keyed by (from, to)
#### Hint: We'd like to have an RDD with the form of ((from, to), rating), think map().


In [13]:
# Task 3: Prepare the known test predictions and actual predictions for comparison keyed by (from, to)
test_predictions_RDD = test_predictions_RDD.map(lambda r: ((r[0], r[1]), r[2]))
test_actual_RDD = test_ratings_RDD.map(lambda r: ((r[0], r[1]), r[2]))

### Task 4: Join the known test predictions with the actual predictions


In [14]:
# Task 4: Join the known test predictions with the actual predictions
test_to_actual_ratings_RDD = test_actual_RDD.join(test_predictions_RDD)
test_to_actual_ratings_RDD.take(10)

[((10000, 117276), (5, 4.4307118385273725)),
 ((21987, 160209), (10, 6.940231459574136)),
 ((627, 96329), (4, 4.631136272745152)),
 ((124069, 80325), (10, 4.72382272067631)),
 ((20117, 103291), (6, -1.767513105468197)),
 ((95130, 30290), (4, 6.07095903305526)),
 ((37420, 32538), (5, 6.650561477611973)),
 ((16941, 99951), (8, 6.725620142730122)),
 ((16240, 159906), (4, 4.666624096116948)),
 ((31056, 58972), (6, 6.902654069384653))]

### Task 5: Evaluate the model using Mean Absolute Error (MAE) between the known test ratings and the actual predictions 


In [15]:
# Task 5: Evaluate the model using Mean Absolute Error (MAE) between the known test ratings and the actual predictions 
mean_absolute_rating_error = test_to_actual_ratings_RDD.map(lambda r: abs(r[1][0]-r[1][1]))\
                                                       .mean()

print mean_absolute_rating_error

2.53829025166


# Next Steps

### Task 6: Instead of a training and test set only, construct a cross validation set and use it to optimize over a vector of possible regularization parameters


### Task 7: If you're considering applying machine learning at scale for tomorrow's project, explore MLlib and consider your options!
