# Introduction
This notebook provides a top-level technical introduction to combining Apache Spark with MongoDB, enabling developers and data engineers to bring sophisticated real-time analytics and machine learning to live, operational data.

The following illustrates how to use MongoDB and Spark with an example application that uses Spark's alternating least squares (ALS) implementation to generate a list of movie recommendations for a user.

The following tasks will be performed:
* Read data from MongoDB into Spark.
* Run the MongoDB Connector for Spark in Spark.
* Use the machine learning ALS library in Spark to generate a set of personalized movie recommendations for a given user.
* Write the recommendations back to MongoDB so they are accessible to applications.

## Importing the required libraries

In [5]:
# Import Libraries
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator 
from pyspark.ml.recommendation import ALS
from pyspark.sql import functions
from pyspark.sql.types import DoubleType

# Read data from MongoDB
The Spark Connector can be configured to read from MongoDB in a number of ways.
We use in this notebook the SparkSesssion object directly, via an options map. The SparkSession reads from the "ratings" collection in the "recommendation" database.

In [3]:
# Open Spark session
spark = SparkSession.\
builder.\
appName("pyspark-notebook2").\
    master("spark://spark-master:7077").\
    config("spark.executor.memory", "1g").\
    config("spark.mongodb.input.uri","mongodb://mongo1:27017,mongo2:27018,mongo3:27019/recommendation.ratings?replicaSet=rs0").\
    config("spark.mongodb.output.uri","mongodb://mongo1:27017,mongo2:27018,mongo3:27019/recommendation.ratings?replicaSet=rs0").\
    config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.0").\
    getOrCreate()

One of the most attractive features of MongoDB is support for flexible schemas, which enables you to store a variety of JSON models in the same collection. A consequence of flexible schemas is that there is no defined schema for a given collection as there would be in an RDBMS. Since DataFrames and Datasets require a schema, the Spark Connector will automatically infer the schema by randomly sampling documents from the database. It then assigns that inferred schema to the DataFrame.

In [4]:
# Loading data into Spark
ratings_df = spark.read.format("mongo").load()

The Spark Connector's ability to infer schema through document sampling is a nice convenience, but if you know your document structure you can assign the schema explicitly and avoid the need for sampling queries. The following example shows you how to define a DataFrame's schema explicitly

In [6]:
# Looking at the schema dataset
ratings_df.cache() 
ratings_df.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)
 |-- userId: integer (nullable = true)



Lets now look at the data which compose the dataset.

In [8]:
# Perform a descriptive analysis
ratings_df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
movieId,100836,19435.2957177992,35530.9871987004,1,193609
rating,100836,3.501556983616962,1.0425292390606307,0.5,5.0
timestamp,100836,1.2059460873684695E9,2.1626103599513054E8,828124615,1537799250
userId,100836,326.12756356856676,182.6184914634992,1,610


# Using Machine Learning librairies
we use the ALS library for Apache Spark to learn our dataset in order to make predictions for a user. This example is detailed as a five step process. You can learn more about how ALS generates predictions in the Spark documentation which is not the purpose of this notebook.

## Creation of the Machine Learning Model
For training purposes, the complete data set must also split into smaller partitions known as the training, validation, and test data. In this case, 80% of the data will be used for training and the rest can be used to validate the model.

In [9]:
# Splitting the dataset into 80/20 for training and test
splits = ratings_df.randomSplit([0.8,0.2]) 
train_df = splits[0]
test_df = splits[1]

Recommendation model are built on the training data.

In [10]:
# Build the recommendation model using ALS on the training data
als = ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userId").\
    setItemCol("movieId").setRatingCol("rating")
model = als.fit(train_df)

To check the accuracy of the training model, we compute the RMSE (Root Mean Square Error) on the test data.

In [12]:
# Setting cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
model.setColdStartStrategy("drop") 
predictions = model.transform(test_df)
evaluator = RegressionEvaluator().setMetricName("rmse").setLabelCol("rating").\
    setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions) 
print("Root-mean-square error = ", rmse)

Root-mean-square error =  1.1018619243577332


Lets display the predictions

In [13]:
# Display predictions
predictions.show()

+--------------------+-------+------+----------+------+----------+
|                 _id|movieId|rating| timestamp|userId|prediction|
+--------------------+-------+------+----------+------+----------+
|{60c0d3fc217b5825...|    471|   4.0| 843491793|   133| 2.2460139|
|{60c0d3fd217b5825...|    471|   2.0| 941558175|   597|  4.226793|
|{60c0d3fd217b5825...|    471|   3.0| 833530187|   436| 3.2405984|
|{60c0d3fd217b5825...|    471|   3.0| 874415126|   372| 3.2151067|
|{60c0d3fc217b5825...|    471|   4.0|1111624874|   218| 2.0807343|
|{60c0d3fd217b5825...|    471|   4.0|1479544381|   610| 3.2153044|
|{60c0d3fd217b5825...|    471|   4.0|1178980875|   448|  4.650613|
|{60c0d3fc217b5825...|    471|   4.0|1043175564|   312| 2.2405844|
|{60c0d3fd217b5825...|    471|   5.0| 965425364|   469| 2.0484393|
|{60c0d3fd217b5825...|    471|   5.0| 961514069|   414| 2.7908955|
|{60c0d3fd217b5825...|    471|   1.5|1117161794|   608| 1.1752882|
|{60c0d3fc217b5825...|    471|   4.5|1109409455|   260|  3.642

Finally, we use the model to generate a set of 10 recommended movies for each user in the dataset, and write those recommendations back to MongoDB.
To do so we will firt clean-up the generated dataframe prior to to write it back into MongoDB.

In [14]:
# Using the model to generate a set of 10 recommended movies for each user in the dataset
columns = ["_id", "timestamp"] 
docs = predictions.drop(*columns)

# Need to cast predictions from FloatType to Double as Float is not a BSON type in MongoDB
docs = docs.withColumn("prediction", docs.prediction.cast(DoubleType())) 
docs.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)
 |-- prediction: double (nullable = false)



Lets write now the recommendations to MongoDB

In [15]:
# Write recommendations to MongoDB
docs.write.format("mongo").mode("overwrite").option("database", "recommendation").option("collection", "recommendations").save()

Lets check the result by reading the recommendation for a specific user from MongoDB.

In [16]:
# Read recommendations for a user
pipeline = "[{'$match': {'userId': 1}}, {'$project': {'_id': 0, 'timestamp': 0}}, {'$sort': {'rating': -1}}, {'$limit': 10}]"
aggPipelineDF = spark.read.format("mongo").option("pipeline", pipeline).option("partitioner", "MongoSinglePartitioner").load()
aggPipelineDF.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|    333|   5.0|     1|
|    231|   5.0|     1|
|    151|   5.0|     1|
|     50|   5.0|     1|
|    101|   5.0|     1|
|    362|   5.0|     1|
|    260|   5.0|     1|
|    457|   5.0|     1|
|    157|   5.0|     1|
|     47|   5.0|     1|
+-------+------+------+

