# Loading Movie Lens dataset into RDDs

- Collaborative filtering is a technique for recommender systems wherein users' ratings and interactions with various products are used to recommend new ones. With the advent of Machine Learning and parallelized processing of data, Recommender systems have become widely popular in recent years, and are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags. In this 3-part exercise, your goal is to develop a simple movie recommendation system using PySpark MLlib using a subset of [MovieLens 100k dataset](https://grouplens.org/datasets/movielens/100k/).

- In the first part, you'll first load the MovieLens data (`ratings.csv`) into RDD and from each line in the RDD which is formatted as `userId`,`movieId`,`rating`,`timestamp`, you'll need to map the MovieLens data to a Ratings object (`userID`, `productID`, `rating`) after removing timestamp column and finally you'll split the RDD into training and test RDDs.

- Remember, you have a `SparkContext` `sc` available in your workspace. Also `file_path` variable (which is the path to the `ratings.csv` file), and `ALS` class are already available in your workspace.

## Instructions

- Load the `ratings.csv` dataset into an RDD.
- Split the RDD using `,` as a delimiter.
- For each line of the RDD, using `Rating()` class create a tuple of `userID`, `productID`, `rating`.
- Randomly split the data into training data and test data (0.8 and 0.2).

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
file_path = 'file:///home/talentum/test-jupyter/P4/2_CollaborativeFiltering/Dataset/ratings.csv'

# Load the data into RDD
data = sc.textFile(file_path)

data.collect()


['1,31,2.5,1260759144',
 '1,1029,3.0,1260759179',
 '1,1061,3.0,1260759182',
 '1,1129,2.0,1260759185',
 '1,1172,4.0,1260759205',
 '1,1263,2.0,1260759151',
 '1,1287,2.0,1260759187',
 '1,1293,2.0,1260759148',
 '1,1339,3.5,1260759125',
 '1,1343,2.0,1260759131',
 '1,1371,2.5,1260759135',
 '1,1405,1.0,1260759203',
 '1,1953,4.0,1260759191',
 '1,2105,4.0,1260759139',
 '1,2150,3.0,1260759194',
 '1,2193,2.0,1260759198',
 '1,2294,2.0,1260759108',
 '1,2455,2.5,1260759113',
 '1,2968,1.0,1260759200',
 '1,3671,3.0,1260759117',
 '2,10,4.0,835355493',
 '2,17,5.0,835355681',
 '2,39,5.0,835355604',
 '2,47,4.0,835355552',
 '2,50,4.0,835355586',
 '2,52,3.0,835356031',
 '2,62,3.0,835355749',
 '2,110,4.0,835355532',
 '2,144,3.0,835356016',
 '2,150,5.0,835355395',
 '2,153,4.0,835355441',
 '2,161,3.0,835355493',
 '2,165,3.0,835355441',
 '2,168,3.0,835355710',
 '2,185,3.0,835355511',
 '2,186,3.0,835355664',
 '2,208,3.0,835355511',
 '2,222,5.0,835355840',
 '2,223,1.0,835355749',
 '2,225,3.0,835355552',
 '2,235,3

In [4]:
# Split the RDD 
ratings = data.map(lambda l: l.split(','))

# Transform the ratings RDD 
ratings_final = ratings.map(lambda line: Rating(int(line[0]), int(line[1]), float(line[2])))

# Split the data into training and test
training_data, test_data = ratings_final.randomSplit([0.8, 0.2])