# Alejandro Osborne - Project 5

### Loading

In [18]:
import pandas as pd
import scipy as sp
from scipy.sparse import coo_matrix
ratings = pd.read_csv('https://raw.githubusercontent.com/AlejandroOsborne/DATA612/master/ratings.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/AlejandroOsborne/DATA612/master/movies.csv')

In [19]:
## Import libraries
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder
from pyspark import SparkContext
from sklearn.metrics import mean_squared_error
import pandas as pd
from pyspark.sql import SQLContext

### Creating the Spark instance 

In [20]:
sc = SparkContext('local', 'project5.1')
sql_sc = SQLContext(sc)

In [22]:
s_df = sql_sc.createDataFrame(ratings)

X_train, X_test = s_df.randomSplit([0.8, 0.2], seed=643)

#### We set the training model for the dataset

In [24]:
## Training 
als = ALS(rank=10,
          maxIter=10,
          userCol='userId',
          itemCol='movieId',
          ratingCol='rating')
model = als.fit(X_train.select(['userId', 'movieId', 'rating']))

In [25]:
## Predictions

preds = model.transform(X_test.select(['userId', 'movieId']))

In [26]:
## Validate
preds = preds.toPandas()

## To pandas
val = pd.merge(ratings, preds, on=['userId', 'movieId'])
val = val.dropna() 
rmse = mean_squared_error(val.rating, val.prediction)
print(rmse)
sc.stop()

0.7983046666997006


In [27]:
combined = pd.merge(val, movies, on=['movieId'])
combined = combined[['userId', 'movieId', 'rating', 'prediction', 'title','genres']]
combined.sample(20)

Unnamed: 0,userId,movieId,rating,prediction,title,genres
10857,585,55820,4.5,4.368899,No Country for Old Men (2007),Crime|Drama
18701,509,6367,3.0,3.109463,Down with Love (2003),Comedy|Romance
11411,599,6957,3.0,2.723639,Bad Santa (2003),Comedy|Crime
16830,266,3156,1.0,2.343047,Bicentennial Man (1999),Drama|Romance|Sci-Fi
3159,173,597,3.0,3.227104,Pretty Woman (1990),Comedy|Romance
15580,599,5628,3.0,1.980348,Wasabi (2001),Action|Comedy|Crime|Drama|Thriller
61,71,260,3.0,4.138785,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi
10291,380,1193,4.0,4.114138,One Flew Over the Cuckoo's Nest (1975),Drama
13397,426,79091,5.0,3.761264,Despicable Me (2010),Animation|Children|Comedy|Crime
2900,267,480,5.0,4.350845,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller


### Conclusion

The Spark API is great for ALS, but not as good for SVD, yet the capacity is still there.

For any ALS implementation in python, Spark is the safest way to go about this as the Surprise package leaves much to the imagination with a very outdated approach to modelling.

Obviously, the distributed computing capability is another huge plus for spark. Any time the matrix size gets around the size of memory e.g., 16 Gigs you will need Spark. Even huge processors like on AWS might be able to fit it (128g) as they are expensive and slower.

Another time spark might be used is in situations where recommendations are needed in real time, say right after a movie is watched because Spark has the speed to process at this rate.