Handout date: 2022-04-13
This is the final project for DS-GA 1004. In this project, we build and evaluate a collaborative-filter based recommender system.
In this project, we'll use the MovieLens datasets collected by
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872
The data is hosted in NYU's HPC environment under /scratch/work/courses/DSGA1004-2021/movielens
.
Two versions of the dataset are provided: a small sample (ml-latest-small
, 9000 movies and 600 users) and a larger sample (ml-latest
, 58000 movies and 280000 users).
Each version of the data contains rating and tag interactions, and the larger sample includes "tag genome" data for each movie, which you may consider as additional features beyond
the collaborative filter.
Each version of the data includes a README.txt file which explains the contents and structure of the data which are stored in CSV files.
I strongly recommend to thoroughly read through the dataset documentation before beginning, and make note of the documented differences between the smaller and larger datasets. Knowing these differences in advance will save you many headaches when it comes time to scale up.
-
In the first step, We partitioned the rating data into training, validation, and test samples.
-
Before implementing a sophisticated model, we began with a popularity baseline model.
-
The recommendation model use Spark's alternating least squares (ALS) method to learn latent factor representations for users and items.
After we make predictions from the popularity baseline and the latent factor model, we evaluate accuracy on the validation and test data. Evaluations is based on predictions of the top 100 items for each user, and report the ranking metrics provided by spark. Refer to the ranking metrics section of the Spark documentation for more details.
For this part, we implement following extensions:
- Comparison to single-machine implementations: compare Spark's parallel ALS model to a single-machine implementation, e.g. lightfm or lenskit. Your comparison should measure both efficiency (model fitting time as a function of data set size) and resulting accuracy.
- Fast search: use a spatial data structure (e.g., LSH or partition trees) to implement accelerated search at query time. For this, it is best to use an existing library such as annoy, nmslib, or scann and you will need to export the model parameters from Spark to work in your chosen environment. For full credit, you should provide a thorough evaluation of the efficiency gains provided by your spatial data structure over a brute-force search method.