Skip to content

CooperZhao1998/Recommender-System-on-MovieLens

 
 

Repository files navigation

DSGA1004 - BIG DATA

Recommender-System-on-MovieLens

Final project

Handout date: 2022-04-13

Overview

This is the final project for DS-GA 1004. In this project, we build and evaluate a collaborative-filter based recommender system.

The data set

In this project, we'll use the MovieLens datasets collected by

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

The data is hosted in NYU's HPC environment under /scratch/work/courses/DSGA1004-2021/movielens.

Two versions of the dataset are provided: a small sample (ml-latest-small, 9000 movies and 600 users) and a larger sample (ml-latest, 58000 movies and 280000 users). Each version of the data contains rating and tag interactions, and the larger sample includes "tag genome" data for each movie, which you may consider as additional features beyond the collaborative filter. Each version of the data includes a README.txt file which explains the contents and structure of the data which are stored in CSV files.

I strongly recommend to thoroughly read through the dataset documentation before beginning, and make note of the documented differences between the smaller and larger datasets. Knowing these differences in advance will save you many headaches when it comes time to scale up.

Basic recommender system

  1. In the first step, We partitioned the rating data into training, validation, and test samples.

  2. Before implementing a sophisticated model, we began with a popularity baseline model.

  3. The recommendation model use Spark's alternating least squares (ALS) method to learn latent factor representations for users and items.

Evaluation

After we make predictions from the popularity baseline and the latent factor model, we evaluate accuracy on the validation and test data. Evaluations is based on predictions of the top 100 items for each user, and report the ranking metrics provided by spark. Refer to the ranking metrics section of the Spark documentation for more details.

Extensions

For this part, we implement following extensions:

  • Comparison to single-machine implementations: compare Spark's parallel ALS model to a single-machine implementation, e.g. lightfm or lenskit. Your comparison should measure both efficiency (model fitting time as a function of data set size) and resulting accuracy.
  • Fast search: use a spatial data structure (e.g., LSH or partition trees) to implement accelerated search at query time. For this, it is best to use an existing library such as annoy, nmslib, or scann and you will need to export the model parameters from Spark to work in your chosen environment. For full credit, you should provide a thorough evaluation of the efficiency gains provided by your spatial data structure over a brute-force search method.

About

final-project-group_101 created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 60.3%
  • Python 39.2%
  • Shell 0.5%