DSGA1004 - BIG DATA

Recommender-System-on-MovieLens

Final project

Handout date: 2022-04-13

Overview

This is the final project for DS-GA 1004. In this project, we build and evaluate a collaborative-filter based recommender system.

The data set

In this project, we'll use the MovieLens datasets collected by

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

The data is hosted in NYU's HPC environment under /scratch/work/courses/DSGA1004-2021/movielens.

Two versions of the dataset are provided: a small sample (ml-latest-small, 9000 movies and 600 users) and a larger sample (ml-latest, 58000 movies and 280000 users). Each version of the data contains rating and tag interactions, and the larger sample includes "tag genome" data for each movie, which you may consider as additional features beyond the collaborative filter. Each version of the data includes a README.txt file which explains the contents and structure of the data which are stored in CSV files.

I strongly recommend to thoroughly read through the dataset documentation before beginning, and make note of the documented differences between the smaller and larger datasets. Knowing these differences in advance will save you many headaches when it comes time to scale up.

Basic recommender system

In the first step, We partitioned the rating data into training, validation, and test samples.
Before implementing a sophisticated model, we began with a popularity baseline model.
The recommendation model use Spark's alternating least squares (ALS) method to learn latent factor representations for users and items.

Evaluation

After we make predictions from the popularity baseline and the latent factor model, we evaluate accuracy on the validation and test data. Evaluations is based on predictions of the top 100 items for each user, and report the ranking metrics provided by spark. Refer to the ranking metrics section of the Spark documentation for more details.

Extensions

For this part, we implement following extensions:

Comparison to single-machine implementations: compare Spark's parallel ALS model to a single-machine implementation, e.g. lightfm or lenskit. Your comparison should measure both efficiency (model fitting time as a function of data set size) and resulting accuracy.
Fast search: use a spatial data structure (e.g., LSH or partition trees) to implement accelerated search at query time. For this, it is best to use an existing library such as annoy, nmslib, or scann and you will need to export the model parameters from Spark to work in your chosen environment. For full credit, you should provide a thorough evaluation of the efficiency gains provided by your spatial data structure over a brute-force search method.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
CheckPoint Submission		CheckPoint Submission
Fast_Search		Fast_Search
Hyper_Parameter		Hyper_Parameter
LightFM		LightFM
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
final_report.pdf		final_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSGA1004 - BIG DATA

Recommender-System-on-MovieLens

Final project

Overview

The data set

Basic recommender system

Evaluation

Extensions

About

Releases

Packages

Languages

CooperZhao1998/Recommender-System-on-MovieLens

Folders and files

Latest commit

History

Repository files navigation

DSGA1004 - BIG DATA

Recommender-System-on-MovieLens

Final project

Overview

The data set

Basic recommender system

Evaluation

Extensions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages