The goal of this project is to create a hyrbid recommendation system using the Yelp dataset to recommend businesses to users.
- The data is a subset of Yelp Dataset Challenge.
- We chose data from 2 cities i.e. Pittsburgh and Las Vegas which can be found here
Collaborative-Filtering (CF) Based Recommendation techniques have been quite successful in the past giving promising results in this domain. The main idea behind these set of algorithms is that the users with similar interests may like similar items. These algorithms unlike Content-Based approaches don’t require the content(features) of the items we want to recommend to users but instead only requires ratings various users have given to various items. Based on just this information, it is able to predict rating a user may give to some item it has not rated. Moreover, it helps in recommending new items to users which he may like.
- Scala 2.11
- Spark 2.3.1 (Mllib, Core, SQL)
- Java 1.8
To run the collaborative filtering models follow the steps below,
- save the
Collaborative_Models.jar
in the bin folder in spark. - open terminal and go to ~/spark/bin/
- run the following command,
spark-submit --class <class_name> Collaborative_Models.jar <train_file_path> <test_file_path> <city> <run_mode>
- <class_name> : ItemBased_CF / UserBased_CF / als
- <train_path> : entire path to the training data
- <test_path> : entire path to the testing data
- <run_mode> : 0 (will generate only test predictions) / 1 (will generate all the predictions test, train and all)
Content based recommendation systems try to find similarity between users and items by building their corresponding vectors. The most important aspect of such models is choosing features. We follow three different approaches to build user and item features. These features are transformed into vectors and then different similarity measures can be used to find compute the similarity between them. Unlike collaborative recommendation systems, in content-based models we used text in the review to form user and item vectors.
-
Numpy
-
Pandas
-
Scikit-learn
-
Install
findspark
using the following command
pip install findspark
- Install Gensim for doc2vec
pip install gensim
- Install nltk for pre-processing
pip install nltk
- In order to run the code for any content based model give the path for
train_file
andtest_file
underLoad Train and Test Data
- Change the parameters for
TfidfVectorizer
underTF-IDF Vector Creation
orDoc2Vec
underDoc2Vec Model Creation
. (optional) - Hypertune the parameters for regression under
Training Regressor on similarity values
if the data is changed. (optional) - Give the path and name of file to save test results under
Save predictions file
.
The final model combines all the predictions generated from the above-mentioned models by their weighted sum. The CF model uses the ratings, while the content-based model uses reviews or categories to capture the similarity between users and businesses to predict the unknown rating. The combination of all the models can grasp different aspects of the relationship between users and businesses to give a more accurate prediction.
- To generate the predictions give the
test_ground_truth
and test_predictions(test prediction files generated using each one of the content based and collaborative based models) as program arguments to theHybridModelRMSE.scala
in the following order (order is important as the weights of the models are hardcoded).
- city_test.txt
- _ItemBased_test_predictions.txt
- _UserBased_test_predictions.txt
- _ALS_test_predictions.txt
- _SGD_test_predictions.txt
- _TextBased_test_predictions.txt
- _ReviewBased_test_predictions.txt
- _CategoryBased_test_predictions.txt
This will generate a resultant RMSE of the combined models along with a text file Hybrid_Model_predictions.txt
which contains predictions of rating corresponding to each user business pair in test file.
- to generate recommendations corresponding to a user give 4 program arguments to
HybridModelRecommendation.scala
in the following order.
- Hybrid_Model_predictions.txt
- ReviewBased_Sorted_similarities.txt
- UserBased_Sorted_similarities.txt
- TextBased_Sorted_similarities.txt
This will generate a text file of all the recommendations for userset in the test file. First item of each row is the user_id and the following items in that row are the business ids of the recommendations.