Skip to content

Finding restaurants tuples that appears in review data from Yelp.com

Notifications You must be signed in to change notification settings

artisan1218/Finding-Frequent-Itemsets

Repository files navigation

Finding frequent itemsets

Implementation of the SON algorithm on top of the Apache Spark Framework. Find all the possible combinations of the frequent restaurants tuples that appears in review data from Yelp.com within the required time.

Sample data

Sample data is a subset of business.json and review.json data from the Yelp dataset (Done by Preprocess.py). The raw data is in csv format as shown below:

image

1. SON algorithm 1/Market-Basket model

SON algorithm is used to build Market-Basket model. There are two cases related to MB model:

  1. Frequent businesses: combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold
user1: [business11, business12, business13, ...]
user2: [business21, business22, business23, ...]
user3: [business31, business32, business33, ...]
  1. Frequent users: combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold.
business1: [user11, user12, user13, ...]
business2: [user21, user22, user23, ...]
business3: [user31, user32, user33, ...]

The market-basket model is built using SON algorithm with A-Priori algorithm and test using relatively small dataset.

2. SON algorithm 2/Market-Basket model on frequent businesses with large dataset

The result is similiar to previous Market-Basket model except that we're working with large dataset this time. So this time I improved the implementation and speeded up the runtime. SON algorithm and A-Priori is also applied here. Preprocess.py takes charge of building market-basket model from the large raw data extracted from Yelp.com in json format.

3. FP-growth algorithm

The fp-growth algorithm in Spark MLlib is used to do the same work as task2 (to obtain the frequent itemsets). This is to compare the runtime and result difference.