Finding frequent itemsets

Implementation of the SON algorithm on top of the Apache Spark Framework. Find all the possible combinations of the frequent restaurants tuples that appears in review data from Yelp.com within the required time.

Sample data

Sample data is a subset of business.json and review.json data from the Yelp dataset (Done by Preprocess.py). The raw data is in csv format as shown below:

1. SON algorithm 1/Market-Basket model

SON algorithm is used to build Market-Basket model. There are two cases related to MB model:

Frequent businesses: combinations of frequent businesses (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold

user1: [business11, business12, business13, ...]
user2: [business21, business22, business23, ...]
user3: [business31, business32, business33, ...]

Frequent users: combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified as frequent given a support threshold.

business1: [user11, user12, user13, ...]
business2: [user21, user22, user23, ...]
business3: [user31, user32, user33, ...]

The market-basket model is built using SON algorithm with A-Priori algorithm and test using relatively small dataset.

2. SON algorithm 2/Market-Basket model on frequent businesses with large dataset

The result is similiar to previous Market-Basket model except that we're working with large dataset this time. So this time I improved the implementation and speeded up the runtime. SON algorithm and A-Priori is also applied here. Preprocess.py takes charge of building market-basket model from the large raw data extracted from Yelp.com in json format.

3. FP-growth algorithm

The fp-growth algorithm in Spark MLlib is used to do the same work as task2 (to obtain the frequent itemsets). This is to compare the runtime and result difference.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
FP-Growth algorithm.ipynb		FP-Growth algorithm.ipynb
FP-Growth result comparsion.txt		FP-Growth result comparsion.txt
Market-Basket model output.txt		Market-Basket model output.txt
Preprocess.ipynb		Preprocess.ipynb
README.md		README.md
SON algorithm 1.ipynb		SON algorithm 1.ipynb
SON algorithm 2.ipynb		SON algorithm 2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP-Growth algorithm.ipynb

FP-Growth algorithm.ipynb

FP-Growth result comparsion.txt

FP-Growth result comparsion.txt

Market-Basket model output.txt

Market-Basket model output.txt

Preprocess.ipynb

Preprocess.ipynb

README.md

README.md

SON algorithm 1.ipynb

SON algorithm 1.ipynb

SON algorithm 2.ipynb

SON algorithm 2.ipynb

Repository files navigation

Finding frequent itemsets

Sample data

1. SON algorithm 1/Market-Basket model

2. SON algorithm 2/Market-Basket model on frequent businesses with large dataset

3. FP-growth algorithm

About

Releases

Packages

Languages

artisan1218/Finding-Frequent-Itemsets

Folders and files

Latest commit

History

Repository files navigation

Finding frequent itemsets

Sample data

1. SON algorithm 1/Market-Basket model

2. SON algorithm 2/Market-Basket model on frequent businesses with large dataset

3. FP-growth algorithm

About

Topics

Resources

Stars

Watchers

Forks

Languages