Frequent-Pattern-Mining-Spark

PCY Algorithm for Frequent Pattern Mining using Pyspark

About Dataset

The dataset is downloaded from kaggle. It has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated.

Details on Implementation

PCY algorithm is an improvement of the Apriori algorithm. We have also added the Multihash optimization in the implementation. PCY finds frequent itemsets by making several passes over a dataset.

In the first pass, It keeps track of the occurrences of each singleton (It counts how many each individual item appears in the dataset). Additionally, it hashes pairs that appear in the dataset to 2 different HashTables.
After the 1st pass, We filter our singletons (We keep only the frequent items of our dataset). To decide if an item is frequent, we define a threshold (support) for each time. If an item appears in the dataset more than the threshold, then it's considered frequent. We also filter HashTables by converting them to bitmaps. In the second pass, we count the candidate pairs. An pair {i, j} I is counted if the following conditions are met:
- Both i and j are frequent items.
- {i, j} is hashed in a frequent bucket in both hashtables.
In that way, we reduce the number of counters we have to count, so we dramatically reduce the memory used in our program.

Implementation Using Apache Spark

For preprocessing Spark dataframe is used.
In implementation of the PCY algorithm, Spark RDD is used.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Groceries_dataset.csv		Groceries_dataset.csv
PCY_Algo.ipynb		PCY_Algo.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groceries_dataset.csv

Groceries_dataset.csv

PCY_Algo.ipynb

PCY_Algo.ipynb

README.md

README.md

Repository files navigation

Frequent-Pattern-Mining-Spark

About Dataset

Details on Implementation

Implementation Using Apache Spark

About

Releases

Packages

Languages

SinghHarshita/Frequent-Pattern-Mining-Spark

Folders and files

Latest commit

History

Repository files navigation

Frequent-Pattern-Mining-Spark

About Dataset

Details on Implementation

Implementation Using Apache Spark

About

Topics

Resources

Stars

Watchers

Forks

Languages