This is an implementation of the algorithms described in:
Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Machine Learning: Fundamental Concepts and Algorithms,
Chapter 8, 2nd Edition, Cambridge University Press, March 2020. ISBN: 978-1108473989.
Our system utilizes the Apriori Algorithm to generate frequent item sets from a given sparse binary table. Some of the uses of the frequent itemsets generated from the Apriori Algorithm include: determining the items that are most likely to be purchased, predicting correlations between the items that are most likely to be selected at the same time, or predicting dependency chains within the dataset. Our system also generates rules within a minimum confidence from the list of frequent itemsets generated by the Apriori Algorithm. The strong rules, in combination with the frequent itemset, can be used to make more confident assumptions about the relationship between each of the categories of a given dataset.
Below, we list the instructions that should be followed in order to use our system as well as some comments about the structure of the project:
-
Clone or download our repository using:
git clone https://github.com/GiorgioMorales/itemMining.git
. -
Configure the project using a Python 3.x environment. The only libraries needed are Pandas and Numpy.
-
The main driver is located within the
ItemMining.py
script. It creates an instance of theItemMining
class calledtable
. The constructor takes as arguments the address of the dataset and the parametersminSup
(minimum support) andminConf
(minimum confidence) used by the Apriori and AssociationRules algorithms, respectively. By default, we consider a minimum support of 4 and a minimum confidence of 0.4. -
Our proposed system is executed typing
python ItemMining.py
in a Python terminal. If the users want to manually set the parameters, they should type
python ItemMining.py -s minSup -c minConf
instead.
-
The file
Dataset.py
contains a class of the same name used to read the Transaction Itemset dataset (in CSV format). In addition, the fileItemsetRule.py
contains two helper classes: TheItemset
class is used to store itemsets as nodes of a tree, while the \texttt{Rule} class store all the elements that describe an association rule. -
The dataset consists of a sparse binary table read as a Python Pandas Dataframe object with the columns labeled and the rows containing unique elements. We noticed that, in this dataset, each
department
is related to only one possibleID
, so we chose to use theDept
column from the dataset instead of theID
column so that the obtained rules can be better interpreted. The table is passed into the Apriori Algorithm, and it displays the frequent itemsets that are detected from the given table. -
Using the sparse binary table, the system then determines a set of rules based on the frequent itemsets and a minimum confidence parameter using the \texttt{associationRules} algorithm. %Taking the frequent itemsets that are larger than 2, rules are generated that imply that if all of the elements of a certain itemset subset is true, then another certain itemset subset must be true with a certain degree of confidence. Each strong rule is displayed along with its support, confidence, leverage, and lift metrics.
-
Finally, we call the static method
rankRules
to print the most "k" relevant rules from the set of strong rules.