# Introduction
* Apriori in Latin means "from before," and it's used to name the classic data mining association rule algorithm, Apriori.
* Apriori uses an iterative approach to first find the support for each item in the first itemset and removes those with support lower than the minimum support.
* It continues to find the support for the second itemset, removing those with support lower than the minimum support, and so on, until no more itemsets can be found.

In [2]:
# install.packages("arules")
library(arules)
library(dplyr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: Matrix


Attaching package: ‘arules’


The following objects are masked from ‘package:base’:

    abbreviate, write



Attaching package: ‘dplyr’


The following objects are masked from ‘package:arules’:

    intersect, recode, setdiff, setequal, union


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




**Transactions for Apriori can be converted from a list, matrix, or data.frame.**

For example, if we have three items: `c("a", "b", "c")`, and the first transaction contains `c("a", "b", "c")`, and the second contains `c("a","b")`,we can create a list as follows: record_list <- list(c("a", "b", "c"), c("a", "b")). We then convert this list into transactions.

In [5]:
record_list <- list(c("a", "b", "c"), c("a", "b"))
record <- as(record_list, "transactions")
record
summary(record)

transactions in sparse format with
 2 transactions (rows) and
 3 items (columns)

transactions as itemMatrix in sparse format with
 2 rows (elements/itemsets/transactions) and
 3 columns (items) and a density of 0.8333333 

most frequent items:
      a       b       c (Other) 
      2       2       1       0 

element (itemset/transaction) length distribution:
sizes
2 3 
1 1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.25    2.50    2.50    2.75    3.00 

includes extended item information - examples:
  labels
1      a
2      b
3      c

#Example 1

 Reference: https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/

In [6]:
# create a list of baskets
market_basket <-
  list(
  c("apple", "beer", "rice", "meat"),
  c("apple", "beer", "rice"),
  c("apple", "beer"),
  c("apple", "pear"),
  c("milk", "beer", "rice", "meat"),
  c("milk", "beer", "rice"),
  c("milk", "beer"),
  c("milk", "pear")
  )

In [7]:
# set transaction names (T1 to T8)
names(market_basket) <- paste("T", c(1:8), sep = "")

In [8]:
trans <- as(market_basket, "transactions")
dim(trans)
itemLabels(trans)
summary(trans)

transactions as itemMatrix in sparse format with
 8 rows (elements/itemsets/transactions) and
 6 columns (items) and a density of 0.4583333 

most frequent items:
   beer   apple    milk    rice    meat (Other) 
      6       4       4       4       2       2 

element (itemset/transaction) length distribution:
sizes
2 3 4 
4 2 2 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.50    2.75    3.25    4.00 

includes extended item information - examples:
  labels
1  apple
2   beer
3   meat

includes extended transaction information - examples:
  transactionID
1            T1
2            T2
3            T3

In [9]:
rules <- apriori(trans,
                 parameter = list(supp=0.3, conf=0.5,
                                  maxlen=10,
                                  target= "rules"))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5     0.3      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[6 item(s), 8 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [10 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [10]:
summary(rules)

set of 10 rules

rule length distribution (lhs + rhs):sizes
1 2 
4 6 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     1.0     2.0     1.6     2.0     2.0 

summary of quality measures:
    support        confidence        coverage           lift      
 Min.   :0.375   Min.   :0.5000   Min.   :0.5000   Min.   :1.000  
 1st Qu.:0.375   1st Qu.:0.5000   1st Qu.:0.5625   1st Qu.:1.000  
 Median :0.500   Median :0.5833   Median :0.7500   Median :1.000  
 Mean   :0.475   Mean   :0.6417   Mean   :0.7750   Mean   :1.067  
 3rd Qu.:0.500   3rd Qu.:0.7500   3rd Qu.:1.0000   3rd Qu.:1.000  
 Max.   :0.750   Max.   :1.0000   Max.   :1.0000   Max.   :1.333  
     count    
 Min.   :3.0  
 1st Qu.:3.0  
 Median :4.0  
 Mean   :3.8  
 3rd Qu.:4.0  
 Max.   :6.0  

mining info:
  data ntransactions support confidence
 trans             8     0.3        0.5
                                                                                           call
 apriori(data = trans, parameter = li

The support for 'apple' without buying anything else is 0.5.

In [14]:
inspect(rules)

     lhs        rhs     support confidence coverage lift     count
[1]  {}      => {apple} 0.500   0.5000000  1.00     1.000000 4    
[2]  {}      => {milk}  0.500   0.5000000  1.00     1.000000 4    
[3]  {}      => {rice}  0.500   0.5000000  1.00     1.000000 4    
[4]  {}      => {beer}  0.750   0.7500000  1.00     1.000000 6    
[5]  {apple} => {beer}  0.375   0.7500000  0.50     1.000000 3    
[6]  {beer}  => {apple} 0.375   0.5000000  0.75     1.000000 3    
[7]  {milk}  => {beer}  0.375   0.7500000  0.50     1.000000 3    
[8]  {beer}  => {milk}  0.375   0.5000000  0.75     1.000000 3    
[9]  {rice}  => {beer}  0.500   1.0000000  0.50     1.333333 4    
[10] {beer}  => {rice}  0.500   0.6666667  0.75     1.333333 4    


In [16]:
# Set minlen to avoid rules with only one item.
rules <- apriori(trans,
                 parameter = list(supp=0.3, conf=0.5,
                                  maxlen=10,
                                  minlen=2,
                                  target= "rules"))
inspect(rules)

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5     0.3      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[6 item(s), 8 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [6 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
    lhs        rhs     support confidence coverage lift     count
[1] {apple} => {beer}  0.375   0.7500000  0.50     1.000000 3    
[2] {beer}  => {apple} 0.375   0.5000000  0.75     1.000000 3    
[3] {milk}  => {beer}  0.375   0.7500000  0.50     1.000000 3    
[4] {beer}  => {milk}  0.375   0.50

Analyze what customers buy before beer.

In [17]:
beer_rules_rhs <- apriori(trans,
                          parameter = list(supp=0.3, conf=0.5,
                                         maxlen=10,
                                         minlen=2),
                          appearance = list(default="lhs", rhs="beer"))
inspect(beer_rules_rhs)

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5     0.3      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[6 item(s), 8 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [3 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
    lhs        rhs    support confidence coverage lift     count
[1] {apple} => {beer} 0.375   0.75       0.5      1.000000 3    
[2] {milk}  => {beer} 0.375   0.75       0.5      1.000000 3    
[3] {rice}  => {beer} 0.500   1.00       0.5      1.333333 4    


Analyze what customers buy after beer.

In [18]:
beer_rules_lhs <- apriori(trans,
                          parameter = list(supp=0.3, conf=0.5,
                                           maxlen=10,
                                           minlen=2),
                          appearance = list(lhs="beer", default="rhs"))
inspect(beer_rules_lhs)

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5     0.3      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[6 item(s), 8 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [3 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
    lhs       rhs     support confidence coverage lift     count
[1] {beer} => {apple} 0.375   0.5000000  0.75     1.000000 3    
[2] {beer} => {milk}  0.375   0.5000000  0.75     1.000000 3    
[3] {beer} => {rice}  0.500   0.6666667  0.75     1.333333 4    


# Example 2

AdultUCI dataset

In [20]:
data(AdultUCI)
dim(AdultUCI)
AdultUCI[1:2,]

Unnamed: 0_level_0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
Unnamed: 0_level_1,<int>,<fct>,<int>,<ord>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<fct>,<ord>
1,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
2,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small


**Data Preprecessing**

In [21]:
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL

AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15,25,45,65,100)),
  labels = c("Young", "Middle-aged", "Senior", "Old"))

AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]],
  c(0,25,40,60,168)),
  labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]],
  c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[["capital-gain"]]>0]),
  Inf)), labels = c("None", "Low", "High"))

AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]],
  c(-Inf,0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]]>0]),
  Inf)), labels = c("None", "Low", "High"))

In [23]:
# Convert AdultUCI into transactions
Adult <- as(AdultUCI, "transactions")
Adult
summary(Adult)

transactions in sparse format with
 48842 transactions (rows) and
 115 items (columns)

transactions as itemMatrix in sparse format with
 48842 rows (elements/itemsets/transactions) and
 115 columns (items) and a density of 0.1089939 

most frequent items:
           capital-loss=None            capital-gain=None 
                       46560                        44807 
native-country=United-States                   race=White 
                       43832                        41762 
           workclass=Private                      (Other) 
                       33906                       401333 

element (itemset/transaction) length distribution:
sizes
    9    10    11    12    13 
   19   971  2067 15623 30162 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   12.00   13.00   12.53   13.00   13.00 

includes extended item information - examples:
           labels variables      levels
1       age=Young       age       Young
2 age=Middle-aged       age Middle-aged
3      age=Senior       age      Senior

includes extended transaction information - examp

association rules

In [24]:
rules <- apriori(Adult, parameter = list(support = 0.5, confidence = 0.9))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.9    0.1    1 none FALSE            TRUE       5     0.5      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 24421 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[115 item(s), 48842 transaction(s)] done [0.05s].
sorting and recoding items ... [9 item(s)] done [0.01s].
creating transaction tree ... done [0.02s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [52 rule(s)] done [0.00s].
creating S4 object  ... done [0.02s].


In [25]:
summary(rules)

set of 52 rules

rule length distribution (lhs + rhs):sizes
 1  2  3  4 
 2 13 24 13 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   2.923   3.250   4.000 

summary of quality measures:
    support         confidence        coverage           lift       
 Min.   :0.5084   Min.   :0.9031   Min.   :0.5406   Min.   :0.9844  
 1st Qu.:0.5415   1st Qu.:0.9155   1st Qu.:0.5875   1st Qu.:0.9937  
 Median :0.5974   Median :0.9229   Median :0.6293   Median :0.9997  
 Mean   :0.6436   Mean   :0.9308   Mean   :0.6915   Mean   :1.0036  
 3rd Qu.:0.7426   3rd Qu.:0.9494   3rd Qu.:0.7945   3rd Qu.:1.0057  
 Max.   :0.9533   Max.   :0.9583   Max.   :1.0000   Max.   :1.0586  
     count      
 Min.   :24832  
 1st Qu.:26447  
 Median :29178  
 Mean   :31433  
 3rd Qu.:36269  
 Max.   :46560  

mining info:
  data ntransactions support confidence
 Adult         48842     0.5        0.9
                                                                     call
 apriori(data =

In [26]:
# Use the inspect function to view rules
inspect(rules[1])

    lhs    rhs                 support   confidence coverage lift count
[1] {}  => {capital-gain=None} 0.9173867 0.9173867  1        1    44807


Use the subset function to extract rules of interest.
For example, we can extract all rules where the right-hand side (rhs) contains "capital-gain=None."

In [28]:
rules.none = subset(rules, subset = rhs %in% "capital-gain=None")
inspect(rules.none)

     lhs                               rhs                   support confidence  coverage      lift count
[1]  {}                             => {capital-gain=None} 0.9173867  0.9173867 1.0000000 1.0000000 44807
[2]  {hours-per-week=Full-time}     => {capital-gain=None} 0.5435895  0.9290688 0.5850907 1.0127342 26550
[3]  {sex=Male}                     => {capital-gain=None} 0.6050735  0.9051455 0.6684820 0.9866565 29553
[4]  {workclass=Private}            => {capital-gain=None} 0.6413742  0.9239073 0.6941976 1.0071078 31326
[5]  {race=White}                   => {capital-gain=None} 0.7817862  0.9143240 0.8550428 0.9966616 38184
[6]  {native-country=United-States} => {capital-gain=None} 0.8219565  0.9159062 0.8974243 0.9983862 40146
[7]  {capital-loss=None}            => {capital-gain=None} 0.8706646  0.9133376 0.9532779 0.9955863 42525
[8]  {capital-loss=None,                                                                                 
      hours-per-week=Full-time}     => {capita

# Homework

1. Use the above methods to observe the rules data frame and find the most useful rules, and explain why.
2. Try adjusting the minimum support and confidence parameters to generate new rules and find useful rules.

In [30]:
inspect(sort(rules, by = "lift")[1:20]) # decreasing lift

     lhs                               rhs                              support confidence  coverage     lift count
[1]  {sex=Male,                                                                                                    
      native-country=United-States} => {race=White}                   0.5415421  0.9051090 0.5983170 1.058554 26450
[2]  {sex=Male,                                                                                                    
      capital-loss=None,                                                                                           
      native-country=United-States} => {race=White}                   0.5113632  0.9032585 0.5661316 1.056390 24976
[3]  {race=White}                   => {native-country=United-States} 0.7881127  0.9217231 0.8550428 1.027076 38493
[4]  {race=White,                                                                                                  
      capital-loss=None}            => {native-country=United-States} 0.

In [34]:
inspect(sort(rules, by = "lift"))  # increasing lift

     lhs                               rhs                              support confidence  coverage      lift count
[1]  {sex=Male,                                                                                                     
      native-country=United-States} => {race=White}                   0.5415421  0.9051090 0.5983170 1.0585540 26450
[2]  {sex=Male,                                                                                                     
      capital-loss=None,                                                                                            
      native-country=United-States} => {race=White}                   0.5113632  0.9032585 0.5661316 1.0563898 24976
[3]  {race=White}                   => {native-country=United-States} 0.7881127  0.9217231 0.8550428 1.0270761 38493
[4]  {race=White,                                                                                                   
      capital-loss=None}            => {native-country=United-St

A lift close to 1 indicates little correlation, lift > 1 indicates a positive correlation, and lift < 1 indicates a negative correlation. The absolute value of lift indicates the strength of the relationship, so I'm sorting by lift for filtering.

I believe that there are no important rules among the top 20 when lift > 1 (positive correlation), but among the top 20 with lift < 1 (negative correlation), I find the following important rules:

- Rule 4: Typically, working in a private company with low income is not a married spouse.
- Rule 40: Usually, those with lower income are not husbands. Most family structures still rely on males as the primary earners.
- Rule 45: Typically, those with lower income are not married spouses. Most people pursue economic stability before marriage.

If a person's marital status, occupation, and education are, what makes them more likely to have high income.