Consider the candidate rules and their frequencies in the table below. The candidate rules are of the form 

$$
X \rightarrow C=c. \quad fr_X = fr(X), \quad fr_{XC} = fr(X C=c).
$$


| num | rule                             | frx | frxc | frc | $\phi$   | $\delta$   | $\gamma$   | nMI   |
|-----|----------------------------------|-----|------|-----|----------|------------|------------|-------|
| 1   | smoking → CD                     | 250 | 100  | 300 |          | 0.025      | 1.33       | 11.07 |
| 2   | stress → CD                      | 250 | 75   | 300 |          | 0          | 1.00       | 0     |
| 3   | healthy diet → \(-\)CD           | 400 | 330  | 700 |          | 0.05       | 1.18       | 37.47 |
| 4   | regular doctor’s visits → \(-\)CD| 2   | 2    | 700 | 1.0000   |            | 1.43       | 1.03  |
| 5   | sun avoidance → \(-\)CD          | 300 | 215  | 700 | 0.717    |            | 1.02       | 0.41  |
| 6   | female → CD                      | 500 | 360  | 700 | 0.702    |            | 1.03       | 1.37  |
| 7   | smoking & sun avoidance → CD     | 210 | 85   | 300 | 0.405    | 0.002      |            | 9.64  |
| 8   | no vaccine & no exercise → CD    | 350 | 100  | 300 | 0.286    | -0.005     |            | 0.38  |
| 9   | smoking & healthy diet → CD      | 210 | 80   | 300 | 0.381    | 0.017      |            | 5.80  |
| 10  | stress & smoking → \(-\)CD       | 90  | 60   | 300 | 0.667    | 0.0330     | 2.22       |       |
| 11  | female & stress → CD             | 130 | 60   | 300 | 0.462    | 0.0210     | 1.54       |       |
| 12  | female & healthy diet → \(-\)CD  | 140 | 106  | 700 | 0.757    | 0.0080     | 1.08       |       |


a) Show the equations for calculating the confidence (or precision) ϕ, leverage δ and lift γ values of the association rules.

b) Calculate the missing confidence, leverage, lift and n-normalized mutual information nMI values in the table.

$$
MI(X \rightarrow C=c) = \log_2 \left[ \dfrac{P(X, C)^{P(X, C)} P(X, \neg C)^{P(X, \neg C)} P(\neg X, C)^{P(\neg X, C)} P(\neg X, \neg C)^{P(\neg X, \neg C)}}{P(X)^{P(X)} P(\neg X)^{P(\neg X)} P(C) ^ {P(C)} P(\neg C) ^ {P(\neg C)}} \right]
$$


c) Which candidate rules would not be pruned out based on their leverage and lift values?

d) Which candidate rules would remain after further requiring nMI≥1.5?

e) What would be the next step in finding statistically significant association rules among those remaining after these steps?

a) Show the equations for calculating the confidence (or precision) ϕ, leverage δ and lift γ values of the association rules.

For confidence (or precision) ϕ of an association rule A⇒B:

$$
\phi(X \Rightarrow C) = \frac{\text{support}(X \cap C)}{\text{support}(X)} = \frac{fr_{XC}/N}{fr_X/N} = \frac{fr_{XC}}{fr_X}
$$

For leverage δ of an association rule A⇒B:

$$
\delta(X \Rightarrow C) = \text{support}(X \cap C) - \text{support}(X) \times \text{support}(C) = \frac{frXC}{N} - \left( \frac{frX}{N} \times \frac{frC}{N} \right)
$$

For lift γ of an association rule A⇒B:

$$
\gamma(A \Rightarrow B) = \dfrac{\text{support}(X \cap C)}{\text{support}(X) \times \text{support}(C)} = \frac{N \times frXC}{frX \times frC}
$$


b) Calculate the missing confidence, leverage, lift and n-normalized mutual information nMI values in the table.

$$
MI(X \rightarrow C=c) = \log_2 \left[ \dfrac{P(X, C)^{P(X, C)} P(X, \neg C)^{P(X, \neg C)} P(\neg X, C)^{P(\neg X, C)} P(\neg X, \neg C)^{P(\neg X, \neg C)}}{P(X)^{P(X)} P(\neg X)^{P(\neg X)} P(C) ^ {P(C)} P(\neg C) ^ {P(\neg C)}} \right]
$$

X and C are statistically independent, if P(X, C) = P(X)P(C) (δ = 0, γ = 1) (“independence rule”)

X and C are positively associated/statistically dependent, if P(X, C) > P(X)P(C) (δ > 0, γ > 1) (rule X → C)

X and C are negatively associated/statistically dependent, if P(X, C) < P(X)P(C) (δ < 0, γ < 1). Now X and ¬C positively associated! (rule X → ¬C)

In [2]:
import numpy as np

First we still do not know N yet. However, based on the table, we can derive N from the lift formula

$N \dfrac{frXC}{(frX \times frC)} = N \dfrac{100}{(250 \times 300)} = 1.33$ 

=> $N \times 0.00133 = 1.33 $ => $N = 1000$

In [26]:
N = 1000

def confidence(frX, frXC, frC):
    return frXC / frX

def leverage(frX, frXC, frC):
    return (frXC / N) - ((frX / N) * (frC / N))

def lift(frX, frXC, frC):
    return N * (frXC / (frX * frC))

def nMI(frX, frXC, frC):
    P_X = frX / N
    P_C = frC / N
    P_XC = frXC / N
    P_negX = 1 - P_X
    P_negC = 1 - P_C
    P_XnegC = P_X - P_XC
    P_negXC = P_C - P_XC
    P_negXnegC = 1 - P_X - P_C + P_XC
    
    nominator = P_XC ** P_XC * P_XnegC ** P_XnegC * P_negXC ** P_negXC * P_negXnegC ** P_negXnegC
    denominator = P_X ** P_X * P_negX ** P_negX * P_C ** P_C * P_negC ** P_negC
    
    MI = np.log2(nominator / denominator)
    nMI = N * MI
    return nMI

 
# frX, frXC, frC = 250, 100, 300
# frX, frXC, frC = 250, 75, 300
# frX, frXC, frC = 400, 330, 700
# frX, frXC, frC = 2, 2, 700
# frX, frXC, frC = 300, 215, 700
# frX, frXC, frC = 500, 360, 700
# frX, frXC, frC = 210, 85, 300
# frX, frXC, frC = 350, 100, 300
# frX, frXC, frC = 210, 80, 300
# frX, frXC, frC = 90, 60, 300
# frX, frXC, frC = 130, 60, 300
frX, frXC, frC = 140, 106, 700

print("Confidence: ", confidence(frX, frXC, frC))
print("Leverage: ", leverage(frX, frXC, frC))
print("Lift: ", lift(frX, frXC, frC))
print("nMI: ", nMI(frX, frXC, frC))

Confidence:  0.7571428571428571
Leverage:  0.007999999999999993
Lift:  1.0816326530612246
nMI:  1.8894165966433774



| num | rule                             | frx | frxc | frc | $\phi$   | $\delta$   | $\gamma$   | nMI   |
|-----|----------------------------------|-----|------|-----|----------|------------|------------|-------|
| 1   | smoking → CD                     | 250 | 100  | 300 | **0.4**  | 0.025      | 1.33       | 11.07 |
| 2   | stress → CD                      | 250 | 75   | 300 | **0.3**  | 0          | 1.00       | 0     |
| 3   | healthy diet → \(-\)CD           | 400 | 330  | 700 | **0.825**| 0.05       | 1.18       | 37.47 |
| 4   | regular doctor’s visits → \(-\)CD| 2   | 2    | 700 | 1.0000   | **0.0006** | 1.43       | 1.03  |
| 5   | sun avoidance → \(-\)CD          | 300 | 215  | 700 | 0.717    | **0.005**  | 1.02       | 0.41  |
| 6   | female → CD                      | 500 | 360  | 700 | 0.702    | **0.01**   | 1.03       | 1.37  |
| 7   | smoking & sun avoidance → CD     | 210 | 85   | 300 | 0.405    | 0.002      | **1.35**   | 9.64  |
| 8   | no vaccine & no exercise → CD    | 350 | 100  | 300 | 0.286    | -0.005     | **0.95**   | 0.38  |
| 9   | smoking & healthy diet → CD      | 210 | 80   | 300 | 0.381    | 0.017      | **1.27**   | 5.80  |
| 10  | stress & smoking → \(-\)CD       | 90  | 60   | 300 | 0.667    | 0.0330     | 2.22       | **41.21**|
| 11  | female & stress → CD             | 130 | 60   | 300 | 0.462    | 0.0210     | 1.54       | **12.56**|
| 12  | female & healthy diet → \(-\)CD  | 140 | 106  | 700 | 0.757    | 0.0080     | 1.08       | **1.89** |

c) Which candidate rules would not be pruned out based on their leverage and lift values?

Based on these criteria, we would not prune rules that have δ>0 and γ>1. Reviewing the table, all rules would not be pruned out except rule 2 and 8. We can also prune out rule 4 and 5 because they have very low leverage near 0

d) Which candidate rules would remain after further requiring nMI≥1.5?

Based on these criteria, we would not prune rules that have nMI≥1.5. Reviewing the table, all rules would not be pruned out except rule 2, 4, 5, 6 and 8.

e) What would be the next step in finding statistically significant association rules among those remaining after these steps?

The left rules are  1, 3, 7, 9, 10, 11, 12. 

Rule Evaluation and Ranking: Among the remaining rules, you would further assess their statistical significance. This could involve calculating additional metrics such as conviction or the chi-square statistic to determine the likelihood that the observed association is statistically significant rather than occurring by chance.

Redundancy Elimination: Remove redundant rules. For example, if both 
A→B and A&C→B are present and have similar confidence, the simpler rule A→B may be preferred.
This is due to the monotonicity property of association rules: if a rule is true, then all its subsets are also true.