### 3.1 Alzheimer disease association rules: basic evaluation

Learning goal: Unconditional evaluation of statistical association rules.

Let us consider a database consisting of n = 1000 patients (50% female, 50% male), 30% of them with Alzheimer’s disease (AD). The database contains information on patients and their lifestyle like smoking status, diet, use of natural products, stress and education levels. Table 1 lists some candidate rules related to AD. The content is also given in the latex table format in the home page. You can use it as the input for your program and for presenting the results (just add new columns for measures).

The required equations for mutual information are given in Appendix 1. (Note that we will use $n \cdot MI$ because it is easier to interpret.)

Table 1: Candidate rules $\mathbf{X} \rightarrow C = c$, $c \in {0, 1}$, related to $C =$ Alzheimer’s disease. $fr_{X} = fr(\mathbf{X})$, $fr_{XC} = fr(\mathbf{X}C =c)$.

| num | rule | $fr_{X}$ | $fr_{XC}$ |
| --- | --- | --- | --- |
| 1 | smoking $\rightarrow$ AD | 300 | 125 |
| 2 | stress $\rightarrow$ AD | 500 | 150 |
| 3 | high education $\rightarrow$ $\neg$ AD | 500 | 400 |
| 4 | tea $\rightarrow$ $\neg$ AD | 342 | 240 |
| 5 | turmeric $\rightarrow$ $\neg$ AD | 2 | 2 |
| 6 | female $\rightarrow$ $\neg$ AD | 500 | 352 |
| 7 | female, stress $\rightarrow$ AD | 260 | 100 |
| 8 | berries, apples $\rightarrow$ AD | 120 | 32 |
| 9 | smoking, tea $\rightarrow$ AD | 240 | 100 |
| 10 | smoking, high education $\rightarrow$ AD | 80 | 32 |
| 11 | stress, smoking $\rightarrow$ AD | 200 | 100 |
| 12 | female, high education $\rightarrow$ $\neg$ AD | 251 | 203 |

How to interpret this table: 

$fr(X)$ is absolute frequency (how many rows do X occurs in the table)

$fr(XC)$ is absolute frequency (how many rows do X and C occurs in the table)

$P(X)$ is relative frequency

For the first row, there are 300 patients who smoke, or $fr_{X} = 300$, where X is the binary variable of whether that person smokes or not. On the other hand, $fr_{XC} = 125$ means the number of people who smokes and also had Alzheimer (AD) is 125. The rule is smoking $\rightarrow$ AD, which means 

Leverage $\delta$ and lift $\gamma$ measure the strength of dependence

$\delta(A=a, B=b) = P(A=a, B=b) − P(A=a)P(B=b)$

$\gamma(A=a, B=b) = \dfrac{P(A=a, B=b)}{P(A=a)P(B=b)}$

X and C are statistically independent, if P(X, C) = P(X)P(C) (δ = 0, γ = 1) (“independence rule”)

X and C are positively associated/statistically dependent, if P(X, C) > P(X)P(C) (δ > 0, γ > 1) (rule X → C)

X and C are negatively associated/statistically dependent, if P(X, C) < P(X)P(C) (δ < 0, γ < 1). Now X and ¬C positively associated! (rule X → ¬C)

In [26]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

(a) Calculate leverage and lift values for all rules. Prune out rules that do
not express positive statistical dependence.

In [27]:
n = 1000 # number of patients
P_AD = 0.3 # probability of having AD
P_negAD = 0.7 # probability of not having AD

# Data from the table
rules_data = [
    ("smoking", "AD", 300, 125),
    ("stress", "AD", 500, 150),
    ("higheducation", "not AD", 500, 400),
    ("tea", "not AD", 342, 240),
    ("turmeric", "not AD", 2, 2),
    ("female", "not AD", 500, 352),
    ("female, stress", "AD", 260, 100),
    ("berries, apples", "AD", 120, 32),
    ("smoking, tea", "AD", 240, 100),
    ("smoking, higheducation", "AD", 80, 32),
    ("stress, smoking", "AD", 200, 100),
    ("female, higheducation", "not AD", 251, 203)
]

df = pd.DataFrame(columns=["X", "C", "delta", "gamma"])

df_not_positive_statistically_dependent = pd.DataFrame(columns=["X", "C", "delta", "gamma"])
df_positive_statistically_dependent = pd.DataFrame(columns=["X", "C", "delta", "gamma"])

for i, (X, C, fr_X, fr_XC) in enumerate(rules_data):
    P_X = fr_X / n
    P_XC = fr_XC / n
    P_XnegC = P_X - P_XC

    
    # Calculate leverage and lift
    if C == "AD":
        P_C = P_AD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
    else:
        P_C = P_negAD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
    
    # print(X, C)
    df.loc[i] = [X, C, leverage, lift]
    if leverage > 0 and lift > 1:
        df_positive_statistically_dependent.loc[i + 1] = [X, C, leverage, lift]
    else:
        df_not_positive_statistically_dependent.loc[i + 1] = [X, C, leverage, lift]

    # print(f"Rule: {X} -> {C}")
    # print(f"Leverage: {leverage:.4f}")
    # print(f"Lift: {lift:.4f}")
    
    # # Prune rules that do not express positive statistical dependence
    # if leverage > 0:
    #     print(f"Rule: {X} -> {C}")
    #     print(f"Leverage: {leverage:.4f}")
    #     print(f"Lift: {lift:.4f}")
    #     print("------")

print("Rules that express positive statistical dependence\n")
print(df_positive_statistically_dependent)

print()

print("Rules that do not express positive statistical dependence\n")
print(df_not_positive_statistically_dependent)


Rules that express positive statistical dependence

                         X       C   delta     gamma
1                  smoking      AD  0.0350  1.388889
3            higheducation  not AD  0.0500  1.142857
4                      tea  not AD  0.0006  1.002506
5                 turmeric  not AD  0.0006  1.428571
6                   female  not AD  0.0020  1.005714
7           female, stress      AD  0.0220  1.282051
9             smoking, tea      AD  0.0280  1.388889
10  smoking, higheducation      AD  0.0080  1.333333
11         stress, smoking      AD  0.0400  1.666667
12   female, higheducation  not AD  0.0273  1.155378

Rules that do not express positive statistical dependence

                 X   C  delta     gamma
2           stress  AD  0.000  1.000000
8  berries, apples  AD -0.004  0.888889


(b) Evaluate mutual information MI of remaining rules (report $n \cdot MI$ values) and prune out rules where $n \cdot MI < 1.5$ (i.e., $MI < 0.0015$).

Mutual information formula:

$MI(X, C) = \sum_{x \in X} \sum_{c \in C} P(X, C) \log \dfrac{P(X, C)}{P(X)P(C)}$

In the case that X and C are binary variables, the formula simplifies to:

$MI(X, C) = P(X, C) \log \dfrac{P(X, C)}{P(X)P(C)} + P(X, \neg C) \log \dfrac{P(X, \neg C)}{P(X)P(\neg C)} + P(\neg X, C) \log \dfrac{P(\neg X, C)}{P(\neg X)P(C)} + P(\neg X, \neg C) \log \dfrac{P(\neg X, \neg C)}{P(\neg X)P(\neg C)}$

=> $MI = \log \dfrac{P(X, C)^{P(X, C)}}{(P(X)P(C)) ^ {P(X, C)}} + \log \dfrac{P(X, \neg C)^{P(X, \neg C)}}{(P(X)P(\neg C)) ^ {P(X, \neg C)}} + \log \dfrac{P(\neg X, C)^{P(\neg X, C)}}{(P(\neg X)P(C)) ^ {P(\neg X, C)}} + \log \dfrac{P(\neg X, \neg C)^{P(\neg X, \neg C)}}{(P(\neg X)P(\neg C)) ^ {P(\neg X, \neg C)}}$

=> $MI = \log \left[ \dfrac{P(X, C)^{P(X, C)} P(X, \neg C)^{P(X, \neg C)} P(\neg X, C)^{P(\neg X, C)} P(\neg X, \neg C)^{P(\neg X, \neg C)}}{(P(X)P(C)) ^ {P(X, C)} (P(X)P(\neg C)) ^ {P(X, \neg C)} (P(\neg X)P(C)) ^ {P(\neg X, C)} (P(\neg X)P(\neg C)) ^ {P(\neg X, \neg C)}} \right]$

=> $MI = \log \left[ \dfrac{P(X, C)^{P(X, C)} P(X, \neg C)^{P(X, \neg C)} P(\neg X, C)^{P(\neg X, C)} P(\neg X, \neg C)^{P(\neg X, \neg C)}}{P(X)^{P(X, C) + P(X, \neg C)} P(C)^{P(X, C) + P(\neg X, C)} P(\neg X)^{P(\neg X, C) + P(\neg X, \neg C)} P(\neg C)^{P(X, \neg C) + P(\neg X, \neg C)}} \right]$

=> $MI = \log \left[ \dfrac{P(X, C)^{P(X, C)} P(X, \neg C)^{P(X, \neg C)} P(\neg X, C)^{P(\neg X, C)} P(\neg X, \neg C)^{P(\neg X, \neg C)}}{P(X)^{P(X)} P(\neg X)^{P(\neg X)} P(C) ^ {P(C)} P(\neg C) ^ {P(\neg C)}} \right]$

In [28]:
def mutual_information(P_X, P_C, P_XC):
    P_negX = 1 - P_X
    P_negC = 1 - P_C
    P_XnegC = P_X - P_XC
    P_negXC = P_C - P_XC
    P_negXnegC = 1 - P_X - P_C + P_XC
    
    nominator = P_XC ** P_XC * P_XnegC ** P_XnegC * P_negXC ** P_negXC * P_negXnegC ** P_negXnegC
    denominator = P_X ** P_X * P_negX ** P_negX * P_C ** P_C * P_negC ** P_negC
    
    MI = np.log2(nominator / denominator)
    return MI

remaining_rules = []  # Store rules that have positive statistical dependence

n = 1000 # number of patients
P_AD = 0.3 # probability of having AD
P_negAD = 0.7 # probability of not having AD

for i, (X, C, fr_X, fr_XC) in enumerate(rules_data):
    P_X = fr_X / n
    P_XC = fr_XC / n
    P_XnegC = P_X - P_XC

    # Calculate leverage and lift
    if C == "AD":
        P_C = P_AD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
    else:
        P_C = P_negAD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
    
    if leverage > 0 and lift > 1:
        remaining_rules.append((X, C, fr_X, fr_XC))

# for i in remaining_rules:
#     print(i)
    
df_nMI_larger_than_1_5 = pd.DataFrame(columns=["X", "C", "delta", "gamma", "n * MI"])
df_nMI_not_larger_than_1_5 = pd.DataFrame(columns=["X", "C", "delta", "gamma", "n * MI"])

for i, (X, C, fr_X, fr_XC) in enumerate(remaining_rules):
    P_X = fr_X / n
    P_XC = fr_XC / n

    if C == "AD":
        P_C = P_AD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
        # Calculate MI
        MI = mutual_information(P_X, P_C, P_XC)
    else:
        P_C = P_negAD
        leverage = P_XC - P_X * P_C
        lift = P_XC / (P_X * P_C)
        MI = mutual_information(P_X, P_C, P_XC)
    
    # Check the condition n * MI > 1.5
    if n * MI > 1.5:
        df_nMI_larger_than_1_5.loc[i] = [X, C, leverage, lift, n * MI]
    else:
        df_nMI_not_larger_than_1_5.loc[i] = [X, C, leverage, lift, n * MI]

print("Rules that have n * MI > 1.5\n")
print(df_nMI_larger_than_1_5)

print()

print("Rules that do not have n * MI <= 1.5\n")
print(df_nMI_not_larger_than_1_5)


Rules that have n * MI > 1.5

                        X       C   delta     gamma     n * MI
0                 smoking      AD  0.0350  1.388889  19.435585
1           higheducation  not AD  0.0500  1.142857  34.851555
5          female, stress      AD  0.0220  1.282051   8.398762
6            smoking, tea      AD  0.0280  1.388889  14.201863
7  smoking, higheducation      AD  0.0080  1.333333   2.846664
8         stress, smoking      AD  0.0400  1.666667  32.268400
9   female, higheducation  not AD  0.0273  1.155378  14.461504

Rules that do not have n * MI <= 1.5

          X       C   delta     gamma    n * MI
2       tea  not AD  0.0006  1.002506  0.005498
3  turmeric  not AD  0.0006  1.428571  1.030385
4    female  not AD  0.0020  1.005714  0.054961
