### 3.2 Alzheimer disease association rules: further evaluation

Learning goal: Conditional evaluation and interpretation of statistical association rules. 

In this task, the same association rules (Table 1) are evaluated further.
You can now evaluate only those rules that remained significant after task 1.

In [6]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

(a) Evaluate overfitting among remaining rules using value-based interpretation and conditional mutual information $MI_{C}$: Rule $\mathbf{X} \rightarrow C=c$ is pruned out if there exists some $\mathbf{Y} \subsetneq \mathbf{X}$, such that for $\mathbf{X} → C=c$ either $P (C=c|\mathbf{Y}) ≥ P (C=c|\mathbf{X})$ (no improvement) or the improvement is not sufficient, $n \cdot MI_{C} < 0.5$ (i.e., $MI_{C} < 0.0005$).

The formula of $P(C=c|X)$ is as follows:

$$P(C = c | X) = \frac{P(C = c, X)}{P(X)}$$

Conditional mutual information for evaluating rule $XY \rightarrow C=c$ given $X$
in the value-based interpretation is

In [11]:
# The remaining rules from Task 3.1

rules_data = [
    ("smoking", "AD", 300, 125),
    ("higheducation", "not AD", 500, 400),
    ("female, stress", "AD", 260, 100),
    ("smoking, tea", "AD", 240, 100),
    ("smoking, higheducation", "AD", 80, 32),
    ("stress, smoking", "AD", 200, 100),
    ("female, higheducation", "not AD", 251, 203)
]

# List of subsets of the rules

rules_subset_data = [
    {
        "Y": ("smoking", "AD", 300, 125),
        "X": ("smoking, tea", "AD", 240, 100),
    },
    {
        "Y": ("smoking", "AD", 300, 125),
        "X": ("smoking, higheducation", "AD", 80, 32),
    },
    {
        "Y": ("smoking", "AD", 300, 125),
        "X": ("stress, smoking", "AD", 200, 100),
    },
    {
        "Y": ("higheducation", "not AD", 500, 400),
        "X": ("female, higheducation", "not AD", 251, 203),
    }
]

In [None]:
n = 1000
P_AD = 0.3
P_negAD = 0.7

def value_based_interpretation(P_X, P_XC, P_Y, P_YC):
    P_CgivenY = P_XC/P_Y
    P_CgivenX = P_YC/P_X
    return P_CgivenY >= P_CgivenX
    

In [8]:
n = 1000
P_AD = 0.3
P_negAD = 0.7

def conditional_MI(P_X, P_C, P_XC, P_XQ):

    P_X = X[2] / n
    P_Y = Y[2] / n
    P_C = C / n
    P_XC = X[3] / n
    P_YC = Y[3] / n
    
    # Calculate conditional mutual information
    MI_C = P_XC * np.log2(P_XC / (P_X * P_C)) + (P_X - P_XC) * np.log2((P_X - P_XC) / (P_X * (1 - P_C))) + \
           P_YC * np.log2(P_YC / (P_Y * P_C)) + (P_Y - P_YC) * np.log2((P_Y - P_YC) / (P_Y * (1 - P_C)))
    
    return MI_C

n = 1000
C = 700  # Fr(C) absolute number of people

# Iterate over each rule and check for overfitting
pruned_rules = []
for rules_subset in rules_subset_data:
    rule_X = 123
    rule_Y = 123
    for subset in rules_data:
        Y = subset
        if set(Y[0].split(", ")).issubset(set(X[0].split(", "))) and Y[0] != X[0]:
            if (Y[3] / Y[2] >= X[3] / X[2]) or (n * conditional_MI(X, Y, C, n) < 0.5):
                pruned_rules.append(X)
                break

# Print the pruned rules
for rule in pruned_rules:
    print(rule)


('smoking, tea', 'AD', 240, 100)
('smoking, higheducation', 'AD', 80, 32)


(b) What are your conclusions based on the remaining association rules?
What would you recommend to do if one would like to avoid Alzheimer’s
disease?

My advice: If you want to avoid Alzheimer's disease, you should avoid smoking

A recent review of 37 research studies found that compared to never smokers, current smokers were 30% more likely to develop dementia in general and 40% more likely to develop Alzheimer's disease. Analyses of earlier studies suggested the risk may be even higher than that. 

Ref: https://www.alzheimersresearchuk.org/blog/all-you-need-to-know-about-smoking-and-dementia/#:~:text=Smoking%20and%20dementia%20risk&text=A%20recent%20review%20of%2037,be%20even%20higher%20than%20that.

(c) Give example rules (among all 12 rules) that demonstrate the following
things. Explain your choices briefly (why they demonstrate something).
One example suffices for each part.

(i) An association rule may have high precision and lift but still lack
validity (unlikely hold in future data).

(ii) Statistical dependence is not a monotonic property. i.e., a rule
can express strong dependence, even if more general rules express
independence or opposite dependence (positive instead of negative
or negative instead of positive).

(iii) Overfitted rules can lead to wrong conclusions.

In [9]:
rules_data = [
    ("smoking", "AD", 300, 125),
    ("stress", "AD", 500, 150),
    ("higheducation", "not AD", 500, 400),
    ("tea", "AD", 342, 240),
    ("turmeric", "not AD", 2, 2),
    ("female", "not AD", 500, 352),
    ("female, stress", "AD", 260, 100),
    ("berries, apples", "AD", 120, 32),
    ("smoking, tea", "AD", 240, 100),
    ("smoking, higheducation", "AD", 80, 32),
    ("stress, smoking", "AD", 200, 100),
    ("female, higheducation", "not AD", 251, 203)
]

# (i) Find rules with high precision and lift but lack validity
# For simplicity, we'll consider rules with precision > 0.8 and lift > 1.5 as high
high_precision_lift_rules = [rule for rule in rules_data if rule[3] / rule[2] > 0.8 and (rule[3] / rule[2]) / (C / n) > 1.5]

# (ii) Find rules where more specific rules have opposite dependence
# We'll look for rules where the more general rule has a lift close to 1 (independence) but the specific rule has a lift far from 1
non_monotonic_rules = []
for rule1 in rules_data:
    for rule2 in rules_data:
        if set(rule2[0].split(", ")).issubset(set(rule1[0].split(", "))) and rule1[0] != rule2[0]:
            lift1 = (rule1[3] / rule1[2]) / (C / n)
            lift2 = (rule2[3] / rule2[2]) / (C / n)
            if abs(lift1 - 1) < 0.1 and abs(lift2 - 1) > 0.5:
                non_monotonic_rules.append((rule1, rule2))

# (iii) Find overfitted rules
# For simplicity, we'll consider rules derived from itemsets with frequency < 5 as potentially overfitted
overfitted_rules = [rule for rule in rules_data if rule[2] < 5]

# Print the rules
print("High precision and lift but lack validity:", high_precision_lift_rules)
print("Non-monotonic dependence:", non_monotonic_rules)
print("Potentially overfitted rules:", overfitted_rules)


High precision and lift but lack validity: []
Non-monotonic dependence: []
Potentially overfitted rules: [('turmeric', 'not AD', 2, 2)]
