# Finding Association Rules

## Q1: Report 3 rules with at least 0.2 support and 0.9 confidence

In [127]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules


data_from_csv = pd.read_csv('mammographic_masses.csv')
# convert to "attribute=value"
data = data_from_csv.apply(lambda row: [f"{col}={val}" for col, val in row.items()], axis=1).tolist()
te = TransactionEncoder()
# hot-pot encoding
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
# computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)

# visualizing association rules results
rules[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
0,(Shape=4),(Density=3),0.37461,0.9
1,"(Margin=1, Density=3)",(BI-RADS=4),0.263267,0.900356
2,"(Margin=1, Severity=0)",(BI-RADS=4),0.299688,0.911392
3,"(Margin=1, BI-RADS=4)",(Severity=0),0.299688,0.911392
4,"(Shape=4, BI-RADS=5)",(Density=3),0.245578,0.904215
5,"(Shape=4, BI-RADS=5)",(Severity=1),0.246618,0.908046
6,"(Severity=1, Shape=4)",(Density=3),0.295525,0.901587
7,"(Margin=1, Severity=0, Density=3)",(BI-RADS=4),0.238293,0.927126
8,"(Margin=1, Density=3, BI-RADS=4)",(Severity=0),0.238293,0.905138
9,"(Shape=4, Density=3, BI-RADS=5)",(Severity=1),0.224766,0.915254


| Rule                                                                              | Support  | Confidence |
|-----------------------------------------------------------------------------------|----------|------------|
| Margin=1(circumscribed), BI-RADS=4 $\rightarrow$ Severity=0(benign)               | 0.299688 | 0.911392   |
| Shape=4(irregular), BI-RADS=5 $\rightarrow$ Severity=1(malignant)                 | 0.246618 | 0.908046   |
| Shape=4(irregular), Density=3(low), BI-RADS=5 $\rightarrow$ Severity=1(malignant) | 0.224766 | 0.915254   |

## Q2: Predict benign or malign

In [128]:
te = TransactionEncoder()
# hot-pot encoding
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
# computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)
# filter rules about severity
severity_rules = rules[rules['consequents'].apply(lambda x: 'Severity=0' in x or 'Severity=1' in x)]
severity_rules[["antecedents","consequents","support","confidence"]]

Unnamed: 0,antecedents,consequents,support,confidence
4,"(Margin=1, BI-RADS=4)",(Severity=0),0.299688,0.911392
5,"(Shape=1, BI-RADS=4)",(Severity=0),0.173777,0.907609
6,"(Shape=2, BI-RADS=4)",(Severity=0),0.162331,0.901734
8,"(Shape=4, BI-RADS=5)",(Severity=1),0.246618,0.908046
12,"(Margin=1, Shape=2)",(Severity=0),0.136316,0.903448
14,"(Margin=1, Density=3, BI-RADS=4)",(Severity=0),0.238293,0.905138
18,"(Shape=1, Density=3, BI-RADS=4)",(Severity=0),0.145682,0.915033
21,"(Margin=1, Shape=1, BI-RADS=4)",(Severity=0),0.156087,0.909091
23,"(Margin=1, Shape=2, BI-RADS=4)",(Severity=0),0.12487,0.9375
24,"(Margin=4, Density=3, BI-RADS=5)",(Severity=1),0.121748,0.906977


| Rule                                                                                     | Support  | Confidence |
|------------------------------------------------------------------------------------------|----------|------------|
| Margin=1(circumscribed), Shape=2(oval) $\rightarrow$ Severity=0(benign)                  | 0.136316 | 0.903448   |
| Margin=1(circumscribed), Shape=1(round), Density=3(low) $\rightarrow$ Severity=0(benign) | 0.143600 | 0.901961   |

There are only two rules which can predict benign and no rule for malign without BI-RADS. As we can see, if the margin of the lesion is **circumscribed** and the shape is regular like round or oval,  there will be a high chance that the given instance is benign.

## Q3: Why BI-RADS is not always accurate?

| Rule                                                       | Support  | Confidence |
|------------------------------------------------------------|----------|------------|
| Shape=1(round), BI-RADS=4 $\rightarrow$ Severity=0(benign) | 0.173777 | 0.907609   |
| Shape=2(oval), BI-RADS=4 $\rightarrow$ Severity=0(benign)  | 0.162331 | 0.901734   |

These two rules show that despite the score of BI-RADS is 4, which indicates a suspicion of malign, there is a more than **90%** chance of benign. We can know that BI-RADS 4 assessment does not always correspond to malign, which may lead to unnecessary breast biopsy.

## Q4: Age=35 $\Rightarrow$ Severity=0

In [129]:
df = data_from_csv
total = len(df)
# coerce: replace non-numeric values with NaN
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
age_cnt = len(df[df['Age'] == 35])
age_and_severity_cnt = len(df[(df['Age'] == 35) & (df['Severity'] == 0)])
support = age_and_severity_cnt / total
confidence = age_and_severity_cnt / age_cnt
print('support:', support)
print('confidence:', confidence)

support: 0.012486992715920915
confidence: 0.9230769230769231


* Support: 0.01249
* Confidence: 0.9231

I think we should ignore it. Although the rule has a high confidence, its very low support suggests that this rule does not have enough data to support it.

## Q5 Age $\geq$ n

In [141]:
df = data_from_csv.copy(deep=True)
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
mode = df['Age'].mode()[0]
# use mode to fill missing values
df['Age'] = df['Age'].fillna(mode)
# set n with mean
n = int(df['Age'].mean())
df['Age≥n'] = df['Age'].apply(lambda x: 1 if x >= n else 0)
df = df.drop(columns=['Age'])
data = df.apply(lambda row: [f"{col}={val}" for col, val in row.items()], axis=1).tolist()
te_ary = te.fit(data).transform(data)
df = pd.DataFrame(te_ary, columns=te.columns_)
# computing frequent itemsets and association rules
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)
filtered_rules = rules[
    (rules['consequents'].apply(lambda x: 'Severity=1' in x)) & 
    (rules['antecedents'].apply(lambda x: 'Age≥n=1' in x))
]
filtered_rules[['antecedents', 'consequents', 'support', 'confidence']]

Unnamed: 0,antecedents,consequents,support,confidence
6,"(Age≥n=1, BI-RADS=5)",(Severity=1),0.246618,0.908046
32,"(Density=3, BI-RADS=5, Age≥n=1)",(Severity=1),0.224766,0.919149
34,"(Shape=4, BI-RADS=5, Age≥n=1)",(Severity=1),0.190427,0.928934
65,"(Shape=4, Density=3, BI-RADS=5, Age≥n=1)",(Severity=1),0.176899,0.939227


`n=55`

| Rule(n=55)                                             | Support  | Confidence |
|--------------------------------------------------------|----------|------------|
| Age≥n=, BI-RADS=5 $\rightarrow$ Severity=1             | 0.246618 | 0.908046   |
| Density=3, BI-RADS=5, Age≥n=1 $\rightarrow$ Severity=1 | 0.224766 | 0.919149   |