#  Frequent Itemset Mining: Apriori Alternatives

In this notebook, we will apply **apriori**, **FP-Growth**, and **maximal frequent itemset** methods on the congressional voting records dataset. You can learn more about this dataset here: https://archive.ics.uci.edu/ml/datasets/congressional+voting+records

 ### Import required Libraries

In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth, fpmax
import matplotlib.pyplot as plt
import os
import sys
%matplotlib inline

### T1: Data Loading

The data is located here: `/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv`


In [2]:
df = pd.read_csv("/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv")
df.head(5)

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


### T2: Show the number of transactions

In [3]:
print(f"Num of transactions = {df.shape[0]}")
print(f"Maximum num of items per transaction = {df.shape[1]}")

Num of transactions = 435
Maximum num of items per transaction = 17


In [4]:
set(df.values.flatten())

{'?', 'democrat', 'n', 'republican', 'y'}

### T3: Transform the dataset to a binary incidence matrix for applying itemset mining methods

In [5]:
from sklearn.preprocessing import MultiLabelBinarizer

trans_data = []
for indx, row in df.iterrows():
    trans_data.append(row.dropna().values)

mlb = MultiLabelBinarizer()

df = pd.DataFrame(trans_data)
for column in df.columns:
    if column == 0:
        continue
    df[column] = df[0]+ "_" + df[column]
    
df = df.drop(0,axis=1)

trans_data = df.values.tolist()

In [6]:
data = mlb.fit_transform(trans_data)
mlb.classes_

array(['democrat_?', 'democrat_n', 'democrat_y', 'republican_?',
       'republican_n', 'republican_y'], dtype=object)

In [7]:
trans_data_enc = pd.DataFrame(data, columns=mlb.classes_)
trans_data_enc.head()
print(trans_data_enc)

     democrat_?  democrat_n  democrat_y  republican_?  republican_n  \
0             0           0           0             1             1   
1             0           0           0             1             1   
2             1           1           1             0             0   
3             1           1           1             0             0   
4             1           1           1             0             0   
..          ...         ...         ...           ...           ...   
430           0           0           0             0             1   
431           0           1           1             0             0   
432           0           0           0             1             1   
433           0           0           0             1             1   
434           0           0           0             1             1   

     republican_y  
0               1  
1               1  
2               0  
3               0  
4               0  
..            ...  
430    

### T4: Indentify Frequent Patterns with FP-Growth Method. Use min_support = 0.3. Show the number of itemsets per itemset length.

In [8]:
frequent_itemsets = fpgrowth(trans_data_enc, min_support=0.3, use_colnames=True)

In [9]:
print(frequent_itemsets)

    support                              itemsets
0  0.383908                        (republican_y)
1  0.383908                        (republican_n)
2  0.613793                          (democrat_y)
3  0.611494                          (democrat_n)
4  0.328736                          (democrat_?)
5  0.383908          (republican_n, republican_y)
6  0.611494              (democrat_y, democrat_n)
7  0.328736              (democrat_y, democrat_?)
8  0.326437              (democrat_?, democrat_n)
9  0.326437  (democrat_y, democrat_?, democrat_n)


In [10]:
frequent_itemsets['itemset_length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
itemsets_per_length = frequent_itemsets.groupby('itemset_length').size()

In [11]:
print(itemsets_per_length)

itemset_length
1    5
2    4
3    1
dtype: int64


### T5: Generate Association Rules from Frequent Itemsets with min 90% confidence.

* Show the total number of rules

In [12]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.9)
rules.head()
print(rules)

                antecedents               consequents  antecedent support  \
0            (republican_n)            (republican_y)            0.383908   
1            (republican_y)            (republican_n)            0.383908   
2              (democrat_y)              (democrat_n)            0.613793   
3              (democrat_n)              (democrat_y)            0.611494   
4              (democrat_?)              (democrat_y)            0.328736   
5              (democrat_?)              (democrat_n)            0.328736   
6  (democrat_y, democrat_?)              (democrat_n)            0.328736   
7  (democrat_?, democrat_n)              (democrat_y)            0.326437   
8              (democrat_?)  (democrat_y, democrat_n)            0.328736   

   consequent support   support  confidence      lift  leverage  conviction  
0            0.383908  0.383908    1.000000  2.604790  0.236523         inf  
1            0.383908  0.383908    1.000000  2.604790  0.236523         i

In [13]:
rules.sort_values(by=['conviction'], ascending=False).head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(republican_n),(republican_y),0.383908,0.383908,0.383908,1.0,2.60479,0.236523,inf
1,(republican_y),(republican_n),0.383908,0.383908,0.383908,1.0,2.60479,0.236523,inf
3,(democrat_n),(democrat_y),0.611494,0.613793,0.611494,1.0,1.629213,0.236163,inf
4,(democrat_?),(democrat_y),0.328736,0.613793,0.328736,1.0,1.629213,0.12696,inf
7,"(democrat_?, democrat_n)",(democrat_y),0.326437,0.613793,0.326437,1.0,1.629213,0.126072,inf


### T6: Identify the top 5 rules with high confidence where `consequents` are only `Class Name_democrat`. Similarly, infer the top 5 rules with high confidence where `consequents` are only `Class Name_republican`. 

* Iterate over these two subsets of rules and print only antecedents, consequents, and confidence.
* Based on these rules, characterize democrat and republican congress members

In [19]:
rules_democrat = rules[rules['consequents'].apply(lambda x: 'democrat_y' in x)]

print(rules["consequents"])

rules_democrat_top5 = rules_democrat.nlargest(5, 'confidence')[['antecedents', 'consequents', 'confidence']]
print(rules_democrat)

0              (republican_y)
1              (republican_n)
2                (democrat_n)
3                (democrat_y)
4                (democrat_y)
5                (democrat_n)
6                (democrat_n)
7                (democrat_y)
8    (democrat_y, democrat_n)
Name: consequents, dtype: object
                antecedents               consequents  antecedent support  \
3              (democrat_n)              (democrat_y)            0.611494   
4              (democrat_?)              (democrat_y)            0.328736   
7  (democrat_?, democrat_n)              (democrat_y)            0.326437   
8              (democrat_?)  (democrat_y, democrat_n)            0.328736   

   consequent support   support  confidence      lift  leverage  conviction  
3            0.613793  0.611494    1.000000  1.629213  0.236163         inf  
4            0.613793  0.328736    1.000000  1.629213  0.126960         inf  
7            0.613793  0.326437    1.000000  1.629213  0.126072         inf  

In [20]:
rules_republican = rules[rules['consequents'].apply(lambda x: 'republican' in x)]
rules_republican_top5 = rules_republican.nlargest(5, 'confidence')[['antecedents', 'consequents', 'confidence']]

In [21]:
print("Top 5 rules with consequents as democrat:")
print(rules_democrat_top5)

Top 5 rules with consequents as democrat:
                antecedents               consequents  confidence
3              (democrat_n)              (democrat_y)    1.000000
4              (democrat_?)              (democrat_y)    1.000000
7  (democrat_?, democrat_n)              (democrat_y)    1.000000
8              (democrat_?)  (democrat_y, democrat_n)    0.993007


In [22]:
print("\nTop 5 rules with consequents as republican:")
print(rules_republican_top5)


Top 5 rules with consequents as republican:
Empty DataFrame
Columns: [antecedents, consequents, confidence]
Index: []


### T7. Show the number of maximal frequent itemsets for min support = 0.3 

In [23]:
max_patterns = fpmax(trans_data_enc, min_support=0.3, use_colnames=True)

In [24]:
max_patterns = max_patterns.rename(columns={'itemsets': 'itemsets', 'support': 'support'})
max_patterns['length'] = max_patterns['itemsets'].apply(lambda x: len(x) if isinstance(x, (set, list)) else 0)

In [25]:
print(max_patterns)
length_counts = max_patterns['length'].value_counts()
print(length_counts)
print(f"Total number of maximal frequent patterns = {max_patterns.shape[0]}")

    support                              itemsets  length
0  0.326437  (democrat_y, democrat_?, democrat_n)       0
1  0.383908          (republican_n, republican_y)       0
0    2
Name: length, dtype: int64
Total number of maximal frequent patterns = 2


# Save your notebook, then `File > Close and Halt`