# Part1:  Online Retail Market Basket Analysis 

The dataset contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a
UK-based and registered non-store online retailer. The company mainly sells unique all-occasion
gifts.

In [1]:
# Loading the Online Retail Dataset
import pandas as pd
pd.set_option("max_colwidth", 150)
f = "https://github.com/cs6220/cs6220.spring2019/raw/master/data/Online%20Retail.xlsx"
df = pd.read_excel(f)
basket = (df[df["Country"] == "United Kingdom"]
.groupby(["InvoiceNo", "Description"])["Quantity"]
.sum().unstack().reset_index().fillna(0)
.set_index("InvoiceNo")) # transform transactions into baskets of items
basket_sets = basket.applymap(lambda x: 1 if x >=1 else 0) # convert counts to booleans

  basket_sets = basket.applymap(lambda x: 1 if x >=1 else 0) # convert counts to booleans


## 1.1  Frequent Itemset Generation

Use the Apriori algorithm to generate frequent itemsets with up to 2 items per set. Explore dif-
ferent support thresholds until you obtain at least 5 results for both 1- and 2-itemsets.

In [2]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [3]:
basket_sets.head()

Description,20713,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,...,wrongly coded 20713,wrongly coded 23343,wrongly coded-23343,wrongly marked,wrongly marked 23343,wrongly marked carton 22804,wrongly marked. 23343 in box,wrongly sold (22719) barcode,wrongly sold as sets,wrongly sold sets
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# reference: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/ mlxtend

# Top 5 1-itemsets with the highest support
from mlxtend.frequent_patterns import apriori

frequent_itemsets_1 = apriori(basket_sets, min_support=0.06, use_colnames=True, max_len = 1)
frequent_itemsets_1.sort_values(by="support", ascending=False).head(5)




Unnamed: 0,support,itemsets
5,0.098276,(WHITE HANGING HEART T-LIGHT HOLDER)
1,0.087931,(JUMBO BAG RED RETROSPOT)
4,0.076452,(REGENCY CAKESTAND 3 TIER)
3,0.072323,(PARTY BUNTING)
2,0.063158,(LUNCH BAG RED RETROSPOT)


In [None]:
# Top 5 2-itemsets with the highest support
frequent_itemsets_2 = apriori(basket_sets, min_support=0.02, use_colnames=True, max_len = 2)
frequent_itemsets_only2 = frequent_itemsets_2[frequent_itemsets_2["itemsets"].apply(lambda x: len(x) == 2)]
frequent_itemsets_only2.sort_values(by="support", ascending=False).head(5)



Unnamed: 0,support,itemsets
211,0.035617,"(JUMBO BAG PINK POLKADOT, JUMBO BAG RED RETROSPOT)"
207,0.031806,"(GREEN REGENCY TEACUP AND SAUCER, ROSES REGENCY TEACUP AND SAUCER )"
218,0.03167,"(JUMBO STORAGE BAG SUKI, JUMBO BAG RED RETROSPOT)"
217,0.029809,"(JUMBO SHOPPER VINTAGE RED PAISLEY, JUMBO BAG RED RETROSPOT)"
223,0.027541,"(LUNCH BAG BLACK SKULL., LUNCH BAG RED RETROSPOT)"


Highest support value for the 1-itemsets is 0.098276 (WHITE HANGING HEART T-LIGHT HOLDER)

Highest support value for the 2-itemsets is 0.035617 (JUMBO BAG RED RETROSPOT, JUMBO BAG PINK POLKADOT)

In [6]:
## 1.2 Association Rule Generation

# reference: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ mlxtend
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets_2, metric="confidence", min_threshold=0.7)
rules.sort_values(by="confidence", ascending=False).head(10)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
2,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.031897,0.042377,0.02618,0.820768,19.368019,1.0,0.024828,5.342926,0.979615,0.54434,0.812837,0.719271
5,(PINK REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.031897,0.043421,0.024773,0.776671,17.886978,1.0,0.023388,4.28328,0.975199,0.490126,0.766534,0.673602
3,(GREEN REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.042377,0.043421,0.031806,0.750535,17.285056,1.0,0.029966,3.834527,0.983839,0.589076,0.739212,0.741516
4,(ROSES REGENCY TEACUP AND SAUCER ),(GREEN REGENCY TEACUP AND SAUCER),0.043421,0.042377,0.031806,0.732497,17.285056,1.0,0.029966,3.579862,0.984912,0.589076,0.72066,0.741516
1,(GARDENERS KNEELING PAD CUP OF TEA ),(GARDENERS KNEELING PAD KEEP CALM ),0.034029,0.040744,0.024546,0.721333,17.703994,1.0,0.02316,3.442306,0.976754,0.488708,0.709497,0.661892
0,(CHARLOTTE BAG PINK POLKADOT),(RED RETROSPOT CHARLOTTE BAG),0.030581,0.041062,0.021733,0.710682,17.307671,1.0,0.020478,3.314484,0.971945,0.435455,0.698294,0.619982


Antecedent: PINK REGENCY TEACUP AND SAUCER, Consequent: GREEN REGENCY TEACUP AND SAUCER

Support ≈ 0.0319, confidence ≈ 0.82, Lift ≈ 19.36

Since tea drinking is strong UK tradition, buyers may purchase sets of coordinated cups.

# Part 2: Association Rule Mining U.S. Census Data

The dataset is an extraction from the 1994 U.S. Census.

In [7]:
# Loading dataset
import numpy as np
import pandas as pd
path = "https://raw.githubusercontent.com/cs6220/cs6220.spring2019/master/data/adult/"
names = pd.read_table(path + "adult.names", delimiter=None, header=None)
parse_cols = lambda x: x.str.split(":", expand=True).iloc[:, 0]
columns = np.roll(parse_cols(names.iloc[92:108, 0]), shift=-1)
df_adult = pd.read_csv(path + "adult.data", delimiter=None, header=None, index_col=False)
df_adult.columns = columns
df_adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,">50K, <=50K."
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## 2.1 Association Rule Mining

In [None]:
# reference https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html scikit learn

# Transforming categorical features into one hot encoded features
from sklearn.preprocessing import OneHotEncoder
# select categorical features
cat_features = df_adult.select_dtypes(include=["object"]).columns
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(df_adult[cat_features])
encoded = enc.transform(df_adult[cat_features]).toarray()
encoded_df = pd.DataFrame(encoded, columns=enc.get_feature_names_out(cat_features), index=df_adult.index)
encoded_df.head()



Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia,">50K, <=50K._ <=50K",">50K, <=50K._ >50K"
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [25]:
# Generate frequent itemsets
frequent_itemsets = apriori(encoded_df, min_support=0.10, use_colnames=True, max_len = 3)
frequent_itemsets.sort_values(by="support", ascending=False).head(10)



Unnamed: 0,support,itemsets
20,0.895857,(native-country_ United-States)
17,0.854274,(race_ White)
92,0.786862,"(race_ White, native-country_ United-States)"
21,0.75919,"(>50K, <=50K._ <=50K)"
0,0.69703,(workclass_ Private)
100,0.675624,"(>50K, <=50K._ <=50K, native-country_ United-States)"
19,0.669205,(sex_ Male)
93,0.635699,"(>50K, <=50K._ <=50K, race_ White)"
34,0.618378,"(workclass_ Private, native-country_ United-States)"
97,0.598507,"(sex_ Male, native-country_ United-States)"


In [39]:
# Association Rule Generation
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
rules.sort_values(by="confidence", ascending=False).iloc[60:70]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
306,"(race_ White, sex_ Female)",(native-country_ United-States),0.26541,0.895857,0.24471,0.922009,1.029192,1.0,0.006941,1.335317,0.038612,0.266988,0.251114,0.597583
290,"(relationship_ Not-in-family, sex_ Female)",(native-country_ United-States),0.119007,0.895857,0.109702,0.921806,1.028966,1.0,0.003088,1.331862,0.031953,0.121196,0.249171,0.522131
168,"(education_ Some-college, >50K, <=50K._ <=50K)",(native-country_ United-States),0.181321,0.895857,0.167132,0.921748,1.028901,1.0,0.004695,1.330866,0.03431,0.183653,0.24861,0.554155
39,(occupation_ Sales),(native-country_ United-States),0.112097,0.895857,0.103314,0.921644,1.028785,1.0,0.002891,1.329098,0.031512,0.114204,0.24761,0.518484
52,(race_ White),(native-country_ United-States),0.854274,0.895857,0.786862,0.921089,1.028165,1.0,0.021555,1.319746,0.187977,0.816866,0.242278,0.899711
235,"(marital-status_ Never-married, race_ White)",(native-country_ United-States),0.268941,0.895857,0.247658,0.920863,1.027913,1.0,0.006725,1.315989,0.037145,0.270033,0.240115,0.598656
312,"(race_ White, sex_ Male)",(native-country_ United-States),0.588864,0.895857,0.542152,0.920674,1.027702,1.0,0.014614,1.312845,0.065562,0.575185,0.238296,0.762925
252,"(occupation_ Craft-repair, native-country_ United-States)",(race_ White),0.113172,0.854274,0.104143,0.920217,1.077193,1.0,0.007463,1.826538,0.080806,0.120633,0.452516,0.521063
157,"(>50K, <=50K._ <=50K, education_ HS-grad)",(native-country_ United-States),0.27106,0.895857,0.249347,0.919896,1.026833,1.0,0.006516,1.300093,0.035849,0.271747,0.230825,0.599115
137,"(education_ Bachelors, native-country_ United-States)",(race_ White),0.146371,0.854274,0.134517,0.91901,1.075779,1.0,0.009476,1.799307,0.08252,0.155308,0.44423,0.538236


antecedents	consequents:	antecedent support: 	consequent support:	support: confidence: lift: 

1: In U.S. during this time, “Craft-repair” workers 95% of them were male.
antecedents	:(occupation_ Craft-repair, native-country_ United-States)
consequents: (sex_ Male)	
antecedent support: 0.113172,
consequent support:	0.669205, 
support: 0.107153,
confidence: 0.946811
lift: 1.414829

2: In the U.S. during this time, “Craft-repair” workers were 92% White
antecedents: (occupation_ Craft-repair, native-country_ United-States)
consequents: (race_ White)
antecedent support: 0.113172
consequent support: 0.854274
support: 0.104143
confidence: 0.920217
lift: 1.077193

3: In the U.S. during this time, people with a Bachelor’s degree were 92% White
antecedents: (education_Bachelors, native-country_United-States)
consequents: (race_White)
antecedent support: 0.146371
consequent support: 0.854274
support: 0.134517
confidence: 0.919010
lift: 1.075779

4: In the U.S. during this time, people who were never married and female were 91% U.S.-born
antecedents: (marital-status_Never-married, sex_Female)
consequents: (native-country_United-States)
antecedent support: 0.146402
consequent support: 0.895857
support: 0.132920
confidence: 0.907909
lift: 1.013453

5: In the U.S. during this time, people working in Sales were 92% U.S.-born
antecedents: (occupation_Sales)
consequents: (native-country_United-States)
antecedent support: 0.112097
consequent support: 0.895857
support: 0.103314
confidence: 0.921644
lift: 1.028785

In [57]:
frequent_itemsets = apriori(encoded_df, min_support=0.10, use_colnames=True, max_len = 3)
frequent_itemsets.sort_values(by="support", ascending=False).head(10)

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules.sort_values(by="confidence", ascending=False).iloc[70:80]



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
35,"(workclass_ Private, >50K, <=50K._ >50K)",(marital-status_ Married-civ-spouse),0.152422,0.459937,0.129603,0.850292,1.848715,1.0,0.059499,3.607448,0.541642,0.268465,0.722796,0.566038
24,"(>50K, <=50K._ >50K)",(sex_ Male),0.24081,0.669205,0.204601,0.849637,1.26962,1.0,0.04345,2.199966,0.279722,0.290043,0.545447,0.577687
85,(marital-status_ Divorced),"(>50K, <=50K._ <=50K, native-country_ United-States)",0.136452,0.675624,0.114462,0.838848,1.241589,1.0,0.022272,2.012851,0.225327,0.164077,0.503192,0.504132
237,(relationship_ Own-child),"(>50K, <=50K._ <=50K, race_ White)",0.155646,0.635699,0.128866,0.82794,1.302409,1.0,0.029922,2.11729,0.274994,0.19452,0.527698,0.515328
153,(relationship_ Own-child),"(marital-status_ Never-married, native-country_ United-States)",0.155646,0.294186,0.127852,0.821429,2.792205,1.0,0.082063,3.952557,0.760179,0.397081,0.746999,0.628013
233,(relationship_ Not-in-family),"(>50K, <=50K._ <=50K, native-country_ United-States)",0.25506,0.675624,0.206843,0.810957,1.200308,1.0,0.034518,1.715886,0.224019,0.285757,0.417211,0.558554
112,(marital-status_ Married-civ-spouse),"(race_ White, sex_ Male)",0.459937,0.588864,0.369645,0.803686,1.364807,1.0,0.098804,2.094277,0.494934,0.544271,0.522508,0.715705
120,(marital-status_ Married-civ-spouse),"(sex_ Male, native-country_ United-States)",0.459937,0.598507,0.366911,0.797743,1.332888,1.0,0.091636,1.985062,0.462444,0.530577,0.496237,0.705393
88,(marital-status_ Married-civ-spouse),"(relationship_ Husband, race_ White)",0.459937,0.366696,0.36642,0.796675,2.172573,1.0,0.197763,3.114731,0.999358,0.796196,0.678945,0.89796
98,(marital-status_ Married-civ-spouse),"(relationship_ Husband, native-country_ United-States)",0.459937,0.36427,0.363994,0.7914,2.172562,1.0,0.196453,3.047596,0.999353,0.790924,0.671873,0.89532


1: People with income ≤50K were about 31.8% likely to be never married and working in the private sector
antecedents: (>50K, <=50K._ <=50K)
consequents: (marital-status_Never-married, workclass_Private)
antecedent support: 0.759190
consequent support: 0.251405
support: 0.241332
confidence: 0.317880
lift: 1.264415

2: Women who are never married have about a 96.5% chance of earning ≤50K
antecedents: (marital-status_Never-married, sex_Female)
consequents: (>50K, <=50K._ <=50K)
antecedent support: 0.146402
consequent support: 0.759190
support: 0.141304
confidence: 0.965177
lift: 1.271324

3: Among people who are married (civil spouse) and have a high school diploma, about 88% are labeled as “husband”
antecedents: (marital-status_Married-civ-spouse, education_HS-grad)
consequents: (relationship_Husband)
antecedent support: 0.148798
consequent support: 0.405178
support: 0.131261
confidence: 0.882147
lift: 2.177183

4: People whose relationship is “own child” have about an 82.8% chance of being low income (≤50K) and White
antecedents: (relationship_Own-child)
consequents: (>50K, <=50K._<=50K, race_White)
antecedent support: 0.155646
consequent support: 0.635699
support: 0.128866
confidence: 0.827940
lift: 1.302409

5: People whose relationship is “Not-in-family” have about an 81% chance of being low income (≤50K) and U.S.-born
antecedents: (relationship_Not-in-family)
consequents: (>50K, <=50K._<=50K, native-country_United-States)
antecedent support: 0.255060
consequent support: 0.675624
support: 0.206843
confidence: 0.810957
lift: 1.200308

In [61]:
# support level 1 
min_support = 0.10
freq_10 = apriori(encoded_df.astype(bool), min_support=min_support, use_colnames=True, max_len=3)

rules_conf_10 = association_rules(freq_10, metric="confidence", min_threshold=0.7)
rules_lift_10 = association_rules(freq_10, metric="lift", min_threshold=1.2)

# top 10 rules by confidence
top_conf_10 = rules_conf_10.sort_values('confidence', ascending=False)
top_conf_10.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
276,"(relationship_ Husband, >50K, <=50K._ >50K)",(sex_ Male),0.181751,0.669205,0.181751,1.0,1.494309,1.0,0.060122,inf,0.404271,0.271592,1.0,0.635796
41,(relationship_ Husband),(sex_ Male),0.405178,0.669205,0.405147,0.999924,1.494196,1.0,0.134,4364.171954,0.556038,0.605388,0.999771,0.80267
185,"(marital-status_ Married-civ-spouse, relationship_ Husband)",(sex_ Male),0.404902,0.669205,0.404871,0.999924,1.494196,1.0,0.133909,4361.194804,0.55578,0.604975,0.999771,0.802463
263,"(relationship_ Husband, race_ White)",(sex_ Male),0.366696,0.669205,0.366666,0.999916,1.494184,1.0,0.12127,3949.686435,0.522243,0.547887,0.999747,0.773914
273,"(relationship_ Husband, native-country_ United-States)",(sex_ Male),0.36427,0.669205,0.364239,0.999916,1.494183,1.0,0.120468,3923.553669,0.520249,0.544261,0.999745,0.772101
95,"(workclass_ Private, relationship_ Husband)",(sex_ Male),0.26326,0.669205,0.263229,0.999883,1.494135,1.0,0.087054,2835.570529,0.448891,0.393328,0.999647,0.696614
275,"(relationship_ Husband, >50K, <=50K._ <=50K)",(sex_ Male),0.223427,0.669205,0.223396,0.999863,1.494104,1.0,0.073878,2406.530051,0.425848,0.333808,0.999584,0.666843
146,"(relationship_ Husband, education_ HS-grad)",(sex_ Male),0.131415,0.669205,0.131384,0.999766,1.49396,1.0,0.043441,1415.469703,0.380663,0.19632,0.999294,0.598047
198,"(relationship_ Husband, >50K, <=50K._ >50K)",(marital-status_ Married-civ-spouse),0.181751,0.459937,0.181628,0.999324,2.172743,1.0,0.098034,799.023602,0.659643,0.394793,0.998748,0.697111
21,(relationship_ Husband),(marital-status_ Married-civ-spouse),0.405178,0.459937,0.404902,0.999318,2.172729,1.0,0.218545,791.672741,0.907413,0.879813,0.998737,0.93983


In [63]:
# top 10 rules by lift
top_lift_10 = rules_lift_10.sort_values('lift', ascending=False)
top_lift_10.head(10)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
159,(relationship_ Own-child),"(marital-status_ Never-married, >50K, <=50K._ <=50K)",0.155646,0.313012,0.136697,0.878256,2.805817,1.0,0.087978,5.642873,0.762237,0.411786,0.822785,0.657485
154,"(marital-status_ Never-married, >50K, <=50K._ <=50K)",(relationship_ Own-child),0.313012,0.155646,0.136697,0.436715,2.805817,1.0,0.087978,1.498981,0.93684,0.411786,0.33288,0.657485
150,"(marital-status_ Never-married, native-country_ United-States)",(relationship_ Own-child),0.294186,0.155646,0.127852,0.434597,2.792205,1.0,0.082063,1.493365,0.90939,0.397081,0.330371,0.628013
153,(relationship_ Own-child),"(marital-status_ Never-married, native-country_ United-States)",0.155646,0.294186,0.127852,0.821429,2.792205,1.0,0.082063,3.952557,0.760179,0.397081,0.746999,0.628013
146,"(marital-status_ Never-married, race_ White)",(relationship_ Own-child),0.268941,0.155646,0.116121,0.431769,2.774038,1.0,0.074261,1.485934,0.874779,0.376444,0.327022,0.588911
149,(relationship_ Own-child),"(marital-status_ Never-married, race_ White)",0.155646,0.268941,0.116121,0.746054,2.774038,1.0,0.074261,2.878792,0.757401,0.376444,0.652632,0.588911
157,(marital-status_ Never-married),"(>50K, <=50K._ <=50K, relationship_ Own-child)",0.328092,0.153589,0.136697,0.416643,2.712722,1.0,0.086306,1.450933,0.939662,0.396243,0.310788,0.653333
156,"(>50K, <=50K._ <=50K, relationship_ Own-child)",(marital-status_ Never-married),0.153589,0.328092,0.136697,0.890022,2.712722,1.0,0.086306,6.109477,0.745933,0.396243,0.83632,0.653333
147,"(race_ White, relationship_ Own-child)",(marital-status_ Never-married),0.130678,0.328092,0.116121,0.888602,2.708393,1.0,0.073246,6.03158,0.725597,0.33889,0.834206,0.621264
148,(marital-status_ Never-married),"(race_ White, relationship_ Own-child)",0.328092,0.130678,0.116121,0.353927,2.708393,1.0,0.073246,1.345548,0.938785,0.33889,0.256808,0.621264


In [64]:
# support level 2
min_support = 0.05
freq_05 = apriori(encoded_df.astype(bool), min_support=min_support, use_colnames=True, max_len=3)

rules_conf_05 = association_rules(freq_05, metric="confidence", min_threshold=0.7)
rules_lift_05 = association_rules(freq_05, metric="lift", min_threshold=1.2)

top_conf_05 = rules_conf_05.sort_values('confidence', ascending=False)
top_conf_05.head(10)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
540,"(relationship_ Husband, >50K, <=50K._ >50K)",(sex_ Male),0.181751,0.669205,0.181751,1.0,1.494309,1.0,0.060122,inf,0.404271,0.271592,1.0,0.635796
327,"(education_ Some-college, relationship_ Husband)",(sex_ Male),0.07598,0.669205,0.07598,1.0,1.494309,1.0,0.025134,inf,0.357995,0.113538,1.0,0.556769
368,"(relationship_ Husband, occupation_ Exec-managerial)",(marital-status_ Married-civ-spouse),0.067166,0.459937,0.067166,1.0,2.174212,1.0,0.036274,inf,0.578949,0.146034,1.0,0.573017
100,"(workclass_ ?, native-country_ United-States)",(occupation_ ?),0.050951,0.056601,0.050951,1.0,17.66739,1.0,0.048067,inf,0.994046,0.900163,1.0,0.950081
464,"(occupation_ Craft-repair, relationship_ Husband)",(sex_ Male),0.077117,0.669205,0.077117,1.0,1.494309,1.0,0.02551,inf,0.358436,0.115236,1.0,0.557618
272,"(relationship_ Husband, education_ Bachelors)",(sex_ Male),0.074721,0.669205,0.074721,1.0,1.494309,1.0,0.024717,inf,0.357508,0.111657,1.0,0.555828
1,(workclass_ ?),(occupation_ ?),0.056386,0.056601,0.056386,1.0,17.66739,1.0,0.053195,inf,0.999772,0.996202,1.0,0.998101
105,"(>50K, <=50K._ <=50K, workclass_ ?)",(occupation_ ?),0.050521,0.056601,0.050521,1.0,17.66739,1.0,0.047661,inf,0.993596,0.892566,1.0,0.946283
483,"(relationship_ Husband, occupation_ Exec-managerial)",(sex_ Male),0.067166,0.669205,0.067166,1.0,1.494309,1.0,0.022218,inf,0.354612,0.100367,1.0,0.550184
508,"(occupation_ Prof-specialty, relationship_ Husband)",(sex_ Male),0.055404,0.669205,0.055404,1.0,1.494309,1.0,0.018327,inf,0.350197,0.08279,1.0,0.541395


In [65]:
top_lift_05 = rules_lift_05.sort_values('lift', ascending=False)
top_lift_05.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
80,(occupation_ ?),"(>50K, <=50K._ <=50K, workclass_ ?)",0.056601,0.050521,0.050521,0.892566,17.66739,1.0,0.047661,8.837831,1.0,0.892566,0.88685,0.946283
0,(occupation_ ?),(workclass_ ?),0.056601,0.056386,0.056386,0.996202,17.66739,1.0,0.053195,248.439961,1.0,0.996202,0.995975,0.998101
1,(workclass_ ?),(occupation_ ?),0.056386,0.056601,0.056386,1.0,17.66739,1.0,0.053195,inf,0.999772,0.996202,1.0,0.998101
79,"(>50K, <=50K._ <=50K, workclass_ ?)",(occupation_ ?),0.050521,0.056601,0.050521,1.0,17.66739,1.0,0.047661,inf,0.993596,0.892566,1.0,0.946283
76,(occupation_ ?),"(workclass_ ?, native-country_ United-States)",0.056601,0.050951,0.050951,0.900163,17.66739,1.0,0.048067,9.505968,1.0,0.900163,0.894803,0.950081
75,"(workclass_ ?, native-country_ United-States)",(occupation_ ?),0.050951,0.056601,0.050951,1.0,17.66739,1.0,0.048067,inf,0.994046,0.900163,1.0,0.950081
77,(workclass_ ?),"(occupation_ ?, native-country_ United-States)",0.056386,0.051166,0.050951,0.903595,17.660234,1.0,0.048065,9.842148,0.999748,0.900163,0.898396,0.949697
74,"(occupation_ ?, native-country_ United-States)",(workclass_ ?),0.051166,0.056386,0.050951,0.995798,17.660234,1.0,0.048065,224.580019,0.994247,0.900163,0.995547,0.949697
81,(workclass_ ?),"(>50K, <=50K._ <=50K, occupation_ ?)",0.056386,0.050736,0.050521,0.895969,17.659602,1.0,0.04766,9.124867,0.999746,0.892566,0.890409,0.945866
78,"(>50K, <=50K._ <=50K, occupation_ ?)",(workclass_ ?),0.050736,0.056386,0.050521,0.995763,17.659602,1.0,0.04766,222.692792,0.993794,0.892566,0.99551,0.945866


Confidence itself can be misleading because it can be influenced by how common the outcome already is.

Lift can help find patterns that are more likely than chance, it is better for discovering interesting relationships.