Checkpoint Objective

Association Rules

Let's say you are a machine learning engineer working for a clothing company and you want to adopt new strategies to improve the company's profit.

Use this dataset and the association rules mining to find new marketing plans. 

Note that one of the strategies can be based on which items should be put together

dataset = [['Skirt', 'Sneakers', 'Scarf', 'Pants', 'Hat'],

    ['Sunglasses', 'Skirt', 'Sneakers', 'Pants', 'Hat'],

    ['Dress', 'Sandals', 'Scarf', 'Pants', 'Heels'],

    ['Dress', 'Necklace', 'Earrings', 'Scarf', 'Hat', 'Heels', 'Hat'],

   ['Earrings', 'Skirt', 'Skirt', 'Scarf', 'Shirt', 'Pants']]

Bonus: Try to do some visualization before applying the Apriori algorithm.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules 

In [2]:
data = pd.read_csv('Market_Basket_Optimisation.csv')

In [3]:
data.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   shrimp             7500 non-null   object 
 1   almonds            5746 non-null   object 
 2   avocado            4388 non-null   object 
 3   vegetables mix     3344 non-null   object 
 4   green grapes       2528 non-null   object 
 5   whole weat flour   1863 non-null   object 
 6   yams               1368 non-null   object 
 7   cottage cheese     980 non-null    object 
 8   energy drink       653 non-null    object 
 9   tomato juice       394 non-null    object 
 10  low fat yogurt     255 non-null    object 
 11  green tea          153 non-null    object 
 12  honey              86 non-null     object 
 13  salad              46 non-null     object 
 14  mineral water      24 non-null     object 
 15  salmon             7 non-null      object 
 16  antioxydant juice  3 non

In [5]:
items = set()
for col in data:
    items.update(data[col].unique())

In [6]:
itemset = set(items)
encoded_vals = []
for index, row in data.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)
encoded_vals[0]
df= pd.DataFrame(encoded_vals)

In [8]:
df.head()

Unnamed: 0,NaN,frozen smoothie,fresh tuna,cooking oil,clothes accessories,sparkling water,green tea,grated cheese,melons,eggplant,...,cookies,honey,mayonnaise,mint green tea,tomatoes,almonds,NaN.1,eggs,meatballs,burgers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [22]:
df = df[df.columns.dropna()]

In [25]:
df.head()

Unnamed: 0,frozen smoothie,fresh tuna,cooking oil,clothes accessories,sparkling water,green tea,grated cheese,melons,eggplant,magazines,...,cereals,cookies,honey,mayonnaise,mint green tea,tomatoes,almonds,eggs,meatballs,burgers
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Columns: 120 entries, frozen smoothie to burgers
dtypes: int64(120)
memory usage: 6.9 MB


In [55]:
frequent_items = apriori(df.astype('bool'), min_support= 0.02, use_colnames=True, verbose = 1)
frequent_items.head(10)

Processing 969 combinations | Sampling itemset size 3


Unnamed: 0,support,itemsets
0,0.0632,(frozen smoothie)
1,0.022267,(fresh tuna)
2,0.051067,(cooking oil)
3,0.132,(green tea)
4,0.0524,(grated cheese)
5,0.026533,(pepper)
6,0.095333,(frozen vegetables)
7,0.0272,(light mayo)
8,0.031733,(cottage cheese)
9,0.071333,(shrimp)


In [56]:
rules = association_rules(frequent_items, metric="confidence", min_threshold=0.01)
rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(mineral water),(frozen smoothie),0.238267,0.0632,0.020133,0.084499,1.337012,0.005075,1.023265,0.330908
1,(frozen smoothie),(mineral water),0.0632,0.238267,0.020133,0.318565,1.337012,0.005075,1.117838,0.269069
2,(mineral water),(cooking oil),0.238267,0.051067,0.020133,0.084499,1.654683,0.007966,1.036518,0.519414
3,(cooking oil),(mineral water),0.051067,0.238267,0.020133,0.394256,1.654683,0.007966,1.257517,0.416947
4,(spaghetti),(green tea),0.174133,0.132,0.026533,0.152374,1.154346,0.003548,1.024036,0.161901
5,(green tea),(spaghetti),0.132,0.174133,0.026533,0.20101,1.154346,0.003548,1.033638,0.154042
6,(chocolate),(green tea),0.163867,0.132,0.023467,0.143206,1.084893,0.001836,1.013079,0.093586
7,(green tea),(chocolate),0.132,0.163867,0.023467,0.177778,1.084893,0.001836,1.016919,0.09015
8,(mineral water),(green tea),0.238267,0.132,0.030933,0.129827,0.983534,-0.000518,0.997502,-0.021505
9,(green tea),(mineral water),0.132,0.238267,0.030933,0.234343,0.983534,-0.000518,0.994876,-0.018922
