## Market Basket Analysis - Groceries Dataset

First, we import the libraries:

*   NumPy - for data manipulation.
*   Pandas - for data manipulation.
*   MatPlotLib - for data visualization.
*   Seaborn - for data visualization.
*   CSV Reader - to read the dataset.
*   MLXtend - to apply the Apriori Algorithm.

In [7]:
# importing the libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from csv import reader
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

Next, we read the dataset into a list. Here, we use the csv_reader() function fromt the CSV Reader library instead of the read_csv() function from the Pandas library because the number of columns in each row is different (ragged arrays).

In [9]:
# reading the dataset
groceries = []
with open('groceries.csv', 'r') as read_obj:
    csv_reader = reader(read_obj)
    for row in csv_reader:
        groceries.append(row)

To solve the following questions, we prepare the transaction dataframe to a format to which we can apply the Apriori algorithm.

Naive approach to form the transaction dataframe:

1.   To find the unique items - flatten the list and convert it into a set. The conversion removes any duplicate values, and hence, we are left with only the unique items in the dataset.
2.   Convert the set of unique items into an empty Pandas dataframe.
3.   Find every item in a transaction and append 1 if found and 0 if not found. This fills the empty dataframe previously created.



In [None]:
# items = set(sum(groceries, []))

In [None]:
# df = pd.DataFrame(columns=items)

In [None]:
# for i in range(len(groceries)):
#     transaction = []
#     for item in items:
#         if item in groceries[i]:
#             transaction.append(1)
#         else:
#             transaction.append(0)
#     print(transaction)
#     df = df.append(transaction, ignore_index=True)          

The above method works, but is very time inefficient. To solve the problem in significantly less time, we use the TransactionEncoder class from the MLXtend library. We fit the object of the class on the list and convert it into a Pandas dataframe with 1 representing item is purchased in that transaction, and 0 representing that the item is not purchased in that transaction.

In [10]:
# fitting the list and converting the transactions to true and false
encoder = TransactionEncoder()
transactions = encoder.fit(groceries).transform(groceries)

In [11]:
# converting the true and false to 1 and 0
transactions = transactions.astype('int')

In [12]:
# converting the transactions array to a datafrmae
df = pd.DataFrame(transactions, columns=encoder.columns_)

In [13]:
# viewing the first few rows of the dataframe
df.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


#### **Q1** How many transactions and items are there in the data set?


To solve Question 1, we use the shape attribute of a Pandas dataframe. Here, the number of rows represent the number of transactions, and the number of columns represent the number of items in the dataset.

In [14]:
# finding the dimensions of the dataframe
df.shape

(9835, 169)

As we can see, there are 9835 rows, meaning 9835 transactions, and 169 columns, meaning 169 items in the dataset.

To prepare the data for the following questions, we apply the Apriori algorithm on the dataframe and set the minimum support parameter to 2%.

In [15]:
# applying the apriori algorithm
frequent_itemsets = apriori(df, min_support=0.02, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets



Unnamed: 0,support,itemsets,length
0,0.033452,(UHT-milk),1
1,0.052466,(beef),1
2,0.033249,(berries),1
3,0.026029,(beverages),1
4,0.080529,(bottled beer),1
...,...,...,...
117,0.032232,"(whole milk, whipped/sour cream)",2
118,0.020742,"(whipped/sour cream, yogurt)",2
119,0.056024,"(whole milk, yogurt)",2
120,0.023183,"(whole milk, root vegetables, other vegetables)",3


#### **Q4** Find top selling items with minimum support of 2%.

To solve Question 4, first we sort the dataframe by support in the descending order by using the sort_values() function from the Pandas library and setting the by and ascending parameters to support and False respectively.

In [16]:
# sorting the dataframe
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)

Next, we filter the dataframe to find items with length 1 and support more than equal to 2%. Then we splice the sorted and filtered dataframe to show only the top 5 entries.

In [17]:
# finding top 5 items with minimum support of 2%
frequent_itemsets[ (frequent_itemsets['length'] == 1) &
                   (frequent_itemsets['support'] >= 0.02) ][0:5]

Unnamed: 0,support,itemsets,length
57,0.255516,(whole milk),1
39,0.193493,(other vegetables),1
43,0.183935,(rolls/buns),1
49,0.174377,(soda),1
58,0.139502,(yogurt),1


As we can see, whole milk, other vegetables, rolls/buns, soda and yogurt are the top 5 selling items with support of 25%, 19%, 18%, 17%, and 14% respectively.

#### **Q5.** Find all frequent itemsets with minimum support of 5%.

To solve Question 5, we filter the dataframe to find itemsets having length more than 1, and support more than 5%.

In [18]:
# finding itemsets having length more than 1 and minimum support of 5%
frequent_itemsets[(frequent_itemsets['length'] > 1) & 
                  (frequent_itemsets['support'] >= 0.05)]

Unnamed: 0,support,itemsets,length
91,0.074835,"(whole milk, other vegetables)",2
103,0.056634,"(whole milk, rolls/buns)",2
119,0.056024,"(whole milk, yogurt)",2


As we can see, there are only 3 itemsets - other vegetables and whole milk, rolls/buns and whole milk, and yogurt and whole milk each of length 2 and having support of 7%, 5.66% and 5.60% respectively.

#### **Q6.**  Find all frequent itemsets of length 2 with minimum support of 2%.

To solve Question 6, we filter the dataframe to find itemsets having length 2 and minimum support of 2%.

In [19]:
# finding itemsets having length 2 and minimum support of 2%
frequent_itemsets[(frequent_itemsets['length'] == 2) & 
                  (frequent_itemsets['support'] >= 0.02)]

Unnamed: 0,support,itemsets,length
91,0.074835,"(whole milk, other vegetables)",2
103,0.056634,"(whole milk, rolls/buns)",2
119,0.056024,"(whole milk, yogurt)",2
106,0.048907,"(whole milk, root vegetables)",2
85,0.047382,"(other vegetables, root vegetables)",2
...,...,...,...
75,0.020539,"(frankfurter, whole milk)",2
60,0.020437,"(bottled beer, whole milk)",2
76,0.020437,"(whole milk, frozen vegetables)",2
96,0.020437,"(pip fruit, tropical fruit)",2


As we can see, there are 61 itemsets having length 2 with support more than or equal to 2%. The support ranges between 7% and 2% with other vegetables and whole milk having the highest support, and butter and other vegetables having the minimum support.

#### **Q7.** Find the top 10 association rules with minimum support of 2%, sorted by confidence in descending order.


To solve Question 7, we first find the association rules using the association_rules() function from the MLXtend library and set the parameter metric to support, and the min_threshold to 2%.

In [20]:
# finding top 10 association rules with minimum support of 2%
rules = association_rules(frequent_itemsets, metric='support', min_threshold=0.02)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,1.0,0.025394,1.140548,0.455803,0.200000,0.123228,0.339817
1,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,1.0,0.025394,1.214013,0.420750,0.200000,0.176286,0.339817
2,(whole milk),(rolls/buns),0.255516,0.183935,0.056634,0.221647,1.205032,1.0,0.009636,1.048452,0.228543,0.147942,0.046213,0.264776
3,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,1.0,0.009636,1.075696,0.208496,0.147942,0.070369,0.264776
4,(whole milk),(yogurt),0.255516,0.139502,0.056024,0.219260,1.571735,1.0,0.020379,1.102157,0.488608,0.165267,0.092688,0.310432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,(frozen vegetables),(whole milk),0.048094,0.255516,0.020437,0.424947,1.663094,1.0,0.008149,1.294636,0.418855,0.072172,0.227582,0.252466
130,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,1.0,0.012499,1.226392,0.661650,0.127619,0.184600,0.232464
131,(tropical fruit),(pip fruit),0.104931,0.075648,0.020437,0.194767,2.574648,1.0,0.012499,1.147931,0.683297,0.127619,0.128868,0.232464
132,(other vegetables),(butter),0.193493,0.055414,0.020031,0.103521,1.868122,1.0,0.009308,1.053661,0.576192,0.087517,0.050929,0.232494


Then we sort the generated association rules in the descending order by confidence by using the sort_values() function from the Pandas library and setting the by and ascending parameters to confidence and False respectively. Then we splice the sorted dataframe to show the top 10 rules.

In [21]:
# sorting the rules in the descending order by confidence
rules.sort_values(by='confidence', ascending=False)[0:10]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
97,"(other vegetables, yogurt)",(whole milk),0.043416,0.255516,0.022267,0.512881,2.007235,1.0,0.011174,1.52834,0.524577,0.080485,0.345695,0.300014
51,(butter),(whole milk),0.055414,0.255516,0.027555,0.497248,1.946053,1.0,0.013395,1.480817,0.514659,0.097237,0.324697,0.302543
60,(curd),(whole milk),0.053279,0.255516,0.026131,0.490458,1.919481,1.0,0.012517,1.461085,0.505984,0.092446,0.315577,0.296363
88,"(other vegetables, root vegetables)",(whole milk),0.047382,0.255516,0.023183,0.48927,1.914833,1.0,0.011076,1.457687,0.501524,0.082879,0.313982,0.289999
86,"(whole milk, root vegetables)",(other vegetables),0.048907,0.193493,0.023183,0.474012,2.44977,1.0,0.013719,1.53332,0.62223,0.105751,0.347821,0.296912
38,(domestic eggs),(whole milk),0.063447,0.255516,0.029995,0.472756,1.850203,1.0,0.013783,1.41203,0.490649,0.1038,0.2918,0.295073
31,(whipped/sour cream),(whole milk),0.071683,0.255516,0.032232,0.449645,1.759754,1.0,0.013916,1.352735,0.465077,0.109273,0.260757,0.287895
7,(root vegetables),(whole milk),0.108998,0.255516,0.048907,0.448694,1.756031,1.0,0.021056,1.350401,0.483202,0.154961,0.259479,0.320049
9,(root vegetables),(other vegetables),0.108998,0.193493,0.047382,0.434701,2.246605,1.0,0.026291,1.426693,0.622764,0.185731,0.299078,0.339789
129,(frozen vegetables),(whole milk),0.048094,0.255516,0.020437,0.424947,1.663094,1.0,0.008149,1.294636,0.418855,0.072172,0.227582,0.252466


As we can see, the top association rule is that if a customer buys other vegetables and yogurt, they also by whole milk, with a support of 2%, confidence of 51% and lift value of 2 indicating a positive correlation (if the sales of other vegetables and yogurt goes up, the sales of whole milk also goes up and vice versa). Similarly, we can read and interpret the other rules.

#### **Q8.** Find association rules with minimum support of 2% and lift of more than 1.0.


To solve Question 8, we filter the dataframe to have lift more than 1.

In [22]:
# finding association rules with minimum support of 2% and having lift more than 1
rules[(rules['support'] >= 0.02) &
      (rules['lift'] > 1.0)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(whole milk),(other vegetables),0.255516,0.193493,0.074835,0.292877,1.513634,1.0,0.025394,1.140548,0.455803,0.200000,0.123228,0.339817
1,(other vegetables),(whole milk),0.193493,0.255516,0.074835,0.386758,1.513634,1.0,0.025394,1.214013,0.420750,0.200000,0.176286,0.339817
2,(whole milk),(rolls/buns),0.255516,0.183935,0.056634,0.221647,1.205032,1.0,0.009636,1.048452,0.228543,0.147942,0.046213,0.264776
3,(rolls/buns),(whole milk),0.183935,0.255516,0.056634,0.307905,1.205032,1.0,0.009636,1.075696,0.208496,0.147942,0.070369,0.264776
4,(whole milk),(yogurt),0.255516,0.139502,0.056024,0.219260,1.571735,1.0,0.020379,1.102157,0.488608,0.165267,0.092688,0.310432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129,(frozen vegetables),(whole milk),0.048094,0.255516,0.020437,0.424947,1.663094,1.0,0.008149,1.294636,0.418855,0.072172,0.227582,0.252466
130,(pip fruit),(tropical fruit),0.075648,0.104931,0.020437,0.270161,2.574648,1.0,0.012499,1.226392,0.661650,0.127619,0.184600,0.232464
131,(tropical fruit),(pip fruit),0.104931,0.075648,0.020437,0.194767,2.574648,1.0,0.012499,1.147931,0.683297,0.127619,0.128868,0.232464
132,(other vegetables),(butter),0.193493,0.055414,0.020031,0.103521,1.868122,1.0,0.009308,1.053661,0.576192,0.087517,0.050929,0.232494


As we can see, there are 126 rules having support of 2% or more and lift more than 1. All the items in these rules have a positive correlation with each other, indicating if the sales of one goes up, the sales of the other goes up as well and vice versa.