# Assignment Prep - Association Rule Mining

We will use [The Bread Basket Dataset](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket). The dataset belongs to "The Bread Basket" a bakery located in Edinburgh.

Opendatasets is a python package which makes it easier to import datasets from Kaggle.

Prerequisites:

- Kaggle Account (preferrably using BU email ID)

Run the following two cells. In the second cell you will be prompted to enter username and key.

Use this link - https://www.kaggle.com/settings/account

- On the right side of your screen you can see your username.
- Scroll down a bit, you will see an API subheading. Click on '**Create new token**'.
- It should automatically download a .json file containing your username and key.
- Copy paste them into the output of the 2nd cell.

Your dataset will be visible in the folders tab on the left side of your colab screen!!

In [None]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [None]:
import opendatasets as od
import pandas as pd
import numpy as np

od.download(
    "https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: snehaekka19
Your Kaggle Key: ··········
Downloading the-bread-basket.zip to ./the-bread-basket


100%|██████████| 98.9k/98.9k [00:00<00:00, 37.9MB/s]







### This cell is for installing any python packages you want to use

In [None]:
!pip install your-package-name

Collecting your-package-name
  Downloading your_package_name-1.0.0-py3-none-any.whl (1.5 kB)
Installing collected packages: your-package-name
Successfully installed your-package-name-1.0.0


In [None]:
# Importing Modules & Packages
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
# Loading the dataset
df = pd.read_csv('/content/the-bread-basket/bread basket.csv')
df.head()

Unnamed: 0,Transaction,Item,date_time,period_day,weekday_weekend
0,1,Bread,30-10-2016 09:58,morning,weekend
1,2,Scandinavian,30-10-2016 10:05,morning,weekend
2,2,Scandinavian,30-10-2016 10:05,morning,weekend
3,3,Hot chocolate,30-10-2016 10:07,morning,weekend
4,3,Jam,30-10-2016 10:07,morning,weekend


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20507 entries, 0 to 20506
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Transaction      20507 non-null  int64 
 1   Item             20507 non-null  object
 2   date_time        20507 non-null  object
 3   period_day       20507 non-null  object
 4   weekday_weekend  20507 non-null  object
dtypes: int64(1), object(4)
memory usage: 801.2+ KB


**Note:**
- No NULLs in any of the columns
- A given transaction is distributed across multiple rows

In [None]:
# Checking the unique number of transactions
df['Transaction'].nunique()

9465

In [None]:
# Collapsing each unique transaction to a single row

# First, aggregating the 'Item' column by creating itemsets
df_itemsets = df.groupby('Transaction', as_index=False)['Item'].apply(','.join)

# Combining this new dataframe with the original one to retain other columns
df_basket = df.merge(df_itemsets, how='left', on='Transaction')
df_basket.drop('Item_x', axis=1, inplace=True)
df_basket.rename(columns={'Item_y':'Itemsets'}, inplace=True)

# The items have now been combined into single itemsets
# Now, dropping duplicate rows so that we have one transaction per row
df_basket.drop_duplicates(keep='first', inplace=True, ignore_index=True)

df_basket

Unnamed: 0,Transaction,date_time,period_day,weekday_weekend,Itemsets
0,1,30-10-2016 09:58,morning,weekend,Bread
1,2,30-10-2016 10:05,morning,weekend,"Scandinavian,Scandinavian"
2,3,30-10-2016 10:07,morning,weekend,"Hot chocolate,Jam,Cookies"
3,4,30-10-2016 10:08,morning,weekend,Muffin
4,5,30-10-2016 10:13,morning,weekend,"Coffee,Pastry,Bread"
...,...,...,...,...,...
9460,9680,09-04-2017 14:24,afternoon,weekend,Bread
9461,9681,09-04-2017 14:30,afternoon,weekend,"Truffles,Tea,Spanish Brunch,Christmas common"
9462,9682,09-04-2017 14:32,afternoon,weekend,"Muffin,Tacos/Fajita,Coffee,Tea"
9463,9683,09-04-2017 14:57,afternoon,weekend,"Coffee,Pastry"


In [None]:
# Converting the 'Itemsets' column into a list of lists for the transformer
itemsets = df_basket.iloc[:, 4]
itemsets = list(itemsets.apply(lambda x: x.split(',')))
itemsets

[['Bread'],
 ['Scandinavian', 'Scandinavian'],
 ['Hot chocolate', 'Jam', 'Cookies'],
 ['Muffin'],
 ['Coffee', 'Pastry', 'Bread'],
 ['Medialuna', 'Pastry', 'Muffin'],
 ['Medialuna', 'Pastry', 'Coffee', 'Tea'],
 ['Pastry', 'Bread'],
 ['Bread', 'Muffin'],
 ['Scandinavian', 'Medialuna'],
 ['Bread', 'Medialuna', 'Bread'],
 ['Jam', 'Coffee', 'Tartine', 'Pastry', 'Tea'],
 ['Basket', 'Bread', 'Coffee'],
 ['Bread', 'Medialuna', 'Pastry'],
 ['Mineral water', 'Scandinavian'],
 ['Bread', 'Medialuna', 'Coffee'],
 ['Hot chocolate'],
 ['Farm House'],
 ['Farm House', 'Bread'],
 ['Bread', 'Medialuna'],
 ['Coffee', 'Coffee', 'Medialuna', 'Bread'],
 ['Jam'],
 ['Scandinavian', 'Muffin'],
 ['Bread'],
 ['Scandinavian'],
 ['Fudge'],
 ['Scandinavian'],
 ['Coffee', 'Bread'],
 ['Bread', 'Jam'],
 ['Bread'],
 ['Basket'],
 ['Scandinavian', 'Muffin'],
 ['Coffee'],
 ['Coffee', 'Muffin'],
 ['Muffin', 'Scandinavian'],
 ['Tea', 'Bread'],
 ['Coffee', 'Bread'],
 ['Bread', 'Tea'],
 ['Scandinavian'],
 ['Juice', 'Tartine', 

In [None]:
# Checking the unique items from all the itemsets
df['Item'].unique()

array(['Bread', 'Scandinavian', 'Hot chocolate', 'Jam', 'Cookies',
       'Muffin', 'Coffee', 'Pastry', 'Medialuna', 'Tea', 'Tartine',
       'Basket', 'Mineral water', 'Farm House', 'Fudge', 'Juice',
       "Ella's Kitchen Pouches", 'Victorian Sponge', 'Frittata',
       'Hearty & Seasonal', 'Soup', 'Pick and Mix Bowls', 'Smoothies',
       'Cake', 'Mighty Protein', 'Chicken sand', 'Coke',
       'My-5 Fruit Shoot', 'Focaccia', 'Sandwich', 'Alfajores', 'Eggs',
       'Brownie', 'Dulce de Leche', 'Honey', 'The BART', 'Granola',
       'Fairy Doors', 'Empanadas', 'Keeping It Local', 'Art Tray',
       'Bowl Nic Pitt', 'Bread Pudding', 'Adjustment', 'Truffles',
       'Chimichurri Oil', 'Bacon', 'Spread', 'Kids biscuit', 'Siblings',
       'Caramel bites', 'Jammie Dodgers', 'Tiffin', 'Olum & polenta',
       'Polenta', 'The Nomad', 'Hack the stack', 'Bakewell',
       'Lemon and coconut', 'Toast', 'Scone', 'Crepes', 'Vegan mincepie',
       'Bare Popcorn', 'Muesli', 'Crisps', 'Pintxos', 

In [None]:
# Cleaning the data

for ind, lst in enumerate(itemsets):
  # Removing leading/trailing spaces from items in each list
  lst = [i.strip() for i in lst]
  itemsets[ind] = lst

In [None]:
# Checking our unique items now

flat_list = []
for itemset in itemsets:
  flat_list = flat_list + itemset
set(flat_list)

{'Adjustment',
 'Afternoon with the baker',
 'Alfajores',
 'Argentina Night',
 'Art Tray',
 'Bacon',
 'Baguette',
 'Bakewell',
 'Bare Popcorn',
 'Basket',
 'Bowl Nic Pitt',
 'Bread',
 'Bread Pudding',
 'Brioche and salami',
 'Brownie',
 'Cake',
 'Caramel bites',
 'Cherry me Dried fruit',
 'Chicken Stew',
 'Chicken sand',
 'Chimichurri Oil',
 'Chocolates',
 'Christmas common',
 'Coffee',
 'Coffee granules',
 'Coke',
 'Cookies',
 'Crepes',
 'Crisps',
 'Drinking chocolate spoons',
 'Duck egg',
 'Dulce de Leche',
 'Eggs',
 "Ella's Kitchen Pouches",
 'Empanadas',
 'Extra Salami or Feta',
 'Fairy Doors',
 'Farm House',
 'Focaccia',
 'Frittata',
 'Fudge',
 'Gift voucher',
 'Gingerbread syrup',
 'Granola',
 'Hack the stack',
 'Half slice Monster',
 'Hearty & Seasonal',
 'Honey',
 'Hot chocolate',
 'Jam',
 'Jammie Dodgers',
 'Juice',
 'Keeping It Local',
 'Kids biscuit',
 'Lemon and coconut',
 'Medialuna',
 'Mighty Protein',
 'Mineral water',
 'Mortimer',
 'Muesli',
 'Muffin',
 'My-5 Fruit Shoo

In [None]:
# Now, converting the data from basket format to mlxtend/encoded format

# Transforming the data
te = TransactionEncoder()
te_data = te.fit_transform(itemsets)

# Saving the above data in a dataframe
df_encoded = pd.DataFrame(te_data, columns=te.columns_)
df_encoded

Unnamed: 0,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9460,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9461,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
9462,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9463,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


# Question 1 **(5 Points)**

In terms of purchase volume, find the top 5 *single* item recommendations based on any *single* item purchases in the bakery. These recommendations will be used to optimally place the two items within reach from to each other.

Use the apriori algorithm with a reasonable minimum support (Justify your choice).

By what percentage has the apriori method reduced the computational cost of solving this query? Feel free to use a theoretical approach or an empirical one.

In [None]:
# Finding the most frequent itemsets based on a support threshold
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
6,0.478394,(Coffee)
2,0.327205,(Bread)
26,0.142631,(Tea)
4,0.103856,(Cake)
34,0.090016,"(Coffee, Bread)"
...,...,...
11,0.010565,(Hearty & Seasonal)
20,0.010460,(Salad)
30,0.010354,"(Alfajores, Bread)"
58,0.010037,"(Coffee, Cake, Bread)"


In [None]:
# Finding the rules of interest: (1 antecedent --> 1 consequent)
rules_1_1 = association_rules(frequent_itemsets, metric="support", min_threshold=0.01)
rules_1_1.sort_values(by='support', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
8,(Coffee),(Bread),0.478394,0.327205,0.090016,0.188163,0.575059,-0.066517,0.828731,-0.586210
9,(Bread),(Coffee),0.327205,0.478394,0.090016,0.275105,0.575059,-0.066517,0.719561,-0.523431
25,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664,0.102840
24,(Coffee),(Cake),0.478394,0.103856,0.054728,0.114399,1.101515,0.005044,1.011905,0.176684
51,(Tea),(Coffee),0.142631,0.478394,0.049868,0.349630,0.730840,-0.018366,0.802014,-0.300482
...,...,...,...,...,...,...,...,...,...,...
69,"(Coffee, Cake)",(Tea),0.054728,0.142631,0.010037,0.183398,1.285822,0.002231,1.049923,0.235157
70,"(Tea, Cake)",(Coffee),0.023772,0.478394,0.010037,0.422222,0.882582,-0.001335,0.902779,-0.119934
71,(Coffee),"(Tea, Cake)",0.478394,0.023772,0.010037,0.020981,0.882582,-0.001335,0.997149,-0.203223
72,(Tea),"(Coffee, Cake)",0.142631,0.054728,0.010037,0.070370,1.285822,0.002231,1.016827,0.259266


In [None]:
rules_1_1_filtered = rules_1_1[(rules_1_1['antecedents'].apply(lambda x: len(x) == 1))]
rules_1_1_filtered = rules_1_1_filtered[(rules_1_1_filtered['consequents'].apply(lambda x: len(x) == 1))]
rules_1_1_filtered = rules_1_1_filtered[(rules_1_1_filtered['confidence'] >= 0.1) & (rules_1_1_filtered['lift'] >= 1)]
rules_1_1_filtered.sort_values(by='confidence', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
53,(Toast),(Coffee),0.033597,0.478394,0.023666,0.704403,1.472431,0.007593,1.764582,0.332006
49,(Spanish Brunch),(Coffee),0.018172,0.478394,0.010882,0.598837,1.251766,0.002189,1.300235,0.204851
36,(Medialuna),(Coffee),0.061807,0.478394,0.035182,0.569231,1.189878,0.005614,1.210871,0.170091
41,(Pastry),(Coffee),0.086107,0.478394,0.047544,0.552147,1.154168,0.006351,1.164682,0.146161
3,(Alfajores),(Coffee),0.036344,0.478394,0.019651,0.540698,1.130235,0.002264,1.135648,0.119574
35,(Juice),(Coffee),0.038563,0.478394,0.020602,0.534247,1.11675,0.002154,1.119919,0.108738
43,(Sandwich),(Coffee),0.071844,0.478394,0.038246,0.532353,1.112792,0.003877,1.115384,0.109205
25,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664,0.10284
45,(Scone),(Coffee),0.034548,0.478394,0.018067,0.522936,1.093107,0.001539,1.093366,0.088224
31,(Cookies),(Coffee),0.054411,0.478394,0.028209,0.518447,1.083723,0.002179,1.083174,0.0817


**Answer 1**

To determine the top 5 single item recommendations from bakery purchases, we employed the apriori algorithm. Setting a minimum support threshold of 0.01 ensured we captured frequent itemsets adequately. We opted for a slightly stricter threshold to prioritize significant associations while still maintaining a reasonable number of results. Additionally, we applied filters based on a confidence threshold of 0.1 and a lift of 1 to focus on associations with higher likelihoods of occurrence.

Our analysis revealed the following top 5 recommendations:

- **Coffee** can be paired with either Toast, Spanish Brunch, Medialuna or Pastry among other items.
- **Bread** can be recommended pastry.
- **Tea** can be recommended with Cake or Sandwich.
- **Cake** can be paired with Hot Chocolate, Tea, or Coffee.
- **Hot chocolate** can be recommended with Cake.

**Computational Cost**

In [None]:
%timeit apriori(df_encoded, min_support=0.01, use_colnames=True)

131 ms ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%timeit apriori(df_encoded, min_support=0.00000001, use_colnames=True)

13.5 s ± 1.81 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
time_with_threshold = 131/1000
time_without_threshold = 13.5

time_saved = time_without_threshold - time_with_threshold
percentage_saved = (time_saved / time_without_threshold) * 100

print("Percentage of computational cost saved: {:.2f}%".format(percentage_saved))

Percentage of computational cost saved: 99.03%


# Question 2 **(5 Points)**

Find out how/if the recommendations from the previous question change based on the time of the day. (morning, afternoon, evening). Comment on how similar/different the associations are.

In [None]:
# Checking the unique times of day in the dataset
df_basket['period_day'].unique()

array(['morning', 'afternoon', 'evening', 'night'], dtype=object)

In [None]:
# Collapsing 'evening' and 'night' into one category
df_basket['period_day'] = df_basket['period_day'].replace({'night':'evening'})
df_basket['period_day'].unique()

array(['morning', 'afternoon', 'evening'], dtype=object)

In [None]:
# Creating a Function to calculate association rules based on time of day

def time_of_day(period):
    # Subsetting the data to work with
    df_period = df_basket[df_basket['period_day'] == period]

    # Converting the 'Itemsets' column into a list of lists for the transformer
    data = df_period.iloc[:, 4]
    data = list(data.apply(lambda x: x.split(',')))

    # Encoding the data
    data_encoded = te.fit_transform(data)
    df_encoded = pd.DataFrame(data_encoded, columns=te.columns_)

    # Finding the most frequent itemsets based on a support threshold
    freq_itemsets = apriori(df_encoded, min_support=0.001, use_colnames=True)
    # Finding the rules of interest: (1 antecedent --> 1 consequent)
    rules = association_rules(freq_itemsets, metric="support", min_threshold=0.01)

    # Filtering the rules for our specific recommendations
    rules_filtered = rules[(rules['antecedents'].apply(lambda x: len(x) == 1))]
    rules_filtered = rules_filtered[(rules_filtered['consequents'].apply(lambda x: len(x) == 1))]
    rules_filtered = rules_filtered[(rules_filtered['confidence'] >= 0.1) & (rules_filtered['lift'] >= 1)]
    rules_filtered = rules_filtered.sort_values(by='confidence', ascending=False).head(25)

    return rules_filtered

In [None]:
# Checking for frequent associations / recommendations in the 'morning'
time_of_day('morning')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
42,(Toast),(Coffee),0.04972,0.514989,0.035827,0.720588,1.39923,0.010222,1.735829,0.30025
28,(Juice),(Coffee),0.031928,0.514989,0.019498,0.610687,1.185825,0.003055,1.245812,0.161874
24,(Cookies),(Coffee),0.047526,0.514989,0.028516,0.6,1.165073,0.00404,1.212527,0.148755
1,(Alfajores),(Coffee),0.024616,0.514989,0.014623,0.594059,1.153538,0.001946,1.194783,0.136461
30,(Medialuna),(Coffee),0.092615,0.514989,0.054594,0.589474,1.144633,0.006898,1.181437,0.139255
34,(Pastry),(Coffee),0.13941,0.514989,0.077261,0.554196,1.076131,0.005466,1.087946,0.082206
38,(Scone),(Coffee),0.026566,0.514989,0.014623,0.550459,1.068875,0.000942,1.078902,0.066195
18,(Brownie),(Coffee),0.029978,0.514989,0.01633,0.544715,1.057722,0.000891,1.065292,0.056259
33,(Muffin),(Coffee),0.035827,0.514989,0.019498,0.544218,1.056756,0.001047,1.064129,0.055703
26,(Hot chocolate),(Coffee),0.052888,0.514989,0.028516,0.539171,1.046955,0.001279,1.052474,0.047354


In [None]:
# Checking for frequent associations / recommendations in the 'afternoon'
time_of_day('afternoon')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
59,(Toast),(Coffee),0.022401,0.459815,0.015131,0.675439,1.468935,0.00483,1.664353,0.32655
46,(Salad),(Coffee),0.017489,0.459815,0.011201,0.640449,1.392841,0.003159,1.502389,0.287063
44,(Pastry),(Coffee),0.045785,0.459815,0.025545,0.55794,1.2134,0.004493,1.221971,0.184308
55,(Spanish Brunch),(Coffee),0.024366,0.459815,0.013559,0.556452,1.210163,0.002355,1.217871,0.178003
48,(Sandwich),(Coffee),0.115936,0.459815,0.062291,0.537288,1.168487,0.008982,1.167432,0.163102
40,(Medialuna),(Coffee),0.037335,0.459815,0.020043,0.536842,1.167517,0.002876,1.166308,0.149046
26,(Cake),(Coffee),0.136766,0.459815,0.07192,0.525862,1.143638,0.009033,1.139299,0.145496
3,(Alfajores),(Coffee),0.044606,0.459815,0.022991,0.515419,1.120925,0.00248,1.114745,0.112916
50,(Scone),(Coffee),0.042051,0.459815,0.021222,0.504673,1.097556,0.001886,1.090562,0.092786
38,(Juice),(Coffee),0.043427,0.459815,0.021615,0.497738,1.082473,0.001647,1.075503,0.079648


In [None]:
# Checking for frequent associations / recommendations in the 'evening'
time_of_day('evening')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
54,(Salad),(Coffee),0.010989,0.274725,0.010989,1.0,3.64,0.00797,inf,0.733333
85,(Scone),(Tea),0.014652,0.164835,0.010989,0.75,4.55,0.008574,3.340659,0.791822
57,(Scone),(Coffee),0.014652,0.274725,0.010989,0.75,2.73,0.006964,2.901099,0.643123
37,(Scone),(Cake),0.014652,0.102564,0.010989,0.75,7.3125,0.009486,3.589744,0.876084
60,(Tiffin),(Coffee),0.014652,0.274725,0.010989,0.75,2.73,0.006964,2.901099,0.643123
8,(Mineral water),(Alfajores),0.018315,0.058608,0.010989,0.6,10.2375,0.009916,2.35348,0.919154
72,(Mineral water),(Juice),0.018315,0.047619,0.010989,0.6,12.6,0.010117,2.380952,0.937811
82,(Postcard),(Tshirt),0.03663,0.076923,0.021978,0.6,7.8,0.01916,2.307692,0.904943
30,(Cake),(Coffee),0.102564,0.274725,0.058608,0.571429,2.08,0.030431,1.692308,0.578571
5,(Alfajores),(Coffee),0.058608,0.274725,0.032967,0.5625,2.0475,0.016866,1.657771,0.54345


**Answer 2**

Based on the findings above, we observe the following:
- **Morning:** Coffee, Tea, Pastry, and Medialuna show up as some of the items that be heavily recommended based on purchases of other single items. These items seem like a fair recommendation given that people usually buy a lot of coffee or on the go breakfast in the morning.
- **Afternoon:** Coffee, Tea, Cake, and Sandwich emerge as the recommended items at this time of day. Although coffee and tea remain consistent in the list, we see new items like cake and sandwiches being purchased more in the afternoon and hence can be recommended alongside other products.
- **Evening:** Again, coffee, tea, cake, juice are among some of the items most purchased alongside items like salad, scone, or mineral water during the evenings.

Although, the recommendations don't deviate too much from the previous question, we do see some subtle trends by time of day and this can be leveraged to offer some items more than others based on when they are more likely to be bought.

# Question 3 **(10 Points)**

Find out if the day of the week (i.e., Monday, Tuesday, ..) affects the customers' purchase patterns. Compute the top 3 most common item associations for each day. Comment on how similar/different the rules are.

Use [to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and [dayofweek](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.dayofweek.html) to generate the day of the week for any date.

In [None]:
# Converting the 'date_time' column to datetime type
df_basket['date_time'] = pd.to_datetime(df_basket['date_time'], dayfirst=True)

# Creating a 'day_of_week' and extracting that information from 'date_time' column
df_basket['day_of_week'] = df_basket['date_time'].dt.dayofweek
df_basket['day_of_week'].replace({0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}, inplace=True)

df_basket.head()

Unnamed: 0,Transaction,date_time,period_day,weekday_weekend,Itemsets,day_of_week
0,1,2016-10-30 09:58:00,morning,weekend,Bread,Sunday
1,2,2016-10-30 10:05:00,morning,weekend,"Scandinavian,Scandinavian",Sunday
2,3,2016-10-30 10:07:00,morning,weekend,"Hot chocolate,Jam,Cookies",Sunday
3,4,2016-10-30 10:08:00,morning,weekend,Muffin,Sunday
4,5,2016-10-30 10:13:00,morning,weekend,"Coffee,Pastry,Bread",Sunday


In [None]:
# Creating a Function to calculate association rules based on time of day

def day_of_week(day):
    # Subsetting the data to work with
    df_day = df_basket[df_basket['day_of_week'] == day]

    # Converting the 'Itemsets' column into a list of lists for the transformer
    data = df_day.iloc[:, 4]
    data = list(data.apply(lambda x: x.split(',')))

    # Encoding the data
    data_encoded = te.fit_transform(data)
    df_encoded = pd.DataFrame(data_encoded, columns=te.columns_)

    # Finding the most frequent itemsets based on a support threshold
    freq_itemsets = apriori(df_encoded, min_support=0.001, use_colnames=True)
    # Finding the rules of interest: (1 antecedent --> 1 consequent)
    rules = association_rules(freq_itemsets, metric="support", min_threshold=0.01)

    # Filtering the rules for our specific recommendations
    rules_filtered = rules[rules['confidence'] >= 0.5]
    rules_filtered = rules_filtered.sort_values(by='support', ascending=False).head(3)

    print("Top Most Common Associations on", day, "are:")
    return rules_filtered

In [None]:
# Top associations for Monday
day_of_week('Monday')

Top Most Common Associations on Monday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
32,(Pastry),(Coffee),0.085106,0.506206,0.052305,0.614583,1.214098,0.009224,1.281196,0.192747
28,(Medialuna),(Coffee),0.06383,0.506206,0.038121,0.597222,1.179802,0.00581,1.225972,0.162791
22,(Cookies),(Coffee),0.067376,0.506206,0.036348,0.539474,1.06572,0.002241,1.072239,0.066123


In [None]:
# Top associations for Tuesday
day_of_week('Tuesday')

Top Most Common Associations on Tuesday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
16,(Cake),(Coffee),0.110549,0.503797,0.069198,0.625954,1.242472,0.013504,1.326582,0.219408
34,(Pastry),(Coffee),0.097046,0.503797,0.057384,0.591304,1.173695,0.008492,1.214113,0.163895
36,(Sandwich),(Coffee),0.065823,0.503797,0.036287,0.551282,1.094253,0.003126,1.105823,0.092204


In [None]:
# Top associations for Wednesday
day_of_week('Wednesday')

Top Most Common Associations on Wednesday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
18,(Cake),(Coffee),0.106402,0.480613,0.064022,0.601695,1.251932,0.012883,1.303992,0.225196
32,(Pastry),(Coffee),0.091073,0.480613,0.055906,0.613861,1.277246,0.012135,1.345079,0.238815
34,(Sandwich),(Coffee),0.074842,0.480613,0.038774,0.518072,1.07794,0.002804,1.077728,0.078154


In [None]:
# Top associations for Thursday
day_of_week('Thursday')

Top Most Common Associations on Thursday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
44,(Sandwich),(Coffee),0.074519,0.463141,0.038462,0.516129,1.11441,0.003949,1.109509,0.110931
34,(Hot chocolate),(Coffee),0.06891,0.463141,0.036058,0.523256,1.129798,0.004143,1.126094,0.123389
32,(Cookies),(Coffee),0.063301,0.463141,0.032853,0.518987,1.120582,0.003535,1.116102,0.114878


In [None]:
# Top associations for Friday
day_of_week('Friday')

Top Most Common Associations on Friday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
42,(Pastry),(Coffee),0.075932,0.477966,0.040678,0.535714,1.120821,0.004385,1.124381,0.116654
38,(Medialuna),(Coffee),0.061017,0.477966,0.035932,0.588889,1.232072,0.006768,1.269812,0.200599
32,(Cookies),(Coffee),0.061017,0.477966,0.031864,0.522222,1.092593,0.0027,1.092629,0.090253


In [None]:
# Top associations for Saturday
day_of_week('Saturday')

Top Most Common Associations on Saturday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
48,(Sandwich),(Coffee),0.068966,0.436134,0.039825,0.577465,1.324053,0.009747,1.334483,0.262873
42,(Medialuna),(Coffee),0.064594,0.436134,0.034968,0.541353,1.241255,0.006797,1.229413,0.207785
38,(Hot chocolate),(Coffee),0.067994,0.436134,0.033997,0.5,1.146437,0.004343,1.127732,0.137051


In [None]:
# Top associations for Sunday
day_of_week('Sunday')

Top Most Common Associations on Sunday are:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
32,(Cake),(Coffee),0.127676,0.512292,0.073751,0.57764,1.12756,0.008343,1.154721,0.129687
48,(Medialuna),(Coffee),0.101507,0.512292,0.067407,0.664062,1.296258,0.015406,1.451782,0.254369
52,(Pastry),(Coffee),0.092784,0.512292,0.052339,0.564103,1.101135,0.004807,1.11886,0.10124


**Answer 3**

From the above analysis, we find the following day-wise results for the Top 3 Most Common Item Associations as:

- **Monday:** Coffee is purchased the most with pastry, followed by medialuna and cookies.
- **Tuesday:** Coffee is again on top of the list and gets purchased often with cake, pastry, or sandwich.
- **Wednesday:** Coffee is again purchased most with cake, pastry, or sandwich.
- **Thursday:** Coffee is purchased a lot with sandwich, hot chocolate, or cookies.
- **Friday:** Coffee is often bought with pastry, medialuna, or cookies.
- **Saturday:** Sandwich, medialuna, and hot chocolate are most bought alongside coffee.
- **Sunday:** Coffee is purchased with cake, medialuna, and pastry.

We notice that each day of the week has their own set of unique associations that occur together the most. Interestingly, Tuesday & Wednesday show the exact same top 3 associations. It also seems like Tuesday, Wednesday, and Thursday are days when sandwich is on the top of the list. Hot chocolate appears on the list on Thursdays and Saturdays. cake & pastry, or pastry & cookies also seem to bought a lot on the same days.

# Question 4 **(8 Points)**

For the items that are bought together in more than 500 transactions:

1. for the sake of item promotion, suggest a strong rule that indicate that  the second item is *more likely than not* to be bought as well once the first one is bought.
2. Show a pair of items that seem to be ill-suited for being promoted together.

Explain your answers.

In [None]:
# Desired Support Threshold
support_500 = 500/len(df_basket)
print(support_500)

0.05282620179609086


**Answer 4.1**

In [None]:
# Finding the most frequent itemsets based on the desired support threshold
freq_itemsets = apriori(df_encoded, min_support=support_500, use_colnames=True)

# Finding the rules of interest
rules_500 = association_rules(freq_itemsets, metric="lift", min_threshold=1)
rules_500.sort_values(by='confidence', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664,0.10284
1,(Coffee),(Cake),0.478394,0.103856,0.054728,0.114399,1.101515,0.005044,1.011905,0.176684


**Result:**

From the above algorithm with high levels of support, list, and confidence, we can see that Cake --> Coffee arises as a strong rule with a support of 5.4%, a lift greater than 1, and a confidence on 0.52 suggesting that 52% of the times that Cake is bought, Coffee will be purchased as well.

**Answer 4.2**

In [None]:
# Finding ill-suited rules
rules_ill = association_rules(freq_itemsets, metric="support", min_threshold=support_500)
rules_ill.sort_values(by='lift')

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
3,(Coffee),(Cake),0.478394,0.103856,0.054728,0.114399,1.101515,0.005044,1.011905,0.176684
1,(Coffee),(Bread),0.478394,0.327205,0.090016,0.188163,0.575059,-0.066517,0.828731,-0.58621
0,(Bread),(Coffee),0.327205,0.478394,0.090016,0.275105,0.575059,-0.066517,0.719561,-0.523431
2,(Cake),(Coffee),0.103856,0.478394,0.054728,0.526958,1.101515,0.005044,1.102664,0.10284


**Result:**

On the other hand, Coffee --> Bread maybe an ill-suited pair when it comes to promoting them together. It seems to have a fairly high support, but has a much lower confidence of 0.18 and a lift of 0.57 (< 1) which indcates that Bread and Coffee occur much less frequently together than they do individually.



# Question 5 **(2 Points)**

Given the following rule from the dataset:

(Valentine's card) -> (Tshirt)

Find its lift, confidence, and support. Do these metrics support the claim that placing valentine cards next to the t-shirt stand will substantially  increase t-shirt sales? Explain your conclusion.

In [None]:
# Calculating Metrics
total_trans = len(df_basket)
x_trans = len(df_basket[df_basket['Itemsets'].str.contains("Valentine's card")])
y_trans = len(df_basket[df_basket['Itemsets'].str.contains("Tshirt")])
x_y_trans = len(df_basket[(df_basket['Itemsets'].str.contains("Valentine's card")) & (df_basket['Itemsets'].str.contains("Tshirt"))])
support_x = x_trans/total_trans
support_y = y_trans/total_trans
support_x_y = x_y_trans/total_trans

print("Metrics for the association Valentine's card --> Tshirt are:")
print("Support    =", np.round(support_x_y,4))
print("Confidence =", np.round(support_x_y/support_x,4))
print("Lift       =", np.round(support_x_y/(support_x*support_y),4))

Metrics for the association Valentine's card --> Tshirt are:
Support    = 0.0002
Confidence = 0.1538
Lift       = 69.3407


**Answer 5**

The association Valentine's Card --> Tshirt has a very low support of 0.0002 which means that it occurs in only 0.02% of the transactions in the data. A confidence of 0.1538 indicates that in 15% of the transactions that involve a valentine's card, also include a tshirt. Moreover, we see a substantially high value of lift as 69.34 indicating that these two items occur together a lot more than they do individually. In conclusion, we could suggest that placing valentine's card next to tshirts would increase t-shirt sales, but this might only hold true during the valentines season.