# Assignment Prep - Association Rule Mining

We will use [The Bread Basket Dataset](https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket). The dataset belongs to "The Bread Basket" a bakery located in Edinburgh.

Opendatasets is a python package which makes it easier to import datasets from Kaggle.

Prerequisites:

- Kaggle Account (preferrably using BU email ID)

Run the following two cells. In the second cell you will be prompted to enter username and key.

Use this link - https://www.kaggle.com/settings/account

- On the right side of your screen you can see your username.
- Scroll down a bit, you will see an API subheading. Click on '**Create new token**'.
- It should automatically download a .json file containing your username and key.
- Copy paste them into the output of the 2nd cell.

Your dataset will be visible in the folders tab on the left side of your colab screen!!

In [3]:
!pip install opendatasets



In [3]:
import opendatasets as od
import pandas as pd
import numpy as np

# od.download(
#     "https://www.kaggle.com/datasets/mittalvasu95/the-bread-basket")

### This cell is for installing any python packages you want to use

In [None]:
!pip install your-package-name

# Question 1 **(5 Points)**

Find the top 5 *single* item recommendations based on any *single* item purchases in the bakery. These recommendations will be used to optimally place the two items within reach from to each other.

Use the apriori algorithm with a reasonable minimum support (Justify your choice).

By what percentage has the apriori method reduced the computational cost of solving this query?

In [43]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

df = pd.read_csv('bread basket.csv')
df_short = df[['Transaction', 'Item']]

df_agg = df_short.groupby('Transaction')['Item'].agg(lambda x: ','.join(x.astype(str))).reset_index()

data_column = df_agg.iloc[:, 1]
data = list(data_column.apply(lambda x: x.split(',')))

encoder = TransactionEncoder()

data_encoded = encoder.fit(data).transform(data)
df_encoded = pd.DataFrame(data_encoded, columns=encoder.columns_)

frequent_itemsets = apriori(df_encoded, min_support=0.00001, use_colnames=True)

rules = association_rules(frequent_itemsets, metric='support', min_threshold=0.01)

rules_1_on_1 = rules[(rules['antecedents'].apply(lambda x: len(x)==1) & 
                      rules['consequents'].apply(lambda x: len(x)==1))]

rules_1_on_1_sorted = rules_1_on_1.sort_values(by='support', ascending=False).head(10)

rules_1_on_1_unique = rules_1_on_1_sorted.groupby(['antecedents', 'consequents'])['support'].mean().reset_index()
rules_1_on_1_unique

Unnamed: 0,antecedents,consequents,support
0,(Coffee),(Bread),0.090016
1,(Coffee),(Cake),0.054728
2,(Coffee),(Tea),0.049868
3,(Coffee),(Pastry),0.047544
4,(Coffee),(Sandwich),0.038246
5,(Bread),(Coffee),0.090016
6,(Cake),(Coffee),0.054728
7,(Tea),(Coffee),0.049868
8,(Pastry),(Coffee),0.047544
9,(Sandwich),(Coffee),0.038246


As is shown in the table above, the top 5 itemsets with the highest support value are (from highest to lowest):

1. Coffee & Bread

2. Coffee & Cake

3. Coffee & Tea

4. Coffee & Pastry

5. Coffee & Sandwich

Thus I would recommend to place Coffee in the center of the other four items.

In terms of support threshold choice in apriori, I intend to choose a low threshold because I plan to filter the dataset during the association rule and later steps, so it's good to set a low threshold to keep more data points at this point.

# Question 2 **(5 Points)**

Find out how/if the recommendations from the previous question change based on the time of the day. (morning, afternoon, evening). Comment on how similar/different the associations are.

In [None]:
# Build a pipeline to process the data



# Question 3 **(10 Points)**

Find out if the day of the week (i.e., Monday, Tuesday, ..) affects the customers' purchase patterns. Compute the top 3 most common item associations for each day. Comment on how similar/different the rules are.

Use [to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and [dayofweek](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.dayofweek.html) to generate the day of the week for any date.

# Question 4 **(8 Points)**

For the items that are bought together in more than 500 transactions:

1. for the sake of item promotion, suggest a strong rule that indicate that  the second item is *more likely than not* to be bought as well once the first one is bought.
2. Show a pair of items that seem to be ill-suited for being promoted together.

Explain your answers.

# Question 5 **(2 Points)**

Give the following rule from the dataset:

(Valentine's card) -> (Tshirt)

Find its lift, confidence, and support. Do these metrics support the claim that placing valentine cards next to the t-shirt stand will substantially  increase t-shirt sales? Explain your conclusion.

*Your answer goes here .... (i.e. edit this markdown cell by double clicking here)*

In [None]:
# Python code if any