---
# **Mod: Goal of Apriori Algorithm**
---

**Purpose:** We have been provided with a dataset of 7,501 transactions from a popular commodity shop. Our goal is to identify the most suitable product 

pairs for the 'Buy 1 Get 1' offer, as determined by the shop owner. To achieve this, I have applied the Apriori algorithm to identify the most 

appropriate item pairs based on transaction patterns

**Algorithm:** To solve the above problem, I used the Apriori algorithm, which is widely used for frequent itemset mining and association rule learning. 

Here, **support**, **confidence**, and **lift** are key metrics that help assess the strength and relevance of the associations (rules) discovered in the data. 

A brief description of each metric is given below.

* $\color{green}{\text{Support}}$ tells you how frequent the items are in the dataset. 

* $\color{green}{\text{Confidence}}$ tells you how often items on the right-hand side (Y) are bought when the items on the left-hand side (X) are bought.

* $\color{green}{\text{Lift}}$ tells you how much more likely X and Y are to be bought together compared to if they were independent.


---
# **Mod: Packages**
---

In [109]:
# !pip install apyori

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

---
# **Mod: Read Data**
---

In [None]:
# importing libraries
df = pd.read_csv('.../Market_Basket_Optimisation.csv', low_memory=False, header=None)

In [111]:
#viewing dataset
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


**Observations**

- Each row provides a list of items purchased together.

- Row 1 customer purchased burgers, meatballs, and eggs.

- Row 4 customer purchased mineral water, milk, energy bar, whole wheat, rice, and green tea.

In [112]:
# list of all customer items
lst = []

# create a copy of the dataset
df_all_items = df

df_all_items.columns = "c" + df_all_items.columns.astype(str)
df_all_items

# collect all items into a single list
for row in range(df_all_items.shape[0]):
    for col in range(df_all_items.shape[1]):
        lst.append(df_all_items.iloc[row, col])
         
lst = pd.DataFrame(lst)
lst.columns = ['col_all']

In [113]:
df_table = pd.DataFrame(lst.col_all.value_counts(normalize=False)).reset_index()
df_table = df_table.sort_values(by='count', ascending=False)

In [114]:
print(f"Number of unique items: {lst.col_all.nunique()}")

Number of unique items: 120


## **Observations: Top 10 items purchased at the store**

In [115]:
import plotly.express as px

fig = px.histogram(df_table.head(10), x="col_all", y="count",  text_auto=True,  color_discrete_sequence=["orange"])
fig.update_layout(bargap=0.6)
fig.update_layout(yaxis_title='Number of items purchased', xaxis_title='List of items', title='')
fig.show()

## **Observations: Bottom 10 items purchased at the store**

In [116]:
import plotly.express as px

fig = px.histogram(df_table.tail(10), x="col_all", y="count",  text_auto=True,  color_discrete_sequence=["orange"])
fig.update_layout(bargap=0.6)
fig.update_layout(yaxis_title='Number of items purchased', xaxis_title='List of items')
fig.show()

In [117]:
# Count NaN values in each row
df['non_na_count_row'] = df.count(axis=1)
df.head(10)

Unnamed: 0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c11,c12,c13,c14,c15,c16,c17,c18,c19,non_na_count_row
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,...,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil,20
1,burgers,meatballs,eggs,,,,,,,,...,,,,,,,,,,3
2,chutney,,,,,,,,,,...,,,,,,,,,,1
3,turkey,avocado,,,,,,,,,...,,,,,,,,,,2
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,...,,,,,,,,,,5
5,low fat yogurt,,,,,,,,,,...,,,,,,,,,,1
6,whole wheat pasta,french fries,,,,,,,,,...,,,,,,,,,,2
7,soup,light cream,shallot,,,,,,,,...,,,,,,,,,,3
8,frozen vegetables,spaghetti,green tea,,,,,,,,...,,,,,,,,,,3
9,french fries,,,,,,,,,,...,,,,,,,,,,1


In [118]:
res = pd.DataFrame(df.non_na_count_row.value_counts()).reset_index()
res = res.sort_values(by="non_na_count_row", ascending=True)
res['non_na_count_row'] = res['non_na_count_row'].astype(str)
res

Unnamed: 0,non_na_count_row,count
0,1,1754
1,2,1358
2,3,1044
3,4,816
4,5,665
5,6,495
6,7,388
7,8,327
8,9,259
9,10,139


In [119]:
fig = px.histogram(res, x="non_na_count_row", y="count",  text_auto=True,  color_discrete_sequence=["orange"])
fig.update_layout(bargap=0.2)
fig.update_layout(yaxis_title='Number of items purchased', xaxis_title='List of items')
fig.show()

**Observations**

- 1754 baskets had one-item purchases.

- 1358 baskets had two-item purchases.

- 1044 baskets had three-item purchases.

- 1 basket had 20-item purchases. 

In [120]:
#basic information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   c0                7501 non-null   object
 1   c1                5747 non-null   object
 2   c2                4389 non-null   object
 3   c3                3345 non-null   object
 4   c4                2529 non-null   object
 5   c5                1864 non-null   object
 6   c6                1369 non-null   object
 7   c7                981 non-null    object
 8   c8                654 non-null    object
 9   c9                395 non-null    object
 10  c10               256 non-null    object
 11  c11               154 non-null    object
 12  c12               87 non-null     object
 13  c13               47 non-null     object
 14  c14               25 non-null     object
 15  c15               8 non-null      object
 16  c16               4 non-null      object
 17  c17           

In [121]:
# missing values any from the dataset
print(str('Any missing data or NaN in the dataset:'), df.isnull().values.any())

Any missing data or NaN in the dataset: True


In [122]:
# Let's create an empty list here
list_of_transactions = []

# Append the list
for i in range(0, df.shape[0]):
    list_of_transactions.append([str(df.values[i,j]) for j in range(0, 20)])

In [123]:
list_of_transactions[0] 

['shrimp',
 'almonds',
 'avocado',
 'vegetables mix',
 'green grapes',
 'whole weat flour',
 'yams',
 'cottage cheese',
 'energy drink',
 'tomato juice',
 'low fat yogurt',
 'green tea',
 'honey',
 'salad',
 'mineral water',
 'salmon',
 'antioxydant juice',
 'frozen smoothie',
 'spinach',
 'olive oil']

---
# **Mod: Train Apriori Algorithm**
---

In [124]:
# Training apiori algorithm on our list_of_transactions
from apyori import apriori
rules = apriori(list_of_transactions, min_support = 0.004, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [125]:
# Create a list of rules and print the results
results = list(rules)

In [126]:
#Here is the first rule in list or results
results[0:5]

[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.015997866951073192, ordered_statistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), 

In [127]:
#In order to visualize our rules better we need to extract elements from our results list, convert it to pd.data frame and sort strong rules by lift value.
#Here is the code for this. We have extracted left hand side and right hand side items from our rules above, also their support, confidence and lift value

def inspect(results):
    lhs     =  [tuple(result [2] [0] [0]) [0] for result in results]
    rhs     =  [tuple(result [2] [0] [1]) [0] for result in results]
    supports = [result [1] for result in results]
    confidences = [result [2] [0] [2]   for result in results]
    lifts = [result [2] [0] [3]   for result in results]
    return list(zip(lhs,rhs,supports,confidences, lifts))

res = pd.DataFrame(inspect(results),columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'] )

------
# **Mod: Observation - Support** 
---

In [128]:
# Results
res = res.drop_duplicates(subset=res.columns, keep='last')
res.nlargest(n = 10, columns = 'Support')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
22,herb & pepper,ground beef,0.015998,0.32345,3.291994
37,spaghetti,ground beef,0.008666,0.311005,3.165328
5,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
29,whole wheat pasta,olive oil,0.007999,0.271493,4.130772
16,mineral water,frozen vegetables,0.007199,0.305085,3.200616
27,milk,olive oil,0.007199,0.203008,3.082509
41,mineral water,,0.007199,0.305085,3.200616
51,milk,olive oil,0.007199,0.203008,3.088761
19,spaghetti,tomatoes,0.006666,0.239234,3.498046
44,spaghetti,,0.006666,0.239234,3.498046


------
# **Mod: Confidence - A likelihood that an item Y will be bought when item X is bought**
---

In [129]:
# Results
res = res.sort_values(by='Confidence', ascending=False)
res = res.drop_duplicates(subset=res.columns, keep='last')
res.nlargest(n = 10, columns = 'Confidence')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
54,olive oil,,0.004399,0.611111,3.509912
32,olive oil,spaghetti,0.004399,0.611111,3.509912
9,ground beef,spaghetti,0.004799,0.571429,3.281995
34,ground beef,,0.004799,0.571429,3.281995
50,ground beef,,0.005999,0.523256,3.005315
26,ground beef,spaghetti,0.005999,0.523256,3.005315
47,spaghetti,ground beef,0.006399,0.393443,4.00436
46,mineral water,ground beef,0.006666,0.390625,3.975683
25,tomato sauce,ground beef,0.005333,0.377358,3.840659
12,pasta,,0.005866,0.372881,4.700812


## Observation 

In [130]:
for row in [32, 9]:
    print(f"- If a customer buys {res['Left Hand Side'][row]} then there's a {round( 100*res['Confidence'][row], 1)}% chance that they will buy {res['Right Hand Side'][row]}.")
    print("")

- If a customer buys olive oil then there's a 61.1% chance that they will buy spaghetti.

- If a customer buys ground beef then there's a 57.1% chance that they will buy spaghetti.



------
# **Mod: Lift - A likelihood that two items are to be bought together.**
---

In [131]:
# Results
res = res.sort_values(by='Lift', ascending=False)
res = res.drop_duplicates(subset=res.columns, keep='last')
res.head(10)

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
7,light cream,chicken,0.004533,0.290598,4.843951
12,pasta,,0.005866,0.372881,4.700812
2,pasta,escalope,0.005866,0.372881,4.700812
30,pasta,shrimp,0.005066,0.322034,4.515096
6,pasta,shrimp,0.005066,0.322034,4.506672
55,ground beef,mineral water,0.004399,0.259843,4.350622
10,ground beef,herb & pepper,0.004133,0.206667,4.178455
35,ground beef,,0.004133,0.206667,4.178455
29,whole wheat pasta,olive oil,0.007999,0.271493,4.130772
5,whole wheat pasta,olive oil,0.007999,0.271493,4.12241


## Observation 

In [132]:
for row in [7, 55, 5]:
    print(f"- A lift for {res['Left Hand Side'][row]} and {res['Right Hand Side'][row]} is {round( res['Lift'][row], 1)} which means that both products are positively associated.")
    print("")

- A lift for light cream and chicken is 4.8 which means that both products are positively associated.

- A lift for ground beef and mineral water is 4.4 which means that both products are positively associated.

- A lift for whole wheat pasta and olive oil is 4.1 which means that both products are positively associated.



**Observation**

Using results from the above analysis, the shop owner might include the following in their weekly advertisement to customers:

- Buy olive oil and get half-off on the spaghetti. 

- Buy ground beef and get a packet of spaghetti free when you buy one packet of spaghetti. 
---