<a href="https://colab.research.google.com/github/TDMDegree/Level-4-Introduction-to-AI-and-ML/blob/main/Apriori%20-%20Association%20Rule%20Learning%20Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Apriori Algorithm - Association rule learning

In this tutorial, we are going to use the association Rule Learning with Apriori to discover interesting relations between our variables. We are going to looking at links between different items within a shop.

For more information about these topics, please check in the following links:

Association rule learning - https://en.wikipedia.org/wiki/Association_rule_learning

Apriori algorithm - https://en.wikipedia.org/wiki/Apriori_algorithm

Data set - https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

# Exploring the Data

When selecting data for association rule mining using the Apriori algorithm, several important considerations must be taken into account to ensure meaningful and actionable results. Here are the key aspects to consider:

1) Relevance of Data - You should be checking the frequency of the data items. Items that appear very infrequently may not provide useful rules.  **- You must show this in your report .**

2) Categorical Data: The Apriori algorithm is typically used with categorical data (e.g., items in a shopping cart). Ensure that the data is categorical or has been appropriately binned if using numerical data.  **- You must show this in your report**

3) Data Cleaning: Ensure that the data is clean, with no missing, duplicated, or incomplete or incorrect data can lead to misleading associations. **- You must show this in your report **

4) Item Grouping - You will need to be able to group the items into different sets (e.g., products bought together in a single shopping cart or product bought by a single customer). **- You should think about this when selecting your data.**

# Let's Start our setup

Google Colab provides a cloud-based Python environment with many pre-installed libraries. However, it doesn't include every possible library, especially specialized ones, which is often used for machine learning and data mining tasks such as the Apriori algorithm. This line :

"*!pip install apyori*"

ensures that the required library is installed before you attempt to use it in your code.



In [1]:
!pip install apyori
from apyori import apriori
import pandas as pd


Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5954 sha256=b9c770ff1980f520f4a89bd8af77c058ca98872649dc120397f1cd8841185c82
  Stored in directory: /root/.cache/pip/wheels/77/3d/a6/d317a6fb32be58a602b1e8c6b5d6f31f79322da554cad2a5ea
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


# Exploring the Data - Code

**1. Categorical Data**

The dataset provided contains three columns: Member_number, Date, and itemDescription. Here, the itemDescription column represents the items purchased, which are categorical in nature. Categorical data refers to data that can be divided into specific categories or groups. In this case, each item in the itemDescription column is a category, like "tropical fruit," "whole milk," or "pip fruit."

**Why Categorical Data is Important for Apriori:**

The Apriori algorithm is designed to work with categorical data, specifically item sets, to identify frequent itemsets and generate association rules.

The items in itemDescription represent distinct categories of products, which can be analysed for patterns of co-occurrence (e.g., which items are frequently bought together).

**2. Item Grouping**

Both Member_number and Date columns can be used to group the  transactions.

Group Items into Transactions:

To perform association rule mining, we need to group the transaction. This can be done by grouping the dataset by Member_number or Date or both. After grouping, could give you different association rules.
       


In [2]:
# Task 1 - read the Groceries_dataset and find out if the data can be grouped and is suitable for association rule mining


The code below shows that there are enough transactions to create meaningful association rules and that there are no incorrect item descriptions. I could potentially remove items with counts lower than 10, although I could also increase the confidence threshold in the analysis to eliminate any weak rules during the association rule mining process. However, it is important to be aware of low counts within your data exploration.

In [3]:
# Task 2  - find out if values are enough to make good rules



The code below cleans the data by checking for any duplicates and identifying any missing values that might impact the rules generated.

In [4]:
# Task 3 - Check for missing data and any duplicated records


# Check for missing data


We will then create three groups and examine the rules associated with each grouping. The three groups are:

1) Grouping by month of transaction.

2) Grouping by member's transactions.

3) Grouping by member's transactions per month.

In [5]:
# Task 4 - Grouping the data

# I first need to change the data into a date data type


# Extract Year-Month and create a new column


# Create 3 groupings  -  1- Group items by Year and Month  , 2 - group the items by members , 3 - group the items by both members and date




Apyori is a Python library that provides an implementation of the Apriori algorithm, which is a popular algorithm for association rule mining in data mining. Association rule mining is a technique to discover relationships between variables in large datasets.

The apriori() function in apyori takes the following parameters:

  **transactions : **A list of transactions. Each transaction is itself a list containing the items bought together.

  **min_support :** Minimum support threshold. This parameter specifies the minimum frequency of an itemset to be considered significant. It is usually set as a small value between 0 and 1.

  **min_confidence :** Minimum confidence threshold. This parameter specifies the minimum confidence level for the rules to be considered significant. It is usually set as a small value between 0 and 1.

  **min_lift :** Minimum lift threshold. This parameter specifies the minimum lift value for the rules to be considered significant. Lift measures how much more likely the antecedent and consequent of a rule are to occur together compared to if they were statistically independent.

Lets try to calculate the value for these parameters.These targets should be supported by literatue.

So, our as per our target,

 ** min_support :** An item must appear in the list at least 3 times, divided by len(datset) i.e = 3 / 728 = 0.00412087912

  **min_confidence :** We will start with 0.8 and them increase or decrease as per the rules observed.

  min_lift : Similar to min_confidence.

  min_length & max_length : Since we want just one product that goes along with a product, min and max length must be 2.


Unlike all of the other models, we can't use the dataframe or series with his model and so we have to put it into an array. Below is the code to iterate around the dataframe and add it to an array.



In [6]:
# Task 5 - Turn each grouped data from task 4 into a python list


# Task 6 - Create 3 apriori model using the python list data




Unfortunately, the output isn't in a DataFrame either, so I am going to convert it back into one. The code is a bit more complex and uses list comprehensions to create arrays for each of the values. You should be able to use this code to extract the data and put it back into a DataFrame.

In [7]:
# Displaying the first results coming directly from the output of the apriori function
Month_group_results = list(Month_group_rules)
Member_group_results = list(Member_group_rules)
Member_Month_results = list(Member_Month_rules)

# Putting the results well organised into a Pandas DataFrame
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

Month_group_results_DataFrame = pd.DataFrame(inspect(Month_group_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])
Member_group_results_DataFrame = pd.DataFrame(inspect(Member_group_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])
Member_Month_results_DataFrame = pd.DataFrame(inspect(Member_Month_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])


NameError: name 'Month_group_rules' is not defined

In [None]:
# Task 7 Sort the new dataframes by confidence and displaying the results




I am going to do an analysis on the "Rules for transactions by members in a month".

Each row in the table represents an association rule, showing items that tend to be purchased together. Let's break down what each column represents and what the results tell us:
Columns:

Left Hand Side (LHS): The item(s) on the left-hand side of the rule, which can be considered as the "antecedent" or "if" part of the rule.

Right Hand Side (RHS): The item(s) on the right-hand side of the rule, which can be considered as the "consequent" or "then" part of the rule.

Support: The proportion of transactions in which both the LHS and RHS items appear together. A support of 0.000072 means that this combination of items appears in 0.0072% of the total transactions.
   
Confidence: The likelihood that a transaction containing the LHS will also contain the RHS. A confidence of 1.0 (or 100%) indicates that every time the LHS appears, the RHS also appears.
  
Lift: This metric measures how much more likely the RHS is to appear when the LHS is present, compared to its general occurrence in the dataset. A lift greater than 1 indicates a positive correlation between the items.

**Analysis:**

All the rules shown have a confidence of 1.0, meaning that whenever the LHS item is purchased, the RHS item is always purchased as well. This indicates a strong relationship between these items.
        
The lift values are significantly greater than 1, suggesting that the RHS items are much more likely to be purchased when the LHS items are in the basket compared to random chance. For example, a lift of 114.92 for "kitchen utensil -> pasta" means that pasta is about 115 times more likely to be bought when a kitchen utensil is purchased than it would be by random chance.

The support values are very low (0.000072), meaning these item combinations are quite rare in the dataset. This suggests that while the associations are strong, they occur in only a tiny fraction of all transactions.