<a href="https://colab.research.google.com/github/NainaniJatinZ/MachineLearningRepo/blob/main/AssociationRuleLearning/ARL_apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association Rule Learning: Apriori


- Association rules help uncover all such relationships between items from huge databases.
- Rules do not extract an individual’s preference, rather find relationships between set of elements of every distinct transaction. This is what makes them different from collaborative filtering.
- eg. --> List of items with unique transaction IDs (from all users) are studied as one group --> placement of items in aisles
--> collaborative filtering ties back all transactions corresponding to a user ID to identify similarity between users’ preferences --> recommendation

- Association Rule: Ex. {X → Y} is a representation of finding Y on the basket which has X on it
- Itemset: Ex. {X,Y} is a representation of the list of all items which form the association rule
- Support: Fraction of transactions containing the itemset
- Confidence: Probability of occurrence of {Y} given {X} is present
- Lift: Ratio of confidence to baseline probability of occurrence of {Y}

Rule: ({a,b} -> {c,d})
then {a,b} is Antecent 
and {c,d} is Consequent 

![picture](https://miro.medium.com/max/3000/1*bqdq-z4Ec7Uac3TT3H_1Gg.png)

![picture](https://miro.medium.com/max/3000/1*E3mNKHcudWzHySGMvo_vPg.png)

![picture](https://miro.medium.com/max/3000/1*Rg429lteTXRKdYgCiHmLVw.png)

# Step 1: Generating itemsets from a list of items

--> itemsets for which support value (fraction of transactions containing the itemset) is above a minimum threshold — minsup

--> itemsets with low support means that we don't have enough data on those to form a rule.

## Apriori Principle 

--> All subsets of a frequent itemset must also be frequent

--> So if support value of {Bread, Egg, Vegetables} is above minsup, then we can be assured that support value of {Bread, Egg}  is above minsup too. 


--> This is called the **anti-monotone property** of support where if we drop out an item from an itemset, support value of new itemset generated will either be the same or will increase.

--> This principle makes it easy to prune all supersets of an itemset that does not satify minsup.

## Apriori Algorithm

refer: https://annalyzin.files.wordpress.com/2016/04/association-rules-apriori-tutorial-explanation.gif

Generate all frequent itemsets (support ≥ minsup) having only one item. Next, generate itemsets of length 2 as all possible combinations of above itemsets. Then, prune the ones for which support value fell below minsup


# Step 2: Generating all possible rules from frequent itemsets 

--> forming candidate rules --> {a,b,c,d} has candidates such as (a,b,c->d); (a,c->b,d); (b->a,c,d) and so on

--> Aim is to identify rules that fall above a minimum confidence level (minconf).

-->  Just like the anti-monotone property of support, confidence of rules generated from the same itemset also follows an anti-monotone property.

--> So this means that confidence of (a,b,c→ d) ≥ (b,c → a,d) ≥ (c → a,b,d). To remind, confidence for {X → Y} = support of {X,Y}/support of {X}



### Pruning using the above mentioned property of confidence

We start with a frequent itemset {a,b,c,d} and start forming rules with just one consequent. Remove the rules failing to satisfy the minconf condition. Now, start forming rules using a combination of consequents from the remaining ones. Keep repeating until only one item is left on antecedent. This process has to be done for all frequent itemsets.


![picture](https://miro.medium.com/max/625/1*oHvr5DH3YJS2TEmajxCkHw.png)

# Step 3: Searching for highest values of Lift to make conclusions

--> with the rules that satisfy both minsup and minconf

## Few more terms:

Maximal frequent itemset: It is a frequent itemset for which none of the immediate supersets are frequent. This is like a frequent itemset X to which no item y can be added such that {X,y} still remains above minsup threshold.

--> Most compact form of frequent itemset representation

--> All the frequent itemsets can be derived as the subsets of maximal frequent itemsets. However, information on support of the subsets is lost. If this value is required, closed frequent itemset is another way to represent all the frequent itemsets.

Closed frequent itemset: It is a frequent itemset for which there exists no superset which has the same support as the itemset. Consider an itemset X. If ALL occurrences of X are accompanied by occurrence of Y, then X is NOT a closed set.

--> help in removing some redundant itemsets while not losing information about the support values.



# References:

- https://www.analyticsvidhya.com/blog/2017/08/mining-frequent-items-using-apriori-algorithm/

- https://towardsdatascience.com/association-rules-2-aa9a77241654

- ML A-Z course on Udemy: https://www.udemy.com/share/101Wci2@Pm1KbFteSFcJd0JKOEtOfQ==/


# Code

Link to dataset: https://drive.google.com/file/d/16wlKvgyHvsXU96rLd-j2WHrN52thrp7-/view?usp=sharing

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
!pip install apyori

Collecting apyori
  Downloading https://files.pythonhosted.org/packages/5e/62/5ffde5c473ea4b033490617ec5caa80d59804875ad3c3c57c0976533a21a/apyori-1.1.2.tar.gz
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-cp37-none-any.whl size=5975 sha256=4be1cb430eb1afc68841a044776ef86cdd9b3e2c4c46e67cdd4958b06b52a1a8
  Stored in directory: /root/.cache/pip/wheels/5d/92/bb/474bbadbc8c0062b9eb168f69982a0443263f8ab1711a8cad0
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [6]:
# importing libraries 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Preprocessing 

In [16]:
#loading the dataset (no header)
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

#creating a list of transactions from dataframe
transactions = []
#print(len(dataset.index))
# print(len(dataset.columns))

# all elements in list for apyori must be str
for i in range(0, len(dataset.index)):
  transactions.append([str(dataset.values[i,j]) for j in range(0, len(dataset.columns))])



## Training model on Dataset

In [17]:
# assumung we wanted at least 3 transactions per week, minsup = 3*7/7501
# rule of thumb is to start with 0.8 and keep dividing by 2 till you get desirable number of rules 
# lift less than 3 aren't that relevant in most cases
# min len and max len = 2 --> (product A-> product B) --> depends on probelme

from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.0027, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

## Visualising Results 

## Direct results

In [18]:
ap_results = list(rules)
ap_results 

[RelationRecord(items=frozenset({'chicken', 'extra dark chocolate'}), support=0.0027996267164378083, ordered_statistics=[OrderedStatistic(items_base=frozenset({'extra dark chocolate'}), items_add=frozenset({'chicken'}), confidence=0.23333333333333334, lift=3.8894074074074076)]),
 RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'esca

## Putting results in a pd frame

In [20]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(ap_results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

In [21]:
resultsinDataFrame


Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,extra dark chocolate,chicken,0.0028,0.233333,3.889407
1,light cream,chicken,0.004533,0.290598,4.843951
2,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
3,pasta,escalope,0.005866,0.372881,4.700812
4,fromage blanc,honey,0.003333,0.245098,5.164271
5,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,tomato sauce,ground beef,0.005333,0.377358,3.840659
7,light cream,olive oil,0.0032,0.205128,3.11471
8,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
9,pasta,shrimp,0.005066,0.322034,4.506672


## Sorted Final Results

In [22]:

resultsinDataFrame.nlargest(10, "Lift")

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
4,fromage blanc,honey,0.003333,0.245098,5.164271
1,light cream,chicken,0.004533,0.290598,4.843951
3,pasta,escalope,0.005866,0.372881,4.700812
9,pasta,shrimp,0.005066,0.322034,4.506672
8,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
0,extra dark chocolate,chicken,0.0028,0.233333,3.889407
6,tomato sauce,ground beef,0.005333,0.377358,3.840659
2,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
5,herb & pepper,ground beef,0.015998,0.32345,3.291994
7,light cream,olive oil,0.0032,0.205128,3.11471
