# Data Mining - Apriori Algorithm
## Professor: Alberto López Cardoza
### Alumno: 6-A
#### Suggested Grade: 10

What is Association Rule Learning?
--

Association Rule Learning is rule-based learning for identifying the association between different variables in a database. **`One of the best and most popular examples of Association Rule Learning is the Market Basket Analysis`**. The problem analyses the association between various items that has the highest probability of being bought together by a customer.

For example, the association rule, {onions, chicken masala} => {chicken} says that a person who has got both onions and chicken masala in his or her basket has a high probability of buying chicken also.


Apriori Algorithm
--

The algorithm was first proposed in 1994 by Rakesh Agrawal and Ramakrishnan Srikant. Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items just like the above-mentioned example.


How Apriori works ?
--

To construct association rules between elements or items, the algorithm considers 3 important factors which are, support, confidence and lift. Each of these factors is explained as follows:

**Support:**

The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions expressed as :

<img src="https://miro.medium.com/max/403/0*pyOADkeaWyrVP2ft.png" />

<font color='darkgreen'><b>Support</b></font> indicates how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%.

<img src='https://annalyzin.files.wordpress.com/2016/04/association-rule-support-table.png?w=503&h=447' />

<br />

<font color='red'><em>If you discover that sales of items beyond a certain proportion tend to have a significant impact on your profits, you might consider using that proportion as your support threshold. You may then identify itemsets with support values above this threshold as significant itemsets.</em></font>


**Confidence:**

This is measured by the proportion of transactions with item I1, in which item I2 also appears. The confidence between two items I1 and I2,  in a transaction is defined as the total number of transactions containing both items I1 and I2 divided by the total number of transactions containing I1. ( Assume I1 as X , I2 as Y )

<img src='https://miro.medium.com/max/576/1*50GI4dR58MnhwBP9dw6nFQ.png' />

<font color='darkgreen'><b>Confidence</b></font> says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.

<img src='https://annalyzin.files.wordpress.com/2016/03/association-rule-confidence-eqn.png?w=527&h=77' />

<font color='red'><em>One drawback of the confidence measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general, there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the confidence measure. To account for the base popularity of both constituent items, we use a third measure called lift. </em></font>


**Lift:**

Lift is the ratio between the confidence and support.

<font color='darkgreen'><b>Lift</b></font> says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought. ( *here X represents apple and Y represents beer* )

<img src='https://annalyzin.files.wordpress.com/2016/03/association-rule-lift-eqn.png?w=566&h=80' />

<hr>

for **Extra Reading** : refer <a href='https://towardsdatascience.com/association-rules-2-aa9a77241654'> this </a> 


In [123]:
#External package need to install
!pip install apyori

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import libaries

In [124]:
# Import Libraries
import pandas as pd
import numpy as np
from apyori import apriori

These are some example datasets I am providing for you.

In [125]:
# ItemsetA
itemsetA = [['arroz','frijol','tortillas','coca','chile'],['frijol','chile','charritoss','tortillas'],['tortillas','coca','frijol','chile'],['coca','tortillas','chile'],['coca','tortillas']]
minSupA = .002
minConfA = .6
minLenA = 2

itemsetB= [['leche','jamon','huevos','arroz','queso','yogurt'],['arroz','huevos','jamon','queso','salsa','sandia','cebolla'],['queso','huevos','yogurt','salsa','arroz','naranja'],['jamon','queso','cebolla','tomates','sandia','huevos'],['queso','jamon','salsa','arroz','limon','cebolla'],['tomates','yogurt','arroz','salsa','tomates','queso'],['yogurt','huevos','jamon','tomates','naranja','arroz','sandia'],['cocacola','jamon','yogurt','salsa','sandia']]
minSupB = .25
minConfB = .5
minLenB = 2

itemsetC = [['leche','jamon','huevos','arroz','queso','yogurt','coke','tortillas'],['arroz','huevos','jamon','queso','coke','salsa','sandia','cebolla'],['queso','huevos','yogurt','salsa','leche','tortillas','arroz','cocacola','naranja'],['jamon','queso','cebolla','tomates','tortillas','sandia','huevos'],['queso','jamon','salsa','arroz','limon','cebolla','cocacola','leche'],['tomates','yogurt','arroz','salsa','cocacola','tomates','queso'],['yogurt','huevos','jamon','tomates','naranja','tortillas','arroz','sandia'],['cocacola','jamon','leche','tortillas','yogurt','salsa','sandia']]
minSupC = .15
minConfC = .4
minLenC = 2

itemsetD = [['leche','jamon','huevos','arroz','queso','yogurt','coke','tortillas'],['arroz','huevos','jamon','queso','coke','salsa','sandia','cebolla'],['queso','huevos','yogurt','salsa','leche','tortillas','arroz','cocacola','naranja'],['jamon','queso','cebolla','tomates','tortillas','sandia','huevos'],['queso','jamon','salsa','arroz','limon','cebolla','cocacola','leche'],['tomates','yogurt','arroz','salsa','cocacola','tomates','queso'],['yogurt','huevos','jamon','tomates','naranja','tortillas','arroz','sandia'],['cocacola','jamon','leche','tortillas','yogurt','salsa','sandia'],['tortillas','salsa','arroz','limon','yogurt','coca'],['queso','jamon','salsa','arroz','limon','cebolla','cocacola','leche','toronja']]
minSupD = .1
minConfD = .45
minLenD = 2

Print first element of each itemSet

In [126]:
# Print itemsets
#A
print(itemsetA[0])
#B
print(itemsetB[0])
#C
print(itemsetC[0])
#D
print(itemsetD[0])

['arroz', 'frijol', 'tortillas', 'coca', 'chile']
['leche', 'jamon', 'huevos', 'arroz', 'queso', 'yogurt']
['leche', 'jamon', 'huevos', 'arroz', 'queso', 'yogurt', 'coke', 'tortillas']
['leche', 'jamon', 'huevos', 'arroz', 'queso', 'yogurt', 'coke', 'tortillas']


Run apriori and set to `rules`
Hare show me what you know about APRIORI and the parameters. 


*   Support
*   Confidance
*   Lift
*   Min Lenght







In [127]:
rulesA = apriori(itemsetA, min_confidance=minConfA, min_length=minLenA, min_support=minSupA)

Convert rules to a list and name it `result`. Print` result` for each itemset

In [128]:
resultA = list(rulesA)
print(resultA)

[RelationRecord(items=frozenset({'arroz'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'arroz'}), confidence=0.2, lift=1.0)]), RelationRecord(items=frozenset({'charritoss'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'charritoss'}), confidence=0.2, lift=1.0)]), RelationRecord(items=frozenset({'chile'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'chile'}), confidence=0.8, lift=1.0)]), RelationRecord(items=frozenset({'coca'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'coca'}), confidence=0.8, lift=1.0)]), RelationRecord(items=frozenset({'frijol'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'frijol'}), confidence=0.6, lift=1.0)]), RelationRecord(items=frozenset({'tortillas'}), support=1.0, ordered_statistics=[OrderedStatistic(it

Print first 10 rows

In [129]:
print(*resultA[:10], sep = "\n")

RelationRecord(items=frozenset({'arroz'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'arroz'}), confidence=0.2, lift=1.0)])
RelationRecord(items=frozenset({'charritoss'}), support=0.2, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'charritoss'}), confidence=0.2, lift=1.0)])
RelationRecord(items=frozenset({'chile'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'chile'}), confidence=0.8, lift=1.0)])
RelationRecord(items=frozenset({'coca'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'coca'}), confidence=0.8, lift=1.0)])
RelationRecord(items=frozenset({'frijol'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'frijol'}), confidence=0.6, lift=1.0)])
RelationRecord(items=frozenset({'tortillas'}), support=1.0, ordered_statistics=[OrderedStatistic(items_ba

Convert rules for each itemset to a list and then to a DataFrame

In [130]:

df_results = pd.DataFrame(resultA)
# print 5 fist rows
df_results.head(5)

Unnamed: 0,items,support,ordered_statistics
0,(arroz),0.2,"[((), (arroz), 0.2, 1.0)]"
1,(charritoss),0.2,"[((), (charritoss), 0.2, 1.0)]"
2,(chile),0.8,"[((), (chile), 0.8, 1.0)]"
3,(coca),0.8,"[((), (coca), 0.8, 1.0)]"
4,(frijol),0.6,"[((), (frijol), 0.6, 1.0)]"


Here extract `support`.

In [131]:
support = df_results.support

Get the other values. Use the first link I gave you.
Copy the code from there

In [132]:
#all four empty list which will contain lhs, rhs, confidance and lift respectively.
first_values = []
second_values = []
third_values = []
fourth_value = []
# loop number of rows time and append 1 by 1 value in a separate list.. 
# first and second element was frozenset which need to be converted in list..
for i in range(df_results.shape[0]):
    single_list = df_results['ordered_statistics'][i][0]
    first_values.append(list(single_list[0]))
    second_values.append(list(single_list[1]))
    third_values.append(single_list[2])
    fourth_value.append(single_list[3])

In [133]:
# convert all four list into dataframe for further operation..
lhs = pd.DataFrame(first_values)
rhs = pd.DataFrame(second_values)

confidance=pd.DataFrame(third_values,columns=['Confidance'])

lift=pd.DataFrame(fourth_value,columns=['lift'])

In [134]:
# concat all list together in a single dataframe
df_final = pd.concat([lhs,rhs,support,confidance,lift], axis=1)
df_final

Unnamed: 0,0,1,2,3,4,support,Confidance,lift
0,arroz,,,,,0.2,0.2,1.0
1,charritoss,,,,,0.2,0.2,1.0
2,chile,,,,,0.8,0.8,1.0
3,coca,,,,,0.8,0.8,1.0
4,frijol,,,,,0.6,0.6,1.0
5,tortillas,,,,,1.0,1.0,1.0
6,arroz,chile,,,,0.2,0.2,1.0
7,arroz,coca,,,,0.2,0.2,1.0
8,arroz,frijol,,,,0.2,0.2,1.0
9,tortillas,arroz,,,,0.2,0.2,1.0


In [135]:
'''
 we have some of place only 1 item in lhs and some place 3 or more so we need to a proper represenation for User to understand. 
 replacing none with ' ' and combining three column's in 1 
 example : coffee,none,none is converted to coffee, ,
'''

"\n we have some of place only 1 item in lhs and some place 3 or more so we need to a proper represenation for User to understand. \n replacing none with ' ' and combining three column's in 1 \n example : coffee,none,none is converted to coffee, ,\n"

In [136]:
df_final.fillna(value=' ', inplace=True)
df_final.head()

Unnamed: 0,0,1,2,3,4,support,Confidance,lift
0,arroz,,,,,0.2,0.2,1.0
1,charritoss,,,,,0.2,0.2,1.0
2,chile,,,,,0.8,0.8,1.0
3,coca,,,,,0.8,0.8,1.0
4,frijol,,,,,0.6,0.6,1.0


In [137]:
#set column name
df_final.columns = ['lhs',1,'rhs',2,3,'support','confidance','lift']
df_final.head()

Unnamed: 0,lhs,1,rhs,2,3,support,confidance,lift
0,arroz,,,,,0.2,0.2,1.0
1,charritoss,,,,,0.2,0.2,1.0
2,chile,,,,,0.8,0.8,1.0
3,coca,,,,,0.8,0.8,1.0
4,frijol,,,,,0.6,0.6,1.0


In [138]:
# add all three column to lhs itemset only
df_final['lhs'] = df_final['lhs'] + str(", ") + df_final[1]

df_final['rhs'] = df_final['rhs']+str(", ")+df_final[2] + str(", ") + df_final[3]

In [139]:
# Print forst 4 rows
df_final.head(4)

Unnamed: 0,lhs,1,rhs,2,3,support,confidance,lift
0,"arroz,",,", ,",,,0.2,0.2,1.0
1,"charritoss,",,", ,",,,0.2,0.2,1.0
2,"chile,",,", ,",,,0.8,0.8,1.0
3,"coca,",,", ,",,,0.8,0.8,1.0


In [140]:
#drop columns 1,2 and 3 because now we already appended to lhs column.
df_final.drop(columns=[1,2,3],inplace=True)

In [141]:
#this is final output. You can sort based on the support lift and confidance..
df_final.head()

Unnamed: 0,lhs,rhs,support,confidance,lift
0,"arroz,",", ,",0.2,0.2,1.0
1,"charritoss,",", ,",0.2,0.2,1.0
2,"chile,",", ,",0.8,0.8,1.0
3,"coca,",", ,",0.8,0.8,1.0
4,"frijol,",", ,",0.6,0.6,1.0


In [145]:
## Showing top 10 items, based on lift.  Sorting in desc order
df_final.sort_values('lift', ascending=True).head(10)

Unnamed: 0,lhs,rhs,support,confidance,lift
0,"arroz,",", ,",0.2,0.2,1.0
21,"tortillas, arroz","chile, ,",0.2,0.2,1.0
22,"arroz, coca","frijol, ,",0.2,0.2,1.0
23,"tortillas, arroz","coca, ,",0.2,0.2,1.0
24,"tortillas, arroz","frijol, ,",0.2,0.2,1.0
25,"chile, charritoss","frijol, ,",0.2,0.2,1.0
26,"tortillas, chile","charritoss, ,",0.2,0.2,1.0
27,"tortillas, charritoss","frijol, ,",0.2,0.2,1.0
20,"arroz, chile","frijol, ,",0.2,0.2,1.0
28,"chile, coca","frijol, ,",0.4,0.4,1.0
