# ASSOCIATION RULES

### The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

## Dataset:
### Use the Online retail dataset to apply the association rules.

## Data Preprocessing:
### Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.

## Association Rule Mining:
### •	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.
### •	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
### •	Set appropriate threshold for support, confidence and lift to extract meaning full rules.

## Analysis and Interpretation:
### •	Analyse the generated rules to identify interesting patterns and relationships between the products.
### •	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [2]:
pip install apyori

Note: you may need to restart the kernel to use updated packages.


In [3]:
from apyori import apriori

In [4]:
store_data = pd.read_excel("Online retail.xlsx", header=None)
display(store_data.head())
print(store_data.shape)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


(7501, 20)


# Preprocessing on Data

In [5]:
#Here we need a data in form of list for Apriori Algorithm.

records = []
for i in range(1, 7501):
    records.append([str(store_data.values[i, j]) for j in range(0, 20)])

In [6]:
print(type(records))

<class 'list'>


# Apriori Algorithm

Now time to apply algorithm on data.

We have provide min_support, min_confidence, min_lift, and min length of sample-set for find rule.

In [7]:
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)

# How many relation derived

In [8]:
print("There are {} Relation derived.".format(len(association_results)))

There are 48 Relation derived.


# Association Rules Derived

In [9]:
for i in range(0, len(association_results)):
    print(association_results[i][0])

frozenset({'chicken', 'light cream'})
frozenset({'escalope', 'mushroom cream sauce'})
frozenset({'escalope', 'pasta'})
frozenset({'herb & pepper', 'ground beef'})
frozenset({'ground beef', 'tomato sauce'})
frozenset({'olive oil', 'whole wheat pasta'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'chicken', 'light cream'})
frozenset({'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'ground beef', 'cooking oil'})
frozenset({'nan', 'escalope', 'mushroom cream sauce'})
frozenset({'nan', 'escalope', 'pasta'})
frozenset({'spaghetti', 'ground beef', 'frozen vegetables'})
frozenset({'olive oil', 'milk', 'frozen vegetables'})
frozenset({'shrimp', 'mineral water', 'frozen vegetables'})
frozenset({'spaghetti', 'olive oil', 'frozen vegetables'})
frozenset({'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'grated cheese', 'ground beef', 'spaghetti'})
frozenset({'herb & pepper', 'ground beef', 'mineral water'})


# Rules Generated

In [10]:
for item in association_results:
    # first index of the inner list
    # Contains base item and add item
    pair = item[0]
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    # second index of the inner list
    print("Support: " + str(item[1]))

    # third index of the list located at 0th
    # of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: chicken -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
Rule: escalope -> mushroom cream sauce
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
Rule: escalope -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
Rule: herb & pepper -> ground beef
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
Rule: ground beef -> tomato sauce
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
Rule: olive oil -> whole wheat pasta
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
Rule: nan -> chicken
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
Rule: shrimp -> frozen vegetables
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.

In [11]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

In [12]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


In [13]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from sklearn.feature_extraction.text import TfidfTransformer


In [14]:
te = TransactionEncoder()
te_try = te.fit(dataset).transform(dataset)

In [15]:
df = pd.DataFrame(te_try, columns=te.columns_)


In [16]:
df


Unnamed: 0,Apple,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Unicorn,Yogurt
0,False,False,False,True,False,True,True,True,True,False,True
1,False,False,True,True,False,True,False,True,True,False,True
2,True,False,False,True,False,True,True,False,False,False,False
3,False,True,False,False,False,True,True,False,False,True,True
4,False,True,False,True,True,True,False,False,True,False,False


In [17]:
#Model Training

In [18]:
 from mlxtend.frequent_patterns import apriori


In [19]:
apriori(df,min_support=0.5)


Unnamed: 0,support,itemsets
0,0.8,(3)
1,1.0,(5)
2,0.6,(6)
3,0.6,(8)
4,0.6,(10)
5,0.8,"(3, 5)"
6,0.6,"(8, 3)"
7,0.6,"(5, 6)"
8,0.6,"(8, 5)"
9,0.6,"(10, 5)"


In [20]:
#Model Training with Column Result return

apriori(df,min_support=0.5, use_colnames=True)


Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


Calculate the length of Itemset

In [21]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.8,(Eggs),1
1,1.0,(Kidney Beans),1
2,0.6,(Milk),1
3,0.6,(Onion),1
4,0.6,(Yogurt),1
5,0.8,"(Kidney Beans, Eggs)",2
6,0.6,"(Eggs, Onion)",2
7,0.6,"(Kidney Beans, Milk)",2
8,0.6,"(Kidney Beans, Onion)",2
9,0.6,"(Kidney Beans, Yogurt)",2


In [22]:
#Length is 2 and Support is > 0.8

In [23]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) & (frequent_itemsets['support'] >= 0.8) ]


Unnamed: 0,support,itemsets,length
5,0.8,"(Kidney Beans, Eggs)",2


In [24]:
frequent_itemsets[ frequent_itemsets['itemsets'] == {'Onion', 'Eggs'} ]


Unnamed: 0,support,itemsets,length
6,0.6,"(Eggs, Onion)",2


Verbose return the number of iteration and itemset default size

In [25]:
apriori(df, min_support=0.6, use_colnames=True, verbose=1)


Processing 21 combinations | Sampling itemset size 3


Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


In [26]:
#Using Max_len set the itemset
apriori(df, min_support=0.6, use_colnames=True, verbose=1, max_len=3)


Processing 21 combinations | Sampling itemset size 3


Unnamed: 0,support,itemsets
0,0.8,(Eggs)
1,1.0,(Kidney Beans)
2,0.6,(Milk)
3,0.6,(Onion)
4,0.6,(Yogurt)
5,0.8,"(Kidney Beans, Eggs)"
6,0.6,"(Eggs, Onion)"
7,0.6,"(Kidney Beans, Milk)"
8,0.6,"(Kidney Beans, Onion)"
9,0.6,"(Kidney Beans, Yogurt)"


In [27]:
#End of assignment

# Interview Questions:

## 1.	What is lift and why is it important in Association rules?
### Ans-
### Lift is a measure of the strength of the association between two items, taking into account the frequency of both items in the dataset. 
### It is calculated as the confidence of the association divided by the support of the second item.
### lift, can be used to compare observed confidence with expected confidence, or how many times an if-then statement is expected to be found true.

## 2.	What is support and Confidence. How do you calculate them?
### Ans-
### Support is calculated by dividing the number of transactions containing an item set by the total number of transactions.
### Support indicates how frequently an item appears in the data.

### Confidence is calculated by dividing the number of transactions containing both itemsets by the number of transactions containing the first itemset.
### Confidence indicates the number of times the if-then statement is found to be true.

### Calculation-  confidence(X ⇒ Y) = support(X ∪ Y) / support(X).

## 3.	What are some limitations or challenges of Association rules mining?
### Ans-
### Some of the main drawbacks of association rule algorithms in e-learning are:
### The algorithms have too many parameters for some in experienced person in data mining 
### The obtained rules are in abundance, most of them non-interesting and with low unambiguousness.
### potential to generate an overwhelming number of rules from a large dataset, which can be costly and complex to analyze. 
### To address this issue, you can use techniques to reduce the search space and filter out irrelevant or redundant rules.