# **Data Preprocessing**

### Importing necessary library

In [50]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import warnings

### Load the dataset

In [51]:
file_path = '/content/Online retail.xlsx'
data = pd.read_excel(file_path)

### Step 1: Rename single column to 'Transaction'

In [52]:
data.columns = ['Transaction']

### Step:2 Drop rows with missing values

In [53]:
data_cleaned = data.dropna()

### Step 3: Remove duplicates transaction

In [54]:
data_cleaned = data_cleaned.drop_duplicates()

### Step 4: Split string of items into list of each transaction

In [55]:
data_cleaned['Transaction'] = data_cleaned['Transaction'].apply(lambda x: x.split(','))

### Step 5: Convert list of items into one-hot encoded format for association rules

In [56]:
te = TransactionEncoder()
te_ary = te.fit(data_cleaned['Transaction']).transform(data_cleaned['Transaction'])

### Step 6: Convert one-hot encoded array into DataFrame

In [57]:
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

### Display first few rows of transformed dataset

In [58]:
df_encoded.head()

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


# **Association Rule Mining**

### Step 1: Apply Apriori algorithm

In [59]:
# `min_support` defines the minimum support threshold for an itemset to be considered frequent
frequent_itemsets = apriori(df_encoded, min_support=0.01, use_colnames=True)

In [60]:
# Display the frequent itemsets
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.029179,(almonds)
1,0.011014,(antioxydant juice)
2,0.045797,(avocado)
3,0.01256,(bacon)
4,0.015459,(barbecue sauce)


### Step 2: Apply Association Rule Mining

In [61]:
# Use the association_rules function to generate rules from the frequent itemsets
# `metric` defines the measure to evaluate the rules (confidence), and `min_threshold` sets its minimum value
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)

In [62]:
# Display the first few association rules
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(almonds),(mineral water),0.029179,0.29971,0.010821,0.370861,1.237399,0.002076,1.113092,0.197619
1,(avocado),(mineral water),0.045797,0.29971,0.015845,0.345992,1.154421,0.00212,1.070766,0.140185
2,(brownies),(mineral water),0.045024,0.29971,0.013913,0.309013,1.031039,0.000419,1.013463,0.031524
3,(burgers),(eggs),0.113816,0.208116,0.036135,0.317487,1.525531,0.012448,1.160248,0.388735
4,(burgers),(mineral water),0.113816,0.29971,0.034589,0.303905,1.013996,0.000477,1.006026,0.015576


### Step 3: Filter rules by lift and confidence

In [63]:
# Setting higher thresholds for lift (greater than 1) and confidence (greater than 0.5)
filtered_rules = rules[(rules['lift'] > 1) & (rules['confidence'] > 0.3)]

In [64]:
# Display the filtered rules
filtered_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(almonds),(mineral water),0.029179,0.29971,0.010821,0.370861,1.237399,0.002076,1.113092,0.197619
1,(avocado),(mineral water),0.045797,0.29971,0.015845,0.345992,1.154421,0.00212,1.070766,0.140185
2,(brownies),(mineral water),0.045024,0.29971,0.013913,0.309013,1.031039,0.000419,1.013463,0.031524
3,(burgers),(eggs),0.113816,0.208116,0.036135,0.317487,1.525531,0.012448,1.160248,0.388735
4,(burgers),(mineral water),0.113816,0.29971,0.034589,0.303905,1.013996,0.000477,1.006026,0.015576


### Step 4: Sort and interpret rules

In [65]:
# Sorting the filtered rules by lift to prioritize the strongest associations
sorted_rules = filtered_rules.sort_values(by='lift', ascending=False)

In [66]:
# Display the sorted rules
sorted_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
23,(herb & pepper),(ground beef),0.066473,0.135845,0.022802,0.343023,2.5251,0.013772,1.31535,0.646983
120,"(shrimp, mineral water)",(frozen vegetables),0.03343,0.129855,0.010435,0.312139,2.403747,0.006094,1.265001,0.604181
113,"(spaghetti, frozen vegetables)",(ground beef),0.039034,0.135845,0.01256,0.321782,2.368738,0.007258,1.274155,0.601306
114,"(ground beef, frozen vegetables)",(spaghetti),0.024541,0.229565,0.01256,0.511811,2.22948,0.006927,1.578149,0.565339
135,"(soup, mineral water)",(milk),0.03343,0.170048,0.012367,0.369942,2.175512,0.006682,1.317263,0.559026


# **Analysis and Interpretation**

In [67]:
# Sort the rules by 'lift' to prioritize the strongest associations
sorted_rules = filtered_rules.sort_values(by='lift', ascending=False)

In [68]:
# Display the top 10 rules sorted by lift
print("Top 10 rules sorted by lift:")
sorted_rules.head(10)

Top 10 rules sorted by lift:


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
23,(herb & pepper),(ground beef),0.066473,0.135845,0.022802,0.343023,2.5251,0.013772,1.31535,0.646983
120,"(shrimp, mineral water)",(frozen vegetables),0.03343,0.129855,0.010435,0.312139,2.403747,0.006094,1.265001,0.604181
113,"(spaghetti, frozen vegetables)",(ground beef),0.039034,0.135845,0.01256,0.321782,2.368738,0.007258,1.274155,0.601306
114,"(ground beef, frozen vegetables)",(spaghetti),0.024541,0.229565,0.01256,0.511811,2.22948,0.006927,1.578149,0.565339
135,"(soup, mineral water)",(milk),0.03343,0.170048,0.012367,0.369942,2.175512,0.006682,1.317263,0.559026
79,"(chocolate, frozen vegetables)",(milk),0.033043,0.170048,0.011594,0.350877,2.063397,0.005975,1.278574,0.532974
100,"(eggs, frozen vegetables)",(milk),0.030918,0.170048,0.010628,0.34375,2.021484,0.00537,1.264688,0.521436
37,(whole wheat pasta),(milk),0.04058,0.170048,0.013913,0.342857,2.016234,0.007013,1.26297,0.525344
133,"(shrimp, mineral water)",(milk),0.03343,0.170048,0.011401,0.34104,2.00555,0.005716,1.259488,0.518725
104,"(eggs, ground beef)",(spaghetti),0.028792,0.229565,0.012947,0.449664,1.958766,0.006337,1.399936,0.503985


In [None]:
# Analysis 1: High lift value indicates a strong association
# For each rule, we'll interpret the antecedents and consequents
# Items with higher lift values are more likely to be bought together.

### Step 1: Analyze the top 5 rules with the highest lift

In [69]:
top_rules = sorted_rules.head(5)
top_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
23,(herb & pepper),(ground beef),0.066473,0.135845,0.022802,0.343023,2.5251,0.013772,1.31535,0.646983
120,"(shrimp, mineral water)",(frozen vegetables),0.03343,0.129855,0.010435,0.312139,2.403747,0.006094,1.265001,0.604181
113,"(spaghetti, frozen vegetables)",(ground beef),0.039034,0.135845,0.01256,0.321782,2.368738,0.007258,1.274155,0.601306
114,"(ground beef, frozen vegetables)",(spaghetti),0.024541,0.229565,0.01256,0.511811,2.22948,0.006927,1.578149,0.565339
135,"(soup, mineral water)",(milk),0.03343,0.170048,0.012367,0.369942,2.175512,0.006682,1.317263,0.559026


### Step 2: Loop through each rule and print the details for better interpretation

In [70]:
for index, rule in top_rules.iterrows():
    print(f"Rule {index + 1}:")
    print(f"Antecedents: {rule['antecedents']}")
    print(f"Consequents: {rule['consequents']}")
    print(f"Support: {rule['support']:.2f} - Confidence: {rule['confidence']:.2f} - Lift: {rule['lift']:.2f}")
    print(f"Interpretation: If customers buy {list(rule['antecedents'])}, they are likely to also buy {list(rule['consequents'])}.")
    print(" ")

Rule 24:
Antecedents: frozenset({'herb & pepper'})
Consequents: frozenset({'ground beef'})
Support: 0.02 - Confidence: 0.34 - Lift: 2.53
Interpretation: If customers buy ['herb & pepper'], they are likely to also buy ['ground beef'].
 
Rule 121:
Antecedents: frozenset({'shrimp', 'mineral water'})
Consequents: frozenset({'frozen vegetables'})
Support: 0.01 - Confidence: 0.31 - Lift: 2.40
Interpretation: If customers buy ['shrimp', 'mineral water'], they are likely to also buy ['frozen vegetables'].
 
Rule 114:
Antecedents: frozenset({'spaghetti', 'frozen vegetables'})
Consequents: frozenset({'ground beef'})
Support: 0.01 - Confidence: 0.32 - Lift: 2.37
Interpretation: If customers buy ['spaghetti', 'frozen vegetables'], they are likely to also buy ['ground beef'].
 
Rule 115:
Antecedents: frozenset({'ground beef', 'frozen vegetables'})
Consequents: frozenset({'spaghetti'})
Support: 0.01 - Confidence: 0.51 - Lift: 2.23
Interpretation: If customers buy ['ground beef', 'frozen vegetables']

### Step 3: Identifying actionable insights from patterns

In [71]:
# For example, products with high lift might suggest bundling or placement strategies.
# Let's also check rules with high confidence to discover likely purchases.

In [72]:
# Sorting by confidence to find rules with the highest confidence
sorted_by_confidence = filtered_rules.sort_values(by='confidence', ascending=False)

In [73]:
# Display the top 5 rules with the highest confidence
print("Top 5 rules sorted by confidence:")
print(sorted_by_confidence.head(5))

Top 5 rules sorted by confidence:
                          antecedents      consequents  antecedent support  \
134                      (soup, milk)  (mineral water)            0.021449   
112  (ground beef, frozen vegetables)  (mineral water)            0.024541   
146                 (spaghetti, soup)  (mineral water)            0.020676   
126           (ground beef, pancakes)  (mineral water)            0.020870   
70               (chicken, chocolate)  (mineral water)            0.021256   

     consequent support   support  confidence      lift  leverage  conviction  \
134             0.29971  0.012367    0.576577  1.923781  0.005939    1.653876   
112             0.29971  0.013333    0.543307  1.812775  0.005978    1.533393   
146             0.29971  0.010821    0.523364  1.746235  0.004624    1.469236   
126             0.29971  0.010821    0.518519  1.730067  0.004566    1.454448   
70              0.29971  0.011014    0.518182  1.728943  0.004644    1.453432   

     zhang

### **Interpretation:**

### High confidence indicates that when the antecedents are purchased, the consequents are very likely to be bought as well.
### These insights are useful for designing promotions or product recommendations.

### **Interview Questions:**

**1.	What is lift and why is it important in Association rules?**

**Lift** is a key metric in association rule mining that measures the strength of a rule by comparing the observed co-occurrence of items to what would be expected if they were independent. Specifically, lift is the ratio of the observed support (how often the items appear together) to the expected support if the items were purchased independently. A lift **value greater than 1** indicates a positive association, meaning the presence of the antecedent increases the likelihood of the consequent being purchased. Conversely, a lift of **less than 1** suggests a negative association, and a lift of **exactly 1** means the items are independent. Lift is important because it helps identify truly interesting and meaningful patterns in the data, filtering out associations that could occur by chance and highlighting product combinations that are more likely to co-occur, which is valuable for decisions like cross-selling, product placement, or marketing strategies.

**2.	What is support and Confidence. How do you calculate them?**

**Support** and **confidence** are fundamental metrics in association rule mining. **Support** refers to the frequency or proportion of transactions in the dataset that contain a particular itemset. It is calculated by dividing the number of transactions that contain the itemset by the total number of transactions. This helps identify how popular or relevant an item or combination of items is. **Confidence**, on the other hand, measures the likelihood that the consequent of a rule is purchased when the antecedent is purchased. It is calculated as the ratio of transactions that contain both the antecedent and consequent to the number of transactions that contain just the antecedent. Confidence helps evaluate the reliability of the rule, indicating how often the rule has been found to be true.

**Support (A -> B)** = (Number of transactions containing both A and B) / (Total transactions).
**Confidence (A -> B)** = (Number of transactions containing both A and B) / (Number of transactions containing A).

**3.	What are some limitations or challenges of Association rules mining?**

Association rule mining, while powerful, comes with several limitations and challenges. One key issue is the **generation of a large number of rules**, many of which may be trivial or uninteresting, especially in large datasets. Filtering through these rules to find truly useful insights can be time-consuming. Another challenge is the **computational complexity**; mining frequent itemsets in large datasets can require significant processing power and memory. Additionally, association rules often identify correlations without establishing **causality**, meaning that while two items may appear together frequently, it doesn't necessarily imply that one causes the other to be purchased. Lastly, setting the right **thresholds for support and confidenc**e can be tricky—too high may miss interesting rules, while too low can generate excessive, meaningless rules.