**Data Preprocessing:**

In [1]:
import pandas as pd

In [2]:
# Load the dataset
file_path = '/content/sample_data/Online retail.xlsx'
df = pd.read_excel(file_path,header=None)

In [3]:
# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())


Missing Values:
0    0
dtype: int64


In [4]:
# Remove duplicates
df.drop_duplicates(inplace=True)

In [5]:
print(df.columns)

Index([0], dtype='int64')


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5176 entries, 0 to 7500
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5176 non-null   object
dtypes: object(1)
memory usage: 80.9+ KB


In [7]:
# Split the items in each transaction into a list
df['Transaction'] = df[0].apply(lambda x: x.split(','))

# Remove duplicates within each transaction
df['Transaction'] = df['Transaction'].apply(lambda x: list(set(x)))

# Drop the original column
df.drop(columns=[0], inplace=True)

**Association Rule Mining:**

In [8]:
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [9]:
# Prepare the data for the Apriori algorithm
te = TransactionEncoder()
te_ary = te.fit(df['Transaction']).transform(df['Transaction'])
df_apriori = pd.DataFrame(te_ary, columns=te.columns_)

  and should_run_async(code)


In [10]:
# Apply the Apriori algorithm
frequent_itemsets = apriori(df_apriori, min_support=0.01, use_colnames=True)

  and should_run_async(code)


In [11]:
# Generate the association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
# Display the results
rules.sort_values(by='lift', ascending=False).head()

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
425,(whole wheat pasta),(olive oil),0.040572,0.087713,0.011012,0.271429,3.094525,0.007454,1.252159,0.705471
424,(olive oil),(whole wheat pasta),0.087713,0.040572,0.011012,0.125551,3.094525,0.007454,1.09718,0.741925
794,"(mineral water, milk)",(soup),0.067813,0.070904,0.012365,0.182336,2.571586,0.007557,1.136281,0.655593
795,(soup),"(mineral water, milk)",0.070904,0.067813,0.012365,0.174387,2.571586,0.007557,1.129085,0.657774
298,(ground beef),(herb & pepper),0.135819,0.066461,0.022798,0.167852,2.525588,0.013771,1.121843,0.698989


In [12]:
top_rules = rules.sort_values(by='lift', ascending=False).head(10)
print(top_rules)

                        antecedents                     consequents  \
425             (whole wheat pasta)                     (olive oil)   
424                     (olive oil)             (whole wheat pasta)   
794           (mineral water, milk)                          (soup)   
795                          (soup)           (mineral water, milk)   
298                   (ground beef)                 (herb & pepper)   
299                 (herb & pepper)                   (ground beef)   
734         (shrimp, mineral water)             (frozen vegetables)   
739             (frozen vegetables)         (shrimp, mineral water)   
718  (spaghetti, frozen vegetables)                   (ground beef)   
719                   (ground beef)  (spaghetti, frozen vegetables)   

     antecedent support  consequent support   support  confidence      lift  \
425            0.040572            0.087713  0.011012    0.271429  3.094525   
424            0.087713            0.040572  0.011012    0.1

  and should_run_async(code)


**Analysis and Interpretation:**

In [13]:
# Sort rules by confidence and lift
rules_sorted = rules.sort_values(by=['confidence', 'lift'], ascending=False)

# Display the top 10 rules
print(rules_sorted.head(10))

                          antecedents      consequents  antecedent support  \
793                      (soup, milk)  (mineral water)            0.021445   
711  (ground beef, frozen vegetables)  (mineral water)            0.024536   
829                 (soup, spaghetti)  (mineral water)            0.020672   
763           (ground beef, pancakes)  (mineral water)            0.020866   
498              (chocolate, chicken)  (mineral water)            0.021252   
775                 (olive oil, milk)  (mineral water)            0.024150   
717  (ground beef, frozen vegetables)      (spaghetti)            0.024536   
598            (chocolate, olive oil)  (mineral water)            0.023570   
751               (ground beef, milk)  (mineral water)            0.031685   
667               (ground beef, eggs)  (mineral water)            0.028787   

     consequent support   support  confidence      lift  leverage  conviction  \
793            0.299845  0.012365    0.576577  1.922913  0.0

  and should_run_async(code)


**Support:** Indicates how frequently the itemset appears in the dataset.
Confidence: Measures the likelihood that the presence of the antecedent (left-hand side) leads to the consequent (right-hand side).

**Lift:**Indicates how much more likely the consequent is given the antecedent compared to random chance. A lift value greater than 1 indicates a strong positive association.

2. Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

Based on the discovered association rules:

### 1. **High Lift Values:**
   - **Interpretation:** A high lift value (greater than 1) indicates that the items in the rule have a strong positive association, meaning customers who purchase one item are much more likely to purchase the other compared to random chance.
   - **Insight:** For example, if the rule "mineral water" → "salmon" has a high lift, it suggests that customers who buy mineral water are significantly more likely to also purchase salmon. This could be used for targeted promotions or product placements.

### 2. **High Confidence Rules:**
   - **Interpretation:** High confidence (close to 1) means that when the antecedent is purchased, the consequent is frequently purchased as well.
   - **Insight:** If "chocolate" → "cookies" shows high confidence, customers who buy chocolate almost always buy cookies. You could bundle these items or offer discounts on one when the other is purchased.

### 3. **Frequent Itemsets with High Support:**
   - **Interpretation:** High support indicates that a particular combination of items is commonly purchased together.
   - **Insight:** If "bread, butter" has high support, these items are frequently bought together. Consider placing these items near each other in the store or offering a combined discount.

### 4. **Cross-Selling Opportunities:**
   - **Interpretation:** Identify products that are frequently bought together but are not typically associated, offering cross-selling opportunities.
   - **Insight:** If "green tea" → "honey" has a strong rule, promoting honey alongside green tea could increase sales of both.

### 5. **Product Affinity:**
   - **Interpretation:** The rules can help identify product affinities that indicate customer preferences and shopping habits.
   - **Insight:** If several rules suggest that customers who buy healthy products like "spinach, olive oil, avocado" also buy "low fat yogurt", it shows a trend towards health-conscious shopping.


The association rules provide valuable insights into customer purchasing behavior, helping you understand which products are commonly bought together. This can inform marketing strategies, product placements, and inventory management to enhance sales and customer satisfaction.

**INTERVIEW QUESTIONS**

**1.	What is lift and why is it important in Association rules?**

**Lift** is a key metric in association rule mining that measures the strength of a rule over the baseline likelihood of the consequent (right-hand side) occurring independently of the antecedent (left-hand side).

### **Definition of Lift:**
Lift is defined as:

\[
\text{Lift} = \frac{\text{Confidence of the rule}}{\text{Support of the consequent}}
\]

Or alternatively:

\[
\text{Lift} = \frac{P(A \cap B)}{P(A) \times P(B)}
\]

Where:
- \( P(A \cap B) \) is the probability that both A and B occur together.
- \( P(A) \) is the probability that A occurs.
- \( P(B) \) is the probability that B occurs.

### **Interpretation of Lift:**
- **Lift > 1:** The presence of the antecedent increases the likelihood of the consequent. The items are positively associated, meaning they are more likely to be purchased together than by random chance.
- **Lift = 1:** The antecedent and consequent are independent; knowing the antecedent does not provide any information about the consequent.
- **Lift < 1:** The antecedent reduces the likelihood of the consequent, indicating a negative association.

### **Importance of Lift in Association Rules:**
1. **Measures Association Strength:** Lift helps identify rules that are not just frequent, but also statistically significant, showing a strong relationship between items.
2. **Eliminates Coincidental Associations:** Unlike support and confidence, lift accounts for the overall occurrence of items, filtering out rules that are frequent only due to high individual item popularity.
3. **Guides Decision-Making:** A high lift value indicates a meaningful pattern, useful for marketing strategies, cross-selling, and promotions.

**2.	What is support and Confidence. How do you calculate them?**

**Support** and **Confidence** are two fundamental metrics used in association rule mining to evaluate the strength and relevance of discovered rules.

### **1. Support:**

**Definition:**
- Support measures how frequently an itemset appears in the dataset. It reflects the proportion of transactions in the dataset that contain the itemset.

**Calculation:**
\[
\text{Support} = \frac{\text{Number of transactions containing the itemset}}{\text{Total number of transactions}}
\]

**Example:**
- If the itemset {bread, butter} appears in 100 out of 1,000 transactions, the support is:
\[
\text{Support} = \frac{100}{1000} = 0.1
\]
- This means 10% of all transactions contain both bread and butter.

**Importance:**
- Support helps identify itemsets that are commonly bought together. High support indicates a frequent pattern that may be of interest.

### **2. Confidence:**

**Definition:**
- Confidence measures the likelihood that a transaction containing the antecedent (left-hand side) also contains the consequent (right-hand side). It represents the conditional probability of the consequent given the antecedent.

**Calculation:**
\[
\text{Confidence} = \frac{\text{Support of (Antecedent ∪ Consequent)}}{\text{Support of Antecedent}}
\]

**Example:**
- Consider the rule {bread} → {butter}:
  - If {bread, butter} appears in 100 transactions, and {bread} appears in 200 transactions, the confidence is:
\[
\text{Confidence} = \frac{100}{200} = 0.5
\]
- This means that in 50% of the transactions where bread is purchased, butter is also purchased.

**Importance:**
- Confidence provides insight into the reliability of a rule. High confidence means that the rule is frequently true in the dataset.

**3.	What are some limitations or challenges of Association rules mining?**

Association rule mining has some challenges and limitations:

1. **Complexity:** It can be slow and use a lot of memory, especially with large datasets.
2. **Too Many Rules:** It often generates more rules than are useful, making it hard to find the important ones.
3. **Threshold Setting:** Choosing the right support and confidence levels is tricky; too high or too low can both cause problems.
4. **Rare Items:** It may miss important but infrequent patterns because of low support.
5. **No Causality:** It shows associations but doesn’t prove one item causes another to be bought.
6. **Handling High Dimensions:** With many items, it can be hard to find useful patterns.
7. **Subjectivity:** Deciding which rules are important can vary by person.
8. **Context Ignored:** It doesn’t consider when or by whom items are bought.
9. **Continuous Data:** It struggles with non-categorical data unless it’s simplified first.