In [11]:
# Data Preprocessing
import pandas as pd
new_data = pd.ExcelFile('Online retail.xlsx')
df_new = new_data.parse('Sheet1')
df_new.head()


Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [12]:
# Renameing the column for clarity
df_new.columns = ["Transaction"]
# Removeing rows with missing or empty transactions
df_new.dropna(subset=["Transaction"], inplace=True)
# Removeing duplicate transactions, if any
df_new.drop_duplicates(subset=["Transaction"], inplace=True)
# Converting transactions into a list of lists for further processing
transactions_new = df_new["Transaction"].apply(lambda x: x.split(','))
transactions_new.head()


Unnamed: 0,Transaction
0,"[burgers, meatballs, eggs]"
1,[chutney]
2,"[turkey, avocado]"
3,"[mineral water, milk, energy bar, whole wheat ..."
4,[low fat yogurt]


In [18]:
!pip install ace_tools



In [23]:
# Association Rule Mining
from itertools import combinations
from collections import defaultdict

# Flattening list of transactions for manual apriori-like functionality
all_items = [item.strip() for transaction in transactions_new for item in transaction]
# Counting individual items
item_counts = defaultdict(int)
for item in all_items:
    item_counts[item] += 1
# Calculateing support for single items
total_transactions = len(transactions_new)
item_support = {item: count / total_transactions for item, count in item_counts.items()}
# Setting a threshold for minimum support
min_support = 0.05
frequent_items = {item: support for item, support in item_support.items() if support >= min_support}
# Generateing pairs of items and count their occurrences
pair_counts = defaultdict(int)
for transaction in transactions_new:
    transaction = [item.strip() for item in transaction]
    for pair in combinations(transaction, 2):
        pair_counts[pair] += 1
# Calculateing support for item pairs
pair_support = {pair: count / total_transactions for pair, count in pair_counts.items()}
# Filtering pairs based on minimum support
frequent_pairs = {pair: support for pair, support in pair_support.items() if support >= min_support}
# Converting results to DataFrame for clarity
frequent_pairs_df = pd.DataFrame(
    [(pair[0], pair[1], support) for pair, support in frequent_pairs.items()],
    columns=["Item A", "Item B", "Support"],
)

# Displaying frequent item pairs with support
print("Frequent Item Pairs")
frequent_pairs_df.head()

Frequent Item Pairs


Unnamed: 0,Item A,Item B,Support
0,mineral water,milk,0.067826
1,mineral water,eggs,0.070145
2,spaghetti,mineral water,0.085024
3,spaghetti,eggs,0.051401
4,spaghetti,milk,0.050048


In [27]:
# Analysis and Interpretation

# Calculateing confidence and lift for each pair
rules = []
for (item_a, item_b), pair_support in frequent_pairs.items():
    confidence = pair_support / item_support[item_a]
    lift = confidence / item_support[item_b]
    rules.append((item_a, item_b, pair_support, confidence, lift))
# Createing a DataFrame for the rules
rules_df = pd.DataFrame(rules, columns=["Item A", "Item B", "Support", "Confidence", "Lift"])
# Sorting by Lift to identify the strongest associations
rules_df = rules_df.sort_values(by="Lift", ascending=False)
# Displaying the rules
print("Association Rules Analysis")
print(rules_df.head())




Association Rules Analysis
              Item A         Item B   Support  Confidence      Lift
5        ground beef      spaghetti  0.055845    0.411095  1.790756
6        ground beef  mineral water  0.058744    0.432432  1.442835
0      mineral water           milk  0.067826    0.226306  1.330831
8  frozen vegetables  mineral water  0.050435    0.388393  1.295895
4          spaghetti           milk  0.050048    0.218013  1.282068




### Analysis of the Rules:
1. **Strongest Associations**:
   - **Ground beef → Spaghetti**: High lift (1.79) and confidence (41.1%) indicate customers buying ground beef are significantly more likely to purchase spaghetti together, suggesting a meal preparation pattern.
   - **Ground beef → Mineral water**: With a lift of 1.44 and confidence of 43.2%, it shows customers frequently pair these items, possibly due to meal combinations or grocery bundling.

2. **Other Interesting Patterns**:
   - **Mineral water → Milk**: A lift of 1.33 indicates a moderate association between these staples, suggesting customers often stock these together.
   - **Frozen vegetables → Mineral water**: This pair (lift 1.30) reflects a healthy or meal-planning shopping behavior.
   - **Spaghetti → Milk**: With a lift of 1.28, this pairing may point to complementary needs for meal preparation.

### Insights into Customer Behavior:
- **Meal Planning**: High-frequency associations like ground beef and spaghetti reflect meal preparation habits.
- **Healthy Lifestyles**: The pairing of frozen vegetables and mineral water suggests health-conscious shopping preferences.
- **Staple Goods**: Items like milk and mineral water are purchased together, indicating that customers often stock up on everyday essentials.
- **Complementary Products**: Items like spaghetti and milk highlight diverse customer preferences for complementary grocery items.


### Interview Question Answers:

#### 1. **What is Lift and Why is it Important in Association Rules?**
   - **Definition**: Lift measures how much more likely two items are to be purchased together compared to if they were purchased independently.
   - **Importance**:
     - A lift value greater than 1 indicates a positive association, meaning the presence of item A increases the likelihood of purchasing item B.
     - It helps identify meaningful and non-trivial relationships between items.
     - Lift is crucial for prioritizing rules, as it evaluates the strength of a rule beyond mere co-occurrence.

---

#### 2. **What is Support and Confidence? How Do You Calculate Them?**
   - **Support**:
     - **Definition**: The proportion of transactions that contain a specific item or itemset.
     - **Formula**:
       Support(A) = (Number of transactions containing A) / (Total number of transactions)
     - **Importance**: It ensures the itemset is frequent enough to be considered significant.

   - **Confidence**:
     - **Definition**: The likelihood that item B is purchased when item A is purchased.
     - **Formula**:
       Confidence(A--->B) = (Support(A U B)) / (Support(A))
     - **Importance**: It measures the reliability of the rule A ---> B.

---

#### 3. **What Are Some Limitations or Challenges of Association Rules Mining?**
   - **Scalability**: Mining association rules in large datasets can be computationally expensive due to the exponential growth of candidate itemsets.
   - **Choice of Parameters**: Setting thresholds for support, confidence, and lift can be subjective and impact the relevance of discovered rules.
   - **Interpretability**: A large number of rules can overwhelm the analysis, making it difficult to identify the most meaningful ones.
   - **Sparsity**: Sparse datasets with many unique items may result in low support values, making it challenging to find significant rules.
   - **Redundancy**: Many rules may convey similar information, requiring additional filtering or post-processing.
   - **Context Dependency**: Rules may lack context or domain-specific insights, which are necessary for actionable decisions.

