## Association Rule

In [5]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Load dataset
file_path = 'OnlineRetail.csv'
df = pd.read_csv(file_path, encoding='ISO-8859-1')

# Data Preprocessing
# Remove missing values
df.dropna(subset=['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID'], inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Remove transactions with negative or zero quantity
df = df[df['Quantity'] > 0]

# Create a basket with the quantity of each product per transaction
basket = (df.groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

# Convert values to 1 and 0 (for association rule mining)
def encode_units(x):
    return 1 if x > 0 else 0

basket = basket.applymap(encode_units)

# Apply the FP-Growth algorithm to find frequent itemsets
frequent_itemsets = fpgrowth(basket, min_support=0.01, use_colnames=True)

# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Filter rules based on support, confidence, and lift thresholds
filtered_rules = rules[(rules['support'] >= 0.01) &
                       (rules['confidence'] >= 0.5) &
                       (rules['lift'] >= 1.2)]

# Print and analyze the top rules
print(filtered_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

# Save the rules to a CSV file for further analysis
filtered_rules.to_csv('association_rules_output.csv', index=False)


  basket = basket.applymap(encode_units)


                            antecedents                          consequents  \
14          (POPPY'S PLAYHOUSE KITCHEN)         (POPPY'S PLAYHOUSE BEDROOM )   
15         (POPPY'S PLAYHOUSE BEDROOM )          (POPPY'S PLAYHOUSE KITCHEN)   
18         (ALARM CLOCK BAKELIKE GREEN)          (ALARM CLOCK BAKELIKE RED )   
19          (ALARM CLOCK BAKELIKE RED )         (ALARM CLOCK BAKELIKE GREEN)   
23          (ALARM CLOCK BAKELIKE PINK)          (ALARM CLOCK BAKELIKE RED )   
..                                  ...                                  ...   
911  (SET OF 12 MINI LOAF BAKING CASES)     (SET OF 6 TEA TIME BAKING CASES)   
913  (SET OF 6 SNACK LOAF BAKING CASES)  (SET OF 12 FAIRY CAKE BAKING CASES)   
914  (SET OF 6 SNACK LOAF BAKING CASES)   (SET OF 12 MINI LOAF BAKING CASES)   
915  (SET OF 12 MINI LOAF BAKING CASES)   (SET OF 6 SNACK LOAF BAKING CASES)   
933        (HAND WARMER RED LOVE HEART)             (HAND WARMER OWL DESIGN)   

      support  confidence       lift  


## Interview Q & A

1.What is lift and why is it important in Association rules?

Lift is a measure used in association rule mining to evaluate the strength of a rule by comparing the observed support of the rule with the expected support if the items were independent. Specifically, it quantifies how much more likely two items are to be purchased together than we would expect if they were independent.


Importance: Lift helps to identify rules that are not just frequent but also significant. A lift value greater than 1 indicates that the occurrence of X and Y together is more significant than would be expected by chance, which can reveal strong associations between items.

2.What is support and confidence? How do you calculate them?

Support measures the proportion of transactions in which an item or itemset appears. It helps to identify the most frequent itemsets.

Number of transactions containing 
𝑋
Total number of transactions
Support(X)= 
Total number of transactions
Number of transactions containing X
​
Confidence measures the likelihood that item Y is purchased given that item X is purchased. It is a measure of the rule's reliability.

Support(X) is the support of item X.
Support helps to identify frequent itemsets, and Confidence helps to assess the strength of the implication in the association rule.

3.What are some limitations or challenges of Association rules mining?

* Scalability: Association rule mining can be computationally expensive, especially with large datasets. The number of possible itemsets grows exponentially with the number of items, which can make the process slow and resource-intensive.

* Handling Large Itemsets: As the number of items increases, the number of potential itemsets grows rapidly, leading to challenges in generating and evaluating all possible combinations.

* Redundancy: Association rule mining may generate a large number of rules, some of which may be redundant or offer similar insights, making it difficult to extract actionable knowledge.

* Interpretability: The rules generated may not always be meaningful or easy to interpret. It can be challenging to determine which rules are practically useful for decision-making.

* Threshold Sensitivity: The results of association rule mining can be sensitive to the thresholds set for support and confidence. Different thresholds can lead to different sets of rules, potentially missing valuable associations.

* Dynamic Data: In dynamic environments where data changes frequently, maintaining up-to-date rules can be challenging. Rules that were significant at one time may become obsolete as data evolves.