# Association Rules

The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

## Data Preprocessing:
Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  


In [1]:
import pandas as pd


# Load the dataset
df = pd.read_excel('Online Retail.xlsx')

# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                           Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                                           --------------  ----- 
 0   shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil  7500 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [2]:
df.head(7500)

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
...,...
7495,"butter,light mayo,fresh bread"
7496,"burgers,frozen vegetables,eggs,french fries,ma..."
7497,chicken
7498,"escalope,green tea"


In [3]:
df.columns = ['Transaction']


In [4]:
df.head()

Unnamed: 0,Transaction
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [5]:
# Transform the data into a list of lists
transactions = df['Transaction'].apply(lambda x: x.split(','))

In [6]:
transactions.head()

0                           [burgers, meatballs, eggs]
1                                            [chutney]
2                                    [turkey, avocado]
3    [mineral water, milk, energy bar, whole wheat ...
4                                     [low fat yogurt]
Name: Transaction, dtype: object

## Step 2: Association Rule Mining

In [7]:
!pip install mlxtend




In [21]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Initialize the transaction encoder
te = TransactionEncoder()

# Fit and transform the transaction data
te_ary = te.fit(transactions).transform(transactions)

# Convert the transaction data to a DataFrame
df = pd.DataFrame(te_ary, columns=te.columns_)


# Define support and confidence thresholds to explore
support_values = [0.01, 0.02, 0.03, 0.04, 0.05]
confidence_values = [0.1, 0.2, 0.3, 0.4, 0.5]

# Function to generate rules for given support and confidence thresholds
def generate_rules(support, confidence):
    # Apply the Apriori algorithm to find frequent itemsets
    frequent_itemsets = apriori(df, min_support=support, use_colnames=True)
    # Check if frequent itemsets is empty
    if frequent_itemsets.empty:
        return pd.DataFrame()    
    # Generate rules
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence)
    return rules

In [22]:
# Explore different support and confidence thresholds
all_rules = []
for support in support_values:
    for confidence in confidence_values:
        rules = generate_rules(support, confidence)
        if not rules.empty:
            rules['support_threshold'] = support
            rules['confidence_threshold'] = confidence
            all_rules.append(rules)

# Combine all rules into a single DataFrame
all_rules_df = pd.concat(all_rules, ignore_index=True)

## Step 3: Analysis and Interpretation

In [23]:
# Sort rules by lift in descending order
all_rules_df = all_rules_df.sort_values(by='lift', ascending=False)

# Display the top 10 rules for analysis
top_10_rules = all_rules_df.head(10)
print(top_10_rules)

# Interpret the results
for idx, row in top_10_rules.iterrows():
    antecedents = ', '.join(list(row['antecedents']))
    consequents = ', '.join(list(row['consequents']))
    print(f"Rule: {antecedents} -> {consequents}")
    print(f"Support: {row['support']:.2f}, Confidence: {row['confidence']:.2f}, Lift: {row['lift']:.2f}\n")

                    antecedents                 consequents  \
187             (herb & pepper)               (ground beef)   
499             (herb & pepper)               (ground beef)   
398             (herb & pepper)               (ground beef)   
188               (ground beef)             (herb & pepper)   
475  (spaghetti, mineral water)               (ground beef)   
310               (ground beef)  (spaghetti, mineral water)   
307  (spaghetti, mineral water)               (ground beef)   
318                 (olive oil)  (spaghetti, mineral water)   
315  (spaghetti, mineral water)                 (olive oil)   
392                  (tomatoes)         (frozen vegetables)   

     antecedent support  consequent support   support  confidence      lift  \
187            0.049467            0.098267  0.016000    0.323450  3.291555   
499            0.049467            0.098267  0.016000    0.323450  3.291555   
398            0.049467            0.098267  0.016000    0.323450  3.




### a. What is lift and why is it important in Association rules?
Lift is a measure used in association rule mining to evaluate the strength and importance of a rule. It is defined as the ratio of the observed support of an itemset (A and B occurring together) to the expected support if A and B were independent. Mathematically, it can be expressed as:

Lift(𝐴→𝐵) = 𝑃(𝐴∩𝐵) / 𝑃(𝐴)×𝑃(𝐵)

Where:

1.P(A∩B) is the support of the itemset {A,B}, i.e., the proportion of transactions containing both A and B.

2.P(A) is the support of item A.

3.P(B) is the support of item B.

Importance of Lift:

1.Interpreting Association Strength: A lift value greater than 1 indicates that the items A and B occur together more frequently than would be expected if they were independent. A lift less than 1 suggests a negative association.
2.Comparison Across Rules: Lift provides a way to compare the strength of different association rules, irrespective of the item frequencies.
3.Identifying Interesting Rules: Rules with a high lift are often more interesting and actionable because they reveal stronger associations.


### b. What is support and Confidence? How do you calculate them?
Support and Confidence are fundamental metrics used in association rule mining to measure the significance and reliability of the rules.

Support: The support of an itemset is the proportion of transactions in the dataset that contain the itemset. It indicates how frequently an itemset appears in the dataset. For an itemset {A,B}, support is calculated as:

Support(A∩B)= Total number of transactions / Number of transactions containing {A,B}

Confidence: Confidence of a rule A→B is the proportion of transactions containing A that also contain B. It measures the reliability of the inference made by the rule. Confidence is calculated as:

Confidence(A→B)= Support(A) / Support(A∩B)

 

Where:

1.Support(𝐴∩𝐵) is the support of the itemset containing both A and B.

2.Support(A) is the support of the itemset containing A.


### c. What are some limitations or challenges of Association rules mining?
While association rule mining is a powerful technique, it comes with several limitations and challenges:

1.Scalability: Large datasets with many items can lead to an exponential number of possible itemsets, making the computation of frequent itemsets and rules very resource-intensive.

2.Data Sparsity: In datasets where items are sparsely distributed, finding meaningful associations can be challenging. Many discovered rules might have low support, making them less useful.

3.Interestingness Measures: Metrics like support, confidence, and lift may not always capture the most interesting or useful rules. Sometimes, additional measures such as conviction, leverage, or novelty are needed.

4.Redundancy: Association rule mining can produce a large number of redundant or trivial rules. Filtering and interpreting these rules require additional effort and domain knowledge.

5.Overfitting: Rules may fit the specific dataset well but may not generalize to new data. This can happen if the rules capture noise or spurious correlations.

6.Actionability: Not all discovered rules are actionable or useful in practice. Determining which rules are actionable requires domain expertise and further analysis.

7.Parameter Sensitivity: The results of association rule mining are sensitive to the choice of parameters such as minimum support and minimum confidence thresholds. Setting these parameters appropriately can be challenging.

8.Imbalanced Data: When dealing with imbalanced datasets, common items may overshadow rare but potentially interesting items, leading to biased results.
