# ASSOCIATION RULES

The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

## Dataset:
Use the Online retail dataset to apply the association rules.

## Data Preprocessing:
Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  


In [96]:
import pandas as pd

In [97]:
df = pd.read_excel('./Online retail.xlsx')
df

Unnamed: 0,Basket
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."
...,...
7496,"butter,light mayo,fresh bread"
7497,"burgers,frozen vegetables,eggs,french fries,ma..."
7498,chicken
7499,"escalope,green tea"


In [98]:
# Checking for Missing Values
df.isnull().sum()

Basket    0
dtype: int64

In [99]:
# Checking for Duplicates
df.duplicated().sum()

2325

In [100]:
# Dropping Duplicate Values
df.drop_duplicates(inplace=True)

In [101]:
df.shape

(5176, 1)

In [102]:
df.dtypes

Basket    object
dtype: object

In [103]:
# Converting Data to transactional format.
transactions = []
for index, row in df.iterrows():
    transactions.append(row['Basket'].split(','))
display(transactions[:5])

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 ['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea']]

In [104]:
# Converting Transactions list into a Dataframe
data = pd.DataFrame(transactions)

# Applying one-hot encoding to boolean data
encoded_data = pd.get_dummies(data, prefix='', prefix_sep='')
encoded_data

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,antioxydant juice.1,french fries,frozen smoothie,frozen smoothie.1,protein bar,spinach,cereals,mayonnaise,spinach.1,olive oil
0,False,False,False,False,False,False,False,False,False,False,...,True,False,False,True,False,False,False,False,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5171,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5172,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5173,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5174,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Association Rule Mining:

*	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.
*	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
*	Set appropriate threshold for support, confidence and lift to extract meaning full rules.


In [105]:
# Applying Apriori algorithm to find frequent itemsets
from mlxtend.frequent_patterns import apriori

frequent_items = apriori(encoded_data, min_support=0.01, use_colnames=True)
frequent_items

Unnamed: 0,support,itemsets
0,0.098725,(burgers)
1,0.048879,(chocolate)
2,0.020672,(eggs)
3,0.011399,(french fries)
4,0.023570,(fresh tuna)
...,...,...
130,0.025309,"(mineral water, spaghetti)"
131,0.010819,"(ground beef, spaghetti)"
132,0.010626,"(milk, mineral water)"
133,0.018547,"(mineral water, spaghetti)"


In [106]:
# Generating Association Rules
from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_items, metric='lift', min_threshold=1.2)

# Sorting rules by lift in descending order:
rules = rules.sort_values(by='lift', ascending=False)

## Analysis and Interpretation:
*	Analyse the generated rules to identify interesting patterns and relationships between the products.
*	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.


In [107]:
# Interpretation of a rule
print("Identified Patterns:")

for i in range(len(rules.head())):
    print("\nCustomers who buy {} are {} times more likely to buy {}.".format(
        list(rules.iloc[i]['antecedents']), 
        round(rules.iloc[i]['lift'], 2), 
        list(rules.iloc[i]['consequents'])
    )) 

Identified Patterns:

Customers who buy ['ground beef'] are 82.02 times more likely to buy ['spaghetti'].

Customers who buy ['spaghetti'] are 82.02 times more likely to buy ['ground beef'].

Customers who buy ['ground beef'] are 67.02 times more likely to buy ['mineral water'].

Customers who buy ['mineral water'] are 67.02 times more likely to buy ['ground beef'].

Customers who buy ['spaghetti'] are 62.61 times more likely to buy ['mineral water'].


### Interpretation:

* Meal Pairings:

    The analysis reveals common meal pairings such as ground beef with spaghetti and shrimp with frozen vegetables, suggesting that customers tend to buy complementary items for preparing meals.

* Beverage Choices:

    Itmms like mineral water are frequently associated with food items like ground beef and spaghetti, indicating that customers often include beverages in their food purchases.

## Interview Questions:

### 1. What is lift and why is it important in Association rules?

Lift is a measure of how much more likely it is to observe the co-occurrence of two items in transactions compared to what would be expected if the items were independent of each other. Mathematically, it is defined as the ratio of the observed support of the itemset to the expected support under independence.

Lift is important in association rules because it helps identify the strength and significance of the association between items. A lift value greater than 1 indicates a positive association, meaning that the presence of one item increases the likelihood of the other item being present. Lift values close to or less than 1 indicate weaker or no association.

### 2. What is support and Confidence. How do you calculate them?

Support meansures the frequency of occurrence of an itemset in the dataset. It is the proportion of transactions that contain the itemset. Mathematically, support is calculated as the number of transaction containing the itemset divided by the total number of transactions.

Confidence measures the reliability of the association rule. It is the conditional probability that a transaction containing the antecedent also contains the consequent. Mathematically, confidence is calculated as the support of the combined itemset (antecedent and consequent) divided by the support of the antecedent.

Support andconfidence areessential metrics in association rule mining as they help identify frequent itemsets and strong association rules, respectively.

### 3. What are some limitations or challenges of Association rules mining?

* Curse of Dimensionality:

    As the number of items or dimensions in the dataset increases, the number of possible itemsets grows exponentially, leading to computational challenges and increased memory requirements.

* Sparse Data:

    Association rule mining may struggle with sparse datasets where most itemsets have low support, making it difficult to identify meaningful associations.

* High Dimensionality:

    With large datasets containing numerous items, finding relevant and interpretable rules becomes challenging, and the sheer volume of rules generated can make interpretation cumbersome.

* Quality of Rules:

    Associations rules may suffer from high false positive rates, where spurious associations are discovered due to nise or random fluctuations in the data.

* Interpretability:

    Understanding and interpreting the generated rules can be subjective and may require domain knowledge to derive actionable insights.

* Scalability:

    Mining association rules from large-scale datasets can be computationally intensive and time-consuming, requiring efficient algorithms and parallel processing techniques.

* Handling Continuous Data:

    Traditional association rule mining techniques are designed for categorical data, and extending them to handle continuous or mixed-type data reqires preprocessing and discretization, which can introduce additional complexity and potential information loss.

These limitations and challenges highlight the importance of careful data preprocessing, parameter tuning, and domain knowledge in association rule mining to ensure meaningful and actionable insights are derived from the analysis.