# Data Preprocessing:
Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  

In [1]:
import pandas as pd
df = pd.read_excel('Online retail.xlsx', header = None)
df.shape

(7501, 1)

In [2]:
df.head()

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [4]:
df.describe()

Unnamed: 0,0
count,7501
unique,5176
top,cookies
freq,223


In [5]:
df.isnull().sum()

0    0
dtype: int64

# Association Rule Mining:
•	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

•	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

•	Set appropriate threshold for support, confidence and lift to extract meaning full rules.

In [6]:
# Converting all values as each column
df_new = df[0].str.split(',', expand = True)
df_new

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,butter,light mayo,fresh bread,,,,,,,,,,,,,,,,,
7497,burgers,frozen vegetables,eggs,french fries,magazines,green tea,,,,,,,,,,,,,,
7498,chicken,,,,,,,,,,,,,,,,,,,
7499,escalope,green tea,,,,,,,,,,,,,,,,,,


In [7]:
trans = []
for i in range(0,7501):
    trans.append([str(df_new.values[i,j]) for j in range(0,20)])

In [8]:
len(trans)

7501

In [9]:
trans[0]

['shrimp',
 'almonds',
 'avocado',
 'vegetables mix',
 'green grapes',
 'whole weat flour',
 'yams',
 'cottage cheese',
 'energy drink',
 'tomato juice',
 'low fat yogurt',
 'green tea',
 'honey',
 'salad',
 'mineral water',
 'salmon',
 'antioxydant juice',
 'frozen smoothie',
 'spinach',
 'olive oil']

In [10]:
trans[50]

['spaghetti',
 'chocolate',
 'brownies',
 'white wine',
 'green tea',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None']

In [11]:
trans[2000]

['pancakes',
 'energy drink',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None']

In [12]:
trans[7500]

['eggs',
 'frozen smoothie',
 'yogurt cake',
 'low fat yogurt',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None',
 'None']

In [13]:
# Implementing an Apriori Algorithm

from apyori import apriori
rules = apriori(transactions = trans,
                min_support = 0.003,
                min_confidence = 0.2, 
                min_lift = 3,
                min_length = 2,
                max_length = 2)

In [14]:
rules

<generator object apriori at 0x000002347AA2F480>

In [15]:
results = list(rules)
results

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0

In [16]:
results[0][1] # Support 

0.004532728969470737

In [17]:
results[0][2][0][0] # base item 

frozenset({'light cream'})

In [18]:
results[0][2][0][1] # add item 

frozenset({'chicken'})

In [19]:
results[0][2][0][2] # confidence  

0.29059829059829057

In [20]:
results[0][2][0][3] # lift 

4.84395061728395

In [21]:
a = []
b = []
c = []
d = []
e = []

for i in range(0,9):
    c.append(results[i][1])           # Support 
    a.append(results[i][2][0][0])     # base item  
    b.append(results[i][2][0][1])     # add item 
    d.append(results[i][2][0][2])     # confidence  
    e.append(results[i][2][0][3])     # lift 

In [22]:
d1 = pd.DataFrame(a)
d2 = pd.DataFrame(b)
d3 = pd.DataFrame(c)
d4 = pd.DataFrame(d)
d5 = pd.DataFrame(e)

In [23]:
data_new = pd.concat([d1,d2,d3,d4,d5], axis = 1)
data_new.columns = ['Baseitem','Additem','Support','Confidence','Lift']
data_new

Unnamed: 0,Baseitem,Additem,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672


# Analysis and Interpretation:
•	Analyse the generated rules to identify interesting patterns and relationships between the products.

•	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

a. Analysis of Generated Rules
1. High Support and High Confidence Rules:

- Look for rules with high support and confidence, as these indicate products frequently bought together. For example, if {milk} -> {bread} has high support and confidence, it means a substantial number of transactions include both milk and bread.
- High-support rules might reveal staple combinations or popular pairings, showing essential products or high-frequency buys.

2. High Lift:

- Rules with lift values significantly greater than 1 indicate a strong association beyond chance, suggesting a genuine preference among customers. For example, a rule like {pasta} -> {tomato sauce} with high lift would highlight a common meal-prep pattern, indicating that customers frequently purchase ingredients for specific recipes together.
- Identifying these can help in understanding product dependencies or habitual purchases.

3. Unique or Seasonal Patterns:

- Some item pairings may have moderate support but strong confidence and lift, suggesting niche buying behavior. For instance, {green tea} -> {honey} might suggest a segment of health-conscious customers purchasing these together for their dietary preferences.
- Seasonality might also come into play; for example, {hot chocolate} -> {marshmallows} could spike during winter months, indicating seasonal buying trends.

b. Customer Behavior Insights
1. Health-Conscious and Meal-Planning Shoppers:

- Rules involving items like “low-fat yogurt,” “vegetable mix,” and “salad” may indicate a subset of customers focused on health. Similarly, combinations like {spinach} -> {salmon} could indicate meal-planning behavior for home-cooked, nutritious meals.
- These patterns suggest that targeting promotions on health-related items or providing recipe ideas could resonate well with this segment.

2. Impulse and Snack Purchases:

- If items like “chocolate,” “chips,” and “soda” are associated, it suggests an impulse-buy segment, where customers add these items to their cart as quick snacks. Cross-promotions or bundle offers on these items may increase sales.

3.Staple or Convenience Shopping:

- Associations like {bread} -> {milk} may suggest customers purchasing basic staples, possibly indicative of regular, frequent shopping trips.
- Placing staple items in convenient areas or bundling frequently bought items in a subscription or convenience package might enhance customer satisfaction and retention.

c. Business Recommendations
1. Product Placement and Promotions:

- Use identified associations to place frequently bought-together items near each other. If {energy drinks} -> {snack bars} is a common association, place these items together or offer bundle discounts.

2.Targeted Advertising and Recommendations:

- Target ads or personalized recommendations based on high-confidence, high-lift rules. For instance, if {green tea} -> {honey} is common, recommend honey to customers purchasing green tea to encourage additional purchases.

3.Bundling and Cross-Selling Opportunities:

- Create bundles based on the identified frequent item pairs, such as “breakfast essentials” for {bread, eggs, milk} or “health packs” for {low-fat yogurt, spinach, green tea}.

# 1.	What is lift and why is it important in Association rules?
- Lift is a key metric in association rule mining that measures the strength of the relationship between two items beyond their individual purchase probabilities. Specifically, it shows how much more likely two items are bought together than if they were bought independently.

- Definition of Lift: For an association rule 𝐴→𝐵, Lift is calculated as: Lift = confidence(𝐴→𝐵)/Support(𝐵)
- Alternatively, it can be expressed as: Lift = Support(𝐴 and 𝐵)/Support(𝐴)×Support(𝐵)

- ​Interpretation of Lift Values
1. Lift > 1: This means that the occurrence of A increases the likelihood of B happening, suggesting a positive association.For instance, if Lift = 2, A and B are twice as likely to be bought together than if they were independent.
2. Lift = 1: This implies no association; A and B occur together purely by chance and are statistically independent.
3. Lift < 1: This indicates a negative association, meaning A and B are less likely to be bought together than expected by chance.

- Why Lift is Important
1. Identifies Strong Relationships: Lift provides a more robust indication of association strength compared to support or confidence alone. Confidence may indicate a strong relationship, but it can be misleading if the consequent item is already common. Lift adjusts for this by accounting for the baseline probability of both items occurring independently.

2. Reveals Potential Cross-Selling Opportunities: High-lift values identify combinations where buying one product significantly boosts the chance of buying the other, ideal for cross-selling strategies and product bundling.

3. Improves Decision-Making: For retailers, lift helps prioritize the associations with genuine influence on buying behavior, rather than those coincidental or expected purely from individual purchase frequencies.

# 2.	What is support and Confidence. How do you calculate them?

1. Support : Support measures how frequently an itemset (a set of items bought together) appears in the dataset. It helps in identifying popular item combinations and determining whether a rule is relevant based on its occurrence.

- Formula: For an itemset A, the support is defined as: Support(𝐴) = Number of transactions containing 𝐴 / Total number of transactions
 
- For a rule 𝐴→𝐵 (where A and B are sets of items), the support of the rule is: Support(𝐴→𝐵) = Number of transactions containing both 𝐴 and 𝐵 / Total number of transactions

​2. Confidence : Confidence measures the likelihood that the consequent (B) will be bought when the antecedent (A) is bought. It provides an estimate of the reliability of the rule.

- Formula: For a rule A→B, confidence is defined as: Confidence, Confidence(A→B) = Support(A) / Support(A and B) Or alternatively: Confidence(A→B) = Number of transactions containing A / Number of transactions containing both A and B

# 3.	What are some limitations or challenges of Association rules mining?

1. High Dimensionality and Data Sparsity
- In large datasets with numerous items, the number of possible item combinations becomes enormous, leading to high dimensionality. This can make association rule mining computationally expensive and result in a huge number of potential rules, many of which may not be useful.
- Data sparsity (when there are many items but each transaction has only a few items) further complicates mining, as it reduces the likelihood of finding frequent itemsets and can lead to an overabundance of low-support rules.

2. Choosing Optimal Thresholds for Support, Confidence, and Lift
- Selecting appropriate thresholds for support, confidence, and lift is often challenging. Setting thresholds too high may result in missing valuable rules, while setting them too low can produce too many rules, including trivial or insignificant ones.
- There is no universal threshold, so these values often require fine-tuning based on domain knowledge and experimentation, which can be time-consuming.

3. Interpretation and Actionability of Results
- Association rules can reveal relationships but don’t provide insights into causality. For example, an association rule might show that bread and butter are frequently purchased together, but it doesn’t explain why. Therefore, business context and expertise are required to interpret the results and derive actionable insights.
- Moreover, not all generated rules are meaningful or actionable, as some rules may appear simply due to random chance or customer preferences without strong predictive value.

4. Handling Rare but Important Associations
- Standard association rule mining techniques often miss rare but valuable associations due to low support. In some cases, these rare associations might represent critical customer segments or niche products (e.g., high-value items or specialty goods) that have unique purchasing patterns but occur infrequently.

5. Scalability and Computational Efficiency
- For very large datasets, traditional association rule algorithms (like Apriori) can be computationally intensive. As dataset size increases, so does the time and memory required to generate frequent itemsets and rules, limiting scalability.
- More advanced algorithms (such as FP-Growth) can help alleviate this, but the trade-off is often added complexity in implementation.

6. Managing Redundant or Uninteresting Rules
- Association rule mining can produce a high number of redundant rules (e.g., variations of the same association) or trivial rules that don’t add much value. For instance, rules that state the obvious (e.g., “bread -> butter”) are common but don’t provide new insights. This redundancy complicates analysis, as it requires manual filtering to identify truly useful rules.

7. Dynamic and Evolving Data
- Customer behavior can change over time, which means association rules based on historical data may not always reflect current trends. Retailers need to periodically update the rules to maintain relevance.
- Dynamic and streaming data environments present additional challenges, as traditional rule-mining algorithms may not be able to adapt efficiently to new data in real-time.

8. Binary Item Representation
Association rule mining generally treats items as either "present" or "absent" in a transaction, ignoring quantities and other contextual factors (such as price or customer demographics). This simplistic approach limits the granularity of insights and makes it difficult to account for nuances, like customers buying in bulk versus those making one-off purchases.