##### CSE 5243 - Introduction to Data Mining
## Homework 5: Association Analysis
- Semester: Fall 2024
- Instructor: John Paparrizos
- Section: Tuesday/Thursday 11:10 AM
- Student Name: Khushboo Suchit Mundada
- Student Email: mundada.10@osu.edu
- Student ID: 500935462
***

# Introduction

### Objectives

In this lab, you will use the "hw5_data.csv" dataset provided on Carmen to find potential association rules.


The objectives of this assignment are:
1.    Practice the Association Analysis content we covered this semester.
2.    Understand “why” the particular topics, techniques, etc., are important from a practical perspective.
3.    Understand how to choose and use appropriate tools to solve the provided problems.

### The Dataset
- This workbook contains is a market basket dataset containing 50 transactions, drawing from a universe of six items: Apples, Bananas, Carrots, Donuts, Eggs, Fish.  For simplicity, use the short form “A, B, C, D, E, F” for the items.
- There is one csv file that captures the data in "long format". Specifically, every row corresponds to the transaction id and the item. If the specific transaction id has multiple items, you will have multiple rows in your data.
- You can use it however you like but it is recommended you convert into the one-hot-encoded datastructure we used in class. This will allow you to easily use the mlxtend package.

### Proper answers
- To make everyone's lives a little easier, when writing itemsets and rules, please list them in lexagraphical order:
  {A}, {B}, {A,B}, {A,C}, {A,B,C},…
  {A,B,C}->{D}, {A,B}->{C,D}

### Collaboration
For this assignment, you should work as an individual. You may informally discuss ideas with classmates, but your work should be your own.

### What you need to turn in:
1)	Turn in this Jupyter Notebook BOTH in notebook format and in html format. Failure to submit one or the other will result in points being deducted.
  - Submit your hw as HW5_Surname_DotNumber.zip
  
2)  Feel free to use the **mlxtend** package to help to help enumerate all possible combinations.

3)  If the question asks you to compute all possible rules, back up the calculation with a "formula" approach as well (similar to how we did in class/slides). This will act as "showing your work".
***

***
# Section 1: Getting Ready (20%)
1A) Load the data, and get it ready for association analysis. Do this with convenient python helper methods as appropriate. Feel free to use the tools we learned in class. HINT: If you're code looks like you're writing it in C++/Java/etc with lots of messy for loops, step back and re-evaluate.
    
    - Make the data one-hot encoded.
***

In [1]:
import pandas as pd  # For handling data in DataFrame format
from scipy.special import comb  # For calculating combinations (nCr)
from mlxtend.frequent_patterns import apriori, association_rules  # For generating frequent itemsets and rules


In [2]:
# Changing dataset from long format (TxId-ItemId pairs) to wide format (TxId with item presence as columns)

# Load the data
df = pd.read_csv('hw5_data.csv')  
print("Data loaded successfully!")
print(df.head())  

# Perform One-Hot Encoding using pd.get_dummies
df_one_hot = pd.get_dummies(df, columns=['ItemId'])
print("\nOne-hot encoded data (first few rows):")
print(df_one_hot.head())

# Groups transactions and ensures binary values are aggregated correctly for each item
df_one_hot = df_one_hot.groupby('TxId').max()
print("\nGrouped one-hot encoded data (by TxId):")
print(df_one_hot.head())

# Rename the columns by removing the 'ItemId_' prefix
df_one_hot.columns = df_one_hot.columns.str.replace('ItemId_', '')
print("\nRenamed columns (prefix removed):")
print(df_one_hot.head())

# Convert data to boolean type for Apriori algorithm compatibility
df_one_hot = df_one_hot.astype(bool)
print("\nConverted to boolean format:")
print(df_one_hot.head())

# Print the final shape of the dataset
print("\nShape of the transformed dataset (rows, columns):", df_one_hot.shape)

Data loaded successfully!
   TxId   ItemId
0     1   Donuts
1     1     Eggs
2     2  Bananas
3     2  Carrots
4     2     Fish

One-hot encoded data (first few rows):
   TxId  ItemId_Apples  ItemId_Bananas  ItemId_Carrots  ItemId_Donuts  \
0     1              0               0               0              1   
1     1              0               0               0              0   
2     2              0               1               0              0   
3     2              0               0               1              0   
4     2              0               0               0              0   

   ItemId_Eggs  ItemId_Fish  
0            0            0  
1            1            0  
2            0            0  
3            0            0  
4            0            1  

Grouped one-hot encoded data (by TxId):
      ItemId_Apples  ItemId_Bananas  ItemId_Carrots  ItemId_Donuts  \
TxId                                                                 
1                 0             

***
# Section 2: Basic Stats (20%)
2A) Calculate the total number of Itemsets that could be created from a universe of six items. Show your work.

2B) Calculate the total number of rules that can be created from a universe of six items. Show your work.

2C) Calculate the total number of ItemSets that could be created from a universe of 12 items. Show your work.

2D) Calculate the total number of rules that can be created from a universe of 12 items. Show your work.

2E) What do the calculations in 2A-2D tell you / hint at as a potential cause of concern? Hint: Complexity.
***

2A) Calculate the total number of Itemsets that could be created from a universe of six items. Show your work.<br>
From a universe of six items, we can form itemsets of size 1, 2, 3, 4, 5, 6.<br>
Formula for calculating the total number of itemsets - <br>
$= 2^d$ <br>
where d is the number of items that is 6<br>
Hence, the total number of itemsets is $=2^6 = 64$<br>
**So, there are 64 possible itemsets from a universe of six items.**<br>
<hr>

2B) Calculate the total number of rules that can be created from a universe of six items. Show your work.<br>
Formula for Total Number of Possible Association Rules: For any itemset of size $d$, the number of rules is $3^d-2^{d+1}+1$<br> 
Now, calculating total rules for 6 itemsets: <br>
Substituting d=6<br>
$Total Rules = 3^6-2^7+1$ <br>
$=602$<br>
**There are 602 rules from 6 items.**<br>
<hr>

2C) Calculate the total number of ItemSets that could be created from a universe of 12 items. Show your work.<br>
The number of itemsets from 12 items is calculated similar to the part 2A.<br>
Formula for calculating the total number of itemsets - <br>
$= 2^d$ <br>
where d=12<br>
Hence, the total number of itemsets is $=2^{12} = 4096$<br>
**So, there are 4096 possible itemsets from a universe of 12 items.**<br>
<hr>

2D) Calculate the total number of rules that can be created from a universe of 12 items. Show your work.
The total number of rules for 12 items is calculated similar to the part 2B. <br>
The number of rules for the itemset of size 12 is $3^d-2^{d+1}+1$, and the total number of rules where d=12 is <br>
$Total Rules = 3^{12}-2^{13}+1$ <br>
$=523250$ <br>
**There are 523250 rules from 12 items.** <br>
<hr>

2E) What do the calculations in 2A-2D tell you / hint at as a potential cause of concern? Hint: Complexity.
1. The number of rules increases exponentially with the number of items. For 6 items, there were 602 possible rules, but for 12 items, this jumps to 523,250. This rapid increase highlights how quickly the problem becomes computationally intensive as the dataset grows.
2. Processing and storing such a large number of rules is resource-intensive. It may not be feasible to generate or analyze all possible rules in larger datasets without significant computational power.
3. Analyzing large datasets requires thresholds like minimum support and confidence to filter out unimportant rules. Without these, the sheer volume of rules can overwhelm both computation and interpretation.
4. In practical scenarios, such as market basket analysis, most rules would not provide meaningful insights. Tools and techniques need to focus on extracting only the most relevant associations to ensure actionable results.

In [3]:
# Function to calculate the total number of rules for a given universe size
def totalrules(d):
    return (3**d - 2**(d+1) + 1)


print("For 6 ietmset: ", totalrules(6))
print("For 12 ietmset: ", totalrules(12))

For 6 ietmset:  602
For 12 ietmset:  523250


In [4]:
# !pip install --upgrade scikit-learn
!pip install mlxtend==0.22.0



***
# Section 3: Itemset Generation (20%)
3A) Calculate the number Itemsets that are possible in the data file.

3B) Calculate the number of Itemsets with MinSup = 2.

3C) Calculate the number of Itemsets with MinSup = 3.

3D) Calculate the number of Itemsets with MinSup = 4.

3E) Using MinSup = 3, use the Apriori algorithm (feel free to use the package) to list all ItemSets with MinSup = 3.

3F) Using MinSup = 3, find all the Maximal Frequent Itemsets.

3G) Find all the Closed Itemsets.

3H) Using Minsup = 3, find the Closed Frequent Itemsets.

3I) How do the Closed Frequent Itemsets compare to the Maximal Frequent Itemsets?

3J) When one might use the Closed Frequent vs the Maximal Frequent Itemsets?
***

3A) Calculate the number Itemsets that are possible in the data file.<br>
To calculate the total number of itemsets possible, I'm considering the dataset's unique items and the potential combinations that can be formed.<br> 
There are six items: Apples (A), Bananas (B), Carrots (C), Donuts (D), Eggs (E), and Fish (F). <br>
Hence, using the combination formula to calulate the total number of itemsets for a set of 6 items <br>
$2^d = 2^6 = 64$<br>
**So, the total number of itemsets possible is 64, as calculated earlier.**<br>
<hr>

(Check code in next cell for below questions)<br>
3B) Calculate the number of Itemsets with MinSup = 2.<br>
MinSup (Minimum Support) specifies that we only count itemsets that appear in at least 2 transactions. To calculate the number of itemsets with MinSup = 2, I'm analyzing the frequency of all itemsets in the dataset and only counting those that appear in 2 or more transactions.<br>
**Number of Itemsets with MinSup = 2 is 43**<br>
<hr>

3C) Calculate the number of Itemsets with MinSup = 3.<br>
Similar to part 3B. <br>
**Number of Itemsets with MinSup = 3 is 31**<br>
<hr>

3D) Calculate the number of Itemsets with MinSup = 4.<br>
Similar to part 3B. <br>
**Number of Itemsets with MinSup = 4 is 28**<br>
<hr>

In [5]:
# 3B) Apply Apriori with MinSup = 2 (which corresponds to at least 2 transactions having the itemset)
frequent_itemsets_2 = apriori(df_one_hot, min_support=2/len(df_one_hot), use_colnames=True)
# print(f"Frequent Itemsets with MinSup = 2: \n{frequent_itemsets_2}\n")
print(f"Number of itemsets with MinSup = 2: {len(frequent_itemsets_2)}\n")

# 3C) Apply Apriori with MinSup = 3 (which corresponds to at least 3 transactions having the itemset)
frequent_itemsets_3 = apriori(df_one_hot, min_support=3/len(df_one_hot), use_colnames=True)
# print(f"Frequent Itemsets with MinSup = 3: \n{frequent_itemsets_3}\n")
print(f"Number of itemsets with MinSup = 3: {len(frequent_itemsets_3)}\n")

# 3D) Apply Apriori with MinSup = 4 (which corresponds to at least 4 transactions having the itemset)
frequent_itemsets_4 = apriori(df_one_hot, min_support=4/len(df_one_hot), use_colnames=True)
# print(f"Frequent Itemsets with MinSup = 4: \n{frequent_itemsets_4}\n")
print(f"Number of itemsets with MinSup = 4: {len(frequent_itemsets_4)}\n")

Number of itemsets with MinSup = 2: 43

Number of itemsets with MinSup = 3: 31

Number of itemsets with MinSup = 4: 28



3E) Using MinSup = 3, use the Apriori algorithm (feel free to use the package) to list all ItemSets with MinSup = 3.<br>
This is essentially a repeat of part 3C, but now we’ll list all itemsets with support ≥ 3<br>
The result is as below - 


In [6]:
# Apply the Apriori algorithm with MinSup = 3
frequent_itemsets_3 = apriori(df_one_hot, min_support=3/len(df_one_hot), use_colnames=True)

# Display the frequent itemsets with MinSup = 3
print(f"Frequent Itemsets with MinSup = 3: \n")
frequent_itemsets_3.reset_index(drop=True)


# print(f"Number of itemsets with MinSup = 3: {len(frequent_itemsets_3)}")

Frequent Itemsets with MinSup = 3: 



Unnamed: 0,support,itemsets
0,0.34,(Apples)
1,0.58,(Bananas)
2,0.48,(Carrots)
3,0.4,(Donuts)
4,0.4,(Eggs)
5,0.42,(Fish)
6,0.26,"(Bananas, Apples)"
7,0.18,"(Apples, Carrots)"
8,0.16,"(Donuts, Apples)"
9,0.06,"(Apples, Eggs)"


In [7]:
# 3F) Using MinSup = 3, find all the Maximal Frequent Itemsets.
# Maximal frequent itemsets are itemsets that are not subsets of any larger frequent itemsets.
# Using the same list of itemsets with MinSup = 3, finding the maximal frequent itemsets.

# Use apriori to find frequent itemsets with MinSup = 3
frequent_itemsets = apriori(df_one_hot, min_support=3/len(df_one_hot), use_colnames=True)

# Find maximal frequent itemsets
maximal_frequent_itemsets = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: not any(x.issubset(y) for y in frequent_itemsets['itemsets'] if x != y))]

# Display maximal frequent itemsets
print("Maximal Frequent Itemsets:")
maximal_frequent_itemsets.reset_index(drop=True)

Maximal Frequent Itemsets:


Unnamed: 0,support,itemsets
0,0.16,"(Donuts, Eggs)"
1,0.08,"(Bananas, Apples, Donuts)"
2,0.06,"(Bananas, Apples, Eggs)"
3,0.08,"(Donuts, Apples, Fish)"
4,0.1,"(Bananas, Eggs, Carrots)"
5,0.06,"(Bananas, Eggs, Fish)"
6,0.08,"(Eggs, Carrots, Fish)"
7,0.16,"(Bananas, Apples, Fish, Carrots)"


In [8]:
# 3G) Find all the Closed Itemsets.
# Closed itemsets are itemsets that are frequent, and there is no superset of the itemset that has the same support.

# Find closed itemsets
closed_itemsets = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: not any(x.issubset(y) and frequent_itemsets[frequent_itemsets['itemsets'] == y]['support'].iloc[0] == frequent_itemsets[frequent_itemsets['itemsets'] == x]['support'].iloc[0] for y in frequent_itemsets['itemsets'] if x != y))]

# Display closed itemsets
print("Closed Itemsets:")
closed_itemsets.reset_index(drop=True)

Closed Itemsets:


Unnamed: 0,support,itemsets
0,0.34,(Apples)
1,0.58,(Bananas)
2,0.48,(Carrots)
3,0.4,(Donuts)
4,0.4,(Eggs)
5,0.42,(Fish)
6,0.26,"(Bananas, Apples)"
7,0.16,"(Donuts, Apples)"
8,0.22,"(Apples, Fish)"
9,0.34,"(Bananas, Carrots)"


In [9]:
# 3H) Using Minsup = 3, find the Closed Frequent Itemsets.
# Closed frequent itemsets are those that are frequent and also closed (i.e., no superset has the same frequency).

# Find closed frequent itemsets (Frequent + Closed)
closed_frequent_itemsets = closed_itemsets[closed_itemsets['support'] >= 3/len(df_one_hot)]

# Display closed frequent itemsets
print("Closed Frequent Itemsets:")
closed_frequent_itemsets.reset_index(drop=True)

Closed Frequent Itemsets:


Unnamed: 0,support,itemsets
0,0.34,(Apples)
1,0.58,(Bananas)
2,0.48,(Carrots)
3,0.4,(Donuts)
4,0.4,(Eggs)
5,0.42,(Fish)
6,0.26,"(Bananas, Apples)"
7,0.16,"(Donuts, Apples)"
8,0.22,"(Apples, Fish)"
9,0.34,"(Bananas, Carrots)"


3I) How do the Closed Frequent Itemsets compare to the Maximal Frequent Itemsets?<br>

Closed frequent itemsets are those where no superset has the same support, while maximal frequent itemsets are those that are not subsets of any other frequent itemset. In this case, some of the closed itemsets are subsets of maximal itemsets, but the maximal itemsets generally represent the largest possible itemsets within the dataset.<br>

3J) When one might use the Closed Frequent vs the Maximal Frequent Itemsets?<br>

Closed frequent itemsets are useful when you want to retain information about all frequent patterns with distinct support, while maximal frequent itemsets are preferred when you're interested in the largest patterns that cannot be extended further without losing frequency. Closed itemsets provide more detailed information, while maximal itemsets reduce the number of patterns for efficiency.

***
# Section 4: Generate Rules (20%)
4A) How many possible rules are there for the data that exists ONLY in the dataset? For example, if there is no data set with items {A,B}, do not list any rules such as {A,B} -> C.

4B) Calculate the number of rules possible if you had the itemset: {B, C, F}.

4C) List all the possible rules from the itemset shown in 4B.

4D) For the Itemset in 4B, using MinConf = 0.75 prune the rules using the anti-monotone property of rules. How many rules remain?

4E) List all the rules in 4D, and explain why.

4F) Explain why pruning rules might be advantageous for large data sets.
***

In [10]:
# 4A) How many possible rules are there for the data that exists ONLY in the dataset? 
# Possible rules are of the form {X} -> {Y}, where X and Y are subsets of the itemset
# considering min_support as 0.01 to list down all possible combination 
frequent_itemsets_ap = apriori(df_one_hot, min_support=.01, use_colnames=True)
rules_ap = association_rules(frequent_itemsets_ap, metric="confidence", min_threshold=0)
rules_ap

# valid_rules = [rule for rule in rules_ap.itertuples() if set(rule[1]).issubset(df_one_hot.columns)]
# print(f"Total possible valid rules: {len(valid_rules)}")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bananas),(Apples),0.58,0.34,0.26,0.448276,1.318458,0.0628,1.196250,0.575092
1,(Apples),(Bananas),0.34,0.58,0.26,0.764706,1.318458,0.0628,1.785000,0.365967
2,(Apples),(Carrots),0.34,0.48,0.18,0.529412,1.102941,0.0168,1.105000,0.141414
3,(Carrots),(Apples),0.48,0.34,0.18,0.375000,1.102941,0.0168,1.056000,0.179487
4,(Donuts),(Apples),0.40,0.34,0.16,0.400000,1.176471,0.0240,1.100000,0.250000
...,...,...,...,...,...,...,...,...,...,...
597,(Eggs),"(Bananas, Donuts, Carrots, Apples, Fish)",0.40,0.02,0.02,0.050000,2.500000,0.0120,1.031579,1.000000
598,(Donuts),"(Bananas, Eggs, Carrots, Apples, Fish)",0.40,0.04,0.02,0.050000,1.250000,0.0040,1.010526,0.333333
599,(Carrots),"(Bananas, Eggs, Donuts, Apples, Fish)",0.48,0.02,0.02,0.041667,2.083333,0.0104,1.022609,1.000000
600,(Apples),"(Bananas, Eggs, Donuts, Carrots, Fish)",0.34,0.02,0.02,0.058824,2.941176,0.0132,1.041250,1.000000


In [19]:
# 4B) Calculate the number of rules possible if you had the itemset: {B, C, F}.
# Here, the itemset is given as {B, C, F}, which corresponds to the items 'Bananas', 'Carrots', and 'Fish' 
# The number of possible rules that can be formed from this itemset can be 
# calculated by considering all subsets of the itemset

itemset = {'Bananas', 'Carrots', 'Fish'}
rules_4b = association_rules(apriori(df_one_hot[list(itemset)], min_support=0.05, use_colnames=True), metric="confidence", min_threshold=0)
print(f"Number of possible rules: {len(rules_4b)}")

Number of possible rules: 0


In [12]:
# 4C) List all the possible rules from the itemset shown in 4B.
# For the itemset {Bananas, Carrots, Fish}, the possible rules are all the combinations of subsets of this itemset 
# as antecedents and the remaining items as consequents.

for rule in rules_4b.itertuples():
    print(f"Rule: {rule[1]} -> {rule[2]}")

Rule: frozenset({'Bananas'}) -> frozenset({'Fish'})
Rule: frozenset({'Fish'}) -> frozenset({'Bananas'})
Rule: frozenset({'Bananas'}) -> frozenset({'Carrots'})
Rule: frozenset({'Carrots'}) -> frozenset({'Bananas'})
Rule: frozenset({'Fish'}) -> frozenset({'Carrots'})
Rule: frozenset({'Carrots'}) -> frozenset({'Fish'})
Rule: frozenset({'Bananas', 'Fish'}) -> frozenset({'Carrots'})
Rule: frozenset({'Bananas', 'Carrots'}) -> frozenset({'Fish'})
Rule: frozenset({'Fish', 'Carrots'}) -> frozenset({'Bananas'})
Rule: frozenset({'Bananas'}) -> frozenset({'Fish', 'Carrots'})
Rule: frozenset({'Fish'}) -> frozenset({'Bananas', 'Carrots'})
Rule: frozenset({'Carrots'}) -> frozenset({'Bananas', 'Fish'})


In [13]:
# 4D) For the Itemset in 4B, using MinConf = 0.75 prune the rules using the anti-monotone property of rules. 
# How many rules remain?

rules_4d = rules_4b[rules_4b['confidence'] >= 0.75]
print(f"Remaining rules after pruning: {len(rules_4d)}")

Remaining rules after pruning: 3


In [14]:
# 4E) List all the rules in 4D, and explain why.
for rule in rules_4d.itertuples():
    print(f"Pruned rule: {rule[1]} -> {rule[2]}")

# In short, the reduction to 3 rules happens because most of the generated rules fail to meet the confidence threshold of 0.75, 
# demonstrating that many subsets of {B, C, F} are weakly associated in the data.

Pruned rule: frozenset({'Fish'}) -> frozenset({'Carrots'})
Pruned rule: frozenset({'Carrots'}) -> frozenset({'Fish'})
Pruned rule: frozenset({'Bananas', 'Fish'}) -> frozenset({'Carrots'})


4F) Explain why pruning rules might be advantageous for large data sets.<br>

Pruning rules helps to reduce the complexity of the model. In large datasets, there can be a massive number of rules generated, many of which might be weak or insignificant. Pruning helps in the following ways:<br>

1. Faster Computation: By discarding rules with low confidence, we reduce the number of rules that need to be considered, improving the computational efficiency.
2. Better Interpretability: Fewer, stronger rules are easier to interpret and use for decision-making.
3. Reduced Overfitting: Keeping only strong rules helps in avoiding overfitting, where the model may fit too closely to the training data but perform poorly on unseen data.

Pruning reduces the search space and makes the process more efficient, especially in the context of large-scale datasets where the number of itemsets and potential rules can grow exponentially
<br>

***
# Section 5: Rule Evaluation (20%)
5A) For the rules in the dataset, find a rule (if any) that has a lift >1. Show the rule and the lift.

5B) For the rules in the dataset, find a rule (if any) that has a lift <1. Show the rule and the lift.

5C) How would the the lift change if in 5A) and 5B) if we added 1000 transactions to the dataset and each of these 1000 transactions had items that were not part of the original dataset (for example we started adding things like Teslas and yachts and other some such)?

5d) What does the above tell you about lift? What other metrics might you consider?
***

In [15]:
# 5A) For the rules in the dataset, find a rule (if any) that has a lift >1. Show the rule and the lift.

frequent_itemsets = apriori(df_one_hot, min_support=0.05, use_colnames=True)

# Generate the association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0)

# Filter out rules with lift greater than 1
rules_lift_greater_than_1 = rules[rules['lift'] > 1]

# Display the rules with lift > 1
rules_lift_greater_than_1[['antecedents', 'consequents', 'lift']].reset_index(drop=True)

Unnamed: 0,antecedents,consequents,lift
0,(Bananas),(Apples),1.318458
1,(Apples),(Bananas),1.318458
2,(Apples),(Carrots),1.102941
3,(Carrots),(Apples),1.102941
4,(Donuts),(Apples),1.176471
...,...,...,...
63,"(Fish, Carrots)","(Bananas, Apples)",1.709402
64,(Bananas),"(Apples, Fish, Carrots)",1.532567
65,(Apples),"(Bananas, Fish, Carrots)",1.960784
66,(Fish),"(Bananas, Apples, Carrots)",2.380952


In [16]:
# 5B) For the rules in the dataset, find a rule (if any) that has a lift <1. Show the rule and the lift.

# Filter out rules with lift less than 1
rules_lift_less_than_1 = rules[rules['lift'] < 1]

# Display the rules with lift < 1
rules_lift_less_than_1[['antecedents', 'consequents', 'lift']].reset_index(drop=True)

Unnamed: 0,antecedents,consequents,lift
0,(Apples),(Eggs),0.441176
1,(Eggs),(Apples),0.441176
2,(Bananas),(Donuts),0.431034
3,(Donuts),(Bananas),0.431034
4,(Bananas),(Eggs),0.948276
5,(Eggs),(Bananas),0.948276
6,(Eggs),(Carrots),0.833333
7,(Carrots),(Eggs),0.833333
8,(Donuts),(Eggs),1.0
9,(Eggs),(Donuts),1.0


5C) How would the the lift change if in 5A) and 5B) if we added 1000 transactions to the dataset and each of these 1000 transactions had items that were not part of the original dataset (for example we started adding things like Teslas and yachts and other some such)? <br>

Lift will likely decrease for the existing itemsets. The new unrelated transactions do not contribute to the support of the original items, but they increase the total transaction count, which lowers the lift of the original rules.

5d) What does the above tell you about lift? What other metrics might you consider?<br>

Lift can decrease when new unrelated items are added to the dataset. This happens because adding new transactions dilutes the support values for existing itemsets, which in turn affects the lift calculation.<br>
Other Metrics to Consider:<br>
1. Confidence: Measures the likelihood of the consequent occurring when the antecedent occurs.
2. Support: Measures how frequently an itemset appears in the dataset.
3. Conviction: Measures how much more likely an itemset is to occur than by chance.
4. Leverage: Measures the difference between the observed frequency of an itemset and the expected frequency if the items were independent.