# Introduction into Data Science - Assignment Part II

This is the second part of the assignment in IDS 2023/2024.

This part of the assignment consists of five questions — each of these questions is contained in a separate Jupyter notebook:
- [Question 1: Data Preprocessing](Q1_Preprocessing_Visualization.ipynb)
- [Question 2: Association Rules](Q2_Frequent_Itemsets_Association_Rules.ipynb)
- [Question 3: Process Mining](Q3_Process_Mining.ipynb)
- [Question 4: Text Mining](Q4_Text_Mining.ipynb)
- [Question 5: Big Data](Q5_Big_Data.ipynb)

Additional required files are in two folders.
- [datasets](datasets/)
- [scripts](scripts/)

Please use the provided notebook to work on the questions. When you are done, upload your version of each of the notebooks to Moodle. Your submission will, therefore, consist of five jupyter notebook and _no_ additional file. Any additionally provided files will not be considered in grading.
Enter your commented Python code and answers in the corresponding cells. Make sure to answer all questions in a clear and explicit manner and discuss your outputs. _Please do not change the general structure of this notebook_. You can, however, add additional markdown or code cells if necessary. Please **DO NOT CLEAR THE OUTPUT** of the notebook you are submitting! Additionally, please ensure that the code in the notebook runs if placed in the same folder as all of the provided files, delivering the same outputs as the ones you submit in the notebook. This includes being runnable in the bundled conda environment.

*Please make sure to include the names and matriculation numbers of all group members in the provided slots in each of the notebooks.* If a name or a student id is missing, the student will not receive any points.

Hint 1: **Plan your time wisely.** A few parts of this assignment may take some time to run. It might be necessary to consider time management when you plan your group work. Also, do not attempt to upload your assignment at the last minute before the deadline. This often does not work, and you will miss the deadline. Late submissions will not be considered.

Hint 2: RWTHMoodle allows multiple submissions, with every new submission overwriting the previous one. **Partial submissions are possible and encouraged.** This might be helpful in case of technical issues with RWTHMoodle, which may occur close to the deadline.

Hint 3: As a technical note. Some IDEs such as DataSpell may automatically strip jupyter notebook cell metadata. If you are able, please re-add it from the source notebooks before submission. This is necessary for our grading.

Enter your group number and members with matriculation numbers below.

In [5]:
GROUP_NO = 112 # group number
GROUP_MEMBERS = {
    451963: "Paul Väthjunker", # mat. no. : name,
    413004: "Touyen Nguyenova"
}

---

In [6]:
# required imports
# please do not edit!
import csv
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd
from mlxtend.frequent_patterns import apriori
import datetime
from mlxtend.frequent_patterns import association_rules as arule
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth

# Question 2: Frequent Item Sets and Association Rules (13 points)

In this question, you work with transaction data of the customer's visits to the store.

### a) 
Load the transactions from the csv-file called **q2_store_transactions.csv** into a variable called `groceries`. The variable should be a list and each row in the csv-file should be represented as a list within this list.

In [7]:
import csv

# YOUR CODE HERE
groceries=[]
with open("datasets/q2_store_transactions.csv", newline='') as f:
    reader = csv.reader(f) 
    groceries = list(reader)


In [8]:
# Please leave this cell empty - used for grading.

### b) 
Transform the entries from the list to a binary matrix using an object of *TransactionEncoder* as introduced in the exercise. Name the resulting dataframe `itemset_matrix` and display the first 20 rows.

In [9]:
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# YOUR CODE HERE
te = TransactionEncoder().fit(groceries)
boolean_matrix = te.transform(groceries)
boolean_matrix

itemset_matrix = pd.DataFrame(boolean_matrix, columns = te.columns_)
itemset_matrix.head(20)

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,True,True,False,True,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [10]:
# Please leave this cell empty - used for grading.

In [11]:
# Please leave this cell empty - used for grading.

### c) 
Find all frequent itemsets with a **support of at least 0.03** using the Apriori algorithm and save them in a variable called `frequent_itemsets`. Display the resulting itemsets and the processing time (in milliseconds) required to detect them. 

In [34]:
%%time
from mlxtend.frequent_patterns import apriori
import datetime

# YOUR CODE HERE
start=datetime.datetime.now()
frequent_itemsets = apriori(itemset_matrix, min_support=0.03, use_colnames=True)
end=datetime.datetime.now()
print("Completion time",int((end-start).total_seconds()*1000),"ms")
frequent_itemsets

Completion time 30 ms
Wall time: 30 ms


Unnamed: 0,support,itemsets
0,0.033329,(avocado)
1,0.033729,(brownies)
2,0.087188,(burgers)
3,0.030129,(butter)
4,0.081056,(cake)
5,0.046794,(champagne)
6,0.059992,(chicken)
7,0.163845,(chocolate)
8,0.080389,(cookies)
9,0.05106,(cooking oil)


In [13]:
# Please leave this cell empty - used for grading.

### d)
Find the most frequent itemsets containing **more than one product** and a **support of more than 0.04** using the Apriori algorithm. Store them in a variable called `frequent_itemsets_filtered` and show the sets in your output.

In [14]:
# YOUR CODE HERE
# technically the min_support does not represent more then 0.04 but at least. No row has support exactly 0.04 so it should be fine
frequent_itemsets_filtered = apriori(itemset_matrix, min_support=0.04, use_colnames=True)
frequent_itemsets_filtered=frequent_itemsets_filtered[frequent_itemsets_filtered['itemsets'].str.len()>1]
frequent_itemsets_filtered

Unnamed: 0,support,itemsets
30,0.05266,"(mineral water, chocolate)"
31,0.050927,"(mineral water, eggs)"
32,0.040928,"(mineral water, ground beef)"
33,0.047994,"(mineral water, milk)"
34,0.059725,"(mineral water, spaghetti)"


In [15]:
# Please leave this cell empty - used for grading.

In [16]:
# Please leave this cell empty - used for grading.

In [17]:
# Please leave this cell empty - used for grading.

### e)
Find all association rules in the data that have a **confidence of at least 0.3** and a **minimum lift of 1.2** based on the frequent itemsets with a support of at least 0.03 (`frequent_itemsets`). Create and show a dataframe `association_rules` listing the antecedents, consequents, support, confidence, and lift of each of these discovered rules. How do you interpret the quality of the discovered rules?

In [18]:
from mlxtend.frequent_patterns import association_rules as arule
import pandas as pd

# YOUR CODE HERE
fp_association_rules = arule(frequent_itemsets, metric='confidence', min_threshold=0.3)
fp_association_rules=fp_association_rules[(fp_association_rules['confidence'] > 0.3) &
       (fp_association_rules['lift'] > 1.2) ]

fp_association_rules[["antecedents",'consequents','support','confidence','lift']]


Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(chocolate),(mineral water),0.05266,0.3214,1.348332
1,(frozen vegetables),(mineral water),0.035729,0.374825,1.572463
2,(ground beef),(mineral water),0.040928,0.416554,1.747522
3,(ground beef),(spaghetti),0.039195,0.398915,2.291162
4,(milk),(mineral water),0.047994,0.37037,1.553774
5,(pancakes),(mineral water),0.033729,0.354839,1.488616
6,(spaghetti),(mineral water),0.059725,0.343032,1.439085


In [19]:
# Please leave this cell empty - used for grading.

__Student Answer:__ All the supports are quite low, meaning these do not occur often in the whole dataset. A moderate confidence for each rule signals the moderate importance of the each rule. The ground beef and spaghetti rule has a significant higher lift than any other rule, suggesting a strong association between spaghetti and ground beef!

### f) 
Find all frequent itemsets with a **support of at least 0.03** using **FP-Growth** and save them in a variable called `fp_frequent_itemsets`. Display the resulting itemsets and the processing time (in milliseconds) required to detect them. 

In [35]:
%%time
from mlxtend.frequent_patterns import fpgrowth

# YOUR CODE HERE

start=datetime.datetime.now()
fp_frequent_itemsets=fpgrowth(itemset_matrix,min_support=0.03,use_colnames=True)
end=datetime.datetime.now()
print("Completion time",int((end-start).total_seconds()*1000),"ms")
fp_frequent_itemsets

Completion time 161 ms
Wall time: 161 ms


Unnamed: 0,support,itemsets
0,0.238368,(mineral water)
1,0.132116,(green tea)
2,0.076523,(low fat yogurt)
3,0.071457,(shrimp)
4,0.065858,(olive oil)
5,0.063325,(frozen smoothie)
6,0.04746,(honey)
7,0.042528,(salmon)
8,0.033329,(avocado)
9,0.031862,(cottage cheese)


In [17]:
# Please leave this cell empty - used for grading.

### g)
Using the itemsets identified by **FP-Growth**: Find all association rules in the data that have a **confidence of at least 0.3** and a **minimum lift of 1.2** based on the frequent itemsets with a support of at least 0.03 (`fp_frequent_itemsets`). Create and show a dataframe `fp_association_rules` listing the antecedents, consequents, support, confidence, and lift of each of these discovered rules.

In [21]:
# YOUR CODE HERE

fp_association_rules = arule(fp_frequent_itemsets, metric='confidence', min_threshold=0.3)
fp_association_rules=fp_association_rules[(fp_association_rules['confidence'] > 0.3) &
       (fp_association_rules['lift'] > 1.2) ]

fp_association_rules[["antecedents",'consequents','support','confidence','lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(milk),(mineral water),0.047994,0.37037,1.553774
1,(spaghetti),(mineral water),0.059725,0.343032,1.439085
2,(frozen vegetables),(mineral water),0.035729,0.374825,1.572463
3,(chocolate),(mineral water),0.05266,0.3214,1.348332
4,(pancakes),(mineral water),0.033729,0.354839,1.488616
5,(ground beef),(mineral water),0.040928,0.416554,1.747522
6,(ground beef),(spaghetti),0.039195,0.398915,2.291162


In [None]:
# Please leave this cell empty - used for grading.

### h) 
You would like to compare the apriori algorithms and FP-Growth.

i) Both algorithms use the same data (transaction data) as an input and provide association rules as an output. How do the algorithms differ in the way they identify association rules?


__Student Answer:__ The apriori algorithm requires multiple dataset scans to construct the candidate set. Then it uses an iterative approach to construct the association rules. 

FP-Growth on the other hand only requires two passes through the dataset. It first creates a tree structure for frequent patterns, then cunstructing conditional trees and extending them recursivly. 

ii) Consider your results of the previous tasks. Do the two algorithms provide the same association rules? Is this always the case?

__Student Answer:__ They do provide the same association rules. This is always the case!

iii) Compare the processing time for finding the frequent itemsets tasks using the apriori algorithm and FG-Growth. What do you notice? Is this the result you expected? Briefly explain your answers.


__Student Answer:__ 
The apriori algorithm took less time than the fg-growth. This is not expected generally. In most cases, fg-growth should be faster, as it requires less iterations over the dataset. However, for cases with a relatively low count of different items the apriori algorithm is quite fast.