# Assignment 5: Association mining

## Objective of this assignment
The overall objective is to understand how frequent itemsets can be extracted by
the Apriori algorithm and be able to calculate and interpret association rules in terms of support and confidence.

## ** Important: ** When handing in your homework:
+ Hand in the notebook (and nothing else) named as follows: StudentName1_snumber_StudentName2_snumber.ipynb
+ Provide clear and complete answers to the questions below under a separate header (not hidden somewhere in your source code), and make sure to explain your answers / motivate your choices. Add Markdown cells where necessary.
+ Source code, output graphs, derivations, etc., should be included in the notebook.
+ Hand-in: upload to Blackboard.
+ Include name, student number, assignment (especially in filenames)!
+ When working in pairs only one of you should upload the assignment, and report the name of your partner in your filename.
+ For problems or questions: use the BB discussion board or email the student assistants.


## Advised Reading and Exercise Material
**The following reading material is recommended:**

- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, *Introduction to Data Mining*, section 6.


## Additional Tools
For this exercise you will need to load the provided *apriorimining.py* script. 


##  5.1 Association mining for course data 
We will use the Apriori algorithm to automatically mine for associations. The Apriori algorithm is an adapted version of the script found here: https://github.com/nalinaksh/Association-Rule-Mining-Python

Check out the script and doc and check if you understand how the association rules are computed. 



In [1]:
from Toolbox.apriorimining import generate_association_rules


#### 5.1.1

(0 points) Look at the data file `Data/courses.txt` into Python. The data is represented in Table 1. Inspect the file Data/courses.txt and make sure you understand how the data in Table 1 is stored in the text file.

##### Table 1
|#  |   History |Math| Biology| Spanish | Economics| Physics | Chemistry | English  |  
| :-------------: |:-------------:| :-----------:| :----------:| :------------:|:-------------:| :------------:|  :-------------: | :-------------: |
|student 1 | 0| 1 | 0 | 0 | 1| 1 |1 |1   
|student 2 | 1| 1 | 1 | 0 | 0| 1 |1 |1   
|student 3 | 0| 1 | 0 | 1 | 0| 1 |0 |1   
|student 4 | 0| 0 | 1 | 0 | 0| 1 |1 |0   
|student 5 | 0| 1 | 0 | 0 | 0| 1 |1 |0        
|student 6 | 0| 1 | 1 | 0 | 0| 1 |1 |1   


#### 5.1.2
(1 point) We will analyze the data in Table 1 automatically using the function `apriorimining.generate_association_rules()` from the script. Analyze the data with $ minsupport  \geq 80 \% $ and $ minconfidence \geq 100 \%$.What
are the generated association rules? What kind of conclusions can you make based on these association rules about the subjects that students took?  
  


In [2]:
generate_association_rules()

Please enter support value in %: 80
Please enter confidence value in %: 100
Enter the max number of rules you want to see (enter 0 to see all rules): 0
Please enter filepath\filename and extension: 'Data/courses.txt'
---------------TOP 10 FREQUENT 1-ITEMSET-------------------------
set= { 6 },  sup= 100.0
set= { 2 },  sup= 83.33
set= { 7 },  sup= 83.33
-----------------------------------------------------------------
-------TOP 10 (or less) FREQUENT 2-ITEMSET------------------------
set= { 2, 6 },  sup= 83.0
set= { 6, 7 },  sup= 83.0
------------------------------------------------------------------
---------------------ASSOCIATION RULES------------------
--------------------------------------------------------
Rule #1: {  } ==> { 6 }, sup= 100.00, conf= 100.00

Rule #2: { 2 } ==> { 6 }, sup= 83.33, conf= 100.00

Rule #3: { 7 } ==> { 6 }, sup= 83.33, conf= 100.00
--------------------------------------------------------


From these generated association rules we can conclude that everyone follows course six (physics) and that almost everyone follows course two and seven (math and chemistry respectively). Also we can see that almost everyone who follows math also follows physics and almost everyone who follows chemistry also follows physics. Though this makes total sense since these are the courses almost everyone follows anyways.

  ##  5.2 Association mining for MovieLens data 
  
  
  In this part of the exercise we consider a Market Basket data set containing 943 users purchases of 1682 movies. A total of 100,000 movies
have been purchased.The data set is called MovieLens100K and is provided by http://www.grouplens.org/node/73, see also the readme `MovieLensData.txt` in the data folder. The data currently considered is not the original data but modified for the apriori algorithm.

#### 5.2.1
  (0 points) The MovieLens data is stored in the file MovieLensData.txt. Inspect the file to see how the data is stored.


#### 5.2.2 
  (1 point) Find association rules using the function below with $ minsupport  \geq 30 \% $ and $ minconfidence \geq 80 \%$. What are the associations with strongest confidence? Do these associations make sense? The script can use file Data/u.item to print the movie titles in stead of numbers. If you enter filename `MovieLensData.txt`, the script will provide an additional option for this. 
  

In [4]:
generate_association_rules()

Please enter support value in %: 30
Please enter confidence value in %: 80
Enter the max number of rules you want to see (enter 0 to see all rules): 0
Please enter filepath\filename and extension: 'Data/MovieLensData.txt'
Do you want to print sets and rules with Movie names in stead of numbers? [y/n]: 'y'
---------------TOP 10 FREQUENT 1-ITEMSET-------------------------
set= { Star Wars (1977) },  sup= 61.82
set= { Contact (1997) },  sup= 53.98
set= { Fargo (1996) },  sup= 53.87
set= { Return of the Jedi (1983) },  sup= 53.76
set= { Liar Liar (1997) },  sup= 51.43
set= { English Patient, The (1996) },  sup= 51.01
set= { Scream (1996) },  sup= 50.69
set= { Toy Story (1995) },  sup= 47.93
set= { Air Force One (1997) },  sup= 45.71
set= { Independence Day (ID4) (1996) },  sup= 45.49
-----------------------------------------------------------------
-------TOP 10 (or less) FREQUENT 2-ITEMSET------------------------
set= { Return of the Jedi (1983), Star Wars (1977) },  sup= 50.0
set= { Farg

When looking at the numbers in the association rules, it is hard to see if those rules make sense. But when printing the names corresponding to those numbers the association rules suddenly make total sense. Most rules with high confidence are movies that are both part of the same series of movies, like 'Star Wars'. Sometimes the movies in the association rules do not belong to the same series of movies, but in that case those movies are very popular, like 'The Godfather', so they have a high confidence simply because everyone watched those movies.

#### 5.2.3 
(1 point) Which movies have been watched by the most users? There are only few rules with more than three items. Why?

'Star Wars' (1977) has been watched by the most users, 61.82% to be precise. There is only one frequent four-itemset. This is because the chance of users buying the exact same four movies is very low. If a user only buys three of those four movies, then the support goes down. So considering that there are a lot of people that buy, for instance, three of those four movies, the support of that frequent four-itemset will go down very quickly.

#### 5.2.4
(0.5 points) Often we are interested in rules with high confidence. Is it possible for
itemsets to have very low support but still have a very high confidence?

It is possible to have an itemset with a very low support and a very high confidence. Let's say we have some very good, but obscure movie that almost nobody knows of and that movie has a sequal. Then people who watched the first movie will definitely watch the sequel, because the first one was very good. So the confidence is very high. But since most people have not watched either of those tho movies, the support will be very low.

## 5.3 Calculating support, confidence and interest

Calculate these measures and write down how you computed things, not just the answers. 


#### 5.3.1
 Suppose we have market basket data consisting of 100 transactions and 20 items. The support for item $ \text{a} = 45 \%$, the support for item $ \text{b} = 80 \%$ and the support for itemset $ \text{ {a,b }} = 30 \%$. Let the support and confidence thresholds be 20$ \%$ and 60$ \%$, respectively.
  
1. (0.5 points) Compute the confidence of the association rule $ \text{ {a } } \rightarrow   \text{{b }} $. Is the rule interesting according to the confidence measure?

2. (0.5 points) Compute the interest measure (or lift, see slide 44 of chapter 6) for the association pattern $ \text{ {a,b}}$. Describe the nature of the relationship between item $ \text{a}$ and item $ \text{b}$  in terms of the itemset measure.
3. (1 points) What conclusion can you draw from the results of parts (1) and (2)?

1. The confidence of a set $\text{{a, b}}$ is the support of $(\text{a U b})$ divided by the support of $a$. So then the confidence of the set $\text{{a, b}}$ is: $(\text{a U b})\ /\ a = \text{{a, b}}\ /\ a = 30\%\ /\ 45\% = 2\ /\ 3$. So the confidence of the association rule $\text{{a}} \rightarrow \text{{b}}$ is $2\ /\ 3 = 67\%$. This rule is interesting according to the confidence measure because the treshold of the confidence measure is set to $60\%$ and $67\% > 60\%$. The confidence measure of $67\%$ means that out of all itemsets that contain item $a$, $67\%$ of those itemsets also contain item $b$.

2. $Lift = P(Y | X)\ /\ P(Y)$. So the lift of the association pattern $\text{{a, b}}$ is: $P(b | a)\ /\ P(b) = 67\% \ /\ 80\% = 5\ /\ 6 = 0.83$. The lift measure of $0.83$ means that the lift is very low. So the probability of a person buying item $b$ when that person has bought item $a$ does not increase.

3. The conclusion we can draw from parts (1) and (2) of this exercise is that item $a$ and item $b$ are frequently bought together. From the confidence measure we can see that when someone buys item $a$ or item $b$, there is also quite a high chance that that person will also buy item $b$ or item $a$ respectively. Though because of the low lift we do not know wheter this is because there is a relationship between item $a$ and item $b$, or if they are bought together by coincidence.

#### 5.3.2

1. (1 points) Let $c_1$, $c_2$, and $c_3$ be the confidence values of the rules $ \text{ {a } } \rightarrow   \text{{b }} $, $ \text{ {a } } \rightarrow   \text{{b,c }} $, and $ \text{ {a,c } } \rightarrow   \text{{b }} $ respectively. If we assume that $c_1$, $c_2$, and $c_3$ have different values, what are the possible inequality relationships (e.g. $c_1 \leq c_2 \leq c_3$) among $c_1$, $c_2$, and $c_3$? Which rule has the lowest confidence?
2. (0.5 points) Suppose the confidence of the rules  $ \text{ {a } } \rightarrow   \text{{b }} $ and $ \text{ {b } } \rightarrow   \text{{c }} $ are larger than the confidence threshold. Is it possible that $ \text{ {a } } \rightarrow   \text{{c }} $ has a confidence below that threshold? If no, explain why. If yes, give an example. 

1. The rule $\text{{a}} \rightarrow \text{{b}}$ is basically a simpler version of $\text{{a}} \rightarrow \text{{b, c}}$ and $\text{{a, c}} \rightarrow \text{{b}}$ because $\text{{b}} \in \text{{b, c}}$ and $\text{{a}} \in \text{{a, c}}$. So this must mean that the confidence of the rule $c_1 = \text{{a}} \rightarrow \text{{b}}$ is higher because when $\text{{a}} \rightarrow \text{{b, c}}$ or $\text{{a, c}} \rightarrow \text{{b}}$ occurs, $\text{{a}} \rightarrow \text{{b}}$ must also occur. Then the confidence of $c_3 = \text{{a, c}} \rightarrow \text{{b}}$ is probably higher than $c_2 = \text{{a}} \rightarrow \text{{b, c}}$ because $c_3$ only requires one more item to be bought, while $c_2$ requires two other items to be bought, which clearly has a lower chance to occur than $c_3$. So $c_2$ has the lowest confidence.

2. In general; no this is not possible, because the confidence of the rule $\text{{a}} \rightarrow \text{{b}}$ is high, so when item $a$ is bought, the chance that item $b$ is bought is high. So then because of the rule $\text{{b}} \rightarrow \text{{c}}$, the chance that item $c$ is bought is also high, so then the rule $\text{{a}} \rightarrow \text{{c}}$ is above the treshold. The only time when the rule $\text{{a}} \rightarrow \text{{c}}$ is below the treshold is when the two other rules are barely above the treshold, because then the 'combined' confidence can be below the treshold.


#### 5.3.3

(3 points) Consider the relationships between customers who buy high-definition televisions and exercise machines as shown in Table 2 and 3.

1. Compute the odd ratios for both tables.
2. Compute the $\phi$-coefficient for both tables.
3. Compute the interest (or lift, in the book) factor for both tables.

For Table 3 you should compute measures given below separately for College
Students and for Adults. For each of the measures, describe how the direction
of association changes when data is pooled together (Table 2) instead of being
separated into two groups (Table 3)

1. $\text{Odds ratio} = (\text{Both yes} * \text{Both no}) / (\text{First yes second no} * \text{First no second yes})$. For table 2 the odds ratio is $(105 * 62) / (87 * 40) = 1.87$. And for table 3 the odds ratios are $(2 * 20) / (9 * 5) = 0.89$ and  $(103 * 42) / (78 * 35) = 1.58$. When seperating the data into two groups, instead of one, both odds ratios get quite a bit lower, especially for college students.

2. $\phi\text{-coefficient} = (\text{Both yes} * \text{Both no} - \text{First yes second no} * \text{First no second yes}) /$ $(\sqrt{\text{Total yes} * \text{Total no} * \text{Total yes} * \text{Total no}})$. So the $\phi$-coefficient of table 2 is $(105 * 62 - 87 * 40) /$ $(\sqrt{192 * 102 * 145 * 149}) = 0.15$. And the $\phi$-coefficients of table 3 are $(2 * 20 - 9 * 5) / (\sqrt{11 * 25 * 7 * 29}) = -0.02$ and $(103 * 42 - 78 * 35) / (\sqrt{181 * 77 * 138 * 120}) = 0.11$. When seperating the data into two groups, instead of one, the $\phi$-coefficients both get lower, for college students the $\phi$-coefficient even goes below zero.

3. $Lift = P(yes\ U\ no)\ /\ (P(yes) * P(no))$. So the lift of table 2 is $294 / (192 * 145) = 1.1\%$. And the lifts of table 3 are $36 / (11 * 7) = 46.8\%$ and $258 / (181 * 138) = 1.0\%$. When seperating the data into two groups, instead of one, the lift for working adults stays about the same, but the lift for college students gets **a lot** higher.