Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Jake McCoy"
STUDENT_NUMBER = "s1106263"
COLLABORATOR_NAME = "Lavandier Theo"
COLLABORATOR_STUDENT_NUMBER = "s1103617"

---

## **Important:** When handing in your homework:
+ Hand in the notebook (and nothing else) **named as follows**: StudentName1_snumber_StudentName2_snumber.ipynb
+ Provide clear and complete answers to the questions below under a separate header (not hidden somewhere in your source code), and make sure to explain your answers / motivate your choices. Add Markdown cells where necessary.
+ Source code, output graphs, derivations, etc., should be included in the notebook.
+ Hand-in: upload to Brightspace.
+ Include name, student number, assignment (especially in filenames)!
+ When working in pairs only one of you should upload the assignment, and report the name of your partner in your filename.
+ Use the Brightspace discussion board or email the student assistants for questions on how to complete the exercises.
+ If you find mistakes/have suggestions/would like to complain about the assigment material itself, please email me [Roel] at `Roel.Bouman@ru.nl`
+ Do not remove any cells in the notebook, else this might break the auto-grading system. **An invalid notebook will mean a severe reduction in your grade!**
+ Many online collaboration platforms remove metadata from notebooks, which breaks the auto-grading system. Again, **An invalid notebook will mean a severe reduction in your grade!**. Should you wish to use these platforms, copy the answers from the online notebook to one running on your own machine with Anaconda, and then execute all cells.
+ Only type your answers in those places where they are asked.
+ Remove any "raise NotImplementedError()" statements in the cells you answered.

# Assignment 5: Association mining

## Objective of this assignment
The overall objective is to understand how frequent itemsets can be extracted by
the Apriori algorithm and be able to calculate and interpret association rules in terms of support and confidence.


## Advised Reading and Exercise Material
**The following reading material is recommended:**

- Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, *Introduction to Data Mining*, section 6.


## Additional Tools
For this exercise you will need to load the provided *apriorimining.py* script. 


##  5.1 Association mining for course data 
We will use the Apriori algorithm to automatically mine for associations. The Apriori algorithm is an adapted version of the script found here: https://github.com/nalinaksh/Association-Rule-Mining-Python

Check out the script and doc and check if you understand how the association rules are computed. 




#### 5.1.1

(0.25 points) Look at the data file `data/courses.txt` by **using Python**. The data is represented in Table 1. Inspect the file `data/courses.txt` and make sure you understand how the data in Table 1 is stored in the text file. Explain how the table is a representation of the text file. What information that we would need for complete reconstruction of the table is missing from the text file?

##### Table 1
|#  |   History |Math| Biology| Spanish | Economics| Physics | Chemistry | English  |  
| :-------------: |:-------------:| :-----------:| :----------:| :------------:|:-------------:| :------------:|  :-------------: | :-------------: |
|student 1 | 0| 1 | 0 | 0 | 1| 1 |1 |1   
|student 2 | 1| 1 | 1 | 0 | 0| 1 |1 |1   
|student 3 | 0| 1 | 0 | 1 | 0| 1 |0 |1   
|student 4 | 0| 0 | 1 | 0 | 0| 1 |1 |0   
|student 5 | 0| 1 | 0 | 0 | 0| 1 |1 |0        
|student 6 | 0| 1 | 1 | 0 | 0| 1 |1 |1   


In [7]:
f = open('data/courses.txt', 'r')
lines = f.read().splitlines()
f.close()
for line in lines:
    print(line)

2,5,6,7,8
1,2,3,6,7,8
2,4,6,8
3,6,7
2,6,7
2,3,6,7,8


The data in the .txt file is just a number of rows with each row representing one particular unnamed student and the classes they do e.g. student 1 does classes 2,5,6,7,8. We are missing what classes each number is representing. There is no way currently to tell that class 2 is Math as an example.

#### 5.1.2
(0.75 points) We will analyze the data in Table 1  using the function `apriorimining.generate_association_rules()` from the toolbox. This function takes the a string dictating the file path as input. Analyze the data with $ minsupport  \geq 80 \% $ and $ minconfidence \geq 100 \%$. (input as fractions into the function)
What are the generated association rules? What kind of conclusions can you make based on these association rules about the subjects that students took? 
You can optionally provide a dictionary `names`, translating the integers in the file to the names in the table, to the `generate_association_rules` function. 


In [9]:
from toolbox.apriorimining import generate_association_rules

generate_association_rules('data/courses.txt', 4/5, 1/1)

---------------TOP 10 FREQUENT 1-ITEMSET-------------------------
set= { 6 },  sup= 1.0
set= { 2 },  sup= 0.83
set= { 7 },  sup= 0.83
-----------------------------------------------------------------
-------TOP 10 (or less) FREQUENT 2-ITEMSET------------------------
set= { 2, 6 },  sup= 0.83
set= { 6, 7 },  sup= 0.83
------------------------------------------------------------------
---------------------ASSOCIATION RULES------------------
--------------------------------------------------------
Rule #1: {  } ==> { 6 }, sup= 1.00, conf= 1.00

Rule #2: { 2 } ==> { 6 }, sup= 0.83, conf= 1.00

Rule #3: { 7 } ==> { 6 }, sup= 0.83, conf= 1.00
--------------------------------------------------------


The generated assoication rules are as follows:
Rule 1: {   } ==> { 6 }
Rule 2: { 2 } ==> { 6 }
Rule 3: { 7 } ==> { 6 }

We can conclude that no matter what set of classes a student is taking that they will be taking class 6 (Physics). This is shown in rule 1 since any set will have a 6 (because {   } is a subset of every itemset then with a support of 1 we know that all of the students in our data take physics).

##  5.2 Association mining for MovieLens data 
  
  
In this part of the exercise we consider a Market Basket data set containing 943 users purchases of 1682 movies. A total of 100,000 movies
have been purchased.The data set is called MovieLens100K and is provided by http://www.grouplens.org/node/73, see also the readme `MovieLensData.txt` in the data folder. The data currently considered is not the original data but modified for the apriori algorithm.

#### 5.2.1
(0.25 points) The MovieLens data is stored in the files `data/MovieLensData.txt` and `data/u.item`. Inspect the files **using Python** to see how the data is stored. How do these files relate? 
  

In [17]:
f1 = open('data/MovieLensData.txt', 'r')
lines1 = f1.read().splitlines()
f1.close()
for line in lines1:
    print(line)

print(" ")
print("--------------------------")
print(" ")

f2 = open('data/u.item', 'r')
lines2 = f2.read().splitlines()
f2.close()
for line in lines2:
    print(line)

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272
1,10,13,14,19,25,50,

4,8,12,13,17,23,25,31,46,48,55,56,79,93,98,100,121,127,137,154,156,163,164,168,173,175,176,178,180,183,185,186,187,194,202,203,204,209,211,216,223,230,234,238,269,271,273,276,285,286,288,301,302,303,305,318,324,327,340,382,428,431,433,461,466,475,504,508,513,520,523,531,603,642,645,651,654,657,673,709,732,741,746,770,792,1065,1084
9,14,19,22,44,50,58,64,69,71,83,89,97,98,100,127,132,162,168,170,172,173,174,180,183,185,187,190,191,192,197,199,213,223,234,265,275,276,283,286,289,292,304,306,318,323,357,427,435,462,463,482,483,487,507,515,521,530,531,549,582,588,614,633,651,673,678,707,716,730,735,923,988,1039,1084
245,260,264,268,271,288,299,300,313,322,323,326,328,331,350,351,354,355,678,683,748,879
4,8,12,14,15,24,25,26,40,47,49,50,56,58,63,64,66,69,70,72,77,85,88,94,100,105,121,127,132,133,134,137,138,140,142,143,157,158,160,161,162,167,174,179,182,186,187,188,191,193,194,196,197,204,205,208,210,211,213,215,216,229,237,238,239,248,255,257,265,269,275,282,284,285,286,289,294,301,305,30

1,3,9,50,100,105,117,121,125,129,148,151,168,174,181,222,237,245,248,249,257,264,273,276,280,281,282,288,291,293,294,295,307,322,325,333,334,338,340,405,410,411,455,456,460,472,475,544,546,591,628,685,689,713,741,742,762,763,815,823,825,827,829,831,866,926,928,975,1001,1011,1012,1017,1028,1277
1,2,6,7,9,12,13,28,31,50,56,64,71,82,96,100,114,117,121,127,129,134,144,153,172,173,174,178,181,195,196,200,202,205,211,216,228,237,258,273,275,276,277,283,285,286,288,300,313,318,322,402,408,421,427,429,433,471,480,482,496,504,511,519,523,526,527,562,605,632,701,705,742,836,849,896,923,1011,1036,1149,1400,1478
5,56,98,185,200,217,218,219,245,260,288,299,323,324,325,327,332,333,447,558,559,561,563,567,672,678,682,773,788,816,876,948
237,245,258,262,269,272,286,288,289,292,294,300,302,303,313,315,321,322,325,340
1,7,8,11,15,24,25,28,29,38,41,56,63,64,67,69,71,72,79,82,91,94,95,96,99,105,111,118,121,125,132,138,143,154,155,158,168,174,195,204,210,217,222,227,228,229,230,237,240,255,257,274,278,288,

908|Half Baked (1998)|16-Jan-1998||http://us.imdb.com/M/title-exact?imdb-title-120693|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
909|Dangerous Beauty (1998)|23-Jan-1998||http://us.imdb.com/M/title-exact?imdb-title-118892|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
910|Nil By Mouth (1997)|06-Feb-1998||http://us.imdb.com/Title?Nil+By+Mouth+(1997)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
911|Twilight (1998)|30-Jan-1998||http://us.imdb.com/M/title-exact?imdb-title-119594|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|0|0|0
912|U.S. Marshalls (1998)|10-Mar-1998||http://us.imdb.com/Title?U.S.+Marshals+(1998)|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
913|Love and Death on Long Island (1997)|10-Mar-1998||http://us.imdb.com/Title?Love+and+Death+on+Long+Island+(1997)|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0
914|Wild Things (1998)|14-Mar-1998||http://us.imdb.com/Title?Wild+Things+(1998)|0|0|0|0|0|0|1|0|1|0|0|0|0|1|0|0|1|0|0
915|Primary Colors (1998)|20-Mar-1998||http://us.imdb.com/Title?Primary+Colors+(1998)|0|0|0|0|0|0|0|0|1|0|0|0

MovieLensData.txt is a list of the purchases each person (represented as a list of integers) has made with movies corresponding to a number that is in the u.item file e.g. movie 12 = Usual Suspects. There is 943 user purchases of the 1682 movies they sell.

#### 5.2.2 
(0.75 point) Find association rules using the function below with $ minsupport  \geq 30 \% $ and $ minconfidence \geq 80 \%$. You can use the `max_rules` argument to set a maximum number of potential rules to print. Use the `movie_names` dict as the optional `names` argument in `generate_association_rules` to interpret the results. What are the associations with strongest confidence? Do these associations make sense? Explain.

In [18]:
from toolbox.load_movie_names import load_movie_names
movie_names = load_movie_names()

generate_association_rules('data/MovieLensData.txt', 3/10, 4/5, max_rules = 5, names = movie_names)

---------------TOP 10 FREQUENT 1-ITEMSET-------------------------
set= { Star Wars (1977) },  sup= 0.62
set= { Fargo (1996) },  sup= 0.54
set= { Return of the Jedi (1983) },  sup= 0.54
set= { Contact (1997) },  sup= 0.54
set= { English Patient, The (1996) },  sup= 0.51
set= { Scream (1996) },  sup= 0.51
set= { Liar Liar (1997) },  sup= 0.51
set= { Toy Story (1995) },  sup= 0.48
set= { Air Force One (1997) },  sup= 0.46
set= { Independence Day (ID4) (1996) },  sup= 0.45
-----------------------------------------------------------------
-------TOP 10 (or less) FREQUENT 2-ITEMSET------------------------
set= { Return of the Jedi (1983), Star Wars (1977) },  sup= 0.51
set= { Fargo (1996), Star Wars (1977) },  sup= 0.42
set= { Star Wars (1977), Toy Story (1995) },  sup= 0.4
set= { Raiders of the Lost Ark (1981), Star Wars (1977) },  sup= 0.4
set= { Independence Day (ID4) (1996), Star Wars (1977) },  sup= 0.38
set= { Godfather, The (1972), Star Wars (1977) },  sup= 0.38
set= { Fargo (1996), R

The strongest associations are rules 73, 51, 33, 56 and 57.

Rule 73: { Empire Strikes Back, The (1980), Raiders of the Lost Ark (1981), Return of the Jedi (1983) } ==> { Star Wars (1977) }
Rule 51: { Empire Strikes Back, The (1980), Return of the Jedi (1983) } ==> { Star Wars (1977) }

Rule 33: { Pulp Fiction (1994), Return of the Jedi (1983) } ==> { Star Wars (1977) }

Rule 56: { Raiders of the Lost Ark (1981), Return of the Jedi (1983) } ==> { Star Wars (1977) }

Rule 57: { Return of the Jedi (1983), Toy Story (1995) } ==> { Star Wars (1977) }

These all have conf in the range of 1.0 -> 0.98 and all of the rules imply the set obtaining Star Wars (1977). We can see that these rules makes sense as the each of these rules contain either a movie from the star wars series or another associated lucas films movie (Raiders of the Lost Ark (1981)). It seems that if someone buys a lucas film movie then they will also pick up Star Wars (1977).

#### 5.2.3 
(1 point) Which movies have been watched by the most users? There are only few rules with more than three items. Why?

Star Wars (1977) is the most watched movie, 62% of the users watched it.
Fargo (1996) is the second most watched movie, 54% of the users watched it.
Return of the Jedi (1983) is the third most watched movie, 54% of the users watched it.

There is only a few rules with more than three items since the more possible options that one person can choose from (1682 different movies) that leads to an enormous amount of different combinations of movies one person can choose from. Having rules with move and more items leads to a smaller and smaller possibility of them appearing multiple times since the chance of a set appearing decreases expodentially. This is why rules with fewer items (3 or less) are 'stronger' (Have a higher support) since they appear more frequently in the data.

#### 5.2.4
(0.5 points) Often we are interested in rules with high confidence. Is it possible for itemsets to have very low support but still have corresponding rule with a very high confidence?

Yes it is possible since we can have itemsets that do not appear often in the data but when they do always contain similar or identical items. This would result in an association rule with a low support (lower) and a high confidence.

## 5.3 Calculating support, confidence and interest

Calculate these measures and write down how you computed things, not just the answers. You can use Latex syntax for writing your answers.


#### 5.3.1
 Suppose we have market basket data consisting of 100 transactions and 20 items. The support for item $ \text{a} = 45 \%$, the support for item $ \text{b} = 80 \%$ and the support for itemset $ \text{ {a,b }} = 30 \%$. Let the support and confidence thresholds be 20$ \%$ and 60$ \%$, respectively.
  
1. (0.5 points) Compute the confidence of the association rule $ \text{ {a } } \rightarrow   \text{{b }} $. Is the rule interesting according to the confidence measure?

2. (0.5 points) Compute the interest measure (or lift, see slide 46 of chapter 6) for the association rule $ \text{ {a } } \rightarrow   \text{{b }} $. Describe the nature of the relationship between item $ \text{a}$ and item $ \text{b}$  in terms of the interest measure.

3. (1 points) What conclusion can you draw from the results of parts (1) and (2)?

1)
$$
\text{Confidence of {a}} \longrightarrow \text{ {b}} = \frac{\text{σ(a,b)}}{\text{σ(a)}} = \frac{30}{45} = \frac{2}{3} \approx 0.66
$$
2)
$$
\text{Interest Measure (Lift) of {a}} \longrightarrow \text{ {b}} = \frac{\text{P(B|A)}}{P(B)} = \frac{P(A\cap B)}{A} \cdot \frac{1}{P(B)} = \frac{0.3}{0.45} \cdot \frac{1}{0.8} = \frac{5}{6} \approx 0.83
$$
3)  <br>
Well we have a reseaonably high confidence since 66% of the time a appears then b also appears. Our Interest Measure is bellow 1 meaning we aren't seeing the combination as much as we expected to see. This may mean the rule isn't as great as we'd hope it to be and maybe there is a better association rule we could

#### 5.3.2

1. (1 points) Let $c_1$, $c_2$, and $c_3$ be the confidence values of the rules $ \text{ {a } } \rightarrow   \text{{b }} $, $ \text{ {a } } \rightarrow   \text{{b,c }} $, and $ \text{ {a,c } } \rightarrow   \text{{b }} $ respectively. If we assume that $c_1$, $c_2$, and $c_3$ have different values, what are the possible inequality relationships (e.g. $c_1 \leq c_2 \leq c_3$) among $c_1$, $c_2$, and $c_3$? Which rule has the lowest confidence?
2. (0.5 points) Suppose the confidence of the rules  $ \text{ {a } } \rightarrow   \text{{b }} $ and $ \text{ {b } } \rightarrow   \text{{c }} $ are larger than the confidence threshold. Is it possible that $ \text{ {a } } \rightarrow   \text{{c }} $ has a confidence below that threshold? If no, explain why. If yes, give an example. 

1)
$$
    c_1 = \frac{σ(a,b)}{σ(a)}\\
    c_2 = \frac{σ(a,b,c)}{σ(a)}\\
    c_3 = \frac{σ(a,b,c)}{σ(a,c)}\\
$$
c_1 will have a higher confidence than c_2. This is because the only different between the two is that c_2 has a smaller denominator (This is because for (abc) to occur (ab) has already occured in the data set) since the specific set occurs less than c_1. This leads us to our first inequality:

$$
\text{Rule 1}\\
c_1 > c_2
$$

We can now compare c_3 and c_2. We can tell that c_3 will have a higher confidence thant c_2 since our denominator will be lower (and a lower denominator will result in a larger number). Our denomintor (for c_3) only counts when both (a) and (c) are present in a given data point whilst c_2 counts when only a is apparent. Therefore we get the equality:

$$
\text{Rule 2}\\
c_3 > c_2
$$

In seeing Rule 1 and 2 we can see that c_2 is always on the smaller end of the equality meaning c_2 has the lowest confidence out of all the confidence values.

2)
$$
    \text{Rule 1 = }
    \frac{σ(a,b)}{σ(a)}\\
    \newline
    \text{Rule 2 = } 
    \frac{σ(b,c)}{σ(b)}\\
    \newline
    \text{Rule 3 = }
    \frac{σ(a,c)}{σ(a)}
$$

Yes it is absolutely possible that Rule 3 has a confidence below the given threshold since there might be data sets that don't contain (a) and (c) together even though they are present in other rules (Like Rule 1 and 2). If we follow the data set below we can see that Rule 3 can have a confidence lower than the threshold.

 a | b | c  
 1 | 1 | 0  
 1 | 1 | 0  
 0 | 1 | 1  
 0 | 1 | 1  
 0 | 1 | 1  
 1 | 0 | 0  
 1 | 0 | 0  
 1 | 0 | 1  
 0 | 0 | 1  
 0 | 0 | 1  
 
 $$
    \text{Rule 1 Confidence = }
    \frac{σ(a,b)}{σ(a)} = \frac{2}{5} = 0.4\\
    \newline
    \text{Rule 2 Confidence = }
    \frac{σ(b,c)}{σ(b)} = \frac{3}{5} = 0.6\\
    \newline
    \text{Rule 3 Confidence = }
    \frac{σ(a,c)}{σ(a)} = \frac{1}{5} = 0.2
$$

If our threshold is 0.3 then clearly {a} -> {c} has a value less than the threshold whilst Rule 1 and Rule 2 both above the threshold


#### 5.3.3

(3 points) Consider the relationships between customers who buy high-definition televisions and exercise machines as shown in Table 2 and 3.

1. Compute the odd ratios for both tables.
2. Compute the $\phi$-coefficient for both tables.
3. Compute the interest (or lift, in the book) factor for both tables.

For Table 3 you should compute measures given above separately for College
Students and for Adults. **For each of the measures, describe how the direction
of association changes when data is pooled together** (Table 2) instead of being
separated into two groups (Table 3)

##### Table 2: Two way contingency table between the sale of high-definition television and exercise machine
| |   Buy Exercise machine |     |     |
| :------------- | -------------:| :-----------:| :----------:| 
| **Buy HDTV     ** | yes | no | total |
| yes  | 105| 87 | 192 | 
| no | 40| 62 | 102 |   
| total | 145 | 149 | 294 | 
 

##### Table 3: Example of three-way contingency table
| | |   Buy Exercise machine |     |     |
|--- | :------------- | -------------:| :-----------:| :----------:| 
|**Customer group** | **Buy HDTV     ** | yes | no | total |
|College students | yes  | 2| 9 | 11 | 
| | no | 5| 20 | 25 |
| Working adults | yes  | 103| 78 | 181 | 
| | no | 35| 42 | 77 |  



Let's define : 

- X : *Buy HDTV* 
- Y : *Exercise Machine* 

### Table 2: 


P(X) = 145/294   
P(Y) = 192/294   
P(X,Y) = 105/294  
$$
    \text{odds ratio = }
    \frac{105 \cdot 62}{87\cdot40} = \frac{6510}{3480} = 1.87\\
    \newline
    \text{𝜙 -coefficient = }
    \frac{P(X,Y) - P(X)P(Y)}{\sqrt{(P(X)[1-P(X)]P(Y)[1-P(Y)])}} \\
    \newline
    = \frac{\frac{105}{294} - \frac{145}{294}\cdot\frac{192}{294}}{\sqrt{\frac{145}{294}\cdot\frac{149}{294}\cdot\frac{192}{294}\cdot\frac{102}{294}}} = 0.147
    \newline
    \text{Lift = }
    \frac{P(Y|X)}{P(Y)} = \frac{\frac{105}{145}}{\frac{192}{294}} = 1.109
$$

### Table 3:
#### College students 
P(X) = 7/36  
P(Y) = 11/36  
P(X,Y) = 2/36  
$$
    \text{odds ratio = }
    \frac{2\cdot20}{9\cdot5} = \frac{40}{45} = 0.889\\
    \newline
    \text{𝜙 -coefficient = }
    \frac{P(X,Y) - P(X)P(Y)}{\sqrt{(P(X)[1-P(X)]P(Y)[1-P(Y)])}} \\
    \newline
    = \frac{\frac{2}{36} - \frac{7}{36}\cdot\frac{11}{36}}{\sqrt{\frac{7}{36}\cdot\frac{29}{36}\cdot\frac{11}{36}\cdot\frac{25}{36}}} = -0.021
    \newline
    \text{Lift = }
    \frac{P(Y|X)}{P(Y)} = \frac{\frac{2}{7}}{\frac{11}{36}} = 0.935
$$


#### Working adults  
P(X) = 138/258  
P(Y) = 181/258  
P(X,Y) = 103/258  

$$
    \text{odds ratio = }
    \frac{103\cdot42}{78\cdot35} = \frac{4326}{2730} = 1.58\\
    \newline
    \text{𝜙 -coefficient = }
    \frac{P(X,Y) - P(X)P(Y)}{\sqrt{(P(X)[1-P(X)]P(Y)[1-P(Y)])}} \\
    \newline
    = \frac{\frac{103}{258} - \frac{138}{258}\cdot\frac{181}{258}}{\sqrt{\frac{138}{258}\cdot\frac{120}{258}\cdot\frac{181}{258}\cdot\frac{77}{258}}} = 0.105
    \newline
    \text{Lift = }
    \frac{P(Y|X)}{P(Y)} = \frac{\frac{103}{138}}{\frac{181}{258}} = 1.064
$$





The first thing that we can observe is that the values of each calculated measures are always higher in table 2 (Phi-Coefficient, odds ratio and lift) and sometimes a substantial amount like the odds ratio which is almost double compared to table 3. This does make some sense since we are spliting the data up and not viewing it as one large group. In this case, combining the data into one table gives us higher association values in all of the calculated aspects.

We can notice that the measured values of the working adults table are very close to the values of the table 2 (due to the fact that College students make up a very small part of the data), and on the contrary the values of the College students table are not very similar and lower than those of the table 2 (since they are a very small minority of the data).
  
Thanks to this we can say the data from table 2 leans more towards adult workers because the number of adult workers in the data is much larger than the number of students (There is a lot less College Student Data than Working Adults). So if we look at the values that we have for the College students table, the interpretation of the results will be very different, even though this part of the data is present in table 2. For example the 𝜙 -coefficient of the college student table has a negative value (Meaning low to no association), and the 𝜙 -coefficient for the 2 other tables have a positive value (Meaning there is some association here even if its not high). Also the odd ratio of the College student table is lesser than 1, in contrast to the other two tables were the values are greater than 1 (almost two for the table 2).  

