# Association Rule Mining

Association Rule Mining (ARM) is an analysis tool for finding out if certain items in a dataset 'go together'-- i.e. if they are often seen next to each other in an observation. ARM is on 'transaction datasets', datasets where each observation is treated as a group of items. Transaction datasets are similar to word-document matrices, but with some key distinctions. In transaction datasets:

    1. Order of items doesn't matter.
    2. Frequency of items also doesn't matter, with the exception of 0.

In transaction data, each document is considered a set of words, where either a word is present in the document or it isn't. How often a word appears isn't relevant in this data. To perform ARM analysis on my text data, I'll first have to convert it to a transactional dataset.

# Converting my data to transactional data.

To start, I'm going to transform my WDM into a transactional data format, so I can perform ARM on it. I'm going to perform this analysis on the title dataset from WorldNewsAPI, as I think that there could be some interesting relationships revealed from the small batch of words contained within the article titles.

Here's my python code for doing this. Since I can't support python and R in one notebook, I'll just display it, but I won't run it here. You can see the result of the code in the `transactions` object being read in afterwards.
ws_title.csv')

### Python Code for creating transaction data
```
import pandas as pd

wdm = pd.read_csv('data/wdms/count/worldnewsapi/lemmed/title.csv',index_col=0)

basket_df = pd.DataFrame(columns=['transactions'])
for i in range(len(wdm)):
    terms=wdm.iloc[i,4:].astype(int)
    # For each row, filter that row by x>0, i.e. there is at least one instance of that term in the document.
    filter=terms.apply(lambda x:x>0)
    terms=terms[filter]
    # Then turn the index of the resulting series into a comma separated list to get the basket of terms.
    basket = ' '.join(list(terms.index))
    basket_df.loc[len(basket_df.index)]=basket

basket_df.to_csv('data/transaction/worldnews_title.csv')
```


In [33]:
library(arules)
#Reading the transaction data in
transactions <- arules::read.transactions('../data/transaction/worldnews_title.csv',format='basket')
#Learning some rules about the transactions data
FirstRule <- arules::apriori(transactions, parameter = list(support=0.007,
                                                    confidence =0.5,
                                                    minlen=2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5   0.007      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 21 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[3913 item(s), 3102 transaction(s)] done [0.02s].
sorting and recoding items ... [154 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [20 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [34]:
inspect(FirstRule)

SortedRules1 <- sort(FirstRule, by="support", decreasing=TRUE)
(summary(SortedRules1))

     lhs                       rhs        support     confidence lift     count
[1]  {hate}                 => {crime}    0.008704062 0.8709677  67.54355 27   
[2]  {crime}                => {hate}     0.008704062 0.6750000  67.54355 27   
[3]  {murder}               => {ghey}     0.007736944 0.5454545  26.43750 24   
[4]  {accused}              => {ghey}     0.007414571 0.5609756  27.18979 23   
[5]  {brianna}              => {ghey}     0.012250161 0.7307692  35.41947 38   
[6]  {ghey}                 => {brianna}  0.012250161 0.5937500  35.41947 38   
[7]  {veto}                 => {governor} 0.008059317 0.5555556  33.14103 25   
[8]  {veto}                 => {ohio}     0.009026435 0.6222222  28.80796 28   
[9]  {veto}                 => {care}     0.008059317 0.5555556  12.67157 25   
[10] {veto}                 => {ban}      0.007736944 0.5333333  11.98841 24   
[11] {bathroom}             => {student}  0.008381689 0.5098039  13.51634 26   
[12] {high}                 => {school} 

set of 20 rules

rule length distribution (lhs + rhs):sizes
 2  3 
18  2 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     2.0     2.0     2.1     2.0     3.0 

summary of quality measures:
    support           confidence          lift           count      
 Min.   :0.007415   Min.   :0.5000   Min.   :11.40   Min.   :23.00  
 1st Qu.:0.008301   1st Qu.:0.5142   1st Qu.:11.91   1st Qu.:25.75  
 Median :0.009188   Median :0.5583   Median :15.61   Median :28.50  
 Mean   :0.011348   Mean   :0.6124   Mean   :24.27   Mean   :35.20  
 3rd Qu.:0.012250   3rd Qu.:0.6687   3rd Qu.:29.89   3rd Qu.:38.00  
 Max.   :0.022244   Max.   :1.0000   Max.   :67.54   Max.   :69.00  

mining info:
         data ntransactions support confidence
 transactions          3102   0.007        0.5

# First Thoughts

Association-Rule Mining has found some common themes in these titles by association. Firstly, it's noticing the Brianna Ghey killing as a distinct set of stories, and picking up the words making up those headlines. It's also revealing a common headline about pushes to 'ban genderaffirming care', and the ohio bill which was vetoed by the Ohio governer. We can see the supreme court getting shoutouts in fome of these stories too, and of course there's the high-school bathroom bill which is the subject of some stories, and is probably responsible for the pairs 'high->school' and 'student->bathroom'.

Let's dig a bit deeper and apply ARM to the text.

#### Now, let's apply the same process to the text.

In [39]:
#Reading the transaction data in
transactions <- arules::read.transactions('../data/transaction/worldnews_text.csv',format='basket')
#Learning some rules about the transactions data
FirstRule <- arules::apriori(transactions, parameter = list(support=0.07,
                                                    confidence=0.95,
                                                    minlen=2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
       0.95    0.1    1 none FALSE            TRUE       5    0.07      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 217 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[12206 item(s), 3102 transaction(s)] done [0.26s].
sorting and recoding items ... [769 item(s)] done [0.02s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 done [1.85s].
writing ... [10185 rule(s)] done [0.03s].
creating S4 object  ... done [0.01s].


In [46]:
SortedRules <- sort(FirstRule, by="support", decreasing=TRUE)
inspect(SortedRules[1:30])
(summary(SortedRules))

     lhs                                    rhs           support   confidence
[1]  {identity,state}                    => {gender}      0.2159897 0.9517045 
[2]  {added,gender}                      => {transgender} 0.1708575 0.9532374 
[3]  {identity,people,state}             => {gender}      0.1698904 0.9547101 
[4]  {change,told}                       => {transgender} 0.1682785 0.9508197 
[5]  {identity,right,state}              => {gender}      0.1592521 0.9648438 
[6]  {identity,law,people}               => {gender}      0.1579626 0.9533074 
[7]  {added,people,year}                 => {transgender} 0.1566731 0.9548134 
[8]  {identity,state,year}               => {gender}      0.1560284 0.9546351 
[9]  {life,people,told}                  => {transgender} 0.1534494 0.9558233 
[10] {identity,people,state,transgender} => {gender}      0.1521599 0.9516129 
[11] {care,identity}                     => {gender}      0.1476467 0.9521830 
[12] {identity,law,state}                => {gender}

set of 30 rules

rule length distribution (lhs + rhs):sizes
 3  4  5 
 7 18  5 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   4.000   3.933   4.000   5.000 

summary of quality measures:
    support         confidence          lift           count      
 Min.   :0.1322   Min.   :0.9508   Min.   :1.133   Min.   :410.0  
 1st Qu.:0.1387   1st Qu.:0.9532   1st Qu.:1.139   1st Qu.:430.2  
 Median :0.1418   Median :0.9548   Median :1.354   Median :440.0  
 Mean   :0.1481   Mean   :0.9563   Mean   :1.349   Mean   :459.5  
 3rd Qu.:0.1554   3rd Qu.:0.9576   3rd Qu.:1.555   3rd Qu.:482.0  
 Max.   :0.2160   Max.   :0.9720   Max.   :1.574   Max.   :670.0  

mining info:
         data ntransactions support confidence
 transactions          3102    0.07       0.95

# Takeaways

It looks like the most common rules in the dataset have to do with the associations with 'gender' and 'transgender'. This tracks, given the overall topic of the dataset is ***transgender***. However, these connections also give us a sense of what people in the news associate **transgender** with. Some of the associated words are:

    added, people, year, life, told, policy, just, want

while the associations with **gender** include

    identity, policy, law, government, right, child, state

Frankly, these sets of words provide better connotations with transgender and gender than I can hope to create. If I attempted to summarize them, I would in fact be doing the opposite. I guess I'll try anyways: The **transgender** associations seem to relate to a recent push by transgender people to change policy this year. There are time related words, words signifying desire, and action words like `told` and `added`. Meanwhile, the words associated with **gender** indicate that gender is somewhat dictated by government and law, but is also considered a personal identity and a right.

One issue with these rules is that while they are prevalent, they aren't very strong. `Lift` is a measure of how strongly correlated two more terms are in a text, i.e. how much one of the terms being present raises the probability of another term occurring. Lift is commonly used to check the validity of associations in ARM, with a high lift indicating that the terms truly are linked to one another, and a lift of 1 indicating that the terms are in fact, not correlated with each other at all. While the rules I inspected so far have a lot of support, they don't have a great amount of lift, with lift scores ranging between `1.1` and `1.6`. So...

#### ...let's look at the same rules, sorted by lift.

In [48]:
SortedRules <- sort(FirstRule, by="lift", decreasing=TRUE)
inspect(SortedRules[1:30])
(summary(SortedRules))

     lhs                              rhs       support    confidence lift    
[1]  {click,fox}                   => {app}     0.07672469 0.9875519  8.169029
[2]  {click,fox,news}              => {app}     0.07672469 0.9875519  8.169029
[3]  {app,fox}                     => {click}   0.07672469 0.9596774  8.089455
[4]  {app,fox,news}                => {click}   0.07672469 0.9596774  8.089455
[5]  {blocker,genderaffirming}     => {puberty} 0.07350097 0.9870130  6.542125
[6]  {blocker,care,medical}        => {puberty} 0.07188910 0.9867257  6.540220
[7]  {blocker,minor}               => {puberty} 0.07898130 0.9839357  6.521728
[8]  {blocker,law}                 => {puberty} 0.07640232 0.9834025  6.518193
[9]  {blocker,child,minor}         => {puberty} 0.07188910 0.9823789  6.511409
[10] {blocker,hormone,minor}       => {puberty} 0.07156673 0.9823009  6.510892
[11] {blocker,surgery}             => {puberty} 0.07543520 0.9750000  6.462500
[12] {blocker,child,medical}       => {puberty} 0.07

set of 10185 rules

rule length distribution (lhs + rhs):sizes
   2    3    4    5    6    7    8 
   6  268 2081 4458 2868  494   10 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   5.000   5.000   5.123   6.000   8.000 

summary of quality measures:
    support          confidence          lift           count      
 Min.   :0.07028   Min.   :0.9500   Min.   :1.132   Min.   :218.0  
 1st Qu.:0.07253   1st Qu.:0.9530   1st Qu.:1.145   1st Qu.:225.0  
 Median :0.07544   Median :0.9568   Median :1.339   Median :234.0  
 Mean   :0.07852   Mean   :0.9587   Mean   :1.377   Mean   :243.6  
 3rd Qu.:0.08124   3rd Qu.:0.9623   3rd Qu.:1.355   3rd Qu.:252.0  
 Max.   :0.21599   Max.   :1.0000   Max.   :8.169   Max.   :670.0  

mining info:
         data ntransactions support confidence
 transactions          3102    0.07       0.95

#### We found a new association!

It seems that the associations with the greatest lift involve the word `puberty`. Well, the second greatest-- it seems that `Fox News` put an advertisement for their app in the text of all of their news stories, and that snagged the number one spot.

But it appears that `puberty` only occurs in the discussion of a particular topic:

    hormone, blocker, surgery, genderaffirming, treatment, medical, care, child, minor

Clearly, gender-affirming care *during puberty* is a main subject of a cluster of stories relating to transgender rights, but only that subset of topics.

# Takeaways

Overall, we found some interesting lingual associations with key words in the transgender rights topic.

It seems like Association-Rule Mining is extremely useful for finding the connotations betwen specific words within a set of documents. It's telling us what, on average, the people writing these articles think about `puberty, gender, and transgender`. There could be more useful associations hidden within the text, or, more interestingly, within sorted clusters of text.