<a href="https://colab.research.google.com/github/DepartmentOfStatisticsPUE/cda-2021/blob/main/notebooks/cda_2021_06_01_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We need to install package `arules` in order to run basket/association analysis.

In [2]:
install.packages("arules")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(arules)

Read the data. This dataset contains the following competences:


+ technical (techniczne), 
+ mathematical,
+ artistic,
+ computer,
+ cognitice,
+ managerial,
+ interpersonal,
+ individual / self-organization,
+ physical,
+ availability, 
+ office.

In [5]:
data <- readRDS("data-bkl.rds")
data <- data[,-1]
head(data)
dim(data)

Unnamed: 0_level_0,techniczne,matematyczne,kulturalne,komputerowe,kognitywne,kierownicze,interpersonalne,indywidualne,fizyczne,dyspozycyjne,biurowe
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,1,0,1,0,0
6,0,0,0,0,0,0,1,1,0,0,0


Let's check in what share of job ads given competence was mentioned.

In [10]:
print(colMeans(data)*100)

     techniczne    matematyczne      kulturalne     komputerowe      kognitywne 
      4.4461658       0.3408211      15.8171960      34.3067390      21.0921766 
    kierownicze interpersonalne    indywidualne        fizyczne    dyspozycyjne 
     31.1773819      61.0302091      66.1270333       8.0325329      23.6638265 
        biurowe 
      2.9434547 


In order to run basket/association analysis we should do the following steps:

1. verify format of the data (transactions vs matrix),
2. create special object that will be recognized by the `arules` package,
3. run function `apriori` that finds rules according to some tresholds.

In [22]:
## first step 
data_m <- as.matrix(data)
## second step
competences <- as(data_m, "transactions")
competences

transactions in sparse format with
 12910 transactions (rows) and
 11 items (columns)

In [23]:
summary(competences)

transactions as itemMatrix in sparse format with
 12910 rows (elements/itemsets/transactions) and
 11 columns (items) and a density of 0.244525 

most frequent items:
   indywidualne interpersonalne     komputerowe     kierownicze    dyspozycyjne 
           8537            7879            4429            4025            3055 
        (Other) 
           6800 

element (itemset/transaction) length distribution:
sizes
   0    1    2    3    4    5    6    7    8    9 
1567 1448 2465 3324 2599 1152  271   80    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    3.00    2.69    4.00    9.00 

includes extended item information - examples:
        labels
1   techniczne
2 matematyczne
3   kulturalne

includes extended transaction information - examples:
  transactionID
1             1
2             2
3             3

In [24]:
## third step -- association analysis
results <- apriori(competences)
results

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5     0.1      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 1291 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[11 item(s), 12910 transaction(s)] done [0.00s].
sorting and recoding items ... [7 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [11 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


set of 11 rules 

In [29]:
## forth step is to analyse results
results_inspect <- inspect(results,  print = FALSE)

     lhs                               rhs               support   confidence
[1]  {kulturalne}                   => {interpersonalne} 0.1284276 0.8119491 
[2]  {kulturalne}                   => {indywidualne}    0.1318358 0.8334966 
[3]  {kognitywne}                   => {indywidualne}    0.1714175 0.8127066 
[4]  {kierownicze}                  => {indywidualne}    0.2600310 0.8340373 
[5]  {interpersonalne}              => {indywidualne}    0.4964369 0.8134281 
[6]  {kulturalne,interpersonalne}   => {indywidualne}    0.1108443 0.8630881 
[7]  {kulturalne,indywidualne}      => {interpersonalne} 0.1108443 0.8407756 
[8]  {kognitywne,interpersonalne}   => {indywidualne}    0.1275755 0.8351927 
[9]  {interpersonalne,dyspozycyjne} => {indywidualne}    0.1312936 0.8466533 
[10] {komputerowe,kierownicze}      => {indywidualne}    0.1014717 0.8624095 
[11] {kierownicze,interpersonalne}  => {indywidualne}    0.2047250 0.8332282 
     coverage  lift     count
[1]  0.1581720 1.330405 1658 
[2] 

In [30]:
as.data.frame(results_inspect)

Unnamed: 0_level_0,lhs,Unnamed: 2_level_0,rhs,support,confidence,coverage,lift,count
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
[1],{kulturalne},=>,{interpersonalne},0.1284276,0.8119491,0.158172,1.330405,1658
[2],{kulturalne},=>,{indywidualne},0.1318358,0.8334966,0.158172,1.260448,1702
[3],{kognitywne},=>,{indywidualne},0.1714175,0.8127066,0.2109218,1.229008,2213
[4],{kierownicze},=>,{indywidualne},0.260031,0.8340373,0.3117738,1.261265,3357
[5],{interpersonalne},=>,{indywidualne},0.4964369,0.8134281,0.6103021,1.230099,6409
[6],"{kulturalne,interpersonalne}",=>,{indywidualne},0.1108443,0.8630881,0.1284276,1.305197,1431
[7],"{kulturalne,indywidualne}",=>,{interpersonalne},0.1108443,0.8407756,0.1318358,1.377638,1431
[8],"{kognitywne,interpersonalne}",=>,{indywidualne},0.1275755,0.8351927,0.1527498,1.263013,1647
[9],"{interpersonalne,dyspozycyjne}",=>,{indywidualne},0.1312936,0.8466533,0.1550736,1.280344,1695
[10],"{komputerowe,kierownicze}",=>,{indywidualne},0.1014717,0.8624095,0.1176607,1.304171,1310


To visualize the results we may package `arulesVis` which provides a `shiny` app to mine rules.