# Mining Association Rules with Orange

We use [`Orange`](http://orange.biolab.si/), a general purpose Data Mining library in Python. `Orange` is very similar with `Scikit-Learn`, with additional features for Visual Programming and Unsupervised Mining Algorithms (e.g., Frequent Item Sets Mining & Association Rules Mining). Check out more at [here](https://docs.orange.biolab.si/2/reference/rst/Orange.associate.html). To install it, you can type:

pip install Orange

or 

conda install Orange

on the command line. Make sure you install version 2.7 (which is the version used for Python 2.7) We use a toy market basket data `market-basket.basket`, which comes with the library. Each line describes a basket:

```
Bread, Milk
Bread, Diapers, Beer, Eggs
Milk, Diapers, Beer, Cola
Bread, Milk, Diapers, Beer
Bread, Milk, Diapers, Cola
```

The `.basket` suffix tells Orange that this is a basket data, i.e. each line can be seen as a set of items. Different lines might have different number of items. For basket data, we can use the built-in `AssociationRulesSparseInducer` of `Orange` to mine association rules with minimum support threshold 0.5, and minimum confidence 0.5:

In [1]:
import Orange

data = Orange.data.Table("market-basket.basket")
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.01, confidence=0.01)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules:
    print "%.2f %.2f  %s" % (r.support, r.confidence, r)

ImportError: No module named Orange

`AssociationRulesSparseInducer` can also induce the frequent item sets, given a minimum support threshold:

In [6]:
inducer = Orange.associate.AssociationRulesSparseInducer(support=0.5, storeExamples=True)
itemsets = inducer.get_itemsets(data)

print "%4s %s" % ("Supp", "Rule")
for items, baskets in itemsets:
    print "%.2f %s" % (len(baskets)/float(len(data)),
                          " ".join(data.domain[item].name for item in items))

Supp Rule
0.60 Beer
0.60 Beer Diapers
0.80 Diapers
0.60 Diapers Milk
0.60 Diapers Bread
0.80 Milk
0.60 Milk Bread
0.80 Bread


# Feature-Value Data Format

Another data format that is relevant for us is the feature-value data format. In this case, typycally the first line specifies the names of the attributes, while the remaining lines specify the data. Observe that the number of elements per item is the same in all the lines. In `Orange` the second line of the input file specifies the type of the data, e.g. discrete, continuous or string. The third line gives additional informations which are not relevant for us at the moment. Below you can find an example (from the file `lenses.tab`) where the tab separator has been used.

```
age       prescription  astigmatic    tear_rate     lenses
discrete  discrete      discrete      discrete      discrete

young     myope         no            reduced       none
young     myope         no            normal        soft
young     myope         yes           reduced       none
young     myope         yes           normal        hard
young     hypermetrope  no            reduced       none
```


If your input file uses the feature-value data format then you should use the function `AssociationRulesInducer` instead of `AssociationRulesSparseInducer`. In `Orange`, file names with suffixes `csv` and `tab` use the feature-value data format, while the separator is a ',' or a tab, respectively. Check also the documentation [here](https://docs.orange.biolab.si/2/reference/rst/Orange.associate.html).



In [8]:
import Orange

data = Orange.data.Table("mammographic_masses.csv")
rules = Orange.associate.AssociationRulesInducer(data, support=0.1, confidence=0.9)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules:
    print "%.2f %.2f  %s" % (r.support, r.confidence, r)

Supp Conf  Rule
0.37 0.90  Shape=4 -> Density=3
0.13 0.93  Margin=5 -> Density=3
0.17 0.91  BI-RADS=4 Shape=1 -> Severity=0
0.16 0.90  BI-RADS=4 Shape=2 -> Severity=0
0.26 0.90  Margin=1 Density=3 -> BI-RADS=4
0.30 0.91  Margin=1 Severity=0 -> BI-RADS=4
0.30 0.91  BI-RADS=4 Margin=1 -> Severity=0
0.24 0.90  BI-RADS=5 Shape=4 -> Density=3
0.25 0.91  BI-RADS=5 Shape=4 -> Severity=1
0.14 0.90  Shape=2 Margin=1 -> Severity=0
0.11 0.96  Shape=4 Margin=5 -> Density=3
0.30 0.90  Shape=4 Severity=1 -> Density=3
0.11 0.93  Margin=5 Severity=1 -> Density=3
0.14 0.91  Shape=1 Margin=1 Density=3 -> BI-RADS=4
0.14 0.91  BI-RADS=4 Shape=1 Density=3 -> Margin=1
0.16 0.92  Shape=1 Margin=1 Severity=0 -> BI-RADS=4
0.16 0.91  BI-RADS=4 Shape=1 Margin=1 -> Severity=0
0.15 0.92  Shape=1 Density=3 Severity=0 -> BI-RADS=4
0.15 0.92  BI-RADS=4 Shape=1 Density=3 -> Severity=0
0.12 0.92  Shape=2 Margin=1 Severity=0 -> BI-RADS=4
0.12 0.94  BI-RADS=4 Shape=2 Margin=1 -> Severity=0
0.12 0.91  Shape=2 Density=3 Se

  This is separate from the ipykernel package so we can avoid doing imports until


Rules can be converted in strings, using `str()` which can then be split using `split('->')`. Example:

In [7]:
print str(rules[0]).split('->')

['lenses=none ', ' prescription=hypermetrope']
