In [1]:
from orangecontrib.associate.fpgrowth import * 
from scipy.sparse import issparse
import Orange

import random
random.seed(42)

# Povezovalna pravila

# Association rules

Orange ponuja dva algoritma za indukcijo povezovalnih pravil, standardni Apriori algoritem za analizo redkih (*sparse*) podatkov (košara) in različico Apriori za nabore podatkov atribut-vrednost. Oba algoritma podpirata tudi rudarjenje pogostih postavk.

Začnimo s podatki o tržni košarici:

Orange provides two algorithms for induction of association rules, a standard Apriori algorithm for sparse (basket) data analysis and a variant of Apriori for attribute-value data sets. Both algorithms also support mining of frequent itemsets.

Let's start with market basket data:

In [2]:
data = Orange.data.Table("podatki/foodmart.basket")

Raziščimo podatke.

Let's explore the data.

In [3]:
print(len(data))
print(type(data.X))
data[:5]

62560
<class 'scipy.sparse.csr.csr_matrix'>


[[Pasta=3.000, Soup=2.000, STORE_ID_2=1.000],
 [Soup=1.000, STORE_ID_2=1.000, Fresh Vegetables=3.000, Milk=3.000, Plastic Utensils=2.000, ...],
 [STORE_ID_2=1.000, Cheese=2.000, Deodorizers=1.000, Hard Candy=2.000, Jam=2.000, ...],
 [STORE_ID_2=1.000, Fresh Vegetables=2.000],
 [STORE_ID_2=1.000, Cleaners=1.000, Cookies=2.000, Eggs=2.000, Preserves=1.000, ...]
]

Podatkov v tabeli ne moremo neposredno uporabljati; najprej jih moramo preoblikovati.

Dobimo bazo podatkov, ki jo lahko uporabimo za iskanje pogostih postavk, in mapiranje, ki ga bomo kasneje uporabili za  preoblikovanje nazaj.

We can’t use table data directly; we first have to one-hot transform it.

We get a database we can use to find frequent itemsets, and a mapping we will use later to revert the transformation.

In [4]:
X, mapping = OneHot.encode(data)
X, mapping

([[0, 1, 2],
  [1, 2, 3, 4, 5],
  [2, 6, 7, 8, 9],
  [2, 3],
  [2, 10, 11, 12, 13],
  [1, 2, 6, 14],
  [2, 15, 16, 17],
  [2, 11, 13, 15],
  [2, 3, 10, 18, 19, 20],
  [1, 2, 16, 21, 22, 23],
  [2, 24, 25, 26],
  [2, 3, 11, 12, 27, 28, 29],
  [2, 11, 30, 31, 32],
  [2, 3, 33, 34, 35],
  [1, 2, 4, 30, 36, 37, 38],
  [2, 14, 15, 26, 32, 39, 40],
  [2, 3, 31, 35, 41, 42, 43],
  [2, 3, 22, 30, 40, 44],
  [2, 3, 41, 42, 45, 46],
  [1, 2, 33],
  [2, 6, 11, 30, 47, 48],
  [2, 20, 27, 30],
  [2, 6, 45, 49],
  [2, 47, 50],
  [2, 27, 40, 41, 51],
  [1, 2, 28, 42],
  [2, 23, 30, 33, 39],
  [2, 38, 41, 50, 52],
  [2, 27, 53],
  [2, 12, 40],
  [2, 3, 5, 9, 34, 36, 54],
  [2, 3, 55],
  [2, 3, 23, 38, 56],
  [2, 50, 55, 57],
  [2, 3, 30],
  [2, 48],
  [2, 4, 9, 11, 58],
  [2, 30, 38],
  [2, 13, 19, 29, 30, 34, 41],
  [2, 10, 14, 18, 36, 59, 60],
  [2, 3, 4, 33, 40, 61],
  [2, 3, 47, 49],
  [1, 2, 3, 62, 63],
  [0, 2, 3, 22, 32, 33, 56],
  [2, 19, 59, 64, 65],
  [2, 16, 22],
  [2, 5, 27, 30, 32, 35, 66

Z ```decode``` lahko povežemo vsak indeks z imenom predmeta.

We can use ```decode``` to link each index with an item's name.

In [5]:
names = {item: ('{}').format(var.name, val)
                 for item, var, val in OneHot.decode(mapping, data, mapping)}

Želimo postavke z nizko podporo, saj bo težko najti prevladujoča pravila za več kot 62.000 transakcij.

We want itemsets with low support, since it will be hard to find prevailing rules for more than 62,000 transactions.

In [6]:
itemsets = {}
for itemset, support in frequent_itemsets(X, 0.01/100):
    itemsets[itemset] = support
len(itemsets)

64760

Zdaj lahko ustvarimo vsa povezovalna pravila, ki imajo vsaj 70% zaupanja (t.j. klasifikacijskih pravil):

Now we can generate all association rules that have at least 70% confidence (i.e. classification rules):

In [7]:
rules = []
for rule in association_rules(itemsets, 0.7):
    left, right, support, confidence = rule
    left_str =  ', '.join(names[i] for i in sorted(left))
    right_str = ', '.join(names[i] for i in sorted(right))
    rules.append(left_str + " -> " + right_str)

In [8]:
rules[:10]

['Cheese, Cookies, Fresh Fruit, Wine -> Fresh Vegetables',
 'Soup, Fresh Fruit, Dried Fruit, Paper Wipes -> Fresh Vegetables',
 'Soup, Cheese, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Cheese, Preserves, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Soup, Preserves, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Soup, Fresh Vegetables, Preserves, Nuts -> Fresh Fruit',
 'Fresh Vegetables, Cheese, Juice, Pizza -> Soup',
 'Soup, Fresh Vegetables, Juice, Pizza -> Cheese',
 'Soup, Deli Meats, Chips, Wine -> Fresh Vegetables',
 'Soup, Fresh Vegetables, Deli Meats, Chips -> Wine']

##### Vprašanje 5-5-1
Filtriraj pravila. Poišči vsa tista pravila, ki napovejo še nabavo sira.

##### Question 5-5-1
Filter rules. Find all the rules that predict the purchase of cheese.

# Klasifikacijska pravila

# Classification rules

Videli smo, kako se dobi povezovalna pravila na redkih podatki, tokrat si bomo ogledali še postopek na polnih podatkih. 

We saw how to get association rules on sparse data, this time we'll see the process on full data.

In [9]:
data = Orange.data.Table('zoo')
data

[[1, 0, 0, 1, 0, ... | mammal] {aardvark},
 [1, 0, 0, 1, 0, ... | mammal] {antelope},
 [0, 0, 1, 0, 0, ... | fish] {bass},
 [1, 0, 0, 1, 0, ... | mammal] {bear},
 [1, 0, 0, 1, 0, ... | mammal] {boar},
 ...
]

Ker so v matriki tudi nule, bomo to upoštevali pri poimenovanju vrednosti.

Because they are also zeroes in the array, we will take this into account when naming values.

In [10]:
X, mapping = OneHot.encode(data)
names = {item: ('{}={}').format(var.name, val)
                 for item, var, val in OneHot.decode(mapping, data, mapping)}
names

{0: 'hair=0',
 1: 'hair=1',
 2: 'feathers=0',
 3: 'feathers=1',
 4: 'eggs=0',
 5: 'eggs=1',
 6: 'milk=0',
 7: 'milk=1',
 8: 'airborne=0',
 9: 'airborne=1',
 10: 'aquatic=0',
 11: 'aquatic=1',
 12: 'predator=0',
 13: 'predator=1',
 14: 'toothed=0',
 15: 'toothed=1',
 16: 'backbone=0',
 17: 'backbone=1',
 18: 'breathes=0',
 19: 'breathes=1',
 20: 'venomous=0',
 21: 'venomous=1',
 22: 'fins=0',
 23: 'fins=1',
 24: 'legs=0',
 25: 'legs=2',
 26: 'legs=4',
 27: 'legs=5',
 28: 'legs=6',
 29: 'legs=8',
 30: 'tail=0',
 31: 'tail=1',
 32: 'domestic=0',
 33: 'domestic=1',
 34: 'catsize=0',
 35: 'catsize=1'}

Od tu naprej je postopek že poznan. Zaradi narave podatkov lahko izberemo višjo podporo in zaupanje.

From here on, the process is already known. Due to the nature of the data, we can choose higher support and trust.

In [11]:
itemsets = {}
for itemset, support in frequent_itemsets(X, 0.7):
    itemsets[itemset] = support
for rule in association_rules(itemsets, 0.8):
        left, right, support, confidence = rule
        left_str =  ', '.join(names[i] for i in sorted(left))
        right_str = ', '.join(names[i] for i in sorted(right))
        print(left_str+" -> "+right_str)

venomous=0, fins=0 -> breathes=1
breathes=1, fins=0 -> venomous=0
fins=0 -> breathes=1, venomous=0
breathes=1, venomous=0 -> fins=0
breathes=1 -> venomous=0, fins=0
venomous=0, tail=1 -> backbone=1
backbone=1, tail=1 -> venomous=0
tail=1 -> backbone=1, venomous=0
backbone=1, venomous=0 -> tail=1
backbone=1 -> venomous=0, tail=1
feathers=0 -> domestic=0
domestic=0 -> feathers=0
feathers=0 -> airborne=0
airborne=0 -> feathers=0
backbone=1 -> domestic=0
domestic=0 -> backbone=1
venomous=0 -> domestic=0
domestic=0 -> venomous=0
feathers=0 -> venomous=0
airborne=0 -> venomous=0
venomous=0 -> backbone=1
backbone=1 -> venomous=0
venomous=0 -> breathes=1
breathes=1 -> venomous=0
fins=0 -> domestic=0
domestic=0 -> fins=0
fins=0 -> breathes=1
breathes=1 -> fins=0
fins=0 -> venomous=0
venomous=0 -> fins=0
tail=1 -> backbone=1
backbone=1 -> tail=1
tail=1 -> venomous=0


Opravka imamo s podatki z razredom. Lahko ustvarimo pravila, ki napovedujejo razred?

V `OneHot.encode` dodamo parameter `include_class=True`, da se upošteva tudi razred.

We are dealing with data with a class. Can we create rules that predict the class?

In `OneHot.encode` we add the parameter `include_class = True` to take into account the class.

In [12]:
X, mapping = OneHot.encode(data, include_class=True)

Želimo postavke z >40% podpore:

We want items with >40% support:

In [13]:
itemsets = dict(frequent_itemsets(X, .4))
len(itemsets)

520

Postavke, kodirane po transakcijah, ki ustrezajo vrednostim razreda, so:

The transaction-coded items corresponding to class values are:

In [14]:
class_items = {item 
               for item, var, _ in OneHot.decode(mapping, data, mapping) 
               if var is data.domain.class_var}
sorted(class_items)

[36, 37, 38, 39, 40, 41, 42]

To je smiselno, saj ima naša spremenljivka razreda sedem vrednosti:

That makes sense as our class variable has seven values:

In [15]:
data.domain.class_var.values

['amphibian', 'bird', 'fish', 'insect', 'invertebrate', 'mammal', 'reptile']

Zdaj lahko ustvarimo vsa povezovalna pravila, ki imajo posledico enako eni od vrednosti razreda in >80% zaupanja (tj.  klasifikacijska pravila):

Now we can generate all association rules that have consequent equal to one of the class values and >80% confidence (i.e. classification rules):

In [16]:
rules = [(P, Q, supp, conf) 
         for P, Q, supp, conf in association_rules(itemsets, .8) 
         if len(Q) == 1 and Q & class_items]
len(rules)
rules

[(frozenset({2, 7, 17, 19, 20}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 17, 19}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 17, 20}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 19, 20}), frozenset({41}), 41, 1.0),
 (frozenset({2, 17, 19, 20}), frozenset({41}), 41, 0.8723404255319149),
 (frozenset({7, 17, 19, 20}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 17}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 19}), frozenset({41}), 41, 1.0),
 (frozenset({2, 17, 19}), frozenset({41}), 41, 0.8367346938775511),
 (frozenset({7, 17, 19}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7, 20}), frozenset({41}), 41, 1.0),
 (frozenset({7, 17, 20}), frozenset({41}), 41, 1.0),
 (frozenset({7, 19, 20}), frozenset({41}), 41, 1.0),
 (frozenset({2, 7}), frozenset({41}), 41, 1.0),
 (frozenset({7, 17}), frozenset({41}), 41, 1.0),
 (frozenset({7, 19}), frozenset({41}), 41, 1.0),
 (frozenset({7, 20}), frozenset({41}), 41, 1.0),
 (frozenset({7}), frozenset({41}), 41, 1.0)]

Da bi bili bolj koristni, lahko uporabimo preslikavo za pretvorbo elementov pravil v vrednosti domene tabel, npr. za prvih pet pravil:

To make them more helpful, we can use mapping to transform the rules’ items back into table domain values, e.g. for first five rules:

In [17]:
names = {item: '{}={}'.format(var.name, val) 
         for item, var, val in OneHot.decode(mapping, data, mapping)}
for ante, cons, supp, conf in rules[:5]:
                              print(', '.join(names[i] for i in ante), '-->',
                                    names[next(iter(cons))],
                                    '(supp: {}, conf: {})'.format(supp, conf))

feathers=0, milk=1, backbone=1, breathes=1, venomous=0 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, milk=1 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, venomous=0, milk=1 --> type=mammal (supp: 41, conf: 1.0)
feathers=0, breathes=1, venomous=0, milk=1 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, venomous=0 --> type=mammal (supp: 41, conf: 0.8723404255319149)


# CN2

In [18]:
learner = Orange.classification.CN2Learner()
classifier = learner(data)

In [19]:
# consider up to 10 solution streams at one time
learner.rule_finder.search_algorithm.beam_width = 10

# continuous value space is constrained to reduce computation time
learner.rule_finder.search_strategy.bound_continuous = True

# found rules must cover at least 15 examples
learner.rule_finder.general_validator.min_covered_examples = 15

# found rules must combine at most 2 selectors (conditions)
learner.rule_finder.general_validator.max_rule_length = 2

classifier = learner(data)

Ustvarjena pravila je mogoče hitro pregledati in razlagati. Vsak od njih ima obliko "če pogoj potem napovej razred". Torej povezava selektorjev, ki ji sledi napovedani razred.

Induced rules can be quickly reviewed and interpreted. They are each of the form "if cond then predict class”. That is, a conjunction of selectors followed by the predicted class.

In [20]:
for rule in classifier.rule_list:
    print(rule, rule.curr_class_dist.tolist())

IF feathers!=0 THEN type=bird  [0, 20, 0, 0, 0, 0, 0]
IF milk!=0 THEN type=mammal  [0, 0, 0, 0, 0, 41, 0]
IF legs==0 AND toothed!=0 THEN type=fish  [0, 0, 13, 0, 0, 0, 3]
IF backbone==0 AND domestic==0 THEN type=invertebrate  [0, 0, 0, 7, 10, 0, 0]
IF TRUE THEN type=mammal  [4, 20, 13, 8, 10, 41, 5]


Če se ne spoži nobeno drugo pravilo, se uporabi privzeto pravilo (večinska klasifikacija). Uporaba privzetega pravila je odvisna od izbranega induktorja pravil.

If no other rules fire, default rule (majority classification) is used. Specific to each individual rule inducer, the application of the default rule varies.