# Regras de associação.

Cálculos de itemsets frequentes com o algoritmo Apriori utilizando o pacote ```mlxt```.


Este notebook foi desenvolvido para o ambiente GOOGLE COLAB ([colab.research.google.com](https://colab.research.google.com)).

Prof. Hugo de Paula



In [1]:
! pip install mlxtend
! pip install xlrd



## Regras de associação geradas a partir de itemsets frequentes

Fonte: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

No exemplo a seguir, foi criado um ```dataset```  transacional formado por uma "lista de listas", onde cada linha corresponde a um cesto de compras de um supermercado hipotético.

Nesta base, são considerados ```itemsets``` frequentes aqueles que possuírem suporte superior a 0.6.

In [2]:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('precision', 2)

# Dataset transcional com cestos de compras

dataset = [['Leite', 'Cebola', 'Batata', 'Feijão', 'Ovos', 'Iogurte'],
           ['Arroz', 'Cebola', 'Batata', 'Feijão', 'Ovos', 'Iogurte'],
           ['Leite', 'Maçã', 'Feijão', 'Ovos'],
           ['Leite', 'Milho', 'Feijão', 'Iogurte'],
           ['Milho', 'Cebola', 'Feijão', 'Sorvete', 'Ovos']]

te = TransactionEncoder()
te_ary = te.fit_transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(pd.DataFrame(dataset))
print(df)

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)




       0       1       2        3     4        5
0  Leite  Cebola  Batata   Feijão  Ovos  Iogurte
1  Arroz  Cebola  Batata   Feijão  Ovos  Iogurte
2  Leite    Maçã  Feijão     Ovos  None     None
3  Leite   Milho  Feijão  Iogurte  None     None
4  Milho  Cebola  Feijão  Sorvete  Ovos     None
   Arroz  Batata  Cebola  Feijão  Iogurte  Leite   Maçã  Milho   Ovos  Sorvete
0  False    True    True    True     True   True  False  False   True    False
1   True    True    True    True     True  False  False  False   True    False
2  False   False   False    True    False   True   True  False   True    False
3  False   False   False    True     True   True  False   True  False    False
4  False   False    True    True    False  False  False   True   True     True
    support                itemsets
0       0.6                (Cebola)
1       1.0                (Feijão)
2       0.6               (Iogurte)
3       0.6                 (Leite)
4       0.8                  (Ovos)
5       0.6     

In [3]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.6,(Cebola)
1,1.0,(Feijão)
2,0.6,(Iogurte)
3,0.6,(Leite)
4,0.8,(Ovos)
5,0.6,"(Cebola, Feijão)"
6,0.6,"(Cebola, Ovos)"
7,0.6,"(Iogurte, Feijão)"
8,0.6,"(Leite, Feijão)"
9,0.8,"(Ovos, Feijão)"


In [4]:
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Cebola),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
1,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
2,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,(Iogurte),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
4,(Leite),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
5,(Ovos),(Feijão),0.8,1.0,0.8,1.0,1.0,0.0,inf
6,(Feijão),(Ovos),1.0,0.8,0.8,0.8,1.0,0.0,1.0
7,"(Cebola, Ovos)",(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
8,"(Cebola, Feijão)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
9,"(Ovos, Feijão)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6


Gera regras de associação com lift mínimo de 1.2. 

É importante lembrar que valores de lift inferiores a 1 significam que a regra não possui causalidade relevante e não aumentam o nosso poder de previsão.

In [5]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
1,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
2,"(Cebola, Feijão)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
3,"(Ovos, Feijão)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
4,(Cebola),"(Ovos, Feijão)",0.6,0.8,0.6,1.0,1.25,0.12,inf
5,(Ovos),"(Cebola, Feijão)",0.8,0.6,0.6,0.75,1.25,0.12,1.6


In [7]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,1
1,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1
2,"(Cebola, Feijão)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,2
3,"(Ovos, Feijão)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,2
4,(Cebola),"(Ovos, Feijão)",0.6,0.8,0.6,1.0,1.25,0.12,inf,1
5,(Ovos),"(Cebola, Feijão)",0.8,0.6,0.6,0.75,1.25,0.12,1.6,1


Exibe apenas as regras com antecedentes de comprimento maior ou igual a 2 e com confiança superior a 0.75 e lift superior a 1.2.

In [8]:
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] >= 0.75) &
       (rules['lift'] > 1.2) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
2,"(Cebola, Feijão)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,2


Exibe apenas as regras cujos antecedentes são Feijão e Ovos.

In [9]:
rules[rules['antecedents'] == {'Ovos', 'Feijão'}]


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
3,"(Ovos, Feijão)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,2


In [10]:
rules[rules['consequents'] == {'Ovos'}]


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,1
2,"(Cebola, Feijão)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,2


**negrito**## Análise de cesta de compras em Python

Fonte:  Chris Moffitt (2017), Introduction to Market Basket Analysis in Python, http://pbpython.com/market-basket-analysis.html


Neste exemplo é utilizada a base de dados **Online Retail** da UCI, disponível em [archive.ics.uci.edu/ml/machine-learning-databases/00352/Online Retail.xlsx](http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx)

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [12]:
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
print(df.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  UnitPrice  CustomerID         Country  
0 2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2 2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  


In [13]:
print(df.describe())

        Quantity  UnitPrice  CustomerID
count  541909.00  541909.00   406829.00
mean        9.55       4.61    15287.69
std       218.08      96.76     1713.60
min    -80995.00  -11062.06    12346.00
25%         1.00       1.25    13953.00
50%         3.00       2.08    15152.00
75%        10.00       4.13    16791.00
max     80995.00   38970.00    18287.00


### Preparação de dados


Comando ```strip()``` elimina espaços no início e fim da string.

Comando ```dropna()``` remove registros com valores faltantes (*missing values*) no campo ```InvoiceNo```.

Comando ```df[~df['InvoiceNo'].str.contains('C')]``` remove registros com ```InvoiceNo``` iniciados com a letra *'C'*, uma vez que esses campos correspondem a pedidos cancelados.





In [14]:
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
print(df.describe())

        Quantity  UnitPrice  CustomerID
count  532621.00  532621.00   397924.00
mean       10.24       3.85    15294.32
std       159.59      41.76     1713.17
min     -9600.00  -11062.06    12346.00
25%         1.00       1.25    13969.00
50%         3.00       2.08    15159.00
75%        10.00       4.13    16795.00
max     80995.00   13541.33    18287.00


Gera uma base de dados apenas com pedidos da França. É gerada uma tabela pivô em que cada coluna corresponde à um produto e cada linha corresponde ao somatório da quantidade comprada daquele produto em um determinado pedido.


In [15]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
print(basket.head())

Description  10 COLOUR SPACEBOY PEN  12 COLOURED PARTY BALLOONS  \
InvoiceNo                                                         
536370                          0.0                         0.0   
536852                          0.0                         0.0   
536974                          0.0                         0.0   
537065                          0.0                         0.0   
537463                          0.0                         0.0   

Description  12 EGG HOUSE PAINTED WOOD  12 MESSAGE CARDS WITH ENVELOPES  \
InvoiceNo                                                                 
536370                             0.0                              0.0   
536852                             0.0                              0.0   
536974                             0.0                              0.0   
537065                             0.0                              0.0   
537463                             0.0                              0.0   

Desc

Transforma as quantidades em 0 ou 1.

In [16]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

print(basket_sets.head())

Description  10 COLOUR SPACEBOY PEN  12 COLOURED PARTY BALLOONS  \
InvoiceNo                                                         
536370                            0                           0   
536852                            0                           0   
536974                            0                           0   
537065                            0                           0   
537463                            0                           0   

Description  12 EGG HOUSE PAINTED WOOD  12 MESSAGE CARDS WITH ENVELOPES  \
InvoiceNo                                                                 
536370                               0                                0   
536852                               0                                0   
536974                               0                                0   
537065                               0                                0   
537463                               0                                0   

Desc

In [17]:
print(basket_sets.describe())

Description  10 COLOUR SPACEBOY PEN  12 COLOURED PARTY BALLOONS  \
count                        392.00                      392.00   
mean                           0.03                        0.02   
std                            0.17                        0.12   
min                            0.00                        0.00   
25%                            0.00                        0.00   
50%                            0.00                        0.00   
75%                            0.00                        0.00   
max                            1.00                        1.00   

Description  12 EGG HOUSE PAINTED WOOD  12 MESSAGE CARDS WITH ENVELOPES  \
count                         3.92e+02                         3.92e+02   
mean                          2.55e-03                         5.10e-03   
std                           5.05e-02                         7.13e-02   
min                           0.00e+00                         0.00e+00   
25%                  

In [18]:
np.max(np.mean(basket_sets))


0.18877551020408162

### Geração de ```itemsets``` frequentes e de regras de associação.



In [19]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
print(frequent_itemsets)

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print("\nAlgumas regras de associação geradas:\n", rules.head())
print("\nDimensões da matriz de regras:", rules.shape)

    support                                           itemsets
0      0.07                      (4 TRADITIONAL SPINNING TOPS)
1      0.10                       (ALARM CLOCK BAKELIKE GREEN)
2      0.10                        (ALARM CLOCK BAKELIKE PINK)
3      0.09                         (ALARM CLOCK BAKELIKE RED)
4      0.08                     (BAKING SET 9 PIECE RETROSPOT)
5      0.07                     (CHILDRENS CUTLERY DOLLY GIRL)
6      0.10                             (DOLLY GIRL LUNCH BOX)
7      0.10                          (JUMBO BAG RED RETROSPOT)
8      0.08                       (JUMBO BAG WOODLAND ANIMALS)
9      0.12                           (LUNCH BAG APPLE DESIGN)
10     0.08                      (LUNCH BAG DOLLY GIRL DESIGN)
11     0.15                          (LUNCH BAG RED RETROSPOT)
12     0.12                        (LUNCH BAG SPACEBOY DESIGN)
13     0.12                               (LUNCH BAG WOODLAND)
14     0.14                 (LUNCH BOX WITH CUTLERY RET

### Exemplos de filtros sobre regras de associação

In [20]:
rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.1,0.09,0.08,0.82,8.64,0.07,4.92
3,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.09,0.1,0.08,0.84,8.64,0.07,5.57
17,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.13,0.13,0.1,0.8,6.03,0.09,4.34
18,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.14,0.13,0.12,0.89,6.97,0.1,7.85
19,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.13,0.14,0.12,0.96,6.97,0.1,21.56
20,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER PLATES),0.1,0.13,0.1,0.97,7.64,0.09,34.9
21,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER CUPS),0.1,0.14,0.1,0.97,7.08,0.09,34.49
22,"(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...",(SET/20 RED RETROSPOT PAPER NAPKINS),0.12,0.13,0.1,0.81,6.12,0.08,4.63


In [21]:
basket['ALARM CLOCK BAKELIKE GREEN'].sum()

340.0

In [22]:
basket['ALARM CLOCK BAKELIKE RED'].sum()

316.0

In [23]:
basket_sets['ALARM CLOCK BAKELIKE RED'].sum()

37



```
# Isto está formatado como código
```

### Análise de cesto de compras da Alemanha

Esse código é semelhante ao código utilizado para gerar as regras da França. O objetivo é mostrar como que o suporte mínimo e a confiança mínima podem variar de uma base para outra. Um país pode ter um perfil de compras mais homogêneo e gerar regras com suporte maior, enquanto outro país pode gerar regras com suporte menor.

In [24]:
basket2 = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

basket_sets2 = basket2.applymap(encode_units)
basket_sets2.drop('POSTAGE', inplace=True, axis=1)
frequent_itemsets2 = apriori(basket_sets2, min_support=0.04, use_colnames=True)
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)

rules2[ (rules2['lift'] >= 4) &
        (rules2['confidence'] >= 0.5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(JUMBO BAG RED RETROSPOT),(JUMBO BAG WOODLAND ANIMALS),0.08,0.1,0.05,0.61,6.07,0.04,2.31
9,(PLASTERS IN TIN STRONGMAN),(PLASTERS IN TIN CIRCUS PARADE),0.07,0.12,0.05,0.69,5.93,0.04,2.83
11,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.12,0.14,0.07,0.58,4.24,0.05,2.08
17,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.11,0.14,0.06,0.57,4.15,0.05,2.01
24,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.07,0.13,0.06,0.84,6.65,0.05,5.59
38,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.05,0.06,0.05,0.88,15.38,0.04,7.54
39,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.06,0.05,0.05,0.81,15.38,0.04,4.93
40,"(ROUND SNACK BOXES SET OF4 WOODLAND, PLASTERS ...",(ROUND SNACK BOXES SET OF 4 FRUITS),0.06,0.16,0.04,0.73,4.64,0.03,3.13


In [None]:
rules2