# **Mineração de Padrões Frequentes**
*Prof. Orlando Junior*

O objetivo da mineração de padrões frequentes é encontrar padrões frequentes, associações, correlações ou estruturas causais entre conjuntos de itens ou objetos em bancos de dados de transações, bancos de dados relacionais e outros repositórios de informações.

## Bibliotecas

In [1]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py) ... [?25l[?25hdone
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5955 sha256=631e960a64cc499bd5637ce9a29ee6d94e029ce1e30d2d6ae27e9a4906d45aef
  Stored in directory: /root/.cache/pip/wheels/c4/1a/79/20f55c470a50bb3702a8cb7c94d8ada15573538c7f4baebe2d
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
from apyori import apriori

## Carregamento de Dados

In [8]:
df = pd.read_csv('store_data.csv', sep=',', header=None)

In [9]:
print(df.shape)

(7501, 20)


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
 1   1       5747 non-null   object
 2   2       4389 non-null   object
 3   3       3345 non-null   object
 4   4       2529 non-null   object
 5   5       1864 non-null   object
 6   6       1369 non-null   object
 7   7       981 non-null    object
 8   8       654 non-null    object
 9   9       395 non-null    object
 10  10      256 non-null    object
 11  11      154 non-null    object
 12  12      87 non-null     object
 13  13      47 non-null     object
 14  14      25 non-null     object
 15  15      8 non-null      object
 16  16      4 non-null      object
 17  17      4 non-null      object
 18  18      3 non-null      object
 19  19      1 non-null      object
dtypes: object(20)
memory usage: 1.1+ MB


In [11]:
# Visualiza os dados
# Obs.: o NaN nos informa que o item representado pela coluna não foi comprado naquela transação específica.
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,
6,whole wheat pasta,french fries,,,,,,,,,,,,,,,,,,
7,soup,light cream,shallot,,,,,,,,,,,,,,,,,
8,frozen vegetables,spaghetti,green tea,,,,,,,,,,,,,,,,,
9,french fries,,,,,,,,,,,,,,,,,,,


## Análise Exploratória

In [12]:
# Converte para um Numpy array
transacoes = []

for i in range(0, df.shape[0]):
    for j in range(0, df.shape[1]):
        transacoes.append(df.values[i,j])

transacoes = np.array(transacoes)

In [13]:
df_t = pd.DataFrame(transacoes, columns=["items"])

In [14]:
df_t

Unnamed: 0,items
0,shrimp
1,almonds
2,avocado
3,vegetables mix
4,green grapes
...,...
150015,
150016,
150017,
150018,


In [15]:
# Contabiliza as frequências
df_t["incidencias"] = 1

# Remove valores nulos
indexNames = df_t[df_t['items'] == "nan" ].index
df_t.drop(indexNames , inplace=True)

# Constrói um novo dataframe para visualizações
df_table = df_t.groupby("items").sum().sort_values("incidencias", ascending=False).reset_index()

In [16]:
# Mostra os top 10 produtos
df_table.head(10).style.background_gradient(cmap='Greens')

Unnamed: 0,items,incidencias
0,mineral water,1788
1,eggs,1348
2,spaghetti,1306
3,french fries,1282
4,chocolate,1230
5,green tea,991
6,milk,972
7,ground beef,737
8,frozen vegetables,715
9,pancakes,713


## Tratamento de Dados

In [30]:
# Converte a base de dados em listas de listas
registros = []

for i in range(0, df.shape[0]):
    registros.append([str(df.values[i, j]) for j in range(0, df.shape[1])])

In [31]:
# Mostra alguns registros
registros[0:1]

[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil']]

## Aprendizagem

In [32]:
# Aprende as regras de associação
# com suporte de 0.3%
# e confiança de 20%
regras = apriori(registros, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [33]:
# Transforma o objeto
lista_regras = list(regras)

In [34]:
print(f"# Regras: {len(lista_regras)}")

# Regras: 160


In [35]:
# Imprime uma das regras encontradas
print(lista_regras[3])

RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0.2450980392156863, lift=5.164270764485569)])


## Avaliação

In [36]:
# Função para exibição dos resultados
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))

In [37]:
# Sumariza os resultados em um dataframe
df_resultados = pd.DataFrame(inspect(lista_regras),
                             columns = ['Antecedente', 'Consequente', 'Suporte', 'Confiança', 'Lift'])

In [38]:
# Exibe o dataframe de resultados
df_resultados.head(10)

Unnamed: 0,Antecedente,Consequente,Suporte,Confiança,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672
9,spaghetti,milk,0.003333,0.416667,3.215449


In [39]:
# Exibe os resultados conforme maior lift
df_resultados.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Antecedente,Consequente,Suporte,Confiança,Lift
97,frozen vegetables,milk,0.003066,0.383333,7.987176
150,frozen vegetables,,0.003066,0.383333,7.987176
96,frozen vegetables,milk,0.003333,0.294118,6.128268
149,frozen vegetables,,0.003333,0.294118,6.128268
132,mineral water,,0.003866,0.402778,6.128268
59,mineral water,olive oil,0.003866,0.402778,6.115863
50,tomato sauce,spaghetti,0.003066,0.216981,5.535971
122,tomato sauce,,0.003066,0.216981,5.535971
28,fromage blanc,,0.003333,0.245098,5.178818
3,fromage blanc,honey,0.003333,0.245098,5.164271
