<a href="https://colab.research.google.com/github/Fagner608/MBA_apriori_with_R/blob/main/MBA_APRIORI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vamos trabalhar com um dataset de vendas on-line de alguns países da Europa. Ao final, teremos conhecimento das assossiações existentes entre os itens mais vendidos, o que poderá nos levar à um modelo de recomendação de compra.

## Sobre o conjunto de dados

Fonte: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx)

Este dataset contém dados de varejo, e fornece dados de transações de um site com base no Reuni Unido.

## Motivação

Com base nas transações já registradas, queremos recomendar ao cliente compras que são bastante prováveis. Desta forma, estaremos aumentando o ticke-médio/cliente.

## Iniciando análise:

In [1]:
# Instalando pacotes
install.packages('htmlwidgets')
install.packages('data.table')
install.packages('arules')
install.packages('tidyr')
install.packages('reshape2')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘stringr’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘plyr’, ‘Rcpp’




In [2]:
# Carregando pacotes

library(htmlwidgets)
library(data.table)
library(arules)
library(tidyr)
library(reshape2)

Loading required package: Matrix


Attaching package: ‘arules’


The following objects are masked from ‘package:base’:

    abbreviate, write



Attaching package: ‘tidyr’


The following objects are masked from ‘package:Matrix’:

    expand, pack, unpack



Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


The following objects are masked from ‘package:data.table’:

    dcast, melt




Suprimindo warnings

In [3]:
install.packages('dplyr')
library(dplyr)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:arules’:

    intersect, recode, setdiff, setequal, union


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [4]:
options(war = -1)

Extraindo dados

In [5]:
download.file('https://raw.githubusercontent.com/Fagner608/MBA_apriori_with_R/main/retail.csv', 'dados.csv')

In [6]:
#lendo dados

dados = read.csv('dados.csv')

head(dados)

Unnamed: 0_level_0,X,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<dbl>,<int>,<chr>
1,27,536370,22728,ALARM CLOCK BAKELIKE PINK,24,2010-12-01 08:45:00,3.75,12583,France
2,28,536370,22727,ALARM CLOCK BAKELIKE RED,24,2010-12-01 08:45:00,3.75,12583,France
3,29,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.75,12583,France
4,30,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,0.85,12583,France
5,31,536370,21883,STARS GIFT TAPE,24,2010-12-01 08:45:00,0.65,12583,France
6,32,536370,10002,INFLATABLE POLITICAL GLOBE,48,2010-12-01 08:45:00,0.85,12583,France


In [7]:
# Eliminando coluna que não serão usadas
dados = dados[, -c(1,3,5:9)]

In [8]:
head(dados)

Unnamed: 0_level_0,InvoiceNo,Description
Unnamed: 0_level_1,<chr>,<chr>
1,536370,ALARM CLOCK BAKELIKE PINK
2,536370,ALARM CLOCK BAKELIKE RED
3,536370,ALARM CLOCK BAKELIKE GREEN
4,536370,PANDA AND BUNNIES STICKER SHEET
5,536370,STARS GIFT TAPE
6,536370,INFLATABLE POLITICAL GLOBE


Vamos criar algumas funções, que receberão os atributos, e investigarão eventual padrão de dado que não paraça normal.

In [9]:
# Função para apontar na's e valores únicos por colunas
info_cols = function(data.frame){

  na_s = colSums(is.na(data.frame))
  unique_v = apply(data.frame, 2, n_distinct)
  tipo_dados = apply(data.frame, 2, typeof)
  

  return(data.frame(na = na_s, valores_unicos = unique_v, tipo_dado = tipo_dados))
}

In [10]:
info_cols(dados)

Unnamed: 0_level_0,na,valores_unicos,tipo_dado
Unnamed: 0_level_1,<dbl>,<int>,<chr>
InvoiceNo,0,461,character
Description,0,1565,character


Vamos verificar se existem caracteres alfabéticos no atributo invoiceNO

In [11]:
indices = grep("[A-Za-z]", dados$InvoiceNo)

In [12]:
length(indices)

In [13]:
head(dados[indices, ])

Unnamed: 0_level_0,InvoiceNo,Description
Unnamed: 0_level_1,<chr>,<chr>
179,C537893,SILK PURSE BABUSHKA BLUE
180,C537893,CHILDS BREAKFAST SET SPACEBOY
181,C537893,DOLLY GIRL LUNCH BOX
355,C539104,LUNCH BAG DOLLY GIRL DESIGN
357,C539114,RECIPE BOX RETROSPOT
440,C540151,RED RETROSPOT CAKE STAND


São dados de compras canceladas. Vamos eliminar esses dados da análise

In [14]:
dados_clean = dados[-indices, ]

In [15]:
info_cols(dados_clean)

Unnamed: 0_level_0,na,valores_unicos,tipo_dado
Unnamed: 0_level_1,<dbl>,<int>,<chr>
InvoiceNo,0,392,character
Description,0,1564,character


Vamos atribuir o tipo correto aos dados

In [16]:
dados_clean$InvoiceNo = as.integer(dados_clean$InvoiceNo)

In [17]:
head(dados_clean)

Unnamed: 0_level_0,InvoiceNo,Description
Unnamed: 0_level_1,<int>,<chr>
1,536370,ALARM CLOCK BAKELIKE PINK
2,536370,ALARM CLOCK BAKELIKE RED
3,536370,ALARM CLOCK BAKELIKE GREEN
4,536370,PANDA AND BUNNIES STICKER SHEET
5,536370,STARS GIFT TAPE
6,536370,INFLATABLE POLITICAL GLOBE


Aplicando a função dcast() do pacote reshape2, vamos transformar os dados em uma matriz de presente, fazendo com que cada linha represente uma compra, e cada coluna represente um poduto, registando sua compra ou não.

In [18]:
dados_transacoes = dcast(dados_clean, InvoiceNo ~ Description, fun.aggregate = length)

Using Description as value column: use value.var to override.



In [19]:
head(dados_transacoes)

Unnamed: 0_level_0,InvoiceNo,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TRELLIS COAT RACK,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,⋯,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,536370,0,0,0,0,1,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,536852,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,536974,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,537065,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
5,537463,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
6,537468,0,0,0,0,0,0,0,1,0,⋯,0,0,0,0,0,0,0,0,0,0


In [20]:
dim(dados_transacoes)

Elegendo os top-50 produtos vendidos

In [21]:
top_50 = head(sort(colSums(dados_transacoes[, -1] == 0), decreasing = F), 50)

dados_transacoes = dados_transacoes[, c('InvoiceNo', names(top_50))]

dados_transacoes = dados_transacoes[, -2]

In [22]:
head(dados_transacoes)

Unnamed: 0_level_0,InvoiceNo,RABBIT NIGHT LIGHT,RED TOADSTOOL LED NIGHT LIGHT,PLASTERS IN TIN WOODLAND ANIMALS,PLASTERS IN TIN CIRCUS PARADE,ROUND SNACK BOXES SET OF4 WOODLAND,LUNCH BAG RED RETROSPOT,LUNCH BOX WITH CUTLERY RETROSPOT,PLASTERS IN TIN SPACEBOY,RED RETROSPOT MINI CASES,⋯,SPACEBOY BIRTHDAY CARD,ASSORTED COLOUR MINI CASES,CHARLOTTE BAG APPLES DESIGN,CHILDRENS CUTLERY SPACEBOY,CIRCUS PARADE LUNCH BOX,COFFEE MUG APPLES DESIGN,LUNCH BOX I LOVE LONDON,RED HARMONICA IN BOX,RED RETROSPOT CHILDRENS UMBRELLA,SET OF 2 TEA TOWELS APPLE AND PEARS
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,536370,0,1,0,0,1,0,0,0,0,⋯,0,0,0,0,1,0,1,0,0,0
2,536852,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,536974,0,0,0,0,0,1,1,0,1,⋯,0,1,0,0,0,0,1,1,0,0
4,537065,0,0,1,0,1,1,1,1,1,⋯,0,1,0,0,0,0,0,0,0,0
5,537463,0,1,1,0,1,0,1,0,0,⋯,0,0,0,0,0,0,0,0,1,1
6,537468,0,0,0,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0


Trasnformando os dados em uma matriz

In [23]:
dados_matrix = as.matrix(dados_transacoes[ , -1])

Casting para o tipo 'transaction'

In [24]:
transacoes = as(dados_matrix, 'transactions')

“matrix contains values other than 0 and 1! Setting all entries != 0 to 1.”


Aplicando apriori

In [25]:
regras = apriori(transacoes,
                parameter = list(conf = 0.5, supp = 0.01, minlen = 3))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5    0.01      3
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 3 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[49 item(s), 392 transaction(s)] done [0.00s].
sorting and recoding items ... [49 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10

“Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!”


 done [0.01s].
writing ... [27563 rule(s)] done [0.01s].
creating S4 object  ... done [0.01s].


Filtrando regras redundantes

In [26]:
regras_clean = regras[!is.redundant(regras)]

In [27]:
length(regras_clean)

Inspecionando as top_50 regras

In [28]:
inspect(head(sort(regras_clean, by = 'support', decreasing = T),20))

     lhs                                       rhs                                      support confidence   coverage     lift count
[1]  {SET/20 RED RETROSPOT PAPER NAPKINS ,                                                                                          
      SET/6 RED SPOTTY PAPER PLATES}        => {SET/6 RED SPOTTY PAPER CUPS}         0.09948980  0.9750000 0.10204082 7.077778    39
[2]  {SET/6 RED SPOTTY PAPER CUPS,                                                                                                  
      SET/6 RED SPOTTY PAPER PLATES}        => {SET/20 RED RETROSPOT PAPER NAPKINS } 0.09948980  0.8125000 0.12244898 6.125000    39
[3]  {SET/6 RED SPOTTY PAPER CUPS,                                                                                                  
      SET/20 RED RETROSPOT PAPER NAPKINS }  => {SET/6 RED SPOTTY PAPER PLATES}       0.09948980  0.9750000 0.10204082 7.644000    39
[4]  {PLASTERS IN TIN CIRCUS PARADE ,                                

É possível observar que a descrição dos produtos não traz apenas a sua definição (seu substantivo); os produtos vêm acompanhados de suas características, algumas delas irrelevantes, como cor, por exemplo.

Este resultado pode sim ter o seu valor, mas, o que queremos ver são as possíveis assossociações entre produtos.

Para isso, vamos retomar a análise do início, aplicando mais um tratamento, antes da transformação.

Vamos aproveitar os labels do top_50

In [29]:
names(top_50)

Aplicando filtro aos dados


In [30]:
dados_clean_2 = dados_clean

Agora, vamos aplicar à  descrição dos produtos uma função, visando padronizar sua definição, descartando caracteristicas que não nos interessam no momentos

In [31]:
lista_produtos = c('NIGHT LIGHT', 'PLASTERS', 'SNACK BOXES',
                  'LUNCH BAG', 'LUNCH BOX', 'MINI CASES', 'PAPER CUPS',
                  'PAPER NAPKINS', 'PAPER PLATES', 'REGENCY CAKESTAND', 'ALARM CLOCK',
                  'CAKE CASES', 'JUMBO BAG', 'BIRTHDAY CARD', 'TEA SET', 'BAKING SET', 'PAPER BUNTING',
                  'RED POLKADOT PARTY CANDLES', 'SPINNING TOPS', 'CHILDRENS CUTLERY', 'PICNIC BAG', 'CHARLOTTE BAG',
                  'COFFEE MUG', 'TEA TOWELS', 'CHILDRENS UMBRELLA')

In [32]:
for(i in lista_produtos){

  indices = grep(i, dados_clean_2$Description)
  dados_clean_2$Description[indices] = i

}

In [33]:
head(dados_clean_2)

Unnamed: 0_level_0,InvoiceNo,Description
Unnamed: 0_level_1,<int>,<chr>
1,536370,ALARM CLOCK
2,536370,ALARM CLOCK
3,536370,ALARM CLOCK
4,536370,PANDA AND BUNNIES STICKER SHEET
5,536370,STARS GIFT TAPE
6,536370,INFLATABLE POLITICAL GLOBE


Filtrando produtos, para conter somente o da nossa lista

In [34]:
dados_clean_2 = dados_clean_2[dados_clean_2$Description == lista_produtos, ]

“longer object length is not a multiple of shorter object length”


In [35]:
dim(dados_clean_2)

In [36]:
dados_transacoes = dcast(dados_clean_2, InvoiceNo ~ Description, fun.aggregate = length)

Using Description as value column: use value.var to override.



Reorganizando colunas

In [37]:
reordered = sort(colSums(dados_transacoes[, -1] == 0), decreasing = F)

dados_transacoes = dados_transacoes[, c('InvoiceNo', names(reordered))]

In [38]:
head(dados_transacoes)

Unnamed: 0_level_0,InvoiceNo,LUNCH BAG,ALARM CLOCK,JUMBO BAG,BIRTHDAY CARD,CHILDRENS CUTLERY,COFFEE MUG,LUNCH BOX,CHARLOTTE BAG,NIGHT LIGHT,⋯,BAKING SET,CHILDRENS UMBRELLA,PAPER CUPS,PICNIC BAG,TEA SET,TEA TOWELS,PAPER BUNTING,RED POLKADOT PARTY CANDLES,REGENCY CAKESTAND,SPINNING TOPS
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,537065,0,1,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,537463,0,0,0,1,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,539607,0,0,0,0,0,1,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,540178,0,0,0,0,0,0,0,0,0,⋯,0,0,1,0,0,0,0,0,0,0
5,540521,0,0,0,0,0,0,0,0,0,⋯,0,0,0,0,0,1,0,0,0,0
6,540835,0,0,0,0,0,0,0,1,0,⋯,0,0,0,0,0,0,0,0,0,0


In [39]:
dados_matrix = as.matrix(dados_transacoes[ , -1])

In [40]:
transacoes = as(dados_matrix, 'transactions')

In [41]:
regras = apriori(transacoes,
                parameter = list(conf = 0.5, supp = 0.01, minlen = 3))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5    0.01      3
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 0 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[25 item(s), 89 transaction(s)] done [0.00s].
sorting and recoding items ... [25 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 done [0.00s].
writing ... [63 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [42]:
regras_clean = regras[!is.redundant(regras)]

In [43]:
length(regras_clean)

In [46]:
inspect(head(sort(regras_clean, by = 'support', decreasing = T), 20))

     lhs                                 rhs                 support   
[1]  {COFFEE MUG, PAPER BUNTING}      => {LUNCH BOX}         0.01123596
[2]  {LUNCH BOX, PAPER BUNTING}       => {COFFEE MUG}        0.01123596
[3]  {COFFEE MUG, LUNCH BOX}          => {PAPER BUNTING}     0.01123596
[4]  {PAPER NAPKINS, PAPER CUPS}      => {PAPER PLATES}      0.01123596
[5]  {PAPER PLATES, PAPER CUPS}       => {PAPER NAPKINS}     0.01123596
[6]  {PAPER PLATES, PAPER NAPKINS}    => {PAPER CUPS}        0.01123596
[7]  {JUMBO BAG, CAKE CASES}          => {CHILDRENS CUTLERY} 0.01123596
[8]  {CHILDRENS CUTLERY, CAKE CASES}  => {JUMBO BAG}         0.01123596
[9]  {JUMBO BAG, CHILDRENS CUTLERY}   => {CAKE CASES}        0.01123596
[10] {CHILDRENS CUTLERY, NIGHT LIGHT} => {ALARM CLOCK}       0.01123596
[11] {ALARM CLOCK, NIGHT LIGHT}       => {CHILDRENS CUTLERY} 0.01123596
[12] {JUMBO BAG, SNACK BOXES}         => {CHILDRENS CUTLERY} 0.01123596
[13] {CHILDRENS CUTLERY, SNACK BOXES} => {JUMBO BAG}         0.0

análise em andamento

In [48]:
install.packages('arulesViz')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘iterators’, ‘foreach’, ‘zoo’, ‘tweenr’, ‘polyclip’, ‘RcppEigen’, ‘gridExtra’, ‘RcppArmadillo’, ‘later’, ‘TSP’, ‘qap’, ‘gclus’, ‘ca’, ‘registry’, ‘lmtest’, ‘ggforce’, ‘ggrepel’, ‘viridis’, ‘tidygraph’, ‘graphlayouts’, ‘crosstalk’, ‘promises’, ‘lazyeval’, ‘seriation’, ‘vcd’, ‘igraph’, ‘scatterplot3d’, ‘ggraph’, ‘DT’, ‘plotly’, ‘visNetwork’




In [49]:
library(arulesViz)

In [56]:
install.packages('plotly')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [57]:
library(plotly)

Loading required package: ggplot2


Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout




In [59]:
plot(regras_clean, measure = "support", shading = "confidence", method = "graph", engine = 'plotly')

ERROR: ignored

In [53]:
help(arulesViz)