<a href="https://colab.research.google.com/github/FranciscoSales1968/app-pinheirosupermercado/blob/main/app_pinheirosupermercado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association Rule Mining with Apriori Algorithm

In [None]:
"""
Aplicativo...................: app-apriori-python-market-supermercadopinheiro.ipynb
Analista de dados............: Esp Francisco José Sales Sampaio
Consultoria .................: Esp Francisco José Sales Sampaio
--------------------------------------------------------------------------------------------------------------------
Criado em....................: 7 de fevereiro de 2023
Alterado em..................: 8 de fevereiro de 2023
Resumo.......................: Algoritmo adaptado para uso pelo market.
Cliente......................: Supermercados.
Versão.......................: Versão.1.0. Rev. 7.Rev.2023. Release. 001
"""

In [None]:
"""
Instala pacotes do Apriori
This instals the Apyori package for using the Association Mining Apriori algorithm

"""

!pip install apyori  

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

In [None]:
# Montando google driver.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Importando Dataset do bilhete telefônico da operadora de telefonia.
Download do arquivo nomearquivo.xlxs

In [None]:
# Importando arquivos do google driver.
from google.colab import files
files.upload()

{}

In [None]:
"""
Notas / observações: 

1. Este bloco faz leitura do arquivo de dados no formato xlxs do google driver e guarda na variável { store_data }
2. O parâmetro utilizado { skiprows }, exclui estas linhas
3. O parâmetro utilizado { usecls }, especificas as colunas que serão utilizadas


Exemplo / modelo: 

store_data = pd.read_excel('
/content/drive/MyDrive/dev-apriori/bilhete_original_006080013_1_8586160492_1.xlsx', 
skiprows=[0,1,2,3,4], 
usecols=[0,1,2,3,7,8,9,10])

By Francisco José Sales Sampaio, em: 07/02/2023


Atenção: O arquivo { venda-cliente.xlsx } deverá conter as gravações dos pontos de acupunturas utilizados pelos profissionais.
Este arquivo será gerado pelo pessoal de tecnologia da informação do cliente, no formato xlxs.

"""
# Cria variável store_data
store_data = pd.read_excel('/content/bilhete_pre_processado_versao_26nov2022_rev_01.xlsx', skiprows=[0], usecols=[1,5,7])
# Nota: Exclui linha 0 e utiliza colunas 1,5,7



In [None]:
# Contagem dos itens do banco de dados / dataset
len(store_data)

847

In [None]:
# Remoção de dados indesejados no dataset.
for index in store_data.columns:
  store_data[index] = store_data[index].str.strip()

In [None]:
# Telefones distintos.
itens = store_data.melt()['value'].dropna().sort_values()

In [None]:
# Contagem de itens distintos
len(itens.unique())


1

In [None]:
# Imprime os itens distintos.
print (f'Existem {itens.unique()} telefones distintos :n', itens.unique())

Existem ['0*100'] telefones distintos :n ['0*100']


Let's call the head() function to see how the dataset looks:

In [None]:
store_data.head(200)

Unnamed: 0,8586160492,8587437187,359035039489450
0,8586160492,8587437187,3.590350e+14
1,8586160492,190,3.590350e+14
2,8586160492,190,3.590350e+14
3,8586160492,190,3.590350e+14
4,8586160492,8534571012,3.590350e+14
...,...,...,...
195,8586160492,8585374835,3.560101e+14
196,8586160492,8585374835,3.560101e+14
197,8586160492,8532521456,3.560101e+14
198,8586160492,8585374835,3.560101e+14


If you carefully look at the data, we can see that the header is actually the first transaction. Each row corresponds to a transaction and each column corresponds to an item purchased in that specific transaction. The NaN tells us that the item represented by the column was not purchased in that specific transaction.

In this dataset there is no header row. But by default, pd.read_csv function treats first row as header. To get rid of this problem, add header=None option to pd.read_csv function, as shown below:

In [None]:
store_data = pd.read_csv('store_data.csv', header=None)
store_data.head()

## Data Proprocessing

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, execute the following script:

In [None]:
# 7501 número de registros do dataset original
# 20 número de colunas do dataset original
records = []
for i in range(0, 847):
    records.append([str(store_data.values[i,j]) for j in range(0, 3)])

## Applying Apriori

The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.

The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. The support for those items can be calculated as 35/7500 = 0.0045. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen, so you can play with these values and see what difference it makes in the rules you get back out.

In [None]:
association_rules = apriori(records, min_support=0.0050, min_confidence=0.1, min_lift=2, min_length=2)
association_results = list(association_rules)

In the second line here we convert the rules found by the apriori class into a list since it is easier to view the results in this form.

## Viewing the Results

Let's first find the total number of rules mined by the apriori class. Execute the following script:

In [None]:
print(len(association_results))

12


The script above should return 48. Each item corresponds to one rule.

Let's print the first item in the association_rules list to see the first rule. Execute the following script:

In [None]:
print(association_results[0])

RelationRecord(items=frozenset({'0*100', 'nan'}), support=0.06965761511216056, ordered_statistics=[OrderedStatistic(items_base=frozenset({'0*100'}), items_add=frozenset({'nan'}), confidence=1.0, lift=4.785310734463277), OrderedStatistic(items_base=frozenset({'nan'}), items_add=frozenset({'0*100'}), confidence=0.3333333333333333, lift=4.785310734463277)])


The first item in the list is a list itself containing three items. The first item of the list shows the grocery items in the rule.

For instance from the first item, we can see that light cream and chicken are commonly bought together. This makes sense since people who purchase light cream are careful about what they eat hence they are more likely to buy chicken i.e. white meat instead of red meat i.e. beef. Or this could mean that light cream is commonly used in recipes for chicken.

The support value for the first rule is 0.0045. This number is calculated by dividing the number of transactions containing light cream divided by total number of transactions. The confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken.

The following script displays the rule, the support, the confidence, and lift for each rule in a more clear way:

In [None]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: 0*100 -> nan
Support: 0.06965761511216056
Confidence: 1.0
Lift: 4.785310734463277
Rule: 558587118176 -> 356363027640530.0
Support: 0.0059031877213695395
Confidence: 1.0
Lift: 14.116666666666665
Rule: 356363027640530.0 -> 8532261759
Support: 0.0059031877213695395
Confidence: 0.8333333333333333
Lift: 11.763888888888888
Rule: 8587243357 -> 359035039489450.0
Support: 0.0059031877213695395
Confidence: 0.38461538461538464
Lift: 4.46259220231823
Rule: 8587568983 -> nan
Support: 0.0070838252656434475
Confidence: 0.6666666666666666
Lift: 3.190207156308851
Rule: nan -> 8588448552
Support: 0.03423848878394333
Confidence: 0.8529411764705883
Lift: 4.081588567630442
Rule: 0*100 -> nan
Support: 0.06965761511216056
Confidence: 1.0
Lift: 4.785310734463277
Rule: 558587118176 -> 356363027640530.0
Support: 0.0059031877213695395
Confidence: 1.0
Lift: 14.116666666666665
Rule: 356363027640530.0 -> 8532261759
Support: 0.0059031877213695395
Confidence: 0.8333333333333333
Lift: 11.763888888888888
Rule: 85

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We have already discussed the first rule. Let's now discuss the second rule. The second rule states that mushroom cream sauce and escalope are bought frequently. The support for mushroom cream sauce is 0.0057. The confidence for this rule is 0.3006 which means that out of all the transactions containing mushroom, 30.06% of the transactions are likely to contain escalope as well. Finally, lift of 3.79 shows that the escalope is 3.79 more likely to be bought by the customers that buy mushroom cream sauce, compared to its default sale.

# Sources:

https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/

https://medium.com/@kbrook10/day-11-machine-learning-using-knn-k-nearest-neighbors-with-scikit-learn-350c3a1402e6

This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas