#  Market Basket Analysis with Apriori Algorithm

### Variables Descriptions:

   - BillNo: bill number -> operation.

   - Itemname: Product name

   - Quantity: Number of products -> how many of the products on the invoices were sold.

   - Date

   - Price

   - CustomerID: Unique customer number

   - Country
   
   - **produto_id -> foi criado essa coluna no dataset** 

## Referencias:

https://practicaldatascience.co.uk/data-science/how-to-use-the-apriori-algorithm-for-market-basket-analysis

# 0.0. Imports

In [54]:
import numpy    as np
import pandas   as pd
import datetime as dt
import pickle
import inflection
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# 0.1. Function

In [41]:
 def data_preparation( dataframe ):
    # Rename Columns  
    cols_old = ['BillNo', 'ItemName', 'Quantity', 'Date', 'Price', 'CustomerID', 'Country']

    snakecase = lambda x: inflection.underscore( x )
    cols_news = list( map( snakecase, cols_old ) )

    # Rename
    dataframe.columns = cols_news
    
    # Drop NA
    dataframe = dataframe.dropna(subset=['item_name','customer_id'])
    
    # Data Types
    #dataframe['customer_id'] = dataframe['customer_id'].astype(int)
    #dataframe['bill_no'] = dataframe['bill_no'].astype(int)
    
    #feature_engineering
    dataframe = dataframe.loc[dataframe['price'] >= 0.04,:]

    dataframe = dataframe[~dataframe["item_name"].str.contains("POST", na=False)]
    
    dataframe = dataframe[~dataframe['country'].isin( ["Unspecified"] )]

    dataframe = dataframe[~dataframe['customer_id'].isin( [16446] )]
    
    # removendo hora
    #dataframe['date'] = dataframe['date'].apply( get_month )
    # month
    #dataframe['month'] = dataframe['date'].dt.month

    # data product_id -> criando codigo unico para os produtos.
    df_product_id = dataframe.drop( ['bill_no', 'quantity', 'date', 'price','customer_id', 'country'], axis=1 ).drop_duplicates( ignore_index=True)
    df_product_id = pd.DataFrame( df_product_id ) 
    df_product_id['produto_id'] = pd.factorize( df_product_id['item_name'])[0]

    # merge produto_id com dataframe
    dataframe = pd.merge( dataframe, df_product_id, on='item_name', how='left' )
    
    return dataframe

**Função data_preparation foi construida a partir do dataset data-exploration, após a análise dos dados e entendimento de algumas Premissas de Negócios**

# 1.0. Loading Data

In [42]:
df_raw = pd.read_excel( '../data/raw/DataSet_Test.xlsx', usecols="A:G")

In [43]:
df1 = df_raw.copy()

# 2.0. Data preparation and Feature Engineering

In [44]:
df1 = data_preparation( df1 )
df1.head()

Unnamed: 0,bill_no,item_name,quantity,date,price,customer_id,country,produto_id
0,536365,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,0
1,536365,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,1
2,536365,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2
3,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,3
4,536365,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,4


**Coluna produto_id foi adicionada no dataset**

# 3.0 Filtragem Country

In [6]:
def data_filter( dataframe, country=False, Country=""):
    if country:
        dataframe = dataframe[dataframe["country"] == Country]
    return dataframe

In [45]:
df_country = data_filter( df1, True, 'France' )
df_country.head()

Unnamed: 0,bill_no,item_name,quantity,date,price,customer_id,country,produto_id
26,536370,ALARM CLOCK BAKELIKE PINK,24,2010-12-01 08:45:00,3.75,12583.0,France,26
27,536370,ALARM CLOCK BAKELIKE RED,24,2010-12-01 08:45:00,3.75,12583.0,France,27
28,536370,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.75,12583.0,France,28
29,536370,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,0.85,12583.0,France,29
30,536370,STARS GIFT TAPE,24,2010-12-01 08:45:00,0.65,12583.0,France,30


**Vamos lidar com os dados de vendas da Alemanha como exemplo** 

# 4.0 Preparando a Matriz Compra-Produto

In [46]:
def create_purchase_product(dataframe, id=False):
    if id:
        return dataframe.groupby(['bill_no', 'produto_id'])['quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)
    else:
        return dataframe.groupby(['bill_no', 'item_name'])['quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)

In [47]:
purchase_product = create_purchase_product( df_country, id=True)
purchase_product.head()

produto_id,0,3,4,5,7,9,10,11,12,15,...,3781,3782,3784,3785,3786,3816,3817,3818,3820,3821
bill_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Coluna produto_id foi criada para realizar a contrução dessa Matriz, associar compras por codigo do produto**

## 4.1 Check prouto_id

In [48]:
def check_produto_id( dataframe, produto_id ): 
    item_name = dataframe[ dataframe["produto_id"] == produto_id]["item_name"].unique()[0] 
    return produto_id, item_name 

In [49]:
check_produto_id( df_country, 34 )

(34, 'ROUND SNACK BOXES SET OF4 WOODLAND')

In [50]:
check_produto_id( df_country, 244 )

(244, 'SET OF 6 T-LIGHTS SANTA')

**Função para encontrar o produto através do código ID**

# 5.0 Determinação das Regras de Associação

In [51]:
frequent_itemsets = apriori( purchase_product, min_support=0.06, use_colnames=True)

frequent_itemsets.sort_values('support', ascending=False).head()



Unnamed: 0,support,itemsets
57,0.190104,(3064)
9,0.182292,(39)
32,0.174479,(423)
38,0.171875,(685)
4,0.161458,(34)


# *Note

In [52]:
asso_rules = association_rules( frequent_itemsets, metric = "lift", min_threshold=0.1)
asso_rules.sort_values('support', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,(171),(64),0.140625,0.130208,0.125,0.888889,6.826667,0.106689,7.828125
12,(64),(171),0.130208,0.140625,0.125,0.96,6.826667,0.106689,21.484375
26,(422),(423),0.138021,0.174479,0.106771,0.773585,4.433681,0.082689,3.64605
27,(423),(422),0.174479,0.138021,0.106771,0.61194,4.433681,0.082689,2.221254
21,(171),(170),0.140625,0.135417,0.104167,0.740741,5.470085,0.085124,3.334821
30,(685),(423),0.171875,0.174479,0.104167,0.606061,3.473541,0.074178,2.095553
31,(423),(685),0.174479,0.171875,0.104167,0.597015,3.473541,0.074178,2.054977
11,(170),(64),0.135417,0.130208,0.104167,0.769231,5.907692,0.086534,3.769097
10,(64),(170),0.130208,0.135417,0.104167,0.8,5.907692,0.086534,4.322917
20,(170),(171),0.135417,0.140625,0.104167,0.769231,5.470085,0.085124,3.723958


Métricas que vemos na tabela acima:

   - **suporte antecedente:** Se X é chamado de antecedente, 'suporte antecedente' calcula a proporção de transações que contêm o antecedente X.
   - **suporte conseqüente:** se Y for chamado de conseqüente, 'suporte conseqüente' calcula a proporção de transações que contêm o antecedente Y.
   - **support:** 'support' calcula a proporção de transações que contêm o antecedente X e Y.
   - **confiança:** Probabilidade de comprar Y quando X é comprado.
   - **lift:** representa quantas vezes a probabilidade de obter Y aumenta quando X é recebido.



# 6.0 Suggesting a Product to Users at the Basket Stage

In [18]:
sorted_rules = asso_rules.sort_values("lift", ascending=False)

In [19]:
produto_id = 34

check_produto_id( df_country, produto_id )

(34, 'ROUND SNACK BOXES SET OF4 WOODLAND')

In [20]:
product_id = 34
recommendation_list = []

for idx, product in enumerate(sorted_rules["antecedents"]):
    # antecendent tuple
    for j in list(product):
        if j == product_id:
            # indexi ne ise (idx) consequentte
            recommendation_list.append(list(sorted_rules.iloc[idx]["consequents"])[0])
            recommendation_list = list( dict.fromkeys(recommendation_list) )
            
            
list_top5 = recommendation_list[0:5]
list_top5          

for elem in list_top5:
    print( check_produto_id( df_country, elem ))  

(356, 'ROUND SNACK BOXES SET OF 4 FRUITS')
(35, 'SPACEBOY LUNCH BOX')
(423, 'PLASTERS IN TIN WOODLAND ANIMALS')
(1215, 'WOODLAND CHARLOTTE BAG')


# 6.  Function Recomendation System 

In [22]:
# ================ Data Preparation ================
    
def data_preparation( dataframe ):
    # Rename Columns  
    cols_old = ['BillNo', 'ItemName', 'Quantity', 'Date', 'Price', 'CustomerID', 'Country']

    snakecase = lambda x: inflection.underscore( x )
    cols_news = list( map( snakecase, cols_old ) )

    # Rename
    dataframe.columns = cols_news
    
    # Drop NA
    dataframe = dataframe.dropna(subset=['item_name','customer_id'])
      
    #feature_engineering
    dataframe = dataframe.loc[dataframe['price'] >= 0.04,:]

    dataframe = dataframe[~dataframe["item_name"].str.contains("POST", na=False)]
    
    dataframe = dataframe[~dataframe['country'].isin( ["Unspecified"] )]

    dataframe = dataframe[~dataframe['customer_id'].isin( [16446] )]

    # data product_id 
    df_product_id = dataframe.drop( ['bill_no', 'quantity', 'date', 'price','customer_id', 'country'], axis=1 ).drop_duplicates( ignore_index=True)
    df_product_id = pd.DataFrame( df_product_id ) 
    df_product_id['produto_id'] = pd.factorize( df_product_id['item_name'])[0]

    # merge 
    dataframe = pd.merge( dataframe, df_product_id, on='item_name', how='left' )
    
    return dataframe

# ================ Filtragem Country ================
def data_filter( dataframe, country=False, Country="" ):
    if country:
        dataframe = dataframe[dataframe["country"] == Country]
    return dataframe

# ================  Matriz Compra-Produto  ================
def create_purchase_product( dataframe, id=False ):
    if id:
        return dataframe.groupby(['bill_no', 'produto_id'])['quantity'].sum().unstack().fillna(0). \
            applymap( lambda x: 1 if x > 0 else 0 )
    else:
        return dataframe.groupby(['bill_no', 'item_name'])['quantity'].sum().unstack().fillna(0). \
            applymap( lambda x: 1 if x > 0 else 0 )

# ================ Apriori Algorithm & ARL Rules ================
def apriori_alg( dataframe, support_val=0.06 ):
    frequent_itemsets = apriori( purchase_product, min_support=support_val, use_colnames=True)
    frequent_itemsets.sort_values('support', ascending=False).head()
    asso_rules = association_rules( frequent_itemsets, metric = "lift", min_threshold=0.1)
    asso_rules.sort_values('support', ascending=False)
    return sorted_rules

# ================ recommend_product ================
def recommend_product( dataframe, product_id, support_val= 0.06, num_of_products=5 ):
    sorted_rules = apriori_alg( dataframe, support_val )
    recommendation_list = []  
    for idx, product in enumerate(sorted_rules["antecedents"]):
        for j in list(product):
            if j == product_id:
                recommendation_list.append(list(sorted_rules.iloc[idx]["consequents"])[0])
                recommendation_list = list( dict.fromkeys(recommendation_list) )
    return( recommendation_list[0:num_of_products])

In [23]:
# ================ Recommendation System ================    
def recommendation_system( dataframe, support_val=0.01, num_of_products=5 ):
    product_id = input( "Digite o código do Produto: " )
    
    if product_id in list( dataframe["produto_id"].astype("str").unique()):
        product_list = recommend_product( dataframe, int( product_id ), support_val, num_of_products )
        if len( product_list ) == 0:
            print("Não há nenhum produto pode ser recomendado!")
        else:
            print("Os produtos relacionados com o do produto_id:" , product_id , "podem ser vistos abaixo:")
        
            for i in range(0, len(product_list[0:num_of_products])):
                print( check_produto_id(dataframe, product_list[i] ))
            
    else:
        print("ID do produto inválido, tente novamente!")

In [24]:
# Loading Data
df_raw = pd.read_excel( '../data/raw/DataSet_Test.xlsx', usecols="A:G")

In [25]:
# Data Preparation:
df1 = df_raw.copy()

df1 = data_preparation( df1 )
df_country = data_filter( df1, True ,'France' )
df_country.head()

Unnamed: 0,bill_no,item_name,quantity,date,price,customer_id,country,produto_id
26,536370,ALARM CLOCK BAKELIKE PINK,24,2010-12-01 08:45:00,3.75,12583.0,France,26
27,536370,ALARM CLOCK BAKELIKE RED,24,2010-12-01 08:45:00,3.75,12583.0,France,27
28,536370,ALARM CLOCK BAKELIKE GREEN,12,2010-12-01 08:45:00,3.75,12583.0,France,28
29,536370,PANDA AND BUNNIES STICKER SHEET,12,2010-12-01 08:45:00,0.85,12583.0,France,29
30,536370,STARS GIFT TAPE,24,2010-12-01 08:45:00,0.65,12583.0,France,30


In [32]:
recommendation_system( df_country )

Digite o código do Produto: 34




Os produtos relacionados com o do produto_id: 34 podem ser vistos abaixo:
(356, 'ROUND SNACK BOXES SET OF 4 FRUITS')
(35, 'SPACEBOY LUNCH BOX')
(423, 'PLASTERS IN TIN WOODLAND ANIMALS')
(1215, 'WOODLAND CHARLOTTE BAG')


In [63]:
recommendation_system( df_country )

Digite o código do Produto: 100




Não há nenhum produto pode ser recomendado!


In [64]:
recommendation_system( df_country )

Digite o código do Produto: 500
ID do produto inválido, tente novamente!


In [39]:
recommendation_system( df_country )

Digite o código do Produto: 9




Não há nenhum produto pode ser recomendado!


# Lista de recomendação sugerida pelo algoritmo

In [58]:
list_asso_france = pickle.load(open('../data/processed/france_asso_rules.pkl', 'rb'))
list_asso_france.sort_values('support', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
13,(171),(64),0.140625,0.130208,0.125,0.888889,6.826667,0.106689,7.828125
12,(64),(171),0.130208,0.140625,0.125,0.96,6.826667,0.106689,21.484375
26,(422),(423),0.138021,0.174479,0.106771,0.773585,4.433681,0.082689,3.64605
27,(423),(422),0.174479,0.138021,0.106771,0.61194,4.433681,0.082689,2.221254
21,(171),(170),0.140625,0.135417,0.104167,0.740741,5.470085,0.085124,3.334821
30,(685),(423),0.171875,0.174479,0.104167,0.606061,3.473541,0.074178,2.095553
31,(423),(685),0.174479,0.171875,0.104167,0.597015,3.473541,0.074178,2.054977
11,(170),(64),0.135417,0.130208,0.104167,0.769231,5.907692,0.086534,3.769097
10,(64),(170),0.130208,0.135417,0.104167,0.8,5.907692,0.086534,4.322917
20,(170),(171),0.135417,0.140625,0.104167,0.769231,5.470085,0.085124,3.723958
