# Association rules for Abili-T data

### Definiciones relevantes:
**Support:** Mide la frequencia el itemset es en todas la transacciones registradas. Ayuda a identificar las reglas que valen la pena analizar más profundamente.

**Confidence:** Mide cuan probable es la ocurrencia de un item como consequencia de otro. Entre más cercano a 1, más confianza tenemos de que si hay un item aparecerá el otro.

**Lift:** Controla para el Support (frecuencia) de consecuencias mientras calcula la probabilidad condicional de ${Y}$ dado ${X}$. Es cuanto aumenta nuestra probabilidad de tener ${Y}$ dado ${X}$. Es el calculo de de la probabilidad de tener ${Y}$ dado ${X}$ dividido a las veces que tenemos Y sin saber que tenemos X.

$$ Lift({X} -> {Y}) = \frac{(\text{Items conteniendo X e Y})/(\text{Items coteniendo X})}{(\text{Fraccion de las transacciones conteniendo Y})}$$

Si tenemos un Lift mayor a 1, entonces aumenta la confianza de nuestra regla de asociación.

**Leverage:** Muestra el computo de la diferencia entre la frecuencia observada de cuando ${X}$ e ${Y}$ aparecen en conjunto, y la frecuencia que sería esperada si ${X}$ e ${Y}$ fuesen independientes. Un valor de 0 indica independencia total.

**Conviction:** Una alta convicción significa que la consecuencia es altamente dependiente del antecedente. en el caso de una confianza de 1, el denominador se convierte en 0 (1-1), por lo que la convicción se define como ${inf}$. Al igual que el **Lift**, si los items son independientes, la convicción es 1.

#### Referencias

[Basic guide to association rules 1/2](https://towardsdatascience.com/association-rules-2-aa9a77241654)

[Basic guide to association rules 2/2](https://towardsdatascience.com/complete-guide-to-association-rules-2-2-c92072b56c84)

[Mlxtend Association rules user guide](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)

[Mlxtend Association rules parameters](https://rasbt.github.io/mlxtend/api_modules/mlxtend.frequent_patterns/association_rules/#association_rules)


### MÉTODO 1: Pandas DataFrame

Este método es más demandante en términos computacionales dado el peso de la matriz. De todas formas, es útil tenerlo para que los resultados puedan ser comparados con los del método 2 y poder realizar un análisis más exaustivo.

In [19]:
# Importamos las librerías necesarias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import scipy
from arulesviz import Arulesviz

In [2]:
# Cargamos la data
gma_data = pd.read_csv('C:/Users/Nico/Desktop/EB Metrics/BDD/GMA y Personali-T/GMA/Binary CSV/GMA_bin_consolidado.csv')
num_records = len(gma_data)
print(num_records)

27654


In [3]:
# Eliminamos la columna 'Unnamed: 0'
gma_data = gma_data.drop('Unnamed: 0', 1)

In [4]:
# Modificamos la tabla para que sea un DataFrame de pandas
gma_data = pd.DataFrame(gma_data)
gma_data.describe()

Unnamed: 0,DGSG,EE,PRO,JEF,TECN,DIG,OPS,VEN,E1,E2,...,BINT34,BINT35,BINT36,BINT37,BINT38,BINT39,BINT40,BINT41,BINT42,BINTOT_T
count,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,...,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0,27654.0
mean,0.025602,0.049866,0.255334,0.030339,0.2954,0.018551,0.015477,0.008534,0.163014,0.288385,...,0.854922,0.89629,0.938164,0.910176,0.944565,0.934693,0.94876,0.964996,0.967021,0.768388
std,0.157948,0.217672,0.436057,0.171522,0.456231,0.134934,0.123442,0.091986,0.369385,0.453019,...,0.352186,0.30489,0.240861,0.285935,0.228832,0.247071,0.220492,0.183793,0.178585,0.42187
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
# Convertimos los valores a terminos booleanos
gma_data = gma_data.astype(bool)
gma_data.head()

Unnamed: 0,DGSG,EE,PRO,JEF,TECN,DIG,OPS,VEN,E1,E2,...,BINT34,BINT35,BINT36,BINT37,BINT38,BINT39,BINT40,BINT41,BINT42,BINTOT_T
0,False,False,True,False,False,False,False,False,False,True,...,True,True,True,True,True,True,True,True,True,True
1,False,False,True,False,False,False,False,False,False,True,...,True,True,True,True,True,True,True,True,True,True
2,False,False,False,False,False,False,False,False,False,True,...,True,True,True,True,True,True,True,True,True,True
3,False,True,False,False,False,False,False,False,False,False,...,True,True,True,True,True,True,True,True,True,True
4,False,True,False,False,False,False,False,False,False,True,...,True,True,True,True,True,True,True,True,True,True


In [6]:
# Aplicamos el algoritmo APRIORI a los datos. Si bien la estructura es la misma, el support puede ser alterado dependiendo el requisito que queramos definir.
freq_itemsets = apriori(gma_data, min_support = 0.8, use_colnames = True, max_len = 7, verbose = 1, low_memory = True)

Processing 17848 combinations | Sampling itemset size 7


In [7]:
# Agregamos una columna para especificar el largo del itemset.
freq_itemsets['length'] = freq_itemsets['itemsets'].apply(lambda x: len(x))

# Llamamos a freq_itemsets
freq_itemsets

Unnamed: 0,support,itemsets,length
0,0.833767,(STATUS),1
1,0.966442,(P01),1
2,0.986584,(P02),1
3,0.873834,(P03),1
4,0.922290,(P04),1
...,...,...,...
9905,0.869820,"(BINT42, BINT38, BINT37, BINT36, BINT40, BINT3...",7
9906,0.869061,"(BINT42, BINT37, BINT36, BINT40, BINT35, BINT4...",7
9907,0.873545,"(BINT42, BINT38, BINT36, BINT40, BINT35, BINT4...",7
9908,0.870760,"(BINT42, BINT38, BINT37, BINT40, BINT35, BINT4...",7


In [8]:
rules = association_rules(freq_itemsets, metric = "support", min_threshold = 0.9)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(P01),(P02),0.966442,0.986584,0.956824,0.990047,1.003510,0.003347,1.347933
1,(P02),(P01),0.986584,0.966442,0.956824,0.969835,1.003510,0.003347,1.112455
2,(P01),(P06),0.966442,0.983619,0.954328,0.987465,1.003910,0.003717,1.306861
3,(P06),(P01),0.983619,0.966442,0.954328,0.970222,1.003910,0.003717,1.126912
4,(P01),(BINT36),0.966442,0.938164,0.906270,0.937739,0.999546,-0.000412,0.993159
...,...,...,...,...,...,...,...,...,...
3431,(P02),"(BINT42, BINT38, P06, BINT40, BINT41, BINT39)",0.986584,0.910863,0.904896,0.917201,1.006959,0.006253,1.076552
3432,(P06),"(BINT42, BINT38, P02, BINT40, BINT41, BINT39)",0.983619,0.914081,0.904896,0.919966,1.006438,0.005789,1.073532
3433,(BINT40),"(BINT42, BINT38, P02, P06, BINT41, BINT39)",0.948760,0.905113,0.904896,0.953768,1.053755,0.046161,2.052387
3434,(BINT41),"(BINT42, BINT38, P02, P06, BINT40, BINT39)",0.964996,0.905005,0.904896,0.937720,1.036149,0.031570,1.525298


In [9]:
# Describimos las variables 'Lift', 'leverage' y 'Conviction' con el propósito de tener información suficiente 
# para poder limpiar las reglas menos importantes

describe_rules = rules[['lift', 'leverage', 'conviction']].describe()
describe_rules

Unnamed: 0,lift,leverage,conviction
count,3436.0,3436.0,3436.0
mean,1.041209,0.035685,18.63162
std,0.02096,0.017808,48.464743
min,0.999134,-0.000781,0.950585
25%,1.033118,0.029255,1.803733
50%,1.043592,0.038182,2.865883
75%,1.056839,0.048991,5.820598
max,1.076408,0.064372,439.002387


In [10]:
# Creamos variables que describan el largo de los antecententes y de las consecuencias de las reglas de asociación.

rules['antecedent_len'] = rules['antecedents'].apply(lambda x: len(x))
rules['consequent_len'] = rules['consequents'].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
0,(P01),(P02),0.966442,0.986584,0.956824,0.990047,1.003510,0.003347,1.347933,1,1
1,(P02),(P01),0.986584,0.966442,0.956824,0.969835,1.003510,0.003347,1.112455,1,1
2,(P01),(P06),0.966442,0.983619,0.954328,0.987465,1.003910,0.003717,1.306861,1,1
3,(P06),(P01),0.983619,0.966442,0.954328,0.970222,1.003910,0.003717,1.126912,1,1
4,(P01),(BINT36),0.966442,0.938164,0.906270,0.937739,0.999546,-0.000412,0.993159,1,1
...,...,...,...,...,...,...,...,...,...,...,...
3431,(P02),"(BINT42, BINT38, P06, BINT40, BINT41, BINT39)",0.986584,0.910863,0.904896,0.917201,1.006959,0.006253,1.076552,1,6
3432,(P06),"(BINT42, BINT38, P02, BINT40, BINT41, BINT39)",0.983619,0.914081,0.904896,0.919966,1.006438,0.005789,1.073532,1,6
3433,(BINT40),"(BINT42, BINT38, P02, P06, BINT41, BINT39)",0.948760,0.905113,0.904896,0.953768,1.053755,0.046161,2.052387,1,6
3434,(BINT41),"(BINT42, BINT38, P02, P06, BINT40, BINT39)",0.964996,0.905005,0.904896,0.937720,1.036149,0.031570,1.525298,1,6


In [11]:
# Filtramos por aquellas reglas de asociación que generen más de una consecuencia.

relevant_rules = rules[rules['consequent_len'] > 1] 
relevant_rules

# 882 reglas han sido descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
89,(P01),"(P02, P06)",0.966442,0.977038,0.948398,0.981329,1.004392,0.004147,1.229842,1,2
90,(P02),"(P01, P06)",0.986584,0.954328,0.948398,0.961295,1.007299,0.006873,1.179977,1,2
91,(P06),"(P01, P02)",0.983619,0.956824,0.948398,0.964192,1.007701,0.007248,1.205792,1,2
95,(P01),"(P02, BINT38)",0.966442,0.931764,0.903992,0.935381,1.003882,0.003496,1.055980,1,2
96,(P02),"(P01, BINT38)",0.986584,0.913141,0.903992,0.916285,1.003443,0.003102,1.037555,1,2
...,...,...,...,...,...,...,...,...,...,...,...
3431,(P02),"(BINT42, BINT38, P06, BINT40, BINT41, BINT39)",0.986584,0.910863,0.904896,0.917201,1.006959,0.006253,1.076552,1,6
3432,(P06),"(BINT42, BINT38, P02, BINT40, BINT41, BINT39)",0.983619,0.914081,0.904896,0.919966,1.006438,0.005789,1.073532,1,6
3433,(BINT40),"(BINT42, BINT38, P02, P06, BINT41, BINT39)",0.948760,0.905113,0.904896,0.953768,1.053755,0.046161,2.052387,1,6
3434,(BINT41),"(BINT42, BINT38, P02, P06, BINT40, BINT39)",0.964996,0.905005,0.904896,0.937720,1.036149,0.031570,1.525298,1,6


In [12]:
# Filtramos por lift > 1 con el propósito de quedarnos solo con las reglas de asociación cuyo lift ayude a la confianza del modelo.

relevant_rules = relevant_rules[relevant_rules['lift'] > 1]
relevant_rules

# 127 reglas han sido descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
89,(P01),"(P02, P06)",0.966442,0.977038,0.948398,0.981329,1.004392,0.004147,1.229842,1,2
90,(P02),"(P01, P06)",0.986584,0.954328,0.948398,0.961295,1.007299,0.006873,1.179977,1,2
91,(P06),"(P01, P02)",0.983619,0.956824,0.948398,0.964192,1.007701,0.007248,1.205792,1,2
95,(P01),"(P02, BINT38)",0.966442,0.931764,0.903992,0.935381,1.003882,0.003496,1.055980,1,2
96,(P02),"(P01, BINT38)",0.986584,0.913141,0.903992,0.916285,1.003443,0.003102,1.037555,1,2
...,...,...,...,...,...,...,...,...,...,...,...
3431,(P02),"(BINT42, BINT38, P06, BINT40, BINT41, BINT39)",0.986584,0.910863,0.904896,0.917201,1.006959,0.006253,1.076552,1,6
3432,(P06),"(BINT42, BINT38, P02, BINT40, BINT41, BINT39)",0.983619,0.914081,0.904896,0.919966,1.006438,0.005789,1.073532,1,6
3433,(BINT40),"(BINT42, BINT38, P02, P06, BINT41, BINT39)",0.948760,0.905113,0.904896,0.953768,1.053755,0.046161,2.052387,1,6
3434,(BINT41),"(BINT42, BINT38, P02, P06, BINT40, BINT39)",0.964996,0.905005,0.904896,0.937720,1.036149,0.031570,1.525298,1,6


In [13]:
# Filtramos por la variable 'leverage' cuando esta es mayor que 0 ya que estamos buscando aquellas 
# relaciones en donde las variables no son completamente independientes.

relevant_rules = relevant_rules[relevant_rules['leverage'] > 0]
relevant_rules

# 0 reglas han sido descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
89,(P01),"(P02, P06)",0.966442,0.977038,0.948398,0.981329,1.004392,0.004147,1.229842,1,2
90,(P02),"(P01, P06)",0.986584,0.954328,0.948398,0.961295,1.007299,0.006873,1.179977,1,2
91,(P06),"(P01, P02)",0.983619,0.956824,0.948398,0.964192,1.007701,0.007248,1.205792,1,2
95,(P01),"(P02, BINT38)",0.966442,0.931764,0.903992,0.935381,1.003882,0.003496,1.055980,1,2
96,(P02),"(P01, BINT38)",0.986584,0.913141,0.903992,0.916285,1.003443,0.003102,1.037555,1,2
...,...,...,...,...,...,...,...,...,...,...,...
3431,(P02),"(BINT42, BINT38, P06, BINT40, BINT41, BINT39)",0.986584,0.910863,0.904896,0.917201,1.006959,0.006253,1.076552,1,6
3432,(P06),"(BINT42, BINT38, P02, BINT40, BINT41, BINT39)",0.983619,0.914081,0.904896,0.919966,1.006438,0.005789,1.073532,1,6
3433,(BINT40),"(BINT42, BINT38, P02, P06, BINT41, BINT39)",0.948760,0.905113,0.904896,0.953768,1.053755,0.046161,2.052387,1,6
3434,(BINT41),"(BINT42, BINT38, P02, P06, BINT40, BINT39)",0.964996,0.905005,0.904896,0.937720,1.036149,0.031570,1.525298,1,6


In [14]:
# Ordenamos la tabla en orden descendiente según la variable 'convition'. 
# De esta forma, podemos observar aquellas reglas cuyas consecuencias dependan considerablemente del antecedente.

relevant_rules = relevant_rules.sort_values(by = 'conviction', ascending = False)
relevant_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
3265,"(BINT39, BINT38, BINT40, BINT37)","(BINT42, BINT41)",0.907030,0.957041,0.906849,0.999801,1.044679,0.038785,215.510263,4,2
3203,"(BINT39, BINT38, BINT36, BINT40)","(BINT42, BINT41)",0.903992,0.957041,0.903811,0.999800,1.044679,0.038654,214.788544,4,2
3268,"(BINT41, BINT39, BINT40, BINT37)","(BINT42, BINT38)",0.907138,0.937152,0.906849,0.999681,1.066723,0.056723,197.075776,4,2
2584,"(BINT39, BINT40, BINT37)","(BINT41, BINT38)",0.907247,0.938381,0.906957,0.999681,1.065325,0.055614,193.243545,3,2
3261,"(BINT42, BINT39, BINT40, BINT37)","(BINT41, BINT38)",0.907138,0.938381,0.906849,0.999681,1.065325,0.055607,193.220438,4,2
...,...,...,...,...,...,...,...,...,...,...,...
143,(P01),"(BINT38, BINT40)",0.966442,0.933030,0.902148,0.933473,1.000475,0.000429,1.006666,1,2
590,(BINT41),"(P01, P02, P06)",0.964996,0.948398,0.915528,0.948737,1.000358,0.000327,1.006615,1,3
653,"(P01, P06)","(BINT41, BINT40)",0.954328,0.947060,0.904137,0.947406,1.000366,0.000330,1.006583,2,2
97,(BINT38),"(P01, P02)",0.944565,0.956824,0.903992,0.957046,1.000232,0.000210,1.005179,1,2


In [15]:
# Guardamos el DataFrame para poder estudiar detenidamente las reglas encontradas.

relevant_rules.to_excel('C:/Users/Nico/Desktop/EB Metrics/BDD/Association Rules/Relevant GMA Association Rules.xlsx')

rules.to_excel('C:/Users/Nico/Desktop/EB Metrics/BDD/Association Rules/GMA Association Rules.xlsx')

## Visualización

In [23]:
 args(getS3method("plot", "rules"))
function (x, method = NULL, measure = "support", shading = "lift",
limit = NULL, interactive = NULL, engine = "default", data = NULL,
control = NULL, ...)
NULL


SyntaxError: positional argument follows keyword argument (<ipython-input-23-7795092f63d6>, line 4)