# Association rules for Personali-T data

### Definiciones relevantes:
**Support:** Mide la frequencia el itemset es en todas la transacciones registradas. Ayuda a identificar las reglas que valen la pena analizar más profundamente.

**Confidence:** Mide cuan probable es la ocurrencia de un item como consequencia de otro. Entre más cercano a 1, más confianza tenemos de que si hay un item aparecerá el otro.

**Lift:** Controla para el Support (frecuencia) de consecuencias mientras calcula la probabilidad condicional de ${Y}$ dado ${X}$. Es cuanto aumenta nuestra probabilidad de tener ${Y}$ dado ${X}$. Es el calculo de de la probabilidad de tener ${Y}$ dado ${X}$ dividido a las veces que tenemos Y sin saber que tenemos X.

$$ Lift({X} -> {Y}) = \frac{(\text{Items conteniendo X e Y})/(\text{Items coteniendo X})}{(\text{Fraccion de las transacciones conteniendo Y})}$$

Si tenemos un Lift mayor a 1, entonces aumenta la confianza de nuestra regla de asociación.

**Leverage:** Muestra el computo de la diferencia entre la frecuencia observada de cuando ${X}$ e ${Y}$ aparecen en conjunto, y la frecuencia que sería esperada si ${X}$ e ${Y}$ fuesen independientes. Un valor de 0 indica independencia total.

**Conviction:** Una alta convicción significa que la consecuencia es altamente dependiente del antecedente. en el caso de una confianza de 1, el denominador se convierte en 0 (1-1), por lo que la convicción se define como ${inf}$. Al igual que el **Lift**, si los items son independientes, la convicción es 1.

#### Referencias

[Basic guide to association rules 1/2](https://towardsdatascience.com/association-rules-2-aa9a77241654)

[Basic guide to association rules 2/2](https://towardsdatascience.com/complete-guide-to-association-rules-2-2-c92072b56c84)

[Mlxtend Association rules user guide](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)

[Mlxtend Association rules parameters](https://rasbt.github.io/mlxtend/api_modules/mlxtend.frequent_patterns/association_rules/#association_rules)



### Desarrollo

Este método es más demandante en términos computacionales dado el peso de la matriz. De todas formas, es útil tenerlo para que los resultados puedan ser comparados con los del método 2 y poder realizar un análisis más exaustivo.

In [4]:
# Importamos las librerías necesarias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import scipy

In [7]:
# Cargamos la data
personality_data = pd.read_csv('C:/Users/Nico/Desktop/EB Metrics/BDD/GMA y Personali-T/Personali-T/Binary CSV/PERSONALI-T_BIN_CONSOLIDADO.csv')
num_records = len(personality_data)
print(num_records)
personality_data.head()

21393


Unnamed: 0.1,Unnamed: 0,DGSG,PRO,DIG,JEF,TECN,OPS,VEN,TOP_10,E1,...,BIN_P_122,BIN_P_123,BIN_PI_124,BIN_P_125,BIN_PI_126,BIN_P_127,BIN_PI_128,BIN_P_129,BIN_P_130,BIN_P_131
0,1,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,1,1,1,0,0
1,2,0,1,0,0,0,0,0,0,0,...,1,1,1,0,1,1,1,1,0,1
2,3,0,0,0,0,1,0,0,0,0,...,0,1,1,1,1,1,0,1,1,0
3,4,0,1,0,0,0,0,0,0,0,...,0,1,1,1,1,1,1,1,0,0
4,5,0,1,0,0,0,0,0,0,0,...,1,0,1,1,1,0,0,1,1,1


In [4]:
# Eliminamos la columna 'Unnamed: 0'
personality_data = personality_data.drop('Unnamed: 0', 1)

In [5]:
# Modificamos la tabla para que sea un DataFrame de pandas
personality_data = pd.DataFrame(personality_data)
personality_data.describe()

Unnamed: 0,DGSG,PRO,DIG,JEF,TECN,OPS,VEN,TOP_10,E1,E2,...,BIN_P_122,BIN_P_123,BIN_PI_124,BIN_P_125,BIN_PI_126,BIN_P_127,BIN_PI_128,BIN_P_129,BIN_P_130,BIN_P_131
count,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,...,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0,21393.0
mean,0.02627,0.30351,0.023185,0.079465,0.535549,0.021082,0.010938,0.295517,0.194924,0.287571,...,0.449773,0.499088,0.895714,0.597812,0.633852,0.543215,0.713925,0.755481,0.425466,0.487215
std,0.159942,0.459784,0.150495,0.27047,0.498746,0.14366,0.104015,0.456286,0.396151,0.45264,...,0.497483,0.500011,0.305639,0.490351,0.481762,0.498141,0.451935,0.429812,0.494425,0.499848
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
75%,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# Convertimos los valores a terminos booleanos
personality_data = personality_data.astype(bool)
personality_data.head()

Unnamed: 0,DGSG,PRO,DIG,JEF,TECN,OPS,VEN,TOP_10,E1,E2,...,BIN_P_122,BIN_P_123,BIN_PI_124,BIN_P_125,BIN_PI_126,BIN_P_127,BIN_PI_128,BIN_P_129,BIN_P_130,BIN_P_131
0,False,True,False,False,False,False,False,False,False,True,...,False,False,False,True,False,True,True,True,False,False
1,False,True,False,False,False,False,False,False,False,True,...,True,True,True,False,True,True,True,True,False,True
2,False,False,False,False,True,False,False,False,False,True,...,False,True,True,True,True,True,False,True,True,False
3,False,True,False,False,False,False,False,False,False,False,...,False,True,True,True,True,True,True,True,False,False
4,False,True,False,False,False,False,False,False,False,True,...,True,False,True,True,True,False,False,True,True,True


In [7]:
# Aplicamos el algoritmo APRIORI a los datos. Si bien la estructura es la misma, el support puede ser alterado dependiendo el requisito que queramos definir.
freq_itemsets = apriori(personality_data, min_support = 0.8, use_colnames = True, max_len = 7, verbose = 1, low_memory = True)

Processing 147664 combinations | Sampling itemset size 7


In [8]:
# Agregamos una columna para especificar el largo del itemset.
freq_itemsets['length'] = freq_itemsets['itemsets'].apply(lambda x: len(x))

# Llamamos a freq_itemsets
freq_itemsets

Unnamed: 0,support,itemsets,length
0,0.811153,(ALTO),1
1,0.975459,(BIN_P_4),1
2,0.839761,(BIN_P_7),1
3,0.966484,(BIN_P_11),1
4,0.949703,(BIN_P_15),1
...,...,...,...
68642,0.806245,"(BIN_P_86, BIN_P_103, BIN_P_46, BIN_P_106, BIN...",7
68643,0.806245,"(BIN_P_86, BIN_P_103, BIN_P_106, BIN_P_43, BIN...",7
68644,0.802085,"(BIN_P_86, BIN_P_103, BIN_P_46, BIN_P_106, BIN...",7
68645,0.802552,"(BIN_P_86, BIN_P_103, BIN_P_46, BIN_P_54, BIN_...",7


In [9]:
rules = association_rules(freq_itemsets, metric="support", min_threshold = 0.9)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(BIN_P_4),(BIN_P_11),0.975459,0.966484,0.946384,0.970194,1.003838,0.003618,1.124444
1,(BIN_P_11),(BIN_P_4),0.966484,0.975459,0.946384,0.979203,1.003838,0.003618,1.180010
2,(BIN_P_4),(BIN_P_15),0.975459,0.949703,0.929790,0.953182,1.003663,0.003393,1.074303
3,(BIN_P_15),(BIN_P_4),0.949703,0.975459,0.929790,0.979032,1.003663,0.003393,1.170409
4,(BIN_P_4),(BIN_P_16),0.975459,0.952414,0.932875,0.956345,1.004127,0.003834,1.090030
...,...,...,...,...,...,...,...,...,...
2857,"(BIN_P_36, BIN_P_67)","(BIN_P_103, BIN_P_29)",0.949376,0.944141,0.904782,0.953028,1.009413,0.008438,1.189208
2858,(BIN_P_103),"(BIN_P_29, BIN_P_36, BIN_P_67)",0.975646,0.921657,0.904782,0.927367,1.006196,0.005571,1.078617
2859,(BIN_P_29),"(BIN_P_103, BIN_P_36, BIN_P_67)",0.964287,0.930678,0.904782,0.938291,1.008180,0.007341,1.123361
2860,(BIN_P_36),"(BIN_P_103, BIN_P_29, BIN_P_67)",0.967186,0.928808,0.904782,0.935479,1.007182,0.006452,1.103388


In [10]:
# Describimos las variables 'Lift', 'leverage' y 'Conviction' con el propósito de tener información suficiente 
# para poder limpiar las reglas menos importantes

describe_rules = rules[['lift', 'leverage', 'conviction']].describe()
describe_rules

Unnamed: 0,lift,leverage,conviction
count,2862.0,2862.0,2862.0
mean,1.006439,0.00581,1.171382
std,0.002638,0.002329,0.098491
min,1.001355,0.00123,1.02156
25%,1.00475,0.004309,1.096724
50%,1.006013,0.005424,1.151823
75%,1.007639,0.006876,1.225868
max,1.027446,0.024063,1.966044


In [11]:
# Creamos variables que describan el largo de los antecententes y de las consecuencias de las reglas de asociación.

rules['antecedent_len'] = rules['antecedents'].apply(lambda x: len(x))
rules['consequent_len'] = rules['consequents'].apply(lambda x: len(x))
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
0,(BIN_P_4),(BIN_P_11),0.975459,0.966484,0.946384,0.970194,1.003838,0.003618,1.124444,1,1
1,(BIN_P_11),(BIN_P_4),0.966484,0.975459,0.946384,0.979203,1.003838,0.003618,1.18001,1,1
2,(BIN_P_4),(BIN_P_15),0.975459,0.949703,0.92979,0.953182,1.003663,0.003393,1.074303,1,1
3,(BIN_P_15),(BIN_P_4),0.949703,0.975459,0.92979,0.979032,1.003663,0.003393,1.170409,1,1
4,(BIN_P_4),(BIN_P_16),0.975459,0.952414,0.932875,0.956345,1.004127,0.003834,1.09003,1,1


In [27]:
# Filtramos por aquellas reglas de asociación que generen más de una consecuencia.

relevant_rules = rules[rules['consequent_len'] > 1] 
relevant_rules

# 1467 reglas fueron descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
327,(BIN_P_4),"(BIN_P_15, BIN_P_11)",0.975459,0.924508,0.906839,0.929653,1.005565,0.005019,1.073138,1,2
328,(BIN_P_15),"(BIN_P_4, BIN_P_11)",0.949703,0.946384,0.906839,0.954865,1.008962,0.008054,1.187906,1,2
329,(BIN_P_11),"(BIN_P_4, BIN_P_15)",0.966484,0.929790,0.906839,0.938286,1.009137,0.008211,1.137664,1,2
333,(BIN_P_4),"(BIN_P_16, BIN_P_11)",0.975459,0.929556,0.911794,0.934733,1.005568,0.005049,1.079308,1,2
334,(BIN_P_16),"(BIN_P_4, BIN_P_11)",0.952414,0.946384,0.911794,0.957350,1.011587,0.010444,1.257099,1,2
...,...,...,...,...,...,...,...,...,...,...,...
2857,"(BIN_P_36, BIN_P_67)","(BIN_P_103, BIN_P_29)",0.949376,0.944141,0.904782,0.953028,1.009413,0.008438,1.189208,2,2
2858,(BIN_P_103),"(BIN_P_29, BIN_P_36, BIN_P_67)",0.975646,0.921657,0.904782,0.927367,1.006196,0.005571,1.078617,1,3
2859,(BIN_P_29),"(BIN_P_103, BIN_P_36, BIN_P_67)",0.964287,0.930678,0.904782,0.938291,1.008180,0.007341,1.123361,1,3
2860,(BIN_P_36),"(BIN_P_103, BIN_P_29, BIN_P_67)",0.967186,0.928808,0.904782,0.935479,1.007182,0.006452,1.103388,1,3


In [28]:
# Filtramos por lift > 1 con el propósito de quedarnos solo con las reglas de asociación cuyo lift ayude a la confianza del modelo.

relevant_rules = relevant_rules[relevant_rules['lift'] > 1]
relevant_rules

# 0 reglas han sido descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
327,(BIN_P_4),"(BIN_P_15, BIN_P_11)",0.975459,0.924508,0.906839,0.929653,1.005565,0.005019,1.073138,1,2
328,(BIN_P_15),"(BIN_P_4, BIN_P_11)",0.949703,0.946384,0.906839,0.954865,1.008962,0.008054,1.187906,1,2
329,(BIN_P_11),"(BIN_P_4, BIN_P_15)",0.966484,0.929790,0.906839,0.938286,1.009137,0.008211,1.137664,1,2
333,(BIN_P_4),"(BIN_P_16, BIN_P_11)",0.975459,0.929556,0.911794,0.934733,1.005568,0.005049,1.079308,1,2
334,(BIN_P_16),"(BIN_P_4, BIN_P_11)",0.952414,0.946384,0.911794,0.957350,1.011587,0.010444,1.257099,1,2
...,...,...,...,...,...,...,...,...,...,...,...
2857,"(BIN_P_36, BIN_P_67)","(BIN_P_103, BIN_P_29)",0.949376,0.944141,0.904782,0.953028,1.009413,0.008438,1.189208,2,2
2858,(BIN_P_103),"(BIN_P_29, BIN_P_36, BIN_P_67)",0.975646,0.921657,0.904782,0.927367,1.006196,0.005571,1.078617,1,3
2859,(BIN_P_29),"(BIN_P_103, BIN_P_36, BIN_P_67)",0.964287,0.930678,0.904782,0.938291,1.008180,0.007341,1.123361,1,3
2860,(BIN_P_36),"(BIN_P_103, BIN_P_29, BIN_P_67)",0.967186,0.928808,0.904782,0.935479,1.007182,0.006452,1.103388,1,3


In [29]:
# Filtramos por la variable 'leverage' cuando esta es mayor que 0 ya que estamos buscando aquellas 
# relaciones en donde las variables no son completamente independientes.

relevant_rules = relevant_rules[relevant_rules['leverage'] > 0]
relevant_rules

# 0 reglas han sido descartadas desde el paso anterior.

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
327,(BIN_P_4),"(BIN_P_15, BIN_P_11)",0.975459,0.924508,0.906839,0.929653,1.005565,0.005019,1.073138,1,2
328,(BIN_P_15),"(BIN_P_4, BIN_P_11)",0.949703,0.946384,0.906839,0.954865,1.008962,0.008054,1.187906,1,2
329,(BIN_P_11),"(BIN_P_4, BIN_P_15)",0.966484,0.929790,0.906839,0.938286,1.009137,0.008211,1.137664,1,2
333,(BIN_P_4),"(BIN_P_16, BIN_P_11)",0.975459,0.929556,0.911794,0.934733,1.005568,0.005049,1.079308,1,2
334,(BIN_P_16),"(BIN_P_4, BIN_P_11)",0.952414,0.946384,0.911794,0.957350,1.011587,0.010444,1.257099,1,2
...,...,...,...,...,...,...,...,...,...,...,...
2857,"(BIN_P_36, BIN_P_67)","(BIN_P_103, BIN_P_29)",0.949376,0.944141,0.904782,0.953028,1.009413,0.008438,1.189208,2,2
2858,(BIN_P_103),"(BIN_P_29, BIN_P_36, BIN_P_67)",0.975646,0.921657,0.904782,0.927367,1.006196,0.005571,1.078617,1,3
2859,(BIN_P_29),"(BIN_P_103, BIN_P_36, BIN_P_67)",0.964287,0.930678,0.904782,0.938291,1.008180,0.007341,1.123361,1,3
2860,(BIN_P_36),"(BIN_P_103, BIN_P_29, BIN_P_67)",0.967186,0.928808,0.904782,0.935479,1.007182,0.006452,1.103388,1,3


In [30]:
# Ordenamos la tabla en orden descendiente según la variable 'convition'. 
# De esta forma, podemos observar aquellas reglas cuyas consecuencias dependan considerablemente del antecedente.

relevant_rules = relevant_rules.sort_values(by = 'conviction', ascending = False)
relevant_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len,consequent_len
1187,(BIN_P_106),"(BIN_P_43, BIN_P_11)",0.944374,0.928388,0.900809,0.953868,1.027446,0.024063,1.552341,1,2
2152,(BIN_P_106),"(BIN_P_43, BIN_P_67)",0.944374,0.935633,0.904829,0.958125,1.024039,0.021241,1.537120,1,2
2159,(BIN_P_106),"(BIN_P_103, BIN_P_43)",0.944374,0.933156,0.902538,0.955700,1.024159,0.021290,1.508889,1,2
773,(BIN_P_106),"(BIN_P_4, BIN_P_43)",0.944374,0.932408,0.901557,0.954660,1.023866,0.021015,1.490792,1,2
2151,(BIN_P_43),"(BIN_P_106, BIN_P_67)",0.952414,0.928201,0.904829,0.950037,1.023525,0.020797,1.437042,1,2
...,...,...,...,...,...,...,...,...,...,...,...
1811,(BIN_P_67),"(BIN_P_44, BIN_P_18)",0.979246,0.925209,0.908755,0.928016,1.003033,0.002748,1.038987,1,2
1847,(BIN_P_67),"(BIN_P_86, BIN_P_18)",0.979246,0.923012,0.906652,0.925868,1.003094,0.002796,1.038517,1,2
2177,(BIN_P_67),"(BIN_P_54, BIN_P_44)",0.979246,0.922171,0.905857,0.925056,1.003129,0.002825,1.038499,1,2
1859,(BIN_P_67),"(BIN_P_115, BIN_P_18)",0.979246,0.916608,0.900481,0.919567,1.003227,0.002897,1.036780,1,2


In [35]:
# Guardamos el DataFrame para poder estudiar detenidamente las reglas encontradas.

relevant_rules.to_excel('C:/Users/Nico/Desktop/EB Metrics/BDD/Association Rules/Relevant Personali-T Association Rules.xlsx')

rules.to_excel('C:/Users/Nico/Desktop/EB Metrics/BDD/Association Rules/Personali-T Association Rules.xlsx')