# Ejercicio 1

Para este ejercicio nos vamos a centrar en las **reglas de asociación**, para ello a partir del dataset que nos indica que productos compran los usuarios, vamos a buscar las reglas de asociación. 

In [1]:
import pandas as pd
import numpy as np
import mlxtend as mlx
import matplotlib.pyplot as plt

In [2]:
!pip install apyori

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import apyori as ap
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Cargamos nuestro set de datos. 

In [4]:
data = pd.read_csv('BlackFriday.csv', encoding = 'latin_1')
data

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,F,26-35,15,B,4+,1,8,,
233595,1006036,P00254642,F,26-35,15,B,4+,1,5,8.0,
233596,1006036,P00031842,F,26-35,15,B,4+,1,1,5.0,12.0
233597,1006037,P00124742,F,46-50,1,C,4+,0,10,16.0,


El set de datos proporcionado tiene diferentes variables, estas nos indican:

* User_ID: Un valor único que identifica a cada comprador.
* Product_ID: Un valor único que identifica cada producto.
* Gender: El género del comprador (M o F).
* Age: Rango de edad del comprador.
* Ocupation: La ocupación del comprador, especificada como valor numérico
* City_Category: La categoría de ciudad en la que se realizó la compra
* Stay_In_Current_City_Years: El número de años que un comprador ha vivido en su ciudad.
* Marital_Status: El estado civil del comprador. 0 denota soltero, 1 denota casado.
* Product_Category_1: La categoría principal del producto, especificada como un número.
* Product_Category_2: La primera subcategoría del producto
* Product_Category_3: La segunda subcategoría del producto 


¿Tenemos datos nulos en nuestro dataset?

In [5]:
data.isnull().values.any()

True

¿Cuántos datos nulos tenemos?

In [6]:
data.isnull().sum().sum()

234906

¿Dónde se ubican los datos nulos?

In [7]:
data.columns[data.isnull().any()]

Index(['Product_Category_2', 'Product_Category_3'], dtype='object')

In [8]:
min = data['Product_Category_2'].min()
min

2.0

In [9]:
min2 = data['Product_Category_3'].min()
min2

3.0

De las columnas que tienen datos nulos, su mínimo no es 0, por tanto, le asignamos a los valores nulos el valor 0. 

In [10]:
data['Product_Category_2'].fillna(0, inplace=True)
data['Product_Category_3'].fillna(0, inplace=True)
data

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,0.0
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,0.0
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,0.0
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,0.0
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,F,26-35,15,B,4+,1,8,0.0,0.0
233595,1006036,P00254642,F,26-35,15,B,4+,1,5,8.0,0.0
233596,1006036,P00031842,F,26-35,15,B,4+,1,1,5.0,12.0
233597,1006037,P00124742,F,46-50,1,C,4+,0,10,16.0,0.0


Para realizar ejercicios de reglas de asociación debemos de pasar a formato one-hot encoding nuestras variables. 

Las pasaremos todas a este formato menos: User_ID y Product_ID.


In [11]:
dummy = pd.get_dummies(data['Gender'])
data1 = pd.concat([data, dummy], axis=1)
data1.drop('Gender', axis=1, inplace=True)
data1

Unnamed: 0,User_ID,Product_ID,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,F,M
0,1000004,P00128942,46-50,7,B,2,1,1,11.0,0.0,0,1
1,1000009,P00113442,26-35,17,C,0,0,3,5.0,0.0,0,1
2,1000010,P00288442,36-45,1,B,4+,1,5,14.0,0.0,1,0
3,1000010,P00145342,36-45,1,B,4+,1,4,9.0,0.0,1,0
4,1000011,P00053842,26-35,1,C,1,0,4,5.0,12.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,26-35,15,B,4+,1,8,0.0,0.0,1,0
233595,1006036,P00254642,26-35,15,B,4+,1,5,8.0,0.0,1,0
233596,1006036,P00031842,26-35,15,B,4+,1,1,5.0,12.0,1,0
233597,1006037,P00124742,46-50,1,C,4+,0,10,16.0,0.0,1,0


In [12]:
dummy = pd.get_dummies(data1['Age'])
data2 = pd.concat([data1, dummy], axis=1)
data2.drop('Age', axis=1, inplace=True)
data2

Unnamed: 0,User_ID,Product_ID,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+
0,1000004,P00128942,7,B,2,1,1,11.0,0.0,0,1,0,0,0,0,1,0,0
1,1000009,P00113442,17,C,0,0,3,5.0,0.0,0,1,0,0,1,0,0,0,0
2,1000010,P00288442,1,B,4+,1,5,14.0,0.0,1,0,0,0,0,1,0,0,0
3,1000010,P00145342,1,B,4+,1,4,9.0,0.0,1,0,0,0,0,1,0,0,0
4,1000011,P00053842,1,C,1,0,4,5.0,12.0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,15,B,4+,1,8,0.0,0.0,1,0,0,0,1,0,0,0,0
233595,1006036,P00254642,15,B,4+,1,5,8.0,0.0,1,0,0,0,1,0,0,0,0
233596,1006036,P00031842,15,B,4+,1,1,5.0,12.0,1,0,0,0,1,0,0,0,0
233597,1006037,P00124742,1,C,4+,0,10,16.0,0.0,1,0,0,0,0,0,1,0,0


In [13]:
dummy = pd.get_dummies(data2['Occupation'], prefix = 'Occ')
data3 = pd.concat([data2, dummy], axis=1)
data3.drop('Occupation', axis=1, inplace=True)
data3

Unnamed: 0,User_ID,Product_ID,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,F,M,...,Occ_11,Occ_12,Occ_13,Occ_14,Occ_15,Occ_16,Occ_17,Occ_18,Occ_19,Occ_20
0,1000004,P00128942,B,2,1,1,11.0,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1000009,P00113442,C,0,0,3,5.0,0.0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,1000010,P00288442,B,4+,1,5,14.0,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1000010,P00145342,B,4+,1,4,9.0,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,C,1,0,4,5.0,12.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,B,4+,1,8,0.0,0.0,1,0,...,0,0,0,0,1,0,0,0,0,0
233595,1006036,P00254642,B,4+,1,5,8.0,0.0,1,0,...,0,0,0,0,1,0,0,0,0,0
233596,1006036,P00031842,B,4+,1,1,5.0,12.0,1,0,...,0,0,0,0,1,0,0,0,0,0
233597,1006037,P00124742,C,4+,0,10,16.0,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
dummy = pd.get_dummies(data3['City_Category'])
data4 = pd.concat([data3, dummy], axis=1)
data4.drop('City_Category', axis=1, inplace=True)
data4

Unnamed: 0,User_ID,Product_ID,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,F,M,0-17,...,Occ_14,Occ_15,Occ_16,Occ_17,Occ_18,Occ_19,Occ_20,A,B,C
0,1000004,P00128942,2,1,1,11.0,0.0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1,1000009,P00113442,0,0,3,5.0,0.0,0,1,0,...,0,0,0,1,0,0,0,0,0,1
2,1000010,P00288442,4+,1,5,14.0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1000010,P00145342,4+,1,4,9.0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1000011,P00053842,1,0,4,5.0,12.0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,4+,1,8,0.0,0.0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
233595,1006036,P00254642,4+,1,5,8.0,0.0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
233596,1006036,P00031842,4+,1,1,5.0,12.0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
233597,1006037,P00124742,4+,0,10,16.0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [15]:
dummy = pd.get_dummies(data4['Stay_In_Current_City_Years'], prefix='City')
data5 = pd.concat([data4, dummy], axis=1)
data5.drop('Stay_In_Current_City_Years', axis=1, inplace=True)
data5

Unnamed: 0,User_ID,Product_ID,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,F,M,0-17,18-25,...,Occ_19,Occ_20,A,B,C,City_0,City_1,City_2,City_3,City_4+
0,1000004,P00128942,1,1,11.0,0.0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0
1,1000009,P00113442,0,3,5.0,0.0,0,1,0,0,...,0,0,0,0,1,1,0,0,0,0
2,1000010,P00288442,1,5,14.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,1000010,P00145342,1,4,9.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
4,1000011,P00053842,0,4,5.0,12.0,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,1,8,0.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
233595,1006036,P00254642,1,5,8.0,0.0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
233596,1006036,P00031842,1,1,5.0,12.0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
233597,1006037,P00124742,0,10,16.0,0.0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,1


In [16]:
dummy = pd.get_dummies(data5['Marital_Status'], prefix='Marital')
data6 = pd.concat([data5, dummy], axis=1)
data6.drop('Marital_Status', axis=1, inplace=True)
data6


Unnamed: 0,User_ID,Product_ID,Product_Category_1,Product_Category_2,Product_Category_3,F,M,0-17,18-25,26-35,...,A,B,C,City_0,City_1,City_2,City_3,City_4+,Marital_0,Marital_1
0,1000004,P00128942,1,11.0,0.0,0,1,0,0,0,...,0,1,0,0,0,1,0,0,0,1
1,1000009,P00113442,3,5.0,0.0,0,1,0,0,1,...,0,0,1,1,0,0,0,0,1,0
2,1000010,P00288442,5,14.0,0.0,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,1
3,1000010,P00145342,4,9.0,0.0,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,1
4,1000011,P00053842,4,5.0,12.0,1,0,0,0,1,...,0,0,1,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,8,0.0,0.0,1,0,0,0,1,...,0,1,0,0,0,0,0,1,0,1
233595,1006036,P00254642,5,8.0,0.0,1,0,0,0,1,...,0,1,0,0,0,0,0,1,0,1
233596,1006036,P00031842,1,5.0,12.0,1,0,0,0,1,...,0,1,0,0,0,0,0,1,0,1
233597,1006037,P00124742,10,16.0,0.0,1,0,0,0,0,...,0,0,1,0,0,0,0,1,1,0


In [17]:
dummy = pd.get_dummies(data6['Product_Category_1'], prefix='PC1')
data7 = pd.concat([data6, dummy], axis=1)
data7.drop('Product_Category_1', axis=1, inplace=True)
data7

Unnamed: 0,User_ID,Product_ID,Product_Category_2,Product_Category_3,F,M,0-17,18-25,26-35,36-45,...,PC1_9,PC1_10,PC1_11,PC1_12,PC1_13,PC1_14,PC1_15,PC1_16,PC1_17,PC1_18
0,1000004,P00128942,11.0,0.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1000009,P00113442,5.0,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,14.0,0.0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1000010,P00145342,9.0,0.0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,5.0,12.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,0.0,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,8.0,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,5.0,12.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233597,1006037,P00124742,16.0,0.0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [18]:
dummy = pd.get_dummies(data7['Product_Category_2'], prefix='PC2')
data8 = pd.concat([data7, dummy], axis=1)
data8.drop('Product_Category_2', axis=1, inplace=True)
data8

Unnamed: 0,User_ID,Product_ID,Product_Category_3,F,M,0-17,18-25,26-35,36-45,46-50,...,PC2_9.0,PC2_10.0,PC2_11.0,PC2_12.0,PC2_13.0,PC2_14.0,PC2_15.0,PC2_16.0,PC2_17.0,PC2_18.0
0,1000004,P00128942,0.0,0,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,1000009,P00113442,0.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,0.0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,1000010,P00145342,0.0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,12.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,12.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233597,1006037,P00124742,0.0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


Vamos a ver cuantos usuarios tenemos en total de nuestro dataset.

In [19]:
dummy = pd.get_dummies(data8['Product_Category_3'], prefix='PC3')
data9 = pd.concat([data8, dummy], axis=1)
data9.drop('Product_Category_3', axis=1, inplace=True)
data9

Unnamed: 0,User_ID,Product_ID,F,M,0-17,18-25,26-35,36-45,46-50,51-55,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
0,1000004,P00128942,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1000009,P00113442,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1000010,P00145342,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
233597,1006037,P00124742,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Una vez observamos que nuestras variables estan pasadas al formato miremos que no tengamos ningun error. 

In [20]:
data9.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233599 entries, 0 to 233598
Data columns (total 94 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   User_ID     233599 non-null  int64 
 1   Product_ID  233599 non-null  object
 2   F           233599 non-null  uint8 
 3   M           233599 non-null  uint8 
 4   0-17        233599 non-null  uint8 
 5   18-25       233599 non-null  uint8 
 6   26-35       233599 non-null  uint8 
 7   36-45       233599 non-null  uint8 
 8   46-50       233599 non-null  uint8 
 9   51-55       233599 non-null  uint8 
 10  55+         233599 non-null  uint8 
 11  Occ_0       233599 non-null  uint8 
 12  Occ_1       233599 non-null  uint8 
 13  Occ_2       233599 non-null  uint8 
 14  Occ_3       233599 non-null  uint8 
 15  Occ_4       233599 non-null  uint8 
 16  Occ_5       233599 non-null  uint8 
 17  Occ_6       233599 non-null  uint8 
 18  Occ_7       233599 non-null  uint8 
 19  Occ_8       233599 non-

Agrupamos por cliente que es lo que nos interesa. Ya que queremos saber, que cuando un cliente va a comprar que productos se lleva, es decir, pongamos un ejemplo. Si un cliente cuando compra vaqueros puede llegar a  comprar también un cinturón, esto es a lo que queremos llegar. 

In [21]:
datos = data9.groupby('User_ID').max()
datos


Unnamed: 0_level_0,Product_ID,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,P0096442,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1000002,P00364842,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,1,1,0
1000003,P00330242,0,1,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
1000004,P00128942,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1000005,P0098142,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,P0099442,1,0,0,0,1,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
1006037,P00323542,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,1,1,1,0
1006038,P00316642,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1006039,P0096442,1,0,0,0,0,0,1,0,0,...,1,0,0,1,1,1,1,1,1,0


Agrupamos con la función max, ya se si realizamos las suma al ser one-hot encoding, sumaria tantas veces como datos tenemos de ese usuario. 

Eliminamos la variable Product_ID ya que no nos proporciona información valiosa.

In [22]:
datos = datos.drop(labels=['Product_ID'], axis=1)
datos

Unnamed: 0_level_0,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+,Occ_0,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,1,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1000002,0,1,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,1,1,1,0
1000003,0,1,0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
1000004,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000005,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,1,0,0,0,1,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
1006037,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,1,1,1,0
1006038,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
1006039,1,0,0,0,0,0,1,0,0,1,...,1,0,0,1,1,1,1,1,1,0


Realizamos una prueba. Vamos a crear las primeras reglas de asociación. La función siguiente nos indica **todas** las reglas que van a surgir con un soporte mayor que 0.5. 

Hay que acordarse, que anteriormente hemos pasado a 0 los valores nulos que teniamos en nuestro dataset, estos se encontraban ubicados en las columnas de producto categoria 2 y 3. Entonces, no queremos aquellas que tienen el producto 0 de la categoría 2 y 3, los anulamos para que nos salgan. 

In [23]:
soporte_minimo = 0.5
itemset_frecuentes = apriori(datos, min_support=soporte_minimo, use_colnames=True)
itemset_frecuentes
a = (itemset_frecuentes['itemsets'].apply(lambda x: "PC2_0.0" in x) | itemset_frecuentes['itemsets'].apply(lambda x: "PC3_0.0" in x))
b = itemset_frecuentes.drop(itemset_frecuentes[a].index)
b

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.532847,(C)
2,0.580037,(Marital_0)
3,0.910711,(PC1_1)
4,0.543711,(PC1_2)
...,...,...
803,0.510440,"(PC2_2.0, PC2_8.0, PC1_1, PC1_8, PC2_14.0)"
809,0.508063,"(PC2_8.0, PC1_1, PC1_8, PC2_16.0, PC2_14.0)"
863,0.501952,"(PC2_2.0, PC1_5, PC2_8.0, PC1_8, PC2_14.0)"
868,0.508912,"(PC1_5, PC2_8.0, PC1_8, PC2_16.0, PC2_14.0)"


In [24]:
b[b['itemsets'].apply(lambda x: len(x)) == 1]

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.532847,(C)
2,0.580037,(Marital_0)
3,0.910711,(PC1_1)
4,0.543711,(PC1_2)
5,0.913597,(PC1_5)
6,0.529961,(PC1_6)
7,0.873196,(PC1_8)
9,0.729588,(PC2_2.0)
10,0.527075,(PC2_4.0)


Esta función nos indica cuantos reglas de asociacion se crean con diferentes valores que toma k. 

In [25]:
for k in range(1,7):
  
  itemset_frecuentes_k = b[b['itemsets'].apply(lambda x: len(x))==k]
  num_itemsets = itemset_frecuentes_k.shape[0]
  print("Se encontraron", num_itemsets, "itemsets frecuentes para k=", k)
  print(itemset_frecuentes_k)

Se encontraron 21 itemsets frecuentes para k= 1
     support     itemsets
0   0.717196          (M)
1   0.532847          (C)
2   0.580037  (Marital_0)
3   0.910711      (PC1_1)
4   0.543711      (PC1_2)
5   0.913597      (PC1_5)
6   0.529961      (PC1_6)
7   0.873196      (PC1_8)
9   0.729588    (PC2_2.0)
10  0.527075    (PC2_4.0)
11  0.592938    (PC2_5.0)
12  0.516890    (PC2_6.0)
13  0.830929    (PC2_8.0)
14  0.757936   (PC2_14.0)
15  0.656765   (PC2_15.0)
16  0.723986   (PC2_16.0)
18  0.533865    (PC3_5.0)
19  0.538618   (PC3_14.0)
20  0.565099   (PC3_15.0)
21  0.670175   (PC3_16.0)
22  0.556782   (PC3_17.0)
Se encontraron 78 itemsets frecuentes para k= 2
      support              itemsets
23   0.667119            (PC1_1, M)
24   0.652521            (PC1_5, M)
25   0.623833            (M, PC1_8)
27   0.551859          (PC2_2.0, M)
28   0.596673          (M, PC2_8.0)
..        ...                   ...
130  0.584281  (PC2_16.0, PC2_14.0)
132  0.554575  (PC3_16.0, PC2_14.0)
133  0.5

In [26]:
reglas = association_rules(b, metric="confidence", min_threshold=0.7)

# Mostrar las reglas y su confianza
reglas[['antecedents', 'consequents', 'confidence']]

Unnamed: 0,antecedents,consequents,confidence
0,(PC1_1),(M),0.732526
1,(M),(PC1_1),0.930178
2,(PC1_5),(M),0.714233
3,(M),(PC1_5),0.909822
4,(M),(PC1_8),0.869822
...,...,...,...
1615,"(PC2_2.0, PC1_8)","(PC1_1, PC1_5, PC2_8.0, PC2_14.0)",0.771660
1616,"(PC2_2.0, PC2_14.0)","(PC1_1, PC1_5, PC2_8.0, PC1_8)",0.858345
1617,"(PC2_8.0, PC2_14.0)","(PC2_2.0, PC1_1, PC1_5, PC1_8)",0.747472
1618,"(PC1_1, PC2_14.0)","(PC2_2.0, PC1_5, PC2_8.0, PC1_8)",0.712187


In [27]:
reglas[reglas['antecedents'].apply(lambda x: "M" in x)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(M),(PC1_1),0.717196,0.910711,0.667119,0.930178,1.021375,0.013961,1.278796
3,(M),(PC1_5),0.717196,0.913597,0.652521,0.909822,0.995868,-0.002707,0.958143
4,(M),(PC1_8),0.717196,0.873196,0.623833,0.869822,0.996136,-0.002420,0.974082
7,(M),(PC2_2.0),0.717196,0.729588,0.551859,0.769467,1.054661,0.028602,1.172991
8,(M),(PC2_8.0),0.717196,0.830929,0.596673,0.831953,1.001233,0.000734,1.006094
...,...,...,...,...,...,...,...,...,...
591,"(M, PC2_8.0, PC1_8)",(PC1_5),0.535902,0.913597,0.510949,0.953437,1.043608,0.021350,1.855607
592,"(PC1_5, M)","(PC2_8.0, PC1_8)",0.652521,0.748939,0.510949,0.783039,1.045530,0.022251,1.157168
593,"(M, PC2_8.0)","(PC1_5, PC1_8)",0.596673,0.814802,0.510949,0.856330,1.050967,0.024778,1.289050
594,"(M, PC1_8)","(PC1_5, PC2_8.0)",0.623833,0.782889,0.510949,0.819048,1.046186,0.022557,1.199823


Realizar para cada género, edad y tipo de producto.

In [28]:
data

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,7,B,2,1,1,11.0,0.0
1,1000009,P00113442,M,26-35,17,C,0,0,3,5.0,0.0
2,1000010,P00288442,F,36-45,1,B,4+,1,5,14.0,0.0
3,1000010,P00145342,F,36-45,1,B,4+,1,4,9.0,0.0
4,1000011,P00053842,F,26-35,1,C,1,0,4,5.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,F,26-35,15,B,4+,1,8,0.0,0.0
233595,1006036,P00254642,F,26-35,15,B,4+,1,5,8.0,0.0
233596,1006036,P00031842,F,26-35,15,B,4+,1,1,5.0,12.0
233597,1006037,P00124742,F,46-50,1,C,4+,0,10,16.0,0.0


In [29]:
df = data

In [30]:
df = df.drop(labels=['Occupation','City_Category', 'Stay_In_Current_City_Years', 'Marital_Status' ], axis=1)
df

Unnamed: 0,User_ID,Product_ID,Gender,Age,Product_Category_1,Product_Category_2,Product_Category_3
0,1000004,P00128942,M,46-50,1,11.0,0.0
1,1000009,P00113442,M,26-35,3,5.0,0.0
2,1000010,P00288442,F,36-45,5,14.0,0.0
3,1000010,P00145342,F,36-45,4,9.0,0.0
4,1000011,P00053842,F,26-35,4,5.0,12.0
...,...,...,...,...,...,...,...
233594,1006036,P00118942,F,26-35,8,0.0,0.0
233595,1006036,P00254642,F,26-35,5,8.0,0.0
233596,1006036,P00031842,F,26-35,1,5.0,12.0
233597,1006037,P00124742,F,46-50,10,16.0,0.0


In [31]:
dummy = pd.get_dummies(df['Gender'])
df = pd.concat([df, dummy], axis=1)
df.drop('Gender', axis=1, inplace=True)
df

Unnamed: 0,User_ID,Product_ID,Age,Product_Category_1,Product_Category_2,Product_Category_3,F,M
0,1000004,P00128942,46-50,1,11.0,0.0,0,1
1,1000009,P00113442,26-35,3,5.0,0.0,0,1
2,1000010,P00288442,36-45,5,14.0,0.0,1,0
3,1000010,P00145342,36-45,4,9.0,0.0,1,0
4,1000011,P00053842,26-35,4,5.0,12.0,1,0
...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,26-35,8,0.0,0.0,1,0
233595,1006036,P00254642,26-35,5,8.0,0.0,1,0
233596,1006036,P00031842,26-35,1,5.0,12.0,1,0
233597,1006037,P00124742,46-50,10,16.0,0.0,1,0


In [32]:
dummy = pd.get_dummies(df['Age'])
df = pd.concat([df, dummy], axis=1)
df.drop('Age', axis=1, inplace=True)
df

Unnamed: 0,User_ID,Product_ID,Product_Category_1,Product_Category_2,Product_Category_3,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+
0,1000004,P00128942,1,11.0,0.0,0,1,0,0,0,0,1,0,0
1,1000009,P00113442,3,5.0,0.0,0,1,0,0,1,0,0,0,0
2,1000010,P00288442,5,14.0,0.0,1,0,0,0,0,1,0,0,0
3,1000010,P00145342,4,9.0,0.0,1,0,0,0,0,1,0,0,0
4,1000011,P00053842,4,5.0,12.0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,8,0.0,0.0,1,0,0,0,1,0,0,0,0
233595,1006036,P00254642,5,8.0,0.0,1,0,0,0,1,0,0,0,0
233596,1006036,P00031842,1,5.0,12.0,1,0,0,0,1,0,0,0,0
233597,1006037,P00124742,10,16.0,0.0,1,0,0,0,0,0,1,0,0


In [33]:
dummy = pd.get_dummies(df['Product_Category_1'], prefix='PC1')
df = pd.concat([df, dummy], axis=1)
df.drop('Product_Category_1', axis=1, inplace=True)
df

Unnamed: 0,User_ID,Product_ID,Product_Category_2,Product_Category_3,F,M,0-17,18-25,26-35,36-45,...,PC1_9,PC1_10,PC1_11,PC1_12,PC1_13,PC1_14,PC1_15,PC1_16,PC1_17,PC1_18
0,1000004,P00128942,11.0,0.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1000009,P00113442,5.0,0.0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,14.0,0.0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1000010,P00145342,9.0,0.0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,5.0,12.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,0.0,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,8.0,0.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,5.0,12.0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
233597,1006037,P00124742,16.0,0.0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [34]:
dummy = pd.get_dummies(df['Product_Category_2'], prefix='PC2')
df = pd.concat([df, dummy], axis=1)
df.drop('Product_Category_2', axis=1, inplace=True)
df

Unnamed: 0,User_ID,Product_ID,Product_Category_3,F,M,0-17,18-25,26-35,36-45,46-50,...,PC2_9.0,PC2_10.0,PC2_11.0,PC2_12.0,PC2_13.0,PC2_14.0,PC2_15.0,PC2_16.0,PC2_17.0,PC2_18.0
0,1000004,P00128942,0.0,0,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,1000009,P00113442,0.0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,0.0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,1000010,P00145342,0.0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,12.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,0.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,12.0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
233597,1006037,P00124742,0.0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [35]:
dummy = pd.get_dummies(df['Product_Category_3'],prefix='PC3' )
df = pd.concat([df, dummy], axis=1)
df.drop('Product_Category_3', axis=1, inplace=True)
df

Unnamed: 0,User_ID,Product_ID,F,M,0-17,18-25,26-35,36-45,46-50,51-55,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
0,1000004,P00128942,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1000009,P00113442,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1000010,P00288442,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1000010,P00145342,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1000011,P00053842,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233594,1006036,P00118942,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233595,1006036,P00254642,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
233596,1006036,P00031842,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
233597,1006037,P00124742,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
df = df.groupby('User_ID').max()
df

Unnamed: 0_level_0,Product_ID,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,P0096442,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1000002,P00364842,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,1,1,1,1,0
1000003,P00330242,0,1,0,0,1,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1
1000004,P00128942,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1000005,P0098142,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,P0099442,1,0,0,0,1,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
1006037,P00323542,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,1,1,1,0
1006038,P00316642,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1006039,P0096442,1,0,0,0,0,0,1,0,0,...,1,0,0,1,1,1,1,1,1,0


In [37]:
df = df.drop(labels=['Product_ID' ], axis=1)
df

Unnamed: 0_level_0,F,M,0-17,18-25,26-35,36-45,46-50,51-55,55+,PC1_1,...,PC3_9.0,PC3_10.0,PC3_11.0,PC3_12.0,PC3_13.0,PC3_14.0,PC3_15.0,PC3_16.0,PC3_17.0,PC3_18.0
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000001,1,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1000002,0,1,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,1,1,1,0
1000003,0,1,0,0,1,0,0,0,0,1,...,1,0,0,0,1,0,0,0,0,1
1000004,0,1,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1000005,0,1,0,0,1,0,0,0,0,1,...,0,0,0,0,0,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1006036,1,0,0,0,1,0,0,0,0,1,...,1,1,1,1,1,1,1,1,1,1
1006037,1,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,1,1,1,0
1006038,1,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,0,0,0
1006039,1,0,0,0,0,0,1,0,0,0,...,1,0,0,1,1,1,1,1,1,0


Vamos a relizarlo con un soporte mínimo de 0.5

In [38]:
soporte_minimo = 0.5
itemset_frecuentes = apriori(df, min_support=soporte_minimo, use_colnames=True)
itemset_frecuentes
A = (itemset_frecuentes['itemsets'].apply(lambda x: "PC2_0.0" in x) | itemset_frecuentes['itemsets'].apply(lambda x: "PC3_0.0" in x))
B = itemset_frecuentes.drop(itemset_frecuentes[A].index)
B

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.910711,(PC1_1)
2,0.543711,(PC1_2)
3,0.913597,(PC1_5)
4,0.529961,(PC1_6)
...,...,...
787,0.510440,"(PC2_2.0, PC2_8.0, PC1_1, PC1_8, PC2_14.0)"
793,0.508063,"(PC2_8.0, PC1_1, PC1_8, PC2_16.0, PC2_14.0)"
847,0.501952,"(PC2_2.0, PC1_5, PC2_8.0, PC1_8, PC2_14.0)"
852,0.508912,"(PC1_5, PC2_8.0, PC1_8, PC2_16.0, PC2_14.0)"


In [39]:
B[B['itemsets'].apply(lambda x: len(x)) == 1]

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.910711,(PC1_1)
2,0.543711,(PC1_2)
3,0.913597,(PC1_5)
4,0.529961,(PC1_6)
5,0.873196,(PC1_8)
7,0.729588,(PC2_2.0)
8,0.527075,(PC2_4.0)
9,0.592938,(PC2_5.0)
10,0.51689,(PC2_6.0)


In [40]:
for k in range(1,7):
  
  itemset_frecuentes_k = B[B['itemsets'].apply(lambda x: len(x))==k]
  num_itemsets = itemset_frecuentes_k.shape[0]
  print("Se encontraron", num_itemsets, "itemsets frecuentes para k=", k)
  

Se encontraron 19 itemsets frecuentes para k= 1
Se encontraron 75 itemsets frecuentes para k= 2
Se encontraron 96 itemsets frecuentes para k= 3
Se encontraron 63 itemsets frecuentes para k= 4
Se encontraron 17 itemsets frecuentes para k= 5
Se encontraron 1 itemsets frecuentes para k= 6


In [41]:
B[B['itemsets'].apply(lambda x: len(x)) == 2]

Unnamed: 0,support,itemsets
21,0.667119,"(PC1_1, M)"
22,0.652521,"(PC1_5, M)"
23,0.623833,"(M, PC1_8)"
25,0.551859,"(PC2_2.0, M)"
26,0.596673,"(M, PC2_8.0)"
...,...,...
122,0.584281,"(PC2_16.0, PC2_14.0)"
124,0.554575,"(PC3_16.0, PC2_14.0)"
125,0.536411,"(PC2_15.0, PC2_16.0)"
127,0.544559,"(PC2_15.0, PC3_16.0)"


In [42]:
B[B['itemsets'].apply(lambda x: len(x)) == 3]

Unnamed: 0,support,itemsets
135,0.609574,"(PC1_1, PC1_5, M)"
136,0.584281,"(PC1_1, M, PC1_8)"
138,0.551859,"(PC2_2.0, PC1_1, M)"
139,0.567815,"(PC1_1, M, PC2_8.0)"
140,0.501782,"(PC1_1, M, PC2_14.0)"
...,...,...
359,0.524868,"(PC2_2.0, PC2_8.0, PC2_16.0)"
361,0.518418,"(PC2_2.0, PC2_8.0, PC3_16.0)"
369,0.538109,"(PC2_8.0, PC2_16.0, PC2_14.0)"
371,0.518927,"(PC2_8.0, PC3_16.0, PC2_14.0)"


In [43]:
B[B['itemsets'].apply(lambda x: len(x)) == 4]

Unnamed: 0,support,itemsets
385,0.547615,"(PC1_1, PC1_5, M, PC1_8)"
387,0.508742,"(PC2_2.0, PC1_5, PC1_1, M)"
388,0.533356,"(PC1_1, PC1_5, M, PC2_8.0)"
391,0.513325,"(PC1_1, M, PC2_8.0, PC1_8)"
402,0.510949,"(PC1_5, M, PC2_8.0, PC1_8)"
...,...,...
600,0.502292,"(PC2_2.0, PC1_5, PC2_8.0, PC2_16.0)"
607,0.524699,"(PC1_5, PC2_8.0, PC2_16.0, PC2_14.0)"
609,0.507214,"(PC1_5, PC2_8.0, PC3_16.0, PC2_14.0)"
639,0.510440,"(PC2_2.0, PC2_8.0, PC1_8, PC2_14.0)"


In [44]:
B[B['itemsets'].apply(lambda x: len(x)) == 5]

Unnamed: 0,support,itemsets
711,0.562553,"(PC2_2.0, PC1_5, PC2_8.0, PC1_1, PC1_8)"
712,0.536581,"(PC2_2.0, PC1_5, PC1_1, PC1_8, PC2_14.0)"
713,0.515532,"(PC2_2.0, PC1_5, PC1_1, PC1_8, PC2_16.0)"
715,0.502292,"(PC2_2.0, PC1_5, PC3_16.0, PC1_1, PC1_8)"
717,0.589204,"(PC1_5, PC2_8.0, PC1_1, PC1_8, PC2_14.0)"
718,0.511119,"(PC2_15.0, PC1_5, PC2_8.0, PC1_1, PC1_8)"
719,0.557461,"(PC1_5, PC2_8.0, PC1_1, PC1_8, PC2_16.0)"
721,0.53081,"(PC1_5, PC2_8.0, PC3_16.0, PC1_1, PC1_8)"
722,0.526905,"(PC1_5, PC1_1, PC1_8, PC2_16.0, PC2_14.0)"
724,0.500424,"(PC1_5, PC3_16.0, PC1_1, PC1_8, PC2_14.0)"


In [45]:
B[B['itemsets'].apply(lambda x: len(x)) == 6]

Unnamed: 0,support,itemsets
921,0.501952,"(PC2_2.0, PC1_5, PC2_8.0, PC1_1, PC1_8, PC2_14.0)"


Lo realizamos para soporte mínimo 0.7

In [46]:
soporte_minimo = 0.7
itemset_frecuentes = apriori(df, min_support=soporte_minimo, use_colnames=True)
itemset_frecuentes
A = (itemset_frecuentes['itemsets'].apply(lambda x: "PC2_0.0" in x) | itemset_frecuentes['itemsets'].apply(lambda x: "PC3_0.0" in x))
B = itemset_frecuentes.drop(itemset_frecuentes[A].index)
B

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.910711,(PC1_1)
2,0.913597,(PC1_5)
3,0.873196,(PC1_8)
5,0.729588,(PC2_2.0)
6,0.830929,(PC2_8.0)
7,0.757936,(PC2_14.0)
8,0.723986,(PC2_16.0)
11,0.835682,"(PC1_1, PC1_5)"
12,0.80275,"(PC1_1, PC1_8)"


In [47]:
for k in range(1,4):
  
  itemset_frecuentes_k = B[B['itemsets'].apply(lambda x: len(x))==k]
  num_itemsets = itemset_frecuentes_k.shape[0]
  print("Se encontraron", num_itemsets, "itemsets frecuentes para k=", k)
  

Se encontraron 8 itemsets frecuentes para k= 1
Se encontraron 10 itemsets frecuentes para k= 2
Se encontraron 4 itemsets frecuentes para k= 3


In [51]:
soporte_minimo = 0.3
itemset_frecuentes = apriori(df, min_support=soporte_minimo, use_colnames=True)
itemset_frecuentes
A = (itemset_frecuentes['itemsets'].apply(lambda x: "PC2_0.0" in x) | itemset_frecuentes['itemsets'].apply(lambda x: "PC3_0.0" in x))
B = itemset_frecuentes.drop(itemset_frecuentes[A].index)
B

Unnamed: 0,support,itemsets
0,0.717196,(M)
1,0.348498,(26-35)
2,0.910711,(PC1_1)
3,0.543711,(PC1_2)
4,0.488712,(PC1_3)
...,...,...
57105,0.315906,"(PC2_2.0, PC3_15.0, PC2_15.0, PC1_5, PC2_8.0, ..."
57360,0.300628,"(PC2_2.0, PC2_15.0, PC2_8.0, PC3_16.0, PC1_1, ..."
57379,0.312001,"(PC2_2.0, PC3_15.0, PC2_15.0, PC2_8.0, PC3_16...."
57736,0.308267,"(PC2_2.0, PC3_15.0, PC2_15.0, PC1_5, PC2_8.0, ..."


In [56]:
for k in range(1,11):
  
  itemset_frecuentes_k = B[B['itemsets'].apply(lambda x: len(x))==k]
  num_itemsets = itemset_frecuentes_k.shape[0]
  print("Se encontraron", num_itemsets, "itemsets frecuentes para k=", k)
  

Se encontraron 30 itemsets frecuentes para k= 1
Se encontraron 302 itemsets frecuentes para k= 2
Se encontraron 1363 itemsets frecuentes para k= 3
Se encontraron 3304 itemsets frecuentes para k= 4
Se encontraron 4520 itemsets frecuentes para k= 5
Se encontraron 3523 itemsets frecuentes para k= 6
Se encontraron 1550 itemsets frecuentes para k= 7
Se encontraron 376 itemsets frecuentes para k= 8
Se encontraron 48 itemsets frecuentes para k= 9
Se encontraron 1 itemsets frecuentes para k= 10
