## Aprendizaje de reglas de asociación

Objetivo:
    
    derivar reglas de la forma {A} -> {B}
    

In [2]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter

In [8]:
# Ejemplo 
compras = np.array([[1,'manzanas'],[1,'mandarinas'], [1,'huevos'], [1,'leche'], [2,'leche'], [2,'huevos']], dtype=object)
compras


array([[1, 'manzanas'],
       [1, 'mandarinas'],
       [1, 'huevos'],
       [1, 'leche'],
       [2, 'leche'],
       [2, 'huevos']], dtype=object)

Lo primero que necesitamos es contar el numero de veces que aparece cada item y luego cada par de items, y luego...:

'manzanas'  aparece 1 vez (aparece solo en una de las dos compras), luego su soporte es 1/2, o bien 50%

In [4]:
c=Counter(compras[:,1])
print(c)

soportes={}
for key in c:    
    soportes[key]=c[key]/2
print('soportes:',soportes)

Counter({'huevos': 2, 'leche': 2, 'manzanas': 1, 'mandarinas': 1})
soportes: {'manzanas': 0.5, 'mandarinas': 0.5, 'huevos': 1.0, 'leche': 1.0}


compra 1:  **manzanas, mandarinas, huevos, leche**   -->  item pairs: {'manzanas', 'mandarinas'}, {'manzanas', 'huevos'}, {'manzanas', 'leche'}, {'mandarinas', 'huevos'}, {'mandarinas', 'leche'}, {huevos,leche}

compra 2:  **huevos,leche**          -->  item pairs: {huevos,leche}


In [5]:
# Generator that yields item pairs, one at a time
#Requiere que la base esté ordenada por compra
def get_item_pairs(entrada):
    
    # For each order, generate a list of items in that order
    for order_id, order_object in groupby(entrada, lambda x: x[0]):
        item_list = [item[1] for item in order_object]      
    
        # For each item list, generate item pairs, one at a time
        for item_pair in combinations(item_list, 2):
            yield item_pair

In [6]:
c3=Counter(get_item_pairs(compras))
print (c3)
print(c3[('huevos', 'leche')])
print(c3[('manzanas', 'huevos')])

Counter({('manzanas', 'mandarinas'): 1, ('manzanas', 'huevos'): 1, ('manzanas', 'leche'): 1, ('mandarinas', 'huevos'): 1, ('mandarinas', 'leche'): 1, ('huevos', 'leche'): 1, ('leche', 'huevos'): 1})
1
1


¿Qué paso?

**Atención! para evitar errores al usar "get_item_pairs" se debe ordena la base: primero por compra y luego por item (puede ser alfabéticamente) :**

compra 1:  **manzanas, mandarinas, huevos, leche**   -->  item pairs: ('mandarinas','manzanas'), ('huevos', 'manzanas'), ('leche', 'manzanas'), ('huevos','mandarinas'), ('leche','mandarinas'), (huevos,leche)

compra 2:  **huevos, leche**          -->  item pairs: (huevos,leche)

In [12]:
# Sample data
compras = np.array([[1,'manzanas'],[1,'mandarinas'], [1,'huevos'], [1,'leche'], [2,'huevos'], [2,'leche']], dtype=object)

c3=Counter(get_item_pairs(compras))
print (c3)
print(c3[('huevos', 'leche')])
print(c3[('manzanas', 'huevos')])

Counter({('huevos', 'leche'): 2, ('manzanas', 'mandarinas'): 1, ('manzanas', 'huevos'): 1, ('manzanas', 'leche'): 1, ('mandarinas', 'huevos'): 1, ('mandarinas', 'leche'): 1})
2
1


In [11]:
transactions=[]#requiere que la base esté ordenada por compra
for compra_id, compra_object in groupby(compras, lambda x: x[0]):
    transactions.append([item[1] for item in compra_object])
print(transactions)

[['manzanas', 'mandarinas', 'huevos', 'leche'], ['leche', 'huevos']]


https://pypi.org/project/efficient-apriori/

In [10]:
from efficient_apriori import apriori

#apriori de efficient_apriori requiere una lista de transacciones (no necesita que estén ordenados en cada transacción)

itemsets, rules = apriori(transactions, min_support=0.6,  min_confidence=0.6)#min_sup conf entre 0 y 1
print(itemsets)
print(rules)
rules=sorted(rules, key=lambda rule: rule.confidence)
for rule in rules:
  print(rule) # Prints the rule and its confidence, support, lift, ...

{1: {('huevos',): 2, ('leche',): 2}, 2: {('huevos', 'leche'): 2}}
[{leche} -> {huevos}, {huevos} -> {leche}]
{leche} -> {huevos} (conf: 1.000, supp: 1.000, lift: 1.000, conv: 0.000)
{huevos} -> {leche} (conf: 1.000, supp: 1.000, lift: 1.000, conv: 0.000)


In [13]:
# Otro ejemplo sintético:
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5,  min_confidence=1)
print(rules)  # [{eggs} -> {bacon}, {soup} -> {bacon}]
rules=sorted(rules, key=lambda rule: rule.confidence)
for rule in rules:
  print(rule) # Prints the rule and its confidence, support, lift, ...

[{eggs} -> {bacon}, {soup} -> {bacon}]
{eggs} -> {bacon} (conf: 1.000, supp: 0.667, lift: 1.000, conv: 0.000)
{soup} -> {bacon} (conf: 1.000, supp: 0.667, lift: 1.000, conv: 0.000)


## Datos Reales!!
### Analisis de los datos de *instacart*
Disponibles en:
https://www.instacart.com/datasets/grocery-shopping-2017

In [16]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

Descripcion de los datos:

order_products__prior.cvs

order_id,product_id,add_to_cart_order,reordered

`orders=compras` (3.4m rows, 206k users):
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

`products` (50k rows):
* **`product_id`: product identifier**
* **`product_name`: name of the product**
* `aisle_id`: foreign key
* `department_id`: foreign key

`aisles` (134 rows):
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

`deptartments` (21 rows):
* `department_id`: department identifier
* `department`: the name of the department

**`order_products__SET` (30m+ rows):**
* **`order_id`: foreign key**
* **`product_id`: foreign key**
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

where `SET` is one of the four following evaluation sets (`eval_set` in `orders`):
* **`"prior"`**: orders prior to that users most recent order (~3.2m orders)
* `"train"`: training data supplied to participants (~131k orders)
* `"test"`: test data reserved for machine learning competitions (~75k orders)

In [14]:
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

In [17]:
orders = pd.read_csv('instacart_2017_05_01/order_products__prior_short.csv')
# es muy pesado trabajar con toda la base...
#orders = pd.read_csv('instacart_2017_05_01/order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())
display(orders.tail())
items_names = pd.read_csv('instacart_2017_05_01/products.csv')
display(items_names.head())

#decodificar el nombre de los productos
compras_df = pd.merge(orders[['order_id','product_id']], items_names[['product_id','product_name']] ,on='product_id', how= "inner")

display(compras_df.head())
compras_df=compras_df.sort_values( by='order_id', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
compras=compras_df.values[:,[0,2]]
print(compras)

orders -- dimensions: (5251, 4);   size: 0.17 MB


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
5246,555,29574,6,1
5247,555,41787,7,1
5248,555,47788,8,1
5249,555,46979,9,1
5250,555,47626,10,1


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


Unnamed: 0,order_id,product_id,product_name
0,2,33120,Organic Egg Whites
1,26,33120,Organic Egg Whites
2,120,33120,Organic Egg Whites
3,327,33120,Organic Egg Whites
4,390,33120,Organic Egg Whites


[[2 'Organic Egg Whites']
 [2 'Michigan Organic Kale']
 [2 'Classic Blend Cole Slaw']
 ...
 [555 'Tempt Unsweetened Vanilla Hemp Milk']
 [555 'Large Lemon']
 [555 'Honeydew Melon']]


In [18]:
transactions=[]
for orders_id, order_object in groupby(compras, lambda x: x[0]):
    transactions.append([item[1] for item in order_object])
print(transactions)
#con toda la base..
#IOPub data rate exceeded.
#The notebook server will temporarily stop sending output

[['Organic Egg Whites', 'Michigan Organic Kale', 'Classic Blend Cole Slaw', 'All Natural No Stir Creamy Almond Butter', 'Original Unflavored Gelatine Mix', 'Garlic Powder', 'Carrots', 'Natural Sweetener', 'Coconut Butter'], ['Air Chilled Organic Boneless Skinless Chicken Breasts', 'Organic Ezekiel 49 Bread Cinnamon Raisin', 'Organic Baby Spinach', 'Lemons', 'Unsweetened Almondmilk', 'Organic Ginger Root', 'Total 2% with Strawberry Lowfat Greek Strained Yogurt', 'Unsweetened Chocolate Almond Breeze Almond Milk'], ['Sugarfree Energy Drink', 'Energy Drink', 'Original Orange Juice', 'Goldfish Cheddar Baked Snack Crackers', 'Traditional Snack Mix', "Kellogg's Nutri-Grain Blueberry Cereal", 'Tiny Twists Pretzels', "Kellogg's Nutri-Grain Apple Cinnamon Cereal", 'Oats & Chocolate Chewy Bars', 'Chewy 25% Low Sugar Chocolate Chip Granola', 'Honey/Lemon Cough Drops', 'Plain Pre-Sliced Bagels', 'Nutri-Grain Soft Baked Strawberry Cereal Breakfast Bars'], ['Natural Artesian Water, Mini & Mobile', 'M

## CUIDADO CON BASES GRANDES usando apriori de efficient apriori

https://pypi.org/project/efficient-apriori/

Working with large datasets:

If you have data that is too large to fit into memory, you may pass a function returning a generator instead of a list. **The min_support will most likely have to be a large value, or the algorithm will take very long before it terminates**. If you have massive amounts of data, this Python implementation is likely not fast enough, and **you should consult more specialized implementations**.


def data_generator(filename):

  """
  
  Data generator, needs to return a generator to be called several times.
  
  """
  
  def data_gen():
  
    with open(filename) as file:
    
      for line in file:
      
        yield tuple(k.strip() for k in line.split(','))      

  return data_gen

transactions = data_generator('dataset.csv')

itemsets, rules = apriori(transactions, min_support=0.9, min_confidence=0.6)

In [24]:
from efficient_apriori import apriori
#wARNING-CUIDADO!! no poner min_support pequeño como por ejempo 0.001!!
#min_support y min_confiden entre 0 y 1  
itemsets, rules = apriori(transactions, min_support=0.009,  min_confidence=0.3)

In [25]:
#rules = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
rules=sorted(rules, key=lambda rule: rule.confidence)
for rule in rules:
  print(rule) # Prints the rule and its confidence, support, lift, ...

{Organic Raspberries} -> {Organic Hass Avocado} (conf: 0.300, supp: 0.011, lift: 4.922, conv: 1.341)
{Organic Yellow Onion} -> {Bag of Organic Bananas} (conf: 0.304, supp: 0.013, lift: 2.619, conv: 1.270)
{Carrots} -> {Bag of Organic Bananas} (conf: 0.312, supp: 0.010, lift: 2.690, conv: 1.286)
{Raspberries} -> {Bag of Organic Bananas} (conf: 0.312, supp: 0.010, lift: 2.690, conv: 1.286)
{Limes} -> {Banana} (conf: 0.333, supp: 0.013, lift: 2.365, conv: 1.289)
{Organic Baby Spinach} -> {Banana} (conf: 0.366, supp: 0.029, lift: 2.596, conv: 1.355)
{Raspberries} -> {Strawberries} (conf: 0.375, supp: 0.011, lift: 6.562, conv: 1.509)
{Cucumber Kirby} -> {Banana} (conf: 0.400, supp: 0.011, lift: 2.838, conv: 1.432)
{Honeycrisp Apple} -> {Banana} (conf: 0.400, supp: 0.015, lift: 2.838, conv: 1.432)
{Organic Blueberries} -> {Strawberries} (conf: 0.429, supp: 0.011, lift: 7.500, conv: 1.650)
{Organic Raspberries} -> {Bag of Organic Bananas} (conf: 0.450, supp: 0.017, lift: 3.873, conv: 1.607)
{