Autor: mzibecchi  
email: mzibecchi@gmail.com

In [1]:
## Aprendizaje de reglas de asociación

Objetivo:
    
    Obtener reglas de asociación entre películas en el dataset movielens.
    Asumimos que los usuarios hacen recomendaciones y queremos encontras reglas de asociacion entre esas recomendaciones.
    Tomamos como premisa que el usuario recomienda peliculas que superan cierto rating.
     

In [2]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

Descripcion de los datos:

movies.csv (movieId, title, genres)
ratings.csv (userId, movieId, rating, timestamp)



Hacemos el desarrollo sobre un dataset mas chico, luego pasamos al de 20m

In [5]:
# small dataset
path = "ml-latest-small"

# 20m records dataset
#path = "ml-20m"

In [6]:
#este dataset finalmente no lo usamos
#tags_df = pd.read_csv(path+"/tags.csv")
#tags_df.head()

In [7]:
movies_df = pd.read_csv(path+"/movies.csv")
print(len(movies_df))
movies_df.head()


9742


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
ratings1_df = pd.read_csv(path+"/ratings.csv")
print(len(ratings1_df))
ratings1_df.head()

100836


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Hacemos merge de ambos datasets, usando movieId como clave para el join

In [9]:
ratings_df = pd.merge(ratings1_df, movies_df, on='movieId')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


Veamos que tipos de ratings existen

In [10]:
ratings_df.rating.unique()

array([4. , 4.5, 2.5, 3.5, 3. , 5. , 0.5, 2. , 1.5, 1. ])

Asumiremos que una pelicula recomendada es aquella que tiene rating > 2

In [11]:
ratings_df["recommended"] = ratings_df["rating"] > 2
recommended_df = ratings_df[ ratings_df.recommended == True]
recommended_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,recommended
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,True
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,True
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,True
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,True
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,True


In [12]:
# la version 2.0 de esta notebook, va a tener una implementacion mas eficiente de esto...
# por ahora, me interesaba mas entender como armar las transacciones

transactions = []

grouped = recommended_df.groupby('userId')

for names, group  in grouped:
    
    transaction_user = []
    
    for row_index, movie in group.iterrows():
        transaction_user.append( movie['title'] )
        
    transactions.append( transaction_user )
    

In [13]:
print("Cantidad de transacciones: "+str(len(transactions)))

Cantidad de transacciones: 610


In [14]:
from efficient_apriori import apriori

#apriori de efficient_apriori requiere una lista de transacciones (no necesita que estén ordenados en cada transacción)

itemsets, rules = apriori(transactions, min_support=0.3,  min_confidence=0.2, max_length=5)#min_sup conf entre 0 y 1

print("\nItemsets:\n")
print(itemsets)

rules=sorted(rules, key=lambda rule: rule.confidence)

print("\nRules:\n")

for rule in rules:
    
  print(rule) # Prints the rule and its confidence, support, lift, ...

  print("\n")


Itemsets:

{1: {('American Beauty (1999)',): 192, ('Apollo 13 (1995)',): 188, ('Braveheart (1995)',): 224, ('Fight Club (1999)',): 209, ('Forrest Gump (1994)',): 318, ('Fugitive, The (1993)',): 186, ('Jurassic Park (1993)',): 221, ('Lord of the Rings: The Fellowship of the Ring, The (2001)',): 185, ('Matrix, The (1999)',): 264, ('Pulp Fiction (1994)',): 290, ('Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',): 195, ('Saving Private Ryan (1998)',): 183, ("Schindler's List (1993)",): 208, ('Seven (a.k.a. Se7en) (1995)',): 191, ('Shawshank Redemption, The (1994)',): 313, ('Silence of the Lambs, The (1991)',): 270, ('Star Wars: Episode IV - A New Hope (1977)',): 239, ('Star Wars: Episode V - The Empire Strikes Back (1980)',): 204, ('Star Wars: Episode VI - Return of the Jedi (1983)',): 186, ('Terminator 2: Judgment Day (1991)',): 216, ('Toy Story (1995)',): 207, ('Usual Suspects, The (1995)',): 199}, 2: {('Forrest Gump (1994)', 'Pulp Fiction (1994)'): 211, 

In [152]:
print(len(rules))

280


### Informe Final  
  
  
Logré correrlo con el dataset de 20m,
Con un soporte de 0.5, no obtenemos ni itemsets ni reglas;
Con un soporte de 0.4, obtenemos itemsets, pero no reglas;
Al seleccionar un soporte bajo (0.2), la cantidad de reglas que se encuentran crece (280).
Con soporte 0.05, no final