## Regras de Associação

Regras de Associação identificam padrões comuns em itens de um grande conjunto de dados.Neste exercício, nós vamos analisar padrões de comportamento em uma plataforma de filmes (como o Netflix) onde as pessoas costumam assistir seus filmes e séries. Existem alguns padrões claros, como pessoas que gostam de super heróis ou aqueles que assistem a desenhos animados.

Regras de Associação são geralmente escritas no formato: **{A} -> {B}**,  o que siginifica que existe uma forte relação entre os itens A e B. Por exemplo, uma possível regra válida para a plataforma de streams é **{Senhor dos Anéis} -> {O Hobbit}**. 

Se frequentemente uma pessoa que assiste a um filme também assiste a um outro, ou seja os filmes são asssitidos frequentemente juntos, então a plataforma de filmes poderia utilizar esse padrão para aumentar a visualização de alguns filmes, através de recomendações na plataforma.

No exemplo acima, **{Senhor dos Anéis} -> {O Hobbit}**, {Senhor dos Anéis} é o **antecedente** e **{O Hobbit}** é o **cosequente**. Antecedentes e consequentes podem ter múltiplos itens, por exemplo um regra válida é **{Thor: Ragnarok, Vingadores: Guerra Infinita}->{Vingadores: Ultimato}**.

Por quê?
Fácil de explicar para pessoas não-técnicas

Sem necessidade de grande preparação dos dados e engenharia de features

Bom início para explorar dados


## Identificando padrões frequentes em usuários de streaming de vídeos
Neste exemplo nós utilizaremos regras de associação para analisar um dataset de transações onde cada transação é composta pelos filmes que um mesmo usuário de uma plataforma de filmes assisitu dentro de um intervalo de tempo.

Exemplo baseado no tutorial disponível em: https://medium.com/@fabio.italiano/the-apriori-algorithm-in-python-expanding-thors-fan-base-501950d55be9

<img src="fig_apriori/Streaming-Movie.jpg">

### Passo 1) Leitura do dataset

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_csv('dataset_movies/movie_dataset.txt',header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,


Cada linha do arquivo refere-se a um conjunto de filmes que um determinado usuário leu. Vamos considerar esse conjunto de filmes como sendo o conjunto de itens de uma transação.

Entretanto, precisamos transforma os dados para deixá-lo num formato de um dataframe  onde cada coluna se refere a um filme e as linhas aos usuarios. Cada cálula contém 1 quando o usuário assitiu ao filme e 0 no caso contrário.

In [4]:
import numpy as np

In [5]:
rows = df.shape[0]

In [6]:
filmes = set()
for i in range(rows):
    filmes = filmes.union(set(df.iloc[i].unique()))


In [7]:
np.nan in filmes

True

In [8]:
filmes.difference_update({np.nan})

In [9]:
df_ = pd.DataFrame(columns=filmes,data=np.zeros((rows,len(filmes))))

In [10]:
df_.head()

Unnamed: 0,Ted,Captain America,Watchmen,Red Sparrow,Plant of the Apes,Thor,Suicide Squad,Cafe Society,Beirut,Martian,...,Mamma Mia,Venom,Ghostbusters,Valerian,Vice,The Incredibles,Guardians of the Galaxy,La La Land,Hulk,Allied
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
def set_units(x):
    return 1

In [12]:
for i in range(rows):
    df_.at[i, df.iloc[i].dropna()] = 1.

In [13]:
df_.head()

Unnamed: 0,Ted,Captain America,Watchmen,Red Sparrow,Plant of the Apes,Thor,Suicide Squad,Cafe Society,Beirut,Martian,...,Mamma Mia,Venom,Ghostbusters,Valerian,Vice,The Incredibles,Guardians of the Galaxy,La La Land,Hulk,Allied
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### O Algoritmo Apriori
Alguns elementos são essenciais para o entendimento do algoritmo Apriori. 


**Suporte**: é um número de vezes que o itemset aparece em diferentes transações dividido pelo número total de transações.

$$supp(X) = \frac{|t \in T; X \subseteq t|}{|T|}$$

Por exemplo, podemos analisar o suporte do filme "Jumanji" fazendo a seguinte operação. 

In [14]:
def supp(df_,X):
    union = np.prod(df_[X].values,axis=1)
    return len(np.nonzero(union)[0])/df_.shape[0]

In [26]:
supp(df_,["Jumanji"])

0.09825356619117451

In [16]:
supp(df_,['Jumanji','Wonder Woman'])

0.005332622317024397

**Itemset Frequente**: Um conjunto $\{i_1,i_2, ..., i_n\}$ de itens é frequente quando o conjunto de itens ocorre com pelo menos a frequênciade um supporte mínimo, $min\_supp$.

**Confiança**:é a indicação de quão frequente uma regra é verdadeira. Quanto maior a confiança, maior é chance de encontrarmos a regra no dataset. É dada por:

$$conf(X \rightarrow Y) = supp(X \cup Y)/supp(X)$$


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [17]:
def confidence(df_, X, Y):
    return supp(df_,X+Y)/supp(df_,X)

In [19]:
confidence(df_, ['Avengers'], ['Thor'])

0.16279069767441862

**Quando uma regra satisfaz a um mínimo suporte e confiança, dizemos que a regra é um regra de associação forte.**

Em geral, a mineração de regras de associação pode ser definida como:

1 - Encontrar todos os itemsets frequentes;

2 - Gerar regras de associação fortes a partir desses itens.

### Como funciona o algoritmo?

* Chamado de **Apriori** pois requer um conhecimento prévio das propriedades do itens mais frequentes;
* É um método iterativo onde $k$ itens são utilizados para para explorar $k+1$ itens;
* **Ideia geral**: Primeiro encontre o o itemset frequente de tamanho 1 satisfazendo o mínimo suporte, denominado $L_1$. Depois utilize $L_1$ para encontrar $L_2$, os itens frequentes de tamanho 2. $L_2$ é utilizado para encontrar $L_3$ e assim por diante.
* **Propriedade Apriori**: Todos os subconjuntos não vazios de um conjunto de itens frequente, também é frequente.




<img src="fig_apriori/Apriori.jpg">

Fonte: http://www.lessons2all.com/Apriori.php

### Utilizando o algortimo apriori

In [22]:
frequent_itemsets = apriori(df_, min_support=0.01, use_colnames=True)

#### Visualizando itens frequentes

In [23]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.018531,(Ted)
1,0.059992,(Captain America)
2,0.019064,(Red Sparrow)
3,0.050527,(Thor)
4,0.046794,(Cafe Society)


#### Computando regras de associação utilizando o lift como métrica

In [24]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Captain America),(Moana),0.059992,0.129583,0.014798,0.246667,1.903546,0.007024,1.155421
1,(Captain America),(Ninja Turtles),0.059992,0.238368,0.022797,0.38,1.594172,0.008497,1.228438
2,(Captain America),(Coco),0.059992,0.163845,0.014665,0.244444,1.491927,0.004835,1.106676
3,(Captain America),(Tomb Rider),0.059992,0.17411,0.017198,0.286667,1.646468,0.006752,1.15779
4,(Captain America),(Get Out),0.059992,0.179709,0.014398,0.24,1.33549,0.003617,1.07933


#### Visualizando regras com determinada confiança e lift

In [25]:
rules[ (rules['lift'] > 1.1) &
       (rules['confidence'] >= 0.45) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
6,(Thor),(Ninja Turtles),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
123,"(Intern, Moana)",(Ninja Turtles),0.023597,0.238368,0.011065,0.468927,1.967236,0.00544,1.434136
129,"(Jumanji, Moana)",(Ninja Turtles),0.021997,0.238368,0.011065,0.50303,2.110308,0.005822,1.532552
143,"(Coco, Jumanji)",(Ninja Turtles),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937
155,"(Tomb Rider, Spotlight)",(Ninja Turtles),0.025197,0.238368,0.011465,0.455026,1.908923,0.005459,1.397557
157,"(Get Out, Jumanji)",(Ninja Turtles),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848


**Lift**: O lift de uma regra é definido como:  

$$lift(X \rightarrow Y): \frac{supp(X \cup Y)}{supp(X) \times supp(Y)}$$

* lift 1: a ocorrência de X é independente da ocorrência de Y

* lift > 1: possível dependência entre X e Y,  o que faz a regra útil para predizer futuros itens

* lift < 1: a presença X tem um efeito negativo na de Y, e vice-versa.


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [20]:
def lift(df_, X, Y):
    return supp(df_,X+Y)/(supp(df_,X)*supp(df_,Y))

In [21]:
lift(df_,  ['Avengers'], ['Thor'])

3.221881327851752