### Regras de Associação

Regras de Associação identificam padrões comuns em itens de um grande conjunto de dados.Neste exercício, nós vamos analisar padrões de comportamento em uma plataforma de filmes (como o Netflix) onde as pessoas costumam assistir seus filmes e séries. Existem alguns padrões claros, como pessoas que gostam de super heróis ou aqueles que assistem a desenhos animados.

Regras de Associação são geralmente escritas no formato: **{A} -> {B}**,  o que siginifica que existe uma forte relação entre os itens A e B. Por exemplo, uma possível regra válida para a plataforma de streams é **{Senhor dos Anéis} -> {O Hobbit}**. 

Se frequentemente uma pessoa que assiste a um filme também assiste a um outro, ou seja os filmes são asssitidos frequentemente juntos, então a plataforma de filmes poderia utilizar esse padrão para aumentar a visualização de alguns filmes, através de recomendações na plataforma.

No exemplo acima, **{Senhor dos Anéis} -> {O Hobbit}**, {Senhor dos Anéis} é o **antecedente** e **{O Hobbit}** é o **cosequente**. Antecedentes e consequentes podem ter múltiplos itens, por exemplo um regra válida é **{Thor: Ragnarok, Vingadores: Guerra Infinita}->{Vingadores: Ultimato}**.

Por quê?
Fácil de explicar para pessoas não-técnicas

Sem necessidade de grande preparação dos dados e engenharia de features

Bom início para explorar dados


### Descrição  da atividade
Nesta atividade nós utilizaremos regras de associação para analisar um dataset de transações onde cada transação é composta pelos filmes que um mesmo usuário de uma plataforma de filmes assisitu dentro de um intervalo de tempo.

### Passo 1) Leitura do dataset

In [25]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [26]:
df = pd.read_csv('dataset_movies/movie_dataset.txt',header=None)

In [50]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,


Cada linha do arquivo refere-se a um conjunto de filmes que um determinado usuário leu. Vamos considerar esse conjunto de filmes como sendo o conjunto de itens de uma transação.

Entretanto, precisamos transforma os dados para deixá-lo num formato de um dataframe  onde cada coluna se refere a um filme e as linhas aos usuarios. Cada cálula contém 1 quando o usuário assitiu ao filme e 0 no caso contrário.

In [41]:
import numpy as np

In [54]:
rows = df.shape[0]

In [103]:
filmes = set()
for i in range(rows):
    filmes = filmes.union(set(df.iloc[i].unique()))


In [107]:
np.nan in filmes

True

In [108]:
filmes.difference_update({np.nan})

In [132]:
df_ = pd.DataFrame(columns=filmes,data=np.zeros((rows,len(filmes))))

In [116]:
df_.head()

Unnamed: 0,Ant Man,Mission Impossible,Spotlight,Justice League,The Revenant,Allied,Grinch,Wolverine,Captain America,Coco,...,Batman,Spiderman 3,Cinderella,John Wick,Vampire in Brooklyn,Hulk,The Good Dunosaur,Trolls,Intern,Star Trek
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [144]:
def set_units(x):
    return 1

In [159]:
for i in range(rows):
    df_.at[i, df.iloc[i].dropna()] = 1.

In [161]:
df_.head()

Unnamed: 0,Ant Man,Mission Impossible,Spotlight,Justice League,The Revenant,Allied,Grinch,Wolverine,Captain America,Coco,...,Batman,Spiderman 3,Cinderella,John Wick,Vampire in Brooklyn,Hulk,The Good Dunosaur,Trolls,Intern,Star Trek
0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### O Algoritmo Apriori
Três elementos são essenciais para o entendimento do algoritmo Apriori. 


**Suporte**: é um número de vezes que o item aparece em diferentes transações dividido pelo número total de transações.

$$supp(X) = \frac{|t \in T; X \subseteq t|}{|T|}$$

Por exemplo, podemos analisar o suporte do filme "Jumanji" fazendo a seguinte operação. 

In [262]:
def supp(df_,X):
    union = np.prod(df_[X].values,axis=1)
    return len(np.nonzero(union)[0])/df_.shape[0]

In [263]:
supp(df_,["Mission Impossible"])

0.011998400213304892

**Confiança**:é a indicação de quão frequente uma regra é verdadeira. Quanto maior a confiança, maior é chance de encontrarmos a regra no dataset. É dada por:

$$conf(X \rightarrow Y) = supp(X \cup Y)/supp(X)$$


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [264]:
def confidence(df_, X, Y):
    return supp(df_,X+Y)/supp(df_,X)

In [265]:
supp(df_,['Avengers','Thor'])/supp(df_,['Avengers'])

0.16279069767441862

In [266]:
supp(df_,['Jumanji','Wonder Woman'])/supp(df_,['Wonder Woman'])

0.3773584905660377

In [267]:
confidence(df_, ['Avengers'], ['Thor'])

0.16279069767441862

**Lift**: O lift de uma regra é definido como:  

$$lift(X \rightarrow Y): \frac{supp(X \cup Y)}{supp(X) \times supp(Y)}$$

* lift 1: a ocorrência de X é independente da ocorrência de Y

* lift > 1: possível dependência entre X e Y,  o que faz a regra útil para predizer futuros itens

* lift < 1: a presença X tem um efeito negativo na de Y, e vice-versa.


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [245]:
def lift(df_, X, Y):
    return supp(df_,X+Y)/(supp(df_,X)*supp(df_,Y))

In [268]:
supp(df_,['Avengers', 'Thor'])

0.0018664178109585388

In [269]:
(supp(df_,['Avengers'])*supp(df_,['Thor']))

0.0005792944000836327

In [261]:
0.06/0.0005

120.0

In [275]:
lift(df_,  ['Avengers'], ['Thor'])

3.221881327851752

In [318]:
frequent_itemsets = apriori(df_, min_support=0.01, use_colnames=True)

### Utilizando o algortimo apriori

#### Visualizando itens frequentes

In [330]:
frequent_itemsets.head()

Unnamed: 0,support,itemsets
0,0.010799,(Ant Man)
1,0.011998,(Mission Impossible)
2,0.095054,(Spotlight)
3,0.013465,(Justice League)
4,0.071457,(The Revenant)


#### Computando regras de associação utilizando o lift como métrica

In [326]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Spotlight),(Coco),0.095054,0.163845,0.019864,0.208976,1.275452,0.00429,1.057054
1,(Spotlight),(Ninja Turtles),0.095054,0.238368,0.033729,0.354839,1.488616,0.011071,1.180529
2,(Spotlight),(Get Out),0.095054,0.179709,0.02173,0.228612,1.272118,0.004648,1.063395
3,(Spotlight),(Tomb Rider),0.095054,0.17411,0.025197,0.265077,1.522468,0.008647,1.123778
4,(Spotlight),(Hotel Transylvania),0.095054,0.170911,0.020131,0.211781,1.239135,0.003885,1.051852


#### Visualizando regras com determinada confiança e lift

In [332]:
rules[ (rules['lift'] > 1.1) &
       (rules['confidence'] >= 0.45) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
64,(Thor),(Ninja Turtles),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
116,"(Tomb Rider, Spotlight)",(Ninja Turtles),0.025197,0.238368,0.011465,0.455026,1.908923,0.005459,1.397557
119,"(Jumanji, Coco)",(Ninja Turtles),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937
136,"(Get Out, Jumanji)",(Ninja Turtles),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848
139,"(Moana, Jumanji)",(Ninja Turtles),0.021997,0.238368,0.011065,0.50303,2.110308,0.005822,1.532552
154,"(Moana, Intern)",(Ninja Turtles),0.023597,0.238368,0.011065,0.468927,1.967236,0.00544,1.434136
