## APRIORI Algorithm - Store Analysis

![title](apriori.png)

There are three major components of Apriori algorithm:

* Support
* Confidence
* Lift

### Support:
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions.

* Support(B) = (Transactions containing (B))/(Total Transactions)

### Confidence:
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought.

* Confidence(A→B) = (Transactions containing both (A and B)) / (Transactions containing A)

### Lift:
Lift(A → B) refers to the increase in the ratio of sale of B when A is sold. Lift(A → B) can be calculated by dividing Confidence(A → B) divided by Support(B).

* Lift(A→B) = (Confidence (A→B)) / (Support (B))

** Instance ** - Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well.

This algorithm can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

1. Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (confidence).

2. Extract all the subsets having a higher value of support than the minimum threshold.

3. Select all the rules from the subsets with confidence value higher than the minimum threshold.

4. Order the rules by descending order of Lift.

In [2]:
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

In [13]:
movie_data = pd.read_csv('E:\RDataSet\movie_dataset_apriori.csv', header = None)
num_records = len(movie_data)
print(num_records)

7501


In [14]:
movie_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,
5,Mad Max,,,,,,,,,,,,,,,,,,,
6,The Spy Who Dumped Me,Hotel Transylvania,,,,,,,,,,,,,,,,,,
7,Thor,London Has Fallen,The Lego Movie,,,,,,,,,,,,,,,,,
8,Intern,Tomb Rider,John Wick,,,,,,,,,,,,,,,,,
9,Hotel Transylvania,,,,,,,,,,,,,,,,,,,


### Apriori Preprocessing

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list.

In [5]:
records = []  
for i in range(0, num_records):  
    records.append([str(movie_data.values[i,j]) for j 
                    in range(0, 20)])

### Apriori Rules

We can now specify the parameters of the apriori class.

* the List
* min_support
* min_confidence
* min_lift
* min_length (the minimum number of items that you want in your rules, typically 2)

In [20]:
association_rules = apriori(records, min_support=0.0053, 
                            min_confidence=0.20, 
                            min_lift=3, min_length=2)
association_results = list(association_rules)  

In [21]:
print(len(association_results))

32


In [19]:
print(association_results[0])

RelationRecord(items=frozenset({'Red Sparrow', 'Green Lantern'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Red Sparrow'}), items_add=frozenset({'Green Lantern'}), confidence=0.3006993006993007, lift=3.790832696715049)])


For instance from the first item ** association_result [0] **, we can see that Red Sparrow and Green Lantern are commonly bought together.

The support value for the first rule is 0.0057. This number is calculated by dividing the number of transactions containing Red Sparrow divided by total number of transactions. The confidence level for the rule is 0.3006 which shows that out of all the transactions that contain Red Sparrow, 30% of the transactions also contain Green Lantern. Finally, the lift of 3.79 tells us that Green Lantern is 3.79 times more likely to be bought by the customers who buy Red Sparrow compared to the default likelihood of the sale of Green Lantern.

In [1]:
results = []
for item in association_results:
    
    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    
    value0 = str(items[0])
    value1 = str(items[1])

    #second index of the inner list
    value2 = str(item[1])[:7]

    #third index of the list located at 0th
    #of the third index of the inner list

    value3 = str(item[2][0][2])[:7]
    value4 = str(item[2][0][3])[:7]
    
    rows = (value0, value1,value2,value3,value4)
    results.append(rows)
    
labels = ['Title 1','Title 2','Support','Confidence','Lift']
movie_suggestion = pd.DataFrame.from_records(results, 
                                             columns = labels)

print(movie_suggestion)

NameError: name 'association_results' is not defined