# Machine Learning A-Z: Section 24 Apriori Association Rule Learning

In the course so far we have looked at Regression, Classification, and Clustering. This notebook is about Association Rule Learning. Association Rule learning looks for common connections in the data.

For instance:
* Purchases containing milk also usually contain bread
* People who watch Mulan often also watch Tangled
* People who listen to Mozart often also listen to Bach

__Important Note:__
A rule which applies in one direction may not always apply in the reverse. So although purchases containing milk often contain bread, it may not be true the transactions containing bread often contain milk.

Association Rule Learning is all about finding these _A_ often implies _B_ type of relationships with-in our data.

Apriori ARL find _A_ implies _B_ rules by looking at 3 values.
1. Support(_A_): The percent of all records (transactions, users, etc.) which contain _A_.
1. Confidence (_A_->_B_): The percent of records which contain _A_ which also contain _B_.
1. Lift(_A_->_B_): The Confidence (_A_->_B_) divided by the Support(_B_)

The Support, Confidence, and Lift are used in Apriori ARL models as follows:
1. Calculate the Support for all items (products, movies, artists, etc.) in our records and select only those above a predetermined Support level. We do this for three reasons. First, if there are not enough records containing A, we won't be able to accurately find what A implies. Second, we are generally interested in finding rules for our most common items so we can leverage their influence. A strong rule that rarely applies often isn't very useful. Third, in the following steps we will look at combinations of items and looking at every possible combination of items can be computationally prohibitive for large datasets with many different items
1. For the subset of items selected in the previous step, find the Confidence that _A_ implies _B_ for every combination in the subset. Select only the rules having higher than a predetermined confidence level. Confidence can be thought of as a numerical representation of our confidence that a rule _may_ exist between _A_ and _B_.
1. For the subset of rules selected in the previous step, rank the rules by lift from highest to lowest. In this case lift tells us how strong a potential rule is. If a large number of records which contain _A_ also contain _B_, but _B_ is very common in the dataset overall (_B_ has a large support) it may not be significant that _B_ appears commonly with _A_ (small Lift). If _B_ is generally uncommon in the data (_B_ has a small support) it may be significant that it appears commonly with _A_ (large Lift)

For our example data, we will be looking at the contents of purchases and looking for items which are often purchased together.

## Step 1 Import and Prepare the data.

In [1]:
import numpy as np # Libraries for fast linear algebra and array manipulation
import pandas as pd # Import and manage datasets
from plotly import __version__ as py__version__
import plotly.express as px # Libraries for ploting data
import plotly.graph_objects as go # Libraries for ploting data
from sklearn import __version__ as skl__version__
from sklearn.model_selection import train_test_split # Library to split data into training and test sets.
from sklearn.preprocessing import StandardScaler # Library to do feature scaling
from sklearn.tree import DecisionTreeClassifier # Library to do Decision Tree classification
from sklearn.metrics import confusion_matrix #Function for computing the confusion matrix

Library versions used in this code:

In [2]:
print('Numpy: ' + np.__version__)
print('Pandas: ' + pd.__version__)
print('Plotly: ' + py__version__)
print('Scikit-learn: ' + skl__version__)

Numpy: 1.16.4
Pandas: 0.25.1
Plotly: 4.0.0
Scikit-learn: 0.21.2


In [3]:
def LoadData():
    dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
    return dataset

dataset = LoadData()
print(dataset.head(3))
print()
print(dataset.info())

        0          1        2               3             4   \
0   shrimp    almonds  avocado  vegetables mix  green grapes   
1  burgers  meatballs     eggs             NaN           NaN   
2  chutney        NaN      NaN             NaN           NaN   

                 5     6               7             8             9   \
0  whole weat flour  yams  cottage cheese  energy drink  tomato juice   
1               NaN   NaN             NaN           NaN           NaN   
2               NaN   NaN             NaN           NaN           NaN   

               10         11     12     13             14      15  \
0  low fat yogurt  green tea  honey  salad  mineral water  salmon   
1             NaN        NaN    NaN    NaN            NaN     NaN   
2             NaN        NaN    NaN    NaN            NaN     NaN   

                  16               17       18         19  
0  antioxydant juice  frozen smoothie  spinach  olive oil  
1                NaN              NaN      NaN       

Looking at the raw data we can see that each row represents a shopping basket and has the items contained in that basket in the columns. However, the Apriori ARL library we are using expects a list of lists with each inner list representing a basket so we will need to transform the data.

In [4]:
baskets = []
for i in range(0,len(dataset)):
    basket = []
    for j in range(0,len(dataset.columns)):
        if dataset.iloc[i,j] == dataset.iloc[i,j]: #This is kind of an odd check of np.NaN which is used for missing values does not equal itself and hence this will be false.
            basket.append(dataset.iloc[i,j])
    baskets.append(basket)
    
#print(baskets)

In [5]:
from apyori import apriori
rules = apriori(baskets, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

In [6]:
results = list(rules)
for item in results[0:5]:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: light cream -> chicken
Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
Rule: escalope -> pasta
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
Rule: honey -> fromage blanc
Support: 0.003332888948140248
Confidence: 0.2450980392156863
Lift: 5.164270764485569
Rule: herb & pepper -> ground beef
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285


If we look at the first result, we can understand the following:
* 0.45% of the all the baskets contained Light Cream which corresponds to 34 baskets out of the 7501 total
* 29% of the baskets containing Light Cream also contained Chicken
* Chicken occured in baskets containing light cream 4.8 times more frequently than it occured in the dataset overall

From this we could try to put Light Cream and Chicken close together in our store try to improve chicken sales, or we coul try to distance Light Cream and Chicken so people looking for both would see more of the products we are selling and pick-up something they wouldn't have otherwise.