#**(11)Apriori(Unsupervised Learning, Association Rule Based)**

- The Apriori algorithm is a  Unsupervised machine learning algorithm that is used to obtain information about the structured relationships between the different elements involved. It is a data mining technique used to extract frequent item sets and relevant association rules.

- It is mainly used for **market basket analysis** and helps to find those products that can be bought together.


##**Working of an Apriori algorithm**
- Let us try and understand the working of an Apriori algorithm with the help of a very famous business scenario, market basket analysis.

- Here is a dataset consisting of six transactions in an hour. Each transaction is a combination of 0s and 1s, where 0 represents the absence of an item and 1 represents the presence of it.

<figure align="center">
<img src="https://drive.google.com/uc?id=1YANAXEr9sCy3-oS78PeTiaELOFJrDL31" height="300px", width="400px"> 
</figure>

- The Apriori algorithm uses three measures to find the best association rules from a dataset, These measures include:

- **(a)Support:** It measures the number of times a particular item or combination of items occur in a dataset. The mathematical formula for support is;

$$Support = \frac{Number of Particular Item Transaction(I)}{Total Number of Transactions}$$ 

- where $I$ is a particular item in an items dataset.
- Support(wine) = 4/6 = 0.667
- Support(Bread) = 4/6 = 0.667

- **(b)Confidence:** Confidence refers to the possibility that the customers bought item $I_1$ and $I_2$ together. It’s calculated using the formula:

$$Confidence(I_1\rightarrow I_2) = \frac{Transaction Cointaing(I_1 and I_2)Together}{Transactions Containing(I_1)}$$

- Confidence(wine,chips)=(3/6)/(4/6)=0.746

- Confidence{(wine,chips),bread}=(2/6)/(3/6)=0.667

- **(c)Lift:** A lift is a measures that determines the strength of association between the best rules. It is obtained by taking confidence and diving it with support. Its mathematical formula is as follows:

$$Lift(I_1\rightarrow I_2) = \frac{Confidence(I_1\rightarrow I_2)}{Support(I_2)}$$
- Lift{(wine,chips),bread}= 0.667/0.667=1





- **Lift = 1** means that there is no correlation within the itemset.
- **Lift > 1** means that there is a positive correlation within the itemset, i.e., products in the itemset, x and y, are more likely to be bought together.
- **Lift < 1** means that there is a negative correlation within the itemset, i.e., products in itemset, x and y, are unlikely to be bought together.

##**Implementing Apriori Algorithm with Python**

-  In this section we will use the Apriori algorithm to find rules that describe associations between 20 different products and 7500 transactions over the course of a week at a French retail store .

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# for market basket analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


In [None]:
from google.colab import drive     #mount your Google Drive in your virtual machine(VM).
drive.mount('/gdrive')              #Access  the data  drive because of different server of colab and drive.

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [None]:
store_data=pd.read_csv('/gdrive/My Drive/ML Project /Feature Engineering /4.ML Algorithms/store_data.csv',quoting=3)
                                 #Read data file with path location step by step path location from My Drive.

In [None]:
store_data.head()

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


In above datset header is actually the first transaction,But by default, pd.read_csv function treats first row as header. To solve this proablem apply add header=None option to pd.read_csv function, as shown below:



The NaN tells us that the item represented by the column was not purchased in that specific transaction.

In [None]:
store_data=pd.read_csv('/gdrive/My Drive/ML Project /Feature Engineering /4.ML Algorithms/store_data.csv',quoting=3,header=None)
                                 #Read data file with path location step by step path location from My Drive.

In [None]:
store_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


###**Data Proprocessing:**
- The Apriori library we are going to use requires our dataset to be in the form of a **list of lists,** where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list.
- Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, execute the following script:



In [None]:
# making each customers shopping items an identical list
records = []
for i in range(0, 7501):
    records.append([str(store_data.values[i,j]) for j in range(0, 20)])

print(records )

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], ['chutney', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], ['turkey', 'avocado', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], ['low fat yogurt', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan

In [None]:
# conveting it into an numpy array
records = np.array(records)
print(records)


[['shrimp' 'almonds' 'avocado' ... 'frozen smoothie' 'spinach'
  'olive oil']
 ['burgers' 'meatballs' 'eggs' ... 'nan' 'nan' 'nan']
 ['chutney' 'nan' 'nan' ... 'nan' 'nan' 'nan']
 ...
 ['chicken' 'nan' 'nan' ... 'nan' 'nan' 'nan']
 ['escalope' 'green tea' 'nan' ... 'nan' 'nan' 'nan']
 ['eggs' 'frozen smoothie' 'yogurt cake' ... 'nan' 'nan' 'nan']]


##**Using Transaction encoder:**
- TransactionEncoder learns the unique labels in the dataset, and via the transform method, it transforms the input dataset (a Python list of lists) into a one-hot encoded NumPy boolean array.

- Here True means item purchased in that specific transaction and False tells us that the item represented by the column was not purchased in that specific transaction.

In [None]:
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
data = te.fit_transform(records)
data = pd.DataFrame(data, columns = te.columns_)
print(data)

       asparagus  almonds  antioxydant juice  asparagus  avocado  babies food  \
0          False     True               True      False     True        False   
1          False    False              False      False    False        False   
2          False    False              False      False    False        False   
3          False    False              False      False     True        False   
4          False    False              False      False    False        False   
...          ...      ...                ...        ...      ...          ...   
7496       False    False              False      False    False        False   
7497       False    False              False      False    False        False   
7498       False    False              False      False    False        False   
7499       False    False              False      False    False        False   
7500       False    False              False      False    False        False   

      bacon  barbecue sauce

In [None]:
import warnings    
warnings.filterwarnings('ignore')   #To filter on more complex rules programmatically, use filterwarnings()

# getting correlations for 121 items would be messy 
# so let's reduce the items from 121 to 40

data = data.loc[:, ['mineral water', 'burgers', 'turkey', 'chocolate', 'frozen vegetables', 'spaghetti',
                    'shrimp', 'grated cheese', 'eggs', 'cookies', 'french fries', 'herb & pepper', 'ground beef',
                    'tomatoes', 'milk', 'escalope', 'fresh tuna', 'red wine', 'ham', 'cake', 'green tea',
                    'whole wheat pasta', 'pancakes', 'soup', 'muffins', 'energy bar', 'olive oil', 'champagne', 
                    'avocado', 'pepper', 'butter', 'parmesan cheese', 'whole wheat rice', 'low fat yogurt', 
                    'chicken', 'vegetables mix', 'pickles', 'meatballs', 'frozen smoothie', 'yogurt cake']]



##**Applying Apriori:**

- The apriori class requires some parameter values to work. The first parameter is the **list of list** that you want to extract rules from. 
- The second parameter is the **min_support** parameter. This parameter is used to select the items with **support values greater than the value specified by the parameter.** 
- Next, the **min_confidence** parameter filters those rules that have **confidence greater than the confidence threshold specified by the parameter.** 
- Similarly, the **min_lift** parameter specifies the minimum lift value for the short listed rules. 
- Finally, the **min_length** parameter specifies the minimum number of items that you want in your rules.

In [None]:
from mlxtend.frequent_patterns import apriori

#Now, let us return the items and itemsets with at least 5% support:
frq_items=apriori(data, min_support = 0.05, use_colnames = True)


In [None]:
from mlxtend.frequent_patterns import association_rules
# Collecting the inferred rules in a dataframe 
rules = association_rules(frq_items, metric ="confidence", min_threshold =0.2) 

rules 


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(chocolate),(mineral water),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357
1,(mineral water),(chocolate),0.238368,0.163845,0.05266,0.220917,1.348332,0.013604,1.073256
2,(mineral water),(spaghetti),0.238368,0.17411,0.059725,0.250559,1.439085,0.018223,1.102008
3,(spaghetti),(mineral water),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314
4,(mineral water),(eggs),0.238368,0.179709,0.050927,0.213647,1.188845,0.00809,1.043158
5,(eggs),(mineral water),0.179709,0.238368,0.050927,0.283383,1.188845,0.00809,1.062815
