# **Association Analysis**<br>
## **1. Introduction and Algorithm Description**
   This notebook uses a real time itemset dataset to demonstrate the association rule mining algorithms below which are provided by the hana-ml.<br>
- **Apriori**<br>
- **AprioriLite**<br>    
- **FPGrowth**<br>
- **KORD**<br>
<br>
<br>
- **Apriori**<br>
Apriori is a classic predictive analysis algorithm for finding association rules used in association analysis. Association analysis uncovers the hidden patterns, correlations or casual structures among a set of items or objects. For example, association analysis enables you to understand what products and services customers tend to purchase at the same time. By analyzing the purchasing trends of your customers with association analysis, you can predict their future behavior.<br>

 **Prerequisites**<br>
     - The input data does not contain null value.<br>
     - There are no duplicated items in each transaction.<br>

  Apriori algorithm takes the itemset as an input parameter and generates the association rules based on the mini-support passed during model fitting.<br>
  
  <b>Apriori Property</b> - If an item set is frequent, then all its subset items will be frequent.<br>
  The item set is frequent if the support for the item set is more that support threshold.Before we execute the model we give the minimum support which actually filter's the further itemset to be processed
  
- **AprioriLite**<br>
  This is a light association rule mining algorithm to realize the Apriori algorithm. It only calculates two large item sets.
  
  **e.g. for Apriori** - 
   
  Suppose if we take mini support as 0.25 the further frequent itemset calcualtion.  



  
<b>Suppose we have below transaction data for association analysis - <br>
![image.png](attachment:image.png)

#### <b>After Applying frequent itemset analysis below itemset will be considered for rule generation considering mini support 0.25
![image.png](attachment:image.png)
    
All records are having mini support more than 0.25.

- **Support** - how frequent an itemset in all the transactions. 
               Support({X} -> {Y}) = Transaction containing both X & Y  /  Total Number of transactions 
               e.g. Support ({News} -> {Finance}) = 4 / 6 = 0.66 (66%)
- Support indicates that there are 60% Chances that there is 66 % chances of this itemset to be frequent in total of 6 transactions 
- **Confidence**  - This is also one of the main factor in association rule analysis. <br>
  This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents. 
                 
                 Confidence({X} -> {Y}) = Transactions containing both X & Y  / Transactions containing X 
                 e.g. Confi({News} - > {Finance}) = 4 / 5 = 0.8 (80%)
                   
**Lift** - This is also one of the most important factor in finding High Association rule among all the generated rules.

Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of {Y} given {X}.
Lift is a very literal term given to this measure. 

                  Lift({X} -> {Y}) = ( (Transactions containing both X & Y) / (Transactions containing X) ) / 
                                        fraction of transaction containig Y
                  Lift({News} - > {Finance}) = (4/5) / 4 = 0.20(20%)
   
- A value of lift greater than 1 vouches for high association between {Y} and {X}.


- **FPGrowth**
   FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.

   In SAP HANA PAL, the FP-Growth algorithm is extended to find association rules in three steps:

   - Converts the transactions into a compressed frequent pattern tree (FP-Tree);<br>
   - Recursively finds frequent patterns from the FP-Tree;<br>
   - Generates association rules based on the frequent patterns found in Step 2.<br>
   FP-Growth with relational output is also supported.
   
    <b>Prerequisites</b>
     - The input data does not contain null value.
     - There are no duplicated items in each transaction.

## **2. Dataset**

We will analyze the store data for frequent pattern mining , this is real time data from from Kaggle for market basket analysis.
Here is the data soruce - https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view

A quick transaction previews of dataset:

![image.png](attachment:image.png)

- **Attribute Information** 
   - CUSTOMER - Transaction.
   - ITEM - Item for a transaction.


## **3. Data loading**<br>

### **Import Packages**
First, import packages needed in the data loading.

In [None]:
from hana_ml import dataframe
from hana_ml.algorithms.pal.utility import Settings, DataSets

## **Setup Connection**
In our case, the data is loaded into a table called "PAL_APRIORI_TRANS_TBL" in HANA from a csv file "apriori_item_data.csv". To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.<br>
<br>
###################<br>
[hana]<br>
url=host-url<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
<br>
###################<br>

In [None]:
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
# the connection
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
print(connection_context.connection.isconnected())

  **Load Data**<br>
   Then, the function DataSets.load_apriori_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an exmaple of return message is shown below:

   If the data is already loaded, there would be a return message "Table XXX exists and data exists".


In [None]:
df = DataSets.load_apriori_data(connection_context)

In [None]:
df.head(3).collect()

In [None]:
df = df.dropna() ##Drop NAN if any

Total number of records in dataset:

In [None]:
df.count() #Total Number of records to be processed in Apriori Algorithm in SAP HANA ML#

Columns:

In [None]:
display(df.columns)

In [None]:
#Filter those items which comes under one transaction
df.filter('CUSTOMER = 0').head(10).collect()

In [None]:
#This is just a Data Analysis on the items, total number of transaction appearance for an item
df.agg([('count' , 'ITEM' , 'TOTAL TRANSACTIONS')] , group_by='ITEM').head(10).collect()

In [None]:
#Describe data in dataframe unique customers & unique items are displayed
df.describe().collect()

Display distinct items and count all the distinct items:

In [None]:
df_distinct = df.distinct(cols=['CUSTOMER' , 'ITEM']).collect()
df_distinct

Distinct table displays a huge similarity in all transactions and items like user can see almonds has appered in 565 different transactions and total transaction 565 indicates that it hasn't always been bought as one item may be combination of one, two or three item together and we are going to find that soon.., it's equally applicable for all other different items as well in the list.

In [None]:
df_distinct.count()   ##No Duplicate transaction

Display the bar chart for first 20 items:

In [None]:
%matplotlib inline
df_distinct_new = df_distinct.head(15)
df_distinct_new.plot(kind='bar' , x = 'ITEM' , y = 'CUSTOMER')

Bar char is explaining the items and number of customers who bought that item , Such kind of a cross data analysis will be done through different Association analysis algorithms where we will get to know the items which are getting frequently picked.

Data Types:

In [None]:
print(df.dtypes())

## **4. Data analysis application of assoication analysis**
In this section we are going to apply various association analysis methods for frequent pattern mining in data.

### **4.1 Apriori**
Our goal is to find out frequent pattern mining from the data hence we are going to apply the Apriori method.

Firstly, import the Apriori method from HANA ML:

In [None]:
from hana_ml.algorithms.pal.association import Apriori

- **Apriori Signature**
   refre here for detailed signature of Apriori Method - 
   https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/1.0.12/en-US/7a073d66173a4c1589ef5fbe5bb3120f.html
   
   Mini Support -0.001 as our transction volumne is very high so before passing this value please check total no of trans.<br>
   Mini Confidence - This is also depend on the total number of transactions into data (explained above).<br>
   Mini lift -  pass it atleast 1 as it is considered a high association value.<br>
   **Note** - you can also manipulate with values of Apriori method & analyse the result. 

In [None]:
ap = Apriori(min_support=0.001,
             min_confidence=0.5,
             relational=False,
             min_lift=1,
             max_conseq=1,
             max_len=5,
             ubiquitous=1.0,
             use_prefix_tree=False,
             thread_ratio=0,
             timeout=3600,
             pmml_export='multi-row')

- Calculate the execution time of Apriori method. 
- Pass the data into method.

Applying fit() function:

In [None]:
# Count the time consumption of fit function
import time
start = time.time()
ap.fit(df)
end = time.time()
print(str(end-start) + " " +  "Secs")

Result:

In [None]:
ap.result_.head(100).collect()

Let's analyse the first transaction:

  #### 0	pasta	shrimp	0.0115	0.676471	8.351489   
  **Support** for item is 0.0115 - this indicates the frequency of item in the dataset as we are processing the dataset of recordcount 6418 hence the value of this field will be low always as it is calcuated by dividing the freuency of itemset appearing in all the transaction by total number of transaction.
  e.g Shrimp total count = 162 and total transaction = 6418 Support Count  = 162/6418 = 0.025<br> 
  Probability of having shrimp on the cart with the knowledge that pasta is already in the cart is 0.7(round of 0.67) i.e. confidence<br>     which is absolutely a high confidence value & this can be considered a strong association rule.<br>
  Result are displayed based on the paremeters passed in Apriori method, e.g. mini confidence is 0.5 hence in result all the rules which are having min confidence greater or equal to 0.5 are considered. <br>
  
 Let's analyse another itemset: 
 #### 98	almonds&green tea	soup	0.0010	0.666667	11.594203
 here confidence is 0.66 & lift is 11.5(which is greater than 1) this states that 66% of chances are that soup can be purchansed but customer if he/she is buying almonds & green tea.<br>

  result is displaying all the preceeding & subsequent item & subsequency chances in terms of Confidence & Lift hence purchasing     trends can be easily analyse but the Apriori method
  

### Change Lift & Confidence 

In [None]:
# update min_confidence to be 0.7 and min_lift to be 5
ap = Apriori(min_support=0.001,
             min_confidence=0.7,
             relational=False,
             min_lift=5,
             max_conseq=1,
             max_len=5,
             ubiquitous=1.0,
             use_prefix_tree=False,
             thread_ratio=0,
             timeout=3600,
             pmml_export='multi-row')

Applying fit() function:

In [None]:
ap.fit(data=df)

Mini Confidence & Lift value got changed before executing the apriori method, Mini confi = 0.7 & lift is 5 so we can see the result is having all the items which are having mostly the confidence as 1 and lift greater than 5 hence we can consider these as a high association rules.


In [None]:
ap.result_.head(100).collect()

#### **Apriori algorithm set up using relational logic:**

In [None]:
apr = Apriori(min_support=0.001,
              min_confidence=0.5,
              relational=True,
              min_lift=1,
              max_conseq=1,
              max_len=5,
              ubiquitous=1.0,
              use_prefix_tree=False,
              thread_ratio=0,
              timeout=3600,
              pmml_export='multi-row')

Applying fit() function:

In [None]:
apr.fit(data=df)

In [None]:
apr.antec_.head(5).collect()

In [None]:
apr.conseq_.head(5).collect()

In [None]:
apr.stats_.head(5).collect()

**Result Analysis**
  Same result is displayed in this relational as well but in form of 3 different table & using rule_id as key So one rule id is genreated for one item (if item is frequent) & first table indicates ANTECEDENTITEM (preceeding item) , second table CONSEQUENTITEM (subsequent item) & third table is displaying stats for that particular rule id containing support confidence & lift hence we can filter those records from stats table which are having high association

### **Attributes/Parameters of Apriori method**

**Attributes**<br>
- **result_**

(DataFrame) Mined association rules and related statistics, structured as follows: - 1st column : antecedent(leading) items. - 2nd column : consequent(dependent) items. - 3rd column : support value. - 4th column : confidence value. - 5th column : lift value. Available only when relational is False.<br>
- **model_**

(DataFrame) Apriori model trained from the input data, structured as follows: - 1st column : model ID, - 2nd column : model content, i.e. Apriori model in PMML format.<br>
- **antec_**

(DataFrame) Antecdent items of mined association rules, structured as follows: - lst column : association rule ID, - 2nd column : antecedent items of the corresponding association rule. Available only when relational is True.<br>
- **conseq_**

(DataFrame) Consequent items of mined association rules, structured as follows: - 1st column : association rule ID, - 2nd column : consequent items of the corresponding association rule. Available only when relational is True.<br>
- **stats_**

(DataFrame) Statistis of the mined association rules, structured as follows: - 1st column : rule ID, - 2nd column : support value of the rule, - 3rd column : confidence value of the rule, - 4th column : lift value of the rule. Available only when relational is True.<br>

### **4.2 AprioriLite**

A light version of Apriori algorithm for assocication rule mining, where only two large item sets are calculated.

Set up parameters for light Apriori algorithm, ingest the input data, and check the result table: 

In [None]:
from hana_ml.algorithms.pal.association import AprioriLite ##Import AprioriLite version of HANA ML

In [None]:
apl = AprioriLite(min_support=0.001,          ##Minimum Support Values 0.001
                  min_confidence=0.6,         ##Let's have it 0.5 atleast
                  subsample=1.0,
                  recalculate=False,
                  timeout=3600,
                  pmml_export='multi-row')

Applying fit() function:

In [None]:
import time
start = time.time()
apl.fit(data=df)
end = time.time()
print(str(end-start) + " " +  "Secs")

- **Result Analysis** <br> 
  Here result says the same as we did above in Apriori method the only difference it works only on Two large itemsets hence it is just displaying preceeding & Subsequent item with their stats , Support Confidence & lift.

In [None]:
apl.result_.head(1000).collect()

### **4.3 FPGrowth**

In [None]:
from hana_ml.algorithms.pal.association import FPGrowth

In [None]:
fpg = FPGrowth(min_support=0.0001,
               min_confidence=0.5,
               relational=False,
               min_lift=1.0,
               max_conseq=1,
               max_len=5,
               ubiquitous=1.0,
               thread_ratio=0,
               timeout=3600)

In [None]:
import time
start = time.time()
fpg.fit(data=df)
end = time.time()
print(str(end-start) + " " +  "Secs")

- FPGrowth method workds on Divide & Conquer Approach & Faster than Apriori Algorithm
- This builds FP tree using for finding the frequent itemset 
- Apriori utilize a level-wise approach where it will generate patterns containing 1 items, then 2 items, 3 items, etc.
- **Result Analysis**
  Decrease the support count tells you that the frequency of item in total transaction is very low So if someone wants to do some   analysis on those then it is fine else we can consider only those items which are frequent enough for example consider those items  only which transacts 50 times in total of 10000 records then we can pass consider mini support count 0.005

In [None]:
fpg.result_.collect()

- Let's increase the support count & re-evaluate the method 

In [None]:
fpg = FPGrowth(min_support=0.005,
               min_confidence=0.6,
               relational=False,
               min_lift=1.0,
               max_conseq=1,
               max_len=5,
               ubiquitous=1.0,
               thread_ratio=0,
               timeout=3600)

In [None]:
import time
start = time.time()
fpg.fit(data=df)
end = time.time()
print(str(end-start) + " " +  "Secs")

In [None]:
fpg.result_.collect()

- Support for first record is 0.0115 it clearly statest that around 73 times this transaction has appeared hence we can surely say that if someone is purchasing pasta then 67% chances are that they will also buy shrimp as well
- from result displayed above can easily be considered as high Association rules as it contains a good support , confidence & lift   So from all the mentioned techniques we can manipulate the parameters Support , Confidence & Lift & analyse the different result for Market-Basket Analysis as all the three parameters play a important rule in execution of method 

### FPGrowth algorithm set up using relational logic:

In [None]:
fpgr = FPGrowth(min_support=0.001,
                min_confidence=0.6,
                relational=True,
                min_lift=1.0,
                max_conseq=1,
                max_len=5,
                ubiquitous=1.0,
                thread_ratio=0,
                timeout=3600)

In [None]:
import time
start = time.time()
fpgr.fit(data=df)
end = time.time()
print(str(end-start) + " " +  "Secs")

- Again mining association rules using FPGrowth algorithm for the input data, and check the resulting tables:

- No of frequent precedding items with rule id as key

In [None]:
fpgr.antec_.collect()

- No of subsequent items hvaing rule ID as key & satisfying the criteria we passed during FRGrowth method call
- Joining of these two tables based on the rule id can result all the antecedent & consequent items & rules can be considered as strong association rules

In [None]:
fpgr.conseq_.collect()

In [None]:
fpgr.stats_.collect()

- Here stats clearly depicts that all the rules which are having strong association have been displayed for example the Support , Confidence & Lift for rule ID - 0 is high enough to be considered.
- If user wants to further these result then filter on the Rules can be apllied after the result display 

In [None]:
fpgr.stats_.collect()

- Filter result as required , it is just displaying on those records which are having lift greater than 5

In [None]:
fpgr.stats_.filter('LIFT > 5').collect()

### **4.4 KORD**
- K-optimal rule discovery (KORD) follows the idea of generating association rules with respect to a well-defined measure, instead of first finding all frequent itemsets and then generating all possible rules.<br>
- Import KORD algorithms from HANA ML package

In [None]:
from hana_ml.algorithms.pal.association import KORD

- Set up a KORD instance:

In [None]:
krd =  KORD(k=50,
            measure='lift',
            min_support=0.001,
            min_confidence=0.5,
            epsilon=0.1,
            use_epsilon=False)

In [None]:
start = time.time()
krd.fit(data=df , transaction='CUSTOMER' , item='ITEM')
end = time.time()
print(str(end-start) + " " +  "Secs")

- **Result Analysis**
 - KORD display result in the form of 3 different table first contains all the preceeding items(ANTECEDENT) which satisfies the criteria we passed above for example mini confidcence & support<br>
 - Second table represents CONSEQUENT items , items which have followers in first table & all the tables containing the relations parameters as RULE_ID , RULE_ID can be used to join the tables if user wants any common value from data<br>
 - Third table displays the stats which contains all the rules which are passed in KORD criteria for example all the filtered rule

In [None]:
krd.antec_.collect()

- Frequent consequent items

In [None]:
krd.conseq_.collect()

- KORD Stats for frequent association rule mining 

In [None]:
krd.stats_.collect()

Finally, close the connection to SAP HANA:

In [None]:
connection_context.close()

## *Note* - For detailed reading please follow the link - 
 *Apriori* -   https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/1.0.12/en-US/7a073d66173a4c1589ef5fbe5bb3120f.html<br>
 *FPRGrowth* - https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/1.0.12/en-US/9495128435164c2680f064b65fef3774.html<br>
 *KORD* -      https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/1.0.12/en-US/598818b3d063482f917e7b9d2f684a4e.html<br>