## **Association Analysis -  Sequential Pattern Mining (SPM)**

### 1. Introduction and algorithm description
- This notebook uses the real time itemset dataset to demonstrate the association rule mining algorithms below which are provided by the hana_ml.<br>
<br>
- **SPM(Sequential Pattern Mining)**
 The sequential pattern mining algorithm searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or subsequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support.

**Understand Sequence Pattern Mining before going into practice**<br>

- T1: Find all subsets of items that occur with a specific sequence in all other transactions:
      e.g {Playing cricket -> high ECG -> Sweating}
- T2: Find all rules that correlate the order of one set of items after that another set of items in the transaction database:
      e.g  72% of users who perform a web search then make a long eye gaze
           over the ads follow that by a successful add-click 
**Prerequisites**<br>
● The input data does not contain null value.<br> 
● There are no duplicated items in each transaction<br>

## Dataset
we will analyze the store data for frequent pattern mining ,this is the sample data which is available on SAP's help webpage.

- **Attribute Information**<br>
 CUSTID -  Customer ID <br>
 TRANSID - Transaction ID <BR>
 ITEMS - Item of Transaction

### **Import Packages**
First, import packages needed in the data loading.

In [None]:
from hana_ml import dataframe
from hana_ml.algorithms.pal.utility import Settings, DataSets

## **Setup Connection**
In our case, the data is loaded into a table called "PAL_APRIORI_TRANS_TBL" in HANA from a csv file "apriori_item_data.csv". To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information.<br>
<br>
###################<br>
[hana]<br>
url=host-url<br>
user=username<br>
passwd=userpassword<br>
port=3xx15<br>
<br>
###################<br>

In [None]:
url, port, user, pwd = Settings.load_config("../../config/e2edata.ini")
# the connection
#print(url , port , user , pwd)
connection_context = dataframe.ConnectionContext(url, port, user, pwd)
print(connection_context.connection.isconnected())

  **Load Data**<br>
   Then, the function DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to    load data, an exmaple of return message is shown below:
   
   ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, 'invalid table name:  Could not find table/view<BR> 
PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)')<br>
Table PAL_SPM_DATA_TBL doesn't exist in schema DM_PAL<br>
Creating table PAL_SPM_DATA_TBL in schema DM_PAL ....<br>
Drop unsuccessful<br>
Creating table DM_PAL.PAL_SPM_DATA_TBL<br>
Data Loaded:100%<br>
   
   #####################<br>
   

In [None]:
df = DataSets.load_spm_data(connection_context)

In [None]:
df.collect().head(100) ##Display Data

In [None]:
df = df.dropna() ##Drop NAN if any of the blank record is present in your dataset

In [None]:
print("Toal Number of Records : " + str(df.count()))

In [None]:
print("Columns:")
df.columns

## **Filter**

In [None]:
df.filter("CUSTID = 'A'").head(10).collect()

In [None]:
df.filter('TRANSID = 1').head(100).collect()

In [None]:
df.filter("ITEMS = 'Apple'").head(10).collect()

### **Group by column**

In [None]:
df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect()

In [None]:
df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect()

In [None]:
df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect()

**Display the most popular items**

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
plt.rcParams['figure.figsize'] = (10, 10)
wordcloud = WordCloud(background_color = 'white', width = 500,  height = 500, max_words = 120).generate(str(df_spm.head(100).collect()))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Popular Items',fontsize = 10)
plt.show()

### Import SPM Method from HANA ML Library 

In [None]:
df.filter("ITEMS = 'Blueberry'").head(100).count()

In [None]:
from hana_ml.algorithms.pal.association import SPM

### **Setup SPM instance**

In [None]:
sp = SPM(min_support=0.5,
         relational=False,
         ubiquitous=1.0,
         max_len=10,
         min_len=1,
         calc_lift=True)

In [None]:
sp.fit(data=df, customer='CUSTID', transaction='TRANSID', item='ITEMS')

**Result Analysis**:<br>

- Itemset Apple has support 1.0 indicates the frequencey of the item in all the transactions , most frequent item - confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them
- Consider (Apple , Blueberry): Support is .88 (Frequeny of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in theri bucket , lif is .89 close to 1 indicates high Asscoiation of items
- Benefit of having such kind of result is Storekeepers can easily look into purchasing Trends for their Shops


In [None]:
sp.result_.collect()

**Attributes**

- **result_**

(DataFrame) The overall fequent pattern mining result, structured as follows: - 1st column : mined fequent patterns, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values. Available only when relational is False.

- **pattern_**

(DataFrame) Result for mined requent patterns, structured as follows: - 1st column : pattern ID, - 2nd column : transaction ID, - 3rd column : items.

- **stats_**

(DataFrame) Statistics for frequent pattern mining, structured as follows: - 1st column : pattern ID, - 2nd column : support values, - 3rd column : confidence values, - 4th column : lift values.

In [None]:
connection_context.close()