# Finding frequent sequential patterns in sequence databases using SPAM

This tutorial has two parts. In the first part, we describe the basic approach to find frequent patterns in a sequence database using the SPAM algorithm. In the final part, we describe an advanced approach, where we evaluate the SPAM algorithm on a dataset at different *minimum support* threshold.

## Prerequisites:

1. Installing the PAMI library

In [None]:
!pip install -U pami

Collecting pami
  Downloading pami-2023.7.31.7-py3-none-any.whl (795 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m795.3/795.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting resource (from pami)
  Downloading Resource-0.2.1-py2.py3-none-any.whl (25 kB)
Collecting validators (from pami)
  Downloading validators-0.20.0.tar.gz (30 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JsonForm>=0.0.2 (from resource->pami)
  Downloading JsonForm-0.0.2.tar.gz (2.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting JsonSir>=0.0.2 (from resource->pami)
  Downloading JsonSir-0.0.2.tar.gz (2.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-easyconfig>=0.1.0 (from resource->pami)
  Downloading Python_EasyConfig-0.1.7-py2.py3-none-any.whl (5.4 kB)
Building wheels for collected packages: validators, JsonForm, JsonSir
  Building wheel for validators (setup.py) ... [?25l[?25hdone
  Created wheel for validators

2. Downloading a sample dataset

In [None]:
!wget -nc https://www.dropbox.com/scl/fi/c2xdmns7rprxnkgd9h3gb/airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf&dl=0

--2023-08-02 08:10:32--  https://www.dropbox.com/scl/fi/c2xdmns7rprxnkgd9h3gb/airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf
Resolving www.dropbox.com (www.dropbox.com)... 162.125.80.18, 2620:100:6035:18::a27d:5512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.80.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/e/scl/fi/c2xdmns7rprxnkgd9h3gb/airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf [following]
--2023-08-02 08:10:33--  https://www.dropbox.com/e/scl/fi/c2xdmns7rprxnkgd9h3gb/airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc6e83fcae18345ed9f24fcc57db.dl.dropboxusercontent.com/cd/0/get/CBAVAN1GyFAZy61-87lk2Cwgyk0takEi7knNTV4auju5vv2Rku0JNhrL1MDk0hmf-05CZujskWg-tfxqfBlggW6GuEDjcNdWn0hGjWcKWwXoB572oEpalDl_wRWM3HtlSJ4uWQ0ZKUP6iF-S_-XfvmGH/file# [following]
--2023-08-02 08:10:33--  https://uc6e83fca

3 convert air pollution dataset to sequence dataset

3.1 change filename to airPollution.csv

In [None]:
!mv airPollution.csv?rlkey=q7zoop7mi2n4z3qi94lpc1jlf airPollution.csv

3.2 show the first 3 row as example of database

In [None]:
!head -3 airPollution.csv

head: cannot open 'airPollution.csv' for reading: No such file or directory


3.3 read file as dataset

In [None]:
import pandas as pd
dataset = pd.read_csv('airPollution.csv', index_col="TimeStamp")

dataset
# you can notice that dataset is collected from 2018-01-01 01:00:00 hours to 2023-04-25 22:00:00 hours (5+ years)

Unnamed: 0_level_0,Unnamed: 0,POINT(137.2331301 36.7425277),POINT(140.8733429 38.2932172),POINT(139.1103334 36.2974922),POINT(140.957261 37.6422006),POINT(139.2619009 36.0594871),POINT(135.5188107 34.7919888),POINT(141.7627117 40.1916885),POINT(140.7468006 41.8188869),POINT(139.7422865 36.2305774),...,POINT(139.6184164 35.402381),POINT(133.7693672 34.5091621),POINT(134.5801986 34.90361180000001),POINT(130.9395423 33.8302551),POINT(141.6892571 42.6698527),POINT(130.3518793 32.088342),POINT(141.6309688 42.6576551),POINT(138.4066959 34.9960412),POINT(140.0499266 39.3839601),POINT(130.4674218 32.9808242)
TimeStamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-07-01 01:00:00,0,3.0,,,,,,,,,...,12.0,,,1.0,14.0,4.0,,8.0,5.0,23.0
2021-07-01 02:00:00,1,3.0,,,,,,,,,...,7.0,,,1.0,5.0,5.0,,6.0,4.0,18.0
2021-07-01 03:00:00,2,4.0,,,,,,8.0,,,...,,,,2.0,4.0,3.0,,6.0,4.0,9.0
2021-07-01 04:00:00,3,11.0,,,,,,7.0,,,...,7.0,,,3.0,1.0,4.0,,2.0,5.0,1.0
2021-07-01 05:00:00,4,10.0,,,,,,4.0,,,...,8.0,,,3.0,2.0,4.0,,3.0,5.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-07-24 18:00:00,3521,7.0,,,,12.0,,2.0,,,...,11.0,,,20.0,3.0,11.0,,7.0,9.0,12.0
2022-07-24 19:00:00,3522,8.0,,,,10.0,,2.0,,,...,4.0,,,18.0,34.0,10.0,,8.0,8.0,11.0
2022-07-24 20:00:00,3523,4.0,,,,9.0,,4.0,,,...,9.0,,,25.0,1.0,10.0,,6.0,705.0,14.0
2022-07-24 21:00:00,3524,9.0,,,,13.0,,4.0,,,...,7.0,,,19.0,2.0,12.0,,7.0,303.0,18.0


3.4 fill NAN value to 0

In [None]:
dataset = dataset.fillna(0)
dataset.head()

Unnamed: 0_level_0,Unnamed: 0,POINT(137.2331301 36.7425277),POINT(140.8733429 38.2932172),POINT(139.1103334 36.2974922),POINT(140.957261 37.6422006),POINT(139.2619009 36.0594871),POINT(135.5188107 34.7919888),POINT(141.7627117 40.1916885),POINT(140.7468006 41.8188869),POINT(139.7422865 36.2305774),...,POINT(139.6184164 35.402381),POINT(133.7693672 34.5091621),POINT(134.5801986 34.90361180000001),POINT(130.9395423 33.8302551),POINT(141.6892571 42.6698527),POINT(130.3518793 32.088342),POINT(141.6309688 42.6576551),POINT(138.4066959 34.9960412),POINT(140.0499266 39.3839601),POINT(130.4674218 32.9808242)
TimeStamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-07-01 01:00:00,0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,12.0,0.0,0.0,1.0,14.0,4.0,0.0,8.0,5.0,23.0
2021-07-01 02:00:00,1,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,0.0,0.0,1.0,5.0,5.0,0.0,6.0,4.0,18.0
2021-07-01 03:00:00,2,4.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,...,0.0,0.0,0.0,2.0,4.0,3.0,0.0,6.0,4.0,9.0
2021-07-01 04:00:00,3,11.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,...,7.0,0.0,0.0,3.0,1.0,4.0,0.0,2.0,5.0,1.0
2021-07-01 05:00:00,4,10.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,8.0,0.0,0.0,3.0,2.0,4.0,0.0,3.0,5.0,6.0


3.5 convert the database sequential database and save it in "airDatabase.txt"

In [None]:
class convertDenseDataframe2SequenceDatabase:

  seq=[]
  def __init__(self,df,minScore):

      hour=0
      row=[]
      last=str(0000)
      for i in df.index:
          if last!=i[6:10] and row!=[]:
              row.pop()
              self.seq.append(row)
              row=[]
          last=i[6:10]
          for k in df.columns[1:]:
              if k != "TimeStamp":
                  if df.loc[i,k]>=minScore:
                      row.append(k)
          if row!=[] and row[-1]!=-1:
              row.append(-1)
      self.seq.append(row)

  def getSequence(self):
    return self.seq

  def save(self,outputFileName):
        with open(outputFileName, 'w') as f:
            for d in self.seq:
                for i in d:
                    p=str(i).replace(" ",",")
                    f.write("%s\t" % p)
                f.write("\n")

In [None]:
dataset = pd.read_csv('airPollution.csv',index_col="TimeStamp")
x=convertDenseDataframe2SequenceDatabase(dataset,15)
seq=x.getSequence()
x.save("airDatabase.txt")


[['POINT(130.601994 32.507843)', 'POINT(132.2165637 34.1698473)', 'POINT(136.6548337 35.0051925)', 'POINT(130.2113464 32.7321302)', 'POINT(132.7326196 33.8884275)', 'POINT(130.3597423 33.5840497)', 'POINT(130.4105582 33.6051041)', 'POINT(132.7283802 33.8225127)', 'POINT(140.1138229 37.919914)', 'POINT(130.4674218 32.9808242)', -1, 'POINT(130.601994 32.507843)', 'POINT(132.2165637 34.1698473)', 'POINT(132.7326196 33.8884275)', 'POINT(130.3597423 33.5840497)', 'POINT(136.603013 36.598011)', 'POINT(130.4105582 33.6051041)', 'POINT(132.7283802 33.8225127)', 'POINT(130.4674218 32.9808242)', -1, 'POINT(130.601994 32.507843)', 'POINT(132.2165637 34.1698473)', 'POINT(130.3597423 33.5840497)', 'POINT(132.7283802 33.8225127)', -1, 'POINT(130.601994 32.507843)', 'POINT(136.6548337 35.0051925)', 'POINT(130.3597423 33.5840497)', 'POINT(132.7283802 33.8225127)', -1, 'POINT(130.601994 32.507843)', 'POINT(140.1138229 37.919914)', -1, 'POINT(130.601994 32.507843)', 'POINT(141.6892571 42.6698527)', -1, 

4 Printing few lines of a dataset to know its format.

4.1 sequential dataset

In [None]:
!head -2 airDatabase.txt

POINT(130.601994,32.507843)	POINT(132.2165637,34.1698473)	POINT(136.6548337,35.0051925)	POINT(130.2113464,32.7321302)	POINT(132.7326196,33.8884275)	POINT(130.3597423,33.5840497)	POINT(130.4105582,33.6051041)	POINT(132.7283802,33.8225127)	POINT(140.1138229,37.919914)	POINT(130.4674218,32.9808242)	-1	POINT(130.601994,32.507843)	POINT(132.2165637,34.1698473)	POINT(132.7326196,33.8884275)	POINT(130.3597423,33.5840497)	POINT(136.603013,36.598011)	POINT(130.4105582,33.6051041)	POINT(132.7283802,33.8225127)	POINT(130.4674218,32.9808242)	-1	POINT(130.601994,32.507843)	POINT(132.2165637,34.1698473)	POINT(130.3597423,33.5840497)	POINT(132.7283802,33.8225127)	-1	POINT(130.601994,32.507843)	POINT(136.6548337,35.0051925)	POINT(130.3597423,33.5840497)	POINT(132.7283802,33.8225127)	-1	POINT(130.601994,32.507843)	POINT(140.1138229,37.919914)	-1	POINT(130.601994,32.507843)	POINT(141.6892571,42.6698527)	-1	POINT(136.603013,36.598011)	-1	POINT(136.603013,36.598011)	-1	POINT(136.6548337,35.0051925)	POINT(

_format:_ every row contains items seperated by a seperator in one sequence.
        _ every row contains subsequence seperated by a "-1".

__Example:__

item1 item2 -1 item3 item4

item1 item4 -1 item6

## Part 1: Finding frequent sequential patterns using SPAM

### Step 1: Understanding the statistics of a sequence database

In [None]:
#import the class file
import PAMI.extras.dbStats.sequentilDatabaseStats as stats

#specify the file name
inputFile = 'airDatabase.txt'

#initialize the class
obj=stats.SequentialDatabase(inputFile,sep='\t')

#execute the class
obj.readDatabase()

**Step 2:Draw the items' frequency graph and sequence length's distribution graphs for more information**

In [None]:
obj.printStats()

### Step 3: Choosing an appropriate *minSup* value

In [None]:
minSup= 0.4 #minSup is specified in count. However, the users can also specify minSup between 0 and 1.

### Step 4:Mining frequent sequence patterns using SPAM

In [None]:
from PAMI.sequentialPattern.basic import SPAM as alg


_ap = alg.SPAM('airDatabase.txt', minSup, '\t')
_ap.mine()
_Patterns = _ap.getPatterns()
_memUSS = _ap.getMemoryUSS()
print("Total Memory in USS:", _memUSS)
_memRSS = _ap.getMemoryRSS()
print("Total Memory in RSS", _memRSS)
_run = _ap.getRuntime()
print("Total ExecutionTime in ms:", _run)
print("Total number of Frequent Patterns:", len(_Patterns))
_ap.save("results.txt")

Step 5: Investigating the generated patterns
Open the patterns' file and investigate the generated patterns. If the generated patterns were interesting, use them; otherwise, redo the Steps 3 and 4 with a different minSup value.

In [None]:
!head result.txt

The storage format is: _frequentPattern:support_

## Part 2: Evaluating the SPAM algorithm on a dataset at different minSup values

### Step 1: Import the libraries and specify the input parameters

In [None]:
#Import the libraries
from PAMI.sequentialPattern.basic import SPAM as alg #import the algorithm
import pandas as pd

#Specify the input parameters
inputFile = "airDatabase.txt"
seperator='\t'
minimumSupportCountList = [0.4,0.42,0.44,0.46,0.46,0.48,0.5]
#minimumSupport can also specified between 0 to 1. E.g., minSupList = [0.005, 0.006, 0.007, 0.008, 0.009]

In [None]:
result = pd.DataFrame(columns=['algorithm', 'minSup', 'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of SPAM algorithm

In [None]:
for minSupCount in minimumSupportCountList:
    obj = alg.SPAM(inputFile, minSup=minSupCount,sep=seperator)
    obj.mine()
    #store the results in the data frame
    result.loc[result.shape[0]] = ['SPAM', minSupCount, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]

In [None]:
print(result)

In [None]:
from PAMI.extras.graph import plotLineGraphsFromDataFrame as plt

ab = plt.plotGraphsFromDataFrame(result)
ab.plotGraphsFromDataFrame()