# Sequence Mining - Frequent Subsequences

Sequence mining is a topic about finding relevant patterns between data examples. Analyzing the patterns that occurs the most can help us discover interesting behaviors hidden in the datasets. 

## Implementation

In this notebook, we analyze the tile patterns of a sokoban level. Initialize a sokoban level, import `numpy` library to convert it to array.  

In [None]:
# tile representation
BLOCK = {
    'WALL': '#',
    'BOX': '$',
    'TARGET': '*',
    'PLAYER': '@',
    'ROAD': '-'
}
# level representation
level = [
            ['#', '#', '#', '#', '#', '#'],
            ['#', '-', '-', '-', '*', '#'],
            ['#', '-', '$', '#', '-', '#'],
            ['#', '*', '-', '#', '#', '#'],
            ['#', '-', '-', '@', '-', '#'],
            ['#', '-', '$', '-', '-', '#'],
            ['#', '#', '#', '#', '#', '#'],
        ]

import numpy as np

level = np.array(level)
print(level)

Next, define the size of the matrix extracted from the two-dimensional array level. All possible 2x2 subsequences are recorded as flattened string in the dictionary map.

In [None]:
# fh stands for filter height, fw stands for filter width
fh = 2
fw = 2
def lv2map(lv: np.ndarray, fh=2, fw=2):
    map = {}
    h, w = lv.shape
    print("All subsequences of 2*2 in this level:")
    for i in range(h - fh + 1):
        for j in range(w - fw + 1):
            print((lv[i:i + fh, j:j + fw]))
            k1 = (lv[i:i + fh, j:j + fw]).flatten()
            k = ''.join(k1)
            # k = tuple((lv[i:i + fh, j:j + fw]).flatten())
            if k in map.keys():
                map[k] = (map[k] + 1)
            else:
                map[k] = 1
    return map

frequency = lv2map(level, fh, fw)
print("subsequences and its frequency:")
print(frequency)

import `collection` library, initializes a dictionary dct whose default value is an empty list. Iterate over all key-value pairs of the original dictionary and rearrange them according to their frequency of occurrence. Output the top n most frequent subsequences.

In [None]:
from collections import defaultdict
def top_n(d, n):
    dct = defaultdict(list) 
    for k, v in d.items():
        dct[v].append(k)
    print("Show all subsequences with the same frequency:")
    print(dct.items())
    return sorted(dct.items(), reverse=True)[:n]

frequent_sub = top_n(frequency, 1)
print("Frequent subsequences:")
print(frequent_sub)

fre_lst = frequent_sub[0][1]
for pattern in fre_lst:
    print(pattern[:fw])
    print(pattern[fw:])
    print()

Let's try a more complex example on a Super Mario Bros level. The txt file is the representation of the level below.

![smb_level](../examples/Mario_Render.png)

In [None]:
read_level = []
with open('../examples/Mario.txt', 'r') as f:
    for line in f:
        read_level.append(line.strip())
        print(line.strip())

In [None]:
smb_level = []
for line in read_level:
    smb_level.append(list(line))
print(smb_level)

In [None]:
smb_level = np.array(smb_level)
smb_fh = 6
smb_fw = 6
smb_frequency = lv2map(smb_level, smb_fh, smb_fw)
print("subsequences and its frequency:")
print(smb_frequency)

smb_sub = top_n(smb_frequency, 5)
print("Frequent subsequences:")
print(smb_sub)

# Since sky occupies a large area in a level, we print the second most frequent sequence in matrix form
fre_sequence = smb_sub[1][1]
print("One of the frequent subsequences:")
print(fre_sequence)
print("In matrix form:")
for s in fre_sequence:
    for i in range(0, smb_fw*smb_fh, smb_fw):
        print(s[i:i+6])