## Creation of the FPTA 

In order to be able to make our predictions, we first need to create our prefix tree, representing all the possible sequences read by the tree. 

We start by creating the *TrieNode* class. It represents a node of our tree. It takes as attribute:
- a string of characters 'text', which represents the label of each node.
- an 'is_word' boolean initialized to False. It becomes True if it has no children (i.e. if the word is finished).
- a list of children, to which we initialise for all nodes the first element representing the number of words ending at that node (completes the boolean presented above)
- an integer 'state', which represents the id of the node. It is unique for each node of the tree.
- an 'is_displayed' boolean initialized to False. It is used when displaying the tree, when there are loops so that it is only displayed once.
- a list of precedents, which grows as you go along. All the precedents of each node are added to it. For the root node, we consider that its predecessor is itself.

For the *PrefixTree* class, which represents the tree as such, we have:
- a *TrieNode* which represents the root node
- the class functions that we will detail below

### Class functions of *PrefixTree*.

#### Insert function:
It allows us to insert a word (taken as a parameter), starting from a certain node (current parameter). To do so, we enumerate the word (each character is taken one by one with its corresponding index), then, if the character is not present in the children list, we will create a node and then increment the indices. If there is already a child node with the same character, we increment its frequency by 1. When the word is finished, we pass the boolean to True, and we increment the word counter by 1. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Trie_example.svg/1200px-Trie_example.svg.png" alt="Drawing" style="width: 200px;"/>

#### starts_with function:
It returns all possible words starting with the prefix, from the (current) node set as parameter. To do this, we iterate recursively in the children of each node until arriving at the node 'is_word' == True. From this terminal node, we read its Text part to get the word.

#### Size function:
It returns us the number of nodes present in the tree. We iterate in the tree and we increment a counter. 

#### Display function: 
Display function: we iterate in the tree and print each node.

In [1]:
class TrieNode:
    _COUNTER_NODE = 1
    def __init__(self, text = "λ"):
        self.text = text
        self.is_word = False
        self.children = list()
        self.children.append(['#', 0, self]) #1st element:SYMB 2nd element:FREQUENCY 3rd element:CHILDREN
        self.state = TrieNode._COUNTER_NODE
        self.frequency = 0
        TrieNode._COUNTER_NODE += 1
        self.previous = []

    def get_frequency(self):
        freq = 0
        for child in self.children[1:]:
            freq += child[1]
        return freq
    
    def get_frequency2(self):
        freq = 0
        for child in self.children:
            freq += child[1]
        return freq

    def __str__(self):
        return 'symb:{} state:{} -> children:{}'.format(self.text, self.state, self.children)

class PrefixTree:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word, current = None):
        '''
        Creates a given word in the trie
        '''
        if not current: 
            current = self.root
            pre = self.root
        for i, char in enumerate(word):
            current.frequency += 1
            if char not in [x[0] for x in current.children]:
                prefix = word[0:i+1]
                current.children.append([char, 1, TrieNode(prefix)])
                if(pre not in current.previous):
                    current.previous.append(pre)
                pre = current
                current = current.children[-1][2]

            else:
                for child in current.children:
                    if child[0] == char:
                        child[1] += 1
                        if(pre not in current.previous):
                            current.previous.append(pre)
                        pre = current
                        current = child[2]

        current.is_word = True   
        current.children[0][1] += 1
        current.frequency += 1
        
    def starts_with(self, prefix, current = None):
        '''
        Returns a list of all words beginning with the given prefix, or
        an empty list if no words begin with that prefix.
        '''
        words = list()
        if not current:
            current = self.root
        for char in prefix:
            if char not in [x[0] for x in current.children]:
                return list()
            for child in current.children:
                  if child[0] == char:
                    current = child[2]
        self.__child_words_for(current, words)
        return words
    
    def __child_words_for(self, node, words):
        '''
        Private helper function. Cycles through all children
        of node recursively, adding them to words if they
        constitute whole words (as opposed to merely prefixes).
        '''
        if node.is_word:
            words.append(node.text)
        for child in node.children[1:]:
            self.__child_words_for(child[2], words)
    
    def size(self, current = None, visited = None):
        '''
        Returns the size of this prefix tree, defined
        as the total number of nodes in the tree.
        '''
        # By default, get the size of the whole trie, starting at the root
        if not current:
            current = self.root
        
        if not visited:
            visited = []
            visited.append(current)
            
        count = 1
        
        for child in current.children[1:]:
            if(child[2] not in visited):
                visited.append(child[2])
                count += self.size(child[2], visited)
                
        return count

    def leaf(self, current = None, visited = None):
        if not current: 
            current = self.root     
        if not visited:
            visited = []
        count = 0
        tmp = 0
        self.__leaves_for(current, count, visited)         
        return count
    
    def __leaves_for(self, current, count, visited):
        if current.is_word == True:
            count += 1
        for child in current.children[1:]:
            if(child[2] not in visited):
                visited.append(child[2])
                self.__leaves_for(child[2], count, visited)
    
    def transform2proba(self, current = None):
        if not current:
            current = self.root
            
        freq = current.get_frequency2()
        for child in current.children:
            if(isinstance(child[1], int) == False):
                return 0
            else: 
                child[1] = child[1]/freq
                if child[2].state != current.state:
                    current = child[2]    
                    self.transform2proba(current)
    
    def predict(self, sequence, method = '', current = None):
        if not current:
            current = self.root
         
        for char in sequence:
            for child in current.children:
                if child[0] == char:
                    current = child[2]
        
        probas = [[x[0], x[1]] for x in current.children]
        probas = sorted(probas, key = lambda x: x[1], reverse = True)
        
        sum_ = 0
        items = []
        for subproba in probas:
            sum_ += subproba[1]
            items.append((subproba[0], subproba[1]))
            if(sum_ > 0.5):
                break   
        if method == 'fuzzy':
            return items
        else:
            return probas[0][0]

    def display(self):
        '''
        Prints the contents of this prefix tree.
        '''
        print('====================================================================================')
        self.__displayHelper(self.root)
        print('====================================================================================\n')
    
    def __displayHelper(self, current, visited = None):
        '''
        Private helper for printing the contents of this prefix tree.
        '''
        if not visited: 
            visited = []
        
        print(current)
        
        for child in current.children[1:]:
            if child[2] not in visited:
                visited.append(current)
                self.__displayHelper(child[2], visited)

## Creation of the DFA with the FPTA:

To transform a prefix tree into a frequency automaton (and thus a probabilistic one), a state-merging algorithm is used. Currently, the best one has been created by Colin de la Higuera. It is implemented from scratch below:

First of all, we have in parameters the sequences and an alpha parameter [0, 1]. 
We create the tree and then we initialize 2 lists: red & blue: 
- Red stores all nodes which have already been defined as representative nodes and will be included in the final output model.
- Blue stores the nodes that have not yet been tested.

The purpose of the algorithm is to keep only the red states. At the beginning,RED contains only the root node, whileBlue contains the immediate successor nodes of the initial node: the child nodes of the root. When executing the external loop of the algorithm, the first qb node in BLUE is chosen. If there is a qr node in RED compatible with qb, then qb and its successor nodes are merged with qr and the corresponding successor nodes of qr, respectively, as described in the Merge&Fold section. If qb is not compatible with any of the states of RED, it will be promoted, i.e. it will switch to RED, and its respective children will switch to BLUE.

#### Merge&Fold operation:
Given the DFA, the merge operation takes 2 states q and q′, where q is a RED state and q′ is a BLUE state compatible with the Hoeffding test. The Merge operation consists of 2 steps. First, if q′est is a final state, then q becomes one as well (if it was not already the case) and the number of sequences ending in q′ is added to the number of sequences ending in q. Second, incoming arcs to q′sont redirected to q. If such arcs already exist, the number of passing sequences in each incoming arc is added to the existing arc. The Fold operation is a recursive function of merging the successor nodes q′ into q and the corresponding successor nodes of q respectively.

In [2]:
def Alergia(sequences, alpha):
    print('Starting building corresponding Trie \n')
    trie = PrefixTree()
    for subseq in sequences:
        trie.insert(subseq)
    print(trie.size())
    print(trie.leaf())
    #initial = trie.size()
    red = []
    red.append(trie.root)
    blue = []
    for x in trie.root.children[1:]:
        if(x[2] not in blue):
            blue.append(x[2])
    t0 = 0.5
    
    print('Running Alergia on trie \n')
    while(len(blue) > 0):
        qb = blue[0]
        blue.remove(qb)
        promote = True
        if(qb.get_frequency() >= t0):
            for qr in red:
                if(AlergiaCompatible(trie, qr, qb, alpha)):
                    #print('Merge accepted...')
                    #print(qr)
                    #print(qb)
                    trie = MergeFold(trie, qr, qb)
                    promote = False
                    for child in qr.children[1:]:    
                        if(child[2] not in blue and child[2].state != qr.state and child[2] not in red):
                            blue.append(child[2])
                    break
        
            if(promote == True):
                #print('No merge possible...')
                red.append(qb)
                for child in qb.children[1:]:
                    if(child[2] not in blue and child[2] not in red):
                        blue.append(child[2])
        else:
            continue
    
    trie.transform2proba()
    print('Algo done')
    return trie

def AlergiaCompatible(trie, qr, qb, alpha):
    correct = True
    if(AlergiaTest(qr.children[0][1], qr.get_frequency(), qb.children[0][1], qb.get_frequency(), alpha) == False):
        correct = False
    
    return correct
  
def AlergiaTest(f1, n1, f2, n2, alpha):
    gamma = math.fabs(f1/n1 - f2/n2)
    root = math.sqrt(1/(2*math.log(2/alpha)))
    summ = (1/math.sqrt(n1)) + (1/math.sqrt(n2))
    hoeffding = root * summ
    return gamma < hoeffding

def MergeFold(trie, qr, qb): #OK FONCTIONNE
    q = qb.previous[0]
    if qb.text[-1] not in qr.text:
        qr.text = qr.text+ ', ' + qb.text[-1]
    for child in q.children:
        if(child[2].state == qb.state):
            child[2] = qr 
    qr.children[0][1] += qb.children[0][1]
    words = trie.starts_with('', current = qb)
    for word in words:
        trie.insert(word)
    return trie