# RNN for An Alignment Cost-Based Classification of Log Traces

> Paper : <u>An Alignment Cost-Based Classification of Log Traces </u><br>
> Date : June 2020 <br>
> Authors : <i>Mathilde Boltenhagen, Benjamin Chetioui, and Laurine Huber  </i> <br>

This notebook is organized as follow : <br> <br>
<b>0. Fitness function </b> 
- The lower bound fitness is a good contribution of the paper, please see the paper for more details. <br>

<b>1. Preprocessing the data:</b>
 - A function ```cleanDataForRNN``` contains all the preprocessing steps. It reads the file, create the sequences and the targets. 
 
<b>2. Model: </b> 
   - The model is a bi-RNN with different layers. 
   
<b>3. Cross-Validation of the method : </b>
 - A function ```runKFoldForRNN``` runs a Kfold method to fit and test the model on the sequences
 
<b> 4. Train and Test :</b>
- Train and test for many  m_AC values

## 0. Fitness

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd 
import numpy as np
np.random.seed(0)
import time

from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.layers import Bidirectional

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from statistics import mean

Using TensorFlow backend.


The commented lines were used when the frequency of the variants was incorporated in the computation of fitness.
In fact, we decided to compute the fitness and lower bound for fitness on the variants only due to the understanding of the approach.

In [2]:
def fitness(packageForFitness, minRunLength):
    '''
    This function computes the fitness for all the sequences. 
    @sequences: the sequences of words
    @costs: the real alignment cost
    @minRunLength: the minimal run in the alignment dataset
    '''
    sumTraceFitness = 0
    totTraces = 0 
    for i in packageForFitness.index:
        sumTraceFitness += (1 - (packageForFitness.realCosts[i] / ( packageForFitness.lengths[i]  + minRunLength )))#*packageForFitness.freqs[i]
        #totTraces += packageForFitness.freqs[i]
        totTraces += 1
    return sumTraceFitness / totTraces

def LB_fitness(packageForFitness, minRunLength, m_AC, indices):
    '''
    This function computes lower bound of the fitness given in the paper. 
    @sequences: the sequences of words
    @minRunLength: the minimal run in the alignment dataset
    @m_AC: needed for the lowerbound formula
    @indices: if we compute the lower bound, then we don't iterate on all the traces but only the positives. 
    '''
    sumTraceFitness = 0
    totTraces = 0 
    for i in indices:
        sumTraceFitness += (1 - ((m_AC) / ( packageForFitness.lengths[i] + minRunLength )))#*packageForFitness.freqs[i]
    for i in packageForFitness.index:
        #totTraces += packageForFitness.freqs[i]
        totTraces += 1
    return sumTraceFitness / totTraces

## 1. Preprocessing the data

This function takes as input an alignment dataset and its Maximal Alignment Cost and clean the data in order to get the sequences and the target classes. Please, see the definition of Maximal Alignment Cost Classification for more details on the target classes. The sequences are the sequences of indices from a dictionary that store an index per word (index_to_word is the inverse mapping). 

In [3]:
# must be a global variable for fake data
word_to_index = {}
max_len = 0

In [4]:
def cleanDataForRNN(dataFile,m_AC, word_to_index_already_exists=None):
    '''
    Reads the file (1), specifies the target classes (2) and prepare the sequences of activities (3). 
    Finally, its prepare some variables `sequences`, `costs` and `minRunLenght` for computing fitness. 
    @dataFile: (String) filename of the alignment dataset
    @m_AC: (int) maximal alignment cost classifier
    '''
    # ---- (1) read the file 
    data = pd.read_csv(dataFile,sep = ";", 
                   names = ["traces","tracesWithMoves","runs","runsWithMoves","costs","frequencies"])
    
    # ---- (2) create the positive and negative target depending on the m_AC parameter
    # alignment cost which interests us is greater than 10000 (other costs are just silent moves)
    # set the fitting to tmp_pos to set them latter to 1
    y = (((data["costs"] / 10000).astype(int)) / (m_AC+1)).astype(int)
    max_y = y.max()
    y = y.replace(0,"tmp_pos")
    y = y.replace(range(1,max_y + 1), 0)
    y = y.replace("tmp_pos",1)

    # two columns are required for a binary classification in a RNN
    y = np.eye(2)[y.to_numpy().reshape(-1)]
    
    # ---- (3) prepare the sequences of activities 
    traces_to_matrix = data.traces.str.split(":::",expand=True,)
    if word_to_index_already_exists== None:
        global max_len
        number_of_traces, max_len = traces_to_matrix.shape
    
    # transform the matrix to a serie
    traces_to_serie = pd.concat([traces_to_matrix[i] for i in range(0,traces_to_matrix.shape[1])], axis=0, 
                                          ignore_index=True, sort=False)
    # from the serie, it's easy to get unique words
    index_to_word = list(filter(None,(traces_to_serie).unique()))
    number_of_activities = len(index_to_word)

    # a simple dictionary 
    if word_to_index_already_exists==None:
        global word_to_index 
        word_to_index = { index_to_word[i]:i+1 for i in range(0, len(index_to_word) ) }
    
    # loop over traces 
    x = np.zeros((data.traces.shape[0],max_len))   
    for i in range(data.traces.shape[0]):
        # Convert the ith training trace and split is into activities. 
        trace = data.traces[i].split(":::")[:-1]
        # for every activity name in ith trace, jth is its index /!\ 
        j = 0
        # Loop over the activities 
        for w in trace:
            # Set the (i,j)th entry of X_trains to the index of the correct word.
            x[i, j] = word_to_index[w]
            # Increment j to j + 1
            j += 1   
    
    # for fitness computation
    minLengthRun= len(data.runs.str.split(":::").min())
    lengths = data.traces.str.split(":::",expand=False,).str.len()-1
    realCosts = (data["costs"] / 10000).astype(int)
    packageForFitness = pd.DataFrame({"lengths":lengths, "realCosts": realCosts, "freqs": data.frequencies})         
    
    return x, y, number_of_activities, max_len, index_to_word, packageForFitness, minLengthRun, m_AC   

In [5]:
# example of use
x, y, number_of_activities, max_len, index_to_word, packageForFitness, minLengthRun, m_AC  = cleanDataForRNN("alignments/A_2012_im.csv",5)
x, y

(array([[1., 2., 3., ..., 0., 0., 0.],
        [1., 2., 4., ..., 0., 0., 0.],
        [1., 2., 4., ..., 0., 0., 0.],
        ...,
        [1., 2., 5., ..., 0., 0., 0.],
        [1., 2., 5., ..., 0., 0., 0.],
        [1., 2., 4., ..., 0., 0., 0.]]), array([[0., 1.],
        [0., 1.],
        [0., 1.],
        ...,
        [0., 1.],
        [0., 1.],
        [0., 1.]]))

## 2. Model
Find below a function that creates the model instance depending on the size of the dictionary and the max length of a sequence. 

In [6]:
def myModel(max_len, number_of_activities, verbose=True):
    """    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    size_of_voc -- size of the embedding output
    
    Returns:
    model -- a model instance in Keras
    """
    # Input of the model, type is dtype 'int32' (as it contains indices of activities, which are integers).
    indices = Input(shape= (max_len,), dtype='int32')
    # Create the embedding layer to reduce to voc size
    embeddings = Embedding(input_dim= number_of_activities, output_dim= 15, input_length=max_len)(indices)
    # Propagate the embeddings through an Bi-LSTM layer with 50-dimensional hidden state 
    # Return the batch of sequences (to replay)
    X = Bidirectional(LSTM(units = 50, return_sequences=True, go_backwards=True),merge_mode='concat')(embeddings)
    # Add dropout with a probability of 0.5 to help with overfitting
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 50-dimensional hidden state
    X = LSTM(units = 50, return_sequences=False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer of the number of predicting classes
    X = Dense(units=2)(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts indices into X.
    model = Model(inputs=indices, outputs=X)
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    if verbose:
        print(model.summary())
        
    return model

In [7]:
# example of use
model = myModel(max_len, number_of_activities)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 176)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 176, 15)           360       
_________________________________________________________________
bidirectional_1 (Bidirection (None, 176, 100)          26400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 176, 100)          0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                30200     
_________________________________________________________________
dropout_2 (Dropout)  

## 3.  Cross-Validation of the method 

From ```KFold``` of sklearn, we do a cross-validation. The outputs are average of the accuracy and the loss (binary-cross entropy). We use the ```binary_crossentropy``` function of <b> Tensorflow </b>. 

- Notice that the RNN is also doing a validation slipt of 33% of the training set. 

- In order to see if positives or negatives items have better results, we give the results for `all` the test items, only the `positive` items and only the `negative` items. 

verbose=0 for silent

In [8]:
def accLossPercentageForRNN(model, x, y, indices_of_test, accArr, lossArr, percentageArr=None,  classToTest=None, verbose=2,freqs=None):
    '''
    This function fills the arrays of results accArr, lossArr and percentageArr for all the test items, or only the 
    negative items (classToTest=0) or only the positive items (classToTest=1). percentageArr is optional because it 
    is not required for the entire dataset, i.e., when we do not specify the class. 
    This works by using a dynamic programmation for the arrays. The return element is either the current accuracy in 
    case of classToTest=None, either the indices in case of classToTest!=None.
    Params:
    @forest: a trained model
    @x_test: the dataset to test
    @y_test: the target value to predict for the test dataset
    @accArr: a list of the previous accurary, or an empty list
    @lossArr: a list of the previous loss, or an empty list
    @percentageArr:  a list of the previous percentage, or an empty list
    @classToTest: 1 or 0 for positive and negative. This will find the indices of the items that belongs to the class
    '''
    if classToTest!=None:
        indices = [i for i in indices_of_test if y[i][1]==classToTest]
        if len(indices)>0:
            loss, acc = model.evaluate(x[indices], y[indices],verbose=verbose)
            accArr.append(acc)
            lossArr.append(loss)
        #percentageArr.append(freqs[indices].sum()/freqs[indices_of_test].sum())
        percentageArr.append(len(indices)/len(indices_of_test))
        return indices
    else :
        loss, acc = model.evaluate(x[indices_of_test], y[indices_of_test],verbose=verbose)
        accArr.append(acc)
        lossArr.append(loss)
        return acc

In [9]:
def runKFoldForRNN(numberOfFold, x, y, packageForFitness,minLengthRun, m_AC,
                   max_len, number_of_activities, verbose=2,epochs = 10, batch_size = 50):
    '''
    This function runs a RNN and prints some metrics. 
    
    @model: the created RNN
    @numberOfFold: number of fold for the cross validation
    @x: sequences
    @y: target
    '''    
    startModel = time.time()
    accAll, lossAll = [], []
    accNeg, lossNeg, percentageNeg = [], [], []
    accPos, lossPos, percentagePos = [], [], []
    
    realFitness, realLBFitness, predictedLBFitness = [], [], []
    packageForFitness.reset_index(drop=True,inplace=True)
    
    runtime = []

    # use a K-fold Cross-Validation and show average Loss and average Accuracy
    kfold = KFold(numberOfFold)
    bestmodel, bestAccuracy = None, 0
    
    for indices_of_train, indices_of_test in kfold.split(x):
        
        # train the model
        model = myModel(max_len, number_of_activities+1,verbose)
        model.fit(x[indices_of_train], y[indices_of_train], verbose=verbose, epochs = 10, batch_size = 50)
        
        # compute loss and accuracy on the test items
        current_accuracy = accLossPercentageForRNN(model, x, y, indices_of_test, accAll, lossAll, verbose=verbose)
        
        # compute loss and accuracy on the test items that are negatives
        accLossPercentageForRNN(model, x, y, indices_of_test, accNeg, lossNeg, percentageNeg, 0, verbose)

        # compute loss and accuracy on the test items that are positives
        indices_of_pos = accLossPercentageForRNN(model, x, y, indices_of_test, accPos, lossPos, percentagePos, 1, verbose) 

        # compute fitness and lower-bound
        realFitness.append(fitness(packageForFitness.iloc[indices_of_test], minLengthRun))
        if len(indices_of_pos)>0:
            realLBFitness.append(LB_fitness(packageForFitness.iloc[indices_of_test], minLengthRun, m_AC, indices_of_pos))
        
        # compute predicted LB fitness and runtime
        start = time.time()
        predictions = model.predict(x[indices_of_test])
        runtime.append((time.time()-start)/len(indices_of_test))
        indices_of_predicted_as_positives = [indices_of_test[i] for i in range(0,len(predictions)) if predictions[i][1]>=0.5]
        if len(indices_of_predicted_as_positives)>0:
            predictedLBFitness.append(LB_fitness(packageForFitness.iloc[indices_of_test], minLengthRun, m_AC, indices_of_predicted_as_positives))

        if bestAccuracy < current_accuracy:
            bestmodel = model
            bestAccuracy = current_accuracy
    print("TIME:",time.time()-startModel)   
    print("[CROSS-VALIDATION]\n[ALL] Loss:", "{:.3f}".format(mean(lossAll)), "\t Acc:", "{:.3f}".format(mean(accAll)))
    print("[POSITIVE ({:.2f}%)] Loss:".format(mean(percentagePos)), "{:.3f}".format(mean(lossPos)), "\t Acc:", "{:.3f}".format(mean(accPos)))
    print("[NEGATIVE ({:.2f}%)] Loss:".format(mean(percentageNeg)), "{:.3f}".format(mean(lossNeg)),"\t Acc:", "{:.3f}".format(mean(accNeg)))
    print("Fitness {:.3f}".format(mean(realFitness)), "\t LB Fitness:", "{:.3f}\n".format(mean(realLBFitness)),"\t Predicted LB Fitness:", "{:.3f}\n".format(mean(predictedLBFitness)))
    print("Runtime (prediction per trace):{:.10f}".format(mean(runtime)))
    return bestmodel

In [10]:
# example of use
runKFoldForRNN(3,x,y, packageForFitness,minLengthRun, m_AC,max_len, number_of_activities)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 176)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 176, 15)           375       
_________________________________________________________________
bidirectional_2 (Bidirection (None, 176, 100)          26400     
_________________________________________________________________
dropout_3 (Dropout)          (None, 176, 100)          0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                30200     
_________________________________________________________________
dropout_4 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 102       
__________

<keras.engine.training.Model at 0x12db95dd0>

## 4. Train and Test

In this section, we launch the cross-validation on a training set and test with a different set, thanks to `train_test_split`. 

The loop gives the results for <b>different mAC values</b> from 1 to 10. 

In [11]:
filename = "A_2012_shm.csv"

for i in [2,4,6,8,10]:
    print("\n","m_AC=",i)
    
    # read the datafile, packageForFitness, minLengthRun, m_AC are needed to compute fitness and LBfitness
    x, y, number_of_activities, max_len, index_to_word, packageForFitness, minLengthRun, m_AC  = cleanDataForRNN("alignments/"+filename,i)
    
    # split the dataset in TRAIN and TEST sets
    X_train, X_test, y_train, y_test, packageForFitness_train, packageForFitness_test = train_test_split(x, y, packageForFitness, test_size=0.33, random_state=42)

    # ----------------------------------------------
    #                     TRAIN 
    # ----------------------------------------------
    # run the cross-validation
    bestmodel = runKFoldForRNN(10, X_train, y_train, packageForFitness_train, minLengthRun, m_AC, max_len, number_of_activities, verbose=0)

    # ----------------------------------------------
    #                     TEST 
    # ----------------------------------------------
    # use the same function as in the cross-validation but for the test set. 
    accAll, lossAll = [], []
    accNeg, lossNeg, percentageNeg = [], [], []
    accPos, lossPos, percentagePos = [], [], []
    
    packageForFitness_test.reset_index(drop=True,inplace=True)
    
    #compute loss and accuracy on the test items 
    accLossPercentageForRNN(bestmodel, X_test, y_test, range(0,len(y_test)), accAll, lossAll, verbose=0)
        
    # compute loss and accuracy on the test items that are negatives
    accLossPercentageForRNN(bestmodel, X_test, y_test, range(0,len(y_test)), accNeg, lossNeg, percentageNeg, 0, 0)

    # compute loss and accuracy on the test items that are positives
    indices_of_pos = accLossPercentageForRNN(bestmodel, X_test, y_test, range(0,len(y_test)), accPos, lossPos,percentagePos,  1, 0) 

    # compute fitness and lower-bound
    realFitness = fitness(packageForFitness_test, minLengthRun)
    if len(indices_of_pos)>0:
        LBFitness = LB_fitness(packageForFitness_test, minLengthRun, m_AC, indices_of_pos)
        
    # compute predicted LB fitness
    predictions = bestmodel.predict(X_test)
    indices_of_predicted_as_positives = [i for i in range(0, len(X_test)) if predictions[i][1]>=0.5]
    if len(indices_of_predicted_as_positives)>0:
        predictedLBFitness = LB_fitness(packageForFitness_test, minLengthRun, m_AC, indices_of_predicted_as_positives)
        
    print("[TEST]\n[ALL] Loss:", "{:.3f}".format(mean(lossAll)), "\t Acc:", "{:.3f}".format(mean(accAll)))
    print("[POSITIVE ({:.2f}%)] Loss:".format(mean(percentagePos)), "{:.3f}".format(mean(lossPos)), "\t Acc:", "{:.3f}".format(mean(accPos)))
    print("[NEGATIVE ({:.2f}%)] Loss:".format(mean(percentageNeg)), "{:.3f}".format(mean(lossNeg)),"\t Acc:", "{:.3f}".format(mean(accNeg)))
    print("Fitness {:.3f}".format((realFitness)), "\t LB Fitness:", "{:.3f}\n".format((LBFitness)),"\t Predicted LB Fitness:", "{:.3f}\n".format((predictedLBFitness)) )

    fake_x, fake_y, fake_number_of_activities, fake_max_len, fake_index_to_word, fake_packageForFitness, fake_minLengthRun, fake_m_AC  = cleanDataForRNN("alignments/mock/"+filename,2,word_to_index_already_exists=True)
    accFake, lossFake = [], []
    accLossPercentageForRNN(bestmodel, fake_x, fake_y, range(0,len(fake_y)), accFake, lossFake)
    print("[MOCK] Loss:", "{:.3f}".format(mean(lossFake)), "\t Acc:", "{:.3f}".format(mean(accFake)))
    print("-------------------------------------------")



 m_AC= 2
TIME: 2458.9817349910736
[CROSS-VALIDATION]
[ALL] Loss: 0.027 	 Acc: 0.989
Fitness 0.836 	 LB Fitness: 0.074
 	 Predicted LB Fitness: 0.073

Runtime (prediction per trace):0.0076604263
[TEST]
[ALL] Loss: 0.023 	 Acc: 0.997
Fitness 0.837 	 LB Fitness: 0.071
 	 Predicted LB Fitness: 0.072

[FAKE] Loss: 0.141 	 Acc: 0.959
-------------------------------------------

 m_AC= 4
TIME: 2538.729465007782
[CROSS-VALIDATION]
[ALL] Loss: 0.064 	 Acc: 0.976
Fitness 0.836 	 LB Fitness: 0.174
 	 Predicted LB Fitness: 0.165

Runtime (prediction per trace):0.0130540386
[TEST]
[ALL] Loss: 0.047 	 Acc: 0.990
Fitness 0.837 	 LB Fitness: 0.169
 	 Predicted LB Fitness: 0.166

[FAKE] Loss: 0.339 	 Acc: 0.934
-------------------------------------------

 m_AC= 6
TIME: 2546.310353040695
[CROSS-VALIDATION]
[ALL] Loss: 0.120 	 Acc: 0.962
Fitness 0.836 	 LB Fitness: 0.462
 	 Predicted LB Fitness: 0.484

Runtime (prediction per trace):0.0210512503
[TEST]
[ALL] Loss: 0.114 	 Acc: 0.971
Fitness 0.837 	 LB 