In [1]:
# Import necessary script
from scripts.dda_functions import EventLogProcessor

# Declarative Data Augmentation
We present our declarative data augmentation approach in this notebook. Every step that is necessary to generate synthetic data, which can then be used to augment the original data will be presented and explained here. 

## Helpdesk

In [2]:
# Initialize EventLogProcessor class for Helpdesk 
helpdesk = event_processor = EventLogProcessor("helpdesk")

# Apply Train-Test Split
helpdesk.train_test_split()

# Convert train data to an event log
helpdesk.generate_train_log()

exporting log, completed traces ::   0%|          | 0/3664 [00:00<?, ?it/s]

As there are many possible paramter combinations, we now need to define the specific parameter settings for the declarative process discovery. In our thesis we provide the results for the three different combinations: 

* consider_vacuity = False, min_support = 0.8, itemset_support = 0.8, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 3

In this notebook the user has to define on its own for which parameters the process should be discovered.

In [3]:
helpdesk.declare_discovery(consider_vacuity=True, min_support=0.95, itemsets_support=0.95, max_declare_cardinality=2)

parsing log, completed traces ::   0%|          | 0/3664 [00:00<?, ?it/s]

Computing discovery ...
Model activities:
-----------------
0 Assign seriousness
1 Take in charge ticket
2 Resolve ticket
3 Closed
4 Insert ticket
5 Wait
6 Create SW anomaly
7 Require upgrade
8 Resolve SW anomaly
9 Schedule intervention
10 RESOLVED
11 INVALID
12 VERIFIED


Model constraints:
-----------------
0 Existence1[Resolve ticket] | |
1 Existence1[Closed] | |
2 Absence2[Closed] | |
3 Exactly1[Closed] | |
4 End[Closed] | |
5 Existence1[Assign seriousness] | |
6 Init[Assign seriousness] | |
7 Choice[Resolve ticket, Closed] | | |
8 Choice[Closed, Resolve ticket] | | |
9 Responded Existence[Resolve ticket, Closed] | | |
10 Responded Existence[Closed, Resolve ticket] | | |
11 Response[Resolve ticket, Closed] | | |
12 Precedence[Resolve ticket, Closed] | | |
13 Alternate Precedence[Resolve ticket, Closed] | | |
14 Chain Precedence[Resolve ticket, Closed] | | |
15 Not Response[Closed, Resolve ticket] | | |
16 Not Precedence[Closed, Resolve ticket] | | |
17 Not Chain Response[Closed, Re

After the declarative process discovery we now need to define the parameters for the process simulation, as well as for the synthetic timestamp generation. Thereby it is important to define the required number of mean values for the time_delta as well as if the business hours and weekday should be considered at the timestamp generation.

For Helpdesk we define:
* number of required mean values: 1
* consider business hours: True
* consider weekday: True

In [4]:
# Get parameters for declarative process simulation
min_events, max_events, num_of_cases, orig_prob = helpdesk.get_syn_log_parameters()

# Simulate declarative process model
helpdesk.generate_syn_log(num_of_cases=num_of_cases, num_min_events=min_events, num_max_events=max_events, orig_prob=orig_prob)

# Convert the obtained synthetic log into an DataFrame for further use
helpdesk.convert_syn_log_to_df()

# Apply synthetic timestamp generation
# Generate timedelta based on the required number of mean values
time_delta = helpdesk.generate_timedelta(1)

# Get case arrival distribution
inter_case_dist = helpdesk.get_inter_case_dist()

# Generate the first timestamps of each case
df_syn = helpdesk.generate_first_timestamps(inter_case_dist=inter_case_dist)

# Generate remaining timestamps and define if business hours and weekday should be considered
helpdesk.generate_timestamps(df_syn=df_syn, time_delta=time_delta, consider_hour=True, consider_day=True)

DEBUG:ASP generator:Distribution for traces uniform
DEBUG:ASP generator:traces: 3664, events can have a trace min(2) max(15)
INFO:ASP generator:Computing distribution
DEBUG:Distributor:Distribution() uniform min_mu: 2 max_sigma: 15 num_traces: 3664 custom_prob: None
DEBUG:Distributor:Uniform() probabilities: [Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14), Fraction(1, 14)]


    Number of Events  Number of Cases
0                  2                1
1                  3              169
2                  4             2158
3                  5              717
4                  6              354
5                  7              137
6                  8               81
7                  9               20
8                 10               12
9                 11                7
10                12                4
11                13                2
12                14                1
13                15                1


DEBUG:Distributor:Custom_dist() min_mu:2 max_sigma:15 num_traces:3664
DEBUG:Distributor:Probabilities sum 1
DEBUG:Distributor:Distribution result: [ 3  9 13 ...  7  6  2]
INFO:ASP generator:Distribution result Counter({11: 301, 7: 276, 4: 276, 5: 274, 15: 268, 14: 264, 12: 263, 6: 260, 10: 256, 3: 253, 2: 251, 9: 249, 13: 238, 8: 235})
DEBUG:ASP generator:Using custom traces length
INFO:ASP generator:Computing distribution
DEBUG:Distributor:Distribution() custom min_mu: 2 max_sigma: 15 num_traces: 3664 custom_prob: [0.0002729258, 0.0461244541, 0.5889737991, 0.1956877729, 0.0966157205, 0.0373908297, 0.0221069869, 0.0054585153, 0.0032751092, 0.0019104803, 0.0010917031, 0.0005458515, 0.0002729258, 0.000272925800000111]
DEBUG:Distributor:Custom_dist() min_mu:2 max_sigma:15 num_traces:3664
DEBUG:Distributor:Probabilities sum 1.0
DEBUG:Distributor:Distribution result: [4 4 5 ... 4 7 4]
INFO:ASP generator:Distribution result Counter({4: 2211, 5: 676, 6: 337, 7: 161, 3: 156, 8: 74, 10: 17, 9: 

exporting log, completed traces ::   0%|          | 0/3662 [00:00<?, ?it/s]

parsing log, completed traces ::   0%|          | 0/3662 [00:00<?, ?it/s]

{'distribution_name': 'gamma', 'distribution_params': [{'value': 7664.858238723535}, {'value': 200319470.91884205}, {'value': 5.0}, {'value': 61493.0}]}


The next step is to compute the common variants (variants that occur in the orginal and the synthetic data) and augment the original data. Therefore we need to define if common variants should not be augmented and based on what information they should be excluded. 
* *orig_proportion*: based on the proportion of the variant in the original data
* *common_ratio*: defined percentage of the common variants will be excluded from the augmentation process

In [5]:
# Compute common variants
common_variants = helpdesk.get_most_frequent_common_variants()

# Define if common variants should be removed
variants_to_remove = helpdesk.common_variants_to_remove(common_variants=common_variants, method='orig_proportion', threshold=0.5)

# Save filtered variants
helpdesk.save_filtered_variants(most_frequent_common_variants=variants_to_remove)

# Apply Train-Validation Split
helpdesk.train_val_split()

# Augment the original data with filtered synthetic data
helpdesk.augment_new_variants()

# Save the augmented and original data
helpdesk.save_total_data('original')
helpdesk.save_total_data('augmented')

Number of common variants: 25
Sorted common variants: [(('Assign seriousness', 'Resolve ticket', 'Take in charge ticket', 'Resolve ticket', 'Closed'), {'frequency_orig': 1, 'frequency_syn': 35, 'proportion_orig': 0.0002729257641921397, 'proportion_syn': 0.009557618787547788}), (('Assign seriousness', 'Take in charge ticket', 'Take in charge ticket', 'Require upgrade', 'Resolve ticket', 'Closed'), {'frequency_orig': 1, 'frequency_syn': 1, 'proportion_orig': 0.0002729257641921397, 'proportion_syn': 0.00027307482250136535}), (('Assign seriousness', 'Take in charge ticket', 'Schedule intervention', 'Resolve ticket', 'Closed'), {'frequency_orig': 1, 'frequency_syn': 3, 'proportion_orig': 0.0002729257641921397, 'proportion_syn': 0.0008192244675040961}), (('Assign seriousness', 'Resolve ticket', 'Wait', 'Wait', 'Resolve ticket', 'Closed'), {'frequency_orig': 1, 'frequency_syn': 1, 'proportion_orig': 0.0002729257641921397, 'proportion_syn': 0.00027307482250136535}), (('Assign seriousness', 'Re

## Sepsis

In [None]:
# Initialize EventLogProcessor class for Sepsis 
sepsis = event_processor = EventLogProcessor("sepsis")

# Apply Train-Test Split
sepsis.train_test_split()

# Convert train data to an event log
sepsis.generate_train_log()

As there are many possible paramter combinations, we now need to define the specific parameter settings for the declarative process discovery. In our thesis we provide the results for the three different combinations: 

* consider_vacuity = False, min_support = 0.8, itemset_support = 0.8, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 3

In this notebook the user has to define on its own for which parameters the process should be discovered.

In [None]:
sepsis.declare_discovery(consider_vacuity=True, min_support=0.95, itemsets_support=0.95, max_declare_cardinality=2)

After the declarative process discovery we now need to define the parameters for the process simulation, as well as for the synthetic timestamp generation. Thereby it is important to define the required number of mean values for the time_delta as well as if the business hours and weekday should be considered at the timestamp generation.

For Sepsis we define:
* number of required mean values: 4
* consider business hours: False
* consider weekday: False

In [None]:
# Get parameters for declarative process simulation
min_events, max_events, num_of_cases, orig_prob = sepsis.get_syn_log_parameters()

# Simulate declarative process model
sepsis.generate_syn_log(num_of_cases=num_of_cases, num_min_events=min_events, num_max_events=max_events, orig_prob=orig_prob)

# Convert the obtained synthetic log into an DataFrame for further use
sepsis.convert_syn_log_to_df()

# Apply synthetic timestamp generation
# Generate timedelta based on the required number of mean values
time_delta = sepsis.generate_timedelta(1)

# Get case arrival distribution
inter_case_dist = sepsis.get_inter_case_dist()

# Generate the first timestamps of each case
df_syn = sepsis.generate_first_timestamps(inter_case_dist=inter_case_dist)

# Generate remaining timestamps and define if business hours and weekday should be considered
sepsis.generate_timestamps(df_syn=df_syn, time_delta=time_delta, consider_hour=False, consider_day=False)

NameError: name 'helpdesk' is not defined

The next step is to compute the common variants (variants that occur in the orginal and the synthetic data) and augment the original data. Therefore we need to define if common variants should not be augmented and based on what information they should be excluded. 
* *orig_proportion*: based on the proportion of the variant in the original data
* *common_ratio*: defined percentage of the common variants will be excluded from the augmentation process

In [None]:
# Compute common variants
common_variants = sepsis.get_most_frequent_common_variants()

# Define if common variants should be removed
variants_to_remove = sepsis.common_variants_to_remove(common_variants=common_variants, method='orig_proportion', threshold=0.5)

# Save filtered variants
sepsis.save_filtered_variants(most_frequent_common_variants=variants_to_remove)

# Apply Train-Validation Split
sepsis.train_val_split()

# Augment the original data with filtered synthetic data
sepsis.augment_new_variants()

# Save the augmented and original data
sepsis.save_total_data('original')
sepsis.save_total_data('augmented')

## BPIC13C

In [None]:
# Initialize EventLogProcessor class for BPIC13C 
bpic13c = event_processor = EventLogProcessor("bpic13c")

# Apply Train-Test Split
bpic13c.train_test_split()

# Convert train data to an event log
bpic13c.generate_train_log()

As there are many possible paramter combinations, we now need to define the specific parameter settings for the declarative process discovery. In our thesis we provide the results for the three different combinations: 

* consider_vacuity = False, min_support = 0.8, itemset_support = 0.8, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 3

In this notebook the user has to define on its own for which parameters the process should be discovered.

In [None]:
bpic13c.declare_discovery(consider_vacuity=True, min_support=0.95, itemsets_support=0.95, max_declare_cardinality=2)

After the declarative process discovery we now need to define the parameters for the process simulation, as well as for the synthetic timestamp generation. Thereby it is important to define the required number of mean values for the time_delta as well as if the business hours and weekday should be considered at the timestamp generation.

For BPIC13C we define:
* number of required mean values: 4
* consider business hours: True
* consider weekday: True

In [None]:
# Get parameters for declarative process simulation
min_events, max_events, num_of_cases, orig_prob = bpic13c.get_syn_log_parameters()

# Simulate declarative process model
bpic13c.generate_syn_log(num_of_cases=num_of_cases, num_min_events=min_events, num_max_events=max_events, orig_prob=orig_prob)

# Convert the obtained synthetic log into an DataFrame for further use
bpic13c.convert_syn_log_to_df()

# Apply synthetic timestamp generation
# Generate timedelta based on the required number of mean values
time_delta = bpic13c.generate_timedelta(4)

# Get case arrival distribution
inter_case_dist = bpic13c.get_inter_case_dist()

# Generate the first timestamps of each case
df_syn = bpic13c.generate_first_timestamps(inter_case_dist=inter_case_dist)

# Generate remaining timestamps and define if business hours and weekday should be considered
bpic13c.generate_timestamps(df_syn=df_syn, time_delta=time_delta, consider_hour=True, consider_day=True)

NameError: name 'helpdesk' is not defined

The next step is to compute the common variants (variants that occur in the orginal and the synthetic data) and augment the original data. Therefore we need to define if common variants should not be augmented and based on what information they should be excluded. 
* *orig_proportion*: based on the proportion of the variant in the original data
* *common_ratio*: defined percentage of the common variants will be excluded from the augmentation process

In [None]:
# Compute common variants
common_variants = bpic13c.get_most_frequent_common_variants()

# Define if common variants should be removed
variants_to_remove = bpic13c.common_variants_to_remove(common_variants=common_variants, method='orig_proportion', threshold=0.5)

# Save filtered variants
bpic13c.save_filtered_variants(most_frequent_common_variants=variants_to_remove)

# Apply Train-Validation Split
bpic13c.train_val_split()

# Augment the original data with filtered synthetic data
bpic13c.augment_new_variants()

# Save the augmented and original data
bpic13c.save_total_data('original')
bpic13c.save_total_data('augmented')

## BPIC15_1

In [None]:
# Initialize EventLogProcessor class for BPIC15_1 
bpic15_1 = event_processor = EventLogProcessor("bpic15_1")

# Apply Train-Test Split
bpic15_1.train_test_split()

# Convert train data to an event log
bpic15_1.generate_train_log()

As there are many possible paramter combinations, we now need to define the specific parameter settings for the declarative process discovery. In our thesis we provide the results for the three different combinations: 

* consider_vacuity = False, min_support = 0.8, itemset_support = 0.8, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 2
* consider_vacuity = True, min_support = 0.95, itemset_support = 0.95, max_declare_cardinality = 3

In this notebook the user has to define on its own for which parameters the process should be discovered.

In [None]:
bpic15_1.declare_discovery(consider_vacuity=True, min_support=0.95, itemsets_support=0.95, max_declare_cardinality=2)

After the declarative process discovery we now need to define the parameters for the process simulation, as well as for the synthetic timestamp generation. Thereby it is important to define the required number of mean values for the time_delta as well as if the business hours and weekday should be considered at the timestamp generation.

For BPIC15_1 we define:
* number of required mean values: 4
* consider business hours: False
* consider weekday: False

In [None]:
# Get parameters for declarative process simulation
min_events, max_events, num_of_cases, orig_prob = bpic15_1.get_syn_log_parameters()

# Simulate declarative process model
bpic15_1.generate_syn_log(num_of_cases=num_of_cases, num_min_events=min_events, num_max_events=max_events, orig_prob=orig_prob)

# Convert the obtained synthetic log into an DataFrame for further use
bpic15_1.convert_syn_log_to_df()

# Apply synthetic timestamp generation
# Generate timedelta based on the required number of mean values
time_delta = bpic15_1.generate_timedelta(4)

# Get case arrival distribution
inter_case_dist = bpic15_1.get_inter_case_dist()

# Generate the first timestamps of each case
df_syn = bpic15_1.generate_first_timestamps(inter_case_dist=inter_case_dist)

# Generate remaining timestamps and define if business hours and weekday should be considered
bpic15_1.generate_timestamps(df_syn=df_syn, time_delta=time_delta, consider_hour=False, consider_day=False)

NameError: name 'helpdesk' is not defined

The next step is to compute the common variants (variants that occur in the orginal and the synthetic data) and augment the original data. Therefore we need to define if common variants should not be augmented and based on what information they should be excluded. 
* *orig_proportion*: based on the proportion of the variant in the original data
* *common_ratio*: defined percentage of the common variants will be excluded from the augmentation process

In [None]:
# Compute common variants
common_variants = bpic15_1.get_most_frequent_common_variants()

# Define if common variants should be removed
variants_to_remove = bpic15_1.common_variants_to_remove(common_variants=common_variants, method='orig_proportion', threshold=0.5)

# Save filtered variants
bpic15_1.save_filtered_variants(most_frequent_common_variants=variants_to_remove)

# Apply Train-Validation Split
bpic15_1.train_val_split()

# Augment the original data with filtered synthetic data
bpic15_1.augment_new_variants()

# Save the augmented and original data
bpic15_1.save_total_data('original')
bpic15_1.save_total_data('augmented')