<a href="https://colab.research.google.com/github/MojtabaValizadeh/paresy/blob/master/paresy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="markdown-google-sans">
<h1><strong>PaRESy</strong></h1>
</div>

<div class="markdown-google-sans">
<h1>Parallel Regular Expression Synthesiser</h1>
</div>



This notebook contains the code and other artifacts for the  paper

> **Search-Based Regular Expression Inference on a GPU**

by Mojtaba Valizadeh and Martin Berger. A draft of the paper is available at [here](https://users.sussex.ac.uk/~mfb21/pldi23.pdf).

# <strong>Introduction</strong>

Welcome to our Colab notebook.

In this work, the goal is to find a `precise` and `minimal` regular expression (RE) that accepts a given set of positive strings and rejects a given set of negative ones. To accomplish this, we developed an algorithm called `PaRESy`, which is implemented in two codes: one for CPU using C++, and another for GPU using Cuda C++. By doing this, we could measure the speed-up for the most challenging examples.

In this version of the work, we use a simple grammar for the REs:

```
R ::= Φ|ε|a|R?|R*|R.R|R+R
```
Regarding the minimality, we need to use a cost function that maps every constructor in the RE to a positive integer. By summing up the costs of all the constructors, we can obtain the overall cost of the RE. This cost function helps us to avoid overfitting and returning the trivial RE that is the union of the positive strings.

**Example:**
- Positive: [`""`, `"010"`, `"1000"`, `"0101"`, `"1010"`, `"0010"`]
- Negative: [`"001"`, `"1011"`, `"111"`]

**Note:** In this work, we use `""` to represent the `epsilon (ε)` as the empty string.

We aim to find a regular expression (RE) that accepts all strings in the positive set and rejects all strings in the negative set. However, there could be an infinite number of such REs, and we want to identify the minimal one based on a given cost function. This cost function assigns positive integers to each constructor, including the `alphabet`, `question mark`, `star`, `concatenation`, and `union`. Let us assume that the costs of these constructors are given as [`c1`, `c2`, `c3`, `c4`, `c5`]. Based on the different costs, we can generate various REs, as follows.

-   [`1`, `1`, `1`, `1`, `1`]   --->    `(0+101?)*`
-   [`1`, `5`, `1`, `1`, `5`]   --->    `(01)*(1*0)*`

**Note:** In this work, we use the `+` symbol to represent the `union` constructor, which is commonly denoted by `|` in standard libraries.

In this simple example, both REs are `precise` (i.e., accepts all positive and rejects all negative examples) and `minimal` w.r.t their own cost functions. We can observe that by increasing the costs of `question mark` and `union` in the second cost function, the resulting RE contains fewer instances of these constructors. However, to compensate for this, the regular expression tends to use more stars, which are cheaper in this particular cost function.

# <strong>Initialization</strong>

To begin, please run the code below to transfer all of the necessary requirements, including the codes, dependencies, benchmarks, etc, to this notebook. **Note:** The code that was initially submitted to PLDI 2023 is now available using the tag `v0.1`. Once you run the script, you will see a message prompting you to select a version to use. To reproduce the data presented in the paper, please choose the `Initial Version` option in the script. Otherwise, leave it as it is and the script will clone the latest version of the work by default.

In [None]:
import os
import ipywidgets as widgets
from IPython.display import clear_output
from IPython.display import HTML, display

if not os.path.isdir("/content/paresy"):

    question = "Please specify which version of the code you would prefer to use:"
    options = ["Initial Version (v0.1)", "Latest Version"]

    def check_answer(b):
        user_answer = option_radio_buttons.value
        clear_output()
        ! git config --global advice.detachedHead false
        if user_answer == "Latest Version":
            ! git clone https://github.com/MojtabaValizadeh/paresy.git
        else:
            ! git clone --branch v0.1 https://github.com/MojtabaValizadeh/paresy.git
        print()
        print("Done")

    display(HTML(f"<h2>{question}</h2>"))
    option_radio_buttons = widgets.RadioButtons(options = options, description = '', value = "Latest Version", disabled = False)
    display(option_radio_buttons)
    button = widgets.Button(description="Clone the repo")
    button.on_click(check_answer)
    display(button)

else:
    print("A repository already exists!")
    print("If you would like to use a different version of the work, please disconnect first from:")
    print("Runtime > Disconnect and delete runtime")
    print("Then please run this script again.")

# <strong>GPU version of the algorithm</strong>

**Important:** To optimize memory and performance, we recommend upgrading your Colab to `Pro` or `Pro+` versions. These paid versions offer larger memory limits, faster processing speeds, and advanced hardware acceleration, enabling users to execute complex operations with greater efficiency. By contrast, the `free version` of Colab may be subject to limitations that can impact the performance and accuracy of the following tasks.

**To upgrade to Colab Pro:**

- Find the `Colab Pro` tab in `Tools > Settings`.
- Choose your desired plan: `Colab Pro` or `Colab Pro+`.
- Follow the prompts to enter your billing information and complete the purchase process.

**Note:** Colab Pro is billed on a monthly basis, and you can cancel at any time. Additionally, some countries or regions may not be eligible to purchase Colab Pro at this time, so you may need to check if it is available in your area before proceeding.

**Note:** To run a cell in Colab, you can either click on the "play" button located on the left side of the cell, or you can press "Shift+Enter" on your keyboard while the cell is selected.

To check if the notebook is connected to the GPU, please run the following command.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
  print("At 'Runtime > Change runtime type', please choose `GPU` for `Hardware accelerator`")
else:
  print(gpu_info)

For optimal performance, please ensure that the GPU you are currently connected to is `NVIDIA A100-SXM` or any better GPUs. You can find it at the middle of the table. To ensure that we are using `high-RAM`, please run the following script.

In [None]:
from psutil import virtual_memory

ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime!')
else:
  print('Great! You are using a high-RAM runtime!')

**Note:** If you encounter any issues, you can adjust the settings from 'Edit > Notebook settings' or 'Runtime > Change runtime type'. Then set

- `GPU` for Hardware accelerator
- `Premium` or `A100` for GPU class/type
- `High RAM` for Runtime shape

then press `save`.

<div class="markdown-google-sans">
<h1>Compiling the GPU code</h1>
</div>

To compile and run the GPU code, please run the following scripts.

**NOTE:** You can add `-D MEASUREMENT_MODE` to measure the running time instead of printing logs.

In [None]:
import os
if not os.path.isdir("/content/paresy"):
    ! git clone https://github.com/MojtabaValizadeh/paresy.git
! nvcc --extended-lambda -I /content/paresy/code/gpu_version/modified_libraries /content/paresy/code/gpu_version/gpu126.cu -o gpu126
print ("Done")

<div class="markdown-google-sans">
<h1>Running the GPU code</h1>
</div>

Generally, you can use
```
! ./gpu126 <input_file_address> <c1> <c2> <c3> <c4> <c5> <maxCost>
```
where
1. `input_file_address` refers to the address of the input file that contains your positive and negative examples.
2. (`c1`, `c2`, `c3`, `c4`, `c5`) are 5 small positive integers for the costs of (a, ?, *, ., +)
3. `maxCost` parameter is an integer that sets an upper limit on the cost of the regular expression that the algorithm will search for. In most cases, you can use a reasonably large integer, such as 500, which is appropriate for our cost functions.

For example, to run the first example of `type1` in our benchmarks, use the code below.

In [None]:
# Check if connected to GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
  print("At 'Runtime > Change runtime type', please choose `GPU` for `Hardware accelerator`")
else:
    ! ./gpu126 /content/paresy/benchmarks/type1/type1_exp1.txt 1 1 1 1 1 500

<div class="markdown-google-sans">
<h1>Tips for utilizing arbitrary input</h1>
</div>

In order to create an input file using your arbitrary positive and negative sets, you can use the following script. It generates an `input` text file in the `content` directory.

**Note:** You can modify the `pos` and `neg` in the following script to create your own input. After running the script once, you can easily edit the `input` text file located in the `content` directory to create your next desired input.

**Note:** To see the content in Colab, you can open the file explorer by clicking on the folder icon on the left side of the Colab notebook interface. You can navigate through the folders and subfolders to find the file you want to see.

Now, please run the follwing script.

In [None]:
def makeInputFile(pos, neg):
    f = open("input.txt", 'w')
    f.write("Exp\n")
    f.write("++")
    for p in pos:
        f.write("\n")
        f.write("\"" + str(p) + "\"")
    f.write("\n--")
    for n in neg:
        f.write("\n")
        f.write("\"" + str(n) + "\"")
    f.close()
    print("Done")

pos = ["10", "101", "100", "1010", "1011", "1000", "1001"]
neg = ["", "0", "1", "00", "11", "010"]

makeInputFile(pos, neg)

The script creates an `input.txt` text file in the `content` directory. Please take a look at this file. Here are some brief notes:

- To indicate positive and negative sets, we use "++" and "--", respectively. You can include or exclude quotation marks around your strings, and the algorithm will ignore any spaces in the strings to avoid potential issues. 

- To indicate `epsilon (ε)` as the empty string, you can use "" in your positive or negative sets. Using any extra symbols will be considered as a new alphabet.

- Our algorithm can handle larger alphabets, and any mistakes in the input can make the search space much bigger. So, please create/modify the input file carefully.

- The current version of the code supports inputs where `infix-closure(Pos u Neg) <= 126 bits` (For more information, please refer to our paper). Overall, if you exceed the limit, the algorithm will notify you. To resolve this issue, please try to shorten your strings, especially the longer ones.

- We use a uniform cost funtion (`1 1 1 1 1` in this case) to indicate the cost of each `alphabet`, `questions mark`, `star`, `concatenation` and `union` respectively. If one constructor has a higher cost (i.e. is more expensive) than the others, it means that the resulting regular expression may have fewer elements from that constructor. For additional examples, please refer to Section 3 of the paper.

- the last integer (`500` in this example) shows the maximum cost of the search. You can leave it like this or use a reasonable big integer for your cost function.

Try using your own input (by modifying the `input.txt` file) and your own cost function, and run the code with the following script.

In [None]:
# Check if connected to GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
  print("At 'Runtime > Change runtime type', please choose `GPU` for `Hardware accelerator`")
else:
    ! ./gpu126 ./input.txt 1 1 1 1 1 500

# <strong>CPU version of the algorithm</strong>

Google Colab might use different CPUs at different times. This is because Colab is a cloud-based platform that relies on Google's infrastructure to run its operations. If you need, to compare the CPU specifications used in our paper with the current CPU in this notebook, you can run the following script:

In [None]:
!cat /proc/cpuinfo

<div class="markdown-google-sans">
<h1>Compiling the CPU code</h1>
</div>

To compile and run the CPU version of our algorithm, please run the following scripts.

**NOTE:** You can add `-D MEASUREMENT_MODE` to measure the running time instead of printing logs.

In [None]:
import os
if not os.path.isdir("/content/paresy"):
    ! git clone https://github.com/MojtabaValizadeh/paresy.git
! g++ -O3 /content/paresy/code/cpu_version/cpu128.cpp -o cpu128
print ("Done")

<div class="markdown-google-sans">
<h1>Running the CPU code</h1>
</div>

Generally, you can use
```
! ./cpu128 <input_file_address> <c1> <c2> <c3> <c4> <c5> <maxCost>
```
where
1. `input_file_address` refers to the address of the input file that contains your positive and negative examples.
2. (`c1`, `c2`, `c3`, `c4`, `c5`) are 5 small positive integers for the costs of (a, ?, *, ., +)
3. `maxCost` parameter is an integer that sets an upper limit on the cost of the regular expression that the algorithm will search for. In most cases, you can use a reasonably large integer, such as 500, which is appropriate for our cost functions.

For example, to run the first example of `type1` in our benchmarks, use the code below.

In [None]:
! ./cpu128 /content/paresy/benchmarks/type1/type1_exp1.txt 1 1 1 1 1 500

# <strong>Artifact</strong>

This section provides you with the opportunity to replicate the results presented in our paper.

**Important:** To achieve higher memory and performance, we recommend using `Colab Pro` (or `Pro+`). The `free version` of Colab may run codes much slower, which can significantly impact the results.

**Note:** In the `content > paresy > code > exp_gen` directory, we have included a script that we used to generate random strings from both `type 1` and `type 2` categories. You can upload this notebook and use to generate additional test suites using different `seeds`. We have omitted further details to avoid making this notebook excessively large. For more information, please refer to `Section 4.3, Benchmark Construction`, in our paper.

Now, before running a group of test suites, we recommend that you take a look at our benchmarks in the `content > paresy > benchmarks`.

**Note:** As we mentioned before, to see the content in Colab, you can open the file explorer by clicking on the folder icon on the left side of the Colab notebook interface. You can navigate through the folders and subfolders to find the file you want to see.

Our benchmarks are divided into two types of examples: `type1` and `type2` (the `alpha_regex` folder contains benchmarks for the CPU section). Each type has a different number of examples: `type1` has `200` examples, while `type2` has `230` examples.

## <strong>Threats to Validity</strong>

The virtualization of the GPU and CPU resources in Google Colab Pro is not fully transparent, which can potentially impact the reproducibility of measurements. The extent of virtualization can vary over time, which may introduce inconsistencies in performance results across different runs. For more information, please see `4.2 Threats to Validity` in the [paper](https://users.sussex.ac.uk/~mfb21/pldi23-draft-updated.pdf).

## <strong>Figure 1</strong>

In this section, we have rewritten the script to rebuild Figure 1. In our study, we executed our GPU code on all examples for all cost functions, and then filtered out any results that ran out of memory or time. Finally, we generated Figure 1 using the remaining data to demonstrate the impact of different cost functions on running times.

Given that this process can take several hours, we have defined a parameter called "step," which runs the code through the 0-th, step-th, 2*step-th, etc. examples in order to create a figure that is similar to the one presented in our paper, but in a shorter amount of time. You are welcome to adjust this parameter based on your time limitations.

**Note:** We encountered several issues while using Google Colab Pro. For instance, after a specific time, it became exceedingly slow, even though we were using the premium version. Therefore, we attempted to halt the process, save the current data, reconnect to Colab, and then resume the process after a brief pause. Additionally, to ensure greater accuracy, we ran our code on all the examples three times, with each run shuffling the examples. We then averaged the final results to create `Figure 1`. Now, running a smaller portion of the examples (such as half of them, which should take around 45 minutes with A100 GPU) will hopefully produce a similar figure.

Please run the following script.

In [None]:
import os
import random
import subprocess

# Check if connected to GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
    print("At 'Runtime > Change runtime type', please choose `GPU` for `Hardware accelerator`")
else:
    # recompiling the GPU version of PaRESy in its MEASUREMENT_MODE
    print("Recompiling the code...")
    if not os.path.isdir("/content/paresy"):
        ! git clone https://github.com/MojtabaValizadeh/paresy.git
    ! nvcc --extended-lambda -D MEASUREMENT_MODE -I /content/paresy/code/gpu_version/modified_libraries /content/paresy/code/gpu_version/gpu126.cu -o gpu126
    print()

    def runGPU(typ, exp_idx, costfun, maxCost, timeOut):

        c1 = costfun[0]
        c2 = costfun[1]
        c3 = costfun[2]
        c4 = costfun[3]
        c5 = costfun[4]

        if typ == "type1":
            input = "/content/paresy/benchmarks/type1/type1_exp" + str(exp_idx) + ".txt"
        else:
            input = "/content/paresy/benchmarks/type2/type2_exp" + str(exp_idx) + ".txt"

        try:
            output = subprocess.run(
                ["./gpu126",
                input,
                str(c1), str(c2), str(c3), str(c4), str(c5),
                str(maxCost)],
                stdout = subprocess.PIPE,
                stderr = subprocess.PIPE,
                timeout = timeOut
            )
            return(str(output.stdout).replace('\\n', '\n')[1:])
        except subprocess.TimeoutExpired:
            return "timeOut!"

    print("ready!")

After executing the code with all the examples, a figure will be generated at the end. We have incorporated a progress bar to help you track the process. The progress bar will move quickly for the easier examples and slower for the more complex ones. To proceed, please run the following script.

In [None]:
import numpy as np
from tqdm import tqdm
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

step = 2         # uses examples 0, step, 2*step, ... from x_axis. So, step = 1 means the full test
maxCost = 500    # a reasonably large integer which has been used for our cost functions
timeOut = 5      # time out in sec

costFun = [
    [1, 1, 1, 1, 1],
    [10, 1, 1, 1, 1],
    [1, 10, 1, 1, 1],
    [1, 1, 10, 1, 1],
    [1, 1, 1, 10, 1],
    [1, 1, 1, 1, 10],
    [1, 10, 10, 10, 10],
    [10, 1, 10, 10, 10],
    [10, 10, 1, 10, 10],
    [10, 10, 10, 1, 10],
    [10, 10, 10, 10, 1], 
    [20, 20, 20, 5, 30]
]

# Examples "type_index" in x axis of Figure 1
x_axis = [
    "2_45", "1_162", "2_34", "2_3", "2_76", "2_59", "2_11", "1_152", "2_95", "2_142", "2_4", "2_48", "1_4", "2_19", "2_63", "2_26", "2_38", "2_1",
    "1_5", "2_98", "1_155", "2_103", "2_69", "2_56", "1_100", "2_41", "1_89", "2_156", "1_67", "1_44", "2_29", "1_35", "2_152", "2_33", "1_98",
    "2_21", "2_183", "1_102", "1_3", "2_14", "2_61", "2_137", "1_153", "1_87", "1_11", "1_21", "2_39", "1_6", "2_115", "2_60", "1_42", "1_134",
    "1_52", "1_36", "1_58", "1_1", "2_102", "1_46", "2_36", "2_65", "1_81", "1_14", "2_24", "1_23", "1_154", "1_43", "1_118", "2_5", "2_35", "2_20",
    "2_80", "2_16", "1_48", "2_13", "2_106", "2_9", "1_28", "1_19", "1_37", "2_17", "1_169", "1_24", "2_171", "2_7", "1_29", "2_31", "2_52", "1_101",
    "1_13", "2_126", "1_95", "1_92", "1_39", "1_18", "2_194", "2_84", "2_57", "2_23", "2_8", "1_15", "2_2", "1_149", "1_10", "1_7", "2_96", "2_82",
    "2_30", "1_84", "1_94", "2_162", "1_22", "2_27", "1_9", "2_10", "2_148", "1_119", "2_47", "1_131", "2_91", "1_90", "1_32", "1_8", "1_71", "2_75",
    "1_17", "1_173", "2_6", "1_25", "1_59", "1_103", "2_49", "2_78", "1_112", "2_22", "1_60", "1_176", "2_79", "1_77", "2_62", "2_25", "2_42", "1_68",
    "2_12", "2_15", "1_72", "2_120", "1_2", "1_80", "1_16", "1_12", "2_74", "2_87", "1_45", "2_40", "1_127", "2_44", "1_156", "1_38", "2_67", "2_71",
    "2_37", "2_157", "2_43", "1_128", "1_199", "2_138", "2_169", "1_30", "1_123", "1_86", "2_99", "2_127", "1_53", "2_66", "1_64", "2_54", "1_76",
    "2_28", "2_94", "2_155", "2_55", "1_105", "1_145", "1_136", "1_122", "2_141", "1_40", "2_46", "1_91", "2_135", "2_51", "2_64", "2_122", "1_120",
    "1_114", "2_111", "2_117", "2_188", "1_110", "1_26", "1_133", "2_175", "1_27", "2_72", "2_121", "2_110", "2_130", "1_56", "2_18", "2_134", "1_83",
    "1_135", "1_144", "2_32", "2_118", "2_50", "2_89", "1_54", "1_142", "2_77", "2_153", "2_70", "1_47", "2_92", "2_104", "1_113", "2_100", "1_187",
    "1_125", "2_58", "1_129", "1_85", "1_143", "1_108", "1_63", "2_123", "1_168", "2_68", "1_34", "1_124", "1_104", "1_93", "1_170", "2_170", "1_147",
    "2_200", "1_181", "2_132", "1_75", "1_31", "1_62", "2_128", "2_202", "1_167", "2_213", "2_97", "2_189", "2_124", "2_116", "1_55", "1_33", "1_188",
    "1_141", "2_112", "2_146", "1_140", "1_163", "2_86", "1_41", "2_101", "2_90", "1_109", "2_217", "2_108", "2_165", "2_133", "2_174", "2_154", "1_97",
    "2_85", "1_74", "2_161", "1_66", "1_126", "2_184", "2_81", "2_185", "1_51", "1_57", "1_69", "2_107", "1_158", "2_113", "1_193", "2_83", "2_150", "2_136",
    "2_158", "1_49", "1_61", "1_73", "2_88", "1_50", "1_20", "2_164", "2_181", "2_196", "1_117", "1_180", "2_201", "1_121", "2_173", "1_161", "1_174", "1_65",
    "1_150", "2_129", "1_116", "2_105", "1_106", "2_93", "2_205", "1_70", "2_140", "1_88", "1_111", "1_96", "2_177", "2_203", "2_125", "2_193", "1_164", "1_79",
    "2_163", "2_53", "2_73", "1_179", "2_145", "1_107", "2_209", "1_151", "2_180", "1_184", "1_194", "1_132", "1_115", "1_137", "2_191", "2_139", "2_114", "2_187"
]

data = []
n = len(costFun)
m = (len(x_axis)//step) if len(x_axis)%step == 0 else (len(x_axis)//step + 1)
with tqdm(total = n * m) as pbar:
    for cf in costFun:
        lst = []
        for i in range (0, len(x_axis), step):
            typ, exp_idx = x_axis[i].split("_")
            output = runGPU(typ, exp_idx, cf, maxCost, timeOut)
            if "Running Time:" in output and not "not_found" in output:
                lst.append(float(output.split("Running Time: ")[1].split()[0]))
            else:
                lst.append(timeOut + 1)
            pbar.update(1)
            print(output)
        data.append(lst)

for i in range(len(data) - 1, -1, -1):
    sorted_indices = sorted(range(len(data[0])), key=lambda x: data[i][x])
    data = [[row[i] for i in sorted_indices] for row in data]

xTicks = []
for i in range (0, len(x_axis), step):
    xTicks.append(x_axis[i])

figure(figsize = (30, 15), dpi = 75)
labels = ["(" + x + ")" for x in [', '.join(map(str, cf)) for cf in costFun]]
colours = cm.rainbow(np.linspace(0, 1, len(data)))

for i in range(len(data) - 1, -1, -1):
    plt.scatter(np.arange(len(data[i])), data[i], color = colours[i],
        alpha = 1, marker = "o", label = labels[i], s = 200)

plt.grid()
plt.xticks(size = 10)
plt.yticks(size = 20)
plt.xticks(rotation = 90)
plt.xlabel('Examples', fontsize = 20, labelpad = 50)
plt.ylabel('Running Time (sec)', fontsize = 20, labelpad = 50)
plt.xticks(np.arange(len(xTicks)), xTicks)
plt.gca().set_ylim([0, timeOut])
handles, labels = plt.gca().get_legend_handles_labels()
plt.legend([handles[i] for i in range(11, -1, -1)], [labels[i] for i in range(11, -1, -1)], title = 'Cost Function', 
    title_fontsize = 0, loc = 'upper left',  bbox_to_anchor = (0.05, 0.99), prop = {'size': 15})

plt.show()

## <strong>Table 1</strong>

Running the `24 hardest examples` of `Table 1` should take a few minutes only on the GPU (while it might take 1-2 days on the CPU version in the next section). To simplify the evaluation process, we have provided the type, example number, and the cost functions used for each example in Table 1. 

### <strong>GPU part of Table 1</strong>

Please run the following code.

In [None]:
import os
import subprocess

# Check if connected to GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
  print("At 'Runtime > Change runtime type', please choose `GPU` for `Hardware accelerator`")
else:
    # recompiling the GPU version of PaRESy in its MEASUREMENT_MODE
    print("Recompiling the code...")
    if not os.path.isdir("/content/paresy"):
        ! git clone https://github.com/MojtabaValizadeh/paresy.git
    ! nvcc --extended-lambda -D MEASUREMENT_MODE -I /content/paresy/code/gpu_version/modified_libraries /content/paresy/code/gpu_version/gpu126.cu -o gpu126
    print()

    def runGPU(typ, exp_idx, costfun, maxSize, timeOut):

        c1 = costfun[0]
        c2 = costfun[1]
        c3 = costfun[2]
        c4 = costfun[3]
        c5 = costfun[4]

        if typ == "type1":
            input = "/content/paresy/benchmarks/type1/type1_exp" + str(exp_idx) + ".txt"
        else:
            input = "/content/paresy/benchmarks/type2/type2_exp" + str(exp_idx) + ".txt"

        try:
            output = subprocess.run(
                ["./gpu126",
                input,
                str(c1), str(c2), str(c3), str(c4), str(c5),
                str(maxSize)],
                stdout = subprocess.PIPE,
                stderr = subprocess.PIPE,
                timeout = timeOut
            )
            return(str(output.stdout).replace('\\n', '\n')[1:])
        except subprocess.TimeoutExpired:
            return "timeOut!"

    # Table 1 which includes the 24 hardest test suite
    hardestExps = [
        ["type1", 50, [1, 1, 1, 1, 1]],
        ["type1", 51, [10, 1, 1, 1, 1]],
        ["type1", 73, [1, 10, 1, 1, 1]],
        ["type1", 20, [1, 1, 10, 1, 1]],
        ["type1", 73, [1, 1, 1, 10, 1]],
        ["type1", 31, [1, 1, 1, 1, 10]],
        ["type1", 57, [10, 10, 10, 10, 1]],
        ["type1", 50, [10, 10, 10, 1, 10]],
        ["type1", 57, [10, 10, 1, 10, 10]],
        ["type1", 97, [10, 1, 10, 10, 10]],
        ["type1", 61, [1, 10, 10, 10, 10]],
        ["type1", 88, [20, 20, 20, 5, 30]],
        ["type2", 88,  [1, 1, 1, 1, 1]],
        ["type2", 150, [10, 1, 1, 1, 1]],
        ["type2", 158, [1, 10, 1, 1, 1]],
        ["type2", 136, [1, 1, 10, 1, 1]],
        ["type2", 107, [1, 1, 1, 10, 1]],
        ["type2", 32,  [1, 1, 1, 1, 10]],
        ["type2", 136, [10, 10, 10, 10, 1]],
        ["type2", 200, [10, 10, 10, 1, 10]],
        ["type2", 107, [10, 10, 1, 10, 10]],
        ["type2", 81,  [10, 1, 10, 10, 10]],
        ["type2", 88,  [1, 10, 10, 10, 10]],
        ["type2", 158, [20, 20, 20, 5, 30]]
    ]

    timeOut = 20
    maxCost = 500 # a reasonably large integer which has been used for our cost functions

    for hardExp in hardestExps:
        print(str(hardExp[0]) + ", Exp " + str(hardExp[1]) + ":")
        print(runGPU(hardExp[0], hardExp[1], hardExp[2], maxCost, timeOut))
        print()

    print("Done")

### <strong>CPU part of Table 1</strong>

Before the main proces, please run the code below.


In [None]:
import os
import subprocess

# recompiling the CPU version of PaRESy in its MEASUREMENT_MODE
print("Recompiling the code...")
if not os.path.isdir("/content/paresy"):
    ! git clone https://github.com/MojtabaValizadeh/paresy.git
! g++ -O3 -D MEASUREMENT_MODE /content/paresy/code/cpu_version/cpu128.cpp -o cpu128
print()

def runCPU(typ, exp_idx, costfun, maxSize, timeOut):

    c1 = costfun[0]
    c2 = costfun[1]
    c3 = costfun[2]
    c4 = costfun[3]
    c5 = costfun[4]

    input = "/content/paresy/benchmarks/"

    if typ == "type1":
        input += "type1/type1_exp"
    elif typ == "type2":
        input += "type2/type2_exp"
    else:
        input += "alpha_regex/no"

    input += str(exp_idx) + ".txt"

    try:
        output = subprocess.run(
            ["./cpu128",
            input,
            str(c1), str(c2), str(c3), str(c4), str(c5),
            str(maxSize)],
            stdout = subprocess.PIPE,
            stderr = subprocess.PIPE,
            timeout = timeOut
        )
        return(str(output.stdout).replace('\\n', '\n')[1:])
    except subprocess.TimeoutExpired:
        return "timeOut!"

print ("Ready")


To evaluate the speed-up of the GPU version of our algorithm compared to the CPU version, in `Table 1`, we selected the hardest examples from our test suite that took the longest to execute on the GPU. The average running time for these 24 examples on the GPU was approximately 4.1 seconds. The GPU version is more than `1000` times faster than the CPU version, which means that it could take a long time (about 1-2 hours per example) to execute these examples on the CPU version. 

Therefore, we recommend using the simpler examples from our test suite to evaluate the CPU version of our algorithm. To facilitate this process, we have provided a script that executes the first few examples from `type1` using the CPU version. These examples are relatively easy in our experience and should only take a few minutes to run.

In [None]:
timeOut = 300
maxCost = 500 # a reasonably large integer which has been used for our cost functions
easierExps = [3, 4, 5, 6, 8] # the first 5 examples which were easier in our experience
costFun = [1, 1, 1, 1, 1] # this cost function assignes equal probability to all the constructors

# running the first few examples from `type1` through a normal cost function
for expIdx in easierExps:
    print("Type1, Exp " + str(expIdx) + ":")
    print(runCPU("type1", expIdx, costFun, maxCost, timeOut))
    print()

print("Done")


However, if you have enough time to execute a portion or all of the most challenging examples using the CPU version, we have provided a script for this purpose.

In [None]:
# Important: Each example may require 1-2 hours to run on the CPU version.
# You may remove items from the list if you wish.

# Table 1 which includes the 24 hardest examples
hardestExps = [
    ["type1", 50, [1, 1, 1, 1, 1]],
    ["type1", 51, [10, 1, 1, 1, 1]],
    ["type1", 73, [1, 10, 1, 1, 1]],
    ["type1", 20, [1, 1, 10, 1, 1]],
    ["type1", 73, [1, 1, 1, 10, 1]],
    ["type1", 31, [1, 1, 1, 1, 10]],
    ["type1", 57, [10, 10, 10, 10, 1]],
    ["type1", 50, [10, 10, 10, 1, 10]],
    ["type1", 57, [10, 10, 1, 10, 10]],
    ["type1", 97, [10, 1, 10, 10, 10]],
    ["type1", 61, [1, 10, 10, 10, 10]],
    ["type1", 88, [20, 20, 20, 5, 30]],
    ["type2", 88,  [1, 1, 1, 1, 1]],
    ["type2", 150, [10, 1, 1, 1, 1]],
    ["type2", 158, [1, 10, 1, 1, 1]],
    ["type2", 136, [1, 1, 10, 1, 1]],
    ["type2", 107, [1, 1, 1, 10, 1]],
    ["type2", 32,  [1, 1, 1, 1, 10]],
    ["type2", 136, [10, 10, 10, 10, 1]],
    ["type2", 200, [10, 10, 10, 1, 10]],
    ["type2", 107, [10, 10, 1, 10, 10]],
    ["type2", 81,  [10, 1, 10, 10, 10]],
    ["type2", 88,  [1, 10, 10, 10, 10]],
    ["type2", 158, [20, 20, 20, 5, 30]]
]

timeOut = 20000
maxCost = 500 # a reasonably large integer which has been used for our cost functions

for hardExp in hardestExps:
    print(str(hardExp[0]) + ", Exp " + str(hardExp[1]) + ":")
    print(runCPU(hardExp[0], hardExp[1], hardExp[2], maxCost, timeOut))
    print()


## <strong>Table 2</strong>

In `table 2`, we compare two codes:
1. The CPU version of our algorithm `PaRESy`, implemented in `C++`
2. AlphaRegex, a related work implemented in `Ocaml`

To conduct this comparison, we utilized the whole `25 examples` that AlphaRegex employs in their research. The examples are located in `benchmark > alpha_regex` directory.

**Note:** Since you are using a Colab notebook to execute both codes, please anticipate longer latency in the running times.

Prior to assessing the codes against the benchmarks, please execute the following script.

In [None]:
import os
import subprocess

# recompiling the CPU version of PaRESy
print("Recompiling the code...")
if not os.path.isdir("/content/paresy"):
    ! git clone https://github.com/MojtabaValizadeh/paresy.git
! g++ -O3 /content/paresy/code/cpu_version/cpu128.cpp -o cpu128
print()

def runCPU(typ, exp_idx, costfun, maxSize, timeOut):

    c1 = costfun[0]
    c2 = costfun[1]
    c3 = costfun[2]
    c4 = costfun[3]
    c5 = costfun[4]

    input = "/content/paresy/benchmarks/"

    if typ == "type1":
        input += "type1/type1_exp"
    elif typ == "type2":
        input += "type2/type2_exp"
    else:
        input += "alpha_regex/no"

    input += str(exp_idx) + ".txt"

    try:
        output = subprocess.run(
            ["./cpu128",
            input,
            str(c1), str(c2), str(c3), str(c4), str(c5),
            str(maxSize)],
            stdout = subprocess.PIPE,
            stderr = subprocess.PIPE,
            timeout = timeOut
        )
        return(str(output.stdout).replace('\\n', '\n')[1:])
    except subprocess.TimeoutExpired:
        return "timeOut!"

print ("Ready")

Kindly take note that in our related work, AlphaRegex, uses a hard-coded cost function in their code with values of (`20`, `20`, `20`, `5`, `30`) assigned to (a, ?, *, ., +). Consequently, we will utilize their cost function for all these benchmarks.

To execute our algorithm's CPU version against `25 examples` from the AlphaRegex benchmarks, please run the following script.

In [None]:
timeOut = 200
ARcostFun = [20, 20, 20, 5, 30]
maxCost = 500 # a reasonably large integer which has been used for our cost functions

for exp_idx in range(1, 26):
    print("AlphaRegex benchmarks, no" + str(exp_idx) + ":")
    print(runCPU("alpharegex", exp_idx, ARcostFun, maxCost, timeOut))
    print()

# <strong>More Info</strong>

If you need further information, please refer to our [paper](https://users.sussex.ac.uk/~mfb21/pldi23-draft-updated.pdf) or feel free to contact the authors directly.