Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To reproduce results for the ground-truth datasets #55

Closed
yoshitomo-matsubara opened this issue Sep 3, 2021 · 13 comments
Closed

To reproduce results for the ground-truth datasets #55

yoshitomo-matsubara opened this issue Sep 3, 2021 · 13 comments

Comments

@yoshitomo-matsubara
Copy link

yoshitomo-matsubara commented Sep 3, 2021

Thank @lacava for helping me resolve the dataset issue last time.

Based on the command in README, I tried to reproduce the results reported in Figure 3 of your recently accepted paper for both Strogatz and Feynman datasets, and found some concerns/questions.

1. How should we see the produced results?

For Strogatz dataset, I ran
python analyze.py -results ../results_sym_data -target_noise 0.0 "/path/to/pmlb/datasets/strogatz*" -sym_data -n_trials 10 -time_limit 9:00 -tuned --local

Following that, there were many json files produced. In strogatz_bacres1_tuned.FE_AFPRegressor_15795.json (AFP_FE), I found the following values:

>>> with open('strogatz_bacres1_tuned.FE_AFPRegressor_15795.json', 'r') as fp:
...    fe_afp_result = json.load(fp)
...
>>> fe_afp_result['r2_test']
0.9984022413915545

>>> fe_afp_result['symbolic_model']
'(log(((((((-0.567/cos(1.486))^2)/(x_1+(x_0/exp((cos(((0.069^2)*(x_0*
(x_0+cos((cos(((-0.065^2)*(x_0*x_0)))*x_1))))))^3)))))^3)^3)+exp(sin(log(((sqrt(|
(cos(((-0.064^2)*(x_0*x_0)))^3)|)/((0.286+
(x_1*0.017))^2))^2))))))*cos((log(x_0)/(log((x_0^3))-x_1))))'

>>> fe_afp_result['true_model']
' 20 - x - \\frac{x \\cdot y}{1+0.5 \\cdot x^2}$'

I think a) r2_test is called Accuracy in the paper, b) symbolic_model means the symbolic expression as a result of training on strogatz_bacres1 c) whereas the true symbolic expression is associated with true_model.

Is my understanding correct for all a), b), c)?
Also, is the above symbolic_model expected as output of AFP_FE for strogatz_bacres1? Since the method is the 2nd best for ground-truth datasets shown in Fig. 3, and I expected a clearer expression.

2. How is the solution rate derived?

Could you please clarify how the solution rate in Fig. 3 is derived?
Did you manually compare the produced expression symbolic_model to the true expression true_model and consider it solved only when the produced expression exactly matches the true one?

Or if it is fully based on Definition 4.1 (Symbolic Solution). in the paper, what values of a and b are used in Fig. 3?

3. Operon build failed

On Ubuntu 18.04 and 20.04, operon build with your provided install.sh failed due to version discrepancy between libceres-dev (expects Eigen 3.4.0) and libeigen3-dev (the latest available version is 3.3.7). I even tried to build Eigen v3.4.0 from source, but still the build failed.
Do you remember how you setup the dependencies for Operon?

4. Commands to reproduce the results in Fig. 3

Could you provide the exact commands to reproduce the results in Fig. 3?
For Strogatz datasets with target noise = 0.0, I think the following command was used
python analyze.py -results ../results_sym_data -target_noise 0.0 "/path/to/pmlb/datasets/strogatz*" -sym_data -n_trials 10 -time_limit 9:00 -tuned

but how about Feynman datasets?
Also, how should we determine -time_limit ?

5. Computing resource and estimated runtime

To estimate how long it will take to reproduce the results in Fig. 3, could you share the detail of computing resource used in the paper e.g., how many machines of 24-28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz chipsets and 250 GB of RAM are used and (rough) estimated runtime to get the results if you remember?

On a machine with 4-core CPU, 128GB RAM and 2 GPUs, even strogatz_bacres1 (400 samples) is taking more than a day to complete python analyze.py -results ../results_sym_data -target_noise 0.0 /path/to/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz -sym_data -n_trials 10 -time_limit 9:00 -tuned --local

Sorry for many questions, but your responses would be really appreciated and helpful for using this great work in my research.

Thank you!

@lacava
Copy link
Member

lacava commented Sep 10, 2021

Hi @yoshitomo-matsubara sorry for the delay, and also thanks for taking a deep dive. I will try to answer all your questions below. Some of them involve improvements to the repo workflow that we should separate out as issues.

Thank @lacava for helping me resolve the dataset issue last time.

You're welcome

Based on the command in README, I tried to reproduce the results reported in Figure 3 of your recently accepted paper for both Strogatz and Feynman datasets, and found some concerns/questions.

1. How should we see the produced results?

These are the full steps:

  1. This is the full script for running the ground truth experiment:
for start_seed in 0 1 2 3 10 11 12 13 14 15; do
    for data in "/path/to/pmlb/datasets/strogatz_" "/path/to/pmlb/datasets/feynman_" ; do
        for TN in 0 0.001 0.01 0.1; do
            python analyze.py -results ../results_sym_data -target_noise $TN $data"*" -sym_data -starting_seed $start_seed -n_trials 1 -m 16384 -time_limit 9:00 -job_limit 100000 -tuned
            if [ $? -gt 0 ] ; then
                break
            fi
        done
    done
done

The job_limit parameter was necessary for our cluster just to not overload it. analyze.py is smart enough to be run multiple times, in which case it will run any experiments that it does not find in the queue or in the results. Note: the start_seed is non-contiguous because we initially submitted results on two clusters, divvying up the seeds, but the second cluster had too many scheduling issues to use reliably. when we updated the noise levels in our revision, we did so with the results from these seeds.

  1. Following this, we used a post-processing script to perform post-run simplification to the models and make them sympy-compatible. This was done post-hoc because, at the moment, different methods do not all provide string outputs that are compatible with sympy. This can be run by specifying the assess_symbolic_model script to analyze.py, plugging in to the same script above:
python analyze.py -results ../results_sym_data -target_noise $TN $data"*" -sym_data -starting_seed $start_seed -n_trials 1 -m 8192 -time_limit 2:00 -job_limit 100000 -tuned -script assess_symbolic_model

Here we limited to 2 hours and 8GB RAM. the script will only submit jobs for results it finds in the results folder (i.e., .json files).

This step will create .json.updated files, containing additional key-value pairs, including simplified_symbolic_model

  1. After running symbolic model assessment (see below), the results files are then collated using collate_groundtruth_results.py, which produces the .feather file called when generating the figures in groundtruth_results.ipynb.

I think a) r2_test is called Accuracy in the paper, b) symbolic_model means the symbolic expression as a result of training on strogatz_bacres1 c) whereas the true symbolic expression is associated with true_model.

Is my understanding correct for all a), b), c)?

a) correct, also R2 defined in section 4.3
b) Yes, however, we applied post-run simplification to the models as described above. Models are compared on the simplified_symbolic_models key.
c) Yes.

Also, is the above symbolic_model expected as output of AFP_FE for strogatz_bacres1? Since the method is the 2nd best for ground-truth datasets shown in Fig. 3, and I expected a clearer expression.

simplified_symbolic_model will be simpler, but also the overall solution rate for strogatz problems is relatively low, and may be even lower for bacres1. Would have to check .

2. How is the solution rate derived?

Could you please clarify how the solution rate in Fig. 3 is derived?
Did you manually compare the produced expression symbolic_model to the true expression true_model and consider it solved only when the produced expression exactly matches the true one?

No, solution rates are derived computationally, as show in groundtruth_results.ipynb.

Or if it is fully based on Definition 4.1 (Symbolic Solution). in the paper, what values of a and b are used in Fig. 3?

a and b can be any values subject to the definition constraints, because we are using sympy to symbolically compare the equations.

3. Operon build failed

On Ubuntu 18.04 and 20.04, operon build with your provided install.sh failed due to version discrepancy between libceres-dev (expects Eigen 3.4.0) and libeigen3-dev (the latest available version is 3.3.7). I even tried to build Eigen v3.4.0 from source, but still the build failed.
Do you remember how you setup the dependencies for Operon?

paging @foolnotion.
I'm not sure but it sounds like you're using the debian package for ceres, whereas I am using ceres-solver 2.0.0 in the environment.yml file.

4. Commands to reproduce the results in Fig. 3

Could you provide the exact commands to reproduce the results in Fig. 3?

groundtruth_results.ipynb

Also, how should we determine -time_limit ?

we used 9:00, but methods are also limited internally (where possible) to 8 hours for symbolic problems.

5. Computing resource and estimated runtime

To estimate how long it will take to reproduce the results in Fig. 3, could you share the detail of computing resource used in the paper e.g., how many machines of 24-28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz chipsets and 250 GB of RAM are used and (rough) estimated runtime to get the results if you remember?

We used a cluster with ~1100 cores, meaning we were able to run ~1000 jobs simultaneously. The maximum core hours for training the models is given in Table 2, assuming all methods use the whole budget (which is not the case). For ground truth datasets, ~440K core hours, with 1000 cores is 440 hours or 18 days. I recall it actually taking about 2 weeks and symbolic model assessment taking about 2 days.

On a machine with 4-core CPU, 128GB RAM and 2 GPUs, even strogatz_bacres1 (400 samples) is taking more than a day to complete python analyze.py -results ../results_sym_data -target_noise 0.0 /path/to/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz -sym_data -n_trials 10 -time_limit 9:00 -tuned --local

That is not surprising to me. Some of the benchmarked methods have very slow implementations.

Because of my access to a big cluster, I have offered to benchmark methods that are submitted to this repo, pending cluster availability. It is a lot of compute.

Sorry for many questions, but your responses would be really appreciated and helpful for using this great work in my research.

Thank you!

Thanks for the detailed questions and let me know if anything is unclear.

@folivetti
Copy link
Contributor

regarding Operon maybe running source activate srbench before running instal.sh solves the issue.

In any case, I had another compilation problem and @foolnotion helped me out with a patch:

remove the following from environment.yml:

  • pybind11=2.6.1
  • tbb-devel=2020.2

then add the following into environment.yml:

  • taskflow=3.1.0
  • pybind11=2.6.2

in srbench/experiments/methods/src/operon_install.sh remove the line:

git checkout 015d420944a64353a37e0493ae9be74c645b4198

@foolnotion
Copy link
Contributor

@yoshitomo-matsubara It's best to let conda handle all the dependencies and not mix system libraries with conda/vcpkgs/etc. The Ceres cmake module will complain if the Eigen version detected is different from the version Ceres itself was compiled with.

I will have to do a PR soon to update operon. There is one last thing to finish before I do that, namely to integrate NSGA2 into the python module. Since operon switched from intel-tbb to taskflow as a parallelism provider, if you do want to use the latest version you have to add taskflow to the conda environment. I've also created da gitter community for operon discussions and support: https://gitter.im/operongp/community so if you have any more problems with operon feel free to ask.

@yoshitomo-matsubara
Copy link
Author

@lacava

Thank you for the detailed answers! I think many of my questions above are resolved now.

Or if it is fully based on Definition 4.1 (Symbolic Solution). in the paper, what values of a and b are used in Fig. 3?
a and b can be any values subject to the definition constraints, because we are using sympy to symbolically compare the equations.

Could you clarify this point little bit more? Does it mean the equation comparison (between estimated and ground-truth equations) is completely left to sympy and the constant parameters a and b are implicitly defined in the sympy package?

we used 9:00, but methods are also limited internally (where possible) to 8 hours for symbolic problems.

At which level is the -time_limit set? e.g., -time_limit 9:00 means 9 hours per equation (i.e., 9 hours / eq at maximum for 14 methods)?

Because of my access to a big cluster, I have offered to benchmark methods that are submitted to this repo, pending cluster availability. It is a lot of compute.

Unfortunately, I do not have access to such a big cluster to distribute jobs and it would take forever to completely follow the experimental design.
What configurations would be reasonable if computing resource is limited? maybe fewer random seeds (say 3 random seeds) with -n_trial 1 per method?

@folivetti and @foolnotion

Thank you for the suggestion! I followed the suggestions from both of you, but unfortunately it didn't resolve the issue of Eigen and libceres-dev.
I also confirmed that either libeigen3-dev or libceres-dev is NOT installed (apt remove libeigen3-dev and apt remove libceres-dev in docker) before conda env setup.
Can I ask you to try installing the operon at a new environment (e.g., in docker container)? I spent several days to make it work...

@lacava
Copy link
Member

lacava commented Sep 14, 2021

Could you clarify this point little bit more? Does it mean the equation comparison (between estimated and ground-truth equations) is completely left to sympy and the constant parameters a and b are implicitly defined in the sympy package?

Yes. Sympy evaluates the equations symbolically, meaning without any real values being passed. So the result of comparing the true equation and model is a symbolic equation, and assessed as such.

At which level is the -time_limit set? e.g., -time_limit 9:00 means 9 hours per equation (i.e., 9 hours / eq at maximum for 14 methods)?

This limits the training of a model on a single dataset for 1 trial. This time limit is sent to the job scheduler.

Unfortunately, I do not have access to such a big cluster to distribute jobs and it would take forever to completely follow the experimental design.

Here are some options, depending on your goals. Are your goals to reproduce the entire experiment or to compare to another method? We provided the .feather files with results from all methods so that people would not have to run the entire experiment. Similarly if you are comparing a single algorithm you can just generate results for that method. If that is still too much, submit it as a PR and I can run it on our cluster and update the repo.

What configurations would be reasonable if computing resource is limited? maybe fewer random seeds (say 3 random seeds) with -n_trial 1 per method?

"reasonable" is subjective and algorithm/problem dependent. IMO running three trials won't give a very good estimate of the likelihood of a randomized algorithm finding an exact solution to a specific problem. Averaged over all problems is perhaps less problematic, but could be misleading.

@yoshitomo-matsubara
Copy link
Author

@lacava

Unfortunately, I do not have access to such a big cluster to distribute jobs and it would take forever to completely follow the experimental design.

Here are some options, depending on your goals. Are your goals to reproduce the entire experiment or to compare to another method? We provided the .feather files with results from all methods so that people would not have to run the entire experiment. Similarly if you are comparing a single algorithm you can just generate results for that method. If that is still too much, submit it as a PR and I can run it on our cluster and update the repo.

What configurations would be reasonable if computing resource is limited? maybe fewer random seeds (say 3 random seeds) with -n_trial 1 per method?

"reasonable" is subjective and algorithm/problem dependent. IMO running three trials won't give a very good estimate of the likelihood of a randomized algorithm finding an exact solution to a specific problem. Averaged over all problems is perhaps less problematic, but could be misleading.

Thank you for the further clarification and suggestions. My goal is 1) to apply the SR methods to my internal datasets (which cannot be shared at this moment) and 2) compare the performance with my proposed SR model for a paper I'm working on. This is why I've been seeking for the way to leverage this great project.

@foolnotion
Copy link
Contributor

@yoshitomo-matsubara I will try to provide a docker image for you, but I don't know how long that'll take as I don't have a lot of experience with docker. In the meanwhile, please create an issue on our project's page and describe your installation steps.

Regarding the computational costs of running the benchmark, I've recently had a good experience with the AWS cloud. You can get a good price if you wait for spot instances to be available. Setting up an ubuntu cloud machine with conda/srbench is pretty easy. I was able to run Operon/SRBench over one weekend on an AMD Epyc 96-core machine for just a little over 20€. Spot instances do have some caveats (need to checkpoint your work often) but overall are a good alternative.

@yoshitomo-matsubara
Copy link
Author

@foolnotion Thank you for the offer. I just created a new issue for Dockerfile #56

@yoshitomo-matsubara
Copy link
Author

@lacava

Inconsistency between the numbers above and those in Table 2

This is the full script for running the ground truth experiment:

for start_seed in 0 1 2 3 10 11 12 13 14 15; do
    for data in "/path/to/pmlb/datasets/strogatz_" "/path/to/pmlb/datasets/feynman_" ; do
        for TN in 0 0.001 0.01 0.1; do
            python analyze.py -results ../results_sym_data -target_noise $TN $data"*" -sym_data -starting_seed $start_seed -n_trials 1 -m 16384 -time_limit 9:00 -job_limit 100000 -tuned
            if [ $? -gt 0 ] ; then
                break
            fi
        done
    done
done

Table 2 of the paper (when accepted) says No. of trials per dataset is 10 while the above one shows -n_trials 1 with 15 random seeds (15 trials per dataset). Which one is used for the results reported in the paper?

Exact commands to complete train-to-evaluate pipeline

Besides the commands for training, could you please complete the instructions (exact commands) in README to 1) train, 2) postprocess the training results, and 3) evaluate?
Unfortunately, many components such as file paths are hardcoded in scripts and look difficult to follow the complete pipeline.

Reopened the issue for dataset

Since I found an issue of Feynman datasets in PMLB, I reopened the issue #54 . Could you please address this issue as well?

Thank you so much

@lacava
Copy link
Member

lacava commented Sep 16, 2021

@lacava

Inconsistency between the numbers above and those in Table 2

This is the full script for running the ground truth experiment:

for start_seed in 0 1 2 3 10 11 12 13 14 15; do
    for data in "/path/to/pmlb/datasets/strogatz_" "/path/to/pmlb/datasets/feynman_" ; do
        for TN in 0 0.001 0.01 0.1; do
            python analyze.py -results ../results_sym_data -target_noise $TN $data"*" -sym_data -starting_seed $start_seed -n_trials 1 -m 16384 -time_limit 9:00 -job_limit 100000 -tuned
            if [ $? -gt 0 ] ; then
                break
            fi
        done
    done
done

Table 2 of the paper (when accepted) says No. of trials per dataset is 10 while the above one shows -n_trials 1 with 15 random seeds (15 trials per dataset). Which one is used for the results reported in the paper?

You'll notice there aren't 15 seeds up there. As I mentioned above,

Note: the start_seed is non-contiguous because we initially submitted results on two clusters, divvying up the seeds, but the second cluster had too many scheduling issues to use reliably. when we updated the noise levels in our revision, we did so with the results from these seeds.

This is probably unnecessarily complicated, so I'm planning to update SEEDS.py to make the seeds contiguous for reproducing.

Exact commands to complete train-to-evaluate pipeline

Besides the commands for training, could you please complete the instructions (exact commands) in README to 1) train, 2) postprocess the training results, and 3) evaluate?

I'll update the readme. In the mean time please read my steps above.

@yoshitomo-matsubara
Copy link
Author

@lacava

You'll notice there aren't 15 seeds up there. As I mentioned above,

My bad, I didn't count them up actually.

I'll update the readme. In the mean time please read my steps above.

Thank you, I'll be looking forward to the updates

@lacava
Copy link
Member

lacava commented Sep 21, 2021

hi @yoshitomo-matsubara , see the README updates

@yoshitomo-matsubara
Copy link
Author

Hi @lacava
Thank you for updating the README! I'll close this issue.

gAldeia pushed a commit to gAldeia/srbench that referenced this issue May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants