Ground-truth datasets are broken? #54

yoshitomo-matsubara · 2021-08-25T06:41:32Z

Hi!

Thank you for your great work and framework!
I wanted to try the benchmarked methods for the ground-truth datasets (i.e., Feynman and Strogatz datasets) and followed the instructions in README.

Is each of the datasets not in gzip format?

However, the datasets fetched from the pmlb repository look broken. Here is one of the errors I got when running
python analyze.py -results ../results_sym_data -target_noise 0.0 "/data/pmlb/datasets/strogatz*" -sym_data -n_trials 10 -time_limit 9:00 -tuned --local
for Strogatz dataset. (Same errors occurred for Feynman dataset by "/data/pmlb/datasets/feynman_*" as well)

========================================
Evaluating tuned.FEATRegressor on
/data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
========================================
compression: gzip
filename: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
Traceback (most recent call last):
File "evaluate_model.py", line 291, in <module>
**eval_kwargs)
File "evaluate_model.py", line 39, in evaluate_model
features, labels, feature_names = read_file(dataset)
File "/opt/app/srbench/experiment/read_file.py", line 19, in read_file
engine='python')
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/util/_decorators.py",
line 311, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 100, in __init__
self._make_reader(self.handles.handle)
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 203, in _make_reader
line = f.readline()
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 300, in read1
return self._buffer.read1(size)
File "/opt/conda/envs/srbench/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 474, in read
if not self._read_gzip_header():
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 422, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b've')

I also tried to manually gunzip the file, but the error message still says it's not in gzip format

$ gunzip /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
gzip: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz: not in gzip format

Could you please resolve this issue for both Feynman and Strogatz datasets?
Thank you!

The text was updated successfully, but these errors were encountered:

lacava · 2021-08-26T02:06:23Z

can you confirm you ran 'git lfs fetch' in the pmlb repo? looks like they may be git lfs references still. i need to update the instructions as well since feynman and strogatz datasets are now in master in pmlb

yoshitomo-matsubara · 2021-08-26T02:18:41Z

Hi @lacava
Thank you for the response.

Yes, I did run git lfs fetch for feynman branch. A few minutes ago, I also fetched master branch, but the downloaded tsv.gz files still look the same and are not in gzip format. (returned the same error as shown above).

yoshitomo-matsubara · 2021-08-26T02:34:24Z

I think we need git lfs pull instead of git lfs fetch. It seems the analyze.py is now working with the downloaded datasets.

lacava · 2021-08-28T05:39:21Z

glad you found a solution. i believe git lfs pull additionally checks out the branch but fetch will pull the files. I'll update the instructions for the main PMLB branch asap.

yoshitomo-matsubara · 2021-09-03T06:01:48Z

Thank you for updating the repo! I'll close this issue

yoshitomo-matsubara · 2021-09-16T05:13:58Z

@lacava
It looks like the feynman datasets in PMLB are still incomplete.

metadata.yaml files in strogatz datasets look complete, and analyze.py works with the datasets.
However, the metadata.yaml in feynman datasets are incomplete (description = 'None yet. See our contributing guide to help us add one.'), thus failed to get model_str (equations?) and analyze.py failed as follows

========================================
Evaluating tuned.FEATRegressor on
/opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
========================================
compression: gzip
filename: /opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
Traceback (most recent call last):
  File "evaluate_model.py", line 291, in <module>
    **eval_kwargs)
  File "evaluate_model.py", line 41, in evaluate_model
    true_model = get_sym_model(dataset)
  File "/opt/app/srbench/experiment/symbolic_utils.py", line 239, in get_sym_model
    model_str = [ms for ms in description if '=' in ms][0].split('=')[-1]                                     
IndexError: list index out of range

lacava · 2021-09-16T13:56:20Z

thanks for checking. hm, some of the changes didn't make it into master... i'll look into it.

lacava · 2021-09-16T14:01:27Z

issued a PR on PMLB to resolve: EpistasisLab/pmlb#158
will check back once it is merged into master.

yoshitomo-matsubara · 2021-09-16T14:32:19Z

@lacava
Thank you for the update! Let me know here once it's merged into master

lacava · 2021-09-16T14:35:07Z

merged, please update PMLB

marcovirgolin · 2022-02-17T16:01:12Z

Hi, I am trying this out myself now, and getting an error with all Strogatz problems this time (Feynman's run fine).
Namely, when using the python analyze.py -script assess_symbolic_model as indicated in the README, I get errors like the one shown below:

========================================
Assessing tuned.GPGOMEARegressor model for 
../../pmlb/datasets/strogatz_predprey2/strogatz_predprey2.tsv.gz
========================================
looking for: ../results_sym_data/strogatz_predprey2//strogatz_predprey2_tuned.GPGOMEARegressor_860.json
['This is one state of a 2-state dynamic model for predator-prey populations. ', '', '$\\dot{x} = x  \\cdot \\left( 4 - x - \\frac{y}{1+x} \\right)$', '$\\dot{y} = y \\cdot \\left( \\frac{x}{1+x} - 0.075 \\cdot y \\right)$', '', 'It is adapted from Steven Strogatz\'s book "Chaos and Nonlinear Dynamics".  ', 'Each strogatz ODE system can exhibit chaotic and/or nonlinear behavior. ', 'For the purposes of modeling, these systems are simulated using initial conditions within stable basins of attraction. ', 'The systems are simulated using simulink and matlab. ', '']
ValueError: Error from parse_expr with transformed code: "x   \\Symbol ('cdot' ) \\Function ('left' )(Integer (4 )-x - \\frac {y }{Integer (1 )+x } \\Symbol ('right' ))$"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "assess_symbolic_model.py", line 158, in <module>
    feature_noise=args.X_NOISE)
  File "assess_symbolic_model.py", line 111, in assess_symbolic_model
    assess_symbolic_model_from_file(save_file+'.json', dataset)
  File "assess_symbolic_model.py", line 42, in assess_symbolic_model_from_file
    true_model = get_sym_model(dataset, return_str=False)
  File "/export/scratch1/home/virgolin/srbench/experiment/symbolic_utils.py", line 246, in get_sym_model
    local_dict = {k:Symbol(k) for k in features})
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1026, in parse_expr
    raise e from ValueError(f"Error from parse_expr with transformed code: {code!r}")
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1017, in parse_expr
    rv = eval_expr(code, local_dict, global_dict)
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 912, in eval_expr
    code, global_dict, local_dict)  # take local objects in preference
  File "<string>", line 1
    x   \Symbol ('cdot' ) \Function ('left' )(Integer (4 )-x - \frac {y }{Integer (1 )+x } \Symbol ('right' ))$
                                                                                                              ^
SyntaxError: unexpected character after line continuation character
 python analyze.py \
            -script assess_symbolic_model \{'INPUT_FILE': '../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../results_sym_data/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': True, 'JSON_FILE': ''}

I do see that the "true_model" field in the .json results for Strogatz includes a trailing $ at the end.

Perhaps it suffices to add a

model_str = model_str.replace("$","")

in symbolic_utils.get_sym_model?

I'd do a PR but I am not sure whether this is (somehow) a problem only I got, since I see nobody else raising it.

EDIT: removing the $ is not enough

lacava · 2022-03-11T07:20:08Z

hi @marcovirgolin

you caught a set of changes I hadn't pushed into PMLB.

once the checks complete on EpistasisLab/pmlb#160, you can update from the pmlb master branch.
for now you can checkout the strogatz_metadata branch. it seems to work for me on your example:

srbench/experiment$ python assess_symbolic_model.py ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz -ml tuned.GPGOMEARegressor -results ../../analysis/results_sym_data_new/strogatz_shearflow2/ -seed 860
{'INPUT_FILE': '../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../../analysis/results_sym_data_new/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': False, 'JSON_FILE': ''}
========================================
Assessing tuned.GPGOMEARegressor model for
../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
========================================
looking for: ../../analysis/results_sym_data_new/strogatz_shearflow2//strogatz_shearflow2_tuned.GPGOMEARegressor_860.json
> /mnt/d/projects/symbolic-regression/srbench/experiment/symbolic_utils.py(244)get_sym_model()
-> model_sym = parse_expr(model_str,
(Pdb) c
compression: gzip
filename: ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
replacing feature 0 with x
replacing feature 1 with y
parsing 0.000170+2.307729*(((((cos(sin(y))*PLOG(PLOG(14.465000)))*cos((cos(y)/(-11.097000--13.964000))))+cos(-20.929000))*sin(x)))
{'x': x, 'y': y, 'add': <class 'sympy.core.add.Add'>, 'mul': <class 'sympy.core.mul.Mul'>, 'max': Max, 'min': Min, 'sub': <function sub at 0x7f4e8ed32790>, 'div': <function div at 0x7f4e8d7e2040>, 'square': <function square at 0x7f4e8d7e20d0>, 'cube': <function cube at 0x7f4e8d7e2160>, 'quart': <function quart at 0x7f4e8d7e21f0>, 'PLOG': <function PLOG at 0x7f4e8d7e2280>, 'PLOG10': <function PLOG at 0x7f4e8d7e2280>, 'PSQRT': <function PSQRT at 0x7f4e8d7e23a0>}
round_floats
rounded: 2.31*(0.983*cos(sin(y))*cos(0.349*cos(y)) - 0.487)*sin(x)
simplify...
simplified: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)
saving...
sym_diff: -(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x) + (0.1*sin(y)**2 + cos(y)**2)*sin(x)
sym_frac: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)
simplified sym_diff: (-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)
{
    "dataset": "strogatz_shearflow2",
    "algorithm": "tuned.GPGOMEARegressor",
    "params": {
        "caching": false,
        "classweights": false,
        "elitism": 1,
        "erc": true,
        "evaluations": 1000000,
        "functions": "+_-_*_p/_plog_sqrt_sin_cos",
        "generations": -1,
        "gomea": true,
        "gomfos": "LT",
        "ims": false,
        "initmaxtreeheight": 6,
        "linearscaling": true,
        "maxsize": 1000,
        "maxtreeheight": 17,
        "parallel": false,
        "popsize": 1000,
        "prob": "symbreg",
        "reproduction": 0.0,
        "sbagx": 0.0,
        "sblibtype": false,
        "sbrdo": 0.0,
        "seed": -1,
        "silent": true,
        "subcross": 0.5,
        "submut": 0.5,
        "syntuniqinit": 1000,
        "time": 28800,
        "tournament": 4,
        "unifdepthvar": true
    },
    "random_state": 860,
    "process_time": 133.882689869,
    "time_time": 133.97960495948792,
    "target_noise": 0.0,
    "feature_noise": 0.0,
    "true_model": "(0.1*sin(y)**2 + cos(y)**2)*sin(x)",
    "model_size": 21,
    "symbolic_model": "0.000170+2.307729*(((((cos(sin(x1))*plog(plog(14.465000)))*cos((cos(x1)p/(-11.097000--13.964000))))+cos(-20.929000))*sin(x0)))",
    "mse_train": 1.2889293272279269e-06,
    "mae_train": 0.0008988973869460174,
    "r2_train": 0.9999751463811574,
    "mse_test": 1.3140537910085879e-06,
    "mae_test": 0.0009068998433162603,
    "r2_test": 0.9999816769614173,
    "simplified_symbolic_model": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)",
    "simplified_complexity": 15,
    "symbolic_error": "(-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)",
    "symbolic_fraction": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)",
    "symbolic_error_is_zero": false,
    "symbolic_error_is_constant": false,
    "symbolic_fraction_is_constant": false
}
saving...
done.

lacava · 2022-03-11T16:31:15Z

EpistasisLab/pmlb#160 was merged. update PMLB from git and you should be good to go.

lacava added a commit that referenced this issue Aug 28, 2021

see issue #54

2980bfe

yoshitomo-matsubara closed this as completed Sep 3, 2021

yoshitomo-matsubara mentioned this issue Sep 3, 2021

To reproduce results for the ground-truth datasets #55

Closed

yoshitomo-matsubara reopened this Sep 16, 2021

lacava closed this as completed Sep 16, 2021

marcovirgolin reopened this Feb 17, 2022

lacava mentioned this issue Mar 11, 2022

update eqn form to python for strogatz eqns EpistasisLab/pmlb#160

Merged

lacava closed this as completed Mar 11, 2022

gAldeia pushed a commit to gAldeia/srbench that referenced this issue May 29, 2024

see issue cavalab#54

31356c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ground-truth datasets are broken? #54

Ground-truth datasets are broken? #54

yoshitomo-matsubara commented Aug 25, 2021

lacava commented Aug 26, 2021

yoshitomo-matsubara commented Aug 26, 2021

yoshitomo-matsubara commented Aug 26, 2021

lacava commented Aug 28, 2021

yoshitomo-matsubara commented Sep 3, 2021

yoshitomo-matsubara commented Sep 16, 2021

lacava commented Sep 16, 2021

lacava commented Sep 16, 2021

yoshitomo-matsubara commented Sep 16, 2021

lacava commented Sep 16, 2021

marcovirgolin commented Feb 17, 2022 •

edited

Loading

lacava commented Mar 11, 2022

lacava commented Mar 11, 2022

Ground-truth datasets are broken? #54

Ground-truth datasets are broken? #54

Comments

yoshitomo-matsubara commented Aug 25, 2021

Is each of the datasets not in gzip format?

lacava commented Aug 26, 2021

yoshitomo-matsubara commented Aug 26, 2021

yoshitomo-matsubara commented Aug 26, 2021

lacava commented Aug 28, 2021

yoshitomo-matsubara commented Sep 3, 2021

yoshitomo-matsubara commented Sep 16, 2021

lacava commented Sep 16, 2021

lacava commented Sep 16, 2021

yoshitomo-matsubara commented Sep 16, 2021

lacava commented Sep 16, 2021

marcovirgolin commented Feb 17, 2022 • edited Loading

lacava commented Mar 11, 2022

lacava commented Mar 11, 2022

marcovirgolin commented Feb 17, 2022 •

edited

Loading