Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ground-truth datasets are broken? #54

Closed
yoshitomo-matsubara opened this issue Aug 25, 2021 · 13 comments
Closed

Ground-truth datasets are broken? #54

yoshitomo-matsubara opened this issue Aug 25, 2021 · 13 comments

Comments

@yoshitomo-matsubara
Copy link

Hi!

Thank you for your great work and framework!
I wanted to try the benchmarked methods for the ground-truth datasets (i.e., Feynman and Strogatz datasets) and followed the instructions in README.

Is each of the datasets not in gzip format?

However, the datasets fetched from the pmlb repository look broken. Here is one of the errors I got when running
python analyze.py -results ../results_sym_data -target_noise 0.0 "/data/pmlb/datasets/strogatz*" -sym_data -n_trials 10 -time_limit 9:00 -tuned --local
for Strogatz dataset. (Same errors occurred for Feynman dataset by "/data/pmlb/datasets/feynman_*" as well)

========================================
Evaluating tuned.FEATRegressor on
/data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
========================================
compression: gzip
filename: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
Traceback (most recent call last):
File "evaluate_model.py", line 291, in <module>
**eval_kwargs)
File "evaluate_model.py", line 39, in evaluate_model
features, labels, feature_names = read_file(dataset)
File "/opt/app/srbench/experiment/read_file.py", line 19, in read_file
engine='python')
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/util/_decorators.py",
line 311, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 100, in __init__
self._make_reader(self.handles.handle)
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 203, in _make_reader
line = f.readline()
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 300, in read1
return self._buffer.read1(size)
File "/opt/conda/envs/srbench/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 474, in read
if not self._read_gzip_header():
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 422, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b've')

I also tried to manually gunzip the file, but the error message still says it's not in gzip format

$ gunzip /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
gzip: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz: not in gzip format

Could you please resolve this issue for both Feynman and Strogatz datasets?
Thank you!

@lacava
Copy link
Member

lacava commented Aug 26, 2021

can you confirm you ran 'git lfs fetch' in the pmlb repo? looks like they may be git lfs references still. i need to update the instructions as well since feynman and strogatz datasets are now in master in pmlb

@yoshitomo-matsubara
Copy link
Author

Hi @lacava
Thank you for the response.

Yes, I did run git lfs fetch for feynman branch. A few minutes ago, I also fetched master branch, but the downloaded tsv.gz files still look the same and are not in gzip format. (returned the same error as shown above).

@yoshitomo-matsubara
Copy link
Author

I think we need git lfs pull instead of git lfs fetch. It seems the analyze.py is now working with the downloaded datasets.

@lacava
Copy link
Member

lacava commented Aug 28, 2021

glad you found a solution. i believe git lfs pull additionally checks out the branch but fetch will pull the files. I'll update the instructions for the main PMLB branch asap.

lacava added a commit that referenced this issue Aug 28, 2021
@yoshitomo-matsubara
Copy link
Author

Thank you for updating the repo! I'll close this issue

@yoshitomo-matsubara
Copy link
Author

@lacava
It looks like the feynman datasets in PMLB are still incomplete.

metadata.yaml files in strogatz datasets look complete, and analyze.py works with the datasets.
However, the metadata.yaml in feynman datasets are incomplete (description = 'None yet. See our contributing guide to help us add one.'), thus failed to get model_str (equations?) and analyze.py failed as follows

========================================
Evaluating tuned.FEATRegressor on
/opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
========================================
compression: gzip
filename: /opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
Traceback (most recent call last):
  File "evaluate_model.py", line 291, in <module>
    **eval_kwargs)
  File "evaluate_model.py", line 41, in evaluate_model
    true_model = get_sym_model(dataset)
  File "/opt/app/srbench/experiment/symbolic_utils.py", line 239, in get_sym_model
    model_str = [ms for ms in description if '=' in ms][0].split('=')[-1]                                     
IndexError: list index out of range

@lacava
Copy link
Member

lacava commented Sep 16, 2021

thanks for checking. hm, some of the changes didn't make it into master... i'll look into it.

@lacava
Copy link
Member

lacava commented Sep 16, 2021

issued a PR on PMLB to resolve: EpistasisLab/pmlb#158
will check back once it is merged into master.

@yoshitomo-matsubara
Copy link
Author

@lacava
Thank you for the update! Let me know here once it's merged into master

@lacava
Copy link
Member

lacava commented Sep 16, 2021

merged, please update PMLB

@lacava lacava closed this as completed Sep 16, 2021
@marcovirgolin
Copy link
Contributor

marcovirgolin commented Feb 17, 2022

Hi, I am trying this out myself now, and getting an error with all Strogatz problems this time (Feynman's run fine).
Namely, when using the python analyze.py -script assess_symbolic_model as indicated in the README, I get errors like the one shown below:

========================================
Assessing tuned.GPGOMEARegressor model for 
../../pmlb/datasets/strogatz_predprey2/strogatz_predprey2.tsv.gz
========================================
looking for: ../results_sym_data/strogatz_predprey2//strogatz_predprey2_tuned.GPGOMEARegressor_860.json
['This is one state of a 2-state dynamic model for predator-prey populations. ', '', '$\\dot{x} = x  \\cdot \\left( 4 - x - \\frac{y}{1+x} \\right)$', '$\\dot{y} = y \\cdot \\left( \\frac{x}{1+x} - 0.075 \\cdot y \\right)$', '', 'It is adapted from Steven Strogatz\'s book "Chaos and Nonlinear Dynamics".  ', 'Each strogatz ODE system can exhibit chaotic and/or nonlinear behavior. ', 'For the purposes of modeling, these systems are simulated using initial conditions within stable basins of attraction. ', 'The systems are simulated using simulink and matlab. ', '']
ValueError: Error from parse_expr with transformed code: "x   \\Symbol ('cdot' ) \\Function ('left' )(Integer (4 )-x - \\frac {y }{Integer (1 )+x } \\Symbol ('right' ))$"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "assess_symbolic_model.py", line 158, in <module>
    feature_noise=args.X_NOISE)
  File "assess_symbolic_model.py", line 111, in assess_symbolic_model
    assess_symbolic_model_from_file(save_file+'.json', dataset)
  File "assess_symbolic_model.py", line 42, in assess_symbolic_model_from_file
    true_model = get_sym_model(dataset, return_str=False)
  File "/export/scratch1/home/virgolin/srbench/experiment/symbolic_utils.py", line 246, in get_sym_model
    local_dict = {k:Symbol(k) for k in features})
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1026, in parse_expr
    raise e from ValueError(f"Error from parse_expr with transformed code: {code!r}")
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1017, in parse_expr
    rv = eval_expr(code, local_dict, global_dict)
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 912, in eval_expr
    code, global_dict, local_dict)  # take local objects in preference
  File "<string>", line 1
    x   \Symbol ('cdot' ) \Function ('left' )(Integer (4 )-x - \frac {y }{Integer (1 )+x } \Symbol ('right' ))$
                                                                                                              ^
SyntaxError: unexpected character after line continuation character
 python analyze.py \
            -script assess_symbolic_model \{'INPUT_FILE': '../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../results_sym_data/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': True, 'JSON_FILE': ''}

I do see that the "true_model" field in the .json results for Strogatz includes a trailing $ at the end.

Perhaps it suffices to add a

model_str = model_str.replace("$","")

in symbolic_utils.get_sym_model?

I'd do a PR but I am not sure whether this is (somehow) a problem only I got, since I see nobody else raising it.

EDIT: removing the $ is not enough

@lacava
Copy link
Member

lacava commented Mar 11, 2022

hi @marcovirgolin

you caught a set of changes I hadn't pushed into PMLB.

once the checks complete on EpistasisLab/pmlb#160, you can update from the pmlb master branch.
for now you can checkout the strogatz_metadata branch. it seems to work for me on your example:

srbench/experiment$ python assess_symbolic_model.py ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz -ml tuned.GPGOMEARegressor -results ../../analysis/results_sym_data_new/strogatz_shearflow2/ -seed 860
{'INPUT_FILE': '../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../../analysis/results_sym_data_new/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': False, 'JSON_FILE': ''}
========================================
Assessing tuned.GPGOMEARegressor model for
../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
========================================
looking for: ../../analysis/results_sym_data_new/strogatz_shearflow2//strogatz_shearflow2_tuned.GPGOMEARegressor_860.json
> /mnt/d/projects/symbolic-regression/srbench/experiment/symbolic_utils.py(244)get_sym_model()
-> model_sym = parse_expr(model_str,
(Pdb) c
compression: gzip
filename: ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
replacing feature 0 with x
replacing feature 1 with y
parsing 0.000170+2.307729*(((((cos(sin(y))*PLOG(PLOG(14.465000)))*cos((cos(y)/(-11.097000--13.964000))))+cos(-20.929000))*sin(x)))
{'x': x, 'y': y, 'add': <class 'sympy.core.add.Add'>, 'mul': <class 'sympy.core.mul.Mul'>, 'max': Max, 'min': Min, 'sub': <function sub at 0x7f4e8ed32790>, 'div': <function div at 0x7f4e8d7e2040>, 'square': <function square at 0x7f4e8d7e20d0>, 'cube': <function cube at 0x7f4e8d7e2160>, 'quart': <function quart at 0x7f4e8d7e21f0>, 'PLOG': <function PLOG at 0x7f4e8d7e2280>, 'PLOG10': <function PLOG at 0x7f4e8d7e2280>, 'PSQRT': <function PSQRT at 0x7f4e8d7e23a0>}
round_floats
rounded: 2.31*(0.983*cos(sin(y))*cos(0.349*cos(y)) - 0.487)*sin(x)
simplify...
simplified: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)
saving...
sym_diff: -(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x) + (0.1*sin(y)**2 + cos(y)**2)*sin(x)
sym_frac: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)
simplified sym_diff: (-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)
{
    "dataset": "strogatz_shearflow2",
    "algorithm": "tuned.GPGOMEARegressor",
    "params": {
        "caching": false,
        "classweights": false,
        "elitism": 1,
        "erc": true,
        "evaluations": 1000000,
        "functions": "+_-_*_p/_plog_sqrt_sin_cos",
        "generations": -1,
        "gomea": true,
        "gomfos": "LT",
        "ims": false,
        "initmaxtreeheight": 6,
        "linearscaling": true,
        "maxsize": 1000,
        "maxtreeheight": 17,
        "parallel": false,
        "popsize": 1000,
        "prob": "symbreg",
        "reproduction": 0.0,
        "sbagx": 0.0,
        "sblibtype": false,
        "sbrdo": 0.0,
        "seed": -1,
        "silent": true,
        "subcross": 0.5,
        "submut": 0.5,
        "syntuniqinit": 1000,
        "time": 28800,
        "tournament": 4,
        "unifdepthvar": true
    },
    "random_state": 860,
    "process_time": 133.882689869,
    "time_time": 133.97960495948792,
    "target_noise": 0.0,
    "feature_noise": 0.0,
    "true_model": "(0.1*sin(y)**2 + cos(y)**2)*sin(x)",
    "model_size": 21,
    "symbolic_model": "0.000170+2.307729*(((((cos(sin(x1))*plog(plog(14.465000)))*cos((cos(x1)p/(-11.097000--13.964000))))+cos(-20.929000))*sin(x0)))",
    "mse_train": 1.2889293272279269e-06,
    "mae_train": 0.0008988973869460174,
    "r2_train": 0.9999751463811574,
    "mse_test": 1.3140537910085879e-06,
    "mae_test": 0.0009068998433162603,
    "r2_test": 0.9999816769614173,
    "simplified_symbolic_model": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)",
    "simplified_complexity": 15,
    "symbolic_error": "(-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)",
    "symbolic_fraction": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)",
    "symbolic_error_is_zero": false,
    "symbolic_error_is_constant": false,
    "symbolic_fraction_is_constant": false
}
saving...
done.

@lacava
Copy link
Member

lacava commented Mar 11, 2022

EpistasisLab/pmlb#160 was merged. update PMLB from git and you should be good to go.

@lacava lacava closed this as completed Mar 11, 2022
gAldeia pushed a commit to gAldeia/srbench that referenced this issue May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants