Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING - Feature set contains IDs that are not in folds dictionary. #225

Closed
mulhod opened this issue Feb 9, 2015 · 13 comments
Closed

WARNING - Feature set contains IDs that are not in folds dictionary. #225

mulhod opened this issue Feb 9, 2015 · 13 comments
Labels
Milestone

Comments

@mulhod
Copy link
Contributor

mulhod commented Feb 9, 2015

I am running a cross-validation experiment. Here's the relevant info from the config file I'm using:

[Input]
train_directory = /path/to/featureset-files
cv_folds_file = /path/to/folds/file.csv
featuresets = [["min", "max", "lpmc", "lpmm"]]
featureset_names = ["min+max+lpmc+lpmm"]
learners = ["RescaledRandomForestRegressor"]
label_col = Cohesive
suffix = .tsv

As indicated above, I have 4 feature files in /path/to/featureset-files named "min", "max", "lpmm", and "lpmc". Each has a label column called "Cohesive", an "id" column (so, no need to specify in config file), and one feature value column (although the name is different from the name of the file). These are TSV files. Also, there is a CSV folds file that contains the same exact set of IDs that the other files contain. In fact, if I do a diff -s on the "id" column between any two feature files or between any feature file and the folds file, the result tells me that the inputs are indeed identical. I'm fairly positive that I'm not feeding in files that contain different IDs.

And, yet, run_experiment results in log files that have the following warning message (at the bottom):

WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs.

I can't figure out why this happens. Afterwards, the jobs eventually result in the following error:

Traceback (most recent call last):
  File "/opt/python/3.4/lib/python3.4/site-packages/gridmap/job.py", line 219, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/3.4/lib/python3.4/site-packages/skll/experiments.py", line 827, in _classify_featureset
    json.dump(res, json_file)
  File "/opt/python/3.4/lib/python3.4/json/__init__.py", line 178, in dump
    for chunk in iterable:
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 420, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 317, in _iterencode_list
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 429, in _iterencode
    o = _default(o)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0 is not JSON serializable

Except for one feature file, all values are float/int. In one file, some values are the string "null". I don't know if that has anything to do with what's above, though. Probably, it doesn't because the log file above is for doing a cross-validation experiment with one feature only (I'm actually doing an ablation experiment with "--ablation 3") and that feature is not the one with "null" values.

@mulhod mulhod added the bug label Feb 9, 2015
@mulhod
Copy link
Contributor Author

mulhod commented Feb 10, 2015

I just tried to run with a different data-set (and one that I had been able to use run_experiment with before in the same way) and it gave me the same kind of error. Here's the traceback:

Stacktrace: Traceback (most recent call last):
  File "/opt/python/3.4/lib/python3.4/site-packages/gridmap/job.py", line 219, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/3.4/lib/python3.4/site-packages/skll/experiments.py", line 827, in _classify_featureset
    json.dump(res, json_file)
  File "/opt/python/3.4/lib/python3.4/json/__init__.py", line 178, in dump
    for chunk in iterable:
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 420, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 317, in _iterencode_list
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 429, in _iterencode
    o = _default(o)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 2 is not JSON serializable

I am going to try with one of the example data-sets to see if I get the same error there too. Is there anything in this error that indicates what the problem might be?

@dan-blanchard
Copy link
Contributor

These sort of type errors are usually a good indication that a number is stored as a numpy int64 or float64 (instead of a plain old int or float), and those are JSON serializable.

@mulhod
Copy link
Contributor Author

mulhod commented Feb 17, 2015

Thanks for the info. I will try to investigate a little more.

However, I'm a little unsure about what I can change about the way that numbers are stored since the data is being read in from TSV files directly. Also, I don't see how this is related to the "WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs." message. As far as I understand, the IDs are the same across the folds file and any feature files I give as input.

@desilinguist
Copy link
Member

Are there IDs in the feature files that are not in the folds file? If so,
then that warning is to be expected. Right, Dan?
On Tue, Feb 17, 2015 at 4:05 PM Matt Mulholland notifications@github.com
wrote:

Thanks for the info. I will try to investigate a little more.

However, I'm a little unsure about what I can change about the way that
numbers are stored since the data is being read in from TSV files directly.
Also, I don't see how this is related to the "WARNING - Feature set
contains IDs that are not in folds dictionary. Skipping those IDs."
message. As far as I understand, the IDs are the same across the folds file
and any feature files I give as input.


Reply to this email directly or view it on GitHub
#225 (comment)
.

@mulhod
Copy link
Contributor Author

mulhod commented Feb 17, 2015

There are no IDs that are in the feature files that aren't in the folds file and vice versa. That's why I don't understand that warning. It's a warning sign to me that something else is happening. I've compared the IDs using diff -s just to make sure and that results in a message that says the files are identical.

@aoifecahill
Copy link
Collaborator

I got some strange errors recently when running skll 1.0 on a data set/config that had previously had no errors. The issue was stricter yaml parsing constraints (e.g. 1.0e-4 instead of 1e-4 and 1.0e+2 instead of 1e2). Have you tried making your ids strings instead of floats?

@mulhod
Copy link
Contributor Author

mulhod commented Feb 17, 2015

Thanks, Aoife. I will look into the 1.0e-4 vs. 1e-4 issue. That probably does come up in my set, so I think that probably will be a big help.

Just for clarification on your question at the end: My IDs consist of a string of characters that includes letters, numbers, hyphens, and underscores. For example: testfile12-cat-dog. How should IDs be represented in a feature file (or a folds file) if not by testfile12-cat-dog? Should I put the IDs in quotations or something? I haven't done this in the past and SKLL was able to use the IDs as they were, so I didn't think it would be an issue.

@aoifecahill
Copy link
Collaborator

Ah, I had thought that maybe your ids were simply numbers, but if they contain letters/numbers then they shouldn't ever be converted to floats. The ids_to_floats option is set to False anyway by default it seems, so this is unlikely to be the cause of your problems, sorry!

@dan-blanchard dan-blanchard added this to the 1.1 milestone Feb 18, 2015
@mulhod
Copy link
Contributor Author

mulhod commented Feb 18, 2015

Whatever is happening is not happening in skll 0.27.0 (at least so far). Making only changes to my config file so that it conforms to the earlier format, I'm getting no weird errors about numbers being JSON serializable or not. I am still getting the warning message about certain IDs not being in the folds dictionary, though. (Side note: Wouldn't it be useful to print out the IDs that are missing? It's not very helpful to find out that IDs are missing if you've confirmed that the file ID set is exactly the same. Obviously, if that's the case, this shouldn't be happening. But, it could be an easy thing to do to print that information out in the log.)

@mulhod
Copy link
Contributor Author

mulhod commented Feb 18, 2015

Yeah, everything worked. Both of the issues that I've brought up are real.

  1. the warning message shouldn't be logged (or, if it should, it should give a bit more information)
  2. something is different about skll 1.0.0 that won't allow me to run an experiment. the "not JSON serializable" issue might be related to what was mentioned above (numpy.float64/numpy.int64 vs float/int or 1.0e-7 vs 1e-7, perhaps). in any case, it seems that the way the data is stored (for example, in TSV files) must be changed if you want to use skll 1.0.0.

@desilinguist
Copy link
Member

So, apparently, json in python 3 can serialize numpy.float64 values but not numpy.int64 values.

In [1]: import json

In [2]: import numpy as np

In [3]: a = np.array([0.464, 0.744])

In [4]: type(a[0])
Out[4]: numpy.float64

In [5]: json.dumps(a[0])
Out[5]: '0.46400000000000002'

In [6]: a = np.array([1, 2])

In [7]: type(a[0])
Out[7]: numpy.int64

In [8]: json.dumps(a[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-853a3f781c77> in <module>()
----> 1 json.dumps(a[0])

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    231         cls is None and indent is None and separators is None and
    232         default is None and not sort_keys and not kw):
--> 233         return _default_encoder.encode(obj)
    234     if cls is None:
    235         cls = JSONEncoder

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in encode(self, o)
    189         # exceptions aren't as detailed.  The list call should be roughly
    190         # equivalent to the PySequence_Fast that ''.join() would do.
--> 191         chunks = self.iterencode(o, _one_shot=True)
    192         if not isinstance(chunks, (list, tuple)):
    193             chunks = list(chunks)

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in iterencode(self, o, _one_shot)
    247                 self.key_separator, self.item_separator, self.sort_keys,
    248                 self.skipkeys, _one_shot)
--> 249         return _iterencode(o, 0)
    250 
    251 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in default(self, o)
    171 
    172         """
--> 173         raise TypeError(repr(o) + " is not JSON serializable")
    174 
    175     def encode(self, o):

TypeError: 1 is not JSON serializable

And the reason we haven't run into this issue so far is because Matt's data is special. He has round numbers as labels but he's trying to run a regression. So, the experiment runs just fine but when it comes time to dump the results to the JSON file, the 'descriptive' stats in the regression results have the min and max as numpy.int64 and that's where the JSON encoding fails. I don't think anyone else has run regression with round numbers as labels before.

@desilinguist
Copy link
Member

Okay, I figured out the cv_folds_file warning that Matt is getting. It's because he doesn't have a header in the folds file and SKLL explicitly discards the first row of the file thinking that it's going to be the header. The documentation doesn't say that there should be a header so that should be fixed.

@dan-blanchard
Copy link
Contributor

Well, it looks like this is already fixed in master (#219). Time to make a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants