WARNING - Feature set contains IDs that are not in folds dictionary. #225

mulhod · 2015-02-09T19:44:39Z

I am running a cross-validation experiment. Here's the relevant info from the config file I'm using:

[Input]
train_directory = /path/to/featureset-files
cv_folds_file = /path/to/folds/file.csv
featuresets = [["min", "max", "lpmc", "lpmm"]]
featureset_names = ["min+max+lpmc+lpmm"]
learners = ["RescaledRandomForestRegressor"]
label_col = Cohesive
suffix = .tsv

As indicated above, I have 4 feature files in /path/to/featureset-files named "min", "max", "lpmm", and "lpmc". Each has a label column called "Cohesive", an "id" column (so, no need to specify in config file), and one feature value column (although the name is different from the name of the file). These are TSV files. Also, there is a CSV folds file that contains the same exact set of IDs that the other files contain. In fact, if I do a diff -s on the "id" column between any two feature files or between any feature file and the folds file, the result tells me that the inputs are indeed identical. I'm fairly positive that I'm not feeding in files that contain different IDs.

And, yet, run_experiment results in log files that have the following warning message (at the bottom):

WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs.

I can't figure out why this happens. Afterwards, the jobs eventually result in the following error:

Traceback (most recent call last):
  File "/opt/python/3.4/lib/python3.4/site-packages/gridmap/job.py", line 219, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/3.4/lib/python3.4/site-packages/skll/experiments.py", line 827, in _classify_featureset
    json.dump(res, json_file)
  File "/opt/python/3.4/lib/python3.4/json/__init__.py", line 178, in dump
    for chunk in iterable:
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 420, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 317, in _iterencode_list
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 429, in _iterencode
    o = _default(o)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 0 is not JSON serializable

Except for one feature file, all values are float/int. In one file, some values are the string "null". I don't know if that has anything to do with what's above, though. Probably, it doesn't because the log file above is for doing a cross-validation experiment with one feature only (I'm actually doing an ablation experiment with "--ablation 3") and that feature is not the one with "null" values.

The text was updated successfully, but these errors were encountered:

mulhod · 2015-02-10T17:01:41Z

I just tried to run with a different data-set (and one that I had been able to use run_experiment with before in the same way) and it gave me the same kind of error. Here's the traceback:

Stacktrace: Traceback (most recent call last):
  File "/opt/python/3.4/lib/python3.4/site-packages/gridmap/job.py", line 219, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/3.4/lib/python3.4/site-packages/skll/experiments.py", line 827, in _classify_featureset
    json.dump(res, json_file)
  File "/opt/python/3.4/lib/python3.4/json/__init__.py", line 178, in dump
    for chunk in iterable:
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 420, in _iterencode
    yield from _iterencode_list(o, _current_indent_level)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 317, in _iterencode_list
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 396, in _iterencode_dict
    yield from chunks
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 429, in _iterencode
    o = _default(o)
  File "/opt/python/3.4/lib/python3.4/json/encoder.py", line 173, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 2 is not JSON serializable

I am going to try with one of the example data-sets to see if I get the same error there too. Is there anything in this error that indicates what the problem might be?

dan-blanchard · 2015-02-17T19:29:33Z

These sort of type errors are usually a good indication that a number is stored as a numpy int64 or float64 (instead of a plain old int or float), and those are JSON serializable.

mulhod · 2015-02-17T21:05:25Z

Thanks for the info. I will try to investigate a little more.

However, I'm a little unsure about what I can change about the way that numbers are stored since the data is being read in from TSV files directly. Also, I don't see how this is related to the "WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs." message. As far as I understand, the IDs are the same across the folds file and any feature files I give as input.

desilinguist · 2015-02-17T21:07:22Z

Are there IDs in the feature files that are not in the folds file? If so,
then that warning is to be expected. Right, Dan?
On Tue, Feb 17, 2015 at 4:05 PM Matt Mulholland notifications@github.com
wrote:

Thanks for the info. I will try to investigate a little more.

However, I'm a little unsure about what I can change about the way that
numbers are stored since the data is being read in from TSV files directly.
Also, I don't see how this is related to the "WARNING - Feature set
contains IDs that are not in folds dictionary. Skipping those IDs."
message. As far as I understand, the IDs are the same across the folds file
and any feature files I give as input.

—
Reply to this email directly or view it on GitHub
#225 (comment)
.

mulhod · 2015-02-17T21:48:33Z

There are no IDs that are in the feature files that aren't in the folds file and vice versa. That's why I don't understand that warning. It's a warning sign to me that something else is happening. I've compared the IDs using diff -s just to make sure and that results in a message that says the files are identical.

aoifecahill · 2015-02-17T22:09:11Z

I got some strange errors recently when running skll 1.0 on a data set/config that had previously had no errors. The issue was stricter yaml parsing constraints (e.g. 1.0e-4 instead of 1e-4 and 1.0e+2 instead of 1e2). Have you tried making your ids strings instead of floats?

mulhod · 2015-02-17T22:20:11Z

Thanks, Aoife. I will look into the 1.0e-4 vs. 1e-4 issue. That probably does come up in my set, so I think that probably will be a big help.

Just for clarification on your question at the end: My IDs consist of a string of characters that includes letters, numbers, hyphens, and underscores. For example: testfile12-cat-dog. How should IDs be represented in a feature file (or a folds file) if not by testfile12-cat-dog? Should I put the IDs in quotations or something? I haven't done this in the past and SKLL was able to use the IDs as they were, so I didn't think it would be an issue.

aoifecahill · 2015-02-17T22:26:18Z

Ah, I had thought that maybe your ids were simply numbers, but if they contain letters/numbers then they shouldn't ever be converted to floats. The ids_to_floats option is set to False anyway by default it seems, so this is unlikely to be the cause of your problems, sorry!

mulhod · 2015-02-18T17:01:36Z

Whatever is happening is not happening in skll 0.27.0 (at least so far). Making only changes to my config file so that it conforms to the earlier format, I'm getting no weird errors about numbers being JSON serializable or not. I am still getting the warning message about certain IDs not being in the folds dictionary, though. (Side note: Wouldn't it be useful to print out the IDs that are missing? It's not very helpful to find out that IDs are missing if you've confirmed that the file ID set is exactly the same. Obviously, if that's the case, this shouldn't be happening. But, it could be an easy thing to do to print that information out in the log.)

mulhod · 2015-02-18T19:06:02Z

Yeah, everything worked. Both of the issues that I've brought up are real.

the warning message shouldn't be logged (or, if it should, it should give a bit more information)
something is different about skll 1.0.0 that won't allow me to run an experiment. the "not JSON serializable" issue might be related to what was mentioned above (numpy.float64/numpy.int64 vs float/int or 1.0e-7 vs 1e-7, perhaps). in any case, it seems that the way the data is stored (for example, in TSV files) must be changed if you want to use skll 1.0.0.

desilinguist · 2015-02-19T21:52:27Z

So, apparently, json in python 3 can serialize numpy.float64 values but not numpy.int64 values.

In [1]: import json

In [2]: import numpy as np

In [3]: a = np.array([0.464, 0.744])

In [4]: type(a[0])
Out[4]: numpy.float64

In [5]: json.dumps(a[0])
Out[5]: '0.46400000000000002'

In [6]: a = np.array([1, 2])

In [7]: type(a[0])
Out[7]: numpy.int64

In [8]: json.dumps(a[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-853a3f781c77> in <module>()
----> 1 json.dumps(a[0])

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    231         cls is None and indent is None and separators is None and
    232         default is None and not sort_keys and not kw):
--> 233         return _default_encoder.encode(obj)
    234     if cls is None:
    235         cls = JSONEncoder

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in encode(self, o)
    189         # exceptions aren't as detailed.  The list call should be roughly
    190         # equivalent to the PySequence_Fast that ''.join() would do.
--> 191         chunks = self.iterencode(o, _one_shot=True)
    192         if not isinstance(chunks, (list, tuple)):
    193             chunks = list(chunks)

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in iterencode(self, o, _one_shot)
    247                 self.key_separator, self.item_separator, self.sort_keys,
    248                 self.skipkeys, _one_shot)
--> 249         return _iterencode(o, 0)
    250 
    251 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in default(self, o)
    171 
    172         """
--> 173         raise TypeError(repr(o) + " is not JSON serializable")
    174 
    175     def encode(self, o):

TypeError: 1 is not JSON serializable

And the reason we haven't run into this issue so far is because Matt's data is special. He has round numbers as labels but he's trying to run a regression. So, the experiment runs just fine but when it comes time to dump the results to the JSON file, the 'descriptive' stats in the regression results have the min and max as numpy.int64 and that's where the JSON encoding fails. I don't think anyone else has run regression with round numbers as labels before.

desilinguist · 2015-02-19T22:08:07Z

Okay, I figured out the cv_folds_file warning that Matt is getting. It's because he doesn't have a header in the folds file and SKLL explicitly discards the first row of the file thinking that it's going to be the header. The documentation doesn't say that there should be a header so that should be fixed.

dan-blanchard · 2015-02-20T15:39:20Z

Well, it looks like this is already fixed in master (#219). Time to make a new release.

mulhod added the bug label Feb 9, 2015

dan-blanchard added this to the 1.1 milestone Feb 18, 2015

dan-blanchard closed this as completed Feb 20, 2015

dan-blanchard mentioned this issue Feb 20, 2015

Release 1.0.1 #229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARNING - Feature set contains IDs that are not in folds dictionary. #225

WARNING - Feature set contains IDs that are not in folds dictionary. #225

mulhod commented Feb 9, 2015

mulhod commented Feb 10, 2015

dan-blanchard commented Feb 17, 2015

mulhod commented Feb 17, 2015

desilinguist commented Feb 17, 2015

mulhod commented Feb 17, 2015

aoifecahill commented Feb 17, 2015

mulhod commented Feb 17, 2015

aoifecahill commented Feb 17, 2015

mulhod commented Feb 18, 2015

mulhod commented Feb 18, 2015

desilinguist commented Feb 19, 2015

desilinguist commented Feb 19, 2015

dan-blanchard commented Feb 20, 2015

WARNING - Feature set contains IDs that are not in folds dictionary. #225

WARNING - Feature set contains IDs that are not in folds dictionary. #225

Comments

mulhod commented Feb 9, 2015

mulhod commented Feb 10, 2015

dan-blanchard commented Feb 17, 2015

mulhod commented Feb 17, 2015

desilinguist commented Feb 17, 2015

mulhod commented Feb 17, 2015

aoifecahill commented Feb 17, 2015

mulhod commented Feb 17, 2015

aoifecahill commented Feb 17, 2015

mulhod commented Feb 18, 2015

mulhod commented Feb 18, 2015

desilinguist commented Feb 19, 2015

desilinguist commented Feb 19, 2015

dan-blanchard commented Feb 20, 2015