-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARNING - Feature set contains IDs that are not in folds dictionary. #225
Comments
I just tried to run with a different data-set (and one that I had been able to use run_experiment with before in the same way) and it gave me the same kind of error. Here's the traceback:
I am going to try with one of the example data-sets to see if I get the same error there too. Is there anything in this error that indicates what the problem might be? |
These sort of type errors are usually a good indication that a number is stored as a numpy int64 or float64 (instead of a plain old int or float), and those are JSON serializable. |
Thanks for the info. I will try to investigate a little more. However, I'm a little unsure about what I can change about the way that numbers are stored since the data is being read in from TSV files directly. Also, I don't see how this is related to the "WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs." message. As far as I understand, the IDs are the same across the folds file and any feature files I give as input. |
Are there IDs in the feature files that are not in the folds file? If so,
|
There are no IDs that are in the feature files that aren't in the folds file and vice versa. That's why I don't understand that warning. It's a warning sign to me that something else is happening. I've compared the IDs using diff -s just to make sure and that results in a message that says the files are identical. |
I got some strange errors recently when running skll 1.0 on a data set/config that had previously had no errors. The issue was stricter yaml parsing constraints (e.g. 1.0e-4 instead of 1e-4 and 1.0e+2 instead of 1e2). Have you tried making your ids strings instead of floats? |
Thanks, Aoife. I will look into the 1.0e-4 vs. 1e-4 issue. That probably does come up in my set, so I think that probably will be a big help. Just for clarification on your question at the end: My IDs consist of a string of characters that includes letters, numbers, hyphens, and underscores. For example: testfile12-cat-dog. How should IDs be represented in a feature file (or a folds file) if not by testfile12-cat-dog? Should I put the IDs in quotations or something? I haven't done this in the past and SKLL was able to use the IDs as they were, so I didn't think it would be an issue. |
Ah, I had thought that maybe your ids were simply numbers, but if they contain letters/numbers then they shouldn't ever be converted to floats. The ids_to_floats option is set to False anyway by default it seems, so this is unlikely to be the cause of your problems, sorry! |
Whatever is happening is not happening in skll 0.27.0 (at least so far). Making only changes to my config file so that it conforms to the earlier format, I'm getting no weird errors about numbers being JSON serializable or not. I am still getting the warning message about certain IDs not being in the folds dictionary, though. (Side note: Wouldn't it be useful to print out the IDs that are missing? It's not very helpful to find out that IDs are missing if you've confirmed that the file ID set is exactly the same. Obviously, if that's the case, this shouldn't be happening. But, it could be an easy thing to do to print that information out in the log.) |
Yeah, everything worked. Both of the issues that I've brought up are real.
|
So, apparently, In [1]: import json
In [2]: import numpy as np
In [3]: a = np.array([0.464, 0.744])
In [4]: type(a[0])
Out[4]: numpy.float64
In [5]: json.dumps(a[0])
Out[5]: '0.46400000000000002'
In [6]: a = np.array([1, 2])
In [7]: type(a[0])
Out[7]: numpy.int64
In [8]: json.dumps(a[0])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-853a3f781c77> in <module>()
----> 1 json.dumps(a[0])
/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
231 cls is None and indent is None and separators is None and
232 default is None and not sort_keys and not kw):
--> 233 return _default_encoder.encode(obj)
234 if cls is None:
235 cls = JSONEncoder
/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in encode(self, o)
189 # exceptions aren't as detailed. The list call should be roughly
190 # equivalent to the PySequence_Fast that ''.join() would do.
--> 191 chunks = self.iterencode(o, _one_shot=True)
192 if not isinstance(chunks, (list, tuple)):
193 chunks = list(chunks)
/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in iterencode(self, o, _one_shot)
247 self.key_separator, self.item_separator, self.sort_keys,
248 self.skipkeys, _one_shot)
--> 249 return _iterencode(o, 0)
250
251 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
/Users/nmadnani/anaconda/envs/3.3/lib/python3.3/json/encoder.py in default(self, o)
171
172 """
--> 173 raise TypeError(repr(o) + " is not JSON serializable")
174
175 def encode(self, o):
TypeError: 1 is not JSON serializable And the reason we haven't run into this issue so far is because Matt's data is special. He has round numbers as labels but he's trying to run a regression. So, the experiment runs just fine but when it comes time to dump the results to the JSON file, the |
Okay, I figured out the |
Well, it looks like this is already fixed in master (#219). Time to make a new release. |
I am running a cross-validation experiment. Here's the relevant info from the config file I'm using:
As indicated above, I have 4 feature files in /path/to/featureset-files named "min", "max", "lpmm", and "lpmc". Each has a label column called "Cohesive", an "id" column (so, no need to specify in config file), and one feature value column (although the name is different from the name of the file). These are TSV files. Also, there is a CSV folds file that contains the same exact set of IDs that the other files contain. In fact, if I do a diff -s on the "id" column between any two feature files or between any feature file and the folds file, the result tells me that the inputs are indeed identical. I'm fairly positive that I'm not feeding in files that contain different IDs.
And, yet, run_experiment results in log files that have the following warning message (at the bottom):
WARNING - Feature set contains IDs that are not in folds dictionary. Skipping those IDs.
I can't figure out why this happens. Afterwards, the jobs eventually result in the following error:
Except for one feature file, all values are float/int. In one file, some values are the string "null". I don't know if that has anything to do with what's above, though. Probably, it doesn't because the log file above is for doing a cross-validation experiment with one feature only (I'm actually doing an ablation experiment with "--ablation 3") and that feature is not the one with "null" values.
The text was updated successfully, but these errors were encountered: