Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle unlabeled data in multiple places in SKLL #453

Merged
merged 9 commits into from Feb 11, 2019

Conversation

desilinguist
Copy link
Member

@desilinguist desilinguist commented Feb 5, 2019

  1. Allow skll_convert to handle unlabaled input data (skll_convert does not handle conversion of unlabeled data correctly in all cases #452)

    • Add --no_labels option to skll_convert
    • Make --no_labels and --label_col mutually exclusive and add test for this.
  2. Remove the .ndj format from the various conversion tests since it's identical to the .jsonlines format and just adds unnecessarily to the test run time.

  3. Fix FeatureSet.has_labels to recognize list of None objects which is what happens when you read in an unlabeled data set and pass label_col=None (Reader/Writers not totally compatible for unlabelled feature sets #426).

  4. Fix bug in ARFFWriter that adds/removes label_col from the field names even if it's None to begin with.

  5. Update test_convert_featureset() in test_featuresets.py to also test for unlabeled data.

Robert Pugh and others added 9 commits May 16, 2018 11:14
- Add `--no_labels` option to `skll_convert`
- Make `--no_labels` and `--label_col` mutually exclusive and add test for this.
- Remove `.ndj` from conversion test since it's identical to `.jsonlines` and just adds to the test time.
We do not need to add/remove `label_col` from the fieldnames if it's None to begin with.
- Remove unnecessary test and associated files.
@desilinguist desilinguist self-assigned this Feb 5, 2019
@desilinguist desilinguist added this to In progress in SKLL Release v2.5 Feb 5, 2019
@desilinguist desilinguist added this to the 2.0 milestone Feb 5, 2019
@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 92.72% when pulling df20b46 on unlabelled-readwrite-compatibility into f08b38d on master.

@desilinguist
Copy link
Member Author

@Lguyogiro @mulhod @jbiggsets any chance of reviewing this soon? I have another branch ready :)

@mulhod
Copy link
Contributor

mulhod commented Feb 11, 2019

I will take a look today.

Copy link
Contributor

@mulhod mulhod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I had a question about the case where you have unlabelled data being converted. You would now use the --no_labels flag, but what happens if you don't use it and there is no label column (y if unspecified)? In the csv to arff case, this still works, but it's probably not doing what we want. I would expect that if you don't pass in --no_labels and there is no label column, it would fail.

@desilinguist
Copy link
Member Author

This is because we allow the CSV reader to ignore non-existent columns and set the label to None if the column does not exist. Has nothing to do with conversion.

Copy link
Collaborator

@jbiggsets jbiggsets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@desilinguist desilinguist merged commit 0b13413 into master Feb 11, 2019
SKLL Release v2.5 automation moved this from In progress to Done Feb 11, 2019
@desilinguist desilinguist deleted the unlabelled-readwrite-compatibility branch February 11, 2019 20:11
@desilinguist desilinguist removed this from Done in SKLL Release v2.5 Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants