Properly handle unlabeled data in multiple places in SKLL #453

desilinguist · 2019-02-05T23:45:10Z

Allow skll_convert to handle unlabaled input data (skll_convert does not handle conversion of unlabeled data correctly in all cases #452)
- Add --no_labels option to skll_convert
- Make --no_labels and --label_col mutually exclusive and add test for this.
Remove the .ndj format from the various conversion tests since it's identical to the .jsonlines format and just adds unnecessarily to the test run time.
Fix FeatureSet.has_labels to recognize list of None objects which is what happens when you read in an unlabeled data set and pass label_col=None (Reader/Writers not totally compatible for unlabelled feature sets #426).
Fix bug in ARFFWriter that adds/removes label_col from the field names even if it's None to begin with.
Update test_convert_featureset() in test_featuresets.py to also test for unlabeled data.

- Add `--no_labels` option to `skll_convert` - Make `--no_labels` and `--label_col` mutually exclusive and add test for this. - Remove `.ndj` from conversion test since it's identical to `.jsonlines` and just adds to the test time.

We do not need to add/remove `label_col` from the fieldnames if it's None to begin with.

- Remove unnecessary test and associated files.

# Conflicts: # skll/data/featureset.py

coveralls · 2019-02-06T00:08:58Z

Coverage increased (+0.3%) to 92.72% when pulling df20b46 on unlabelled-readwrite-compatibility into f08b38d on master.

desilinguist · 2019-02-11T15:35:26Z

@Lguyogiro @mulhod @jbiggsets any chance of reviewing this soon? I have another branch ready :)

mulhod · 2019-02-11T15:42:08Z

I will take a look today.

mulhod

Looks good.

I had a question about the case where you have unlabelled data being converted. You would now use the --no_labels flag, but what happens if you don't use it and there is no label column (y if unspecified)? In the csv to arff case, this still works, but it's probably not doing what we want. I would expect that if you don't pass in --no_labels and there is no label column, it would fail.

desilinguist · 2019-02-11T19:26:41Z

This is because we allow the CSV reader to ignore non-existent columns and set the label to None if the column does not exist. Has nothing to do with conversion.

jbiggsets

Looks good to me!

Robert Pugh and others added 9 commits May 16, 2018 11:14

update 'has_labels' in featureset to return False if all labels are None

efef911

add featureset test

14ac9ec

add test data

f0b515b

fix test data path

610bfbb

Allow skll_convert to handle unlabaled input data

54e8f79

- Add `--no_labels` option to `skll_convert` - Make `--no_labels` and `--label_col` mutually exclusive and add test for this. - Remove `.ndj` from conversion test since it's identical to `.jsonlines` and just adds to the test time.

Fix has_labels to recognize list of None's

d515782

Fix bug in ARFFWriter

faf511e

We do not need to add/remove `label_col` from the fieldnames if it's None to begin with.

Update test_convert_featureset to also test for unlabeled data

a84f7dc

- Remove unnecessary test and associated files.

Merge branch 'master' into unlabelled-readwrite-compatibility

df20b46

# Conflicts: # skll/data/featureset.py

desilinguist self-assigned this Feb 5, 2019

desilinguist requested review from a user, mulhod, Lguyogiro and jbiggsets February 5, 2019 23:45

desilinguist added this to In progress in SKLL Release v2.5 Feb 5, 2019

desilinguist added this to the 2.0 milestone Feb 5, 2019

mulhod reviewed Feb 11, 2019

View reviewed changes

mulhod approved these changes Feb 11, 2019

View reviewed changes

jbiggsets approved these changes Feb 11, 2019

View reviewed changes

desilinguist merged commit 0b13413 into master Feb 11, 2019

SKLL Release v2.5 automation moved this from In progress to Done Feb 11, 2019

desilinguist deleted the unlabelled-readwrite-compatibility branch February 11, 2019 20:11

This was referenced Feb 11, 2019

Reader/Writers not totally compatible for unlabelled feature sets #426

Closed

skll_convert does not handle conversion of unlabeled data correctly in all cases #452

Closed

desilinguist removed this from Done in SKLL Release v2.5 Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle unlabeled data in multiple places in SKLL #453

Properly handle unlabeled data in multiple places in SKLL #453

desilinguist commented Feb 5, 2019 •

edited

coveralls commented Feb 6, 2019

desilinguist commented Feb 11, 2019

mulhod commented Feb 11, 2019

mulhod left a comment •

edited

desilinguist commented Feb 11, 2019

jbiggsets left a comment

Properly handle unlabeled data in multiple places in SKLL #453

Properly handle unlabeled data in multiple places in SKLL #453

Conversation

desilinguist commented Feb 5, 2019 • edited

coveralls commented Feb 6, 2019

desilinguist commented Feb 11, 2019

mulhod commented Feb 11, 2019

mulhod left a comment • edited

Choose a reason for hiding this comment

desilinguist commented Feb 11, 2019

jbiggsets left a comment

Choose a reason for hiding this comment

desilinguist commented Feb 5, 2019 •

edited

mulhod left a comment •

edited