Warn about order of label_vocab for binary classification (#1435)

Summary: Pull Request resolved: #1435 As Junteng reports: > For example, if you have two possible labels in your training data, namely, "0" and "1". If you specify label_vocab as ["0", "1"], then "0" gets map to 0, and "1" gets map to 1. On the other hand, if you specify label_vocab as ["1", "0"], then "0" gets map to 1, and "1" gets map to 0. > Although this is not important for multi-class classification with negative log-likelihood loss, whether a label gets mapped to 0 or 1 matters in CosineEmbeddingLoss Reviewed By: m3rlin45 Differential Revision: D22641684 fbshipit-source-id: f74c83ed3320286d394546cb6394fd34e7e65f04
facebookresearch · Aug 29, 2020 · 49a45b7 · 49a45b7
1 parent 7bf61b2
commit 49a45b7
Showing 1 changed file with 10 additions and 5 deletions.
diff --git a/pytext/data/tensorizers.py b/pytext/data/tensorizers.py
@@ -948,24 +948,29 @@ def sort_key(self, row):
 
 
 class LabelTensorizer(Tensorizer):
-    """Numberize labels. Label can be used as either input or target """
+    """Numberize labels. Label can be used as either input or target.
+
+    NB: if the labels are used as targets for binary classification with a loss
+    such as cosine distance, the order of the `label_vocab` *does* matter,
+    and it should be `[negative_class, positive_class]`.
+    """
 
     __EXPANSIBLE__ = True
 
     class Config(Tensorizer.Config):
         #: The name of the label column to parse from the data source.
         column: str = "label"
-        #: Whether to allow for unknown labels at test/prediction time
+        #: Whether to allow for unknown labels at test/prediction time.
         allow_unknown: bool = False
-        #: if vocab should have pad, usually false when label is used as target
+        #: Whether vocab should have pad, usually false when label is used as target.
         pad_in_vocab: bool = False
         #: The label values, if known. Will skip initialization step if provided.
         label_vocab: Optional[List[str]] = None
         #: File with the label values. This can be used when the label space is
         #: too large to specify these as a list. The file should not contain
-        #: a header
+        #: a header.
         label_vocab_file: Optional[str] = None
-        # Indicate if it can be used to generate input Tensors for prediction
+        # Indicate if it can be used to generate input Tensors for prediction.
         is_input: bool = False
 
     @classmethod