Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label Function Analysis #13

Closed
schopra8 opened this issue Aug 15, 2021 · 20 comments
Closed

Label Function Analysis #13

schopra8 opened this issue Aug 15, 2021 · 20 comments
Labels
enhancement New feature or request

Comments

@schopra8
Copy link

schopra8 commented Aug 15, 2021

First of all, thanks for open sourcing such an awesome project!

Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.

Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.

Are there any plans to add such functionality down the line as a feature enhancement?

@plison plison added the enhancement New feature or request label Aug 15, 2021
@plison
Copy link
Collaborator

plison commented Aug 15, 2021

Thanks Sahil! Yes, this would indeed be very useful. I'll definitely add it to our TODO list, but I can't promise anything in the short term though. If you're willing to implement a first version and submit a pull request, do let us know, that would be a tremendous addition to the toolkit :-)

@schopra8
Copy link
Author

schopra8 commented Aug 16, 2021

For sure! I was planning to hack something together later this week; I'll try to make it a little more robust than an overnight hack and can issue a pull request with what I build so we can see if it's substantial enough to add!

@schopra8 schopra8 reopened this Aug 16, 2021
@plison
Copy link
Collaborator

plison commented Aug 16, 2021

Brilliant :-) Don't hesitate to ask if you have any questions!

@schopra8
Copy link
Author

schopra8 commented Aug 17, 2021

@plison Wanted to get your quick thoughts on a general approach. In Snorkel, each prediction tasks is assumed to be a n-way classification problem, in contrast the sequence labeling focus of Skweak.

It seems like we should be adapting the LFAnalysis functions so that her applied on a "per-label" basis. For example, it seems we'd want to have a label_coverage metric (i.e., % of dataset given a label) for each entity type. So that, if we have 3 NER classes (PER, DATE, LOC), we would want 3 metrics -- % of dataset covered for PER, % of dataset covered for DATE, and % of dataset covered for LOC.

What do you think of this approach to adapting the classification metrics to the sequence labeling task?

@schopra8
Copy link
Author

schopra8 commented Aug 18, 2021

Under this framing we would have the following function definitions (adapted from Snorkel's LF Analysis).

label_coverage(Y=Optional[List[Doc]): For each NER Tag, compute the fraction of data points with at least one label (e.g., 60% of data points have PER, 50% have DATE, 40% have LOC).

Unlike a classification problem like Snorkel, coverage metrics won't be very helpful without priors on how frequent these entities should appear. For example, 60% of the data points having PER labels (according to LFs) might actually be a good thing - if only 60% of the dataset have PER entities.

Accordingly, I'm thinking we should give users the ability to provide the same set of documents with gold labeled spans (if they have such gold label data) so that they can see how much coverage is afforded by their LFs. Wdyt?

I've coded up a first version of this -- here.


label_conflict(): For each NER Tag, compute the fraction of data points with conflicting (non-abstain labels). There are two broad definitions of conflict:

  1. Disagreement among LFs for shared Label: If we have 2 LFs for a "DATE" label, if LF1 labels the span "June 29" as "DATE" while LF2 does not capture this date, we could consider this a conflict.
  2. Disagreements among LFs for different Labels: If we have 1 LF for PER and 1 LF for "DATE", if LF1 incorrectly labels a span "June 29" as "PER" while LF2 labels "June 29" as "DATE", we could consider this as a conflict.

These conflicts seem fundamentally different IMO; and given that these analysis tool is meant to help folks debug and improve their LFs, I'm thinking we may want to split label_conflict() into two separate functions label_conflict_shared_target()and label_conflict_diff_target(). Wdyt?

Additionally, it seems like this metric would be most useful over spans rather than datapoints, given that we are solving a sequence labeling task rather than classification task. Wdyt?


label_overlap(): For each NER Tag, compute the fraction of data points with at least two (non-abstain labels).

Similar to label_conflict()it seems like this metric would be most useful over spans rather than datapoints. Wdyt?

@plison
Copy link
Collaborator

plison commented Aug 18, 2021

Yes, your suggestions do make sense! Indeed, it's indeed much more informative to get per-label metrics, since most labelling functions are tailored to recognise few specific labels (and ignore the rest).

Regarding the coverage, it's indeed a difficult problem for sequence labelling (in contrast to standard classification where each points belong to one class) -- when an LF outputs an "O", we don't know if that means the LF abstains from giving a prediction, or whether the LF predicts that the token does not belong to a category. Yes, one solution is indeed to provide the users the ability to give gold standard docs, such that one could compute the coverage based on them (in that case, one can directly compute the recall, precision and F1 for each label). On the other hand, this would basically amount to doing a classical evaluation based on test data, and is a quite different kind of exercise than what is typically meant with LF analysis, where the aim is to compare LFs among themselves.

One alternative would be to say that, for a given label (say PER), we look at all the tokens where at least one LF has predicted PER, and we define the coverage in relation to that set. Of course, this estimate of coverage will tend to underestimate the actual coverage (since the denominator will also include false positive tokens for that label), but at least it could be computed without access to gold standard data. Would that make sense for you?

There are also two other aspects that should be ironed out:

  • should the metrics be computed at the token-level or entity-level? In other words, should the coverage of e.g. PER entities look at the proportion of PER tokens that are covered, or the proportion of PER entities (that may span multiple words)? I think the easiest would be to provide all metrics at the token-level, but that can of course be discussed. One advantage of computing the metrics at the token-level is that the metrics can be applied to all types of sequence labelling tasks, not just NER.
  • when computing the coverage, conflict, and overlap, how should we handle BIO tags? In other words, if a LF outputs B-PER and another LF outputs I-PER, should we count this as a match or as a conflict? It might be worth including an option to allow the user to perform either "strict" matching (i.e. counting B-PER and I-PER as a conflict) or robust matching (i.e. only looking at the category, which means B-PER and I-PER would become a match).

Thanks again for this great work :-)

@schopra8
Copy link
Author

schopra8 commented Aug 18, 2021

Agreed that we need to iron out those cases :)

  1. Should the metrics be computed at the token-level or entity-level? Agreed, token-level seems to be a more encompassing approach. Let's go with token level for now ... in the future it's always possible to add bells and whistles to use spans instead.
  2. When computing the coverage, conflict, and overlap, how should we handle BIO tags? Agreed, we'll have to handle both "strict" and "robust" matching. I'm thinking to prioritize "robust" up-front for the v1, wdyt?

On the topic of coverage specifically, can you clarify what the formula would be for coverage in your definition? Reading your description, it seems like this would provide a coverage % for each label function (akin to lf_coverages () in Snorkel) -- but it's not clear to me how you would compute a single coverage metric that would encapsulate the coverage provided by combining all of your label functions.

Maybe it's nonsensical to provide a single coverage metric ... given that we can't distinguish "Abstain" from predicting that a token is not of a particular class? In that case, we would not port over label_coverages() and only port over lf_coverages(). What are your thoughts?

@plison
Copy link
Collaborator

plison commented Aug 19, 2021

Ah, my bad, I didn't see the difference between lf_coverages and label_coverages! Yes, given that we do not know how many tokens should have a non-O label in the first place, I don't quite see how we can provide an implementation for label_coverage as in Snorkel. The easiest is indeed to simply drop that function.

I noticed I also did not respond to your question regarding the definition of label_conflict. If I understand your description correctly, for shared_target you also want to count as conflict cases where one LF predicts e.g. DATE and the other one predicts O, assuming the two LFs are able to predicts DATE? I think it may quickly become a bit messy, given that LFs may have very different recalls. I would think it would be easier to simply count as conflict cases where two LFs predicts incompatible, non-O labels. Or where you thinking of doing something different?

@schopra8
Copy link
Author

schopra8 commented Aug 19, 2021

Agreed that it would be messy -- since we can't differentiate between Abstain and actively predicting that a token does not belong to a specific entity.

Awesome so to summarize:

  • Token-level LF analysis metrics (later introducing possible entity-level metrics)
  • "Robust" matching (e.g., I-PER and B-PER are considered the same) for v1 (later introducing "strict" matching)
  • Don't implement label_coverages
  • Implement lf_coverages with respect to the set of tokens assigned a particular label (e.g. PER) by 1+ LFs
  • label_conflict should be implemented with respect to LFs returning different non-null labels for the same tokens (e.g., LF1 labels "Apple" as a ORG while LF2 labels "Apple" as PER)

Does this seem correct? Just want to confirm :)

@schopra8
Copy link
Author

schopra8 commented Aug 19, 2021

Few follow up questions:

  • Is there an easy way to enumerate all possible labels a LF can possibly return? For TokenConstraintAnnotator this is super simple, as this contained within the label property. But for other Annotators (e.g., ModelAnnotator) it's not clear to me how we might access this information easily.
  • How should we handle gap tokens when computing token-level statistics? My assumption is that we simply ignore gap tokens? For example if we have extracted a span "2017 - 2019" with label "DATE", should we only count 2 tokens "2017" and "2019".
  • Instead of calling the function lf_coverages should we call it lf_agreements? Circling back to the definition you proposed earlier in the thread, we seem to be computing agreement amongst LFs with the same target label, rather than identifying the prevalence of labels across our dataset.

One alternative would be to say that, for a given label (say PER), we look at all the tokens where at least one LF has predicted PER, and we define the coverage in relation to that set. Of course, this estimate of coverage will tend to underestimate the actual coverage (since the denominator will also include false positive tokens for that label), but at least it could be computed without access to gold standard data

Thanks in advance!

@plison
Copy link
Collaborator

plison commented Aug 19, 2021

Yes, your summarisation is correct, that's exactly what I had in mind :-)

Coming to your follow-up questions:

  • No, there isn't a direct way to get the labels that can be produced by a given LF, unfortunately (without going into the details, the main problem comes from LFs such as DocumentMajorityAnnotator that rely on the outputs of other LFs). However, given a collection of documents annotated with LFs, it's relatively straightforward to loop on the documents and the spans defined in those documents to find out the labels that are produced by each LF:
labels_by_lf = {}
for doc in docs:
    for lf_name, spans in doc.spans.items():
        if lf_name not in labels_by_lf:
            labels_by_lf[lf_name] = set()
        labels_by_lf[lf_name].update(span.label_ for span in spans)
  • I would simply treat all tokens in the same manner, without any special treatment for "gap tokens" or similar. So the "-" would be treated as I-DATE in your example.
  • I see your point, but I fear the word "agreement" may give the wrong impression that it provides some annotator agreement metric. In my view, the idea is that, for a given label such as PER, the number of tokens that are marked as PER by at least one LF (among all LFs) can give us a rough estimate of the actual number of PER tokens in the corpus. Of course this will be an overestimate due to false positives, but if we assume that this estimate is not too far off, then the proportion of tokens marked as PER by a given LF divided by this total number will provide a conservative estimate for the LF's recall (and thus the LFs coverage).

@schopra8
Copy link
Author

Makes sense on all accounts! Let's stick to the term coverage accordingly.

@schopra8
Copy link
Author

schopra8 commented Aug 21, 2021

Is there a standard code formatter and accompanying config file I should be using to make sure my code formatting matches that of the overall library (e.g., yapf or black)? Just want to make sure I get my ducks in a row before creating a pull request for you to review :)

@plison
Copy link
Collaborator

plison commented Aug 23, 2021

No, I haven't used any recommended code formatter (but you're right that we probably should!)

@schopra8
Copy link
Author

schopra8 commented Aug 25, 2021

Good to know -- I shall leave that for a future pull request :)

I wanted to run my process for computing the LF level accuracies and see if that resonates with you?

  • LFs can have different target label sets and a LF must not cover all desired labels.
  • If we simply compute accuracy as the predictions of a LF vs. ground truth, we will underestimate that accuracy of the LF - if the LF does not cover all the labels specified in the ground truth.
  • As a result we need to take the vector of ground truth labels and transform it for each of the LF1s, by setting all the ground truth labels outside the LF's target set to the null token (0).
  • We can then run a typical accuracy calculation (i.e., number of matching values between the two vectors / size of vector)

@schopra8
Copy link
Author

schopra8 commented Aug 26, 2021

One ambiguity in the empirical accuracies calculation is how to treat label mismatches between LFs and Gold Data. I imagine two situations when computing empirical accuracies over LFs across labels (this issue does not exist for LFs for individual labels).


LFs are missing labels which are present in the Gold Data. The proposed behavior would be to print a warning to the user and set the Gold Data value to 0. Since the LF could not have possibly predicted the correct answer, it is unfair to penalize it. We will print a warning to help users in case they have typos in their label names between the LFs and the Gold Dataset. I see this as a fairly expected and common circumstance, if one is only modeling a subset of the labels provided by a gold dataset or haven't finished writing a full set of LFs.


Gold Data is missing labels which are present among LFs. This case seems more complicated. Let's take a look at three examples and 3 approaches:

  • Approach A: Remove tokens with labels outside the domain of the gold data from the accuracy calculation altogether.
  • Approach B: Re-assign null labels (0) to the tokens the LF provided the out-of-domain labels.
  • Approach C: Modify nothing, just compare ground truth vector (normalized to the domain of the LF) to the LF predictions.
  1. Correct label exists in LF's domain
    LF covers the labels [0, 1, 2], gold dataset covers the labels [0,1, 3]. LF yields predictions [1, 0, 2] for a 3 point dataset. Gold Dataset provides labels [1, 1, 1] for the 3-point dataset.
  • Approach A: Remove tokens with OOD labels means that we would remove the token assigned label 2 from the computation; we would be left with predictions [1, 0] and gold data [1, 1], giving 50% accuracy for the LF.
  • Approach B: Re-assigning null labels (0) to the tokens the LF provided the out-of-domain labels would be left with predictions [1, 0 , 0] and gold data [1, 1, 1], giving 33% accuracy.
  • Approach C: Comparing the predictions [1, 0 , 2] to gold data [1, 1, 1] without any modification would give us 33% accuracy.
  1. Correct label is unknown or null
    LF covers the labels [0, 1, 2], gold dataset covers the labels [0,1, 3]. LF yields predictions [1, 0, 2] for a 3 point dataset. Gold Dataset provides labels [1, 1, 0] for the 3 point dataset.
  • Approach A would yield 50% again (nothing has changed from the past example)
  • Approach B would re-assign 0 to the token labeled 2, leaving predictions [1, 0, 0] and gold data [1, 1, 0], giving accuracy 66%.
  • Approach C would do nothing, This would mean that we would be left with predictions [1, 0 , 2] and gold data [1, 1, 0], giving 33% accuracy.
  1. Correct label is outside of LF's domain
    LF covers the labels [0, 1, 2], gold dataset covers the labels [0,1,3]. LF yields predictions [1, 0, 2] for a 3 point dataset. Gold Dataset provides labels [1, 1, 3] for the 3 point dataset. Note that since 3 is not covered by the LF, we would normalize the gold dataset predictions to [1, 1, 0].
  • Approach A would yield 50% again (nothing has changed from the past examples)
  • Approach B would re-assign 0 to the token labeled 2, leaving predictions [1, 0, 0] and gold data [1, 1, 0], giving accuracy 66%.
  • Approach C would do nothing (except normalize the ground truth vector to the domain of the LF). This would mean that we would be left with predictions [1, 0 , 2] and gold data [1, 1, 0], giving 33% accuracy.

In Example 1, Approaches B & C yield more intuitive accuracies, as we should be penalizing the LF for predicting incorrectly, when the true label was in the domain.

In Example 2, the gold dataset's null-token could be as a definitive labeling of the token being a non-entity or an implication that the entity was not considered during labeling. If believe our LF has high precision, then it would make sense to inflate the score and champion Approach B. But if we don't believe our LF has high precision, it would make sense to penalize the score and champion Approach C. Approach A splits the difference.

In Example 3, the gold dataset definitively labels the token to be of another class so it would make sense to penalize the LF for lack of precision (it would have been safer to label the token null then assign the wrong label). So here, Approach C would make the most intuitive sense.

@schopra8
Copy link
Author

To me it seems like Approach C is the simplest and most intrepretable, even if it is pessimistic in example number 2. So that would be my vote for the best implementation strategy.

Which would you select and why?

@plison
Copy link
Collaborator

plison commented Aug 26, 2021

Yes, if we wish to compute LF accuracies, I agree that Approach C seems to be the easiest strategy. But I'm still uncertain as to whether accuracy is the most appropriate metric for that kind of task. Why not compute precision and recall instead (either per-label or micro-averaged on the labels that are covered by the LF)? It's true that it gives us two measures instead of one, but it's also much more informative, as it clearly indicates whether the errors come from the occurrence of false positives or false negatives -- something that is conflated in the accuracy metric. If we use precision and recall, all the problems you mentioned will essentially vanish. Or am I missing something?

@schopra8
Copy link
Author

You're totally right! I think I put the feature-parity blinders on too tightly :) I'll add precision and recall to the LF Analysis.

@schopra8
Copy link
Author

@plison I have opened #15 to introduce the LF Analysis tooling. Would love to get your eyes on it and gather your feedback. Please let me know if you'd prefer to connect for a live call - and I can walk you through it at a high level, if that makes it easier to review.

@schopra8 schopra8 closed this as completed Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants