New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Label Function Analysis #13
Comments
Thanks Sahil! Yes, this would indeed be very useful. I'll definitely add it to our TODO list, but I can't promise anything in the short term though. If you're willing to implement a first version and submit a pull request, do let us know, that would be a tremendous addition to the toolkit :-) |
For sure! I was planning to hack something together later this week; I'll try to make it a little more robust than an overnight hack and can issue a pull request with what I build so we can see if it's substantial enough to add! |
Brilliant :-) Don't hesitate to ask if you have any questions! |
@plison Wanted to get your quick thoughts on a general approach. In Snorkel, each prediction tasks is assumed to be a n-way classification problem, in contrast the sequence labeling focus of Skweak. It seems like we should be adapting the LFAnalysis functions so that her applied on a "per-label" basis. For example, it seems we'd want to have a label_coverage metric (i.e., % of dataset given a label) for each entity type. So that, if we have 3 NER classes (PER, DATE, LOC), we would want 3 metrics -- % of dataset covered for PER, % of dataset covered for DATE, and % of dataset covered for LOC. What do you think of this approach to adapting the classification metrics to the sequence labeling task? |
Under this framing we would have the following function definitions (adapted from Snorkel's LF Analysis).
Unlike a classification problem like Snorkel, coverage metrics won't be very helpful without priors on how frequent these entities should appear. For example, 60% of the data points having PER labels (according to LFs) might actually be a good thing - if only 60% of the dataset have PER entities. Accordingly, I'm thinking we should give users the ability to provide the same set of documents with gold labeled spans (if they have such gold label data) so that they can see how much coverage is afforded by their LFs. Wdyt? I've coded up a first version of this -- here.
These conflicts seem fundamentally different IMO; and given that these analysis tool is meant to help folks debug and improve their LFs, I'm thinking we may want to split Additionally, it seems like this metric would be most useful over spans rather than datapoints, given that we are solving a sequence labeling task rather than classification task. Wdyt?
Similar to |
Yes, your suggestions do make sense! Indeed, it's indeed much more informative to get per-label metrics, since most labelling functions are tailored to recognise few specific labels (and ignore the rest). Regarding the coverage, it's indeed a difficult problem for sequence labelling (in contrast to standard classification where each points belong to one class) -- when an LF outputs an "O", we don't know if that means the LF abstains from giving a prediction, or whether the LF predicts that the token does not belong to a category. Yes, one solution is indeed to provide the users the ability to give gold standard docs, such that one could compute the coverage based on them (in that case, one can directly compute the recall, precision and F1 for each label). On the other hand, this would basically amount to doing a classical evaluation based on test data, and is a quite different kind of exercise than what is typically meant with LF analysis, where the aim is to compare LFs among themselves. One alternative would be to say that, for a given label (say PER), we look at all the tokens where at least one LF has predicted PER, and we define the coverage in relation to that set. Of course, this estimate of coverage will tend to underestimate the actual coverage (since the denominator will also include false positive tokens for that label), but at least it could be computed without access to gold standard data. Would that make sense for you? There are also two other aspects that should be ironed out:
Thanks again for this great work :-) |
Agreed that we need to iron out those cases :)
On the topic of coverage specifically, can you clarify what the formula would be for coverage in your definition? Reading your description, it seems like this would provide a coverage % for each label function (akin to Maybe it's nonsensical to provide a single coverage metric ... given that we can't distinguish "Abstain" from predicting that a token is not of a particular class? In that case, we would not port over |
Ah, my bad, I didn't see the difference between I noticed I also did not respond to your question regarding the definition of |
Agreed that it would be messy -- since we can't differentiate between Abstain and actively predicting that a token does not belong to a specific entity. Awesome so to summarize:
Does this seem correct? Just want to confirm :) |
Few follow up questions:
Thanks in advance! |
Yes, your summarisation is correct, that's exactly what I had in mind :-) Coming to your follow-up questions:
|
Makes sense on all accounts! Let's stick to the term coverage accordingly. |
Is there a standard code formatter and accompanying config file I should be using to make sure my code formatting matches that of the overall library (e.g., yapf or black)? Just want to make sure I get my ducks in a row before creating a pull request for you to review :) |
No, I haven't used any recommended code formatter (but you're right that we probably should!) |
Good to know -- I shall leave that for a future pull request :) I wanted to run my process for computing the LF level accuracies and see if that resonates with you?
|
One ambiguity in the empirical accuracies calculation is how to treat label mismatches between LFs and Gold Data. I imagine two situations when computing empirical accuracies over LFs across labels (this issue does not exist for LFs for individual labels). LFs are missing labels which are present in the Gold Data. The proposed behavior would be to print a warning to the user and set the Gold Data value to 0. Since the LF could not have possibly predicted the correct answer, it is unfair to penalize it. We will print a warning to help users in case they have typos in their label names between the LFs and the Gold Dataset. I see this as a fairly expected and common circumstance, if one is only modeling a subset of the labels provided by a gold dataset or haven't finished writing a full set of LFs. Gold Data is missing labels which are present among LFs. This case seems more complicated. Let's take a look at three examples and 3 approaches:
In Example 1, Approaches B & C yield more intuitive accuracies, as we should be penalizing the LF for predicting incorrectly, when the true label was in the domain. In Example 2, the gold dataset's null-token could be as a definitive labeling of the token being a non-entity or an implication that the entity was not considered during labeling. If believe our LF has high precision, then it would make sense to inflate the score and champion Approach B. But if we don't believe our LF has high precision, it would make sense to penalize the score and champion Approach C. Approach A splits the difference. In Example 3, the gold dataset definitively labels the token to be of another class so it would make sense to penalize the LF for lack of precision (it would have been safer to label the token null then assign the wrong label). So here, Approach C would make the most intuitive sense. |
To me it seems like Approach C is the simplest and most intrepretable, even if it is pessimistic in example number 2. So that would be my vote for the best implementation strategy. Which would you select and why? |
Yes, if we wish to compute LF accuracies, I agree that Approach C seems to be the easiest strategy. But I'm still uncertain as to whether accuracy is the most appropriate metric for that kind of task. Why not compute precision and recall instead (either per-label or micro-averaged on the labels that are covered by the LF)? It's true that it gives us two measures instead of one, but it's also much more informative, as it clearly indicates whether the errors come from the occurrence of false positives or false negatives -- something that is conflated in the accuracy metric. If we use precision and recall, all the problems you mentioned will essentially vanish. Or am I missing something? |
You're totally right! I think I put the feature-parity blinders on too tightly :) I'll add precision and recall to the LF Analysis. |
First of all, thanks for open sourcing such an awesome project!
Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.
Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.
Are there any plans to add such functionality down the line as a feature enhancement?
The text was updated successfully, but these errors were encountered: