Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ConfTr #55

Closed
2 of 5 tasks
pat-alt opened this issue Feb 15, 2023 · 3 comments · Fixed by #60
Closed
2 of 5 tasks

Add support for ConfTr #55

pat-alt opened this issue Feb 15, 2023 · 3 comments · Fixed by #60
Assignees
Labels
CCE 💯 difficult This is expected to be difficult. enhancement New feature or request

Comments

@pat-alt
Copy link
Member

pat-alt commented Feb 15, 2023

This ICLR 2022 paper shows how to train conformal classifiers.

  • Add losses for prediction step (prediction step)
  • Streamline (need separate score method for dealing with MLJFlux) - done in b4c7140
  • Add support for differentiable quantile computations (calibration step)
  • Implement batch training procedure
  • Test and document
@pat-alt pat-alt added enhancement New feature or request difficult This is expected to be difficult. labels Feb 15, 2023
@pat-alt
Copy link
Member Author

pat-alt commented Feb 15, 2023

Some questions that have come up so far:

  1. Is the Direc delta really supposed to be an indicator function? Equation (5) on page 5. Maybe I'm just not familiar with this notation.
  2. Doesn't the smooth size loss depend a lot on the scale of the (non-)conformity scores? For $E_{\theta}(x,k)=\pi_{\theta,k}(x)\in[0,1]$, for example, we have that $\sigma(E_{\theta}(x,k) - \tau) \in [0.27,0.73]$. We can use temperature scaling, but can we really speak of 'probabilities' that labels are assigned to the set?
  3. More on smooth size loss: What about empty sets? Shouldn't they be penalised at least as heavily as complete sets?
    a. Could just penalise these cases as $K - \kappa$, that is the maximum set size minus the target set size (1).
    b. Perhaps even better: penalise $\sum(1-C) - \kappa$, that is the total sum of probabilities that labels are not assigned to $C$.
  4. As for the smooth quantile computation, it seems that Zygote.jl's AD actually let's me compute grads as long as I sort values beforehand (see this answer on SO). Is this suprising?

@davidstutz would much appreciate your thoughts, if you get the chance. This is still early stages here, so there's absolutely no rush. Amazing paper by the way!!

@pat-alt pat-alt self-assigned this Feb 15, 2023
@davidstutz
Copy link

Re:

  1. You can find a reference implementation for Equation (5) here.
  2. Reference implementation for that is here - but this does indeed depend on the scale. That's what the temperature term $T$ is for: $\sigma((E_\theta(x, k) - \tau)/T)$. Also, you can use the log-probabilities $E_\theta(x, k) = \log \pi_{\theta,k}(x)$ which works a bit better in practice.
  3. They can be penalized but this is generally not necessary. Basically, as long as there is one true label for each example, and $\alpha$ is reasonably low, the majority of prediction sets will contain at least the true label (so not be empty). This is mainly a result of the simple conformity score (for other conformal predictors this can be different). Beyond that, you are of course free to penalize that, but I am just saying that it is generally not required to learn good classifiers.
  4. Gradients wrt. to what is the question? Generlly, gradient is not a problem as long as the sorting is fixed. The key is getting gradients through the sorting - this is what the smooth sorter is for.

Hope that helps. If I am slow to respond on here, feel free to send me an email to follow-up - always curious to see what people do with conformal training especially as I had some follow-up ideas but couldn't really pursue them.

@pat-alt
Copy link
Member Author

pat-alt commented Feb 17, 2023

Wow this was quick, thanks a lot 🙏

That all makes sense. Regarding the quantile computation, thanks for the clarification. For my current use case, I just need to differentiate with respect to a conformal model that has already been calibrated, but I see now why you need information about the sorting itself for training.

Thanks again for being responsive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CCE 💯 difficult This is expected to be difficult. enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants