# **Context**

## **Trustworthy AI in Medicine: Conformal Prediction and Test-Time Augmentation**

[MIT_Article_Link](https://news.mit.edu/2025/making-ai-models-more-trustworthy-high-stakes-settings-0501)

[GPT_Ans](https://chatgpt.com/share/681883b2-a7c8-8006-9fcc-7800614a98f3)

**Keywords**

`conformal prediction`

### **Understanding**

Great. I’ll prepare two versions of the summary for this new MIT research on improving AI model trustworthiness in high-stakes settings. One version will be written for a non-technical audience using plain language and metaphors to explain the ideas, while the other will offer a detailed technical explanation, including mathematical and algorithmic aspects. I’ll also explain the background, the problem being addressed, the methods used, and what new opportunities this research may open up.

I’ll get started and let you know when it’s ready for review.

# Trustworthy AI in Medicine: Conformal Prediction and Test-Time Augmentation

## Explanation for a Non-Technical Audience

Imagine a doctor looking at a chest X-ray and trying to decide what’s wrong. There could be many possibilities: a buildup of fluid, infection, a broken bone, or something else. Modern AI tools can help by suggesting diagnoses from medical images, but these AIs are often uncertain. Traditionally, an AI might give **one** answer plus a “confidence” percentage (e.g. “I’m 20% sure it’s pleurisy”). However, just like an uncertain weather forecast (e.g. “20% chance of rain” can be misleading), these confidence scores are **not very reliable**. In fact, studies show that AI confidence numbers often don’t match reality, so doctors can’t fully trust them.

An alternative approach is to have the AI output a **set of possible answers** instead of just one guess. Think of it as getting a shortlist of the most likely diagnoses, with a guarantee that the right answer is in the list. This method is called **conformal prediction**. For example, instead of saying “It’s probably A with 20% chance,” the AI might say “It could be A, B, or C, and one of these is almost certainly correct”. This is much more informative for a doctor, because it explicitly shares uncertainty and focuses attention on the most plausible conditions.

However, there’s a catch: if the AI is very unsure, that “shortlist” can become **very long and hard to use**. For instance, in a tricky task like identifying an animal species from an image out of 10,000 possibilities, the AI might include hundreds of species on the list just to be safe. Sifting through 200 options to find the right one would be overwhelming for anyone. In other words, conformal prediction can give **too many choices** to be practical. Small changes in the image – like a slight rotation or crop – can also make the AI jump around and produce a different huge list of candidates.

The MIT researchers’ new method addresses this by making the AI’s predictions **more consistent and accurate** before forming the list. They borrowed an idea called **Test-Time Augmentation (TTA)** from computer vision. Here’s a helpful analogy: suppose you’re trying to recognize a familiar face in a photo that’s slightly blurry. You might tilt the photo, zoom in, or look at it under different lighting to get a clearer idea. TTA does something similar for images. The AI creates **many modified versions of the same image** – for example, by flipping it, zooming in, or cropping different parts – and it asks itself “What is this” for each version. Each version is like a slightly different viewpoint.

Then, instead of trusting any single answer, the AI **combines all those predictions**. It’s like polling a panel of experts: each augmented image “votes” or gives a probability for each diagnosis, and the AI aggregates these into one final judgment. This process tends to boost the confidence in the true diagnosis. (Remarkably, even if the true answer was originally very low on the list, these multiple views often raise its score significantly – for example, moving it from the 200th most likely to the 100th.)

Crucially, the MIT team found a smart way to **learn how to combine** these augmented predictions effectively. They set aside a portion of labeled training data (images where the correct answer is known) to practice the best way to merge the augmented results. After learning this augmentation strategy, they run the conformal prediction step on the final, combined output. The result is a **much shorter list of diagnoses** that still has the same high guarantee of containing the true answer. In fact, experiments showed that this method shrank the number of possibilities by about **10–30%** on common image tasks, without sacrificing reliability.

By giving doctors fewer, more precise options, the AI becomes a much more helpful assistant. A shorter list means the doctor can focus and decide quicker. As MIT’s Divya Shanmugam explains, “With fewer classes to consider, the sets of predictions are naturally more informative… you are not really sacrificing anything in terms of accuracy”. Although the research focused on medical images, the idea applies broadly. For example, an AI identifying wildlife in camera-trap photos would also benefit from smaller, reliable species lists. In the future, similar methods could be used for text problems too – imagine AI helping to flag issues in legal documents by giving a short list of likely categories.

In summary, the MIT approach makes AI outputs **safer and more user-friendly** by switching from a single risky guess to a guaranteed-shorter shortlist of answers, refined using multiple versions of the image. This should help doctors (and others in high-stakes jobs) trust AI guidance and make better decisions.

## Explanation for a Technical Audience

**The problem (large sets vs. coverage):** In high-stakes tasks like medical diagnosis, we want models to quantify uncertainty **reliably**. Conformal prediction is a formal framework that converts a classifier’s point prediction into a **prediction set** with a coverage guarantee. Formally, for a chosen error rate \$\alpha\$, a conformal predictor ensures that the true label is contained in the set at least \$1-\alpha\$ fraction of the time. However, this guarantee often comes at the cost of very large sets: “achieving a suitably strong guarantee often leads to prediction sets that are uninformatively large”. Large sets burden the end-user (e.g. a doctor) and diminish the practical value of conformal methods. Moreover, conformal predictors inherit the **instability** of the underlying model: tiny input perturbations (like a small rotation) can dramatically alter which classes meet the threshold, hurting consistency.

**Conformal prediction basics:** In typical split-conformal classification, we train a model \$f(x)\$ (e.g. a neural network) on a training set. We then set aside a calibration set \$\mathcal{D}_{\text{cal}}\$ of labeled examples. For each class \$y\$ on a new input \$x\$, we compute a score (often derived from the model’s probability for \$y\$) and include all classes whose score exceeds a threshold \$\tau\$. The threshold \$\tau\$ is chosen so that, over \$\mathcal{D}_{\text{cal}}\$, the empirical coverage is at least \$1-\alpha\$. This yields a valid coverage guarantee without assumptions on model calibration. But because scores for many classes might exceed \$\tau\$ when the model is unsure, the resulting set \${y: \text{score}(x,y) \ge \tau}\$ can be very large.

**The MIT solution (TTA + Conformal):** The key idea is to use **Test-Time Augmentation (TTA)** to improve the quality of the classifier’s scores _before_ forming the conformal set. TTA is a well-known technique in computer vision that creates an ensemble of predictions by applying label-preserving transformations to the input. Formally, for each test image \$x\$, we generate augmented versions \${a_0(x), a_1(x), \dots, a_M(x)}\$ (e.g. crops, flips, rotations). The pretrained model \$f\$ then produces probability distributions \$f(a_i(x))\$ for each augmentation. These are aggregated (for example, by averaging or a learned weighted combination) into a single probability vector \$\hat{f}(x)\$. This aggregation typically yields a more **robust and accurate** estimate of class probabilities.

Concretely, Shanmugam _et al._ propose **Test-Time-Augmented Conformal Prediction**. They allocate part of the labeled validation data to **learn an augmentation policy**. That is, they learn how to combine the augmented predictions \$f(a*i(x))\$ to maximize classification accuracy on this held-out set. Importantly, they use \_distinct* splits of data for (1) learning the TTA aggregation function and (2) conformal calibration. This split preserves exchangeability and thus the coverage guarantee of the conformal method. No retraining of the base classifier \$f\$ is needed – the model is fixed, and only the inference procedure is enhanced.

**Algorithmic steps (high-level):**

1. **Train base model:** Train (or use a pretrained) classifier \$f(x)\$ on the available training set.
2. **Partition labeled data:** Split a labeled validation set into two parts: one for learning TTA weights and one for conformal calibration.
3. **Learn augmentation policy:** For each \$x\$ in the TTA-training subset, generate augmentations \${a_i(x)}\$. Learn an aggregation function \$g\$ (e.g. weights or voting rule) so that the combined output \$g(f(a_0(x)),…,f(a_M(x)))\$ improves accuracy (for example, by maximizing the probability of the true class on these held-out examples).
4. **Aggregate test predictions:** For a new test image \$x^_\$, apply the learned augmentations and aggregate: \$\hat{p} = g(f(a_0(x^_)),…,f(a_M(x^\*)))\$.
5. **Conformal scoring and thresholding:** Compute the conformal score (e.g. cumulative probability of top-\$k\$ classes) based on \$\hat{p}\$ and compare to the threshold \$\tau\$ determined on the calibration set. The final prediction set is \${y : \text{score}(\hat{p},y) \ge \tau}\$.

This approach yields **smaller sets with the same coverage**. By improving the base probabilities, especially the rank of the true class, the conformal threshold cuts off more low-probability classes. The authors explain that TTA often **increases the predicted probability of the true class** even when it was originally ranked poorly. For example, a class that was 200th-most likely might move to 100th. Such changes don’t affect the single top prediction but do allow the conformal set to **drop many incorrect classes** while still including the truth.

**Results:** Across benchmarks (ImageNet, iNaturalist, CUB-Birds, etc.), the TTA-enhanced method consistently reduced set sizes **by about 10–14% on average**. In the most challenging cases, reductions were as high as **30%**. Crucially, these gains come **with no loss of the coverage guarantee**. Classes that originally needed many guesses benefit the most. The paper also shows that a smaller model (e.g. ResNet-50 with TTA) can outperform a larger model (ResNet-101 without TTA) in terms of set size, effectively “bridging gaps” between model capacities. Under distribution shifts (corrupted images), TTA-augmented conformal prediction was **even more efficient**, thanks to the added robustness of augmentations.

An important practical point is that using some labeled data for TTA can pay off. The method “sacrifices” a fraction of data from calibration to learn augmentations, but it boosts accuracy enough that overall coverage accuracy improves. This raises new questions about how to best allocate labeled data after model training. The researchers note that dedicated labeled examples for TTA can yield more efficient uncertainty estimates than naively using all data for calibration.

**Implications and future work:** The proposed method is straightforward to implement: it “requires no model retraining” and can be added on top of any classifier or conformal scoring rule. This makes it attractive for deployment. The authors plan to extend their approach to other domains, such as text classification (e.g. AI review of legal documents) and to optimize the computation cost of TTA (since multiple inferences per example can be expensive). Overall, combining TTA with conformal prediction is shown to yield **smaller, more reliable prediction sets** in high-stakes settings, improving the **trustworthiness** of AI decisions.

**Key Points (Technical Summary):**

- **Tradeoff in conformal prediction:** To guarantee (say) 95% coverage, classical conformal may output unwieldy sets, especially when model confidence is low.
- **Test-Time Augmentation:** Apply label-preserving transformations to each test image and aggregate model outputs, improving accuracy and stability.
- **TTA-Conformal Pipeline:** Use a separate validation split to learn augmentation aggregation, then feed the combined prediction into the conformal scoring mechanism.
- **Results:** Up to 30% smaller prediction sets (10–14% average reduction) while **preserving coverage**, across image datasets. Classes with the largest original sets see the biggest shrinkage.
- **No retraining needed:** This method works with any pretrained classifier and any conformal scoring rule (e.g. RAPS/APS), requiring only extra inference time.
- **Data allocation:** Interestingly, dedicating a portion of labeled data to learn TTA often outweighs the loss of calibration samples.
- **Future directions:** Apply to NLP models and optimize TTA efficiency for real-time use.

These innovations make conformal prediction more practical in critical applications (like medical imaging), where reliability and interpretability are paramount. By ensuring a tight, well-grounded set of predictions, doctors and other experts can make better-informed decisions.


## **Gemini Version**

### **Conformal Predictions**

Provides a set of possible outputs with high confidence that the correct output lies only in the provided set.

Conformal Prediction is really helpful where the input might be ambiguous or when the model encounters data that is significantly different from what it was trained on.

Let's take an example from our project Baali Bigyan. If we provide an image of a dog it is very certain that our model will output one of the class out of the 10 class with certain confidence.

Sticking on this example, if we want to to prove that our model's predictions are accurate then minor changes in the input data should not change the output.

Let's say we provided a fresh `Blast` leaf for which our model predicted the correct output. Now, to ensure our model's integrity we provide the same image but with slight changes in the image. In this condition if our model predicts correct output the we can be sure that our model has learned the features accurately if not then we've created a biased model.

In such condition `Conformal Prediction` shows it's ability where instead of a single prediction it provides us a range of output, out of which consists a single correct output.

**Disadvantage**

However, a notable limitation of conformal classification is its potential to produce
prediction sets that are impractically large, thereby diminishing its utility for clinicians
in real-world scenarios.

Suppose we've a classification task of 1000 classes. Conformal Prediction can reduce the prediction set to about 200 per image. Although, the probaility has increased for the correct output still the set is big and introduces ambiguity in our prediction.

We know when there's ambiguity we cannot rely on the prediction made especially in sensitive fields like Medical.

This necessitates the development ofmethods to refine the prediction sets generated by conformal classifiers, making them smaller and more informative without compromising their statistical validity.

### **Test-Time Augumentation (TTA)**

Test-time augmentation (TTA) is a technique employed in machine learning,
particularly in computer vision, to enhance the reliability and accuracy of predictions
made by trained models.

TTA involves presenting an AI model with multiple slightly modified versions of the same input image during the prediction phase.

These modifications, or augmentations, can include transformations such as slight rotations, cropping, horizontal or vertical flipping, and adjustments in zoom level. The AI model then generates a prediction for each of these augmented versions
of the original image.

Finally, these individual predictions are combined, or aggregated, to produce a more robust and accurate final prediction for the original, unaugmented image.

TTA is a widely adopted strategy in computer vision aimed at improving the accuracy
of models during the inference stage.23 It has been shown to enhance the overall
accuracy and robustness of predictions by making the model less sensitive to minor
variations or noise in the input data.25 By exposing the model to slightly different
perspectives of the same image, TTA encourages it to focus on the more consistent
and diagnostically relevant features, rather than being swayed by minor artifacts or
variations in image presentation.25 The process of aggregating predictions from
multiple augmented views often leads to a more reliable and accurate final output
compared to a prediction based on a single, unaugmented image.

### **Combining Conformal Classification and TTA**

The core of the MIT researchers' method involves applying test-time augmentation
(TTA) to the input medical images before subjecting them to the conformal
classification process.5 This initial step entails creating multiple augmented versions of
each original medical image using a variety of label-preserving transformations, such
as cropping different regions of the image, flipping it horizontally or vertically, and
applying slight zoom adjustments.5 Subsequently, the underlying AI model, which is
typically a pre-trained computer vision model, is used to generate a prediction for
each of these augmented images.25 This results in a set of predictions for each
original medical image, corresponding to its various augmented forms.

To effectively leverage the information from these multiple augmented views, the
researchers developed a process to learn how to optimally combine the individual
predictions to maximize the accuracy of the underlying AI model.25 This learning
process was conducted on a held-out portion of labeled image data, which is a
subset of data that would normally be used in the conformal classification step.26 By
using this held-out data, the researchers could train a mechanism to automatically
determine the most effective way to aggregate the predictions obtained from the
different augmented versions of the images.25 This learned aggregation strategy
allows for a more nuanced combination of the augmented predictions compared to a
simple averaging approach.

Once the TTA-transformed predictions were obtained and optimally aggregated, the
researchers then applied the conformal classification method to these aggregated
predictions.26 The rationale behind this approach is that by first improving the
accuracy and robustness of the underlying predictions through the application of TTA
and a learned aggregation strategy, the subsequent conformal classification step canthen generate a smaller, more focused set of probable diagnoses while still
maintaining the crucial guarantee of including the correct diagnosis within that set.26
As Divya Shanmugam aptly stated, with fewer classes to consider in the prediction
set, the results become naturally more informative.28
