### Abstract

- Deep learning has gained a significant attention in medical image segmentation.
- However the limited availability of training data remains a major challenge, particularly in the medical field where data acquisition can be costly and subject to privacy regulations.
- In efforts to overcome this challenge, traditional data augmentation techniques have been proposed, but these techniques often produce limited and unconvincing results.
- for segmentation tasks, providing both images and their corresponding target masks is crucial.
- Furthermore, 3D medical images synthesis is impractical due to the tremendous memory cost and training difficulty, essentially if we synthesize the mask too.
- To this end we propose a novel yet simple slice-based latent diffusion architecture which handles the reconstruction of the volume in a slice-by-slice fashion and generalizes the joint distribution of the medical images and corresponding mask for a simultaneous generation.
- Furthermore, our architecture allows a conditionning on the tumor and control the relative position and size shape of the tumor.

- avoids computation expensiveness of diffusion models makes them hard to use in practice, especially for the synthesis of 3D medical images.

- Experiments conducted on BRATS dataset demonstrate the quality of the synthesized images and and their efficacy in augmenting the training data for segmentation tasks.

### Abstract

Despite the increasing use of deep learning in medical image segmentation, the limited availability of annotated training data remains a major challenge, particularly in the medical field where data acquisition can be time-consuming and subject to privacy regulations. While conventional approaches to data augmentation have been proposed, they often produce limited and suboptimal results. In the context of segmentation tasks, providing both medical images and their corresponding target masks is essential. In this study, we introduce a novel slice-based latent diffusion architecture designed to address the complexities of volumetric data reconstruction, performed in a slice-by-slice fashion. This approach extends the joint distribution modeling of medical images and their associated masks, allowing a simultaneous generation of both. Additionally, this approach effectively mitigates the computational complexity and memory expensiveness typically associated with diffusion models, making them more practical, especially for the synthesis of 3D medical images. Furthermore, our architecture allows a conditioning on tumor characteristics, providing control over tumor size, shape, and relative position. Importantly, our method demonstrates improved image quality and mode-coverage compared to Generative Adversarial Networks (GANs)-based approaches while efficiently training on limited data, resulting in competitive 3D MRI synthesis performance. Empirical experiments on a segmentation tasks using the BRATS2023 dataset confirm the effectiveness of the synthesized images and masks in improving segmentation tasks, despite the limited training data.


Deep learning has witnessed remarkable growth in the domain of medical imaging, demonstrating notable effectiveness in image segmentation across various imaging modalities, including MRI (citations). However, the ongoing challenge of limited access to annotated medical imaging data is a major hurdle, primarily stemming from the rarity of certain medical conditions and the imposition of stringent medical privacy regulations. This limitation leads to a laborious and time-consuming manual delineation of tumor masks by medical professionals.

In this context, data augmentation has emerged as an integral component of deep learning, enabling models to overcome the constraints associated with a scarcity of training samples and generalize more effectively (Krizehvsky, 2017). Nevertheless, in the complex realm of medical imaging, conventional augmentation techniques may introduce deformations and generate anomalous data, resulting in deviations from the true data distribution.

To address these challenges, advanced deep learning-based data augmentation techniques have been proposed, striving to generate synthetic samples that closely resemble real data while preserving the semantic integrity of medical images (Kebaili et al., 2023; Song et al., 2021; Krizhevsky, 2017). These models also offer privacy preservation and data anonymization.

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have found widespread application in medical imaging (Wang et al., 2023; Trullo et al., 2019) and have been advocated in numerous literature reviews for data augmentation due to their ability to generate realistic images (Chen et al., 2022). However, GANs exhibit certain limitations, including learning instability, convergence issues, and the well-documented problem of mode collapse (Mescheder et al., 2018), where the generator produces a limited range of samples.

In contrast, Variational Autoencoders (VAEs) (Kingma and Welling, 2013) have been proposed as an alternative to GANs, offering a more stable training process and a more efficient inference procedure. However, VAEs are known to produce blurry images (Chen et al., 2022) and are incapable of generating high-resolution images (Chen et al., 2022).

Recently, diffusion models have emerged as a promising method for image synthesis, offering superior image quality and realism compared to GANs while maintaining a good mode coverage, just like VAEs. This has led to the rise of these models and the development of various alternatives, such as the Latent Diffusion Model (LDM) (Ho et al., 2023).

While these models provide a compelling solution to the problem of limited training data, one common issue is their computational cost and high memory requirements, rendering them impractical for 3D medical image synthesis. This is especially true for diffusion-based models, which are more resource-intensive, making them challenging to deploy in clinical routine, particularly for tasks like data harmonization or imputation. Furthermore, existing literature predominantly focuses on 2D image generation (Kebaili et al., 2023).

In addition, recent studies have primarily concentrated on image generation or translation, which, in the context of tumor segmentation, is insufficient. The importance lies in generating both images and their corresponding tumor masks, as these masks serve as ground truth for segmentation tasks, adding complexity and cost to the generation process, as we must generalize not only the medical image but also the associated mask.

In this study, we introduce an optimized and lightweight variant of latent diffusion models, employing a slice-by-slice approach for the simultaneous generation of medical images and corresponding segmentation masks. Our architecture is trained under data limitations, and we demonstrate its efficacy in augmenting training data for segmentation tasks. Moreover, our model allows for precise control of tumor size, shape, and relative position—a vital feature for tumor segmentation tasks. This feature also serves as regularization to our model, enhancing supervision and mitigating overfitting. Our evaluation encompasses the quality of generated images, followed by a comprehensive assessment in the context of 3D segmentation tasks using synthesized volumes.

Our contributions encompass:
1. Introduction of an efficient, slice-by-slice diffusion model for the simultaneous generation of high-quanlity medical images and associated segmentation masks.
2. Highlighting the strength of our approach in data-scarce environments, in contrast to the data-intensive nature of GAN-based architectures.
3. Comprehensive evaluation showcasing the effectiveness of our high-quality synthesized MRIs in enhancing segmentation tasks.

### Introduction

- L'apprentissage profond a connu un essor considérablement dans le domaine de l'imagerie médicale, démontrant une grande efficacité dans la segmentation d'images médicales, et cela, à travers plusieurs modalités d'imagerie incluant l'IRM (citations). However, the limited availability of annotated medical imaging data poses a significant challenge, stemming from the rarity of certain pathologies and rigorous medical privacy regulations. Consequently, manual delineation of tumor masks by physicians becomes a laborious and time-consuming process.

In this context, data augmentation has emerged as an inseparable part of deep learning, enabling models to overcome the limitations of low sample size regimes, and generalize better \cite{krizhevsky2017imagenet}. However, when dealing with the complex structures inherent in medical images, conventional augmentation operations may introduce image deformations and generate aberrant data, leading to misrepresentations of the underlying data distribution.

advanced deep learning-based data augmentation techniques have been proposed to address these challenges, aiming to generate synthetic samples that closely resemble real data while preserving the semantic integrity of medical images \cite{kebaili2023deep, song2021deep, krizhevsky2017imagenet}. These models offer ... privacy preserving and anonymization of the data.

Generative adversarial networks (GANs) \cite{goodfellow2014generative} have found wide application in the field of medical imaging \cite{wang2023fedmed,trullo2019multiorgan} and have been recommended in numerous literature reviews on data augmentation due to their ability to generate realistic images \cite{chen2022generative}. However, GANs exhibit certain limitations, including learning instability, convergence difficulties, and the well-known issue of mode collapse \cite{mescheder2018training}, wherein the generator produces a restricted range of samples.

Alternativaly, Variational Autoencoders (VAEs) \cite{kingma2013auto} have been proposed as an alternative to GANs, offering a more stable training process and a more efficient inference procedure. However, VAEs are known to produce blurry images \cite{chen2022generative} and are unable to generate high-resolution images \cite{chen2022generative}.

Diffusion models ont récemment émergé comme une méthode prometteuse pour la synthèse d'images, proposant une meilleure qualité et réalisme d'image que les GANs tout en ayant un bon mode de couverture étant aussi un likelihood-based model tout comme le VAE. Ceci a permi l'essor de ces modèles et l'apparition de plusieurs alternatives comme le latent diffusion modèle (LDM) \cite{ho2023denoising}.

These models offer a promising solution to the problem of limited training data, but one common issue is that they are often computationally expensive and require large amounts of memory, making them impractical for 3D medical image synthesis, especially pour les modèles basés sur  la diffusion qui sont plus gourmands, ce qui les rend plus difficile à déployer en routine clinique notamment pour des tâches utiles comme l'harmonisation ou la data imputation <verify this information if data imputation is in its good context>. Par ailleurs, dans l'état de l'art actuelle on remarque que la quasi totalité des recherches se basent essentiellement sur la génération en 2D \cite{kebaili2023deep}. Furthermore, recent studies have primarily focused on image generation or translation \cite{gan2022esophageal, liang2021data}, but in the context of tumor segmentation \cite{niyas2022medical}, generating images alone is insufficient. It is crucial to generate both images and their corresponding tumor masks, which serve as ground truth for segmentation tasks, and this, makes the generation process even more complex and constly, because we need to generalize not only the medical image pour also the associated mask.

In this study, we present a optimized and lightweight variant of latent diffusion models in which the autoencoder operates in a slice-by-slice base for the simulataneous generation of medical images and corresponding segmentation masks. Our architecture has been trained under a limited data regime, and we demonstrate its effectiveness in augmenting the training data for segmentation tasks. We also show that our model can be used to control the size, shape, and relative position of the tumor, which is a crucial feature for tumor segmentation tasks, which also serves a regularization to our model and improves the supervision to avoid overfitting. We evaluate <first the quality of the images> .. <then we evaluate on a 3D segmentation task using synthesized volumes>.

Our contributions are as follows:
<continue it ...>





### Method
#### Latent diffusion models ...
#### Slice-based diffusion models:

- Our approach utilizes a 2D Autoencoding architecture augmented with positional embedding to encode volumes slice by slice. This enables the synthesis of 3D volumes along with their corresponding segmentation masks.
- Cela vient dans l'optique de pallier les problèmes d'apprentissages des modèles de diffusion sur des données en 3D, notamment en raison des couts de calculs et de la mémoire nécessaire à l'implémentation d'un auto-encodeur 3D mais aussi en terme de quantité de données afin d'éviter un surapprentissage.
- inherent complexity associated with training 3D models due to its excessive computational demands and requirement for large amounts of data in order to generalize well.
- sacrificing the tri-dimensional aspect of the volumes afin de déconstruire le problème en un autre plus simple (reconstruction 2D) peut être bénéfique pour tirer mieux parti des ressources et quantité de données limitées que nous possédons, yielding acceptable improvements in reconstructing high-quality volumes in a simple and effective manner.
- The positional embedding is a technique that allows the model to learn the relative position of each slice in the volume. This additional level of supervision allows the autoencoder to be more aware of the spatial context of the volume, which is essential for the synthesis of realistic volumes.
- l'autoencodeur est entrainé à encoder des couples d'images et de masques de segmentation dans un espace de moindre dimension noté $z_i$ ou $i$ représente le niveau de la slice. Ce même espace latent $z_i$ contient des caractéristiques commune entre l'image et le masque de segmentation au même niveau.
- Suite à cela, l'ensemble des espaces latents sont combinés pour former un espace latent 3D $z$, le modèle de diffusion est entrainé à apprendre la distribution jointe des images et des masques de segmentation dans cet espace latent 3D $z$. La figure \ref{fig:latent_space} illustre l'architecture de notre modèle allant de l'encodage à la diffusion.

#### Conditionning on tumor characteristics:
... à revoir

Conditioning a deep learning model with labels, often referred to as label conditioning, is a technique that involves providing additional information (labels) to the model during training or inference. This additional information can offer several benefits depending on the specific task and architecture. Here are some benefits of conditioning a model with labels in deep learning:

Improved Supervision: Label conditioning provides a more explicit form of supervision. In tasks like classification, object detection, or semantic segmentation, labels serve as ground truth, guiding the model towards learning the correct associations between input data and corresponding classes or objects.

Regularization: Label conditioning can act as a form of regularization. It constrains the model's predictions to align more closely with the provided labels, reducing overfitting and improving generalization.

## Effects of conditionning

Conditioning a deep learning model with labels, often referred to as label conditioning, is a technique that involves providing additional information (labels) to the model during training or inference. This additional information can offer several benefits depending on the specific task and architecture. Here are some benefits of conditioning a model with labels in deep learning:

Improved Supervision: Label conditioning provides a more explicit form of supervision. In tasks like classification, object detection, or semantic segmentation, labels serve as ground truth, guiding the model towards learning the correct associations between input data and corresponding classes or objects.

Semantic Understanding: Labels can help the model develop a deeper understanding of the data by providing semantic information. This understanding can be leveraged for tasks like natural language understanding, where labels can represent the meaning or intent of a sentence.

Multi-Modal Learning: For tasks that involve multiple modalities, such as text and images, label conditioning allows the model to learn cross-modal relationships. This is useful in applications like image captioning, where the model generates textual descriptions conditioned on the visual content.

Regularization: Label conditioning can act as a form of regularization. It constrains the model's predictions to align more closely with the provided labels, reducing overfitting and improving generalization.

Semi-Supervised Learning: In scenarios where labeled data is scarce but unlabeled data is abundant, label conditioning can be used in semi-supervised learning. By leveraging the small amount of labeled data, the model can learn from both labeled and unlabeled examples, potentially improving performance.

Adaptation to Specific Classes: In fine-grained classification tasks, label conditioning helps the model adapt to specific classes, making it more accurate in distinguishing between closely related categories.

Controlled Generation: For generative models like GANs (Generative Adversarial Networks), conditioning with labels allows you to control the generated samples. For example, in conditional image generation, you can specify the desired class or attributes for the generated images.

Data Augmentation: Labels can be used to guide data augmentation techniques, ensuring that augmented samples remain within the same class or category. This helps in data diversification without introducing label noise.

Structured Outputs: In structured prediction tasks like sequence-to-sequence models, labels can provide information about the structure of the output, helping the model generate structured and meaningful sequences.

Task Adaptability: Label conditioning makes it easier to adapt a pre-trained model to a new task. You can fine-tune the model with labeled data specific to the new task, preserving the knowledge learned from the original task.

### Possible arguments

- No use of alternate 1D convolution to catch the volumetric aspect, we use directly a normal 3D convolutional UNet for the diffusion.
- Our architecture is free of attention, we assume that attention increases a lot the computation consumption, especially on 3-dimensional data. Furthermore, attention struggles to converge when dealing with low sample size dataset, and needs a higher quantity of data in order to scale. We thus argue that using a UNet with attention is such data-scarce regimes wont bring that much of improvements and more importantly, can encourage potential overfittings. Same rule applied to the autoencoding architecture.
- Attention has a quadratic complexity, applied to 3D volumes in the case of DDPMs, and the number of parameters grows quadratically with the number of pixels in the image. This makes it impractical to use attention in 3D medical image synthesis tasks, especially when dealing with high resolution images. A simple 3D convolutional UNet with attention on the last layer already reaches 97 Go of VRAM which exceeds already the memory of most of the modern GPUs. This makes impractical to use in real life applications.
- For 2D latent diffusion models, attention is employeed at 8, 4 and 2 resolution with 32 heads. While this configuration is not possible using 3D models, we keep it only to the last one in order to demonstrate the impact of attention on the computation efficiency.
- By leveraging a 2D Autoencoding architecture supplied with a positional embedding to encode volumes in a slice-by-slice fashion, we are able to synthesize 3D volumes with corresponding segmentation quasi-instantanément.
- Our approach utilizes a 2D Autoencoding architecture augmented with positional embedding to encode volumes slice by slice. This enables the rapid synthesis of 3D volumes along with their corresponding segmentation masks.

- Drawback:
Reconstructing sliwe-wise volume can impact the quality when viewing the volume in sagital and coronal views, since the images are decoded in axial view. So there is a logical continuity of the pixel in axial view, but not when its sagital and coronal which can be a problem when dealing with some modalities where those views are important such a PET imaging. 

### Computation efficiency of diffusion models

#### Autoencoders
- **3D VAE + Discrimiantor (Original LDM)**: 15.324TFLOPS, 85.949M + 14.745M (~101M) params
- **Slice-based VAE (Ours)**: 195.996GFLOPS', 27.082M params

#### Diffusion models
- **3D DDPM with Attention at resolution = 16**: 33.432TFLOPS', 268.214M params
- **3D LDM**: 2.180TFLOPS, 627.556M + 101M (Autoencoder) = ~728M params
- **3D LDM reduced number of filters**: 545.149GFLOPS, 156.922M + 101M (Autoencoder) = ~258M params
- **Slice-based LDM (Ours)**: 523.735GFLOPS, 132.101M + 27.082M (Autoencoder) = ~159M params
- **Slice-based LDM with 2D diffusion (Ours)**: 588.329GFLOPS, 45.247M + 27.082M (Autoencoder) = ~72M params