# **Data Science Project: Brain Tumor Detection (BTD) from MRI Scans**

### **Authors: Sofiene HERMI, Hajer JRAD, Riadh ZIDI**

# **1. Phase 1: Business Understanding**

### **1. Medical Context**
The detection and characterization of brain tumors is a major public health challenge. MRI (Magnetic Resonance Imaging) is the primary diagnostic tool for detecting and characterizing brain tumors. The three tumor types in our dataset have distinct characteristics:

- **Glioma**: Primary brain tumor developing from glial cells; often aggressive and requires prompt treatment.
- **Meningioma**: Usually benign tumor arising from the meninges; more common in women.
- **Pituitary tumor**: Tumor of the pituitary gland, usually benign but can affect hormone levels.

### **2. Project Objectives**
**Main objective:**
Develop an automatic classification system that performs:

- **Binary detection:** Identify presence or absence of a tumor.
- **Multi-class classification:** If a tumor is detected, determine its type among the three categories.

**Secondary objectives:**

- Assist radiologists in early diagnosis
- Reduce MRI analysis time
- Provide a second opinion for validation
- Improve diagnostic accessibility in under-resourced areas

**Success criteria:**

- **Overall accuracy:** > 90% on the test set
- **Sensitivity (Recall):** > 95% for tumor detection (minimize false negatives)
- **Specificity:** > 90% to avoid false positives
- Balanced confusion matrix across classes

**Constraints & considerations:**

- **Medical ethics:** The model is an aid, not a replacement for clinical diagnosis
- **Interpretability:** Importance of understanding model decisions
- **Class imbalance:** Handle the lower number of 'no_tumor' images

# **2. Phase 2: Data Understanding**

Before any transformation, an exploratory data analysis (EDA) is necessary to validate image quality.

**Distribution analysis:** The numerical disparity between the "Tumor" and "No_tumor" classes may bias the model toward detection errors (e.g., higher false negatives).

**Visual variance analysis:** Images in the "no_tumor" class show variable dimensions, whereas tumor images are often 512x512. A uniform resizing step is required to ensure consistent input tensors.

**Integrity check:** Visual inspection using pixel histograms helps verify that brightness and contrast are consistent across folders, preventing the model from learning acquisition artifacts instead of pathology.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Display configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Dataset path
BASE_PATH = '/kaggle/input/brain-tumor-classification-mri'



## **2.1 DATASET STRUCTURE EXPLORATION**

In [None]:
dataset_info = {
    'Training': defaultdict(int),
    'Testing': defaultdict(int)
}

image_dimensions = {
    'Training': defaultdict(list),
    'Testing': defaultdict(list)
}

# Parcourir Training et Testing
for split in ['Training', 'Testing']:
    split_path = os.path.join(BASE_PATH, split)
    
    if not os.path.exists(split_path):
        print(f"\n Le dossier {split} n'existe pas!")
        continue
        
    print(f"\n Dossier: {split}")
    print("-" * 70)
    
    # Parcourir les classes
    for class_name in os.listdir(split_path):
        class_path = os.path.join(split_path, class_name)
        
        if os.path.isdir(class_path):
            # Compter les images
            images = [f for f in os.listdir(class_path) 
                     if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
            count = len(images)
            dataset_info[split][class_name] = count
            
            print(f"  ├─ {class_name:25s}: {count:4d} images")
            
            # Échantillonner quelques images pour analyser les dimensions
            sample_size = min(50, count)
            for img_name in images[:sample_size]:
                img_path = os.path.join(class_path, img_name)
                try:
                    img = Image.open(img_path)
                    image_dimensions[split][class_name].append(img.size)
                except:
                    pass
    
    total = sum(dataset_info[split].values())
    print(f"  └─ {'TOTAL':25s}: {total:4d} images")


## **2.2 IMAGE DIMENSIONS ANALYSIS**

In [None]:
for split in ['Training', 'Testing']:
    print(f"\n {split}:")
    print("-" * 70)
    
    for class_name, dims in image_dimensions[split].items():
        if dims:
            widths = [d[0] for d in dims]
            heights = [d[1] for d in dims]
            
            unique_dims = set(dims)
            
            print(f"\n  {class_name}:")
            print(f"    Dimensions uniques: {len(unique_dims)}")
            
            if len(unique_dims) == 1:
                print(f"    Taille uniforme: {dims[0][0]}x{dims[0][1]}")
            else:
                print(f"    Largeur  - Min: {min(widths):4d}, Max: {max(widths):4d}, Moyenne: {np.mean(widths):.1f}")
                print(f"    Hauteur  - Min: {min(heights):4d}, Max: {max(heights):4d}, Moyenne: {np.mean(heights):.1f}")
                print(f"    Dimensions les plus fréquentes: {max(set(dims), key=dims.count)}")


## **2.3 CLASS IMBALANCE ANALYSIS**

In [None]:
for split in ['Training', 'Testing']:
    print(f"\n {split}:")
    print("-" * 70)
    
    counts = dataset_info[split]
    if not counts:
        continue
        
    total = sum(counts.values())
    
    # Créer un DataFrame pour l'analyse
    df = pd.DataFrame({
        'Classe': list(counts.keys()),
        'Nombre': list(counts.values()),
        'Pourcentage': [v/total*100 for v in counts.values()]
    })
    
    df = df.sort_values('Nombre', ascending=False)
    print(df.to_string(index=False))
    
    # Ratio de déséquilibre
    max_count = max(counts.values())
    min_count = min(counts.values())
    imbalance_ratio = max_count / min_count if min_count > 0 else 0
    
    print(f"\n  Ratio de déséquilibre: {imbalance_ratio:.2f}:1")
    print(f"  (Classe majoritaire / Classe minoritaire)")


## **2.4 VISUALISATION DES ÉCHANTILLONS**

In [None]:
split = 'Training'
split_path = os.path.join(BASE_PATH, split)
classes = sorted([d for d in os.listdir(split_path) 
                 if os.path.isdir(os.path.join(split_path, d))])

samples_per_class = 3

fig, axes = plt.subplots(len(classes), samples_per_class, 
                         figsize=(15, 4*len(classes)))
fig.suptitle('Échantillons d\'images par classe (Training)', 
             fontsize=16, fontweight='bold')

for i, class_name in enumerate(classes):
    class_path = os.path.join(split_path, class_name)
    images = [f for f in os.listdir(class_path) 
             if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    # Sélectionner des échantillons aléatoires
    samples = np.random.choice(images, 
                              min(samples_per_class, len(images)), 
                              replace=False)
    
    for j, img_name in enumerate(samples):
        img_path = os.path.join(class_path, img_name)
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        
        ax = axes[i, j] if len(classes) > 1 else axes[j]
        ax.imshow(img, cmap='gray')
        ax.axis('off')
        
        if j == 0:
            ax.set_title(f"{class_name}\n{img.shape}", 
                       fontsize=10, fontweight='bold')
        else:
            ax.set_title(f"{img.shape}", fontsize=9)

plt.tight_layout()
plt.show()


## **2.5 PIXEL STATISTICAL ANALYSIS**

In [None]:
split = 'Training'
split_path = os.path.join(BASE_PATH, split)
classes = sorted([d for d in os.listdir(split_path) 
                 if os.path.isdir(os.path.join(split_path, d))])

sample_size = 100
stats = {}

for class_name in classes:
    class_path = os.path.join(split_path, class_name)
    images = [f for f in os.listdir(class_path) 
             if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    # Échantillonner
    samples = np.random.choice(images, 
                              min(sample_size, len(images)), 
                              replace=False)
    
    pixel_values = []
    
    for img_name in samples:
        img_path = os.path.join(class_path, img_name)
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        pixel_values.extend(img.flatten())
    
    pixel_values = np.array(pixel_values)
    
    stats[class_name] = {
        'mean': np.mean(pixel_values),
        'std': np.std(pixel_values),
        'min': np.min(pixel_values),
        'max': np.max(pixel_values),
        'median': np.median(pixel_values)
    }

# Afficher les résultats
print(f"\n Statistiques (sur {sample_size} images par classe):")
print("-" * 70)

df_stats = pd.DataFrame(stats).T
df_stats = df_stats.round(2)
print(df_stats)


## **2.6 PIXEL INTENSITY DISTRIBUTION**

In [None]:
print("\n Génération des histogrammes de distribution...")

split = 'Training'
split_path = os.path.join(BASE_PATH, split)
classes = sorted([d for d in os.listdir(split_path) 
                 if os.path.isdir(os.path.join(split_path, d))])

sample_size_hist = 50

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Distribution des intensités de pixels par classe', 
             fontsize=16, fontweight='bold')
axes = axes.flatten()

for idx, class_name in enumerate(classes):
    class_path = os.path.join(split_path, class_name)
    images = [f for f in os.listdir(class_path) 
             if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    samples = np.random.choice(images, 
                              min(sample_size_hist, len(images)), 
                              replace=False)
    
    pixel_values = []
    for img_name in samples:
        img_path = os.path.join(class_path, img_name)
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        pixel_values.extend(img.flatten())
    
    axes[idx].hist(pixel_values, bins=50, alpha=0.7, edgecolor='black')
    axes[idx].set_title(f'{class_name} (n={len(samples)} images)')
    axes[idx].set_xlabel('Intensité de pixel')
    axes[idx].set_ylabel('Fréquence')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## **2.7 GRAPHIQUES DE DISTRIBUTION DES CLASSES**

In [None]:
print("\n Génération des graphiques de distribution...")

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle('Distribution des classes dans le dataset', 
             fontsize=16, fontweight='bold')

for idx, split in enumerate(['Training', 'Testing']):
    counts = dataset_info[split]
    if not counts:
        continue
    
    classes_list = list(counts.keys())
    values = list(counts.values())
    
    # Graphique en barres
    bars = axes[idx].bar(classes_list, values, 
                        color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
    axes[idx].set_title(f'{split} Set', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel('Nombre d\'images', fontsize=12)
    axes[idx].set_xlabel('Classes', fontsize=12)
    axes[idx].tick_params(axis='x', rotation=45)
    
    # Ajouter les valeurs sur les barres
    for bar in bars:
        height = bar.get_height()
        axes[idx].text(bar.get_x() + bar.get_width()/2., height,
                      f'{int(height)}',
                      ha='center', va='bottom', fontweight='bold')
    
    axes[idx].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## **2.8 EXPLORATORY PHASE SUMMARY**

### **Key findings:**
- Dataset structured into Training/Testing
- 4 classes: glioma_tumor, meningioma_tumor, pituitary_tumor, no_tumor
- Significant imbalance: 'no_tumor' underrepresented
- Variable dimensions in Testing; Training images are generally 512x512 except in the NO_TUMOR class

### **Class distribution (Training)**

Total: 2,870 images
- Glioma: 826 (28.8%)
- Meningioma: 822 (28.6%)
- Pituitary: 827 (28.8%)
- No Tumor: 395 (13.8%) -> Underrepresented

### **Class distribution (Testing)**

Total: 389 images
Relatively balanced across classes

### **Technical characteristics**
- Training format: standardized 512×512 grayscale images
- Testing format: JPG images of variable dimensions -> requires preprocessing
- Channels: Grayscale images (1 channel)

**Critical observation:** There is a significant class imbalance: pathological classes are overrepresented compared to the healthy class in the training set

# **3. Phase 3: Data Preparation**

### This is the pivotal step of our project. Here we aim to balance the Tumor and no_tumor classes. We will use K-means clustering to determine 200 representative clusters for each tumor type so that the total of the "tumor" class becomes 600 images (200+200+200). For augmenting the no_tumor class we will use a diffusion model to generate synthetic images.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import shutil
from tqdm import tqdm
import zipfile
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')


### **3.1 SETUP**

In [None]:
BASE_PATH = '/kaggle/input/brain-tumor-classification-mri'
OUTPUT_BASE = '/kaggle/working/kmeans_output'

# Créer structure
folders = [
    'selected_images/glioma',
    'selected_images/meningioma',
    'selected_images/pituitary',
    'original_no_tumor',
    'metrics'
]

print("\n Création structure...")
for folder in folders:
    os.makedirs(os.path.join(OUTPUT_BASE, folder), exist_ok=True)
print(f" {len(folders)} dossiers créés")

### **3.2 STEP 1: K-MEANS TO SELECT 200 IMAGES PER TUMOR TYPE**

### **Methodology:**

### For each tumor type:

1. Extract descriptors (features) from images (e.g., flattening, PCA, or CNN embeddings).
2. Apply K-means with K = 200.
3. Select one image per cluster (the image closest to the centroid).

Thus, for each tumor type we obtain 200 representative images:

- 200 Glioma
- 200 Meningioma
- 200 Pituitary

These images are then merged to form a single 'tumor' class for the binary classification task.

**Advantages:**

- Selection guided by the data structure.
- Reduced redundancy bias.
- Better coverage of the feature space.

#### **3.2.1 IMAGE SELECTION WITH K-MEANS**

In [None]:
tumor_classes = ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor']
target_per_class = 200
image_size = (128, 128)

print(f"\nObjectif: {target_per_class} images par classe")
print(f"Métriques: Inertie, Silhouette, Davies-Bouldin")

kmeans_metrics = {}

for tumor_class in tumor_classes:
    print(f"\n{'='*70}")
    print(f"Traitement: {tumor_class}")
    print(f"{'='*70}")
    
    class_path = os.path.join(BASE_PATH, 'Training', tumor_class)
    all_images = [f for f in os.listdir(class_path) 
                  if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
    
    print(f"   Total: {len(all_images)}")
    
    # Extraction features
    features_list = []
    valid_images = []
    
    for img_name in tqdm(all_images, desc="  Features"):
        try:
            img = cv2.imread(os.path.join(class_path, img_name), cv2.IMREAD_GRAYSCALE)
            img_resized = cv2.resize(img, image_size)
            features = (img_resized.astype('float32') / 255.0).flatten()
            features_list.append(features)
            valid_images.append(img_name)
        except:
            continue
    
    features_array = np.array(features_list)
    print(f"   Features: {features_array.shape}")
    
    # K-Means
    print(f"   K-Means (K={target_per_class})...")
    kmeans = KMeans(n_clusters=target_per_class, random_state=42, n_init=10, max_iter=300)
    cluster_labels = kmeans.fit_predict(features_array)
    
    # Métriques
    inertia = kmeans.inertia_
    silhouette = silhouette_score(features_array, cluster_labels)
    davies_bouldin = davies_bouldin_score(features_array, cluster_labels)
    
    print(f"\n   MÉTRIQUES:")
    print(f"  • Inertie: {inertia:.2f}")
    print(f"  • Silhouette: {silhouette:.4f}")
    print(f"  • Davies-Bouldin: {davies_bouldin:.4f}")
    
    kmeans_metrics[tumor_class] = {
        'inertia': inertia,
        'silhouette': silhouette,
        'davies_bouldin': davies_bouldin
    }
    
    # Sélection images
    print(f"\n   Sélection images représentatives...")
    selected_indices = []
    
    for cluster_id in range(target_per_class):
        cluster_indices = np.where(cluster_labels == cluster_id)[0]
        if len(cluster_indices) == 0:
            continue
        centroid = kmeans.cluster_centers_[cluster_id]
        distances = np.linalg.norm(features_array[cluster_indices] - centroid, axis=1)
        selected_indices.append(cluster_indices[np.argmin(distances)])
    
    selected_names = [valid_images[idx] for idx in selected_indices]
    print(f"   {len(selected_names)} images sélectionnées")
    
    # Copier
    class_short = tumor_class.replace('_tumor', '')
    dest_path = os.path.join(OUTPUT_BASE, 'selected_images', class_short)
    
    for img_name in tqdm(selected_names, desc="  Copie"):
        shutil.copy2(os.path.join(class_path, img_name), 
                    os.path.join(dest_path, img_name))


#### **3.2.2 K-MEANS EVALUATION METRICS**

In [None]:
print(f"{'Classe':<15} {'Inertie':<15} {'Silhouette':<15} {'Davies-Bouldin':<15}")
print("-"*70)
for tumor_class, metrics in kmeans_metrics.items():
    class_name = tumor_class.replace('_tumor', '').upper()
    print(f"{class_name:<15} {metrics['inertia']:<15.2f} {metrics['silhouette']:<15.4f} {metrics['davies_bouldin']:<15.4f}")


In [None]:
# Visualisation
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle('Métriques K-Means', fontsize=16, fontweight='bold')

classes_short = [k.replace('_tumor', '') for k in kmeans_metrics.keys()]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

# Inertie
inertia_vals = [kmeans_metrics[k]['inertia'] for k in kmeans_metrics.keys()]
axes[0].bar(classes_short, inertia_vals, color=colors, alpha=0.8, edgecolor='black')
axes[0].set_title('Inertie (plus bas = meilleur)')
axes[0].set_ylabel('Inertie')
axes[0].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(inertia_vals):
    axes[0].text(i, v, f'{v:.0f}', ha='center', va='bottom', fontweight='bold')

# Silhouette
sil_vals = [kmeans_metrics[k]['silhouette'] for k in kmeans_metrics.keys()]
axes[1].bar(classes_short, sil_vals, color=colors, alpha=0.8, edgecolor='black')
axes[1].axhline(y=0.5, color='green', linestyle='--', label='Excellent')
axes[1].axhline(y=0.3, color='orange', linestyle='--', label='Bon')
axes[1].set_title('Silhouette (plus haut = meilleur)')
axes[1].set_ylabel('Score')
axes[1].set_ylim(0, 1)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(sil_vals):
    axes[1].text(i, v, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

# Davies-Bouldin
db_vals = [kmeans_metrics[k]['davies_bouldin'] for k in kmeans_metrics.keys()]
axes[2].bar(classes_short, db_vals, color=colors, alpha=0.8, edgecolor='black')
axes[2].axhline(y=1.0, color='green', linestyle='--', label='Excellent')
axes[2].axhline(y=1.5, color='orange', linestyle='--', label='Bon')
axes[2].set_title('Davies-Bouldin (plus bas = meilleur)')
axes[2].set_ylabel('Score')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(db_vals):
    axes[2].text(i, v, f'{v:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_BASE, 'metrics', 'kmeans_metrics.png'), dpi=150, bbox_inches='tight')
plt.show()


#### **3.2.3 COPY ORIGINAL NO_TUMOR IMAGES**

In [None]:
no_tumor_path = os.path.join(BASE_PATH, 'Training', 'no_tumor')
no_tumor_images = [f for f in os.listdir(no_tumor_path) 
                   if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

print(f"Total no_tumor: {len(no_tumor_images)}")

dest_no_tumor = os.path.join(OUTPUT_BASE, 'original_no_tumor')

for img_name in tqdm(no_tumor_images, desc="Copie"):
    shutil.copy2(os.path.join(no_tumor_path, img_name), 
                os.path.join(dest_no_tumor, img_name))

print(f" {len(no_tumor_images)} images copiées")


In [None]:
# ==========================================
# CRÉER ZIP POUR TÉLÉCHARGEMENT
# ==========================================
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f'kmeans_results_{timestamp}.zip'
zip_path = os.path.join('/kaggle/working', zip_filename)

print(f"\n Compression en cours...")

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(OUTPUT_BASE):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, OUTPUT_BASE)
            zipf.write(file_path, arcname)

zip_size = os.path.getsize(zip_path) / (1024 * 1024)
print(f"  ZIP créé!")
print(f"   Nom: {zip_filename}")
print(f"   Taille: {zip_size:.2f} MB")
print(f"   Emplacement: /kaggle/working/{zip_filename}")

print("\n" + "="*70)
print("INSTRUCTIONS TÉLÉCHARGEMENT")
print("="*70)
print("1. Cliquez sur 'Output' dans le panneau droit")
print(f"2. Trouvez: {zip_filename}")
print("3. Cliquez sur '...' puis 'Download'")
print("4. CONSERVEZ CE ZIP EN LOCAL!")
print("="*70)

print("\n  PARTIE 1 TERMINÉE - TÉLÉCHARGEZ LE ZIP MAINTENANT!")
print("   Ensuite, exécutez la PARTIE 2 (Diffusion Model)")
print("="*70 + "\n")

### **3.3 STEP 2: SYNTHETIC IMAGE GENERATION (DIFFUSION MODEL)**

### The tumor class initially contains about 400 images, which is insufficient for robust training of a deep model. Artificial augmentation is therefore necessary.

### Classical methods (rotation, flip, noise) are limited and produce little semantic diversity. Diffusion models (**DDPM – Denoising Diffusion Probabilistic Models**) offer a powerful alternative.

### **Principle of diffusion models:**

### DDPMs rely on:

### - A progressive noising process applied to images.

### - A learned reverse process, implemented by a neural network, to reconstruct images.

### - Once trained, the model can generate new realistic images that are statistically close to the original MRI images.

### Why choose DDPMs:

### - Ability to generate realistic medical images.

### - Preservation of anatomical structures.

### - Reduced overfitting through dataset diversification.

### In this work, DDPMs are used to generate 200 synthetic images, bringing the tumor class to 600 images.

#### **3.3.1 GENERAL CONFIGURATION**

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import cv2
from tqdm import tqdm
import zipfile
from datetime import datetime
import shutil
import warnings
warnings.filterwarnings('ignore')

In [None]:
BASE_PATH = '/kaggle/input/brain-tumor-classification-mri'
OUTPUT_BASE = '/kaggle/working/ddpm_output'

print(f"Dataset original: {BASE_PATH}")
print(f"Output: {OUTPUT_BASE}")

# Créer structure
folders = [
    'original_no_tumor',
    'generated_images',
    'model',
    'samples'
]

print("\n Création structure...")
for folder in folders:
    os.makedirs(os.path.join(OUTPUT_BASE, folder), exist_ok=True)
print(f" {len(folders)} dossiers créés")


In [None]:
# ==========================================
# COPIER NO_TUMOR ORIGINALES
# ==========================================

no_tumor_path = os.path.join(BASE_PATH, 'Training', 'no_tumor')
no_tumor_images = [f for f in os.listdir(no_tumor_path) 
                   if f.lower().endswith(('.jpg', '.jpeg', '.png'))]

print(f"Total no_tumor: {len(no_tumor_images)}")

temp_no_tumor = os.path.join(OUTPUT_BASE, 'original_no_tumor')

for img_name in tqdm(no_tumor_images, desc="Copie"):
    shutil.copy2(os.path.join(no_tumor_path, img_name), 
                os.path.join(temp_no_tumor, img_name))

print(f" {len(no_tumor_images)} images copiées")

In [None]:

# INSTALLATION DÉPENDANCES


import subprocess
import sys

print(" Installation diffusers, torch...")
try:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", 
                          "diffusers", "transformers", "accelerate", "torch", "torchvision"])
    print(" Dépendances installées!")
except Exception as e:
    print(f" Erreur: {e}")

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from diffusers import DDPMScheduler, UNet2DModel, DDPMPipeline
from diffusers.optimization import get_cosine_schedule_with_warmup
from tqdm.auto import tqdm as tqdm_auto
from PIL import Image as PILImage

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\n Device: {device}")
if device.type == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

#### **3.3.2 PRÉPARATION DU DATASET NO_TUMOR**

In [None]:
class BrainMRIDataset(Dataset):
    def __init__(self, image_dir, image_size=128):
        self.image_dir = image_dir
        self.image_files = [f for f in os.listdir(image_dir) 
                           if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        self.transform = transforms.Compose([
            transforms.Resize((image_size, image_size)),
            transforms.Grayscale(num_output_channels=3),
            transforms.ToTensor(),
            transforms.Normalize([0.5], [0.5])
        ])
    
    def __len__(self):
        return len(self.image_files)
    
    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, self.image_files[idx])
        image = PILImage.open(img_path)
        return self.transform(image)

image_size = 128
dataset = BrainMRIDataset(temp_no_tumor, image_size=image_size)
dataloader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=2)

print(f"  Dataset: {len(dataset)} images")
print(f"   Batches: {len(dataloader)}")

#### **3.3.4 DDPM MODEL CONFIGURATION**

In [None]:
model = UNet2DModel(
    sample_size=image_size,
    in_channels=3,
    out_channels=3,
    layers_per_block=2,
    block_out_channels=(128, 128, 256, 256, 512, 512),
    down_block_types=(
        "DownBlock2D", "DownBlock2D", "DownBlock2D", 
        "DownBlock2D", "AttnDownBlock2D", "DownBlock2D"
    ),
    up_block_types=(
        "UpBlock2D", "AttnUpBlock2D", "UpBlock2D", 
        "UpBlock2D", "UpBlock2D", "UpBlock2D"
    ),
)
model.to(device)

noise_scheduler = DDPMScheduler(num_train_timesteps=1000)

print(f"   Modèle créé")
print(f"   Paramètres: {sum(p.numel() for p in model.parameters()):,}")

#### **3.3.5 MODEL TRAINING**

In [None]:
num_epochs = 300
learning_rate = 1e-4

print(f"Configuration:")
print(f"  • Epochs: {num_epochs}")
print(f"  • Learning rate: {learning_rate}")
print(f"  • Batch size: 16")

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_cosine_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=500,
    num_training_steps=(len(dataloader) * num_epochs),
)

print(f"\n Démarrage...")

model.train()
losses = []

for epoch in range(num_epochs):
    epoch_loss = 0
    progress_bar = tqdm_auto(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}")
    
    for batch in progress_bar:
        clean_images = batch.to(device)
        noise = torch.randn(clean_images.shape).to(device)
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, 
                                 (clean_images.shape[0],), device=device).long()
        
        noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
        noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
        loss = torch.nn.functional.mse_loss(noise_pred, noise)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        
        epoch_loss += loss.item()
        progress_bar.set_postfix({"loss": loss.item()})
    
    avg_loss = epoch_loss / len(dataloader)
    losses.append(avg_loss)
    
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}")
    
    # Échantillons tous les 10 epochs
    if (epoch + 1) % 10 == 0 or epoch == 0:
        model.eval()
        with torch.no_grad():
            pipeline = DDPMPipeline(unet=model, scheduler=noise_scheduler)
            pipeline.to(device)
            
            samples = pipeline(batch_size=4, num_inference_steps=50, output_type="numpy").images
            
            fig, axes = plt.subplots(1, 4, figsize=(16, 4))
            fig.suptitle(f'Epoch {epoch+1}', fontsize=14, fontweight='bold')
            for i, ax in enumerate(axes):
                img = (samples[i] * 255).astype(np.uint8)
                ax.imshow(img[:, :, 0], cmap='gray')
                ax.axis('off')
            
            plt.tight_layout()
            plt.savefig(os.path.join(OUTPUT_BASE, 'samples', f'epoch_{epoch+1:03d}.png'))
            plt.show()
        
        model.train()

print("\n Entraînement terminé.")

# Courbe loss
plt.figure(figsize=(12, 6))
plt.plot(losses, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('DDPM Training Loss')
plt.grid(True, alpha=0.3)
plt.savefig(os.path.join(OUTPUT_BASE, 'training_loss.png'), dpi=150, bbox_inches='tight')
plt.show()


#### **3.3.6 FINAL IMAGE GENERATION**

In [None]:
target_total = 600
images_to_generate = target_total - len(no_tumor_images)

print(f"Images à générer: {images_to_generate}")

model.eval()
pipeline = DDPMPipeline(unet=model, scheduler=noise_scheduler)
pipeline.to(device)

generated_folder = os.path.join(OUTPUT_BASE, 'generated_images')
generated_count = 0
batch_size_gen = 4

print("\n Génération en cours...")

with torch.no_grad():
    while generated_count < images_to_generate:
        try:
            current_batch = min(batch_size_gen, images_to_generate - generated_count)
            
            output = pipeline(batch_size=current_batch, num_inference_steps=50, output_type="numpy")
            images = output.images
            
            for i in range(current_batch):
                try:
                    img = (images[i] * 255).astype(np.uint8)
                    img_gray = img[:, :, 0]
                    img_resized = cv2.resize(img_gray, (512, 512))
                    
                    save_path = os.path.join(generated_folder, f'ddpm_{generated_count:04d}.jpg')
                    cv2.imwrite(save_path, img_resized)
                    
                    generated_count += 1
                    
                    if generated_count % 10 == 0:
                        print(f"  {generated_count}/{images_to_generate}")
                
                except Exception as e:
                    print(f"   Erreur image {generated_count}: {e}")
                    generated_count += 1
        
        except Exception as e:
            print(f"   Erreur batch: {e}")
            if generated_count >= images_to_generate * 0.8:
                break

print(f"\n {generated_count} images générées")

# Vérification
saved_imgs = [f for f in os.listdir(generated_folder) if f.lower().endswith('.jpg')]
print(f" Vérification: {len(saved_imgs)} fichiers")

# Comparaison visuelle
print("\n Comparaison...")

original_samples = no_tumor_images[:6]
generated_samples = saved_imgs[:6]

fig, axes = plt.subplots(2, 6, figsize=(18, 6))
fig.suptitle('Originales (haut) vs DDPM (bas)', fontsize=16, fontweight='bold')

for idx, img_name in enumerate(original_samples):
    img = cv2.imread(os.path.join(temp_no_tumor, img_name), cv2.IMREAD_GRAYSCALE)
    axes[0, idx].imshow(img, cmap='gray')
    axes[0, idx].set_title('Original', fontsize=10, fontweight='bold')
    axes[0, idx].axis('off')

for idx, img_name in enumerate(generated_samples):
    img = cv2.imread(os.path.join(generated_folder, img_name), cv2.IMREAD_GRAYSCALE)
    axes[1, idx].imshow(img, cmap='gray')
    axes[1, idx].set_title('DDPM', fontsize=10, fontweight='bold')
    axes[1, idx].axis('off')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_BASE, 'comparison.png'), dpi=150, bbox_inches='tight')
plt.show()

#### **3.3.7 SUMMARY**

In [None]:

# RÉSUMÉ
print(f"  • Images originales: {len(no_tumor_images)}")
print(f"  • Images générées: {generated_count}")
print(f"  • Total: {len(no_tumor_images) + generated_count}")


### **3.4 Data Splitting**

### 15% of the data from each class is reserved for validation.

### The original Testing folder is kept for an independent final evaluation.

### This choice follows best practices in machine learning and avoids information leakage.

#### **3.4 FINAL VISUALIZATIONS**

In [None]:
DATASET = '/kaggle/input/brain-tumor-balanced'
# Compter
distribution = {'binary': {}, 'multiclass': {}}

for split in ['train', 'validation', 'test']:
    for class_name in ['tumor', 'no_tumor']:
        path = os.path.join(DATASET, 'binary_classification', split, class_name)
        count = len([f for f in os.listdir(path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
        if split not in distribution['binary']:
            distribution['binary'][split] = {}
        distribution['binary'][split][class_name] = count

for split in ['train', 'validation', 'test']:
    for class_name in ['glioma', 'meningioma', 'pituitary']:
        path = os.path.join(DATASET, 'multiclass_classification', split, class_name)
        count = len([f for f in os.listdir(path) if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
        if split not in distribution['multiclass']:
            distribution['multiclass'][split] = {}
        distribution['multiclass'][split][class_name] = count

# Graphiques
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribution - Classification Binaire', fontsize=16, fontweight='bold')

for idx, split in enumerate(['train', 'validation', 'test']):
    data = distribution['binary'][split]
    axes[idx].bar(data.keys(), data.values(), color=['#FF6B6B', '#4ECDC4'], alpha=0.8, edgecolor='black')
    axes[idx].set_title(f'{split.upper()}', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel('Images')
    axes[idx].grid(True, alpha=0.3, axis='y')
    
    for i, (k, v) in enumerate(data.items()):
        axes[idx].text(i, v, str(v), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Multi-classe
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
fig.suptitle('Distribution - Classification Multi-classe', fontsize=16, fontweight='bold')

for idx, split in enumerate(['train', 'validation', 'test']):
    data = distribution['multiclass'][split]
    axes[idx].bar(data.keys(), data.values(), color=['#FF6B6B', '#4ECDC4', '#45B7D1'], alpha=0.8, edgecolor='black')
    axes[idx].set_title(f'{split.upper()}', fontsize=14, fontweight='bold')
    axes[idx].set_ylabel('Images')
    axes[idx].grid(True, alpha=0.3, axis='y')
    axes[idx].tick_params(axis='x', rotation=15)
    
    for i, (k, v) in enumerate(data.items()):
        axes[idx].text(i, v, str(v), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\n CLASSIFICATION BINAIRE:")
for split in ['train', 'validation', 'test']:
    for class_name in ['tumor', 'no_tumor']:
        count = distribution['binary'][split][class_name]
        print(f"  {split:12s} / {class_name:10s}: {count:4d} images")

print("\n CLASSIFICATION MULTI-CLASSE:")
for split in ['train', 'validation', 'test']:
    for class_name in ['glioma', 'meningioma', 'pituitary']:
        count = distribution['multiclass'][split][class_name]
        print(f"  {split:12s} / {class_name:11s}: {count:4d} images")

# **4. Phase 4 : Data Modeling**

### **Objectifs de la phase de modélisation:**

### La phase de modélisation vise à concevoir, entraîner et comparer plusieurs modèles de Deep Learning afin d’identifier l’architecture la plus performante pour la tâche de classification des images IRM cérébrales. Conformément à la méthodologie CRISP-DM, cette phase s’appuie directement sur les décisions prises lors de la préparation des données.

### Deux objectifs de modélisation sont poursuivis :

### - Classification binaire : distinguer les patients atteints d’une tumeur cérébrale (tumor) de ceux sans tumeur (no_tumor).

### - Classification multi-classes conditionnelle : identifier le type de tumeur (Glioma, Meningioma ou Pituitary) uniquement lorsque la présence d’une tumeur est détectée.

### Dans cette section, l’accent est mis sur l’étude comparative de modèles de Transfer Learning, particulièrement adaptés aux jeux de données médicaux de taille limitée.

### **Transfer Learning et imagerie médicale**

### L’entraînement d’un réseau de neurones convolutifs profond à partir de zéro nécessite un volume de données massif, ce qui est rarement disponible en imagerie médicale. Le Transfer Learning consiste à exploiter des modèles pré-entraînés sur de larges bases de données (telles qu’ImageNet) afin de transférer les connaissances apprises vers une nouvelle tâche.

### En imagerie médicale, cette approche présente plusieurs avantages :

### - Réduction du temps d’entraînement.

### - Meilleure convergence des modèles.

### - Amélioration des performances malgré un nombre limité d’images annotées.

### De nombreuses études ont montré que les premières couches d’un CNN apprennent des caractéristiques génériques (bords, textures), utiles également pour les images IRM.

### **Modèles retenus:**

### Deux architectures de référence ont été sélectionnées : EfficientNetB0 et DenseNet121. Ce choix repose sur leur efficacité démontrée dans des travaux récents en imagerie médicale.

### **1. EfficientNetB0**

### EfficientNet repose sur un principe de scaling composé, qui équilibre simultanément :

### - la profondeur du réseau,

### - la largeur des couches,

### - la résolution des images.

### EfficientNetB0 est la version de base de cette famille, offrant un excellent compromis entre : performance, nombre de paramètres, coût computationnel.

### Ces caractéristiques en font un modèle particulièrement adapté aux environnements à ressources limitées et aux bases de données médicales.

### **2 DenseNet121**

### DenseNet121 appartient à la famille des réseaux densément connectés. Chaque couche reçoit en entrée les cartes de caractéristiques de toutes les couches précédentes.

### Les principaux avantages de DenseNet121 sont :

### - une meilleure propagation du gradient,

### - une réutilisation efficace des caractéristiques,

### - une réduction du risque de surapprentissage.

### Ces propriétés sont particulièrement intéressantes pour la détection de structures fines et complexes, typiques des images IRM.

### **Stratégies d’entraînement**

### Pour chaque architecture, trois stratégies de Transfer Learning ont été mises en œuvre afin d’évaluer leur impact sur les performances.

### **Mode 1 : Feature Extraction**

### Dans ce mode :

### Les poids du modèle pré-entraîné sont entièrement gelés.

### Seules les couches finales (classifieur) sont entraînées.

### Avantages : Entraînement rapide, risque limité de surapprentissage.

### Limites : Adaptation restreinte aux spécificités des images IRM.

### Ce mode constitue une base de référence pour la comparaison.

### **Mode 2 : Fine-tuning total**

### Dans ce mode :

### L’ensemble des couches du modèle est entraîné.

### Les poids pré-entraînés servent uniquement d’initialisation.

### Avantages : Adaptation complète aux données IRM, Potentiel de performance maximal.

### Limites : Risque accru de surapprentissage, coût computationnel plus élevé.

### Cette stratégie est pertinente lorsque les données sont suffisamment diversifiées, notamment grâce à l’augmentation par modèles de diffusion.

### **Mode 3 : Fine-tuning partiel**

### Le fine-tuning partiel constitue un compromis entre les deux approches précédentes :

### - Les premières couches (bas niveau) sont gelées.

### - Les couches profondes sont réentraînées.

### Avantages : Conservation des caractéristiques génériques, adaptation ciblée aux motifs spécifiques des tumeurs et Meilleur équilibre biais/variance.

### This approach is often considered most suitable in medical imaging.

### **4.1. Library Imports**

In [None]:
#Importation des bibliothèques
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras import layers, models, optimizers
import warnings
warnings.filterwarnings('ignore')

### **4.2. Configuration and Data Loading**

In [None]:


# Paramètres
IMG_SIZE = (224, 224)
BATCH_SIZE = 32
BASE_PATH = '/kaggle/input/brain-tumor-balanced'

# Prétraitement : EfficientNet gère le scaling en interne, 
# mais on applique de l'augmentation de données pour limiter le surapprentissage.
train_datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    zoom_range=0.1
)

test_val_datagen = ImageDataGenerator() # Pas d'augmentation pour val/test

# 1. Générateurs BINAIRES
train_gen_bin = train_datagen.flow_from_directory(
    os.path.join(BASE_PATH, 'binary_classification/train'), 
    target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode='binary')

val_gen_bin = test_val_datagen.flow_from_directory(
    os.path.join(BASE_PATH, 'binary_classification/validation'),
    target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode='binary', shuffle=False)

# 2. Générateurs MULTI-CLASSE
train_gen_multi = train_datagen.flow_from_directory(
    os.path.join(BASE_PATH, 'multiclass_classification/train'),
    target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode='categorical')

val_gen_multi = test_val_datagen.flow_from_directory(
    os.path.join(BASE_PATH, 'multiclass_classification/validation'),
    target_size=IMG_SIZE, batch_size=BATCH_SIZE, class_mode='categorical', shuffle=False)

### **4.3. Building the EfficientNetB0 Transfer Learning Model**

In [None]:
def build_efficientnet_model(num_classes, activation):
    # Charger la base EfficientNetB0 pré-entraînée sur ImageNet
    base_model = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = False  # On gèle les poids au début

    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
               
    ])
    
    return model


### **4.3.1 Binary Classification Model (Feature Extraction):**

In [None]:
# Modèle 1 : Binaire (Tumor vs No Tumor)
model_bin = build_efficientnet_model(1, 'sigmoid')
model_bin.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])


In [None]:
history_bin = model_bin.fit(
    train_gen_bin,
    validation_data=val_gen_bin,
    epochs=10
)

In [None]:
import matplotlib.pyplot as plt

def plot_training_history(history, title="Évolution de l'entraînement"):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs_range = range(len(acc))

    plt.figure(figsize=(14, 5))

    # Graphique de la Précision (Accuracy)
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc, label='Précision Entraînement', color='#2ecc71', marker='o')
    plt.plot(epochs_range, val_acc, label='Précision Validation', color='#e74c3c', marker='s')
    plt.title(f'{title} - Précision')
    plt.xlabel('Époques')
    plt.ylabel('Score')
    plt.legend(loc='lower right')
    plt.grid(True, linestyle='--', alpha=0.6)

    # Graphique de la Perte (Loss)
    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss, label='Perte Entraînement', color='#2ecc71', marker='o')
    plt.plot(epochs_range, val_loss, label='Perte Validation', color='#e74c3c', marker='s')
    plt.title(f'{title} - Perte')
    plt.xlabel('Époques')
    plt.ylabel('Valeur de Perte')
    plt.legend(loc='upper right')
    plt.grid(True, linestyle='--', alpha=0.6)

    plt.tight_layout()
    plt.show()



In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_bin, title="Évolution de l'entraînement")

In [None]:
# Evaluation et matrice de confusion

def evaluate_and_plot(model, generator, labels, title):
    # Prédictions
    preds = model.predict(generator)
    if len(labels) == 2: # Cas binaire
        y_pred = (preds > 0.5).astype(int).flatten()
    else: # Cas multi-classe
        y_pred = np.argmax(preds, axis=1)
    
    y_true = generator.classes
    
    # Matrice de confusion
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.title(f'Matrice de Confusion : {title}')
    plt.ylabel('Réalité')
    plt.xlabel('Prédiction')
    plt.show()

    # Rapport de classification (Precision, Recall, F1)
    print(f"--- Rapport de classification : {title} ---")
    print(classification_report(y_true, y_pred, target_names=labels))

In [None]:
# Affichage pour le binaire
evaluate_and_plot(model_bin, val_gen_bin, ['No Tumor', 'Tumor'], "Classification Binaire")

In [None]:
model_bin.save('model_bin.h5') # Sauvegarde du modèle binaire

### **4.3.2 Multi-class Classification Model (Feature Extraction):**

In [None]:
# Modèle 2 : Multi-classe (Glioma, Meningioma, Pituitary)
model_multi = build_efficientnet_model(3, 'softmax')
model_multi.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])

In [None]:
# Entrainement du modèle multi-classe
history_multi = model_multi.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10
)

In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_bin, title="Évolution de l'entraînement")

In [None]:
# Evaluation du modèle de classification multi-classe
evaluate_and_plot(model_multi, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse")

### **4.3.3. Fine-tuning total pour le modèle multi-classe :**

In [None]:
# On dégèle le modèle de base
def build_efficientnet_model(num_classes, activation):
    # Charger la base EfficientNetB0 pré-entraînée sur ImageNet
    base_model = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = True  # On gèle les poids au début

    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
    ])
    
    return model


In [None]:
# Modèle 3 : Multi-classe (Glioma, Meningioma, Pituitary) avec Fine Tuning
model_multi_fine = build_efficientnet_model(3, 'softmax')
model_multi_fine.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5), 
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
)

In [None]:
callbacks = [
    # Arrête l'entraînement si la perte de validation ne baisse plus pendant 3 époques
    tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True, monitor='val_loss'),
    # Réduit le LR si le modèle stagne
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=1e-7)
]

In [None]:
# Lancement du Fine-tuning
history_fine = model_multi_fine.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10, 
    callbacks=callbacks
)

In [None]:
plot_training_history(history_fine, title="Évolution de l'entraînement")

In [None]:
# Evaluation et matrice de confusion
# Affichage pour le multiclasse
evaluate_and_plot(model_multi_fine, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse avec Fine tuning")

### **4.3.4. Fine-tuning partiel pour le modèle multi-classe :**

In [None]:
# On dégèle le modèle de base
def build_efficientnet_model(num_classes, activation):
    # Charger la base EfficientNetB0 pré-entraînée sur ImageNet
    base_model = EfficientNetB0(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    
    # nous avons choisi d’entraîner 78 couches, 
    # c’est-à-dire que nous allons geler
    # les 160 premières couches et dégeler les autres :
    for layer in base_model.layers[:160]:
        layer.trainable = False
    for layer in base_model.layers[160:]:
        layer.trainable = True
    
    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
    ])
    
    return model

In [None]:
# Modèle 3 : Multi-classe (Glioma, Meningioma, Pituitary) avec Fine Tuning
model_multi_fine = build_efficientnet_model(3, 'softmax')
model_multi_fine.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5), 
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
)

In [None]:
callbacks = [
    # Arrête l'entraînement si la perte de validation ne baisse plus pendant 3 époques
    tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True, monitor='val_loss'),
    # Réduit le LR si le modèle stagne
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=1e-7)
]

In [None]:
# Lancement du Fine-tuning
history_fine = model_multi_fine.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10, 
    callbacks=callbacks
)

In [None]:
plot_training_history(history_fine, title="Évolution de l'entraînement")

In [None]:
# Evaluation et matrice de confusion
# Affichage pour le multiclasse
evaluate_and_plot(model_multi_fine, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse avec Fine tuning")

In [None]:
model_multi_fine.save('model_multi_fine_partiel.h5') # Sauvegarde du modèle multiclasse

In [None]:
# Pipeline de décision finale

def diagnostic_pipeline(image_path, model_bin, model_multi_fine):
    # 1. Charger et préparer l'image
    img = tf.keras.preprocessing.image.load_img(image_path, target_size=(224, 224))
    img_array = tf.keras.preprocessing.image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    
    # 2. Étape 1 : Détection de tumeur (Binaire)
    is_tumor_prob = model_bin.predict(img_array)[0][0]
    
    if is_tumor_prob < 0.5:
        return "Résultat : Absence de tumeur détectée."
    else:
        # 3. Étape 2 : Si tumeur, classifier le type (Multi-classe)
        type_preds = model_multi.predict(img_array)
        classes_multi = ['Glioma', 'Meningioma', 'Pituitary']
        detected_type = classes_multi[np.argmax(type_preds)]
        return f"Résultat : Tumeur détectée. Type suspecté : {detected_type} (Confiance : {np.max(type_preds)*100:.2f}%)"



In [None]:
from tensorflow.keras.models import load_model

test_img_path = '/kaggle/input/brain-tumor-balanced/multiclass_classification/test/pituitary/image(10).jpg'
binary_model = load_model('/kaggle/input/model-for-deployement/other/default/1/model_bin_eff.h5')
multi_model = load_model('/kaggle/input/model-for-deployement/other/default/1/model_multi_fine_partiel.h5')
diagnostic_pipeline(test_img_path, binary_model, multi_model)

### **4.4. Construction du modèle transfer learning DenseNet121**

In [None]:
from tensorflow.keras.applications import DenseNet121

def build_densenet_model(num_classes, activation):
    # Charger DenseNet121 pré-entraîné
    base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = False # Geler pour le transfert initial

    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
    ])
    
    return model



### **4.4.1. Modèle binaire avec DenseNet121 :**

In [None]:
# Création du modèle binaire
model_bin_dense = build_densenet_model(1, 'sigmoid')
model_bin_dense.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])

In [None]:
history_bin_dense = model_bin_dense.fit(
    train_gen_bin,
    validation_data=val_gen_bin,
    epochs=10
)

In [None]:
model_bin_dense.save('model_bin_dense.h5') # Sauvegarde du modèle binaire

In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_bin_dense, title="Évolution de l'entraînement Binaire")

In [None]:
# Affichage pour le binaire
evaluate_and_plot(model_bin_dense, val_gen_bin, ['No Tumor', 'Tumor'], "Classification Binaire")

### **4.4.2. Modèle multi-classe avec DenseNet121 (feature extraction):**

In [None]:
# Création du modèle multi-classe
model_multi_dense = build_densenet_model(3, 'softmax')
model_multi_dense.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()])

In [None]:
# Entrainement du modèle multi-classe
history_multi_dense = model_multi_dense.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10
)

In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_multi_dense, title="Évolution de l'entraînement Multiclasse")

In [None]:
# Evaluation et matrice de confusion
# Affichage pour le multiclasse
evaluate_and_plot(model_multi_dense, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse")

### **4.4.3. Fine-tuning total du Modèle multi-classe avec DenseNet121 :**

In [None]:
from tensorflow.keras.applications import DenseNet121

def build_densenet_model_fine(num_classes, activation):
    # Charger DenseNet121 pré-entraîné
    base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    base_model.trainable = True # 

    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
    ])
    
    return model

In [None]:
# Modèle 3 : Multi-classe (Glioma, Meningioma, Pituitary) avec Fine Tuning
model_multi_dense_fine = build_densenet_model_fine(3, 'softmax')
model_multi_dense_fine.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5), 
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
)

In [None]:
callbacks = [
    # Arrête l'entraînement si la perte de validation ne baisse plus pendant 3 époques
    tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True, monitor='val_loss'),
    # Réduit le LR si le modèle stagne
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=1e-7)
]

In [None]:
# Lancement du Fine-tuning
history_dense_fine = model_multi_dense_fine.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10, 
    callbacks=callbacks
)

In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_dense_fine, title="Évolution de l'entraînement Multiclasse")

In [None]:
# Evaluation et matrice de confusion
# Affichage pour le multiclasse
evaluate_and_plot(model_multi_dense_fine, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse")

### **4.4.4. Fine-tuning partiel du Modèle multi-classe avec DenseNet121 :**

In [None]:
base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
print(len(base_model.layers))

In [None]:
from tensorflow.keras.applications import DenseNet121

def build_densenet_model_fine(num_classes, activation):
    # Charger DenseNet121 pré-entraîné
    base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    
    # nous avons choisi d’entraîner 38 couches, 
    # c’est-à-dire que nous allons geler
    # les 200 premières couches et dégeler les autres :
    for layer in base_model.layers[:300]:
        layer.trainable = False
    for layer in base_model.layers[300:]:
        layer.trainable = True 

    model = models.Sequential([
        base_model,
        layers.Flatten(),
        layers.Dense(512, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(num_classes, activation=activation)
    ])
    
    return model

In [None]:
# Modèle 3 : Multi-classe (Glioma, Meningioma, Pituitary) avec Fine Tuning
model_multi_dense_fine = build_densenet_model_fine(3, 'softmax')
model_multi_dense_fine.compile(
    optimizer=optimizers.Adam(learning_rate=1e-5), 
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.Recall(), tf.keras.metrics.Precision()]
)

In [None]:
callbacks = [
    # Arrête l'entraînement si la perte de validation ne baisse plus pendant 3 époques
    tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True, monitor='val_loss'),
    # Réduit le LR si le modèle stagne
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, min_lr=1e-7)
]

In [None]:
# Lancement du Fine-tuning
history_dense_fine = model_multi_dense_fine.fit(
    train_gen_multi,
    validation_data=val_gen_multi,
    epochs=10, 
    callbacks=callbacks
)

In [None]:
#  courbes de perte (Loss) et d'Accuracy
plot_training_history(history_dense_fine, title="Évolution de l'entraînement Multiclasse")

In [None]:
# Evaluation et matrice de confusion
# Affichage pour le multiclasse
evaluate_and_plot(model_multi_dense_fine, val_gen_multi, ['glioma_tumor', 'meningioma_tumor', 'pituitary_tumor'], "Classification Multiclasse")

In [None]:
model_multi_dense_fine.save('model_multi_dense.h5') # Sauvegarde du modèle multiclasse

In [None]:
# Pipeline de décision finale

def diagnostic_pipeline(image_path, model_bin, model_multi):
    # 1. Charger et préparer l'image
    img = tf.keras.preprocessing.image.load_img(image_path, target_size=(224, 224))
    img_array = tf.keras.preprocessing.image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    
    # 2. Étape 1 : Détection de tumeur (Binaire)
    is_tumor_prob = model_bin.predict(img_array)[0][0]
    
    if is_tumor_prob < 0.5:
        return "Résultat : Absence de tumeur détectée."
    else:
        # 3. Étape 2 : Si tumeur, classifier le type (Multi-classe)
        type_preds = model_multi.predict(img_array)
        classes_multi = ['Glioma', 'Meningioma', 'Pituitary']
        detected_type = classes_multi[np.argmax(type_preds)]
        return f"Résultat : Tumeur détectée. Type suspecté : {detected_type} (Confiance : {np.max(type_preds)*100:.2f}%)"



In [None]:
test_img_path = '/kaggle/input/brain-tumor-balanced/multiclass_classification/test/meningioma/image(109).jpg'

diagnostic_pipeline(test_img_path, model_bin_dense, model_multi_dense_fine)

## **5. Déploiement :**

In [None]:
import gradio as gr
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.applications.efficientnet import preprocess_input

# 1. Chargement des modèles (une seule fois au démarrage)
try:
    binary_model = load_model('/kaggle/input/model-for-deployement/other/default/1/model_bin_eff.h5')
    multi_model = load_model('/kaggle/input/model-for-deployement/other/default/1/model_multi_fine_partiel.h5')
    print("Modèles chargés avec succès !")
except Exception as e:
    print(f"Erreur de chargement : {e}")

def predict_tumor(input_img):
    if input_img is None:
        return "Veuillez télécharger une image."

    # 2. Préparation de l'image (Gradio fournit un array numpy)
    img = tf.image.resize(input_img, (224, 224))
    img_array = tf.keras.preprocessing.image.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    
    # 3. Prétraitement spécifique EfficientNet
    img_array = preprocess_input(img_array)
    
    # 4. Étape 1 : Détection Binaire
    is_tumor_prob = binary_model.predict(img_array)[0][0]
    
    # Seuil à 0.5 (ajustable si besoin)
    if is_tumor_prob < 0.5:
        return " Résultat : Absence de tumeur détectée."
    
    # 5. Étape 2 : Classification si tumeur détectée
    type_preds = multi_model.predict(img_array)[0]
    classes_multi = ['Glioma', 'Meningioma', 'Pituitary']
    
    # Création d'un dictionnaire pour afficher les probabilités par classe dans Gradio
    results = {classes_multi[i]: float(type_preds[i]) for i in range(3)}
    
    detected_type = classes_multi[np.argmax(type_preds)]
    confiance = np.max(type_preds) * 100
    
    return f" Tumeur détectée : {detected_type} ({confiance:.2f}%)"

# 6. Création de l'interface Gradio
interface = gr.Interface(
    fn=predict_tumor,
    inputs=gr.Image(label="Uploader l'IRM du cerveau"),
    outputs=gr.Textbox(label="Diagnostic"),
    title="Système de Diagnostic de Tumeurs Cérébrales",
    description="Ce système utilise EfficientNet pour détecter la présence d'une tumeur, puis classer son type (Gliome, Méningiome ou Pituitaire).",
    examples=['/kaggle/input/brain-tumor-balanced/multiclass_classification/test/pituitary/image(10).jpg'] # Optionnel
)

# 7. Lancement
if __name__ == "__main__":
    interface.launch()