## üìä SCRIPT DE CR√âATION DE DATASET SEMI-SUPERVIS√â

---

Ce script est con√ßu pour g√©n√©rer un jeu de donn√©es optimis√© pour les mod√®les d'**apprentissage semi-supervis√© (Semi-Supervised Learning)**, en respectant des proportions et un √©quilibrage stricts.

### üìù Param√®tres Cl√©s du Dataset

| Param√®tre | Valeur | Objectif |
| :--- | :--- | :--- |
| **Ratio de Supervision** | **30% / 70%** | Utilisation maximale de donn√©es non-labellis√©es. |
| **Images Labellis√©es** | 30% | Donn√©es de d√©part pour l'entra√Ænement du mod√®le. |
| **Images Non-Labellis√©es** | 70% | Utilisation pour les techniques de *Pseudo-Labeling* ou de coh√©rence. |
| **Taille des Classes** | **6 844 images** | Assurer un √©quilibre parfait entre toutes les classes (minimum commun). |

---

### üí° Note d'Ex√©cution

Pour ex√©cuter la cr√©ation de ce dataset, veuillez vous assurer que la fonction de sous-√©chantillonnage (down-sampling) est configur√©e pour cibler le nombre minimum commun de **6 844 images** par classe pour garantir l'√©quilibrage.

In [1]:


import os
import random
import shutil
from pathlib import Path
import sys

class SemiSupervisedDatasetCreator:
    def __init__(self):
        self.source_dir = "skintone"  # Votre dossier source
        self.target_dir = "skintone_data"  # Dossier de sortie
        self.random_seed = 42  # Pour reproductibilit√©
        
        # Param√®tres standards de recherche
        self.TOTAL_PER_CLASS = 6844  # Nombre minimum d'images par classe
        self.LABELLED_RATIO = 0.30   # 30% labellis√©
        self.UNLABELLED_RATIO = 0.70  # 70% non-labellis√©
        
        # Split interne pour donn√©es labellis√©es
        self.TRAIN_RATIO = 0.70      # 70% train
        self.VAL_RATIO = 0.15        # 15% validation
        self.TEST_RATIO = 0.15       # 15% test
        
        # Calcul des quantit√©s
        self.calculate_counts()
        
        # Classes
        self.classes = ['dark', 'light', 'mid-dark', 'mid-light']
        
        random.seed(self.random_seed)
    
    def calculate_counts(self):
        """Calcule les nombres d'images pour chaque split"""
        self.labelled_per_class = int(self.TOTAL_PER_CLASS * self.LABELLED_RATIO)
        self.unlabelled_per_class = self.TOTAL_PER_CLASS - self.labelled_per_class
        
        self.train_per_class = int(self.labelled_per_class * self.TRAIN_RATIO)
        self.val_per_class = int(self.labelled_per_class * self.VAL_RATIO)
        self.test_per_class = self.labelled_per_class - self.train_per_class - self.val_per_class
        
        print("=" * 60)
        print("CONFIGURATION DU DATASET SEMI-SUPERVIS√â")
        print("=" * 60)
        print(f"Images par classe: {self.TOTAL_PER_CLASS:,}")
        print(f"Ratio: {self.LABELLED_RATIO*100:.0f}% labellis√© / {self.UNLABELLED_RATIO*100:.0f}% non-labellis√©")
        print(f"\nD√©tails par classe:")
        print(f"  Labellis√©: {self.labelled_per_class:,} images")
        print(f"    ‚îú‚îÄ‚îÄ Train: {self.train_per_class:,} (70%)")
        print(f"    ‚îú‚îÄ‚îÄ Validation: {self.val_per_class:,} (15%)")
        print(f"    ‚îî‚îÄ‚îÄ Test: {self.test_per_class:,} (15%)")
        print(f"  Non-labellis√©: {self.unlabelled_per_class:,} images")
        print("=" * 60)
    
    def verify_source_data(self):
        """V√©rifie que les donn√©es source existent"""
        print("\nüîç V√âRIFICATION DES DONN√âES SOURCE")
        
        if not os.path.exists(self.source_dir):
            print(f"‚ùå ERREUR: Le dossier '{self.source_dir}' n'existe pas!")
            print(f"   Placez ce script dans le m√™me dossier que 'skintone/'")
            return False
        
        # V√©rifier chaque classe
        missing_classes = []
        for cls in self.classes:
            class_path = os.path.join(self.source_dir, cls)
            if not os.path.exists(class_path):
                missing_classes.append(cls)
            else:
                # Compter les images
                images = [f for f in os.listdir(class_path) 
                         if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
                print(f"  ‚úì {cls}: {len(images):,} images")
                
                # V√©rifier si assez d'images
                if len(images) < self.TOTAL_PER_CLASS:
                    print(f"    ‚ö†Ô∏è  Attention: {cls} a seulement {len(images)} images")
                    print(f"    Nous allons utiliser {len(images)} au lieu de {self.TOTAL_PER_CLASS}")
        
        if missing_classes:
            print(f"\n‚ùå Classes manquantes: {missing_classes}")
            return False
        
        return True
    
    def create_directory_structure(self):
        """Cr√©e la structure de dossiers cible"""
        print("\nüìÅ CR√âATION DE LA STRUCTURE DES DOSSIERS")
        
        # Dossiers principaux
        main_dirs = [
            f"{self.target_dir}/labelled/train",
            f"{self.target_dir}/labelled/val", 
            f"{self.target_dir}/labelled/test",
            f"{self.target_dir}/unlabelled",
            f"{self.target_dir}/metadata"
        ]
        
        # Sous-dossiers par classe
        for split in ['train', 'val', 'test']:
            for cls in self.classes:
                main_dirs.append(f"{self.target_dir}/labelled/{split}/{cls}")
        
        # Cr√©ation
        for dir_path in main_dirs:
            os.makedirs(dir_path, exist_ok=True)
            print(f"  ‚úì {dir_path}")
        
        return True
    
    def process_class(self, cls):
        """Traite une classe sp√©cifique"""
        print(f"\nüìä Traitement de la classe: {cls}")
        
        # Chemin source
        source_path = os.path.join(self.source_dir, cls)
        
        # Lister toutes les images
        all_images = [
            f for f in os.listdir(source_path) 
            if f.lower().endswith(('.jpg', '.jpeg', '.png'))
        ]
        
        # Si moins d'images que pr√©vu, ajuster
        available_count = len(all_images)
        if available_count < self.TOTAL_PER_CLASS:
            print(f"  ‚ö†Ô∏è  Seulement {available_count} images disponibles")
            print(f"  Ajustement du plan...")
            
            # Ajuster dynamiquement
            actual_total = available_count
            actual_labelled = int(actual_total * self.LABELLED_RATIO)
            actual_unlabelled = actual_total - actual_labelled
            
            actual_train = int(actual_labelled * self.TRAIN_RATIO)
            actual_val = int(actual_labelled * self.VAL_RATIO)
            actual_test = actual_labelled - actual_train - actual_val
        else:
            # Prendre exactement le nombre requis
            actual_total = self.TOTAL_PER_CLASS
            actual_labelled = self.labelled_per_class
            actual_unlabelled = self.unlabelled_per_class
            actual_train = self.train_per_class
            actual_val = self.val_per_class
            actual_test = self.test_per_class
        
        print(f"  Total √† utiliser: {actual_total} images")
        
        # √âchantillonnage al√©atoire
        selected_images = random.sample(all_images, min(actual_total, len(all_images)))
        random.shuffle(selected_images)
        
        # Split labellis√© vs non-labellis√©
        labelled_images = selected_images[:actual_labelled]
        unlabelled_images = selected_images[actual_labelled:actual_total]
        
        # Split labellis√© en train/val/test
        train_images = labelled_images[:actual_train]
        val_images = labelled_images[actual_train:actual_train + actual_val]
        test_images = labelled_images[actual_train + actual_val:]
        
        # Copier les images labellis√©es
        stats = {'labelled': 0, 'unlabelled': 0}
        
        # Train
        for img in train_images:
            src = os.path.join(source_path, img)
            dst = os.path.join(self.target_dir, 'labelled', 'train', cls, img)
            shutil.copy2(src, dst)
            stats['labelled'] += 1
        
        # Validation
        for img in val_images:
            src = os.path.join(source_path, img)
            dst = os.path.join(self.target_dir, 'labelled', 'val', cls, img)
            shutil.copy2(src, dst)
            stats['labelled'] += 1
        
        # Test
        for img in test_images:
            src = os.path.join(source_path, img)
            dst = os.path.join(self.target_dir, 'labelled', 'test', cls, img)
            shutil.copy2(src, dst)
            stats['labelled'] += 1
        
        # Copier les images non-labellis√©es avec nom unique
        unlabelled_count = 0
        unlabelled_dir = os.path.join(self.target_dir, 'unlabelled')
        
        for img in unlabelled_images:
            src = os.path.join(source_path, img)
            
            # Cr√©er un nom unique avec classe d'origine
            img_name, img_ext = os.path.splitext(img)
            unique_name = f"{cls}_{img_name}_{unlabelled_count:06d}{img_ext}"
            dst = os.path.join(unlabelled_dir, unique_name)
            
            shutil.copy2(src, dst)
            stats['unlabelled'] += 1
            unlabelled_count += 1
        
        print(f"  ‚úì Labellis√©: {stats['labelled']} images (train: {len(train_images)}, val: {len(val_images)}, test: {len(test_images)})")
        print(f"  ‚úì Non-labellis√©: {stats['unlabelled']} images")
        
        return stats
    
    def create_metadata(self):
        """Cr√©e des fichiers metadata avec les statistiques"""
        print("\nüìù CR√âATION DES FICHIERS M√âTADONN√âES")
        
        metadata_dir = os.path.join(self.target_dir, 'metadata')
        
        # Compter les images dans chaque dossier
        stats = {
            'labelled': {'train': {}, 'val': {}, 'test': {}},
            'unlabelled': 0
        }
        
        # Compter labellis√©
        for split in ['train', 'val', 'test']:
            for cls in self.classes:
                split_dir = os.path.join(self.target_dir, 'labelled', split, cls)
                if os.path.exists(split_dir):
                    count = len([f for f in os.listdir(split_dir) 
                               if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
                    stats['labelled'][split][cls] = count
        
        # Compter non-labellis√©
        unlabelled_dir = os.path.join(self.target_dir, 'unlabelled')
        if os.path.exists(unlabelled_dir):
            stats['unlabelled'] = len([f for f in os.listdir(unlabelled_dir) 
                                      if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
        
        # √âcrire les statistiques
        with open(os.path.join(metadata_dir, 'statistics.txt'), 'w') as f:
            f.write("=" * 60 + "\n")
            f.write("STATISTIQUES DU DATASET SEMI-SUPERVIS√â\n")
            f.write("=" * 60 + "\n\n")
            
            f.write("CONFIGURATION:\n")
            f.write(f"Source: {self.source_dir}\n")
            f.write(f"Destination: {self.target_dir}\n")
            f.write(f"Ratio: {self.LABELLED_RATIO*100:.0f}% labellis√© / {self.UNLABELLED_RATIO*100:.0f}% non-labellis√©\n")
            f.write(f"Images par classe (cible): {self.TOTAL_PER_CLASS}\n\n")
            
            f.write("R√âSULTATS PAR CLASSE:\n")
            f.write("-" * 60 + "\n")
            
            # Calculer les totaux
            total_labelled = 0
            total_unlabelled = stats['unlabelled']
            
            for cls in self.classes:
                f.write(f"\n{cls.upper()}:\n")
                
                # Labellis√©
                cls_labelled = 0
                for split in ['train', 'val', 'test']:
                    count = stats['labelled'][split].get(cls, 0)
                    cls_labelled += count
                    f.write(f"  {split}: {count} images\n")
                
                total_labelled += cls_labelled
                f.write(f"  Total labellis√©: {cls_labelled}\n")
            
            f.write("\n" + "=" * 60 + "\n")
            f.write("TOTAUX G√âN√âRAUX:\n")
            f.write(f"  Images labellis√©es: {total_labelled}\n")
            f.write(f"  Images non-labellis√©es: {total_unlabelled}\n")
            f.write(f"  TOTAL: {total_labelled + total_unlabelled}\n\n")
            
            f.write("RATIOS FINAUX:\n")
            total_all = total_labelled + total_unlabelled
            if total_all > 0:
                f.write(f"  Labellis√©: {(total_labelled/total_all*100):.1f}%\n")
                f.write(f"  Non-labellis√©: {(total_unlabelled/total_all*100):.1f}%\n")
        
        print(f"  ‚úì statistics.txt cr√©√© dans {metadata_dir}")
        
        # Cr√©er un fichier CSV pour le mapping non-labellis√©
        with open(os.path.join(metadata_dir, 'unlabelled_mapping.csv'), 'w') as f:
            f.write("filename,original_class\n")
            
            unlabelled_dir = os.path.join(self.target_dir, 'unlabelled')
            if os.path.exists(unlabelled_dir):
                for filename in os.listdir(unlabelled_dir):
                    if filename.lower().endswith(('.jpg', '.jpeg', '.png')):
                        # Extraire la classe d'origine du nom de fichier
                        parts = filename.split('_')
                        if len(parts) >= 1:
                            original_class = parts[0]
                            f.write(f"{filename},{original_class}\n")
        
        print(f"  ‚úì unlabelled_mapping.csv cr√©√©")
    
    def run(self):
        """Ex√©cute le pipeline complet"""
        print("üöÄ D√âBUT DE LA CR√âATION DU DATASET SEMI-SUPERVIS√â")
        print("=" * 60)
        
        # √âtape 1: V√©rification
        if not self.verify_source_data():
            print("\n‚ùå Impossible de continuer. V√©rifiez vos donn√©es source.")
            return False
        
        # √âtape 2: Cr√©ation structure
        self.create_directory_structure()
        
        # √âtape 3: Traitement de chaque classe
        print("\n‚öôÔ∏è  TRAITEMENT DES CLASSES")
        print("-" * 60)
        
        all_stats = {}
        for cls in self.classes:
            all_stats[cls] = self.process_class(cls)
        
        # √âtape 4: Metadata
        self.create_metadata()
        
        # √âtape 5: R√©sum√© final
        self.print_summary()
        
        return True
    
    def print_summary(self):
        """Affiche un r√©sum√© final"""
        print("\n" + "=" * 60)
        print("‚úÖ DATASET CR√â√â AVEC SUCC√àS!")
        print("=" * 60)
        
        # Calcul des totaux
        total_labelled = 0
        total_unlabelled = 0
        
        for split in ['train', 'val', 'test']:
            split_dir = os.path.join(self.target_dir, 'labelled', split)
            for cls in self.classes:
                class_dir = os.path.join(split_dir, cls)
                if os.path.exists(class_dir):
                    count = len([f for f in os.listdir(class_dir) 
                               if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
                    total_labelled += count
        
        unlabelled_dir = os.path.join(self.target_dir, 'unlabelled')
        if os.path.exists(unlabelled_dir):
            total_unlabelled = len([f for f in os.listdir(unlabelled_dir) 
                                  if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
        
        total_all = total_labelled + total_unlabelled
        
        print(f"\nüìä STATISTIQUES FINALES:")
        print(f"  Images labellis√©es: {total_labelled:,}")
        print(f"  Images non-labellis√©es: {total_unlabelled:,}")
        print(f"  TOTAL: {total_all:,}")
        
        if total_all > 0:
            print(f"\nüìà RATIOS FINAUX:")
            print(f"  Labellis√©: {(total_labelled/total_all*100):.1f}% (objectif: 30%)")
            print(f"  Non-labellis√©: {(total_unlabelled/total_all*100):.1f}% (objectif: 70%)")
        
        print(f"\nüìÅ STRUCTURE CR√â√âE:")
        print(f"  {self.target_dir}/")
        print(f"    ‚îú‚îÄ‚îÄ labelled/")
        print(f"    ‚îÇ   ‚îú‚îÄ‚îÄ train/ (images organis√©es par classe)")
        print(f"    ‚îÇ   ‚îú‚îÄ‚îÄ val/   (images organis√©es par classe)")
        print(f"    ‚îÇ   ‚îî‚îÄ‚îÄ test/  (images organis√©es par classe)")
        print(f"    ‚îú‚îÄ‚îÄ unlabelled/ (toutes images m√©lang√©es)")
        print(f"    ‚îî‚îÄ‚îÄ metadata/ (fichiers statistiques)")
        
        print(f"\n‚ú® Votre dataset est pr√™t pour:")
        print(f"   - Entra√Ænement supervis√©: utiliser 'labelled/train/'")
        print(f"   - M√©thodes semi-supervis√©es: utiliser 'labelled/train/' + 'unlabelled/'")
        print(f"   - √âvaluation: utiliser 'labelled/test/'")


def main():
    """Fonction principale"""
    print("üéØ CR√âATEUR DE DATASET SEMI-SUPERVIS√â")
    print("Ratio: 30% labellis√© / 70% non-labellis√©")
    print("=" * 60)
    
    # Cr√©er l'instance
    creator = SemiSupervisedDatasetCreator()
    
    # Ex√©cuter
    success = creator.run()
    
    if success:
        print("\n" + "=" * 60)
        print("‚úÖ Op√©ration termin√©e avec succ√®s!")
        print(f"üìÅ Dataset disponible dans: {creator.target_dir}")
        print("=" * 60)
    else:
        print("\n‚ùå √âchec de la cr√©ation du dataset.")
        sys.exit(1)


if __name__ == "__main__":
    """
    INSTRUCTIONS:
    1. Placez ce script dans le m√™me dossier que 'skintone/'
    2. Ex√©cutez: python create_semi_supervised_dataset.py
    3. Attendez la fin du traitement
    
    STRUCTURE ATTENDUE EN ENTR√âE:
    skintone/
        ‚îú‚îÄ‚îÄ dark/           (images .jpg/.png)
        ‚îú‚îÄ‚îÄ light/          (images .jpg/.png)
        ‚îú‚îÄ‚îÄ mid-dark/       (images .jpg/.png)
        ‚îî‚îÄ‚îÄ mid-light/      (images .jpg/.png)
    """
    main()

üéØ CR√âATEUR DE DATASET SEMI-SUPERVIS√â
Ratio: 30% labellis√© / 70% non-labellis√©
CONFIGURATION DU DATASET SEMI-SUPERVIS√â
Images par classe: 6,844
Ratio: 30% labellis√© / 70% non-labellis√©

D√©tails par classe:
  Labellis√©: 2,053 images
    ‚îú‚îÄ‚îÄ Train: 1,437 (70%)
    ‚îú‚îÄ‚îÄ Validation: 307 (15%)
    ‚îî‚îÄ‚îÄ Test: 309 (15%)
  Non-labellis√©: 4,791 images
üöÄ D√âBUT DE LA CR√âATION DU DATASET SEMI-SUPERVIS√â

üîç V√âRIFICATION DES DONN√âES SOURCE
‚ùå ERREUR: Le dossier 'skintone' n'existe pas!
   Placez ce script dans le m√™me dossier que 'skintone/'

‚ùå Impossible de continuer. V√©rifiez vos donn√©es source.

‚ùå √âchec de la cr√©ation du dataset.


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
