## üìù Guide d‚Äôutilisation du notebook ‚Äì R√©√©quilibrage des classes

Ce notebook contient tout le pipeline n√©cessaire pour r√©√©quilibrer les classes du dataset apr√®s nettoyage des outliers.

### üîß **Ce que fait ce notebook**
- Charge uniquement **les chemins des images** (pas les images elles‚Äëm√™mes, pour √©viter d‚Äôutiliser trop de m√©moire).
- Construit un DataFrame propre contenant une ligne par image.
- Cr√©e les labels `sain / malade` et les sous-cat√©gories de pathologies.
- Effectue un **r√©√©quilibrage sain/malade (50/50)**.
- Effectue ensuite un **r√©√©quilibrage interne des classes malades**  
  (COVID / Lung Opacity / Viral Pneumonia).
- G√©n√®re un dictionnaire `final_paths_dict` contenant les **chemins des images √©quilibr√©es**, pr√™t pour l‚Äô√©tape de mod√©lisation (CNN).

---

### üìÇ **O√π aller chercher les donn√©es**
Pour le moment, le notebook lit les images dans :

In [None]:
from pathlib import Path
from glob import glob
import pandas as pd
import numpy as np

PROJECT_ROOT = Path().resolve().parents[0]
RAW_DATA = PROJECT_ROOT / "data" / "01_raw" / "COVID-19_Radiography_Dataset"
#RAW_DATA = PROJECT_ROOT / "data" / "02_clean"

REPS = ["COVID", "Lung_Opacity", "Viral Pneumonia", "Normal"]

print("PROJECT_ROOT:", PROJECT_ROOT)
print("RAW_DATA:", RAW_DATA)
print("RAW_DATA existe:", RAW_DATA.exists())

# Carga de rutas
image_paths_dict = {}

for rep in REPS:
    key = rep.lower().replace(" ", "_")
    img_dir = RAW_DATA / rep / "images"
    paths = sorted(glob(str(img_dir / "*.png")))
    image_paths_dict[key] = paths

    print(f"\nClase {key}:")
    print(f"  Carpeta: {img_dir}")
    print(f"  N¬∫ im√°genes: {len(paths)}")
    print("  Ejemplos:", paths[:3])

PROJECT_ROOT: /home/ubuntu/sep25_alt1_mle_ds_covid1
RAW_DATA: /home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset
RAW_DATA existe: True

Clase covid:
  Carpeta: /home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/COVID/images
  N¬∫ im√°genes: 3616
  Ejemplos: ['/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/COVID/images/COVID-1.png', '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/COVID/images/COVID-10.png', '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/COVID/images/COVID-100.png']

Clase lung_opacity:
  Carpeta: /home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Lung_Opacity/images
  N¬∫ im√°genes: 6012
  Ejemplos: ['/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Lung_Opacity/images/Lung_Opacity-1.png', '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Da

In [2]:
rows = []
for label, paths in image_paths_dict.items():
    for p in paths:
        rows.append({"path": p, "label": label})

df = pd.DataFrame(rows)
df.head()

Unnamed: 0,path,label
0,/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_...,covid
1,/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_...,covid
2,/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_...,covid
3,/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_...,covid
4,/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_...,covid


In [3]:
df["is_healthy"] = df["label"].apply(lambda x: "sain" if x == "normal" else "malade")
df["disease_type"] = df["label"]

df["label"].value_counts(), df["is_healthy"].value_counts()

(label
 normal             10192
 lung_opacity        6012
 covid               3616
 viral_pneumonia     1345
 Name: count, dtype: int64,
 is_healthy
 malade    10973
 sain      10192
 Name: count, dtype: int64)

In [4]:
def simple_resample(df_subset, n_samples, replace=False, random_state=42):
    rng = np.random.default_rng(random_state)
    idx = rng.choice(df_subset.index.to_numpy(), size=n_samples, replace=replace)
    return df_subset.loc[idx]

In [5]:
df_sain = df[df.is_healthy == "sain"]
df_malade = df[df.is_healthy == "malade"]

n = min(len(df_sain), len(df_malade))
print("n =", n)

df_sain_bal = simple_resample(df_sain, n_samples=n, replace=False)
df_malade_bal = simple_resample(df_malade, n_samples=n, replace=False)

df_balanced = pd.concat([df_sain_bal, df_malade_bal]).sample(frac=1).reset_index(drop=True)

df_balanced["is_healthy"].value_counts()

n = 10192


is_healthy
malade    10192
sain      10192
Name: count, dtype: int64

In [6]:
df_m = df_balanced[df_balanced.is_healthy == "malade"]

counts = df_m["disease_type"].value_counts()
min_count = counts.min()

dfs = []
for disease in counts.index:
    subset = df_m[df_m.disease_type == disease]
    dfs.append(simple_resample(subset, n_samples=min_count, replace=True))

df_m_balanced = pd.concat(dfs).sample(frac=1).reset_index(drop=True)

df_m_balanced["disease_type"].value_counts()

disease_type
covid              1233
lung_opacity       1233
viral_pneumonia    1233
Name: count, dtype: int64

In [8]:
df_final = pd.concat([df_sain_bal, df_m_balanced])
df_final = df_final.sample(frac=1, random_state=42).reset_index(drop=True)

print("\nSain vs Malade en df_final:")
print(df_final["is_healthy"].value_counts())

print("\nEnfermedades en df_final:")
print(df_final["disease_type"].value_counts())


Sain vs Malade en df_final:
is_healthy
sain      10192
malade     3699
Name: count, dtype: int64

Enfermedades en df_final:
disease_type
normal             10192
covid               1233
lung_opacity        1233
viral_pneumonia     1233
Name: count, dtype: int64


In [9]:
final_paths_dict = {
    label: df_final[df_final.disease_type == label]["path"].tolist()
    for label in df_final.disease_type.unique()
}

final_paths_dict

{'normal': ['/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-8053.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-1029.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-2601.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-8203.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-4338.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-7615.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-3351.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Dataset/Normal/images/Normal-179.png',
  '/home/ubuntu/sep25_alt1_mle_ds_covid1/data/01_raw/COVID-19_Radiography_Datas