## 📂 Dataset Setup Instructions

1. **Create a directory** in your Google Drive (e.g., `BreakHisDataset`).
2. **Download the dataset** from [Kaggle - BreakHis Dataset](https://www.kaggle.com/datasets/ambarish/breakhis).
3. **Unzip the downloaded file** on your local machine.
4. Inside the unzipped contents, locate the folder named `Breast`.
5. **Upload the entire `Breast` folder** to the directory you created in Google Drive.

> You will later access this folder from your Google Colab environment using `drive.mount()`.


In [1]:
import os
import shutil
import pandas as pd
from pathlib import Path
from tqdm import tqdm
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


### STEP 1: Build DataFrame and Assign Labels

In [7]:
image_folder = Path('/content/drive/MyDrive/ICAN/breast')
image_files = list(image_folder.rglob('*.[jp][pn][eg]'))


In [8]:
# === STEP 2: Build DataFrame and Assign Labels ===
df = pd.DataFrame({'image': image_files})
df['label'] = df['image'].apply(lambda x: 0 if '_B_' in x.name else 1)  # 0 = benign, 1 = malignant


In [9]:
print("\nOriginal class distribution:")
print(df['label'].value_counts())



Original class distribution:
label
1    5429
0    2480
Name: count, dtype: int64


### STEP 2: Balance Dataset (Upsample Benign to Match Malignant)


In [None]:
benign_df = df[df.label == 0]
malignant_df = df[df.label == 1]

benign_upsampled = resample(
    benign_df,
    replace=True,
    n_samples=len(malignant_df),
    random_state=42
)

df_balanced = pd.concat([malignant_df, benign_upsampled]).sample(frac=1, random_state=42).reset_index(drop=True)


### STEP 3: Train/Validation Split

In [None]:
train_df, val_df = train_test_split(
    df_balanced,
    stratify=df_balanced['label'],
    test_size=0.3,
    random_state=42
)

print(f"\nTraining samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")


✅ Training samples: 7600
✅ Validation samples: 3258


### STEP 4: Prepare Destination Folders

In [None]:

out_base = Path('/content/drive/MyDrive/ICAN/breakhis_pytorch')
for split in ['train', 'val']:
    for label in ['benign', 'malignant']:
        (out_base / split / label).mkdir(parents=True, exist_ok=True)

### STEP 5: Copy Images into Train/Val Structure

In [None]:

def copy_images(df, split):
    for _, row in tqdm(df.iterrows(), total=len(df), desc=f'Copying {split} images'):
        label_folder = 'benign' if row.label == 0 else 'malignant'
        dest_path = out_base / split / label_folder / row.image.name
        shutil.copy(row.image, dest_path)

copy_images(train_df, 'train')
copy_images(val_df, 'val')

print("\nDataset is now ready in /ICAN/breakhis_pytorch/train and /val for PyTorch ImageFolder.")

📁 Copying train images: 100%|██████████| 7600/7600 [30:48<00:00,  4.11it/s]
📁 Copying val images: 100%|██████████| 3258/3258 [10:31<00:00,  5.16it/s]


✅ Dataset is now ready in /ICAN/breakhis_pytorch/train and /val for PyTorch ImageFolder.



