### Library Imports

Imports standard Python libraries used for file handling, data manipulation, and organization:

- `pandas`: For reading and manipulating CSV/tabular data.
- `os`: For filesystem operations like path handling and directory checks.
- `shutil`: For high-level file operations (e.g., copying files).
- `random`: For random sampling and shuffling.
- `defaultdict` (from `collections`): For grouping items with default list behavior.

In [None]:
import pandas as pd
import os
import shutil
import random
from collections import defaultdict

### Load Dataset and Count Positive Cancer Cases

- Loads the dataset from `input/train.csv` into a DataFrame `df`.
- Counts the number of rows where the `cancer` label is `1` (i.e., positive cases).
- Prints the total number of cancer-positive images.

In [1]:
df = pd.read_csv('input/train.csv')

# Count the number of images labeled cancer=1
num_cancer_1_images = (df['cancer'] == 1).sum()

print(f"Total cancer=1 images: {num_cancer_1_images}")

Total cancer=1 images: 1158


### Identify Patients with Conflicting Cancer Labels

- Loads the dataset from `input/train.csv`.
- Groups the data by `patient_id` and counts the number of **unique cancer labels** per patient.
- Filters for patients who have **both `cancer=0` and `cancer=1`** labels (i.e., `nunique() > 1`).
- Prints:
  - The number of such patients.
  - A list of their `patient_id`s.

In [2]:
# Load your original dataset
df = pd.read_csv('input/train.csv')

# Group by patient_id and check unique cancer labels
mixed_patients = df.groupby('patient_id')['cancer'].nunique()

# Select patients where there are more than 1 unique label
mixed_patients = mixed_patients[mixed_patients > 1]

print(f"Number of patients with mixed labels: {len(mixed_patients)}")
print("Patient IDs with mixed labels:")
print(mixed_patients.index.tolist())

Number of patients with mixed labels: 480
Patient IDs with mixed labels:
[106, 236, 283, 500, 729, 826, 865, 1025, 1109, 1336, 1524, 1703, 1775, 1878, 1963, 2133, 2179, 2346, 2489, 2679, 2938, 2989, 3021, 3346, 3510, 3542, 3568, 3626, 3670, 3713, 4083, 4202, 4340, 4696, 4824, 4888, 4917, 4953, 5059, 5235, 5444, 5608, 5769, 5820, 5878, 5911, 6018, 6038, 6107, 6303, 6637, 6658, 6659, 6668, 6853, 7053, 7151, 7196, 7339, 7493, 7780, 7964, 8248, 8403, 8631, 8641, 8675, 8732, 9010, 9014, 9029, 9167, 9201, 9345, 9481, 9559, 9707, 9840, 9851, 10130, 10226, 10432, 10589, 10635, 10638, 10668, 10940, 11094, 11249, 11341, 11365, 11817, 11919, 11937, 12153, 12195, 12258, 12282, 12305, 12392, 12463, 12485, 12522, 12651, 12725, 12918, 12988, 13101, 13116, 13267, 13331, 13463, 13756, 13845, 13920, 14292, 14327, 14706, 14769, 14941, 14962, 15078, 15268, 15696, 15945, 16145, 16249, 16346, 16451, 16639, 16668, 16694, 16703, 16955, 17222, 17535, 17562, 17891, 17894, 18026, 18316, 18399, 18421, 18709, 1883

### Construct Trimmed Dataset with Balanced Negative Sampling

Creates a trimmed version of the dataset by reducing the number of negative (cancer=0) samples, while preserving all positive cases, to achieve a roughly **5:1 negative-to-positive image ratio**:

---

**1. Define Paths and Load Data**:
- Sets file paths for input and output.
- Creates output directory `trimmed_train_images`.
- Loads the original `train.csv`.

---

**2. Separate Positive and Negative Cases**:
- `df_pos`: All positive samples (`cancer=1`) — retained fully.
- `df_neg`: All negative samples (`cancer=0`).

---

**3. Group Negatives by Patient and Subsample**:
- Maps negative images by `patient_id`.
- Randomly shuffles patients to avoid bias.
- Selects patients until total negative images reach ~5× the number of positive images.
  - Adds a +200 buffer to avoid overshooting.

---

**4. Build and Save Trimmed Dataset**:
- Combines all positive samples with selected negative ones.
- Saves to `trimmed_train.csv`.

---

**5. Log Excluded Patients**:
- Computes removed patients and images for transparency.
- Saves excluded patient IDs to `trimmed_patient_ids.csv`.

---

**6. Copy Corresponding Image Folders**:
- Copies image folders of retained patients from `train_images/` to `trimmed_train_images/`.

---

**7. Summary Report**:
- Prints the number of patients and images retained.
- Displays cancer label distribution in the final trimmed dataset.


In [3]:
input_folder = 'input'
train_csv_path = os.path.join(input_folder, 'train.csv')
train_images_path = os.path.join(input_folder, 'train_images')
trimmed_images_path = os.path.join(input_folder, 'trimmed_train_images')
trimmed_csv_path = os.path.join(input_folder, 'trimmed_train.csv')
trimmed_patients_csv_path = os.path.join(input_folder, 'trimmed_patient_ids.csv')

os.makedirs(trimmed_images_path, exist_ok=True)
df = pd.read_csv(train_csv_path)

# Split positives and negatives
df_pos = df[df['cancer'] == 1]
df_neg = df[df['cancer'] == 0]

# Group negative samples by patient_id
patient_to_images = defaultdict(list)
for idx, row in df_neg.iterrows():
    patient_to_images[row['patient_id']].append(row['image_id'])

# Shuffle patients
negative_patients = list(patient_to_images.keys())
random.shuffle(negative_patients)

# Select patients until reaching approximately 5:1 ratio
selected_patients = []
selected_image_count = 0
target_negatives = 5 * len(df_pos)

for patient_id in negative_patients:
    num_images = len(patient_to_images[patient_id])
    if selected_image_count + num_images > target_negatives + 200:
        continue
    selected_patients.append(patient_id)
    selected_image_count += num_images
    if selected_image_count >= target_negatives:
        break

print(f"Selected {len(selected_patients)} negative patients totaling {selected_image_count} images.")

# Build the new dataframe
selected_negatives = df_neg[df_neg['patient_id'].isin(selected_patients)]
df_trimmed = pd.concat([df_pos, selected_negatives], ignore_index=True)

# Save the new CSV
df_trimmed.to_csv(trimmed_csv_path, index=False)

all_negative_patients = set(df_neg['patient_id'].unique())
retained_negative_patients = set(selected_patients)
removed_negative_patients = all_negative_patients - retained_negative_patients

removed_df = df_neg[df_neg['patient_id'].isin(removed_negative_patients)]
num_removed_patients = len(removed_negative_patients)
num_removed_images = len(removed_df)

print(f"Removed {num_removed_patients} patients totaling {num_removed_images} images.")

removed_patients_df = pd.DataFrame({'patient_id': list(removed_negative_patients)})
removed_patients_df.to_csv(trimmed_patients_csv_path, index=False)

patient_ids_to_copy = df_trimmed['patient_id'].unique()

for patient_id in patient_ids_to_copy:
    src_dir = os.path.join(train_images_path, str(patient_id))
    dest_dir = os.path.join(trimmed_images_path, str(patient_id))
    
    if os.path.exists(src_dir):
        shutil.copytree(src_dir, dest_dir)

final_counts = df_trimmed['cancer'].value_counts().to_dict()
print(f"\n--- Summary Report ---")
print(f"Patients kept: {len(patient_ids_to_copy)}")
print(f"Total images kept: {len(df_trimmed)}")
print(f"Overall Cancer Label Distribution:")
for label, count in final_counts.items():
    print(f"Cancer {label}: {count}")
print(f"\nTrimmed images and CSV have been created successfully.")

Selected 1280 negative patients totaling 5792 images.
Removed 10627 patients totaling 47756 images.

--- Summary Report ---
Patients kept: 1719
Total images kept: 6950
Overall Cancer Label Distribution:
Cancer 0: 5792
Cancer 1: 1158

Trimmed images and CSV have been created successfully.
