## Dataset attributes

- **site_id** - ID code for the source hospital.
- **patient_id** - ID code for the patient.
- **image_id** - ID code for the image.
- **laterality** - Whether the image is of the left or right breast.
- **view** - The orientation of the image. The default for a screening exam is to capture two views per breast.
- **age** - The patient's age in years.
- **implant** - Whether or not the patient had breast implants. Site 1 only provides breast implant information at the patient level, not at the breast level.
- **density** - A rating for how dense the breast tissue is, with A being the least dense and D being the most dense. Extremely dense tissue can make diagnosis more difficult. Only provided for train.
- **machine_id** - An ID code for the imaging device.
- **cancer** - Whether or not the breast was positive for malignant cancer. The target value. Only provided for train.
- **biopsy** - Whether or not a follow-up biopsy was performed on the breast. Only provided for train.
- **invasive** - If the breast is positive for cancer, whether or not the cancer proved to be invasive. Only provided for train.
- **BIRADS** - 0 if the breast required follow-up, 1 if the breast was rated as negative for cancer, and 2 if the breast was rated as normal. Only provided for train.
- **prediction_id** - The ID for the matching submission row. Multiple images will share the same prediction ID. Test only.
- **difficult_negative_case** - True if the case was unusually difficult. Only provided for train.

In [2]:
import pandas as pd
from itables import init_notebook_mode

In [None]:
init_notebook_mode(all_interactive=True)

In [8]:
dataset_dir = "./dataset"
train_dataset = pd.read_csv('./train.csv')
test_dataset = pd.read_csv('./test.csv')

train_dataset


site_id,patient_id,image_id,laterality,view,age,cancer,biopsy,invasive,BIRADS,implant,density,machine_id,difficult_negative_case
Loading... (need help?),,,,,,,,,,,,,


In [5]:
train_neg_samples_cnt, train_pos_samples_cnt = train_dataset['cancer'].value_counts()
test_neg_samples_cnt, test_pos_samples_cnt = test_dataset['cancer'].value_counts()

print(f"Train dataset: {train_neg_samples_cnt} negative samples and {train_pos_samples_cnt} positive samples")
print(f"Test dataset: {test_neg_samples_cnt} negative samples and {test_pos_samples_cnt} positive samples")

Train dataset: 42778 negative samples and 966 positive samples
Test dataset: 10770 negative samples and 192 positive samples


In [9]:
# build dataset paths
from pathlib import Path

def build_image_path(image_id: str) -> str:
    return Path(dataset_dir, str(image_id) + '.png').resolve()

train_dataset['image_path'] = train_dataset['image_id'].apply(build_image_path, convert_dtype=str)
test_dataset['image_path'] = test_dataset['image_id'].apply(build_image_path, convert_dtype=str)

In [10]:
train_aug_factor = train_neg_samples_cnt / train_pos_samples_cnt
test_aug_factor = test_neg_samples_cnt / test_pos_samples_cnt

