---
## 1 | Introduction
---

## Goal of the Competition

The goal of this competition is to identify cases of breast cancer in mammograms from screening exams. It is important to identify cases of cancer for obvious reasons, but false positives also have downsides for patients. As millions of women get mammograms each year, a useful machine learning tool could help a great many people.

## Competition's Metric

Submissions are evaluated using the **probabilistic F1 score** (pF1). This extension of the traditional F score accepts probabilities instead of binary classifications. Our model should output the likelihood of cancer in the corresponding image. You can find a Python implementation [here](https://www.kaggle.com/code/sohier/probabilistic-f-score).

With $p_X$ as the probabilistic version of X:

$$p_{F_1} = 2 \cdot \frac{p_{precision} \cdot p_{recall}}{p_{precision} + p_{recall}}$$

where: 

$$p_{precision} = \frac{p_{TP}}{p_{TP} + p_{FP}}$$

$$p_{recall} = \frac{p_{TP}}{TP + FN}$$

## Images file format

> Images are given in dicom format. Here you'll find a tutorial to get started -> [Pulmonary Dicom Preprocessing](https://www.kaggle.com/code/allunia/pulmonary-dicom-preprocessing)

DICOM or digital imaging and communications in medicine are image files sourced from different modalities and it is the international standard to transmit, store, retrieve, print, process, and display medical imaging information. However, DICOM groups information into the data set, and that means that the image file contains the patient information ID, date of birth, age, sex, and other information about the diagnosis all this within the image, as shown in the figure the main components of the medical image.

![](https://miro.medium.com/max/1400/1*BvVR-348gg0qRmVmm8gtxw.webp)

* **Pixel Depth**: is the number of bits used to encode the information of each pixel. For example, an 8-bit raster can have 256 unique values that range from 0 to 255.

* **Photometric Interpretation**: specifies how the pixel data should be interpreted for the correct image display as a monochrome or color image. To specify if the color information is or is not stored in the image pixel values, we introduce the concept of samples per pixel, also known as (number of channels).

* **Metadata**: is the information that describes the image (i.e. patients ID, date of the image).

* **Pixel Data**: is the section where the numerical values of the pixels are stored. All the components are essential but in our scope the pixel depth and pixel data. To my knowledge that ultrasound images are not an issue with converting the image to another format, but we have to look into consideration the depth of the image since we cannot convert 16-bit DICOM image to JPEG or PNG with 8-bit that might corrupt the image quality and image features. Pixel data the data that we are going to feed it to the network.

In [None]:
!pip install -U pylibjpeg pylibjpeg-openjpeg pylibjpeg-libjpeg pydicom python-gdcm

In [None]:
from IPython.display import clear_output, display_html
import os
import warnings
from pathlib import Path

# Basic libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

# Set Color Palettes for the notebook
'''Inspired by: https://www.kaggle.com/code/andradaolteanu/rsna-fracture-detection-dicom-images-explore'''
custom_colors = ['#74a09e','#86c1b2','#98e2c6','#f3c969','#f2a553', '#d96548', '#c14953']
print('Custom Colors Palette: ')
sns.palplot(sns.color_palette(custom_colors))

import scipy as sc
from scipy import stats

# Train Test Split
from sklearn.model_selection import train_test_split

# Cross Validation
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold, learning_curve, train_test_split

# Tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go

warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

---
## 2 | Exploratory Data Analysis
---

In this subsection we'll focus on examining both the metadata and the images that we're given. Let's start by loading the metadata datasets. 

In [None]:
def load_data():
    '''Load each of the datasets we are given.'''
    
    data_dir = Path("../input/rsna-breast-cancer-detection")
    train = pd.read_csv(data_dir / "train.csv")
    test = pd.read_csv(data_dir / "test.csv")
    sample_submission = pd.read_csv(data_dir / 'sample_submission.csv')
    return train, test, sample_submission

from termcolor import colored
def data_info(csv, name="Train"):
    '''Prints basic information about the datasets we are given.'''
    '''Inspired by: https://www.kaggle.com/code/andradaolteanu/rsna-fracture-detection-dicom-images-explore'''
    
    print(colored('==== {} ===='.format(name), 'cyan', attrs=['bold']))
    print(colored('Shape: ', 'cyan', attrs=['bold']), csv.shape)
    print(colored('NaN Values: ', 'cyan', attrs=['bold']), csv.isnull().sum().sum(), '\n')
    #print(colored('Columns: ', 'blue', attrs=['bold']), list(csv.columns))
    
    display_html(csv.head())
    if name != 'Sample Submission': print("\n")

train, test, sample_submission = load_data()
clear_output()

names = ["Train", "Test", "Sample Submission"]
for i, df in enumerate([train, test, sample_submission]): 
    data_info(df, names[i])

📌 **Early insights:** 
* In the metadata training file we have plenty of missing values. 
* Repeated values for `patient_id`. It seems that for each patient, 4 images have been taken. 
* Some features from the training set do not appear in the testing one. 

In order to make a proper analysis, we're gonna load every metadata from the images into a dataframe. Some of this data may be useful afterwards for the model training and splitting strategies. Below, you have a quick example of .dcm file metadata. 

In [None]:
import pydicom
from os import listdir

dcm_path = "/kaggle/input/rsna-breast-cancer-detection/train_images/10006/1459541791.dcm"
img = pydicom.dcmread(dcm_path)
img

The image data is stored in `Pixel Data`. Everything else is metadata.

* The `Rows` and `Columns` values tell us the image size.
* The `Pixel Spacing` and `Slice Thickness` tell us the pixel size and thickness.
* The `Window Center` and `Window Width` give information about the brightness and contrast of the image respectively.
* The `Rescale Intercept` and `Rescale Slope` determine the range of pixel values. (ref).
* `ImagePositionPatient` tells us the x, y, and z coordinates of the top left corner of each image in mm
* `InstanceNumber` is the slice number.

In [None]:
dcms = []
for root, dirs, fnames in os.walk('/kaggle/input/rsna-breast-cancer-detection/train_images/'):
    dcms += list(os.path.join(root, f) for f in fnames if f.endswith('.dcm'))
print(f'There are {len(dcms)} images')

attrs = set()
for fname in tqdm(dcms[:5000]):
    with pydicom.dcmread(fname) as obj:
        attrs.update(obj.dir())

dcm_keys = list(attrs)
dcm_keys.remove('PixelData') # The actual array of pixels, this is not metadata
dcm_keys

In [None]:
meta = []
typemap = {
    pydicom.uid.UID: str,
    pydicom.multival.MultiValue: list
}
def cast(x):
    return typemap.get(type(x), lambda x: x)(x)

for i, fname in enumerate(tqdm(dcms[:5000])):
    with pydicom.dcmread(fname) as obj:
        meta.append([cast(obj.get(key, np.nan)) for key in dcm_keys])

dfmeta = pd.DataFrame(meta, columns=dcm_keys)
dfmeta.head()

In [None]:
print('Values for Photometric Interpretation: {}'.format(dfmeta['PhotometricInterpretation'].unique()))
print('Values for VOILUTFunction: {}\n'.format(dfmeta['VOILUTFunction'].unique()))

plt.figure(figsize = (22,16))
for i, col in enumerate(dfmeta.select_dtypes([int, float]).columns):
    plt.subplot(4,4, i+1)
    sns.distplot(dfmeta[col], color = custom_colors[0])

📌 **Early Insights**:
* Big images sizes. Rows' peak values are near 4k. For columns this value is 3k. Moreover, we observe that we have images with different sizes and resolutions. Afterwards, we'll determine whether padding is going to be needed. 

In [None]:
dfmeta[['Rows','Columns']].describe().T.style.background_gradient(cmap='GnBu_r')

* `Photometric Interpretation` is set to **MONOCHROME1** and **MONOCHROME2**. We have to be careful about that as image interpretation could vary from one type to the other. The same happens to`VOILUTFunction`, different values are given. 

* The dataset contains **compressed Pixel Data**. By itself pydicom can only handle Pixel Data that hasn't been compressed, but if you install [one or more optional libraries](https://pydicom.github.io/pydicom/stable/tutorials/installation.html#install-the-optional-libraries) then it can handle various compressions. [This table](https://pydicom.github.io/pydicom/stable/old/image_data_handlers.html#supported-transfer-syntaxes) tells you which package is required.

* `BodyPartThickness` refers to the average thickness in mm of the body part examined when compressed, if compression has been applied during exposure.

In [None]:
dfmeta[['CompressionForce','BodyPartThickness']].describe().T.style.background_gradient(cmap='GnBu_r')

Let's now show some breast images. 

In [None]:
dcm_path = "/kaggle/input/rsna-breast-cancer-detection/train_images/"

def patient_images(p_id): 
    ''' Shows all the images that are associated with the patient for whom the ID is given. '''
    
    figure = plt.figure(figsize = (22,5))
    for i, file in enumerate(listdir(dcm_path + str(p_id) + '/')):
        plt.subplot(1, 4, i+1)
        dataset = pydicom.dcmread(dcm_path + str(p_id) + '/' + file)
        plt.imshow(dataset.pixel_array, cmap=plt.cm.bone)
        plt.axis('off');
        
patient_images(train['patient_id'].unique()[0])    

* Breasts are shown in a **small portion** of the image. So it'd be nice to **crop out** those sections of the images that not contain any useful information. As you may have observed, it seems that in some images we're given a vertical line. It could be useful to consider them to do the cropping.

## Site and Patient

Starting with the hospital, we can observe that we only have two of them in our dataset. Apart from that, we can observe that **background colors depend on the site** where the image was taken. For site nº 2, background color is blank. However, for site nº 1 this color is almost black

In [None]:
dcm_path = "/kaggle/input/rsna-breast-cancer-detection/train_images/"

def images_site(site_id):
    ids = train[train.site_id == site_id]['patient_id'].unique()
    for i, id_ in enumerate(ids[[0,3]]):
        patient_path = dcm_path + str(id_) +'/'
        fig = plt.figure(figsize = (22,5))
        for j, file in enumerate(listdir(patient_path)):
            plt.subplot(1, 4, j+1)
            dataset = pydicom.dcmread(patient_path + file)
            p = plt.imshow(dataset.pixel_array, cmap=plt.cm.bone)
            plt.axis('off');

print('There are {} different hospitals in the dataset.\n'.format(len(train.site_id.unique())))            
for val in train.site_id.unique(): 
    images_site(val)

In [None]:
print('There are {} unique patients in the Train Set.'.format(len(train['patient_id'].unique())))

data = train.groupby(by="patient_id")['laterality'].count().reset_index(drop=False)
data = data.sort_values(['laterality']).reset_index(drop=True)

print("\nMinimum number of entries are: {}".format(data["laterality"].min()), "\n" +
      "Maximum number of entries are: {}\n".format(data["laterality"].max()))

plt.figure(figsize = (16, 4))
img = sns.barplot(data.index, data['laterality'], color=custom_colors[2])
plt.title("Number of Entries per Patient", fontsize = 17)
plt.ylabel('Frequency', fontsize=14)
img.axes.get_xaxis().set_visible(False);

* Most common frequency is **4 images per pacient**. However, we observe that there is a big amount of them having between 5-6 of them. Rarely, a pacient has more images asociated. 

## Laterality, View and Age 

* We almost have the same amount of left breast images than right ones. 
* Ver few values under 40 years old for `Age`. Some peaks between 50 and 70 yo. 
* Six different values for `view` feature. Quite imbalanced (**CC** and **MLO** are the most common ones). 

`Laterality` feature indicates whether the image is of the left or right breast. This issue can be fixed quite fast with OpenCV tools, for example. We'll focus on it later. `View` instead, refers to the orientation of the image. The default for a screening exam is to capture two views per breast. That's the reason for having almost the same amount of left and right breast images.  

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (16,4))
sns.countplot(train.laterality, label = ['Left','Right'], ax = axes[0], palette = custom_colors)
axes[0].set_title('Laterality Count')
sns.countplot(train.view, ax = axes[1], palette = custom_colors[3:])
axes[1].set_title('View Count')
sns.distplot(train.age, ax = axes[2], color = custom_colors[5])
axes[2].set_title('Age Dsitribution')

## ❗❗❗View feature is more important than you think❗❗❗

Now, **let's explain more in detail the `View` feature**. Recently, [@hengck23](https://www.kaggle.com/hengck23) posted a discussion talking about the fact that ADMANI dataset[1] is part of this competition's data (see discussion https://www.kaggle.com/competitions/rsna-breast-cancer-detection/discussion/370333#2076911, and one notebook explaining the details of [one paper] that shows good results with the ADMANI dataset. Let's examine better the image attached to the explanation: 

![](https://i.ibb.co/4S9BcxG/Selection-318.png)

You can observe that images are clasified in **main** and **auxiliary** images. Let's now observe an example of an image for each of the different types of view we're given. 

In [None]:
fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (22,10))
for i, val in enumerate(train.view.unique()):
    ids = train[train.view == val]['patient_id'].unique()
    image_id = train[(train.patient_id == ids[0]) & (train.view == val)]['image_id']
    img_id = dcm_path + str(ids[0]) + '/' + str(image_id.values[0]) + '.dcm'
    dataset = pydicom.dcmread(img_id)
    axes[i // 3, i % 3].imshow(dataset.pixel_array, cmap=plt.cm.bone)
    axes[i // 3, i % 3].axis('off')
    axes[i // 3, i % 3].set_title('View: {}'.format(val))

Actually **CC and MLO correspond to the main and auxiliary types**, respectively. To find out it by yourselves please head to the following [article](https://radiopaedia.org/articles/craniocaudal-view). Therefore, trying some different approaches when training (such as making a distinction) our models could make a difference and have a ridiculously significant effect in LB. Thus, it's gonna be time-worthy to do some research about it. 

Moreover, let's examine whether there is any relationship between these values and the diagnosis of cancer. Just to remind, except from CC and MLO values the rest have very few samples. We must take this into account to analyse properly these plots. 

> As it can be appreciated, values are the same for CC and MLO types. This makes sense regarding what I told above. 

In [None]:
fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (22, 8))
for i, val in enumerate(train.view.unique()):
    dt = [train[(train.view == val) & (train.cancer == c)].shape[0] for c in [0,1]]
    axes[i // 3, i % 3].pie(dt, labels = ['No Cancer','Cancer'], colors=[custom_colors[0], 
                            custom_colors[5]], autopct='%.2f%%')
    axes[i // 3, i % 3].set_title('View: {}'.format(val))

## Cancer, Biopsy, Invasive and BIRADS

These features are only provided for training. First of all, let's show images with both a negative and a positive diagnosis. Aparently, I can't notice any significant difference between these images. 



In [None]:
fig, axes = plt.subplots(nrows = 2, ncols = 4, figsize = (22,8))
for i, val in enumerate(train.cancer.unique()):
    ids = train[train.cancer == val]['patient_id'].unique()
    patient_path = dcm_path + str(ids[i]) +'/'
    for j, file in enumerate(listdir(patient_path)[:4]): 
        dataset = pydicom.dcmread(patient_path + file)
        axes[i,j].imshow(dataset.pixel_array, cmap=plt.cm.bone)
        axes[i,j].axis('off')
        axes[i,j].set_title('Cancer: {}'.format(val))

* Cancer, biopsy and invasive distributions are very imbalanced. Seems that there could be a relationship between them.
* In `BIRADS` plot, we observe that there are lots of negative ratings for cancer. Rating a breast as normal is the least common one.

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 4, figsize = (22,5))
sns.countplot(train.cancer, ax = axes[0], palette = custom_colors)
axes[0].set_title('Cancer')
sns.countplot(train.biopsy, ax = axes[1], palette = custom_colors[2:])
axes[1].set_title('Biopsy')
sns.countplot(train.invasive, ax = axes[2], palette = custom_colors[3:])
axes[2].set_title('Invasive')
sns.countplot(train.BIRADS, ax = axes[3], palette = custom_colors[4:])
axes[3].set_title('BIRADS')

Any relationship between them ? Let's explore it below. 

* `Invasive` and `cancer` are **very correlated** (this makes sense if we observe features' definitions).   
* Negative correlation between `BIRADS` and the rest of the features. 

In [None]:
corr= train[['cancer','biopsy','invasive','BIRADS']].corr()
# Getting the Upper Triangle of the co-relation matrix
matrix = np.triu(corr)

fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (22,8))
# Heatmap without absolute values
sns.heatmap(corr, mask=matrix, center = 0, cmap = 'vlag', ax = axes[0]).set_title('Without absolute values')
# Heatmap with absolute values
sns.heatmap(abs(corr), mask=matrix, center = 0, cmap = 'vlag', ax = axes[1]).set_title('With absolute values')

fig.tight_layout(h_pad=1.0, w_pad=0.5)

`Biopsy` feature determines whether or not a follow-up biopsy was performed on the breast. Thus, it could be interesting to analyse its relation with `Cancer` and `View` features.

In [None]:
fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (22, 8))
fig.suptitle('Biopsy vs View', fontsize = 15)
for i, val in enumerate(train.view.unique()):
    dt = [train[(train.view == val) & (train.biopsy == c)].shape[0] for c in [0,1]]
    axes[i // 3, i % 3].pie(dt, labels = ['No Biopsy','Biopsy'], colors=[custom_colors[0], 
                            custom_colors[5]], autopct='%.2f%%')
    axes[i // 3, i % 3].set_title('View: {}'.format(val))

Now we focus on its relation with cancer diagnosis. It's appreciatable that when there is no need to make a biopsy, cancer diagnosis is discarded. However, we can notice that when it's done we have the same amount of positive and negative results. 

In [None]:
for b in [0,1]: 
    print('Biopsy: {}'.format('Not Performed' if b == 0 else 'Performed'))
    for c in [0,1]: 
        dt = [train[(train.biopsy == b) & (train.cancer == c)].shape[0] for c in [0,1]]
        print('\tPatients with{}diagnosed cancer: {}'.format(' no ' if c == 0 else ' ', dt[c]))

Now, we head into `Invasive` feature analysis. No clear difference between one type of image and another.

In [None]:
def invasive_images(val): 
    fig, axes = plt.subplots(nrows = 3, ncols = 4, figsize = (22,15))
    fig.suptitle('Invasive: {}'.format(val), fontsize = 10)
    ids = train[train.invasive == val]['patient_id'].unique()
    for i in range(3): 
        patient_path = dcm_path + str(ids[i]) +'/'
        for j, file in enumerate(listdir(patient_path)[:4]): 
            dataset = pydicom.dcmread(patient_path + file)
            axes[i,j].imshow(dataset.pixel_array, cmap=plt.cm.bone)
            axes[i,j].axis('off')
        
for val in train.invasive.unique():
    invasive_images(val)

In [None]:
s1 = len(train[train.cancer == 1].axes[0])
s2 = len(train[(train.cancer == 1) & (train.invasive == 1)])
print('Percentage of breasts with an invasive cancer: {}%'.format(round(s2/s1 * 100, 2)))

## Implant and Density

Let's explore now the other two image features that we're given in this dataset. 

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (16,5))
sns.countplot(train.implant, label = ['Left','Right'], ax = axes[0], palette = custom_colors)
axes[0].set_title('Implant Count')
sns.countplot(train.density, ax = axes[1], palette = custom_colors[2:])
axes[1].set_title('Density Count')

* Almost every breast have no implants. We have just a few images from breasts with implants. 
* B and C values for `density` are the most usual ones. Just to remind, highly dense tissue can make diagnosis more difficult (case D). So this is something that we'll need to take into account in validation strategies. 

In the following chart we can appreciate the four different types of density that we have. We can notice that the closer the value is to D, we can observe that there are more white spots on the end of the breast. On the other hand, for A density type images, colour is quite uniform.

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 4, figsize = (22,5))
fig.suptitle('Exploring Density Types', fontsize = 12)
for i, val in enumerate(train.density.unique()[1:]):
    ids = train[train.density == val]['patient_id'].unique()
    patient_path = dcm_path + str(ids[9]) +'/'
    for j, file in enumerate([listdir(patient_path)[0]]): 
        dataset = pydicom.dcmread(patient_path + file)
        axes[i].imshow(dataset.pixel_array, cmap=plt.cm.bone)
        axes[i].axis('off')
        axes[i].set_title('Density: {}'.format(val))

## Machine ID

This feature does not seem that will have a significant effect. However, we realise that most of the pictures have been taken with machine 49. 

In [None]:
fig = plt.figure(figsize = (16,5))
sns.countplot(train.machine_id, palette = custom_colors)

---
## 3 | Preprocessing

---

To be continued ...



## Cropping out ROI

## Different sizes

## Different background colours

## Image Augmentation



---
## 4 | References

---

* [DICOM Metadata Extracting Attributes to DataFrame. Author: @anarthal](https://www.kaggle.com/code/anarthal/dicom-metadata-extracting-attributes-to-dataframe/notebook)

* [RSNA Fracture Detection: DICOM & Images Explore. Author: @andradaolteanu](https://www.kaggle.com/code/andradaolteanu/rsna-fracture-detection-dicom-images-explore)

* [MVCCL Model for Admani dataset, Author: @hengck23](https://www.kaggle.com/code/hengck23/mvccl-model-for-admani-dataset)