<a href="https://www.kaggle.com/code/gebreyowhansh/rsna-bcd-dicom-png-roi?scriptVersionId=135823583" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

* This note book is designed to perform some exploratory data analysis and to convert the images in dicom format into Png fromat using thrid party python libraries named dicomsdl


# <span style="color:teal">1. Install Libraries <a class="anchor"  id="Libraries"></a></span>

## <span style="color:teal">1.1 dicomsdl is a third-party python library that we used to convert dicom images into png <a class="anchor"  id="dicomsdl"></a></span>

In [None]:
from IPython.display import clear_output
!pip -q install dicomsdl
!pip install pylibjpeg
!pip install python-gdcm
!pip install plotly==5.11.0
clear_output()

## <span style="color:teal">1.2 Imporing pre installed libraries <a class="anchor"  id="additionalLibraries"></a></span>

In [None]:
import os, random, cv2, dicomsdl
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

from tqdm import tqdm
from joblib import Parallel, delayed
from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid

from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from mpl_toolkits.axes_grid1 import ImageGrid
import pydicom
import pylibjpeg
from pathlib import Path
sns.set_style("darkgrid")

# <span style="color:teal"> 2. Create basic configuration class <a class="anchor"  id="configuration"></a></span>
 * Configuration class consisting of information about project

In [None]:
class Config:
    def __init__(self):

        self.path = '/kaggle/input/rsna-breast-cancer-detection/'
        self.train_path = self.path + 'train_images/'
        self.test_path = self.path + 'test_images/'
        self.train_csv = self.path + 'train.csv'
        self.test_csv = self.path + 'test.csv'
        self.sample_submission = self.path + 'sample_submission.csv'
        
        self.output_path = '/kaggle/working/'
        self.train_output_path = self.output_path + 'train_images/'
        self.test_output_path = self.output_path + 'test_images/'

        self.img_size =1024
        self.resize_dim = (1024,512)

config = Config()

# <span style="color:teal"> 3. Exploratory Data Analysis  <a class="anchor"  id="trainingdatainfo"></a></span>

## <span style="color:teal"> 3.1 Training dataframe info  <a class="anchor"  id="trainingdatainfo"></a></span>

In [None]:
train_df = pd.read_csv(config.train_csv)
train_df.describe()

In [None]:
train_df.shape

In [None]:
train_df.head()

In [None]:
train_df.info()

#### <span style="color:teal"> Profiling Report  <a class="anchor"  id="profile"></a></span>
 * Pandas Profiling is a powerful library to access all the exploratory information about the data**

In [None]:
profile=ProfileReport(train_df,title="RSNA Train data profiling", explorative=True)
profile.to_notebook_iframe()

---
#### <span style="color:teal"> Early Observations  <a class="anchor"  id="profile1"></a></span>

- **image_id** is the unique key present in the training data
- The training data has observations for **54,706** images (no duplicates present)
- This information is for **11,913** patients
- The training data has observations from two sites
- Images are of 6 views. However, **MLO** and **CC **views account for >99% images
- **age** ranges from 26 years to 89 years. For 0.1% (37) images Age is missing
- **cancer**, the dependent or target variable has a rate of 2.1% amongst 54,706 images
- 5.4% of the images warranted a biopsy
- 1.5% of the images showed an invasive cancer. For all positive cancer images 70.6% showed invasive cancer
- **BIRADS** information is missing for a majority of images
- Less than 2.7% images were done on breasts with implants
- A large proportion of density information is missing
- 10 different machines were used across the two sites
- 15% of all image scans were difficult to infer as negative for cancer
---

### <span style="color:teal"> Patient Level Profiles  <a class="anchor"  id="profile2"></a></span>


In [None]:
total_patients=train_df['patient_id'].nunique()
unique_cancer_patients=np.where(train_df.groupby(['patient_id'])['cancer'].sum().reset_index()['cancer']>0,1,0).sum()
print("\nThe total number of patients diagnosed with cancer is " + str(unique_cancer_patients))
print("This indicates a prevalence rate of " + str(round(unique_cancer_patients/total_patients*100,2)) + "%" + " in " + str(total_patients)+" patients.")

## <span style="color:teal"> 3.2 Testing dataframe info  <a class="anchor"  id="test"></a></span>

In [None]:
test_df = pd.read_csv(config.test_csv)
test_df.head()

In [None]:
test_df.shape

In [None]:
test_df.info()

## <span style="color:teal"> 3.3 Sample submission datafram info <a class="anchor"  id="test"></a></span>

In [None]:
sample_submission_df = pd.read_csv(config.sample_submission)
sample_submission_df

## <span style="color:teal"> 3.4 Number of images per patient <a class="anchor"  id="test"></a></span>

In [None]:
images_counter = train_df["patient_id"].value_counts().sort_index()

fig = px.histogram(images_counter, text_auto=True, title="Number of images per patient")
fig.update_layout(bargap=0.2)
fig.show()


## <span style="color:teal"> 3.5 Distribution of patients' age <a class="anchor"  id="age"></a></span>

In [None]:
person_age = train_df.groupby("patient_id")['age'].max().sort_index().fillna(0).astype('int64')

fig = px.histogram(person_age, title="Distribution of patients' age")
fig.show()

# <span style="color:teal"> 3.6. Sample image data <a class="anchor"  id="sampleimagedata"></a></span>

In [None]:
sample_image = config.train_path+"10006/1459541791.dcm"
pydicom.dcmread(sample_image)

# <span style="color:teal"> 3.7 Handle missing values <a class="anchor"  id="missingData"></a></span>

In [None]:
def filter_missing_Features(df):
    total_missing_data = [df[col].isnull().sum() for col in df.columns]
    percentage_of_missing = [df[col].isnull().mean() for col in df.columns]
    result = pd.DataFrame(zip(total_missing_data, percentage_of_missing), columns=['total_missing_data', 'percentage_of_missing'], index=df.columns)
    result = result.sort_values('total_missing_data', ascending=False)
    return result
filter_missing_Features(train_df)

* **BIRADS** Columun have almost half missing and we don't know how to fill it so we decided better to remove this feature
* The **Age**Columun also have very few missing values and we decided to fill these missing with average age value of the others

* **Density** Columun have also almost half of missing data and we decided to fill it with other category labeld 'E'

In [None]:
train_df = train_df.drop(['BIRADS'], axis=1)
train_df['age'] = train_df['age'].fillna(train_df['age'].mean())
train_df['density'] = train_df['density'].fillna('E')

filter_missing_Features(train_df)


# <span style="color:teal">3.8 Fix Data Types <a class="anchor"  id="FixingDataType"></a></span>

In [None]:
train_df['laterality'] = train_df['laterality'].astype('category')
train_df['view'] = train_df['view'].astype('category')
train_df['age'] = train_df['age'].astype('int64')
train_df['density'] = train_df['density'].astype('category')
train_df['cancer'] = train_df['cancer'].astype('float32')

train_df.info()

# <span style="color:teal">3.9 Numerical Categorization of Age feature <a class="anchor"  id="FixingDataType"></a></span>

In [None]:
train_df["age_bin"] = pd.cut(train_df['age'].values.reshape(-1), bins=5, labels=False)

# <span style="color:teal">3.10 Generate dummy variables for the categorical columns <a class="anchor"  id="FixingDataType"></a></span>

In [None]:
cat_cols = ['laterality', 'view', 'density', 'difficult_negative_case']
train_df = pd.get_dummies(train_df, columns=cat_cols)
train_df.info()

# <span style="color:teal">3.11 Adding new features to hold dicom image path and processed image pathes <a class="anchor"  id="FixingDataType"></a></span>


In [None]:
train_df['dicom_path'] = config.train_path + train_df['patient_id'].astype(str) + '/' + train_df['image_id'].astype(str) + '.dcm'
train_df['image_path'] = config.train_output_path + train_df['patient_id'].astype(str) + '/' + train_df['image_id'].astype(str) + '.png'

test_df['dicom_path'] = config.test_path + test_df['patient_id'].astype(str) + '/' + test_df['image_id'].astype(str) + '.dcm'
test_df['image_path'] = config.test_output_path + test_df['patient_id'].astype(str) + '/' + test_df['image_id'].astype(str) + '.png'

In [None]:
train_df.head()

In [None]:
test_df.head()

# <span style="color:teal">3.12 Save the prepared training dataframe into csv <a class="anchor"  id="FixingDataType"></a></span>

In [None]:
train_df.to_csv("train_df_processed.csv", sep=',',index=False)

# <span style="color:teal"> 4.Functions to convert ,extract ROI and save dicom images <a class="anchor"  id="UtilityFunction"></a></span>

## <span style="color:teal"> 4.1 Dicom to png : <a class="anchor"  id="dicomtopng"></a></span>

  * The dicom.pixelData(storedvalue=False) line of code in the below function **dicom_to_png()**   extracts the pixel data from the DICOM file using the pixelData method provided by the dicomsdl library and the **storedvalue=False** argument ensures that the pixel data is returned as floating point values between 0 and 1.

* The if **dicom.PhotometricInterpretation == 'MONOCHROME1'**: line of checks the photometric interpretation of the DICOM file and inverts the image data if it is **MONOCHROME1** because **MONOCHROME1** images have higher pixel values for darker areas of the image, whereas **MONOCHROME2** images have higher pixel values for brighter areas of the image.

In [None]:
def dicom_to_png(dicom_path):
    dicom = dicomsdl.open(dicom_path)
    image = dicom.pixelData(storedvalue=False)
    image = image - np.min(image)
    image = image / np.max(image)

    if dicom.PhotometricInterpretation == 'MONOCHROME1':
        image = 1.0 - image
    
    image = cv2.resize(image, (config.img_size, config.img_size), interpolation=cv2.INTER_LINEAR)
    image = (image * 255).astype(np.uint8)
    return image

## <span style="color:teal"> 4.2 Region of Interest (ROI): <a class="anchor"  id="regionOfInterest"></a></span>
 * In image processing and computer vision, a Region of Interest (ROI) is a portion of an image that is selected for further processing or analysis. An ROI can be defined as a rectangular, circular, or polygonal area that contains the object or region of interest.
 * It is an important step in many image processing and computer vision applications because it allows us to focus on the most relevant parts of an image and reduce the amount of data that needs to be processed.
 * **Contours** are simply the boundaries of objects or shapes in an image. In image processing, contours are defined as the curves joining all the continuous points (along the boundary), having same color or intensity.

In [None]:
def png_to_roi(image, image_path):
    bin_image = cv2.threshold(image, 20, 255, cv2.THRESH_BINARY)[1]
    contours, _ = cv2.findContours(bin_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    contour = max(contours, key=cv2.contourArea)
    ys = contour.squeeze()[:, 0]
    xs = contour.squeeze()[:, 1]
    roi = image[np.min(xs):np.max(xs), np.min(ys):np.max(ys)]
    return cv2.resize(roi, config.resize_dim[::-1], interpolation=cv2.INTER_LINEAR)


## <span style="color:teal"> 4.3 Process and save images into ouput directory: <a class="anchor"  id="process"></a></span>

In [None]:
def process(dicom_path, image_path):
    image = dicom_to_png(dicom_path)
    os.makedirs(os.path.dirname(image_path), exist_ok=True)
    image = png_to_roi(image, image_path)
    cv2.imwrite(image_path, image)

In [None]:
Parallel(n_jobs=4, backend='threading')(delayed(process)(dicom_path, image_path) 
                   for dicom_path, image_path in tqdm(zip(test_df['dicom_path'], 
                                                          test_df['image_path'])))
clear_output()

In [None]:
Parallel(n_jobs=4, backend='threading')(delayed(process)(dicom_path, image_path) 
                   for dicom_path, image_path in tqdm(zip(train_df['dicom_path'], 
                                                          train_df['image_path'])))
clear_output()

In [None]:
!cp /kaggle/input/rsna-breast-cancer-detection/train.csv /kaggle/working/
!cp /kaggle/input/rsna-breast-cancer-detection/test.csv /kaggle/working/
!cp /kaggle/input/rsna-breast-cancer-detection/sample_submission.csv /kaggle/working/