# Overview

* This competition classifies chest radiographs into 15 categories, one of which is normal.
* This is a competition for object detection and disease classification.
* All images in dataset are DICOM format. So we need to convert data from DICOM to numpy array.[Convert dicom to np.array - the correct way](https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way) article will be helpful.

* The host of the competition is explained as follows in [this thread](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/207741).
* In this competition, you’re given a set of training X-ray images in DICOM format, each of which was blindly annotated with bounding boxes of 14 classes by 3 radiologists from a pool of 17, encoded with Rad IDs from R1 to R17. The task is to automatically correctly predict boxes around abnormalities and classify them for the test images, whose ground-truth labels are hidden. Unlike the labels of the training set, those of the test set were already a consensus of 5 radiologists per image.
* 
# Columns


* image_id - unique image identifier

* class_name - the name of the class of detected object (or "No finding")

* class_id - the ID of the class of detected object

* rad_id - the ID of the radiologist that made the observation

* x_min - minimum X coordinate of the object's bounding box

* y_min - minimum Y coordinate of the object's bounding box

* x_max - maximum X coordinate of the object's bounding box

* y_max - maximum Y coordinate of the object's bounding box

In [None]:
import numpy as np
import pandas as pd
import os
import re
import pydicom

# import useful tools
from glob import glob
from PIL import Image
import cv2
import pydicom as dcm
import random
import matplotlib.patches as patches
from sklearn.model_selection import KFold
from pydicom.pixel_data_handlers.util import apply_voi_lut

from sklearn.model_selection import StratifiedKFold
import warnings

# import data visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
import matplotlib
import pydicom as dicom

from bokeh.plotting import figure
from bokeh.io import output_notebook, show, output_file
from bokeh.models import ColumnDataSource, HoverTool, Panel
from bokeh.models.widgets import Tabs


# import data augmentation
import albumentations as albu

# import math module
import math

# Libraries
import pandas_profiling
import xgboost as xgb
from sklearn.metrics import log_loss
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor
import matplotlib.patches as patches
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff

# One-hot encoding
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

# Other
from random import randint
import warnings
import csv
warnings.filterwarnings("ignore")

# Config File

In [None]:
class Cfg(object):
  
  def __init__(self):
    super(Cfg, self).__init__()
    self.dim = 512
    self.batch= 8
    self.steps= 500
    self.epochs= 10
    self.train_csv= '../input/vinbigdata-{}-image-dataset/vinbigdata/train.csv'.format(self.dim)
    self.test_csv= '../input/vinbigdata-{}-image-dataset/vinbigdata/test.csv'.format(self.dim)
    self.img_dir= '../input/vinbigdata-{}-image-dataset/vinbigdata/train/'.format(self.dim)
    
    self.git= 'https://github.com/fizyr/keras-retinanet.git'
    self.model_url= ['https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5',
               'https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet101_oid_v1.0.0.h5',
               'https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet152_oid_v1.0.0.h5']
      
    self.color_code=   {'Cardiomegaly':(124,252,0), 'Aortic enlargement':(135,206,250),
                        'Pleural thickening':(199,21,133),'ILD':(245,245,220), 'Nodule/Mass':(220,20,60),
                        'Pulmonary fibrosis':(0,255,255), 'Lung Opacity':(128,128,0), 'Atelectasis':(255,0,255),
                        'Other lesion':(176,224,230), 'Infiltration':(210,105,30),'Pleural effusion':(105,105,105),
                        'Calcification':(138,43,226) ,'Consolidation':(250,240,230),'Pneumothorax':(100,149,237)}
    
cfg= Cfg()

# Loading data

In [None]:
# Setup the paths to train and test images
TEST_DIR = "../input/vinbigdata-chest-xray-abnormalities-detection/test/"
TRAIN_DIR = "../input/vinbigdata-chest-xray-abnormalities-detection/train/"
dataset_dir = "../input/vinbigdata-chest-xray-abnormalities-detection/"

In [None]:
# Glob the directories and get the lists of train and test images
train_fns = glob(TRAIN_DIR + '*')
test_fns = glob(TEST_DIR + '*')

# Loading training data and test data
train_df = pd.read_csv(dataset_dir+'train.csv')
sample = pd.read_csv(dataset_dir+'sample_submission.csv')

In [None]:
# Images with no abnormal findings will be omitted.
abnormal_train = train_df[train_df['class_name']!="No finding"]

In [None]:
from tqdm import tqdm
rows, columns, sex = [], [], []
ids = abnormal_train['image_id'].unique()
for i in ids:
    path = dataset_dir+ 'train/' + i + '.dicom'
    dicom = dcm.dcmread(path, stop_before_pixels=True)
    rows.append(dicom.Rows)
    columns.append(dicom.Columns)
    sex.append(dicom.PatientSex)

In [None]:
info = pd.DataFrame({'image_id':ids, 'rows':rows, 'columns':columns, 'sex':sex})

In [None]:
dicom_meta = pd.read_csv('../input/eda-dicom-reading-vinbigdata-chest-x-ray/train_dicom_properties.csv.bz2').rename(columns={'file':'image_id'})

In [None]:
df= pd.read_csv(cfg.train_csv)
print(df.shape)
df= df[df.class_name != 'No finding']

# Useful Functions

In [None]:
def get_feature_distribution(data, feature):
    # Get the count for each label
    label_counts = data[feature].value_counts()

    # Get total number of samples
    total_samples = len(data)

    # Count the number of items in each class
    print("Feature: {}".format(feature))
    for i in range(len(label_counts)):
        label = label_counts.index[i]
        count = label_counts.values[i]
        percent = int((count / total_samples) * 10000) / 100
        print("{:<30s}:{}%".format(label, count, percent))

In [None]:
def show_dicom_images(data):
    img_data = data
    f, ax = plt.subplots(3,3, figsize=(6,8))
    for i,data_row in enumerate(img_data):
        imagePath = data_row
        data_row_img_data = dcm.read_file(imagePath)
        data_row_img = dcm.dcmread(imagePath)
        ax[i//3, i%3].imshow(data_row_img.pixel_array, cmap=plt.cm.bone) 
        ax[i//3, i%3].axis('off')
    plt.show()

In [None]:
def load_img(path):
    img= cv2.imread(path)
    img= cv2.resize(img, (cfg.dim, cfg.dim))
    return img

def normalize_cod(df):
    df.x_min= (df.x_min/ df.width)* cfg.dim
    df.x_max= (df.x_max/ df.width)* cfg.dim
    
    df.y_min= (df.y_min/ df.height)* cfg.dim
    df.y_max= (df.y_max/ df.height)* cfg.dim
    return df

df= normalize_cod(df.copy())
df= df.reset_index(drop = True)

In [None]:
def read_xray(path, voi_lut = True, fix_monochrome = True):
    dicom = dcm.read_file(path)

    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array

    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data

    data = data - np.min(data)
    data = data / np.max(data)
    return (data * 255).astype(np.uint8)

In [None]:
def plot(name):
    train_cls = train_df[train_df['class_name'] == name]
    fig, axes = plt.subplots(4,4, figsize=(10, 10))
    fig.suptitle(name+" examples", fontsize=10)
    for i in range(4):
        for j in range(4):
            row = train_cls.iloc[random.randint(0, len(train_cls))]
            path = dataset_dir + 'train/' + row['image_id'] + '.dicom'
            axes[i][j].imshow(read_xray(path), cmap='gray')
            axes[i][j].add_patch(patches.Rectangle(
                (row['x_min'], row['y_min']), 
                row['x_max'] - row['x_min'], 
                row['y_max'] - row['y_min'], 
                edgecolor='blue', 
                fill=False)
            )
    plt.show()

In [None]:
def load_img(path):
    img= cv2.imread(path)
    img= cv2.resize(img, (cfg.dim, cfg.dim))
    return img

def normalize_cod(df):
    df.x_min= (df.x_min/ df.width)* cfg.dim
    df.x_max= (df.x_max/ df.width)* cfg.dim
    
    df.y_min= (df.y_min/ df.height)* cfg.dim
    df.y_max= (df.y_max/ df.height)* cfg.dim
    return df

df= normalize_cod(df.copy())
df= df.reset_index(drop = True)

In [None]:
def plot_width_of__bounding_boxes(data):
    
    fig, axes = plt.subplots(7, 2, figsize=(16,20), sharex=True)
    fig.suptitle("width of bounding box for different categories", fontsize=16)
    
    classes = data.class_name.unique()
    for j, i in enumerate(classes[~np.isin(classes, 'No finding')]):
        data_ = data[data['class_name']==i]
        sns.distplot(data_['x_max'] - data_['x_min'], ax=axes[j%7, j//7]);
        axes[j%7, j//7].title.set_text(i);
    plt.show()

In [None]:
classes= df.class_name.unique()
ind= df.class_id.unique()

In [None]:
##### Color-Code ########
color= cfg.color_code

def show_bb(i):
    df_mini= df[df.image_id==df.image_id[i]]
    path= cfg.img_dir + df.image_id[i] + '.png'
    img= load_img(path)
    rep_class=[]
    font = cv2.FONT_HERSHEY_SIMPLEX 
    for i, row in df_mini.iterrows():
        class_n= row['class_name']
        if class_n in rep_class:
            continue                          # More generalization
        rep_class.append(class_n)
        x_min= int(row['x_min']); x_max= int(row['x_max'])
        y_min= int(row['y_min']); y_max= int(row['y_max'])
        img= cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color[class_n], 2)
        fontScale= (x_max- x_min)*2.5/img.shape[1]
        img= cv2.putText(img, class_n, (x_min, y_min-5), font, fontScale, cv2.LINE_AA)
    return img

In [None]:
classes= df.class_name.unique()
ind= df.class_id.unique()

In [None]:
##### Color-Code ########
color= cfg.color_code

def show_bb(i):
    df_mini= df[df.image_id==df.image_id[i]]
    path= cfg.img_dir + df.image_id[i] + '.png'
    img= load_img(path)
    rep_class=[]
    font = cv2.FONT_HERSHEY_SIMPLEX 
    for i, row in df_mini.iterrows():
        class_n= row['class_name']
        if class_n in rep_class:
            continue                          # More generalization
        rep_class.append(class_n)
        x_min= int(row['x_min']); x_max= int(row['x_max'])
        y_min= int(row['y_min']); y_max= int(row['y_max'])
        img= cv2.rectangle(img, (x_min, y_min), (x_max, y_max), color[class_n], 2)
        fontScale= (x_max- x_min)*2.5/img.shape[1]
        img= cv2.putText(img, class_n, (x_min, y_min-5), font, fontScale, cv2.LINE_AA)
    return img

# Check statistics

In [None]:
train_df.info()

In [None]:
train_df.nunique().to_frame().rename(columns={0:"Unique Values"}).style.background_gradient(cmap="plasma")

In [None]:
train_df[['class_id', 'class_name', 'rad_id']].groupby(['class_id', 'class_name']).count().rename(columns={'rad_id': 'Number of records'})

In [None]:
df.head()

In [None]:
train_rad = train_df['rad_id'].value_counts().reset_index()
fig = go.Figure(data=[go.Table(header=dict(values=['Radiologist ID', 'Number of Observations'], fill_color='yellow'),
                 cells=dict(values=[train_rad['index'], train_rad['rad_id']], fill_color='lavender'))
                     ])
fig.show()

In [None]:
def plot_distribution_classes(x_values, y_values, title):
    
    #colors = ['rgb(26, 118, 255)',] * 15
    #colors[0] = 'lightslategray'

    fig = go.Figure(data=[go.Bar(
        x=x_values, 
        y=y_values,
        text=y_values
        #marker_color=colors
    )])

    fig.update_layout(height=400, width=700, title_text=title)
    fig.update_xaxes(type="category")

    fig.show()

In [None]:
indexes = train_df.rad_id.unique()
counts = train_df.rad_id.value_counts()

sorted_dict = dict(zip(indexes, counts))
sorted_dict = {k: v for k, v in sorted(sorted_dict.items(), key=lambda item: item[1], reverse = True)}

x = list(sorted_dict.keys())
y = list(sorted_dict.values())

plot_distribution_classes(x, y, 
                          title="Distribution of Annotations by Radioloiest")

* Rad IDs from R1 to R17
* rad_id is the ID of the radiologist that made the observation

In [None]:
whatassigned = train_df[['rad_id', 'class_id', 'image_id']]\
    .groupby(['rad_id', 'class_id'])\
    .count()\
    .reset_index()\
    .pivot(index='rad_id', columns='class_id',values='image_id')\
    .add_prefix('class')\
    .fillna(0)\
    .astype(np.int64)
whatassigned['Percent with no finding'] = [f'{tmpvar}%' for tmpvar in np.round(100*whatassigned['class14'].values/whatassigned.sum(axis=1).values,2)]
whatassigned

In [None]:
plt.figure(figsize=(6, 6))
sns.countplot(x="class_id", data=train_df)
plt.title("Class ID Distribution")
plt.show()

In [None]:
plt.figure(figsize=(6, 6))
sns.countplot(x="rad_id", data=train_df)
plt.title("RAD ID Distribution")
plt.show()

In [None]:
fig, axes = plt.subplots(figsize = (20,10), nrows=2, ncols=2)
ax0, ax1, ax2, ax3 = axes.flatten()

ax0.hist(train_df.x_min, bins=55, color = "skyblue")
ax0.axvline(x=602, color='royalblue', linestyle='dashed', linewidth=2)
ax0.axvline(x=1457, color='royalblue', linestyle='dashed', linewidth=2)
ax0.axvline(x=1014, color='cornflowerblue', linewidth=2)
ax0.set_title('X minimum', fontsize=18)

ax1.hist(train_df.x_max, bins=55, color = "skyblue")
ax1.axvline(x=1010, color='royalblue', linestyle='dashed', linewidth=2)
ax1.axvline(x=1567, color='cornflowerblue', linewidth=2)
ax1.axvline(x=1947, color='royalblue', linestyle='dashed', linewidth=2)
ax1.set_title('X maximum', fontsize=18)

ax2.hist(train_df.y_min, bins=55)
ax2.axvline(x=627, color='orchid', linestyle='dashed', linewidth=2)
ax2.axvline(x=935, color='cornflowerblue', linewidth=2)
ax2.axvline(x=1471, color='orchid', linestyle='dashed', linewidth=2)
ax2.set_title('Y minimum', fontsize=18)

ax3.hist(train_df.y_max, bins=55)
ax3.axvline(x=1009, color='orchid', linestyle='dashed', linewidth=2)
ax3.axvline(x=1411, color='cornflowerblue', linewidth=2)
ax3.axvline(x=1911, color='orchid', linestyle='dashed', linewidth=2)
ax3.set_title('Y maximum', fontsize=18)

# Submission File
[The competition page](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/overview/evaluation) has the following to say about the sample.
Images in the test set may contain more than one object. For each object in a given test image, you must predict a class ID, confidence score, and bounding box in format xmin ymin xmax ymax. If you predict that there are NO objects in a given image, you should predict 14 1.0 0 0 1 1, where 14 is the class ID for "No finding", 1.0 is the confidence, and 0 0 1 1 is a one-pixel bounding box.

The submission file should contain a header and have the following format:



In [None]:
# Confirmation of the format of samples for submission
sample.head(3)

* Check the structure of the training data: there are three types of IDs and disease categories, and the maximum and minimum values for x and y, respectively, are listed.

In [None]:
# Display some of the training data
train_df.head()

In [None]:
# Display of training data
print(train_df)

In [None]:
# Check for missing values in the training data
train_df.isnull().sum()

In [None]:
# Check the unique values of image IDs.
abnormal_train.image_id.value_counts()

* Let's look at the number of diseases in the training data in a graph.

In [None]:
fig = plt.figure(figsize=(6,6))
sns.countplot(y ='class_name', data=abnormal_train);

* Next, let's check the number of diseases in the training data with numbers.

In [None]:
train_df[['class_id', 'class_name', 'rad_id']].groupby(['class_id', 'class_name']).count().rename(columns={'rad_id': 'Number of records'}).style.applymap(lambda x: 'background-color:lightsteelblue')

# No finding

In [None]:
plot("No finding")

# Aortic enlargement
* Aortic enlargement is known as a sign of an aortic aneurysm. This condition often occurs in the ascending aorta.

In [None]:
plot("Aortic enlargement")

In [None]:
target1 = abnormal_train[abnormal_train['class_id']==0]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target1['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target1['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target1['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target1['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Cardiomegaly

*  Cardiomegaly can be caused by many conditions, including hypertension, coronary artery disease, infections, inherited disorders, and cardiomyopathies.
* Cardiomegaly is usually diagnosed when the ratio of the heart's width to the width of the chest is more than 50%. This diagnostic criterion may be an essential basis for this competition.
* Cardiomegaly can be caused by many conditions, including hypertension, coronary artery disease, infections, inherited disorders, and cardiomyopathies.
* The heart-to-lung ratio criterion for the diagnosis of cardiomegaly is a ratio of greater than 0.5. 

In [None]:
plot("Cardiomegaly")

In [None]:
target２ = abnormal_train[abnormal_train['class_id']==3]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target２['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target２['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target２['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target２['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Pleural thickening

* The pleura is the membrane that covers the lungs, and the change in the thickness of the pleura is called pleural thickening.
* It is often seen in the uppermost part of the lung field (the apex of the lung).

In [None]:
plot("Pleural thickening")

In [None]:
target３ = abnormal_train[abnormal_train['class_id']==11]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target３['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target３['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target３['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target３['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Pulmonary fibrosis

* Pulmonary Fibrosis is inflammation of the lung interstitium due to various causes, resulting in thickening and hardening of the walls, fibrosis, and scarring.
* The fibrotic areas lose their air content, which often results in dense cord shadows or granular shadows.

In [None]:
plot("Pulmonary fibrosis")

In [None]:
target４ = abnormal_train[abnormal_train['class_id']==13]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target４['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target４['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target４['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target４['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Nodule/Mass

* Nodules and masses are seen primarily in lung cancer, and metastasis from other parts of the body such as colon cancer and kidney cancer, tuberculosis, pulmonary mycosis, non-tuberculous mycobacterium, obsolete pneumonia, and benign tumors.
* A nodule/mass is a round shade (typically less than 3 cm in diameter – resulting in much smaller than average bounding boxes) that appears on a chest X-ray image.

In [None]:
plot("Nodule/Mass")

In [None]:
target５ = abnormal_train[abnormal_train['class_id']==8]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target５['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target５['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target５['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target５['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Lung Opacity

* Lung opacity can often be identified as any area in the chest radiograph that is more white than it should be.
* Please see [this kaggle discussion](https://www.kaggle.com/zahaviguy/what-are-lung-opacities) for more information.

In [None]:
plot("Lung Opacity")

* Please see [What are lung opacities?](https://www.kaggle.com/zahaviguy/what-are-lung-opacities).

In [None]:
target６ = abnormal_train[abnormal_train['class_id']==7]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target６['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target６['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target６['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target６['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Pleural effusion

* Pleural effusion is the accumulation of water outside the lungs in the chest cavity.
* The outside of the lungs is covered by a thin membrane consisting of two layers known as the pleura. Fluid accumulation between these two layers (chest-wall/parietal-pleura and the lung-tissue/visceral-pleura) is called pleural effusion.
* The findings of pleural effusion vary widely and vary depending on whether the radiograph is taken in the upright or supine position.
* The most common presentation of pleural effusion is elevation of the diaphragm on one side, flattening the diaphragm, or blunting the angle between rib and diaphragm (typically more than 30 degrees)

In [None]:
plot("Pleural effusion")

In [None]:
target７ = abnormal_train[abnormal_train['class_id']==10]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target７['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target７['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target７['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target７['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Other lesion

* Others include all abnormalities that do not fall into any other category. This includes bone penetrating images, fractures, subcutaneous emphysema, etc.

In [None]:
plot("Other lesion")

In [None]:
target８ = abnormal_train[abnormal_train['class_id']==9]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target８['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target８['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target８['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target８['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Infiltration

* The infiltration of some fluid component into the alveoli causes an infiltrative shadow (Infiltration).
* It is difficult to distinguish from consolidation and, in some cases, impossible to distinguish. Please see [this link](https://allnurses.com/consolidation-vs-infiltrate-vs-opacity-t483538/) for more information.

In [None]:
plot("Infiltration")

In [None]:
target９ = abnormal_train[abnormal_train['class_id']==6]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target９['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target９['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target９['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target９['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# ILD  

* ILD stands for "Interstitial Lung Disease".
* Interstitial Lung Disease is a general term for many conditions in which the interstitial space is injured.

In [None]:
plot("ILD")

* ILD stands for "Interstitial Lung Disease."
* Interstitial lung disease is a general term for many conditions in which the interstitial space is injured.

In [None]:
target10 = abnormal_train[abnormal_train['class_id']==5]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target10['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target10['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target10['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target10['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Calcification

In [None]:
plot("Calcification")

* Calcium (calcification) may be deposited in areas where previous inflammation of the lungs or pleura has healed. Calcium may be deposited in the aorta due to atherosclerosis. Or calcification may occur in mediastinal lymph nodes.
* Many diseases or conditions can cause calcification on chest x-ray.
* Calcification may occur in the Aorta (as with atherosclerosis) or it may occur in mediastinal lymph nodes (as with previous infection, tuberculosis, or histoplasmosis).


In [None]:
target11 = abnormal_train[abnormal_train['class_id']==2]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target11['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target11['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target11['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target11['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Consolidation

In [None]:
plot("Consolidation")

* Consolidation is officially referred to as air space consolidation. It is a decrease in lung permeability due to infiltration of fluid, cells, or tissue replacing the air-containing spaces in the alveoli.

In [None]:
target12 = abnormal_train[abnormal_train['class_id']==4]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target12['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target12['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target12['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target12['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

# Atelectasis

In [None]:
plot("Atelectasis")

* Atelectasis is a condition where there is no air in part or all of the lungs. And the lungs are collapsed. A common cause of atelectasis is obstruction of the bronchi.

In [None]:
target13 = abnormal_train[abnormal_train['class_id']==1]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target13['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target13['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target13['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target13['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plot("Pneumothorax")

In [None]:
target14 = abnormal_train[abnormal_train['class_id']==12]
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target14['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target14['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target14['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target14['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
# coding: utf-8
from tqdm import tqdm
import time

# Set the total value 
bar = tqdm(total = 1000)
# Add description
bar.set_description('Progress rate')
for i in range(100):
    # Set the progress
    bar.update(25)
    time.sleep(1)

# Data Visualization

In [None]:
target1 = abnormal_train[abnormal_train['class_name']=='Aortic enlargement']
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(target1['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(target1['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(target1['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(target1['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

* Let's look at the distribution of Rad-IDs.
* R10, R9 and R8 are prominently high.

* Now let's look at the gender distribution

In [None]:
fig = plt.figure(figsize=(6,6))
sns.countplot(info['sex'], data=train_df)
plt.title("Sex distribution including those with no findings")
plt.show()

* Let's also look at the distribution of sexes with no abnormal findings. There seems to be no particular difference.

In [None]:
fig = plt.figure(figsize=(6,6))
sns.countplot(info['sex'], data=abnormal_train)
plt.title("SEX distribution excluding those with no findings")
plt.show()

In [None]:
plt.figure(figsize=(6, 6))
sns.countplot(x="rad_id", data=train_df)
plt.title("RAD ID Distribution including those with no findings")
plt.show()

In [None]:
fig = plt.figure(figsize=(6,6))
sns.countplot(x='rad_id', data=abnormal_train)
plt.title("RAD ID Distribution excluding those with no findings")
plt.show()

Shape Analysis

In [None]:
fig = plt.figure(figsize=(6,6))
ax = sns.scatterplot(x='rows', y='columns', data=info, alpha=0.3)
plt.title("row(x) column(x) scatter plot")
plt.show()

In [None]:
fig = plt.figure(figsize=(6,6))
ax = sns.scatterplot(x='x_min', y='y_min', data=abnormal_train, alpha=0.3)
plt.title("min coordinate scatter plot")
plt.show()

In [None]:
fig = plt.figure(figsize=(6,6))
ax = sns.scatterplot(x='x_max', y='y_max', data=abnormal_train, alpha=0.3)
plt.title("max coordinate scatter plot")
plt.show()

In [None]:
sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(2,2,figsize=(6,6))
sns.distplot(abnormal_train['x_max'],kde=True,bins=50, color="red", ax=ax[0,0])
sns.distplot(abnormal_train['y_max'],kde=True,bins=50, color="blue", ax=ax[0,1])
sns.distplot(abnormal_train['x_min'],kde=True,bins=50, color="green", ax=ax[1,0])
sns.distplot(abnormal_train['y_min'],kde=True,bins=50, color="magenta", ax=ax[1,1])
locs, labels = plt.xticks()
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

In [None]:
plt.figure(figsize = (6, 6))
x = info["rows"]
y = info["columns"]
sns.distplot(x * y, kde = True, color = "brown")
plt.xlabel("pixel count", fontsize = 16)
plt.title("Pixel Count Analysis", fontsize = 18)
plt.grid(True)
plt.axis("on")

# Features based on where bounding boxes for a class tend to be

In [None]:
# Create a list of classes and a dictionary
classes = train_df[['class_id', 'class_name', 'rad_id']]\
    .groupby(['class_id', 'class_name'])\
    .count()\
    .rename(columns={'rad_id': 'Number of records'})\
    .reset_index()

for index, row in classes.iterrows():
    if index==0:
        label_dict = {row['class_id']: row['class_name']}
    else:
        label_dict.update({row['class_id']: row['class_name']})

train_df = pd.merge(train_df, dicom_meta, on='image_id', how='left')
train_df['x_max'] = train_df['x_max']/train_df['Columns']
train_df['x_min'] = train_df['x_min']/train_df['Columns']
train_df['y_max'] = train_df['y_max']/train_df['Rows']
train_df['y_min'] = train_df['y_min']/train_df['Rows']
train_df['width'] = (train_df['x_max']-train_df['x_min'])
train_df['height'] = (train_df['y_max']-train_df['y_min'])
train_df['area'] = train_df['height']*train_df['width']
train_df['x_center'] = (train_df['x_max']+train_df['x_min'])/2
train_df['y_center'] = (train_df['y_max']+train_df['y_min'])/2

In [None]:
# In this cell, we create a feature for how many other classes a radiologist has already assigned 
# for the same image.

tmp1 = pd.merge(train_df, train_df, on=['image_id', 'rad_id'], how='left').fillna(0)
#tmp1 = tmp1[ (tmp1['class_id_x']!=tmp1['class_id_y']) | (tmp1['x_min_x']!=tmp1['x_min_y']) | (tmp1['x_max_x']!=tmp1['x_max_y']) | (tmp1['y_min_x']!=tmp1['y_min_y']) | (tmp1['y_max_x']!=tmp1['y_max_y'])]


tmp_cols = ['image_id', 'rad_id', 'class_id_x', 'x_min_x', 'y_min_x', 'x_max_x', 'y_max_x']
tmp1 = tmp1[ tmp_cols + ['class_id_y', 'x_max_y']]\
    .groupby(tmp_cols+['class_id_y'])\
    .count()\
    .reset_index()\
    .pivot(index=tmp_cols,
           columns='class_id_y', values='x_max_y')\
    .add_prefix('other_')\
    .reset_index()\
    .rename(columns={'class_id_x':'class_id',
                     'class_id_x': 'class_id',
                     'x_min_x': 'x_min',
                     'y_min_x': 'y_min',
                     'x_max_x': 'x_max',
                     'y_max_x': 'y_max'})


train_df = pd.merge(train_df, tmp1, 
                 on=['image_id', 'rad_id', 'class_id', 'x_min', 'y_min', 'x_max', 'y_max'],
                 how='left')
train_df[['other_'+str(i) for i in range(15)]] = train_df[['other_'+str(i) for i in range(15)]]\
    .fillna(0)\
    .astype(np.int)

# Finally, we subtract the extra count of +1 for each label itself (when we want to predict it, we
# do not want to have a feature that leaks the label, which it otherwise would).
for idx, row in train_df.iterrows():
    if row['class_id']<14:        
        train_df['other_' + str(row['class_id'])].values[idx] += -1
    

In [None]:
locations = np.zeros((14, 1000, 1000))
for index, row in tqdm(train_df.iterrows(), total=train_df.shape[0]):
    if row['class_id']<14:
        locations[row['class_id'], 
                  ((np.round(row['y_min'],3)*1000).astype(np.int)):((np.round(row['y_max'],3)*1000).astype(np.int)), 
                  ((np.round(row['x_min'],3)*1000).astype(np.int)):((np.round(row['x_max'],3)*1000).astype(np.int))] += 1
        
classcounts = train_df[['image_id', 'rad_id', 'class_id','class_name']]\
    .groupby(['image_id', 'rad_id', 'class_id'])\
    .count()\
    .reset_index()\
    .pivot(index=['image_id', 'rad_id'], columns='class_id', values='class_name')\
    .rename(columns={i:'n_class'+str(i) for i in range(15)})\
    .fillna(0)
   
classareas = train_df[['image_id', 'rad_id', 'class_id','area']]\
    .groupby(['image_id', 'rad_id', 'class_id'])\
    .sum()\
    .reset_index()\
    .pivot(index=['image_id', 'rad_id'], columns='class_id', values='area')\
    .rename(columns={i:'area_class'+str(i) for i in range(15)})\
    .fillna(0)

train_df = pd.merge( pd.merge( train_df, classcounts, on=['image_id', 'rad_id'], how='left'), 
                  classareas, on=['image_id', 'rad_id'], how='left')
train_df = train_df[train_df['class_id']!=14]

classes = train_df[['class_id', 'class_name', 'rad_id']]\
    .groupby(['class_id', 'class_name'])\
    .count()\
    .rename(columns={'rad_id': 'Number of records'})\
    .reset_index()
    
for index, row in classes.iterrows():
    if index==0:
        label_dict = {row['class_id']: row['class_name']}
    else:
        label_dict.update({row['class_id']: row['class_name']})
        
f, axs = plt.subplots(5, 3, sharey=True, sharex=True, figsize=(16,28));

for class_id in range(14):
    axs[class_id // 3, class_id - 3*(class_id // 3)].imshow(locations[class_id], cmap='inferno', interpolation='nearest');
    axs[class_id // 3, class_id - 3*(class_id // 3)].set_title(str(class_id) + ': ' + label_dict[class_id])
    
plt.show();    

# Trends in bounding boxes differences

In [None]:
plot_width_of__bounding_boxes(abnormal_train)

In [None]:
classes = train_df[['class_id', 'class_name', 'rad_id']].groupby(['class_id', 'class_name']).count().rename(columns={'rad_id': 'Number of records'}).reset_index()

for index, row in classes.iterrows():
    if index==0:
        label_dict = {row['class_id']: row['class_name']}
    else:
        label_dict.update({row['class_id']: row['class_name']})

In [None]:
cols = ['#e41a1c', '#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628','#f781bf','#999999', '#000000', '#1b9e77', '#d95f02', '#7570b3', '#e7298a']

In [None]:
train_df['size'] = (train_df['x_max']-train_df['x_min'])*(train_df['y_max']-train_df['y_min'])
sizes = train_df.loc[train_df['class_id']<14, ['class_id', 'size']].groupby('class_id').mean().reset_index()

plt.figure(figsize=(6, 6));
plt.bar(sizes['class_id'], sizes['size'], 
        tick_label=[str(i) + ': ' + label_dict[i] for i in range(14)],
        color=cols);
plt.xticks(rotation='vertical');

In [None]:
numbers = train_df.loc[train_df['class_id']<14, ['image_id', 'rad_id', 'class_id', 'size']].groupby(['image_id', 'rad_id', 'class_id']).count().reset_index().groupby('class_id').mean('size').reset_index()

plt.figure(figsize=(6, 6));
plt.bar(numbers['class_id'], numbers['size'], tick_label=[str(i) + ': ' + label_dict[i] for i in range(14)], color=cols);
plt.xticks(rotation='vertical');

In [None]:
tmpdf = train_df[['class_id', 'image_id', 'rad_id']].groupby(['class_id', 'image_id']).count().reset_index()
tmpdf['rad_id'] = np.minimum(tmpdf['rad_id'].values, 1)
corr  = tmpdf.pivot(index='image_id', columns='class_id', values='rad_id').fillna(0).reset_index(drop=True).corr()
corr.style.background_gradient(cmap='coolwarm', vmin=-1.0, vmax=1.0).set_precision(2)

* Let's see how it correlates with the different classes
* The following correlations were found to be strong.
* '0: Aortic enlargement' and '3: Cardiomegaly'.
* '１0: Pleural effusion' and '１１: Pleural thickening'.
* '１１: Pleural thickening' and '１３: Pulmonary fibrosis'.

There are 15 different radiographic observations which correspond to:

* 0 - Aortic enlargement
* 1 - Atelectasis
* 2 - Calcification
* 3 - Cardiomegaly
* 4 - Consolidation
* 5 - ILD
* 6 - Infiltration
* 7 - Lung Opacity
* 8 - Nodule/Mass
* 9 - Other lesion
* 10 - Pleural effusion
* 11 - Pleural thickening
* 12 - Pneumothorax
* 13 - Pulmonary fibrosis
* 14 - No finding

* View the distribution by object's bounding box and class ID

# Plot bounding box

In [None]:
def dicom2array(path, voi_lut=True, fix_monochrome=True):
    dicom = pydicom.read_file(path)
    # VOI LUT (if available by DICOM device) is used to
    # transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
    # depending on this value, X-ray may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME1":
        data = np.amax(data) - data
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data
        
    
def plot_img(img, size=(7, 7), is_rgb=True, title="", cmap='gray'):
    plt.figure(figsize=size)
    plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()
    

def plot_imgs(imgs, cols=4, size=7, is_rgb=True, title="", cmap='gray', img_size=(500,500)):
    rows = len(imgs)//cols + 1
    fig = plt.figure(figsize=(cols*size, rows*size))
    for i, img in enumerate(imgs):
        if img_size is not None:
            img = cv2.resize(img, img_size)
        fig.add_subplot(rows, cols, i+1)
        plt.imshow(img, cmap=cmap)
    plt.suptitle(title)
    plt.show()

In [None]:
imgs = []
img_ids = abnormal_train['image_id'].values
class_ids = abnormal_train['class_id'].unique()

# map label_id to specify color
label2color = {class_id:[randint(0,255) for i in range(3)] for class_id in class_ids}
thickness = 3
scale = 5



for i in range(8):
    img_id = random.choice(img_ids)
    img_path = f'{dataset_dir}/train/{img_id}.dicom'
    img = dicom2array(path=img_path)
    img = cv2.resize(img, None, fx=1/scale, fy=1/scale)
    img = np.stack([img, img, img], axis=-1)
    
    boxes = abnormal_train.loc[abnormal_train['image_id'] == img_id, ['x_min', 'y_min', 'x_max', 'y_max']].values/scale
    labels = abnormal_train.loc[abnormal_train['image_id'] == img_id, ['class_id']].values.squeeze()
    
    for label_id, box in zip(labels, boxes):
        color = label2color[label_id]
        img = cv2.rectangle(
            img,
            (int(box[0]), int(box[1])),
            (int(box[2]), int(box[3])),
            color, thickness
    )
    img = cv2.resize(img, (500,500))
    imgs.append(img)
    
plot_imgs(imgs, cmap=None)

# Modeling

In [None]:
# training dataset
features = ['image_id' ,'class_id', 'rad_id', 'x_min', 'y_min', 'x_max', 'y_max']
train_ftr = train_df[features]
train_ftr.head()

In [None]:
# coding: utf-8
from tqdm import tqdm
import time

# Set the total value 
bar = tqdm(total = 1000)
# Add description
bar.set_description('Progress rate')
for i in range(100):
    # Set the progress
    bar.update(25)
    time.sleep(1)

# Submission

In [None]:
# predictions.to_csv('submission.csv',index= False)

# Acknowledgements
* [EDA - VinBigData Chest X-ray Abnormalities](https://www.kaggle.com/trungthanhnguyen0502/eda-vinbigdata-chest-x-ray-abnormalities)
* [VinBigData: EDA All You need to know](https://www.kaggle.com/dhananjay3/vinbigdata-eda-all-you-need-to-know)
[*Chest_X-ray: Knowledges for the 14 abnormalities*](https://www.kaggle.com/sakuraandblackcat/chest-x-ray-knowledges-for-the-14-abnormalities)
* [Convert dicom to np.array - the correct way](https://www.kaggle.com/raddar/convert-dicom-to-np-array-the-correct-way)
* [VinBigData Chest X-ray Abnormalities Detection](https://www.kaggle.com/hamditarek/vinbigdata-chest-x-ray-abnormalities-detection)
* [EDA & .dicom reading: VinBigData Chest X-ray](https://www.kaggle.com/bjoernholzhauer/eda-dicom-reading-vinbigdata-chest-x-ray)
* [Chest_X-ray_Starter](https://www.kaggle.com/drcapa/chest-x-ray-starter)
* [VinBigData Chest X-ray EDA with Plotly](https://www.kaggle.com/debarshichanda/vinbigdata-chest-x-ray-eda-with-plotly)
* [VinBigData Retinanet-Detection [Training] ](https://www.kaggle.com/akhileshdkapse/vinbigdata-retinanet-detection-training/data)
* [VinBigData: EDA All You need to know](https://www.kaggle.com/dhananjay3/vinbigdata-eda-all-you-need-to-know)
* [VBD Chest X-ray Abnormalities Detection | EDA📊🔴](https://www.kaggle.com/mrutyunjaybiswal/vbd-chest-x-ray-abnormalities-detection-eda)
* [Finding data issues and mislabeled bounding boxes](https://www.kaggle.com/bjoernholzhauer/finding-data-issues-and-mislabeled-bounding-boxes)
* [Chest X-ray Abnormalities Doctor-EDA](https://www.kaggle.com/anantgupt/chest-x-ray-abnormalities-doctor-eda)
* [All you need to know about DICOM](https://www.kaggle.com/asimzahid/all-you-need-to-know-about-dicom)
* [EDA_train_csv](https://www.kaggle.com/soudainchat/eda-train-csv)

# Work in progress…

# Your upvote is my motivation