# Table Of Contents
* [Intro](#section-one)
    - [Check File Size](#chapter-one)
    - [Check train file](#chapter-two)
    - [Definitions of Variables](#chapter-three)
    - [Data Distribution of Each Variable](#chapter-four)
        + [Quick Visualization](#ch04_sub_chapter-one)
    - [Background Knowledge](#chapter-five)
        + [Endotracheal Tube](#ch05_sub_chapter-one)
        + [Nasogastric Tube](#ch05_sub_chapter-two)
        + [Central venous catheter](#ch05_sub_chapter-three)
        + [Swan Ganz Catheter Presen](#ch05_sub_chapter-four)
    - [Check train annotation file](#chapter-six)
    - [Visualization of X-rays image](#chapter-seven)
    


<a id="section-one"></a>
# Intro
- Thanks to [RANZCR/resnext50_32x4d starter [training]](https://www.kaggle.com/yasufuminakama/ranzcr-resnext50-32x4d-starter-training)
    + Please visit here and upvote

In [None]:
import os

import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

<a id="chapter-one"></a>
## Check File Size
- Check Each Size of Dataset Folder in this competition
    + train_records = 4.5GB
    + test_tfrecords = 0.5MB
    + train (image data) = 6.5GB
    + test (image data) = 0.8MB

In [None]:
import os

def get_folder_size(file_directory):
  # file_list = os.listdir(file_directory)
  dir_sizes = {}
  for r, d, f in os.walk(file_directory, False):
      size = sum(os.path.getsize(os.path.join(r,f)) for f in f+d)
      size += sum(dir_sizes[os.path.join(r,d)] for d in d)
      dir_sizes[r] = size
      print("{} is {} MB".format(r, round(size/2**20), 2))      
  
base_dir = '../input/ranzcr-clip-catheter-line-classification'
get_folder_size(base_dir)

<a id="chapter-two"></a>
## Check train file
- Let's descirbe train

In [None]:
train = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/train.csv', index_col = 0)
test = pd.read_csv('../input/ranzcr-clip-catheter-line-classification/sample_submission.csv', index_col = 0)
display(train.head())
display(test.head())

<a id="chapter-three"></a>
## Definitions of Variables 
- What's inside data?
    + StudyInstanceUID - unique ID for each image
    + ETT - Abnormal - endotracheal tube placement abnormal
    + ETT - Borderline - endotracheal tube placement borderline abnormal
    + ETT - Normal - endotracheal tube placement normal
    + NGT - Abnormal - nasogastric tube placement abnormal
    + NGT - Borderline - nasogastric tube placement borderline abnormal
    + NGT - Incompletely Imaged - nasogastric tube placement inconclusive due to imaging
    + NGT - Normal - nasogastric tube placement borderline normal
    + CVC - Abnormal - central venous catheter placement abnormal
    + CVC - Borderline - central venous catheter placement borderline abnormal
    + CVC - Normal - central venous catheter placement normal
    + Swan Ganz Catheter Present(??)
    + PatientID - unique ID for each patient in the dataset


<a id="chapter-four"></a>
### Data Distribution of Each Variable
- why two calculations are different?
    + When inserting catheters and lines into patients, some patients needs them to put on multiple positions. 
    + Let's see PatientID - bf4c6da3c
- But, you realize that three groups - ETT, NGT, CVC counted seperately. 

In [None]:
print("Total Rows of Train Data is", len(train))
print("Total Count of Each Variable in Train Data is", train.iloc[:, :-1].sum().sum())

var_cal_tmp = train.iloc[:, :-1].sum()
print(var_cal_tmp)

In [None]:
train.iloc[1].to_frame().T

<a id="ch04_sub_chapter-one"></a>
### Quick Visualization
- In general, CVC outnumbered other group. 

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x = var_cal_tmp.values, y = var_cal_tmp.index, ax=ax)
ax.tick_params(axis="x", labelsize=14)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)

- The number of Patients are smaller than total data. 
- It means some patients are frequently checked, depending upon patients

In [None]:
print("Number of Unique Patients: ", train["PatientID"].unique().shape[0])
print("Number of Total Data: ", len(train["PatientID"]))

In [None]:
tmp = train['PatientID'].value_counts()
print(tmp)
fig, ax = plt.subplots(figsize=(24, 6))
sns.countplot(x = tmp.values, ax=ax)
ax.tick_params(axis="x", labelsize=10)
ax.tick_params(axis="y", labelsize=14)
ax.set_xlabel("Number of Images", fontsize=15)
ax.set_title("Distribution of Labels", fontsize=15)

- Now, we need to see the distribution of data in each variable. 

In [None]:
target_cols = ['ETT - Abnormal', 'ETT - Borderline', 'ETT - Normal', 'NGT - Abnormal', 
               'NGT - Borderline', 'NGT - Incompletely Imaged', 'NGT - Normal', 'CVC - Abnormal',
               'CVC - Borderline', 'CVC - Normal', 'Swan Ganz Catheter Present']

fig, ax = plt.subplots(4, 3, figsize=(16, 10))
for i, col in enumerate(train[target_cols].columns[0:]):
  print(i, col)
  if i <= 2:
    ax[0, i].hist(train[col].values)
    ax[0, i].set_title(f'target: {col}')
  elif i <= 5:
    ax[1, i-3].hist(train[col].values)
    ax[1, i-3].set_title(f'target: {col}')
  elif i <= 8:
    ax[2, i-6].hist(train[col].values)
    ax[2, i-6].set_title(f'target: {col}')
  else:
    ax[3, i-9].hist(train[col].values)
    ax[3, i-9].set_title(f'target: {col}')

fig.tight_layout()
fig.subplots_adjust(top=0.95)

- How to interpret the graph?
    + CVC group is the top most amongst groups
    + In each group, Normal is the top most.
- This datasets are typically imbalanced, and multi-classification problem is revealed.


<a id = "chapter-five"></a>
## Background Knowledge
- Since my major is far from this medical area, it difficults to figure what to classify from images. 
- So, need some videos to understand the processing. 
- Thanks to [RANZCR CLiP: Visualize and Understand Dataset](https://www.kaggle.com/nayuts/ranzcr-clip-visualize-and-understand-dataset)
    + Please visit here and upvote


<a id = "ch05_sub_chapter-one"></a>
### Endotracheal Tube¶
- It's so called ETT in this dataset. 

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('FtJr7i7ENMY')

<a id = "ch05_sub_chapter-two"></a>
### Nasogastric Tube
- It's so called NTT in this dataset. 

In [None]:
YouTubeVideo('Abf3Gd6AaZQ')

<a id = "ch05_sub_chapter-three"></a>
### Central venous catheter
- It's so called CVC in this dataset. 

In [None]:
YouTubeVideo('mTBrCMn86cU')

<a id = "ch05_sub_chapter-four"></a>
### Swan Ganz Catheter Present
- It's Swan Ganz Catheter Present

In [None]:
YouTubeVideo('YkN30T6ig30')

<a id = chapter-six></a>
## Check train annotation file
- What's Inside train_annotations file?
    + The main purpose is said that 'These are segmentation annotations for training samples that have them. They are included solely as additional information for competitors.'
- Let's look at data
    

In [None]:
annot = pd.read_csv("../input/ranzcr-clip-catheter-line-classification/train_annotations.csv")
annot.head(30)

<a id="chapter-seven"></a>
## Visualization of X-rays image
- combined train + train_annotations, let's draw sample image


In [None]:
from PIL import Image, ImageDraw

def train_base_chest_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_chest_plot(1, base_dir)

- But, what we need is to draw tube. Thus, we need to use column 'data' in this plot. Let's do this. 

In [None]:
import ast 
import numpy as np

def train_base_tube_plot(row_ind, base_dir):
    row = annot.loc[row_ind]
    train_img = Image.open(base_dir + row['StudyInstanceUID'] + '.jpg')
    uid = row['StudyInstanceUID']
    label = row['label']
    data = np.array(ast.literal_eval(row['data']))
    fig, ax = plt.subplots(figsize=(15, 6))
    ax.imshow(train_img)
    ax.plot(data[:, 0], data[:, 1], color = 'b', linewidth=2, marker='o')
    plt.title(f"train: {label}")

base_dir = '../input/ranzcr-clip-catheter-line-classification/train/'
train_base_tube_plot(1, base_dir)
train_base_tube_plot(2, base_dir)
train_base_tube_plot(25, base_dir)

- Well, still difficult to figure out what the difference between normal and abnormal is. So, Droped to draw more. 