<a href="https://colab.research.google.com/github/cornellradiology/SIIM19_EDA/blob/master/medium_pneumonia_eda_with_interactive_Q_and_A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SIIM - Data Science - Tier 1
Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. In 2015, 920,000 children under the age of 5 died from the disease. In the United States, pneumonia accounts for over 500,000 visits to emergency departments and over 50,000 deaths in 2015, keeping the ailment on the list of top 10 causes of death in the country.

Diagnosing pneumonia usually requires review of a chest radiograph (CXR) by highly trained specialists and confirmation through clinical history, vital signs and laboratory exams. Pneumonia usually manifests as an area or areas of increased opacity on CXR. However, the diagnosis of pneumonia on CXR is complicated because of a number of other conditions in the lungs can also manfiest with the same findings, including as fluid overload (pulmonary edema), bleeding, volume loss (atelectasis or collapse), lung cancer, or post-radiation or surgical changes. To improve the efficiency and reach of diagnostic services, the RSNA and STR annotated a 30,000 chest x-ray dataset, for the presence or absence of pneumonia, which involved over 15 radiologists from multiple academic institutions.  The areas of opacity suspicious for pnuemonia were outlined by a bounding box, and exams without suspicious opacities were labeled as either Normal, or Not Opacity / Not Normal.

A Kaggle Competition was held in the fall of 2018 (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge) using this dataset.  In this course, we will start
 by doing exploratory data analysis (EDA), while learning about data science techniques.  Performing EDA is a prerequiste for building high quality machine learning algorithms.

#Preliminaries

## Create directory structure

Must import `os` package to change a directory correctly in python.

Define `ROOT_PATH` on google colab with `pneumonia` subdirectory and then change directory into that path.

In colab, you run bash linux commands by prefixing a command with an ! exclamation point, which we use to create the pneumonia subdirectory.  

Within a bash command run from a notebook, you can substitute in the value of python variables using the curly brace syntax `{variable_name}` will substitute to the current contents of the variable into the command.


In [0]:
import os
ROOT_PATH = '/content/pneumonia' # on colab you start in the path /content/
!mkdir -p {ROOT_PATH}
os.chdir(ROOT_PATH)
print(ROOT_PATH)

## Download the Data

In colab, you run bash linux commands by prefixing a command with an ! exclamation point.  The first command below uses `wget` to download a file from a given URL; in this case a compressed zip file from the URL to your colab instance.  

Within bash commands run in ipython notebooks you can substitute python variables with the curly brace syntax `{variable_name}` which will substitute with the value of `variable_name`.


In [0]:
#DOWNLOAD_URL = 'http://quarkonia.info/media/small_rsna_pneumonia.zip' # 12MB zip file - 60 train images; 20 test images

DOWNLOAD_URL = 'http://quarkonia.info/media/medium_rsna_pneumonia.zip' # 53MB zip file - 300 train images; 100 test images

ZIP_FILE = DOWNLOAD_URL.split('/')[-1] # split string at '/' and take last part which is file name

if not os.path.exists(ZIP_FILE): # Download if it doesn't exist
  !wget --show-progress {DOWNLOAD_URL}
else:
  print("Zip file already exists")

print("Current path:")
!pwd
print("\nCurrent directory contents:")
!ls


##Unzip into expected directory structure

This takes the zip file you've downloaded, fix permissions (initially restrictive on the full RSNA dataset) and extracts the training images into a directory called `pneumonia/train`

In [0]:
if not os.path.exists('train/'):
  !unzip {ZIP_FILE}
  !chmod 664 *.zip *.csv
  os.mkdir('train/')
  !unzip -q stage_2_train_images.zip -d train/
else:
  print("Already unzipped.")

# Basic import statements to load common libraries

* `os` is a built-in python library for exploring the file system.
* [pandas](https://pandas.pydata.org/) is a library for reading and processing tabular data.
* [numpy](https://www.numpy.org/) is a library for fast numerical operations on arrays of numerical data
* [matplotlib](https://matplotlib.org/) is a plotting library
* [pydicom](https://pydicom.github.io/pydicom/stable/getting_started.html) is a library for opening and loading dicom images
* [seaborn](https://seaborn.pydata.org/) is a plotting library built on top of matplotlib with a higher level interface with better defaults for data visualization ([seaborn tutorial](https://www.datacamp.com/community/tutorials/seaborn-python-tutorial))

## Install libraries not installed by default in Google Colab

The packages pandas, numpy, matplotlib are by default installed on google colab, however pydicom is not automatically included so we have to go to the command line and install using `pip` (the python package installer).  We first try to import the module (if it exists), but if it doesn't exist we catch the `ImportError`, use pip to install the package on colab.

In [0]:
!pip install pydicom

To import a module into python, it is simply `import module` and you can access functions (or submodules, classes, variables, etc.) belonging to the module via `module.name_of_thing_in_module`.  For importing `pandas`, `numpy`, `matplotlib.pyplot` it is convenient to import using common abbreviations.

In [0]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pydicom as dcm

import warnings
warnings.filterwarnings('ignore') # done to silence a seaborn deprecation warning


# Explore the Spreadsheet Data

## Open the data into panda dataframes

There are two CSVs associated with RSNA pneumonia challenge that we will load into pandas dataframes.

A pandas **dataframe** is a collection of tabular data, similar to a spreadsheet, and is the primary data structure for pandas.

Since we have the tabular data loaded into CSV spreadsheet, we will use `pd.read_csv(file_name)` to read into file name.

In [0]:
bbox_df = pd.read_csv('stage_2_train_labels.csv')         # CSV of labeled bounding box information
class_df = pd.read_csv('stage_2_detailed_class_info.csv') # CSV with extra detailed_class_info

### Jupyter Notebook Tip #1: How to find documentation

If you need to get documentation in google colab for a function when you forgot the syntax or options, either:

1. execute a cell where the function ends with a question mark (`pd.read_csv?`)
2. type the name of the function with an open paranthesis `pd.read_csv(` and press tab to have documentation pop-up in a hover window.


---


In a Jupyter notebook (similar to colab but run locally) you can press shift-tab to pop up the same information.

Similarly you colab and Jupyter notebooks have tab auto_complete.  (E.g., press tab after typing in `pd.` to see things in pandas namespace or tab after `pd.rea` to see a list of options)

In [0]:
pd.read_csv?

In [0]:
# Type "pd.merge(" below and press tab inside the parentheses.

### Jupyter Notebook Tip #2: Jupyter pretty-prints last value returned in codeblock

You can insert multiple lines of python code inside a Jupyter codeblock.

If the last python line in a code block returns a value, it will be displayed (and formatted nicely by default in a better format than a regular `print()` statement)  below the codeblock.  When you assign a value to a variable, python will NOT return a value. 

Thus a common pattern in codeblocks is to say calculate a dataframe, assign it to something, and then repeat the last calculated value to display it.

In [0]:
df = bbox_df.copy() # create a copy of bbox_df
df

## View Samples of CSV Data in DataFrame

`df.head(n)` is a useful function to look at the first n rows of a dataframe to get a feel for the dataframe.

`df.tail(n)` acts similarly looking at the last n rows of the dataframe

`df.sample(n)` looks at n randomly selected entries.  

If unspecified n defaults to 5, in the three cases above

### Data field descriptions from [RSNA pneumonia dataset](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data):

* `patientId`- A patientId. Each patientId corresponds to a unique image.
* `x` - the upper-left x coordinate of the bounding box.
* `y` - the upper-left y coordinate of the bounding box.
* `width` - the width of the bounding box.
* `height` - the height of the bounding box.
* `Target` - the binary Target, indicating whether this sample has evidence of pneumonia.


###Q1 - Q3


In [0]:
# Q1: View first 5 rows of bounding box training label data frame.

In [0]:
# Q2: View last 10 rows of detailed class label data frame

In [0]:
# Q3: Sample 20 random entries of bounding box training label data frame

####Q1 - Q3 solutions



In [0]:
bbox_df.head() # Note it defaults to N=5

In [0]:
class_df.tail(10)

In [0]:
bbox_df.sample(20)

## Describe the Dataset

`df.shape` returns the shape of the dataframe.  Note this is a property of the dataframe, not a function (hence no parentheses)
`df.describe()` shows summary statistics about a dataframe. 

For object (categorical/string/timestamp) data, it shows:

* the **count** of each column, 
* the number of **unique** entries, 
* the **top** (most frequent) entry
* the **freq**uency of the top entry (number of occurrences)

For numerical data, it shows the **count**, **mean** (average), **std** (standard deviation), **min**imum, **25%** percentile, **50%** percentile (median), **75%** percentile, and **max**imum.

If both categorical data and numerical data columns are present, by default `df.describe()` will only describe the numerical columns.  However, you can show both with `df.describe(include='all')`.

### Q4 - Q5

In [0]:
# Q4: What are the dimensions of bbox_df and class_df?

In [0]:
# Q5: What is the most frequent category of descriptive class label and for how many bounding boxes does this label occur?

#### Q4 - Q5 solutions

In [0]:
print("BBox Shape:", bbox_df.shape)
print("Class info Shape:", class_df.shape)

In [0]:
class_df.describe(include='all')

## Pandas Slicing - How to select a column

First you can select a column from a dataframe in pandas by either the syntax:

`df['column_name']` or `df.column_name`

which returns a pandas Series (an indexed array of data values) containing that column of data.  The first method will always work regardless of column name, but the second is a convenience method that can be used when there are no spaces or other special characters in the column name and the name doesn't otherwise clash (e.g., isn't a member function of a pandas dataFrame).

You can then compare this array element by element with standard comparison operators; e.g., the following will find be set to True when the row has a Target of 0 and False otherwise.


In [0]:
bbox_df.Target == 0 # Returns pandas.Series filled with True when the row has a Target of 0, False otherwise.

####Q6

In [0]:
# Q6: demonstrate that the first column patientId is the same in class_df and bbox_df

##### Q6 Hint (python's `any` and `all` built-**ins**)

In [0]:
# Python comes with built-in operators that allow you to test whether `any` or `all` elements in a iterable evaluate when converted to booleans are True or False 

any([False, True, False]) # returns True - at least one value is True
all([False, True, True])  # returns False - all values aren't True

# Aside: python's any and all functions will attempt to cast values to booleans before testing them.
# In python things types that translate to being empty or zero when cast to a boolean are False (they are "Falsy")
# (e.g., "", 0, None, [], {}, etc)
# Non-empty containers and non-zero valued numbers things are "Truthy" and bool(non_empty_non_zero_value) == True

any([False, "", 0, None, [], {}]) # returns False - False, "", 0, None, [], {} are all considered Falsy types in python.
all([True, "Hello", 1]) # returns True; in python non-empty strings, non-zero numbers when cast to bool evaluate to True

##### Q6 Answer

In [0]:
all(bbox_df['patientId'] == class_df['patientId']) 
# demonstrating the patientId column is the same in both df when compared row by row.

## Boolean Array Indexing - How to select rows that match criteria

You can then use this True/False pandas.Series to index your dataframe and only select rows where the entry in the Series is True.

`df[df.Target == 0]` # picks out rows from `df` where the value of `Target` column is `0`.

We can then verify the non-null count using the function `df.count()` which returns the number of rows where each column is non-null.

Note this type of slicing (Boolean Array Indexing) is not only used for selecting slices of a dataframe, but can also be used for updating slices of a data frame.



In [0]:
bbox_df[bbox_df.Target == 0] # This selects all rows of bbox_df with Target==0

In [0]:
bbox_df[bbox_df.Target == 0].count()

## Adding a Column to DataFrame

The syntax for adding a column to a dataframe is simple; you just create a new column name and assign to it with a Series of the correct size.  

For example if I wanted to add a column to our data frame called Area being a product of `bbox_df.width` x `bbox_df.height`

In [0]:
bbox_df['area'] = bbox_df.width * bbox_df.height
bbox_df

If we don't want the column, we can delete that column using the `del` keyword and using the following indexing notation to select the column we want to remove.

Obviously if we run the following cell twice, we'll raise a `KeyError` when trying to delete a column that no longer exists.


In [0]:
del bbox_df['area']
bbox_df

##Merge DataFrames

It makes sense to merge dataframes.  Now in Q6 we showed the patientId column is identically sized and there's a one-to-one correspondance between rows, there are two valid ways to merge the dataframes in this case.

We could simply copy a new column in our DataFrame by copying the class column off of `class_df`:


In [0]:
bbox_df['patId_from_class'] = class_df['patientId']
bbox_df['class'] = class_df['class']
bbox_df

In [0]:
# After again verifying that 'patId_from_class' matches 'patientId', we can delete the duplicate column
if all(bbox_df['patId_from_class'] == bbox_df['patientId']):
  del bbox_df['patId_from_class']
  
bbox_df

In [0]:
# But we are going to merge in a more generic way where we don't require that the patientId column between
# the two data frames to merge are identical.
# So let's delete the 'class' column
del bbox_df['class']
bbox_df

### Using merge to join DataFrame akin to SQL JOIN

We can take two distinct dataframes in pandas and merge them into one concatenated dataframe, similar to a [SQL JOIN](https://i.stack.imgur.com/VQ5XP.png).

In `left_df.merge(right_df, on='patientId', how='inner')`

the parameter `on` determines the column we merge on  and `how` specifies whether we do an

* `inner`:  INNER join (use intersection of keys),  
* `left`:  LEFT OUTER join (always keep left key)
* `right`:  RIGHT OUTER join (always keep right key)
* `outer`:  OUTER join (use union of two keys)

Note because `stage2_detailed_class_info.csv` has one row for every bounding box (with the same label for all bounding boxes corresponding to the same patient), we should first remove duplicate rows before doing a JOIN type merge with `df.drop_duplicates()`.  

If we do not remove duplicates before joining we'll find that doing a naive inner JOIN on patientId when there are say three bounding boxes on a patient image they'll be three rows in `bbox_df` and three corresponding rows in `class_df`.  The JOIN would then calculate the inner product where the patient_id match and after the JOIN we would have 9 rows (instead of 3) for our merged df.

### Q7

In [0]:
# Create a dataframe with no repeated rows with the information in class_df
# Do a merge between bbox_df and class_df with no duplicates.  Call this merged dataframe `df`
# Inspect the merged dataframe


#### Q7 Solution

In [0]:
class_df_no_dups = class_df.drop_duplicates()
df = bbox_df.merge(class_df_no_dups, on='patientId', how='inner')
df.head(8)

In [0]:
df.describe(include='all') # show numerical and string data

## View distribution of data with histograms

`df.hist()` specifies a histogram showing the distribution of values of each column of numerical data.  Note these plots are generated from pandas by internally calling matplotlib.


In [0]:
hist = df.hist(bins=50, layout=(2,3), figsize=(15,6))

## Inspect Missing Data

We've defined a function below to inspect the missing data.
 
 `df.isnull()` return `False` (0) if a cell is non-empty or `True` (1) when a cell is null (or `NaN`).
 
 `df.isnull().sum()` then counts the number of null rows in every column of `df` (note in python `True+True==2`)
 
 `df.index` returns the index of row labels that can quickly be counted to specify number of rows using `len(df.index)`
 
 `pd.concat` concatenates pandas Series (a column of pandas data) along an axis and appends appropriate labels.

In [0]:
def missing_data(df):
  null_data = df.isnull().sum()
  num_rows = len(df.index)
  percent_null = 100.*null_data/num_rows
  return pd.concat([null_data, percent_null.round(1)], axis=1, keys=['Missing', 'PercentMissing'])

missing_data(df)

### Q8

In [0]:
# Let's compare the count of missing data to the value of the Target column
# How many rows in df have a Target == 0?
# How many rows in df have a Target == 1?

#### Q8 Answer

In [0]:
# Recall 
print(df.Target == 0) # Show True/False values based on whether Target is zero

In [0]:
print("Target 0")
print("="*20)
print(df[df.Target == 0].count())
print("")
print("Target 1")
print("="*20)
print(df[df.Target == 1].count())

Conclusion: bounding box is only defined for Target = 1; otherwise x, y, width, height are null (which makes sense with missing data above).

## Distribution of counts in our dataset

`series.value_counts()` takes a pandas series and returns the number of counts of each unique value in that series and by default sorts the unique labels in by descending value counts.

In [0]:
df['class'].value_counts()

### Q9

In [0]:
# We can also see it as a percentage by doing math on the Series data returned by `value_counts()`
# Convert the value_counts() into a percentage to find the percent of labeled bounding boxes that fall into each patient category

#### Q9 Answer

In [0]:
df['class'].value_counts()*(100.0)/len(df.index)

### Q10

In [0]:
# Use Boolean Array Selection and Describe to Verify that class = 'Lung Opacity' corresponds to Target = 1

#### Q10 Answer

In [0]:
df[df.Target==1].describe(include='all') # Note we selected on Target = 1 and show only one unique value of class, the top value seen is "Lung Opacity"

In [0]:
df[df['class'] == 'Lung Opacity'].describe(include='all') # Looking at it the other way, we selected on Class and saw the numerical value Target appear 9555 times as 1.0 (as seen with a max and min value of 1.0)

## We can also generate a countplot of this series

`sns.countplot(series)` creates a bar graph of the number of distinct values present in the pandas series. 

In [0]:
countplot = sns.countplot(df['class'])

## Grouping the categories

We can do another `countplot` and where we first plot the x-axis by `Target` value but break up each group of `Target` value into each `class` using the `hue` keyword.  In this type of plot there's groups of colored bars (colored based on the column indicated by 'hue')


In [0]:
target_class_plot = sns.countplot(data=df, x='Target', hue='class')

## Plotting density of values when Target = 1

We can observe general distributions of the x, y, width, height, data fields in a `distplot`. We first plot the x-axis value by the data column we're interested in. We specify `kde=true` to fit a gaussian kernel density estimate on top of the existing data. We specify `50` bins to spread our data over. We specify `color` for plot colorization. We use matplotlib to generate a plot with four subplots in a 2x2 grid and put each plot in a specific one of its subplots before showing.  Consult [matplotlib.pyplot.subplots](https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.subplots.html#matplotlib.pyplot.subplots) for more documentation.

In [0]:
target1_df = df[df['Target']==1]

fig, ax = plt.subplots(2,2, figsize=(12,12))
sns.distplot(target1_df['x'],     kde=True, bins=50, color="red",     ax=ax[0,1])
sns.distplot(target1_df['y'],     kde=True, bins=50, color="blue",    ax=ax[0,0])
sns.distplot(target1_df['width'], kde=True, bins=50, color="green",   ax=ax[1,0])
sns.distplot(target1_df['height'],kde=True, bins=50, color="magenta", ax=ax[1,1])
plt.tick_params(axis='both', which='major', labelsize=12)
plt.show()

## Plotting midpoints

The midpoint of each window is $x_c = x + width/2$ and $y_c = y + height/2$

We can calculate these new columns and add them to our dataframe and then create a scatter plot of their values.

### Q11

In [0]:
# For rows where there is bounding box (lung opacity), add columns named 'xc' and 'yc'
# representing the midpoint of the box, so we can graph them later.

#### Q11 Answer

In [0]:
target1_df['xc'] = target1_df['x'] + target1_df['width']/2
target1_df['yc'] = target1_df['y'] + target1_df['height']/2
target1_df.head()

### Create a scatter plot of these midpoints

We will use the built in plotting functionality provided by pandas (via matplotlib) to create a scatter plot associated with this new data.  

We can access a set of built-in plot function under our dataframe from the namespace `df.plot` including a scatter plot with `df.plot.scatter`.

We specify the x-center label `xc` and the y-center label `yc`. We specify the range of values for the x and y axis via xlim and ylim respectively. We specify a blending value of 0.5 (0 transparent - 1 opaque). We also specify a point [marker ](https://matplotlib.org/api/markers_api.html#module-matplotlib.markers) `'.'` along with the color.

In [0]:
scatter_plot = target1_df.plot.scatter(x='xc', y='yc', xlim=(0,1024), ylim=(0, 1024), alpha=0.5, marker='.', color='red')


## Exploring DICOM data

Let's look at the DICOM data in the training set located in subdirectory `train/`.

We've only loaded a sample of the images, so let's browse the directory to see which files are present.



```os.listdir() ``` is a useful function that takes in a string and gives back a python list of the contents of the directory associated with that string.



### Q12

In [0]:
# How can we get a list of our training images?
# Can we count the number of training images that we have?

#### Q12 Answer


In [0]:
image_file_names = os.listdir('train/')
print(image_file_names[0:5]) # print first 5 file names
print(len(image_file_names))

### Create dataFrame with names of Images and patientId

First, let's convert this python list into a 1 column pandas dataframe.

Then we add a second column based on truncating the last 4 characters (`.dcm`) off each string to get the `patientId` corresponding to each image.


In [0]:
images_df = pd.DataFrame(image_file_names, columns=['file_name'])
images_df['patientId'] = images_df['file_name'].str[:-4] # add a with patientId column taken from file name
images_df

###Q13

In [0]:
# Create merged_df by merging the newly created dataframe (with image names and patientId for ~300 images)
# with the full df with 30,227 rows from above.
# Be careful not to lose bbox data in the merge from rows that aren't present in both data frames.


#### Q13 Answer

In [0]:
merged_df = df.merge(images_df, on='patientId', how='left') # merge with 

### Q14


In [0]:
#Create a column in merged_df called 'image_exists' that is True when the image exists in the directory and False otherwise.

#### Q14 Hints

In [0]:
# Python has several bitwise operators:

# Note for examples in binary 42 is 0b101010; 15 is 0b1111
# & (bitwise and)          (42 & 15 == 0b1010 == 18  
# | (bitwise inclusive or) (42 | 15 == 0b101111 == 47)
# ^ (bitwise exclusive or) (42 ^ 15 == 0b100101 == 37)
# ~ (bitwise negation)     (~42 == 0b1111...010101 == -43 using two's complement.

#These bitwise operators can be used on pandas series (e.g., dataframe column) to act on them 
# element by element in an efficient vectorized manner

pd.Series([True, True, False, False]) & pd.Series([False, True, True, False]) 
# above returns pd.Series([False, True, False, False])
pd.Series([True, True, False, False]) | pd.Series([False, True, True, False]) 
# returns pd.Series([True, True, True, False])
pd.Series([True, True, False, False]) ^ pd.Series([False, True, True, False]) 
# returns pd.Series([True, False, True, False])
~pd.Series([True, True, False, False]) 
# returns pd.Series([False, False, True, True])


#Note you cannot use python's normal boolean functions like and, or, and not do not work element by element when
# applied to panda data Series.




In [0]:
# Hint 2:

# Recall pandas has isnull() function that works on both DataFrames and Series.

#### Q14 Answer

In [0]:
merged_df['image_exists'] = ~merged_df['file_name'].isnull() 
# set a boolean column when file_name is not null.  Note `~` which is python's bitwise negation; 
# in pandas bitwise operators will act on each element of a data frame column


# This is a convenient thing to do, because now we can simply select rows with a images present with:
merged_df[merged_df['image_exists']] 


### Unique Patients

The files names are the patients IDs.  In this limited sample for this demonstration, we only have 300 dicom training images loaded.  

If we downloaded the full dataset we would have seen 26684 images in the training set and 3000 images in the test set, but this differs from the 30227 rows in our column.  This will match with the number of unique `patientId` present in our dataframe:

In [0]:
print("Unique patientId in merged_df: ", merged_df['patientId'].nunique())

## Explore Patients With Multiple Rows with `groupby`

Let's see which patients have multiple rows of data (multiple bounding boxes) and their distribution.

First we take our dataframe and `groupby([column_names])` a value, which combined with an aggregating function
(e.g., `count()`, `mean()`, `max()`, `min()`, `sum()`, `median()`) will return the numbers of a column aggregated by that group.  We'll do this to group by `patientId` and see counts of various columns.

In [0]:
grouped_by_patient = merged_df.groupby(['patientId'])
grouped_by_patient.count() # shows how many times is present per value of 'patientId'

We only care about the number of rows present, so we can select the column `patientId` out of our patient_group like:

`grouped_by_patient['patientId'].count()`

This returns a pandas Series, though we'd rather work with DataFrames so we convert it to a frame with `series.to_frame('numRows')` which creates a DataFrame with the values of the series in a column named `'numRows'` in our example.

However, this continues to use `patientId` as the index (which we had been aggregating on), but we want to stop indexing on `patientId` but pull that column into our dataframe as a column, which we can do with  `.reset_index()`

In [0]:
#We care only about number of rows present, which would be:
# grouped_by_patient['patientId'].count()
# Note this is a pandas Series, which we use t
num_rows_df = grouped_by_patient['patientId'].count().to_frame('numRows').reset_index()
num_rows_df

In [0]:
# We can then merge this into our dataframe with :

df = merged_df.merge(num_rows_df, on='patientId', how='left')
df

### Q15

In [0]:
# Can we find the maximum # of bounding boxes for our sample of patients?
# Can we easily find the patients for which have the maximum number of bounding boxes?
# Can we obtain their bounding boxes?
# Do we have those dicoms available? If not, which patients have the next most number of bounding boxes, and do we have those dicoms?

#### Q15 Answer

In [0]:
#max_dicom_patients_df = merged_df.groupby(['patientId']).count().to_frame('num_rows')

# Note this is a pandas Series, which we use to_frame
#max_dicom_patients_df = grouped_by_patient.count().to_frame('numRows')
#max_dicom_patients_df

# We care only about number of rows present, which would be:
# grouped_by_patient['patientId'].count()
grouped_by_patient['patientId'].count().describe()

In [0]:
num_rows_df = grouped_by_patient['patientId'].count().to_frame('numRows')
# ie those with maximum number of bounding boxes
max_num_patient_count_df = num_rows_df[num_rows_df.numRows == 4]

In [0]:
right_df = merged_df.merge(max_num_patient_count_df, on='patientId', how='right')
right_df[right_df.image_exists]

In [0]:
max_num_patient_count_df = num_rows_df[num_rows_df.numRows == 3]
right_df = merged_df.merge(max_num_patient_count_df, on='patientId', how='right')
right_df[right_df.image_exists]

### Plot Distribution of Rows in our dataframe by Patient Class

In [0]:
count_plot2 = sns.countplot(x='class', hue='numRows',data=df)

Note the above plot is not really that telling, as we can't differentiate the cases with zero bounding boxes from those with exactly one bounding box present (as there would be a row in both cases).

Instead let's count the number of rows where 'x' is not empty.  This lets us get an accurate count of the number of bounding boxes.

In [0]:
num_bboxes_df = df.groupby('patientId')['x'].count().to_frame('NumBBoxes').reset_index()
df2 = df.merge(num_bboxes_df, on='patientId', how='left')
count_plot3 = sns.countplot(x='class', hue='NumBBoxes', data=df2)

## Reading DICOM Meta Data

Pick a patient ID from our populated list

We do this by first looking at data frame of images that exist: `df[df['image_exists']]`, create a new dataframe that samples five randomly with `.sample(5)`. 

We then use integer-based indexing with `.iloc` indexer, and take out the zeroth data value with `.iloc[0]`.
From this returned row we look at the `file_name` attribute.

In [0]:
sample_with_images = df[df['image_exists']].sample(6)
dicom_file_name = sample_with_images.iloc[0]['file_name']
dicom_file_name

We then open the file, get the `patient_id`

In [0]:
dicom_full_path = os.path.join(ROOT_PATH, 'train', dicom_file_name)
patient_id = dicom_file_name[:-4]
dicom_full_path, patient_id

Open the dicom file using `pydicom` module

In [0]:
dicom_data = dcm.read_file(dicom_full_path)
dicom_data

We can observe that we do have available some useful information in the DICOM metadata with predictive value, for example:

* Patient sex;
* Patient age;
* Modality;
* Body part examined;
* View position;
* Rows & Columns;
* Pixel Spacing.

Let's sample few images having the **Target = 1**.

## Plot DICOM images with Target = 1



In [0]:
def plot_six_dicom_images(df):
    img_data = list(df.T.to_dict().values())
    f, ax = plt.subplots(2,3, figsize=(16,12))
    for i,data_row in enumerate(img_data):
        patientImage = data_row['patientId']+'.dcm'
        imagePath = os.path.join(ROOT_PATH, 'train', patientImage)
        dcmdata = dcm.dcmread(imagePath)
        modality = dcmdata.Modality
        age = dcmdata.PatientAge
        sex = dcmdata.PatientSex
        ax[i//3, i%3].imshow(dcmdata.pixel_array, cmap=plt.cm.bone) 
        ax[i//3, i%3].axis('off')
        ax[i//3, i%3].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nClass: {}\nWindow: {}:{}:{}:{}'.format(
                data_row['patientId'],
                modality, age, sex, data_row['Target'], data_row['class'], 
                data_row['x'],data_row['y'],data_row['width'],data_row['height']))
    plt.show()
    


Create a sample of six images with `Target == 1`.  We can either name a new dataframe based on the image exists or use bitwise operators between columns when selecting rows of our dataframe.

In [0]:
images_present_df = df[df['image_exists']]
target1_sample = images_present_df[images_present_df['Target']==1].sample(6)

# alternatively if you didn't save data frame only containing present images
# you can use bitwise & operator on two boolean columns to select rows where Target is 1 and image_exists is 1

target1_sample = df[df['image_exists'] & (df['Target']==1)].sample(6)

plot_six_dicom_images(target1_sample)

We would like to represent the images with the overlay boxes superposed. For this, we will need first to parse the whole dataset with Target = 1 and gather all coordinates of the windows showing a Lung Opacity on the same image.


In [0]:
from matplotlib.patches import Rectangle

def show_dicom_images_with_boxes(data):
    img_data = list(data.T.to_dict().values())
    f, ax = plt.subplots(2,3, figsize=(16,12))
    for i,data_row in enumerate(img_data):
        patientImage = data_row['patientId']+'.dcm'
        imagePath = os.path.join(ROOT_PATH, 'train', patientImage)
        dcmdata = dcm.dcmread(imagePath)
        modality = dcmdata.Modality
        age = dcmdata.PatientAge
        sex = dcmdata.PatientSex
        ax[i//3, i%3].imshow(dcmdata.pixel_array, cmap=plt.cm.bone) 
        ax[i//3, i%3].axis('off')
        ax[i//3, i%3].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nClass: {}'.format(
                data_row['patientId'],modality, age, sex, data_row['Target'], data_row['class']))
        rows = merged_df[merged_df['patientId']==data_row['patientId']]
        box_data = list(rows.T.to_dict().values())
        for j, row in enumerate(box_data):
            ax[i//3, i%3].add_patch(Rectangle(xy=(row['x'], row['y']),
                        width=row['width'],height=row['height'], 
                        color="yellow",alpha = 0.1))   
    plt.show()


In [0]:
show_dicom_images_with_boxes(target1_sample)

### Q16

In [0]:
# Can you plot the patient case where there are 3 bounding boxes?

#### Q16 Answer

In [0]:
show_dicom_images_with_boxes(right_df[right_df.image_exists].sample(1))


For some of the images with **Target=1**, we might see multiple areas (boxes/rectangles) with Lung Opacity.

Let's sample few images having the **Target = 0**.

## Plot DICOM images with Target = 0


In [0]:
## Normal images
target0_sample_normal = images_present_df[images_present_df['class']=='Normal'].sample(6)
plot_six_dicom_images(target0_sample_normal)

In [0]:
## No Lung Opacity
target0_sample_abnormal = images_present_df[images_present_df['class']=='No Lung Opacity / Not Normal'].sample(6)
plot_six_dicom_images(target0_sample_abnormal)

#Add meta information from DICOM files

##Train data

We will read the DICOM meta data from the dicom files and add it to the train dataset.

In [0]:
dcmfields = ['PatientID', 'Modality', 'PatientAge', 'PatientSex', 'BodyPartExamined', 'ViewPosition', 'ConversionType', 'Rows', 'Columns', 'PixelSpacing']

def process_dicom_image_dir(data_path):
    image_names = os.listdir(data_path)
    image_meta_data = []
    for i, img_name in enumerate(image_names):
        imagePath = os.path.join(data_path, img_name)
        dcmdata = dcm.dcmread(imagePath, stop_before_pixels=True)
        meta_data_row = [dcmdata.get(field) for field in dcmfields]
        image_meta_data.append(meta_data_row)
    return pd.DataFrame(image_meta_data, columns=dcmfields)

In [0]:
dcm_metadata_df = process_dicom_image_dir(os.path.join(ROOT_PATH, 'train'))
dcm_metadata_df = dcm_metadata_df.rename(columns={'PatientID':'patientId'})
dcm_metadata_df

In [0]:
merged_df = df.merge(dcm_metadata_df, on='patientId', how='left')
merged_df[merged_df['image_exists']]

## We only downloaded a subset of data

Using same function above we can read the metadata of all 26,684 training dicoms and 3000 test dicoms in 30 seconds of processing total.


In [0]:
dcm_train_metadata_full_df = pd.read_csv('extras/dcm_train_metadata_full.csv')
dcm_test_metadata_full_df = pd.read_csv('extras/dcm_test_metadata_full.csv')  

In [0]:
merged_df = df.merge(dcm_train_metadata_full_df, on='patientId', how='left')


## Inspecting the DCM Data Fields

If we look at our full training data's dicom fields we see:

In [0]:
merged_df[['Modality', 'BodyPartExamined', 'PatientSex', 'ViewPosition', 'ConversionType']].describe()

We see that **Modality**, **BodyPartExamined** and **ConversionType** all have exactly one value in all their cells (`unique=1`), hence all these images are modality `CR` (Computed Radiograph) that examined body part `CHEST`, with ConversionType WSD (Workstation).

We do have 2 unique values for PatientSex and ViewPosition



In [0]:
merged_df[['PatientAge', 'Rows', 'Columns']].describe()

We also see for the numerical data of **Rows** and **Columns** that all the data is in a fixed shape with 1024 rows and 1024 columns.  These columns were inputed as numerical data, but we can note they are all identical by either noting that the **std** is 0 or that the **min** and **max** are both equal.

Meanwhile there's a wide array of **PatientAge** present including several data points that appear to be incorrect (PatientAge = 155)

## Inspecting View Position

We do observe that the ViewPosition changes between **PA** (posterior/anterior - with the patient facing away from the X-ray source) and AP (anteriorposterior - with the patient facing towards the X-ray source) in our dataset.

In [0]:
ax=sns.countplot(data=merged_df, x='ViewPosition', hue='class')
print(merged_df[merged_df['Target'] == 1]['ViewPosition'].value_counts())

## Inspecting PatientSex



In [0]:
ax=sns.countplot(data=merged_df, x='PatientSex', hue='class')

## Inspecting PatientAge



In [0]:
fig = plt.figure(figsize=(16,6))
plt.xticks(rotation=90)
ax=sns.countplot(data=dcm_train_metadata_full_df, x='PatientAge')

print("Incorrect ages (greater than 130):", (dcm_train_metadata_full_df['PatientAge'] > 130).sum())

# Number of Target Cases plot by PatientSex

In [0]:
fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(211)
plt.xticks(rotation=90)
ax2 = fig.add_subplot(212)
plt.xticks(rotation=90)
ax1.set_title("Train: Male Chest Exams by Age and Target")
ax1=sns.countplot(x='PatientAge', hue='Target', data=merged_df[merged_df['PatientSex']=='M'], ax=ax1)
ax2.set_title("Train: Female Chest Exams by Age and Target")
ax2=sns.countplot(x='PatientAge', hue='Target', data=merged_df[merged_df['PatientSex']=='F'], ax=ax2)


## Bounding Box Mask

To determine where to look for potential lung opacity, let's create a 1024x1024 box showing summing all the labeled bounding box regions in our training set.

`np.linspace(0, 1023, 1024)` create an numpy array `[0, 1, 2, ..., 1023]` with 1024 values.

`xx, yy = np.meshgrid(arr1, arr2)` creates two 2d-arrays (`xx`, `yy`) of the shape `len(arr1) x len(arr2)`. 

Every row of `xx` is a copy of `arr1` and there are `len(arr2)` copies present.

Every column of `yy` is a copy of `arr2` and there are `len(arr1)` copies present.



In [0]:
xx, yy = np.meshgrid(np.linspace(0, 1023, 1024),
                     np.linspace(0, 1023, 1024), 
                     indexing='xy')
# xx is a 1024 x 1024 2d array where every row is [0, 1, 2, 3, ... ,1023]
# yy is a 1024 x 1024 2d array where the zeroth row is [0, 0, 0, ... ]
# the next row is [1, 1, 1, ...] ; i-th row is [i, i, i, ]

# in our example xx[i,j] will always be i and `yy[i,j]` will always be j
# so it makes it possible to quickly do vectorized math based on bounding box coordinates.


bboxes = {}
bboxes['AP'] = np.zeros_like(xx)
bboxes['PA'] = np.zeros_like(xx)
## Creates a zero filled array in same shape


for view in ('AP', 'PA'):
  for i, bbox in merged_df[(merged_df['Target']==1) & (merged_df['ViewPosition']==view)].sample(1767).iterrows():
      mask  = (xx >= bbox['x']) & (xx <= (bbox['x'] + bbox['width'] ))
      mask &= (yy >= bbox['y']) & (yy <= (bbox['y'] + bbox['height']))
      bboxes[view] += mask
    
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))
ax1.set_title("AP Bounding Boxes")
subfig1 = ax1.imshow(bboxes['AP'], cmap='hot')

ax2.set_title("PA Bounding Boxes")
subfig2 = ax2.imshow(bboxes['PA'], cmap='hot')


## Overlay

We can then overlay this probability map over images from our training set.

In [0]:
def show_dicom_images_with_bbox_dist(data, bboxes):    
    img_data = list(data.T.to_dict().values())
    f, ax = plt.subplots(2,3, figsize=(16,12))
    for i, data_row in enumerate(img_data):
        patientImage = data_row['patientId']+'.dcm'
        imagePath = os.path.join(ROOT_PATH, 'train', patientImage)
        dcmdata = dcm.dcmread(imagePath)
        modality = dcmdata.Modality
        age = dcmdata.PatientAge
        sex = dcmdata.PatientSex
        view = dcmdata.ViewPosition
        img = plt.cm.gray(dcmdata.pixel_array)
        img += 0.25*plt.cm.hot(bboxes[view]/bboxes[view].max())
        img = np.clip(img, 0, 1)

        ax[i//3, i%3].imshow(img) 
        ax[i//3, i%3].axis('off')
        ax[i//3, i%3].set_title('ID: {}\nModality: {} Age: {} Sex: {} Target: {}\nView: {}, Class: {}'.format(
                data_row['patientId'],modality, age, sex, data_row['Target'], view, data_row['class']))
        rows = merged_df[merged_df['patientId']==data_row['patientId']]
        box_data = list(rows.T.to_dict().values())       
        for j, row in enumerate(box_data):
            ax[i//3, i%3].add_patch(Rectangle(xy=(row['x'], row['y']),
                        width=row['width'],height=row['height'], 
                        color="green", fill=False))   
    plt.show()


In [0]:
show_dicom_images_with_bbox_dist(target1_sample, bboxes)

# What next?  MACHINE LEARNING!

You can go to the kaggle competition website and look at code used to analyze the data we've been exploring:

* One example of a [RCNN used to detect pneumonia](https://www.kaggle.com/liangbinghao/my-mask-rcnn-sample-starter-code)
* [Winning submission advanced ensemble models](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/discussion/70421)

Also don't forget, build your data processing pipeline (resize, center, and normalize images), insert quality controls in labelling, analyze mislabeled cases, augment your dataset, try and experiment and repeat, etc.