#  Cleaning and Inventory

In this pack there are 326 stroke patients which have a total of 8454 images associated. Using the folder of images that we receive from the hospital X, we create the part1_inventory_test.csv which contains all the metadata of the received images. The initial file format follows the DICOM standard, which you can check here: https://dicom.innolitics.com/ciods. In the part1_inventory_test.csv file each row represents an image in the package, and each column a dicom standard metadata field.

In order to apply our deep learning algorithms to this dataset, we first have to select the correct NCCT (Non-Contrast CT) image (out of the many images that we have) for each patient. Selecting the right NCCT is critical to achieve a good performance in the clinical study or to train a good algorithm. 

The challenge consists on a small simplified version of this task: 

● Select the correct NCCT image for each of the 326 patients. 

● The correct NCCT must meet the following characteristics: non-contrast image, CT modality, axial orientation, slice thickness between 2.5 and 5 mm’s, first (in time) NCCT acquired. 

Some hints on the data: 

● A patient (PatientID) may have several studies (StudyInstanceUID - group of images), and within the study there can be many images (SeriesInstanceUID - single image).

● Each row of the part1_inventory_test.csv is the metadata of the image (unique SeriesInstanceUID). 

● In this inventory you may find different kinds of image modalities: CT (NCCT or CTA), DWI, MRI, CTP, etc. Note that the DICOM modality column is not enough to complete the exercise, as NCCT and CTA are both CTs.

● The most difficult ambiguity to discern using only image metadata is whether the image is an NCCT or a CTA, you have an image above of how they look (A CTA is a CT acquisition with an injection of contrast in the patient's arteries).

● Other important data fields: Modality, ImageOrientationPatient, SeriesDescription, StudyDescription, ImageType, SliceThickness, etc. 

Since we have already solved it, we will provide you with an example of 25 selected NCCT images (example_solution.csv) with some relevant fields, in order to show you an example of how the final result should look like. 

In [41]:
import pandas as pd

In [42]:
!pwd

/home/carlosgil/code/Charlie5545/data-specialist-methinks-challenge/data-specialist-methinks-challenge


In [90]:
df = pd.read_csv('data/part1_inventory_test.csv',low_memory = False)

In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8454 entries, 0 to 8453
Columns: 206 entries, PatientID to RouteOfAdmissions
dtypes: float64(113), int64(13), object(80)
memory usage: 13.3+ MB


In [131]:
threshold = int(df.shape[0] * 0.3)
df_cleaned = df.dropna(axis=1, thresh=threshold)

In [132]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8454 entries, 0 to 8453
Data columns (total 83 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   PatientID                                8454 non-null   object 
 1   StudyInstanceUID                         8454 non-null   object 
 2   StudyDescription                         8454 non-null   object 
 3   StudyDate                                8454 non-null   int64  
 4   SeriesInstanceUID                        8454 non-null   object 
 5   SeriesDescription                        8453 non-null   object 
 6   Manufacturer                             8454 non-null   object 
 7   Instances                                8454 non-null   int64  
 8   LastUpdate                               8454 non-null   object 
 9   SeriesNumber                             8454 non-null   int64  
 10  ContentDate                              8424 no

In [133]:
# Get the original list of columns
original_columns = df.columns

# Get the new list of columns after dropping
new_columns = df_cleaned.columns

# Find the columns that were removed
removed_columns = original_columns.difference(new_columns)

print("Removed columns:", removed_columns)

Removed columns: Index(['AcquisitionMatrix', 'AcquisitionType', 'AngioFlag',
       'AveragePulseWidth', 'BurnedInAnnotation',
       'CTAdditionalXRaySourceSequence', 'CTDIPhantomTypeCodeSequence',
       'CineRate', 'ConfidentialityCode', 'ContrastBolusAgent',
       ...
       'TablePosition', 'TableSpeed', 'TemporalPositionIndex',
       'TransmitCoilName', 'TubeAngle', 'TypeOfPatientID', 'Units',
       'VOILUTFunction', 'VariableFlipAngleFlag', 'dBdt'],
      dtype='object', length=123)


In [134]:
solution_df = pd.read_csv('data/example_solution.csv',low_memory = False)

In [135]:
solution_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PatientID          25 non-null     object 
 1   StudyInstanceUID   25 non-null     object 
 2   SeriesInstanceUID  25 non-null     object 
 3   StudyDescription   25 non-null     object 
 4   SeriesDescription  25 non-null     object 
 5   PixelSpacing       25 non-null     object 
 6   SliceThickness     25 non-null     float64
 7   ConvolutionKernel  25 non-null     object 
dtypes: float64(1), object(7)
memory usage: 1.7+ KB


In [12]:
solution_df.columns

Index(['PatientID', 'StudyInstanceUID', 'SeriesInstanceUID',
       'StudyDescription', 'SeriesDescription', 'PixelSpacing',
       'SliceThickness', 'ConvolutionKernel'],
      dtype='object')

In [77]:
solution_df.isnull().sum()

PatientID            0
StudyInstanceUID     0
SeriesInstanceUID    0
StudyDescription     0
SeriesDescription    0
PixelSpacing         0
SliceThickness       0
ConvolutionKernel    0
dtype: int64

In [60]:
filtered_df = unfiltered_df[solution_df.columns]

KeyError: "['SeriesDescription', 'PixelSpacing', 'SliceThickness', 'ConvolutionKernel'] not in index"

In [37]:
solution_df.StudyDescription.unique().size

7

In [36]:
solution_df.SeriesDescription.unique().size

13

In [38]:
solution_df.PixelSpacing.unique().size

17

In [40]:
solution_df.SliceThickness.unique()

array([5.  , 3.  , 3.75])

In [32]:
solution_df.StudyInstanceUID.unique()

array(['1.3.12.2.1107.5.1.4.326841954957759929311289542521699260521',
       '1.3.12.2.1107.5.1.4.238857691497332572709100675877600626548',
       '1.3.12.2.1107.5.1.4.167235305454598923027682309313262206041',
       '1.3.12.2.1107.5.1.4.97427260740542717980461021888564870593',
       '1.3.12.2.1107.5.1.4.105076704470053803362550393506575745220',
       '1.3.12.2.1107.5.1.4.46104404406294049309673197744443069356',
       '1.3.12.2.1107.5.1.4.171298045711787897636307998027206005333',
       '1.3.12.2.1107.5.1.4.298158445625631474037403991027935234555',
       '1.3.12.2.1107.5.1.4.87689265667367200807072424104807954004',
       '1.3.12.2.1107.5.1.4.244637133215950936685061510712113215344',
       '1.3.12.2.1107.5.1.4.74059431288974628151054094532771979282',
       '1.3.12.2.1107.5.1.4.92814992245386364185969317284217509548',
       '1.3.12.2.1107.5.1.4.141223865578061624423976536064915077257',
       '1.3.12.2.1107.5.1.4.315776492648473720424017194932502913545',
       '1.3.12.2.1107.5.1