In [716]:
PENDING

first (in time) NCCT acquired.

SyntaxError: invalid syntax (3295957230.py, line 1)

#  Cleaning and Inventory

In this pack there are 326 stroke patients which have a total of 8454 images associated. Using the folder of images that we receive from the hospital X, we create the part1_inventory_test.csv which contains all the metadata of the received images. The initial file format follows the DICOM standard, which you can check here: https://dicom.innolitics.com/ciods. In the part1_inventory_test.csv file each row represents an image in the package, and each column a dicom standard metadata field.

In order to apply our deep learning algorithms to this dataset, we first have to select the correct NCCT (Non-Contrast CT) image (out of the many images that we have) for each patient. Selecting the right NCCT is critical to achieve a good performance in the clinical study or to train a good algorithm. 

The challenge consists on a small simplified version of this task: 

● Select the correct NCCT image for each of the 326 patients. 

● The correct NCCT must meet the following characteristics: non-contrast image, CT modality, axial orientation, slice thickness between 2.5 and 5 mm’s, first (in time) NCCT acquired. 

Some hints on the data: 

● A patient (PatientID) may have several studies (StudyInstanceUID - group of images), and within the study there can be many images (SeriesInstanceUID - single image).

● Each row of the part1_inventory_test.csv is the metadata of the image (unique SeriesInstanceUID). 

● In this inventory you may find different kinds of image modalities: CT (NCCT or CTA), DWI, MRI, CTP, etc. Note that the DICOM modality column is not enough to complete the exercise, as NCCT and CTA are both CTs.

● The most difficult ambiguity to discern using only image metadata is whether the image is an NCCT or a CTA, you have an image above of how they look (A CTA is a CT acquisition with an injection of contrast in the patient's arteries).

● Other important data fields: Modality, ImageOrientationPatient, SeriesDescription, StudyDescription, ImageType, SliceThickness, etc. 

Since we have already solved it, we will provide you with an example of 25 selected NCCT images (example_solution.csv) with some relevant fields, in order to show you an example of how the final result should look like. 

In [679]:
import pandas as pd

In [680]:
!pwd

/home/carlosgil/code/Charlie5545/data-specialist-methinks-challenge/data-specialist-methinks-challenge


In [681]:
df = pd.read_csv('data/part1_inventory_test.csv',low_memory = False)

In [682]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8454 entries, 0 to 8453
Columns: 206 entries, PatientID to RouteOfAdmissions
dtypes: float64(113), int64(13), object(80)
memory usage: 13.3+ MB


In [683]:
solution_df = pd.read_csv('data/example_solution.csv',low_memory = False)

In [684]:
solution_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   PatientID          25 non-null     object 
 1   StudyInstanceUID   25 non-null     object 
 2   SeriesInstanceUID  25 non-null     object 
 3   StudyDescription   25 non-null     object 
 4   SeriesDescription  25 non-null     object 
 5   PixelSpacing       25 non-null     object 
 6   SliceThickness     25 non-null     float64
 7   ConvolutionKernel  25 non-null     object 
dtypes: float64(1), object(7)
memory usage: 1.7+ KB


In [685]:
filtered_columns = list(solution_df.columns) + ['ImageOrientationPatient','Modality']
filtered_columns

['PatientID',
 'StudyInstanceUID',
 'SeriesInstanceUID',
 'StudyDescription',
 'SeriesDescription',
 'PixelSpacing',
 'SliceThickness',
 'ConvolutionKernel',
 'ImageOrientationPatient',
 'Modality']

In [686]:
df = df[filtered_columns]

In [687]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8454 entries, 0 to 8453
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                8454 non-null   object 
 1   StudyInstanceUID         8454 non-null   object 
 2   SeriesInstanceUID        8454 non-null   object 
 3   StudyDescription         8454 non-null   object 
 4   SeriesDescription        8453 non-null   object 
 5   PixelSpacing             8365 non-null   object 
 6   SliceThickness           7410 non-null   float64
 7   ConvolutionKernel        5033 non-null   object 
 8   ImageOrientationPatient  7709 non-null   object 
 9   Modality                 8454 non-null   object 
dtypes: float64(1), object(9)
memory usage: 660.6+ KB


### Removing columns with < 30% NaN's 

In [688]:
threshold = int(df.shape[0] * 0.3)
df = df.dropna(axis=1, thresh=threshold)

The correct NCCT must meet the following characteristics:

    1) non-contrast image -> can't be defined straight forward
    
    2) CT modality -> can be filtered from 'Modality' == 'CT'
    
    3) axial orientation -> can be filtered from 'ImageOrientationPatient' == ['1', '0', '0', '0', '1', '0']
    
    4) slice thickness between 2.5 and 5 mm’s -> can be filtered from 2.5 <= 'SliceThickness' <= 5
    
    5) first (in time) NCCT acquired. -> 
    
    Other columns used in solution_df or in the instructions to achieve 1) Non-contrast Image:
    
    6) 'SeriesDescription'
    
    7) 'StudyDescription'
   
    8) 'ImageType'
    

## Modality == 'CT'

In [689]:
df.Modality.unique()

array(['CT', 'MR', 'OT', 'XA'], dtype=object)

In [690]:
modality_df = df[df.Modality == 'CT']

In [691]:
modality_df.PatientID.unique().size

326

## Axial Orientation == ['1', '0', '0', '0', '1', '0']

In [692]:
df[df.ImageOrientationPatient == '1\\0\\0\\0\\1\\0'].PatientID.unique().size

306

Defining 0.1 tolerance for Axial Orientation

In [693]:
def tolerance(value, target, tolerance=0.1):
    if pd.isna(value):  # Check for NaN
        return False
    
    # Convert value to string and split
    components = list(map(float, str(value).split('\\')))
    target_components = list(map(float, target.split('\\')))
    
    return all(abs(comp - tgt) <= tolerance for comp, tgt in zip(components, target_components))

# Target string to compare against
target_value = '1\\0\\0\\0\\1\\0'

# Filtering the DataFrame based on closeness to the target value
axial_df = modality_df[modality_df.ImageOrientationPatient.apply(lambda x: tolerance(x, target_value))]

In [694]:
axial_df.PatientID.unique().size

319

In [695]:
axial_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4324 entries, 1 to 8447
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                4324 non-null   object 
 1   StudyInstanceUID         4324 non-null   object 
 2   SeriesInstanceUID        4324 non-null   object 
 3   StudyDescription         4324 non-null   object 
 4   SeriesDescription        4324 non-null   object 
 5   PixelSpacing             4324 non-null   object 
 6   SliceThickness           4294 non-null   float64
 7   ConvolutionKernel        3165 non-null   object 
 8   ImageOrientationPatient  4324 non-null   object 
 9   Modality                 4324 non-null   object 
dtypes: float64(1), object(9)
memory usage: 371.6+ KB


## Slice Thickness

In [696]:
thickness_df = axial_df[(axial_df.SliceThickness >= 2.5) & (axial_df.SliceThickness <= 5)]

In [697]:
thickness_df.PatientID.unique().size

305

In [698]:
thickness_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 867 entries, 1 to 8393
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                867 non-null    object 
 1   StudyInstanceUID         867 non-null    object 
 2   SeriesInstanceUID        867 non-null    object 
 3   StudyDescription         867 non-null    object 
 4   SeriesDescription        867 non-null    object 
 5   PixelSpacing             867 non-null    object 
 6   SliceThickness           867 non-null    float64
 7   ConvolutionKernel        688 non-null    object 
 8   ImageOrientationPatient  867 non-null    object 
 9   Modality                 867 non-null    object 
dtypes: float64(1), object(9)
memory usage: 74.5+ KB


## Series Description

Words to take into account and further study: Stroke, Neuro,..?

In [699]:
df.SeriesDescription.unique().size

772

In [700]:
solution_df.SeriesDescription.unique()

array(['Head__5_0__J37s__1', 'Head__3_0__J30s', 'Head_5_0',
       'brain_Head_3_0', 'noncontrast_Head_3_0', 'brain_ST_Head_3_0',
       'Head_SPIRAL_Spiral', 'Head__3_0__J37s__1', 'AXIAL_HEAD',
       'noncontrast_Head_3_000', 'Head_3_0__Axial____FC64',
       'HEAD_3_75mm_Soft', 'Head_WO__3_0__J30f__SOFT'], dtype=object)

In [701]:
# Filtering based on 'Head', 'HEAD', or 'Brain' in 'SeriesDescription'
seriesdesc_df = thickness_df[
    thickness_df.SeriesDescription.str.contains('head|brain|Topogram', case=False, na=False)
]

In [702]:
seriesdesc_df.PatientID.unique().size

291

In [703]:
seriesdesc_df.SeriesDescription.unique().size

58

In [507]:
seriesdesc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 659 entries, 1 to 8393
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                659 non-null    object 
 1   StudyInstanceUID         659 non-null    object 
 2   SeriesInstanceUID        659 non-null    object 
 3   StudyDescription         659 non-null    object 
 4   SeriesDescription        659 non-null    object 
 5   PixelSpacing             659 non-null    object 
 6   SliceThickness           659 non-null    float64
 7   ConvolutionKernel        651 non-null    object 
 8   ImageOrientationPatient  659 non-null    object 
 9   Modality                 659 non-null    object 
dtypes: float64(1), object(9)
memory usage: 56.6+ KB


## Study Description

Not sure about:
    
CT BRAIN WO, CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T) -> included in solution

CT ANGIO HEAD & NECK W/WO (70496, 70498) -> not included in solution

CT ANGIO HEAD W/WO CONTRAST (70496) ->not included in solution

In [582]:
df.StudyDescription.unique()

array(['CT BRAIN WO CONTRAST (70450)',
       'CT BRAIN WO, CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T)',
       'MRI BRAIN W/WO CONTRAST (70553)',
       'CT ANGIO HEAD & NECK W/WO (70496, 70498)',
       'CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T)',
       'EXTERNAL CT-STORE & INTERPRET',
       'CT ANGIO HEAD W/WO CONTRAST (70496)', 'EXTERNAL CT - STORE ONLY',
       'MRI BRAIN WO CONTRAST (70551)', 'EXTERNAL CT - STORE ONLY - RAD',
       'IR NEURO RADIOLOGY PROCEDURE',
       'EXTERNAL CT BRAIN - STORE ONLY - RAD',
       'EXTERNAL CTA HEAD - STORE ONLY - RAD',
       'EXTERNAL CT BRAIN - STORE ONLY', 'EXTERNAL CTA HEAD - STORE ONLY',
       'EXTERNAL CT BRAIN INTERPRET', 'EXTERNAL CTA HEAD INTERPRET'],
      dtype=object)

In [583]:
solution_df.StudyDescription.unique()

array(['CT BRAIN WO, CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T)',
       'EXTERNAL CT - STORE ONLY', 'EXTERNAL CT - STORE ONLY - RAD',
       'EXTERNAL CT BRAIN - STORE ONLY - RAD',
       'EXTERNAL CT BRAIN - STORE ONLY', 'CT BRAIN WO CONTRAST (70450)',
       'EXTERNAL CT BRAIN INTERPRET'], dtype=object)

In [584]:
# Filter StudyDescription to exclude CTA or Angio
studydesc_df = seriesdesc_df[
    (~seriesdesc_df.StudyDescription.str.contains('angio|CTA HEAD', case=False, na=False))
]

In [585]:
studydesc_df.StudyDescription.unique()

array(['CT BRAIN WO CONTRAST (70450)',
       'CT BRAIN WO, CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T)',
       'EXTERNAL CT-STORE & INTERPRET',
       'CTA BRAIN & CEREBRAL PERFUSION (70496, 0042T)',
       'EXTERNAL CT - STORE ONLY - RAD', 'EXTERNAL CT - STORE ONLY',
       'EXTERNAL CT BRAIN - STORE ONLY - RAD',
       'EXTERNAL CT BRAIN INTERPRET', 'EXTERNAL CT BRAIN - STORE ONLY'],
      dtype=object)

In [586]:
studydesc_df.PatientID.unique().size

287

In [587]:
studydesc_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 626 entries, 1 to 8393
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                626 non-null    object 
 1   StudyInstanceUID         626 non-null    object 
 2   SeriesInstanceUID        626 non-null    object 
 3   StudyDescription         626 non-null    object 
 4   SeriesDescription        626 non-null    object 
 5   PixelSpacing             626 non-null    object 
 6   SliceThickness           626 non-null    float64
 7   ConvolutionKernel        618 non-null    object 
 8   ImageOrientationPatient  626 non-null    object 
 9   Modality                 626 non-null    object 
dtypes: float64(1), object(9)
memory usage: 53.8+ KB


## Pixel Spacing

Study if it's needed to determine a threshold for this value

In [600]:
df.PixelSpacing.unique().size

863

In [604]:
solution_df.PixelSpacing.unique()

array(['[0.46875, 0.46875]', '[0.47265625, 0.47265625]', '[0.417, 0.417]',
       '[0.456, 0.456]', '[0.517, 0.517]', '[0.430, 0.430]',
       '[0.455078125, 0.455078125]', '[0.39086294416244, 0.390625]',
       '[0.390625, 0.390625]', '[0.527000, 0.527000]', '[0.429, 0.429]',
       '[0.451171875, 0.451171875]', '[0.454, 0.454]',
       '[0.488281, 0.488281]', '[0.459, 0.459]', '[0.4296875, 0.4296875]',
       '[0.38850174216028, 0.388671875]'], dtype=object)

In [621]:
studydesc_df.PixelSpacing.unique()

array(['0.46875\\0.46875', '0.36156351791531\\0.361328125',
       '0.25390625\\0.25390625', '0.48046875\\0.48046875',
       '0.392578125\\0.392578125', '0.392578125\\0.39279869067103',
       '0.396484375\\0.396484375', '0.37890625\\0.37890625',
       '0.427734375\\0.427734375', '0.41030534351145\\0.41015625',
       '0.33177570093458\\0.33203125', '0.515625\\0.515625',
       '0.488281\\0.488281', '1.033203125\\1.033203125',
       '1.029296875\\1.029296875', '0.51953125\\0.51953125',
       '1.02734375\\1.02734375', '0.9765625\\0.9765625',
       '1.03125\\1.03125', '0.978515625\\0.978515625',
       '1.03515625\\1.03515625', '0.98046875\\0.98046875',
       '0.541015625\\0.541015625', '0.533203125\\0.533203125',
       '0.468\\0.468', '0.98828125\\0.98828125', '0.430\\0.430',
       '0.47265625\\0.47265625', '0.456\\0.456', '0.527000\\0.527000',
       '0.435546875\\0.435546875', '0.4296875\\0.4296875'], dtype=object)

In [630]:
def filter_pixel_spacing(value):
    # Split the PixelSpacing string
    values = value.split('\\')
    
    # Convert the split values into floats
    val1 = float(values[0])
    val2 = float(values[1])
    
    # Check if both values are less than or equal to 0.53
    return val1 <= 0.53 and val2 <= 0.53

In [631]:
pixelsp_df = studydesc_df[studydesc_df.PixelSpacing.apply(filter_pixel_spacing)]

In [632]:
pixelsp_df.PatientID.unique().size

287

In [634]:
pixelsp_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 609 entries, 1 to 8393
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                609 non-null    object 
 1   StudyInstanceUID         609 non-null    object 
 2   SeriesInstanceUID        609 non-null    object 
 3   StudyDescription         609 non-null    object 
 4   SeriesDescription        609 non-null    object 
 5   PixelSpacing             609 non-null    object 
 6   SliceThickness           609 non-null    float64
 7   ConvolutionKernel        601 non-null    object 
 8   ImageOrientationPatient  609 non-null    object 
 9   Modality                 609 non-null    object 
dtypes: float64(1), object(9)
memory usage: 52.3+ KB


## Convolution Kernel

In [638]:
df.ConvolutionKernel.unique() 

array(['T20f', 'J37s\\1', 'J70h\\1', 'Tr20f', 'Hc40s\\2', 'Br64s\\2', nan,
       'T80f', 'H20f', 'Hr36d', 'H41s', 'H70h', 'Hr38s\\2', 'Br64s\\3',
       'J37f\\1', 'Q40f\\1', 'Hr68h\\2', 'J30f\\2', 'T20s', 'H37s',
       'H40s', 'STANDARD', 'BONE', 'H41f', 'UC', 'H31f', 'H60f',
       'I30f\\3', 'I40f\\3', 'B30f', 'H31s', 'FC03', 'FL04', 'FL03',
       'FC43', 'J37s\\3', 'J70h\\3', 'Hc40s', 'I40s\\3', 'B60s', 'B70s',
       'Hc40s\\3', 'Bv36d\\3', 'Bv36d', 'LUNG', 'J37f\\3', 'H40f', 'H10s',
       'I80s\\1', 'J40s\\2', 'I30f\\2', 'J30f\\3', 'H60s', '01', '54',
       '42', '12', 'SOFT', 'FC21', 'FC30', 'J30s\\1', 'FC26', 'I40f\\2',
       'B20f', 'FC68', 'FC35', 'BONEPLUS', 'FC41', 'H50s', 'FC64',
       'Qr36d\\3', 'Qr40s\\2'], dtype=object)

In [639]:
solution_df.ConvolutionKernel.unique()

array(["['J37s', '1']", "['J30s', '1']", 'FC26', 'FC68', 'FC21', 'H31f',
       'FC64', 'STANDARD', "['J30f', '3']"], dtype=object)

In [663]:
pixelsp_df.ConvolutionKernel.unique()

array(['J37s\\1', 'J70h\\1', 'Hc40s\\2', 'Br64s\\2', 'H41s', 'H70h',
       'Hr38s\\2', nan, 'J37f\\1', 'Q40f\\1', 'Hr68h\\2', 'H37s', 'H40s',
       'STANDARD', 'BONE', 'H41f', 'Hc40s\\3', 'J40s\\2', 'J70h\\3', '54',
       'FC21', 'J30s\\1', 'FC68', 'FC30', 'H31s', 'J30f\\3', 'H60f'],
      dtype=object)

## Further analysis, SeriesInstanceUID that are duplicated?

This was determined not to be considered 

In [641]:
filtered_df = df[~df.duplicated(subset=['SeriesInstanceUID'], keep='first')]

In [642]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8299 entries, 0 to 8453
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                8299 non-null   object 
 1   StudyInstanceUID         8299 non-null   object 
 2   SeriesInstanceUID        8299 non-null   object 
 3   StudyDescription         8299 non-null   object 
 4   SeriesDescription        8298 non-null   object 
 5   PixelSpacing             8213 non-null   object 
 6   SliceThickness           7267 non-null   float64
 7   ConvolutionKernel        4943 non-null   object 
 8   ImageOrientationPatient  7565 non-null   object 
 9   Modality                 8299 non-null   object 
dtypes: float64(1), object(9)
memory usage: 713.2+ KB


In [643]:
filtered_df.PatientID.unique().size

323

In [644]:
filtered_df.StudyInstanceUID.unique().size

502

In [645]:
missing_patients = set(sfiltered_df.PatientID) - set(filtered_df.PatientID)
missing_patients

{'D2', 'D44', 'D56'}

In [646]:
# Get duplicate SeriesInstanceUIDs
duplicate_series = filtered_df[filtered_df.SeriesInstanceUID.duplicated(keep=False)]

# Extract the unique duplicate SeriesInstanceUID values
unique_duplicate_series = duplicate_series.SeriesInstanceUID.unique()

## Compilation of Previous code - End of First Task

In [655]:
import pandas as pd

# Load the dataset with optimized memory usage
df = pd.read_csv('data/part1_inventory_test.csv', low_memory=False)

# Define the target columns to be used in the analysis
target_columns = ['PatientID',
                  'StudyInstanceUID',
                  'SeriesInstanceUID',
                  'StudyDescription',
                  'SeriesDescription',
                  'PixelSpacing',
                  'SliceThickness',
                  'ConvolutionKernel',
                  'ImageOrientationPatient',
                  'Modality']

# Filter DataFrame to include only the target columns and remove columns with more than 30% missing values
threshold = int(df.shape[0] * 0.3)
df = df[target_columns].dropna(axis=1, thresh=threshold)

# Filter for rows where the modality is 'CT' (Computed Tomography)
df = df[df.Modality == 'CT']

# Function to check if the ImageOrientationPatient values are within a tolerance of the target
def tolerance(value, target, tolerance=0.1):
    """Check if the orientation components of `value` are within the specified tolerance of `target`."""
    if pd.isna(value):  # Handle missing values
        return False
    
    # Convert value and target into lists of float components
    components = list(map(float, str(value).split('\\')))
    target_components = list(map(float, target.split('\\')))
    
    # Check if all components are within the tolerance range
    return all(abs(comp - tgt) <= tolerance for comp, tgt in zip(components, target_components))

# Define the target axial orientation to compare against
target_orientation = '1\\0\\0\\0\\1\\0'

# Filter based on closeness to the target axial orientation
df = df[df.ImageOrientationPatient.apply(lambda x: tolerance(x, target_orientation))]

# Filter rows where Slice Thickness is between 2.5 and 5 mm
df = df[(df.SliceThickness >= 2.5) & (df.SliceThickness <= 5)]

# Filter SeriesDescription to include only rows with 'head' or 'brain' (case insensitive)
df = df[df.SeriesDescription.str.contains('head|brain', case=False, na=False)]

# Filter StudyDescription to exclude rows with 'CTA HEAD' or 'Angio' (case insensitive)
df = df[~df.StudyDescription.str.contains('angio|CTA HEAD', case=False, na=False)]

# Function to filter PixelSpacing values where both components are less than or equal to 0.53 mm
def filter_pixel_spacing(value):
    """Return True if both components of PixelSpacing are <= 0.53 mm, otherwise False."""
    try:
        # Split the PixelSpacing into two values and convert them to floats
        val1, val2 = map(float, value.split('\\'))
    except (ValueError, AttributeError):
        return False  # Handle any potential issues with missing or malformed data
    
    # Check if both values are less than or equal to the threshold (0.53 mm)
    return val1 <= 0.53 and val2 <= 0.53

# Apply PixelSpacing filter to keep rows with PixelSpacing <= 0.53 mm for both dimensions
final_df = df[df.PixelSpacing.apply(filter_pixel_spacing)]

# Display the resulting filtered DataFrame
final_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 609 entries, 1 to 8393
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PatientID                609 non-null    object 
 1   StudyInstanceUID         609 non-null    object 
 2   SeriesInstanceUID        609 non-null    object 
 3   StudyDescription         609 non-null    object 
 4   SeriesDescription        609 non-null    object 
 5   PixelSpacing             609 non-null    object 
 6   SliceThickness           609 non-null    float64
 7   ConvolutionKernel        601 non-null    object 
 8   ImageOrientationPatient  609 non-null    object 
 9   Modality                 609 non-null    object 
dtypes: float64(1), object(9)
memory usage: 52.3+ KB


## ETL Integration

In [656]:
import pandas as pd
import logging

# Set up logging for the ETL process
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def extract_data(file_path):
    """Extract data from CSV file."""
    try:
        logging.info("Extracting data from CSV file.")
        df = pd.read_csv(file_path, low_memory=False)
        logging.info(f"Data extraction complete. Rows: {df.shape[0]}, Columns: {df.shape[1]}")
        return df
    except Exception as e:
        logging.error(f"Error during data extraction: {e}")
        raise

def transform_data(df):
    """Transform the data by filtering and processing."""
    logging.info("Starting data transformation.")

    # Define the target columns to be used in the analysis
    target_columns = ['PatientID',
                      'StudyInstanceUID',
                      'SeriesInstanceUID',
                      'StudyDescription',
                      'SeriesDescription',
                      'PixelSpacing',
                      'SliceThickness',
                      'ConvolutionKernel',
                      'ImageOrientationPatient',
                      'Modality']

    # Filter DataFrame to include only the target columns and remove columns with more than 30% missing values
    threshold = int(df.shape[0] * 0.3)
    df = df[target_columns].dropna(axis=1, thresh=threshold)

    # Filter for rows where the modality is 'CT'
    df = df[df.Modality == 'CT']

    # Function to check if the ImageOrientationPatient values are within a tolerance of the target
    def tolerance(value, target, tolerance=0.1):
        if pd.isna(value):
            return False
        components = list(map(float, str(value).split('\\')))
        target_components = list(map(float, target.split('\\')))
        return all(abs(comp - tgt) <= tolerance for comp, tgt in zip(components, target_components))

    # Target orientation to compare against
    target_orientation = '1\\0\\0\\0\\1\\0'
    df = df[df.ImageOrientationPatient.apply(lambda x: tolerance(x, target_orientation))]

    # Filter rows where Slice Thickness is between 2.5 and 5 mm
    df = df[(df.SliceThickness >= 2.5) & (df.SliceThickness <= 5)]

    # Filter SeriesDescription to include only rows with 'head' or 'brain'
    df = df[df.SeriesDescription.str.contains('head|brain', case=False, na=False)]

    # Filter StudyDescription to exclude rows with 'CTA HEAD' or 'Angio'
    df = df[~df.StudyDescription.str.contains('angio|CTA HEAD', case=False, na=False)]

    # Function to filter PixelSpacing values
    def filter_pixel_spacing(value):
        try:
            val1, val2 = map(float, value.split('\\'))
        except (ValueError, AttributeError):
            return False
        return val1 <= 0.53 and val2 <= 0.53

    # Apply PixelSpacing filter
    df = df[df.PixelSpacing.apply(filter_pixel_spacing)]

    logging.info(f"Data transformation complete. Rows after transformation: {df.shape[0]}")
    return df

def load_data(df, output_file):
    """Load transformed data to a CSV file."""
    try:
        logging.info(f"Loading data to {output_file}.")
        df.to_csv(output_file, index=False)
        logging.info("Data loading complete.")
    except Exception as e:
        logging.error(f"Error during data loading: {e}")
        raise

def main():
    """Main ETL function to orchestrate the process."""
    input_file = 'data/part1_inventory_test.csv'
    output_file = 'data/transformed_inventory_data.csv'
    
    # Extract, Transform, Load process
    try:
        df = extract_data(input_file)
        transformed_df = transform_data(df)
        load_data(transformed_df, output_file)
    except Exception as e:
        logging.error(f"ETL process failed: {e}")

if __name__ == "__main__":
    main()


2024-10-08 17:20:28,514 - INFO - Extracting data from CSV file.
2024-10-08 17:20:28,690 - INFO - Data extraction complete. Rows: 8454, Columns: 206
2024-10-08 17:20:28,691 - INFO - Starting data transformation.
2024-10-08 17:20:28,740 - INFO - Data transformation complete. Rows after transformation: 609
2024-10-08 17:20:28,741 - INFO - Loading data to data/transformed_inventory_data.csv.
2024-10-08 17:20:28,766 - INFO - Data loading complete.


# Analysis

In [None]:
import pandas as pd

In [659]:
file_path = 'data/part2_inferences.csv'

In [660]:
df2 = pd.read_csv(file_path, low_memory=False)

In [674]:
df2[~df2.Patient_name.isin(pixelsp_df.PatientID)]

Unnamed: 0,Patient_name,Model_1,Model_2,Ground_truth
17,K80,0.524756,0.802197,0
32,K67,0.065052,0.330898,1
42,G136,0.034389,0.760785,1
44,K50,0.25878,0.770967,0
45,G116,0.662522,0.493796,1


In [712]:
df[df.PatientID == 'K50'].ImageOrientationPatient.unique()

array(['0\\1\\0\\0\\0\\-1',
       '1\\0\\0\\0\\0.99254615164132\\-0.1218693434051',
       '0.99416247517374\\-0.007978126208\\0.10759796679597\\-0.000924026793\\0.99659619588001\\0.082432812229',
       '0.99416247517374\\-0.007978126208\\0.10759796679597\\0.10788938376703\\0.08205103823703\\-0.9907711683303',
       '-0.000924027214\\0.99659619588342\\0.08243281218303\\0.10788938337703\\0.08205103204803\\-0.9907711688853'],
      dtype=object)

In [713]:
df[df.PatientID == 'K50'].SeriesDescription.unique()

array(['Topogram__0_6__T20s', 'Head__5_0__J37s__1', 'Head__3_0__J70h__1',
       'Head__5_0__Axial', 'Head__5_0__Coronal', 'Head__5_0__Sagittal'],
      dtype=object)

In [714]:
df[df.PatientID == 'K50'].StudyDescription.unique()

array(['CT BRAIN WO CONTRAST (70450)'], dtype=object)

In [715]:
df[df.PatientID == 'K50'].PixelSpacing.unique()

array(['1\\1', '0.46875\\0.46875', '0.3828125\\0.3828125',
       '0.32421875\\0.32397003745318', '0.32421875\\0.32396694214876'],
      dtype=object)