# RSNA Lumbar Spine Degenerative Classification - Inference with Pre-trained CNN Model

## Introduction

Lumbar spine degeneration is a common condition affecting a large portion of the global population, often leading to pain and mobility issues. Accurate classification and diagnosis of this condition through imaging, particularly MRI and X-ray scans, is crucial for timely and effective treatment. In this notebook, we will perform inference using a pre-trained Convolutional Neural Network (CNN) model to classify lumbar spine degeneration based on imaging data.

Our model has been pre-trained on the **RSNA 2024 Lumbar Spine Dataset**, which includes labeled X-ray images of patients with varying levels of degenerative disease. The goal is to utilize the trained model to make predictions on new, unseen images, classifying the level of degeneration according to standard medical categories.

## Objective

The main objective of this notebook is to load the pre-trained CNN model and use it for inference on new test data. We will perform the following steps:

1. **Load the pre-trained model:** We'll load the saved CNN model that was trained to classify lumbar spine degeneration.
2. **Preprocess new images:** New test images will be preprocessed to match the input format expected by the model.
3. **Make predictions:** The model will be used to predict the level of degeneration for each image.
4. **Interpret the results:** We will interpret the model's output and map predictions to the corresponding categories of degeneration.

## Dataset Information

The RSNA 2024 dataset consists of labeled lumbar spine X-ray images categorized based on different grades of degeneration. This classification helps healthcare professionals assess the severity of degeneration, which aids in treatment planning.

## Pre-trained Model

The CNN model was trained using transfer learning, leveraging a state-of-the-art architecture that has been fine-tuned on the RSNA dataset for better performance on spine degeneration classification.


In [1]:
import pandas as pd 
import pydicom

### Loading the Dataset

In this section, we define the path to the dataset and load the test series descriptions, which will help us understand the data we'll be working with.

1. **Dataset Path:**
   - We define `train_path`, which points to the location of the dataset in a Kaggle directory. This path is used to load various files related to lumbar spine degeneration classification.
   
2. **Loading Test Series Descriptions:**
   - We use the `pd.read_csv()` function to load a CSV file named `test_series_descriptions.csv`. This file contains important information about the test data series that we will use for inference.

3. **Dataset Overview:**
   - The `info()` method is used to display a summary of the DataFrame, which includes the total number of entries, column names, data types, and the count of non-null values. This provides a quick understanding of the dataset's structure and whether there are any missing values.
   
By loading this information, we ensure we have a clear view of the test dataset before proceeding with further analysis and inference.


In [2]:
# the path to the dataset

train_path = '/kaggle/input/rsna-2024-lumbar-spine-degenerative-classification/'

test_description = pd.read_csv(train_path + 'test_series_descriptions.csv')


In [3]:
test_description.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   study_id            3 non-null      int64 
 1   series_id           3 non-null      int64 
 2   series_description  3 non-null      object
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes


### Generating Image Paths

In this section, we define a function that helps us generate the paths to the test images based on the dataset’s directory structure. This is crucial as the images are stored in a nested folder format.

1. **Importing Necessary Libraries:**
   - We import the `os` library to handle directory operations and the `cv2` (OpenCV) library for image processing tasks. Although OpenCV is imported here, it might be used later for reading or manipulating images.

2. **`generate_image_paths()` Function:**
   - This function generates the full paths to the images based on the directory structure. Each image is stored in a nested directory where the folders represent different `study_id` and `series_id` combinations.
   
   - **Arguments:**
     - `df`: A DataFrame that contains columns `study_id` and `series_id`, which correspond to unique identifiers for studies and series in the dataset.
     - `data_dir`: The base directory where the images are stored.
     
   - **Process:**
     - For each row in the DataFrame, the function constructs the path to the folder containing the images using `study_id` and `series_id`. 
     - It then lists all the files (images) in the directory and generates the full path for each image by combining the folder path with the image filename.
     - These image paths are collected in a list, `image_paths`, which is returned at the end.

3. **Applying the Function:**
   - The function is applied to the `test_description` DataFrame using the test image directory path (`f'{train_path}/test_images'`). This results in a list of full paths to all test images, which is stored in `test_image_paths`.

By generating the full image paths, we prepare the data for later stages such as image loading and preprocessing, which are necessary before performing inference with the CNN model.


In [4]:
import os
import cv2

# Function to generate image paths based on directory structure
def generate_image_paths(df, data_dir):
    image_paths = []
    for study_id, series_id in zip(df['study_id'], df['series_id']):
        study_dir = os.path.join(data_dir, str(study_id))
        series_dir = os.path.join(study_dir, str(series_id))
        images = os.listdir(series_dir)
        image_paths.extend([os.path.join(series_dir, img) for img in images])
    return image_paths


test_image_paths = generate_image_paths(test_description, f'{train_path}/test_images')

### Condition Mapping

This section defines a dictionary, `condition_mapping`, which maps different types of medical image modalities to specific conditions related to lumbar spine degeneration.

1. **Purpose:**
   - The `condition_mapping` dictionary is used to associate specific image views (such as Sagittal and Axial) with conditions affecting the lumbar spine. These mappings will help us interpret the model's output and classify the type of spine degeneration based on which view of the spine is being analyzed.

2. **Key-Value Pairs:**
   - Each key in the dictionary refers to a specific type of medical image modality:
     - **Sagittal T1:** This refers to images in the sagittal plane (side view) taken using T1-weighted MRI scans. The dictionary specifies two conditions, one for the left side and one for the right side:
       - `left_neural_foraminal_narrowing`: Narrowing of the neural foramen on the left side.
       - `right_neural_foraminal_narrowing`: Narrowing of the neural foramen on the right side.
   
     - **Axial T2:** This refers to images in the axial plane (cross-sectional view) using T2-weighted MRI scans. The conditions mapped are:
       - `left_subarticular_stenosis`: Subarticular stenosis (narrowing) on the left side.
       - `right_subarticular_stenosis`: Subarticular stenosis on the right side.
   
     - **Sagittal T2/STIR:** This corresponds to sagittal images using T2-weighted or STIR (Short-TI Inversion Recovery) techniques. The mapped condition is:
       - `spinal_canal_stenosis`: Narrowing of the spinal canal.

3. **Use Case:**
   - This mapping is important because different image types highlight different anatomical structures and conditions. For example, sagittal views are more useful for assessing neural foraminal narrowing, while axial views can show subarticular stenosis. By using this dictionary, we can automatically determine which condition is related to the image type being processed.

This mapping provides a structured way to interpret the model's predictions based on the specific MRI sequence used for each image.


In [5]:
condition_mapping = {
    'Sagittal T1': {'left': 'left_neural_foraminal_narrowing', 'right': 'right_neural_foraminal_narrowing'},
    'Axial T2': {'left': 'left_subarticular_stenosis', 'right': 'right_subarticular_stenosis'},
    'Sagittal T2/STIR': 'spinal_canal_stenosis'
}


### Expanding the Dataset with Image Paths and Conditions

In this section, we are expanding the test dataset by associating each image with its corresponding condition(s) based on the series description. The result is a new DataFrame (`test_df`) that contains additional details about each image.

#### Key Components:

1. **`get_image_paths(row)` Function:**
   - This function generates the file paths for all images associated with a particular row (study and series) in the DataFrame.
   - **Parameters:**
     - `row`: A row from the `test_description` DataFrame containing information such as `study_id` and `series_id`.
   - **Process:**
     - The function constructs the path to the series folder by joining `base_path` with the study and series IDs.
     - If the folder exists, it lists all files in that directory and returns their full paths.
     - If the folder doesn't exist, it returns an empty list.

2. **Expanding the Rows:**
   - The `expanded_rows` list will store the expanded data for each image and its corresponding condition(s).
   - We iterate over each row in the `test_description` DataFrame using `iterrows()` to access the details of each study and series.

3. **Mapping Conditions to Series Descriptions:**
   - For each row, we use `condition_mapping` to retrieve the relevant conditions based on the `series_description`. If a single condition is returned (e.g., for `Sagittal T2/STIR`), it is converted into a dictionary that applies the condition to both the left and right sides.
   - If multiple conditions (e.g., left and right for neural foraminal narrowing) are returned, they are stored in a dictionary format.

4. **Appending Expanded Data:**
   - For each condition (left or right) and image path generated by `get_image_paths()`, we create a dictionary with the following fields:
     - `study_id`: The unique identifier for the study.
     - `series_id`: The identifier for the imaging series.
     - `series_description`: A description of the imaging series (e.g., Sagittal T1, Axial T2).
     - `image_path`: The full path to the image file.
     - `condition`: The condition being assessed (e.g., left neural foraminal narrowing, spinal canal stenosis).
     - `row_id`: A unique identifier for the row, constructed by concatenating the `study_id` and the condition.

   - This dictionary is appended to the `expanded_rows` list, essentially "flattening" the relationship between series, images, and conditions.

5. **Creating the Final DataFrame (`test_df`):**
   - Finally, the expanded data stored in `expanded_rows` is converted into a new DataFrame (`test_df`). This DataFrame contains additional rows, with each row representing a unique image and its corresponding condition.

#### Purpose:
This approach expands the test dataset so that each individual image is associated with the correct condition(s) based on the type of imaging series it belongs to. This structure is crucial for making accurate predictions during inference, as each image is linked with the medical condition that the model needs to classify.

By organizing the data in this way, we ensure that the CNN model can be applied effectively to each test image and its corresponding medical condition.


In [6]:
base_path = '/kaggle/input/rsna-2024-lumbar-spine-degenerative-classification/test_images/'


In [7]:
def get_image_paths(row):
    series_path = os.path.join(base_path, str(row['study_id']), str(row['series_id']))
    if os.path.exists(series_path):
        return [os.path.join(series_path, f) for f in os.listdir(series_path) if os.path.isfile(os.path.join(series_path, f))]
    return []


In [8]:
expanded_rows = []
for index, row in test_description.iterrows():
    image_paths = get_image_paths(row)
    conditions = condition_mapping.get(row['series_description'], {})
    if isinstance(conditions, str):  # Single condition
        conditions = {'left': conditions, 'right': conditions}
    for side, condition in conditions.items():
        for image_path in image_paths:
            expanded_rows.append({
                'study_id': row['study_id'],
                'series_id': row['series_id'],
                'series_description': row['series_description'],
                'image_path': image_path,
                'condition': condition,
                'row_id': f"{row['study_id']}_{condition}"
            })


In [9]:
test_df = pd.DataFrame(expanded_rows)

### Adding Levels to the `row_id`

In this section, we modify the `row_id` to incorporate specific spinal levels (e.g., L1-L2, L2-L3) into the identifier. This helps differentiate images based on the anatomical region of the spine being assessed.

#### Key Components:

1. **Defining Levels:**
   - We define a list called `levels` that represents various spinal levels:
     - `'l1_l2'`, `'l2_l3'`, `'l3_l4'`, `'l4_l5'`, and `'l5_s1'`.
   - These levels correspond to different regions of the lumbar spine where degeneration is typically assessed.

2. **`update_row_id()` Function:**
   - This function updates the `row_id` by appending one of the spinal levels from the `levels` list.
   - **Parameters:**
     - `row`: A row from the `test_df` DataFrame.
     - `levels`: The list of spinal levels that will be cycled through for each row.
   - **Process:**
     - The function calculates the appropriate level for the current row using `row.name % len(levels)`. The modulus operator ensures that the function cycles through the levels repeatedly, so each row is assigned one of the spinal levels in sequence.
     - The new `row_id` is constructed by concatenating the `study_id`, `condition`, and the assigned level.
   
3. **Applying the `update_row_id()` Function:**
   - We use the `apply()` method to apply the `update_row_id()` function to each row in `test_df`. This updates the `row_id` column to include the spinal level.
   - The lambda function ensures that each row in `test_df` is passed to `update_row_id()`, and the updated `row_id` is returned for each row.

4. **Viewing the Updated DataFrame:**
   - After updating the `row_id`, we display the first few rows of the updated DataFrame using `head()`. This allows us to verify that the `row_id` now includes the correct level along with the `study_id` and `condition`.

#### Purpose:
By incorporating spinal levels into the `row_id`, we ensure that each image and its corresponding condition are uniquely identified, not just by the study and condition but also by the anatomical region being assessed. This is particularly useful when the model needs to make predictions for different regions of the spine, as it allows for more granular and detailed analysis.

This structured identification will help ensure that the model outputs are specific to the correct region of the lumbar spine.


In [10]:
# Levels for row_id
levels = ['l1_l2', 'l2_l3', 'l3_l4', 'l4_l5', 'l5_s1']

# update row_id with levels
def update_row_id(row, levels):
    level = levels[row.name % len(levels)]  
    return f"{row['study_id']}_{row['condition']}_{level}"

# Update row_id in expanded_test_desc to include levels
test_df['row_id'] = test_df.apply(lambda row: update_row_id(row, levels), axis=1)

test_df.head()

Unnamed: 0,study_id,series_id,series_description,image_path,condition,row_id
0,44036939,2828203845,Sagittal T1,/kaggle/input/rsna-2024-lumbar-spine-degenerat...,left_neural_foraminal_narrowing,44036939_left_neural_foraminal_narrowing_l1_l2
1,44036939,2828203845,Sagittal T1,/kaggle/input/rsna-2024-lumbar-spine-degenerat...,left_neural_foraminal_narrowing,44036939_left_neural_foraminal_narrowing_l2_l3
2,44036939,2828203845,Sagittal T1,/kaggle/input/rsna-2024-lumbar-spine-degenerat...,left_neural_foraminal_narrowing,44036939_left_neural_foraminal_narrowing_l3_l4
3,44036939,2828203845,Sagittal T1,/kaggle/input/rsna-2024-lumbar-spine-degenerat...,left_neural_foraminal_narrowing,44036939_left_neural_foraminal_narrowing_l4_l5
4,44036939,2828203845,Sagittal T1,/kaggle/input/rsna-2024-lumbar-spine-degenerat...,left_neural_foraminal_narrowing,44036939_left_neural_foraminal_narrowing_l5_s1


### Test Dataset Class for Inference

This section defines a custom class `TestDataset` that is used to handle the loading and batching of test images for inference. The class efficiently processes images and prepares them for the model by organizing them into batches, resizing, and normalizing as needed.

#### Key Components:

1. **`__init__(self, dataframe, batch_size=16, image_size=(256, 256), normalize=False)`**
   - **Purpose:** Initializes the dataset with key parameters for image processing and batching.
   - **Parameters:**
     - `dataframe`: The DataFrame containing image paths and other relevant data (e.g., `row_id`).
     - `batch_size`: Number of images to process per batch. Defaults to 16.
     - `image_size`: The target size (width, height) to which each image will be resized. Defaults to (256, 256) pixels.
     - `normalize`: A flag to indicate whether images should be normalized (scaled to a range of 0 to 1).

2. **`load_image(self, image_path)`**
   - **Purpose:** Loads an image from the given file path and prepares it for inference.
   - **Process:**
     - If the image is in **DICOM** format (with `.dcm` extension), it is read using the `pydicom` library.
     - For other image formats (e.g., `.png`, `.jpg`), the image is read using OpenCV’s `cv2.imread()`.
     - If the image is not found, a `FileNotFoundError` is raised.
     - If the image is grayscale (2D), it is stacked into 3 channels to simulate an RGB image.
     - The image is normalized (divided by 255) if `normalize` is set to `True`.

3. **`__getitem__(self, index)`**
   - **Purpose:** Retrieves a batch of images and their corresponding `row_id`s based on the current batch index.
   - **Process:**
     - Computes the start and end indices for the batch.
     - Loops through the DataFrame rows corresponding to this batch.
     - For each row, it loads the image using `load_image()`, resizes it to the specified `image_size`, and adds the image and its `row_id` to separate lists.
     - Converts the lists of images and `row_id`s to NumPy arrays for easier batch processing in deep learning frameworks.
   
4. **`__len__(self)`**
   - **Purpose:** Returns the number of batches in the dataset, calculated as the ceiling of the total number of images divided by the batch size. This ensures that any remainder images in the dataset will be included in the final batch.

#### How It Works:

- **Batch Processing:** The class enables batch processing by allowing the user to retrieve a batch of images and their associated `row_id`s by indexing (`__getitem__`). This is useful when working with models that expect a batch of inputs for efficient computation during inference.
  
- **Image Loading and Resizing:** Images are loaded, resized, and normalized as per the requirements of the model. This ensures that the images have the correct dimensions and value ranges for accurate predictions.

- **Grayscale to RGB Conversion:** If an image is grayscale (2D), it is converted to a 3-channel format (simulating RGB), which is a common requirement for CNN models that are trained on RGB images.

- **DICOM Handling:** Since medical images are often stored in DICOM format, the class supports loading these images using `pydicom`, ensuring compatibility with common medical imaging formats.

This custom dataset class ensures that the images are processed efficiently and in a format ready for inference with a pre-trained convolutional neural network (CNN).


In [11]:
class TestDataset:
    def __init__(self, dataframe, batch_size=16, image_size=(256, 256), normalize=False):
        self.dataframe = dataframe
        self.batch_size = batch_size
        self.image_size = image_size
        self.normalize = normalize

    def load_image(self, image_path):
        if image_path.lower().endswith('.dcm'):
            dicom = pydicom.dcmread(image_path, force=True)
            image = dicom.pixel_array
        else:
            image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
            if image is None:
                raise FileNotFoundError(f"Could not load image from {image_path}")

        # Convert image to uint8 if necessary
        if image.dtype != np.uint8:
            image = image.astype(np.uint8)

        # If the image is grayscale, stack it to make 3 channels
        if len(image.shape) == 2:
            image = np.stack([image] * 3, axis=-1)

        # Normalize the image if the flag is set
        if self.normalize:
            image = image / 255.0  # Normalization to [0, 1]

        return image

    def __getitem__(self, index):
        # Get the starting index for this batch
        start_index = index * self.batch_size
        end_index = min((index + 1) * self.batch_size, len(self.dataframe))

        images = []
        row_ids = []

        for i in range(start_index, end_index):
            row = self.dataframe.iloc[i]
            image_path = row['image_path']
            row_id = row['row_id']

            # Load and resize image
            image = self.load_image(image_path)
            image = cv2.resize(image, self.image_size)

            images.append(image)
            row_ids.append(row_id)

        # Convert the list of images and row_ids to numpy arrays
        images = np.array(images)
        row_ids = np.array(row_ids)

        return images, row_ids

    def __len__(self):
        # Number of batches per epoch
        return int(np.ceil(len(self.dataframe) / self.batch_size))


### Creating an Instance of the TestDataset Class

In this section, we create an instance of the `TestDataset` class to prepare our test dataset for inference with the model.


In [12]:
test_dataset = TestDataset(test_df,batch_size = 16, image_size=(256, 256), normalize=True)


### Loading the Pre-trained Model

In this section, we load a pre-trained convolutional neural network (CNN) model that has been previously trained and saved. This model will be used to make predictions on our test dataset of lumbar spine images.

#### Steps Involved:

1. **Importing Required Libraries:**
   We start by importing the necessary Keras module from TensorFlow, which provides a convenient interface for building and using deep learning models.

2. **Loading the Pre-trained Model:**
   The pre-trained model is loaded using the `load_model()` function provided by Keras. This function requires the file path of the saved model as an argument.

   - **Model Path:**
     - The specified path is where the model was saved. It is important to ensure that this path is correct and that the model file is accessible in the current environment. This path suggests that the model is stored on Kaggle's platform.

3. **Purpose of Loading the Model:**
   By loading the pre-trained model, we can utilize the knowledge that has been acquired during its training. The model is equipped to recognize patterns and features relevant to the classification of lumbar spine degenerative conditions based on the training data it has seen.

4. **Next Steps:**
   Once the model is loaded, we can proceed to perform inference on the test dataset. This will allow us to classify the images and predict their associated conditions. Evaluating the model's predictions on unseen data is crucial for assessing its performance and generalization capabilities.

This step is essential for applying deep learning techniques to medical image classification tasks, enabling informed decision-making based on the predictions produced by the model.


In [13]:
# Load the saved model (make sure this is the correct path to your model)
from tensorflow import keras
model = keras.models.load_model("/kaggle/input/cnn_us_lsdc/keras/default/1/CNN_LSDC_model.keras")



### Making Predictions on the Test Dataset

In this section, we perform inference using the pre-trained model on the test dataset. The code processes the images in batches, makes predictions, and stores the results.

#### Key Steps:

1. **Imports and Initialization:**
   - We import necessary libraries such as `tqdm` for displaying progress bars and `numpy` and `pandas` for data handling.
   - We initialize a dictionary called `results` to store the prediction results, which includes `row_id` and probabilities for each class: normal/mild, moderate, and severe.

2. **Setting the Batch Size:**
   - We define a `batch_size` variable to specify the number of images processed in each batch. In this case, it is set to 16.

3. **Processing Images with a Progress Bar:**
   - We utilize `tqdm` to create a progress bar that provides feedback on the processing status of the entire dataset.

   - **Batch Iteration:**
     - We loop through the dataset using the index to process each batch of images. For each iteration:
       - We retrieve a batch of images and their corresponding `row_ids` from the `test_dataset`.
       - We check if the shape of the images matches the expected dimensions of `(batch_size, 256, 256, 3)`. If not, an error is raised.
       - We use the loaded model to predict the probabilities for each class based on the batch of images. The predictions will have a shape of `(batch_size, num_classes)`.

4. **Storing Predictions:**
   - For each image in the batch, we:
     - Extract the probabilities from the predictions.
     - Append the `row_id` and the respective class probabilities (normal/mild, moderate, severe) to the `results` dictionary.

5. **Updating Progress Bar:**
   - After processing each batch, we update the progress bar to reflect the number of batches processed.

6. **Handling Errors:**
   - Any errors encountered during processing are caught and printed, indicating the specific index that caused the issue.

7. **Creating a DataFrame:**
   - After processing all images, we convert the results dictionary into a pandas DataFrame called `results_df`.

8. **Normalizing Probabilities:**
   - We normalize the predicted probabilities to ensure they sum to 1 for each instance. This step is important for ensuring that the model's output can be interpreted as probabilities.

9. **Saving Results:**
   - Finally, we save the `results_df` DataFrame to a CSV file named `test_predictions.csv`, which contains the predictions for further analysis or submission.

This step is essential for evaluating the model's performance on the test dataset and provides a structured format for the results that can be easily analyzed or visualized later.


In [14]:
from tqdm import tqdm
import numpy as np
import pandas as pd

# Initialize results storage
results = {
    'row_id': [],
    'normal_mild': [],
    'moderate': [],
    'severe': []
}

batch_size = 16  

# Use tqdm to create a progress bar for the entire dataset
with tqdm(total=len(test_dataset), desc="Processing images") as pbar:
    # Iterate over batches of data
    for idx in range(len(test_dataset)):
        try:
            # Get a batch of images and corresponding row IDs
            images, row_ids = test_dataset[idx]

            # Ensure the images have the shape (batch_size, 256, 256, 3)
            if images.shape[1:] != (256, 256, 3):
                raise ValueError(f"Image batch shape is {images.shape}, expected (?, 256, 256, 3)")

            # Make predictions on the batch with verbose=0 to suppress output
            predictions = model.predict(images, verbose=0)  # Shape: (batch_size, num_classes)

            # Append results for each image in the batch
            for i in range(len(row_ids)):
                probs = predictions[i]

                results['row_id'].append(row_ids[i])
                results['normal_mild'].append(probs[0])  # Class 0: Normal/Mild
                results['moderate'].append(probs[1])     # Class 1: Moderate
                results['severe'].append(probs[2])       # Class 2: Severe

            # Update the progress bar by the batch size
            pbar.update(1)

        except Exception as e:
            print(f"Error processing index {idx}: {e}")

# Convert results to DataFrame
results_df = pd.DataFrame(results)

# Normalize probabilities to ensure they sum to 1
results_df[['normal_mild', 'moderate', 'severe']] = results_df[['normal_mild', 'moderate', 'severe']].div(
    results_df[['normal_mild', 'moderate', 'severe']].sum(axis=1), axis=0
)

# Save the results to a CSV file
results_df.to_csv('test_predictions.csv', index=False)


I0000 00:00:1728226727.897156     122 service.cc:145] XLA service 0x7966f000fbb0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1728226727.897224     122 service.cc:153]   StreamExecutor device (0): Host, Default Version
I0000 00:00:1728226728.637045     122 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Processing images: 100%|██████████| 13/13 [00:30<00:00,  2.32s/it]


### Overview of the Results DataFrame

After generating the predictions, we utilize the `info()` method on the `results_df` DataFrame to obtain a summary of its structure and contents. This step is crucial for understanding the data we have produced during the inference process.

#### Key Aspects of `results_df.info()`:

1. **DataFrame Summary:**
   - The `info()` method provides a concise summary of the DataFrame, including:
     - The number of entries (rows) and columns.
     - The index range.
     - The data types of each column.
     - The count of non-null entries in each column.

2. **Column Breakdown:**
   - The DataFrame consists of the following columns:
     - `row_id`: Unique identifier for each image, linking it to its respective condition and study.
     - `normal_mild`: Probability score for the image being classified as normal or mild.
     - `moderate`: Probability score indicating a moderate classification.
     - `severe`: Probability score suggesting a severe classification.

3. **Data Types:**
   - The data types of the columns will typically include:
     - `row_id`: Object (string type).
     - `normal_mild`, `moderate`, `severe`: Float (probability values).

4. **Non-null Counts:**
   - The count of non-null entries helps verify that there are no missing values in the predictions, ensuring the integrity of the results.

5. **Importance of the Summary:**
   - By examining this information, we can confirm that the predictions have been successfully generated and structured correctly. It also helps identify any issues, such as missing values or incorrect data types, before proceeding with further analysis or visualizations.

Overall, running `results_df.info()` is an important step to validate the output of our model and prepare for subsequent tasks like analysis, visualization, or reporting of the results.


In [15]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   row_id       194 non-null    object 
 1   normal_mild  194 non-null    float32
 2   moderate     194 non-null    float32
 3   severe       194 non-null    float32
dtypes: float32(3), object(1)
memory usage: 3.9+ KB


### Averaging Results per `row_id`

In this section, we aggregate the prediction results for each unique `row_id` by calculating the average probabilities across the corresponding images. This process is essential for consolidating the predictions to reflect the overall assessment for each condition associated with a specific study.

#### Key Steps:

1. **Grouping by `row_id`:**
   - We use the `groupby()` method on the `results_df` DataFrame to group the data based on the `row_id`. This ensures that we consolidate predictions for each unique image identifier.

   - The `as_index=False` parameter is set to maintain `row_id` as a regular column in the resulting DataFrame instead of using it as an index.

2. **Calculating Mean Probabilities:**
   - For each `row_id`, we compute the mean of the probability scores for the classes: `normal_mild`, `moderate`, and `severe`. This results in a new DataFrame called `averaged_results_df`, which contains the average probability scores for each `row_id`.

3. **Normalizing Probabilities:**
   - To ensure that the probabilities across the three classes sum to 1 for each `row_id`, we perform normalization:
     - We calculate the sum of the probabilities for each `row_id` using `sum(axis=1)`.
     - Each probability column (`normal_mild`, `moderate`, `severe`) is divided by the corresponding sum of probabilities. This step ensures that the predicted probabilities are properly scaled.

4. **Checking for Invalid Values:**
   - We perform a validation check to ensure that there are no negative probability values in the normalized probabilities. This is critical, as probabilities must always be non-negative and fall within the range [0, 1].
   - If any negative probabilities are found, a `ValueError` is raised, indicating that the submission contains invalid values.

5. **Importance of Averaging and Normalization:**
   - Averaging results across multiple predictions for the same `row_id` provides a more robust estimate of the true condition.
   - Normalizing the probabilities is essential for interpretation, as it allows for direct comparisons among the different conditions.

This step is crucial for preparing the final prediction results, ensuring they are valid and ready for analysis, visualization, or submission.


In [16]:
# Average results per row_id
averaged_results_df = results_df.groupby('row_id', as_index=False).mean()

# Normalize probabilities to ensure they sum to 1
sum_probs = averaged_results_df[['normal_mild', 'moderate', 'severe']].sum(axis=1)
averaged_results_df['normal_mild'] = averaged_results_df['normal_mild'] / sum_probs
averaged_results_df['moderate'] = averaged_results_df['moderate'] / sum_probs
averaged_results_df['severe'] = averaged_results_df['severe'] / sum_probs

# Check for any invalid values
if (averaged_results_df[['normal_mild', 'moderate', 'severe']] < 0).any().any():
    raise ValueError("Found negative probabilities in submission.")

### Verification of Probability Normalization

In this step, we verify that the normalized probabilities for each condition sum to 1 for every `row_id` in the `averaged_results_df` DataFrame. This check ensures that the normalization process was successful and that the probability values are valid.

#### Key Steps:

1. **Calculating the Sum of Probabilities:**
   - We create a new column called `sum_check` in the `averaged_results_df` DataFrame. This column will hold the sum of the probabilities for the three conditions: `normal_mild`, `moderate`, and `severe`.
   - The sum is calculated using the `sum(axis=1)` method, which sums the values across the specified columns for each row.

2. **Rounding the Sum Values:**
   - To enhance readability and ensure consistency, we apply the `round(x, 2)` function to round the sum to two decimal places. This rounding helps in clearly observing the results without excessive precision that may not be meaningful in the context of probabilities.

3. **Displaying the Normalization Check:**
   - We print a confirmation message "Normalization Check:" to indicate that we are about to display the results of the verification process.
   - The `head()` method is used to show the first few entries of the `row_id` along with their corresponding `sum_check` values. This display allows us to quickly assess whether the normalization was successful across several samples.

4. **Importance of the Normalization Check:**
   - Ensuring that the sum of probabilities equals 1 is crucial for validating the model's outputs. Probabilities must adhere to the properties of a probability distribution, where the sum of all possible outcomes should equal 1.
   - This check serves as a final validation step before proceeding to utilize or submit the results, providing confidence in the integrity of the data.

By performing this verification, we confirm that our predictions are reliable and ready for further analysis or reporting.


In [17]:
# Verify that the sum of probabilities is 1 for each row
averaged_results_df['sum_check'] = averaged_results_df[['normal_mild', 'moderate', 'severe']].sum(axis=1).apply(lambda x: round(x, 2))
print("Normalization Check:")
print(averaged_results_df[['row_id', 'sum_check']])

Normalization Check:
                                             row_id  sum_check
0    44036939_left_neural_foraminal_narrowing_l1_l2        1.0
1    44036939_left_neural_foraminal_narrowing_l2_l3        1.0
2    44036939_left_neural_foraminal_narrowing_l3_l4        1.0
3    44036939_left_neural_foraminal_narrowing_l4_l5        1.0
4    44036939_left_neural_foraminal_narrowing_l5_s1        1.0
5         44036939_left_subarticular_stenosis_l1_l2        1.0
6         44036939_left_subarticular_stenosis_l2_l3        1.0
7         44036939_left_subarticular_stenosis_l3_l4        1.0
8         44036939_left_subarticular_stenosis_l4_l5        1.0
9         44036939_left_subarticular_stenosis_l5_s1        1.0
10  44036939_right_neural_foraminal_narrowing_l1_l2        1.0
11  44036939_right_neural_foraminal_narrowing_l2_l3        1.0
12  44036939_right_neural_foraminal_narrowing_l3_l4        1.0
13  44036939_right_neural_foraminal_narrowing_l4_l5        1.0
14  44036939_right_neural_foramina

### Insights on Normalization Check Results

1. **Successful Normalization:**
   - The `sum_check` column indicates that the sum of the predicted probabilities for each `row_id` equals **1.0** for all entries shown. This confirms that the normalization process was successful and that the probability values are valid.

2. **Consistency Across Conditions:**
   - Each `row_id` corresponds to a specific spinal condition (e.g., left neural foraminal narrowing at various lumbar levels). The consistency of the sum being **1.0** across different conditions indicates that the model is effectively outputting probabilities that reflect the relative likelihood of each classification (normal/mild, moderate, severe).

3. **Data Integrity:**
   - The fact that all rows show a sum of exactly **1.0** is a positive sign of data integrity. It suggests that the predictions generated by the model are reliable and adhere to the mathematical properties expected of probabilities.

4. **Interpretation of Results:**
   - Since the probabilities are normalized, they can be directly interpreted as the model’s confidence levels for each condition. For instance, if the probabilities were `[0.7, 0.2, 0.1]` for a specific `row_id`, this would indicate a strong confidence that the condition is classified as normal/mild.

5. **Validation for Clinical Use:**
   - Normalization is especially critical in medical applications. It ensures that the model's predictions can be trusted and used for decision-making processes. Clinicians can rely on the output as a basis for further assessments or interventions.

6. **Next Steps:**
   - With the normalization successfully verified, you can proceed to utilize these predictions for clinical reporting or submission in competitions. It would also be beneficial to visualize some of the results to better understand the model's performance and potentially identify areas for improvement.

### Conclusion
Overall, the normalization check demonstrates that the model's predictions are mathematically sound and ready for practical application. This step is essential in ensuring the robustness of the inference process in a real-world setting, particularly in medical imaging.


### Preparing the Submission DataFrame

In this section, we create a new DataFrame specifically designed for submission purposes. This DataFrame will contain the final average probability results for each unique `row_id`, allowing for straightforward analysis or submission.

#### Key Steps:

1. **Creating the Submission DataFrame:**
   - We define `submission_df` by selecting specific columns from the `averaged_results_df`. The columns included are:
     - `row_id`: The unique identifier for each image, linking it to its respective condition and study.
     - `normal_mild`: The averaged probability score indicating the likelihood that the condition is classified as normal or mild.
     - `moderate`: The averaged probability score representing a moderate classification.
     - `severe`: The averaged probability score suggesting a severe classification.

   This selection ensures that only the relevant data for submission is retained in the new DataFrame.

2. **Structure of the Submission DataFrame:**
   - The resulting `submission_df` will have the following columns:
     - **row_id**: A string that uniquely identifies each row.
     - **normal_mild**: A float representing the predicted probability of the normal/mild condition.
     - **moderate**: A float representing the predicted probability of the moderate condition.
     - **severe**: A float representing the predicted probability of the severe condition.

3. **Viewing the Submission DataFrame:**
   - By calling `submission_df`, we can preview the structure and contents of the DataFrame. This step is essential for verifying that the data has been correctly organized and is ready for further processing, such as exporting to a CSV file for submission or reporting.

4. **Importance of the Submission DataFrame:**
   - The `submission_df` serves as the final output of our inference process. It consolidates all the predictions into a format that can be easily interpreted, shared, or submitted to relevant stakeholders or platforms.
   - Ensuring that the data is structured correctly is crucial for effective communication of the model’s findings and performance on the test dataset.

This preparation step is vital for the completion of the project, marking the transition from model inference to result dissemination.


In [18]:
submission_df = averaged_results_df[['row_id', 'normal_mild', 'moderate', 'severe']]
submission_df

Unnamed: 0,row_id,normal_mild,moderate,severe
0,44036939_left_neural_foraminal_narrowing_l1_l2,0.379058,0.290456,0.330486
1,44036939_left_neural_foraminal_narrowing_l2_l3,0.386446,0.305325,0.308228
2,44036939_left_neural_foraminal_narrowing_l3_l4,0.409427,0.312947,0.277626
3,44036939_left_neural_foraminal_narrowing_l4_l5,0.375377,0.28416,0.340463
4,44036939_left_neural_foraminal_narrowing_l5_s1,0.384381,0.2928,0.322819
5,44036939_left_subarticular_stenosis_l1_l2,0.412525,0.373115,0.214359
6,44036939_left_subarticular_stenosis_l2_l3,0.41357,0.36827,0.21816
7,44036939_left_subarticular_stenosis_l3_l4,0.374798,0.35047,0.274732
8,44036939_left_subarticular_stenosis_l4_l5,0.345216,0.44329,0.211494
9,44036939_left_subarticular_stenosis_l5_s1,0.426723,0.361181,0.212096


### Saving the Submission File

In this final step, we save the prepared submission DataFrame (`submission_df`) to a CSV file. This file can be used for further analysis, sharing, or submission to relevant platforms.

#### Key Steps:

1. **Exporting the DataFrame to CSV:**
   - We utilize the `to_csv()` method from the pandas library to export `submission_df` to a CSV file.
   - The argument `index=False` is specified to prevent pandas from writing row indices to the CSV file. This is important because we want to keep the file clean and ensure it only contains the relevant data columns.

2. **Naming the Submission File:**
   - The submission file is named `submission.csv`, and this name is indicated in the print statement for clarity. The CSV file will contain the `row_id` and the corresponding averaged probability scores for the conditions: normal/mild, moderate, and severe.

3. **Outputting a Confirmation Message:**
   - A confirmation message is printed to the console to inform the user that the submission file has been successfully saved. This provides assurance that the data has been correctly exported.

4. **Saving to Kaggle Working Directory:**
   - The command `submission_df.to_csv('/kaggle/working/submission.csv', index=False)` ensures that the file is saved to the Kaggle working directory, making it accessible for download or further use within the Kaggle environment.

5. **Importance of Saving the Submission File:**
   - Saving the predictions in a structured format (CSV) is essential for effective communication of results.
   - The submission file can be submitted to competitions or assessments where model performance needs to be evaluated.
   - This step marks the conclusion of the data processing and inference pipeline, transitioning from analysis to actionable outcomes.

By completing this step, we ensure that our results are preserved and ready for future reference or evaluation.


In [19]:
submission_df.to_csv('submission.csv', index=False)
# Save the submission file
submission_df.to_csv('/kaggle/working/submission.csv', index=False)
print("Submission file saved as 'submission.csv'.")

Submission file saved as 'submission.csv'.


### Conclusion

In this notebook, we successfully implemented the inference process for the RSNA Lumbar Spine Degenerative Classification task using a pre-trained Convolutional Neural Network (CNN). The key steps undertaken include:

1. **Data Preparation:** 
   - We loaded and processed the test dataset, ensuring that all images were correctly accessed and organized.

2. **Model Loading and Predictions:** 
   - We utilized a pre-trained CNN model to make predictions on the test dataset. The model output was carefully handled to ensure accurate results.

3. **Results Aggregation and Normalization:** 
   - We computed the average probabilities for each condition across multiple images corresponding to each `row_id`, ensuring that the probabilities were normalized to sum to 1.

4. **Verification:** 
   - A final check confirmed that the normalized probabilities adhered to the fundamental properties of probability distributions.

5. **Submission Preparation:**
   - The processed results were saved in a structured CSV format, ready for submission or further analysis.

Through these steps, we have built a robust inference pipeline that can be adapted for similar classification tasks in medical imaging. The results generated can be valuable for clinical evaluations and decision-making regarding lumbar spine conditions. Future work may focus on improving model accuracy through fine-tuning and exploring additional data augmentation techniques.

Thank you for reviewing this notebook, and I look forward to any questions or feedback!
------