The project aims to address the critical issue of prompt and accurate diagnosis of abdominal trauma, which is a common cause of death and a major public health concern globally. Abdominal trauma, often resulting from motor vehicle accidents, can lead to severe injuries to internal organs and internal bleeding### **Background Information:**

Traumatic injury is a significant global health concern, especially affecting individuals in the first four decades of life. It is responsible for millions of annual deaths worldwide and poses a substantial public health challenge. Prompt and accurate diagnosis of traumatic injuries is crucial for improving patient outcomes and increasing survival rates. Among various diagnostic tools, computed tomography (CT) has emerged as a vital technology for evaluating individuals suspected of having abdominal injuries. CT scans provide detailed cross-sectional images of the abdomen, aiding in the detection and assessment of traumatic injuries.


Interpreting CT scans for abdominal trauma can be a complex and time-consuming task, particularly when dealing with multiple injuries or subtle areas of active bleeding. This complexity often requires the expertise of medical professionals, and even for them, it can be challenging to make rapid and precise diagnoses. The need for timely intervention and appropriate treatment underscores the importance of improving the diagnostic process.



### **Problem Statement:**

With more than 5 million deaths caused by traumatic injury each year, it is the largest cause of early-life mortality and a major public health concern worldwide. Among these, blunt abdominal trauma is frequently sustained in car accidents and can cause serious internal bleeding and damage. In Kenya, a country of over 50 million people, this challenge is magnified by the severe shortage of healthcare infrastructure—only about 50 CT scanners and 200 trained radiologists are available nationwide. This shortage leads to misdiagnoses, delayed treatments due to average waiting times of several weeks, and a lack of access to vital healthcare services for many Kenyans. Despite government initiatives to invest in new CT scanners and train more radiologists, the need for rapid and accurate diagnosis remains critical. However, it is sometimes difficult and time-consuming for medical personnel to interpret CT scans for abdominal injuries. Therefore, there is an urgent need for automated, accurate, and rapid diagnostic solutions as any delay can be fatal.



### **Objectives:**

* To develop AI algorithms that can automatically and accurately detect traumatic injuries to internal abdominal organs using CT scans.

* To classify the discovered injuries according to their severity, thereby providing medical experts a vital tool to start proper treatment.

* To rigorously evaluate the developed algorithms using performance metrics that are relevant for both machine learning models and clinical applicability.





### **Research Questions:**

* How effective are AI algorithms in automatically detecting traumatic injuries to internal abdominal organs like the liver, kidneys, spleen, and bowel using CT scans?

* What features and patterns in CT scans are most indicative of different severities of abdominal injuries, and how can they be utilized for automated injury grading?

* What are the appropriate metrics for evaluating the performance of the developed AI algorithms in terms of both machine learning benchmarks and clinical utility?

### **Importing Libraries**

In [None]:
import numpy as np 
import pandas as pd
import pydicom
import matplotlib.pyplot as plt
import cv2
import seaborn as sns
import tensorflow as tf
import os

### **Loading the datasets**

In [None]:
labels = pd.read_csv('/kaggle/input/rsna-2023-abdominal-trauma-detection/image_level_labels.csv')
train=pd.read_csv('/kaggle/input/rsna-2023-abdominal-trauma-detection/train.csv')
train_meta = pd.read_csv('/kaggle/input/rsna-2023-abdominal-trauma-detection/train_series_meta.csv')
test_meta = pd.read_csv('/kaggle/input/rsna-2023-abdominal-trauma-detection/test_series_meta.csv')

In [None]:
#Displaying the first few rows of each dataset
train.head(), labels.head(), train_meta.head()

**labels (label.csv):**

* patient_id: The unique identifier for each patient.
* series_id: Identifier for the series of images for the patient.
* instance_number: The specific image instance number within the series.
* injury_name: The name or type of injury detected in the image.

**train (train.csv):**

This dataset provides the labels for different types of injuries for each patient.
Columns like bowel_healthy, bowel_injury, extravasation_healthy, etc., indicate the health status or injury severity of various organs for each patient.

**train_meta (train_series_meta.csv):**

* patient_id: The unique identifier for each patient.
* series_id: Identifier for the series of images for the patient.
* aortic_hu: A quantitative measure related to the images.
* incomplete_organ: A binary indicator specifying whether the organ is incomplete in the images.

In [None]:
merged_df = pd.merge(train, train_meta, on='patient_id', how='inner')

In [None]:
complete_df = pd.merge(merged_df, labels, on='patient_id', how='inner')
complete_df

In [None]:
corr_df = complete_df.drop(['patient_id', 'any_injury','series_id_x','series_id_y', 'instance_number', 'injury_name'], axis=1)

In [None]:
correlation_matrix = corr_df.corr()
correlation_matrix

In [None]:
import seaborn as sns
plt.figure(figsize=(10, 8))
# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Add a title
plt.title('Correlation Matrix Heatmap')

# Show the plot
plt.show()

In [None]:
correlation_with_aortic_hu = corr_df.corr()['aortic_hu']
plt.figure(figsize=(12, 6))
corr_df.corr()['aortic_hu'].plot(kind='bar', color='skyblue')
plt.xlabel('Columns')
plt.ylabel('Correlation')
plt.title('Correlation of "aortic_hu" with Other Columns')
plt.xticks(rotation=90)
plt.show()

# **Data Understanding**

In [None]:
# Basic information for the 'train' dataset
train_info = {
    "Number of Rows": train.shape[0],
    "Number of Columns": train.shape[1],
    "Columns": train.columns.tolist(),
    "Data Types": train.dtypes.tolist(),
    "Unique Values per Column": train.nunique().tolist()
}

# Basic information for the 'labels' dataset
labels_info = {
    "Number of Rows": labels.shape[0],
    "Number of Columns": labels.shape[1],
    "Columns": labels.columns.tolist(),
    "Data Types": labels.dtypes.tolist(),
    "Unique Values per Column": labels.nunique().tolist()
}

# Basic information for the 'train_meta' dataset
train_meta_info = {
    "Number of Rows": train_meta.shape[0],
    "Number of Columns": train_meta.shape[1],
    "Columns": train_meta.columns.tolist(),
    "Data Types": train_meta.dtypes.tolist(),
    "Unique Values per Column": train_meta.nunique().tolist()
}

train_info, labels_info, train_meta_info

**1. labels (image_level_labels.csv) Dataset:**

- Number of Rows: 12,029
- Number of Columns: 4
- Columns:
    - patient_id: Unique dentifier of the patient.
    - series_id: An identifier for the series of images associated with each patient..
    - instance_number: Specific image instance number within the series.
    - injury_name: Type of injury detected in the image.
- Data Types: The data types are appropriate with integer types for identifiers and object (string) type for the injury name.
- Unique Values: There are 246 unique patients, 330 unique series, and 925 unique instance numbers. The injury_name column has 2 unique values, indicating two types of injuries;  Active_Extravasation and bowel

**2. train(train.csv) Dataset:**

- Number of Rows: 3,147
- Number of Columns: 15
- Columns:
    - patient_id: Unique identifier of the patient.
    - The other 14 columns represent the health status and injury severity of various organs for each patient, recorded as binary variables where 0 indicates the absence of a condition, and 1 indicates the presence of a condition.
- Data Types: All columns are of integer type.
- Unique Values: There are 3,147 unique patients. The injury-related columns have binary values (0 or 1), indicating the absence or presence of a specific injury type.

**3. train_meta (train_series_meta.csv) Dataset:**

- Number of Rows: 4,711
- Number of Columns: 4
- Columns:
    - patient_id: Unique identifier of the patient.
    - series_id: An identifier for the series of images associated with each patient..
    - aortic_hu: A quantitative measure in HU related to the aorta.
    - incomplete_organ: A binary indicator where 0 signifies the absence of an incomplete organ, and 1 signifies the presence of an incomplete organ..
- Data Types: The data types are appropriate with integer and float types.
- Unique Values: There are 3,147 unique patients and 4,711 unique series. The incomplete_organ column has 2 unique values.

In [None]:
# Checking for missing values and duplicates
def check_missing_and_duplicates(datasets):
    # Initializing lists to store the results
    dataset_names = []
    missing_values_list = []
    duplicates_list = []
    
    for dataset_name, dataset in datasets.items():
        # Calculating missing values
        missing_values = dataset.isnull().sum().sum()
        
        # Checking for duplicates
        duplicates = dataset.duplicated().sum()
        
        # Appending 
        dataset_names.append(dataset_name)
        missing_values_list.append(missing_values)
        duplicates_list.append(duplicates)
    
    # Creating a summary DataFrame
    summary_df = pd.DataFrame({
        "Dataset": dataset_names,
        "Missing Values": missing_values_list,
        "Duplicates": duplicates_list
    })
    
    return summary_df

datasets = {
    "train": train,
    "labels": labels,
    "train_meta": train_meta
}

summary = check_missing_and_duplicates(datasets)

print(summary)

- There are no missing values in any of the datasets.
- There are no duplicated rows in any of the datasets.

In [None]:
print("Descriptive statistics for the train dataset:")
print(train.describe())

**Statistical Summary of the train dataset**

**1. Patient IDs Distribution:**

- Representing the unique patient identifier, the 'patient_id' column ranges from 19 to 65,508, suggesting a wide range of patients in the dataset.

**2. Organ Health Status:**

- Several columns (e.g., 'bowel_healthy', 'extravasation_healthy', 'kidney_healthy', 'liver_healthy', 'spleen_healthy') are binary indicators of organ health.
- On average, most patients have healthy organs, as indicated by values close to 1.
- The mean values of these columns are as per below:
    - The 'bowel_healthy' has an approximate mean of 0.98.
    - The 'extravasation_healthy' has a 0.94 mean.
    - The 'kidney_healthy' has 0.94 mean.
    - The 'liver_healthy'has a 0.90 mean.
    - The 'spleen_healthy' has a 0.89 mean.
- This suggests that these organ related injuries are relatively rare in the dataset.

**3. Organ Injury Severity:**

- Columns like 'bowel_injury', 'extravasation_injury', 'kidney_low', 'kidney_high', 'liver_low', 'liver_high', 'spleen_low', and 'spleen_high' represent binary indicators of injury severity for various organs.  The mean values of these columns are as per below:
    - The 'liver_high'has a mean of around 0.02 while 'liver_low' has a mean of 0.08. 
    - The 'spleen_high' is at 0.05 with 'spleen_low' at 0.06.
    - The 'kidney_high' is at 0.02 with 'kidney_low' at 0.04.
    - The 'bowel_injury' is at 0.02 with the 'extravasation_injury' at 0.06.
- These columns have low mean values, further confirming that severe injuries are relatively uncommon as compared to healthy organs.

**4. Overall Injury Presence:**

- The 'any_injury' column is a binary indicator of the presence of any injury in a patient.
- On average, approximately 27% of patients in the dataset have at least one injury (mean value of 0.27).

**Conclusions:**

- The dataset appears to be relatively imbalanced, with most patients having healthy organs and a minority experiencing injuries.

- Bowel injuries, extravasation injuries, kidney injuries, liver injuries, and spleen injuries are relatively rare, as indicated by low mean values for their respective columns.

- Most patients have healthy organs, suggesting that the dataset may contain a majority of cases without severe injuries.

**Approximately 27% of patients in the dataset have at least one injury, indicating that injuries, while less common, are still present in a significant portion of the population. This underscores the importance of conducting further EDA to gain deeper insights into the nature, patterns, and potential risk factors associated with these injuries. EDA will help us better understand the characteristics of injuries and their impact on patient outcomes, leading to more informed decision-making in the field of trauma care and intervention.**

## **Exploratory data Analysis**

### **Univariate Analysis**

In [None]:
# Visualizing the distribution of injury types in the 'label' dataset
plt.figure(figsize=(10, 6))
sns.countplot(data=labels, x='injury_name')
plt.title('Distribution of Injury Types in train Dataset')
plt.ylabel('Count')
plt.xlabel('Injury Type')
plt.show()

The data suggests that extravasation (active bleeding) is more frequently identified in the provided images than bowel injuries.

In [None]:
# Visualizing the distribution of injury-related columns in the 'train' dataset
injury_columns = [col for col in train.columns if col != "patient_id"]
injury_counts = train[injury_columns].sum()

plt.figure(figsize=(14, 8))
injury_counts.sort_values().plot(kind='barh')
plt.title('Distribution of Injury-Related Columns in labels Dataset')
plt.xlabel('Count')
plt.ylabel('Injury Type / Health Status')
plt.show()

In [None]:
# Visualizing the distribution of the 'aortic_hu' column in the 'train_meta' dataset
plt.figure(figsize=(10, 6))
sns.histplot(train_meta['aortic_hu'], bins=50, kde=True)
plt.title('Distribution of Aortic HU in train_meta Dataset')
plt.xlabel('Aortic HU')
plt.ylabel('Count')
plt.show()

Hounsfield Units (HU) are a measure used in CT scans to describe radiodensity, and the distribution gives us an idea of the variation in these values across different images.

**Relationship Analysis:**

In [None]:
# Visualizing the relationship between 'aortic_hu' and 'incomplete_organ' in the 'train_meta' dataset
plt.figure(figsize=(10, 6))
sns.boxplot(data=train_meta, x='incomplete_organ', y='aortic_hu')
plt.title('Relationship between Aortic HU and Incomplete Organ in train_meta Dataset')
plt.xlabel('Incomplete Organ (0 = Complete, 1 = Incomplete)')
plt.ylabel('Aortic HU')
plt.show()

This suggests that there might be some relationship between the completeness of the organ in the image and the aortic_hu values.

**Outliers Analysis:**

In [None]:
# Outlier analysis for the 'aortic_hu' column using the IQR method

# Calculate Q1, Q3, and IQR
Q1 = train_meta['aortic_hu'].quantile(0.25)
Q3 = train_meta['aortic_hu'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = train_meta[(train_meta['aortic_hu'] < lower_bound) | (train_meta['aortic_hu'] > upper_bound)]

# Percentage of data points that are outliers
outlier_percentage = (len(outliers) / len(train_meta)) * 100

outlier_summary = {
    "Lower Bound": lower_bound,
    "Upper Bound": upper_bound,
    "Number of Outliers": len(outliers),
    "Percentage of Outliers": outlier_percentage
}

outlier_summary

In [None]:
# Visualizing outliers for the 'aortic_hu' column
plt.figure(figsize=(12, 8))
sns.boxplot(train_meta['aortic_hu'])
plt.axhline(lower_bound, color='r', linestyle='--', label=f"Lower Bound: {lower_bound}")
plt.axhline(upper_bound, color='g', linestyle='--', label=f"Upper Bound: {upper_bound}")
plt.title('Boxplot of Aortic HU with Outliers Highlighted')
plt.xlabel('Aortic HU')
plt.legend()
plt.show()

From the plot, we can observe a cluster of data points above the upper bound, indicating potential outliers with higher aortic_hu values.

**Relationship Analysis:**

**1. Injury Type vs. Aortic HU:**

In [None]:
# Merging the 'label' and 'train_meta' datasets on 'patient_id' and 'series_id'
merged_data = pd.merge(labels, train_meta, on=['patient_id', 'series_id'])

# Visualizing the distribution of 'aortic_hu' based on 'injury_name'
plt.figure(figsize=(12, 8))
sns.boxplot(data=merged_data, x='injury_name', y='aortic_hu')
plt.title('Distribution of Aortic HU based on Injury Type')
plt.xlabel('Injury Type')
plt.ylabel('Aortic HU')
plt.show()

For bowel_injury, the distribution appears to have a slightly higher median and is more compact in terms of the interquartile range (IQR) compared to extravasation.
The extravasation injury (which represents active bleeding) has a broader IQR, indicating more variability in the aortic_hu values for this injury type. There are also a few potential outliers present for this injury type.

**2. Injury Type vs. Completeness of Organ:**

In [None]:
# Visualizing the relationship between 'injury_name' and 'incomplete_organ'
plt.figure(figsize=(10, 6))
sns.countplot(data=merged_data, x='injury_name', hue='incomplete_organ')
plt.title('Injury Type vs. Completeness of Organ')
plt.xlabel('Injury Type')
plt.ylabel('Count')
plt.legend(title='Incomplete Organ (0 = Complete, 1 = Incomplete)')
plt.show()

For both bowel_injury and extravasation injury types, the majority of the organs in the images are complete (incomplete_organ = 0).
The number of images with incomplete organs (incomplete_organ = 1) is relatively lower for both injury types, with extravasation having a slightly higher count of incomplete organs compared to bowel_injury.

## Importing Images

Start by creating image paths for test dataset

In [None]:
# Adjusting the path generation function to exclude instance_number
def test_img_path(row):
    return f"/kaggle/input/rsna-2023-abdominal-trauma-detection/test_images/{row['patient_id']}/{row['series_id']}/"

test_meta['test_img_path'] = test_meta.apply(test_img_path, axis=1)

# Display the first few rows of the test_meta dataframe with the new 'adjusted_img_path' column
test_meta.head()

Craeting image path for train dataset

In [None]:
def img_path(row):
    return f"/kaggle/input/rsna-2023-abdominal-trauma-detection/train_images/{row['patient_id']}/{row['series_id']}/{row['instance_number']}.dcm"

labels['img_path'] = labels.apply(img_path, axis=1)

**DICOM Image Visualization:**

In [None]:
# Generating Kaggle reference paths for the 'train' dataset again
labels['img_path'] = labels.apply(img_path, axis=1)

# Displaying the first few rows of the 'train' dataset with the updated 'img_path' column
labels.head()

Ramdomly display injury type and image

In [None]:
import pydicom
import matplotlib.pyplot as plt

def read_dicom_image(path):
    """
    Reads a DICOM image and returns its pixel array.
    """
    dicom_img = pydicom.dcmread(path)
    return dicom_img.pixel_array

# Sample 20 rows from the train dataset
sample_data = labels.sample(20)

# Extract the img_paths and corresponding injury names for labeling
sample_img_paths = sample_data['img_path'].tolist()
sample_labels = sample_data['injury_name'].tolist()

# Set up the figure for visualization
plt.figure(figsize=(15, 30))

# Loop through the sampled image paths and display them in rows of 3 with labels
for idx, (img_path, label) in enumerate(zip(sample_img_paths, sample_labels), start=1):
    plt.subplot(7, 3, idx)  # 7 rows, 3 columns
    plt.imshow(read_dicom_image(img_path), cmap='gray')
    plt.title(label)
    plt.axis('off')

plt.tight_layout()
plt.show()

**comparison of images for each injury type**

In [None]:
import pydicom
import matplotlib.pyplot as plt

def read_dicom_image(path):
    """
    Reads a DICOM image and returns its pixel array.
    """
    dicom_img = pydicom.dcmread(path)
    return dicom_img.pixel_array

# Sample one image path for each injury type
sample_img_paths = labels.groupby('injury_name').apply(lambda x: x.sample(1)['img_path'].values[0])
sample_labels = sample_img_paths.index.tolist()

# Set up the figure for visualization
plt.figure(figsize=(15, 5))

# Loop through the sampled image paths and display them side by side with labels
for idx, (img_path, label) in enumerate(zip(sample_img_paths, sample_labels), start=1):
    plt.subplot(1, len(sample_img_paths), idx)
    plt.imshow(read_dicom_image(img_path), cmap='gray')
    plt.title(label)
    plt.axis('off')

plt.tight_layout()
plt.show()

Randomly Display Images by Patient ID

In [None]:
import pydicom
import matplotlib.pyplot as plt
import random

def read_dicom_image(path):
    """
    Reads a DICOM image and returns its pixel array.
    """
    dicom_img = pydicom.dcmread(path)
    return dicom_img.pixel_array

# Get unique patient IDs from your DataFrame
unique_patient_ids = labels['patient_id'].unique()

# Randomly select 5 patient IDs (or you can select a fixed set)
random_patient_ids = random.sample(list(unique_patient_ids), 5)

# Set up a grid for displaying images
num_rows = 5  # Number of rows in the grid (one row per patient)
num_cols = 5  # Number of columns in the grid (up to 5 images per patient)
plt.figure(figsize=(15, 10))

# Iterate through randomly selected patient IDs
for row, random_patient_id in enumerate(random_patient_ids, start=1):
    # Filter the DataFrame to get all images for the randomly selected patient
    patient_images = labels[labels['patient_id'] == random_patient_id]
    
    # Get unique series IDs for the patient
    unique_series_ids = patient_images['series_id'].unique()
    
    # Randomly select up to 5 unique series IDs (you can adjust the number)
    random_series_ids = random.sample(list(unique_series_ids), min(5, len(unique_series_ids)))
    
    # Iterate through randomly selected series IDs for the patient
    for col, random_series_id in enumerate(random_series_ids, start=1):
        # Filter the DataFrame to get all images for the selected series
        series_images = patient_images[patient_images['series_id'] == random_series_id]
        
        # Display each image in the series
        for i, (_, image_row) in enumerate(series_images.iterrows(), start=1):
            image_path = image_row['img_path']
            plt.subplot(num_rows, num_cols, (row - 1) * num_cols + col)
            plt.imshow(read_dicom_image(image_path), cmap='gray')
            plt.title(f'Patient ID: {random_patient_id}\nSeries ID: {random_series_id}\nImage {i}')
            plt.axis('off')

plt.tight_layout()
plt.show()

**preprocessing :**


* Rescaling: Adjusting the intensity values to a standard scale, e.g., between 0 and 1.
* Resizing: Making sure all images have the same size, especially if they are being fed into a neural network.
* Histogram Equalization: Enhancing the contrast of images.
* Normalization: Removing the mean and scaling to unit variance.
* Data Augmentation: Techniques such as rotation, zooming, and flipping to artificially increase the size of the dataset (useful for training deep learning models).
* Smoothing
*Padding

In [None]:
import pydicom
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Load a sample DICOM image
sample_path = labels['img_path'].iloc[0]
dicom_img = pydicom.dcmread(sample_path).pixel_array

# Rescale the image to the range [0, 1]
rescaled_img = cv2.normalize(dicom_img, None, 0, 1, cv2.NORM_MINMAX, dtype=cv2.CV_32F)

# Apply histogram equalization
equalized_img = cv2.equalizeHist((rescaled_img * 255).astype(np.uint8))

# Plot original and preprocessed images side by side
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.imshow(dicom_img, cmap='gray')
plt.title('Original Image')
plt.subplot(1, 2, 2)
plt.imshow(equalized_img, cmap='gray')
plt.title('Preprocessed Image')
plt.show()

In [None]:
import pydicom
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Function to read a DICOM image and return its pixel array
def read_dicom_image(path):
    dicom_img = pydicom.dcmread(path)
    return dicom_img.pixel_array

# Load a sample DICOM image
sample_path = labels['img_path'].iloc[0]
dicom_img = read_dicom_image(sample_path)

# Rescale the image to the range [0, 1]
rescaled_img = cv2.normalize(dicom_img, None, 0, 1, cv2.NORM_MINMAX, dtype=cv2.CV_32F)

# Apply histogram equalization
equalized_img = cv2.equalizeHist((rescaled_img * 255).astype(np.uint8))

# Apply Gaussian smoothing
k_size = (5, 5)  # Kernel size for Gaussian filter
sigma = 0.5      # Standard deviation for Gaussian filter
smoothed_img = cv2.GaussianBlur(equalized_img, k_size, sigma)

# Define padding size (top, bottom, left, right)
padding_size = (20, 20, 20, 20)

# Apply zero-padding
padded_img = np.pad(smoothed_img, ((padding_size[0], padding_size[1]), (padding_size[2], padding_size[3])), mode='constant', constant_values=0)

# Plot original, rescaled, equalized, smoothed, and padded images
plt.figure(figsize=(20, 8))
plt.subplot(1, 5, 1)
plt.imshow(dicom_img, cmap='gray')
plt.title('Original Image')

plt.subplot(1, 5, 2)
plt.imshow(rescaled_img, cmap='gray')
plt.title('Rescaled Image')

plt.subplot(1, 5, 3)
plt.imshow(equalized_img, cmap='gray')
plt.title('Equalized Image')

plt.subplot(1, 5, 4)
plt.imshow(smoothed_img, cmap='gray')
plt.title('Smoothed Image')

plt.subplot(1, 5, 5)
plt.imshow(padded_img, cmap='gray')
plt.title('Padded Image')

plt.show()

In [None]:
import pydicom
import cv2
import numpy as np
import pandas as pd
import gc

def read_dicom_image(path):
    dicom_img = pydicom.dcmread(path)
    return dicom_img.pixel_array

def process_image(img):
    rescaled_img = cv2.normalize(img, None, 0, 1, cv2.NORM_MINMAX, dtype=cv2.CV_32F)
    equalized_img = cv2.equalizeHist((rescaled_img * 255).astype(np.uint8))
    k_size = (5, 5)
    sigma = 0.5
    smoothed_img = cv2.GaussianBlur(equalized_img, k_size, sigma)
    padding_size = (20, 20, 20, 20)
    padded_img = np.pad(smoothed_img, ((padding_size[0], padding_size[1]), (padding_size[2], padding_size[3])), mode='constant', constant_values=0)
    
    # Resize to a fixed size
    resized_img = cv2.resize(padded_img, (256, 256))
    
    return resized_img / 255.0  # normalize to [0,1]

def process_batch(batch):
    batch_images = []
    for index, row in batch.iterrows():
        img = read_dicom_image(row['img_path'])
        processed_img = process_image(img)
        batch_images.append(processed_img)
    return np.stack(batch_images)

def image_generator(labels_df, batch_size):
    num_samples = len(labels_df)
    
    while True:
        for start in range(0, num_samples, batch_size):
            end = min(start + batch_size, num_samples)
            batch = labels_df.iloc[start:end]
            batch_images = process_batch(batch)
            
            yield batch_images
            
            # Free up memory
            del batch_images
            gc.collect()

# Sample usage
batch_size = 100
data_gen = image_generator(labels, batch_size=batch_size)

# To get the next batch of images:
# next_batch = next(data_gen)


In [None]:
labels.head()

**Modelling**

**Steps for Model Building:**
* Data Preparation: Split the data into training and validation sets.
* Data Augmentation: Use data augmentation techniques to artificially increase the size of the training dataset.
* Model Architecture: Define the CNN architecture.
* Model Compilation: Specify the loss function, optimizer, and metrics.
* Model Training: Train the model using the training data.
* Model Evaluation: Evaluate the model's performance on the validation data.

Merging Pre-Processed Labels and Train CSV

In [None]:

model_df = pd.merge(train, labels, on='patient_id', how='inner')
model_df.shape

In [None]:
model_df = model_df.drop_duplicates()
model_df.shape

**1. Modelling**

**Data Preparation**: Split the data into training and validation sets.

In [None]:
from sklearn.model_selection import train_test_split

def batch_generator(data_df, batch_size):
    while True:
        for start in range(0, len(data_df), batch_size):
            end = min(start + batch_size, len(data_df))
            batch_df = data_df.iloc[start:end]
            batch_images = process_batch(batch_df)
            batch_labels = batch_df[y_train_columns].values
            yield batch_images, batch_labels

# Splitting the data into training (80%) and validation (20%) sets again
train_patient_ids, val_patient_ids = train_test_split(model_df['patient_id'].unique(), test_size=0.2, random_state=42)

# Use boolean indexing to filter rows in model_df based on patient IDs
train_df = model_df[model_df['patient_id'].isin(train_patient_ids)]
val_df = model_df[model_df['patient_id'].isin(val_patient_ids)]

# Define column names for labels
y_train_columns = [ 'any_injury']

# Create generators for training and validation
batch_size = 32
train_gen = batch_generator(train_df, batch_size=batch_size)
val_gen = batch_generator(val_df, batch_size=batch_size)

# Number of steps per epoch
train_steps_per_epoch = len(train_df) // batch_size
val_steps_per_epoch = len(val_df) // batch_size

# Sample training loop
# history = model.fit(train_gen, 
#                     steps_per_epoch=train_steps_per_epoch,
#                     validation_data=val_gen, 
#                     validation_steps=val_steps_per_epoch, 
#                     epochs=10)


In [None]:
# Convert the train and val patient IDs to sets
train_patient_ids_set = set(train_patient_ids)
val_patient_ids_set = set(val_patient_ids)

# Check for common patient IDs between train and val
common_patient_ids = train_patient_ids_set.intersection(val_patient_ids_set)

# Check if there are any common patient IDs
if len(common_patient_ids) > 0:
    print("Common Patient IDs between Train and Validation Sets:")
    print(common_patient_ids)
else:
    print("No Common Patient IDs between Train and Validation Sets")

**Image  Preprocessing**

**2. Model Architecture Definition**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

# Create a model
model = Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(256, 256, 1)),  # Adjusted input shape
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D(2, 2),
    layers.Flatten(),
    layers.Dropout(0.5),
    layers.Dense(512, activation='relu'),
    layers.Dense(len(y_train_columns), activation='sigmoid')  # Assuming y_train_columns is defined elsewhere in your code.
])

**3. Model Compilation**

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


**4. Model Training**

In [None]:
sample_batch_images, sample_batch_labels = next(train_gen)
print(sample_batch_images.shape)


In [None]:
history = model.fit(train_gen, 
                    steps_per_epoch=train_steps_per_epoch,
                    validation_data=val_gen, 
                    validation_steps=val_steps_per_epoch, 
                    epochs=10)


**5. Model Evaluation**

In [None]:
# Evaluate the model on the validation generator
val_loss, val_accuracy = model.evaluate(val_gen, steps=val_steps_per_epoch)

print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")


**6. Visualize Training and Validation Accuracy**

In [None]:
import matplotlib.pyplot as plt

# Extract accuracy values from the history object
train_acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

# Extract loss values from the history object (optional if you want to plot loss as well)
train_loss = history.history['loss']
val_loss = history.history['val_loss']

# Create a range of values for epochs (starting from 1)
epochs = range(1, len(train_acc) + 1)

# Plot training and validation accuracy
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(epochs, train_acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

# Plot training and validation loss (optional)
plt.subplot(1, 2, 2)
plt.plot(epochs, train_loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()



In [None]:
# Extract a batch of images and labels from the validation generator
val_images, val_labels = next(val_gen)

# Predict using the model
predictions = model.predict(val_images)

# Convert predictions to binary labels
predicted_labels = (predictions > 0.5).astype(int)

# Visualize the first few images, true labels, and predicted labels
plt.figure(figsize=(15, 5))
for i in range(10):  # Displaying the first 10 images
    plt.subplot(2, 5, i+1)
    plt.imshow(val_images[i].squeeze(), cmap='gray')  # Assuming images are grayscale
    plt.title(f"True: {val_labels[i]}, Predicted: {predicted_labels[i]}")
    plt.axis('off')
plt.tight_layout()
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Get all predictions and true labels from the validation set
all_predictions = []
all_true_labels = []

for i in range(val_steps_per_epoch):
    val_images, val_labels = next(val_gen)
    predictions = model.predict(val_images)
    predicted_labels = (predictions > 0.5).astype(int)
    all_predictions.extend(predicted_labels)
    all_true_labels.extend(val_labels)

# Generate the confusion matrix
cm = confusion_matrix(all_true_labels, all_predictions)

# Visualize the confusion matrix
plt.figure(figsize=(6, 6))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()
