## Data Acquisition (Assumes Data is Already Downloaded)

This notebook assumes you have already downloaded the LCTSC dataset from the TCIA. Instructions for downloading can be found on the TCIA website: [https://www.cancerimagingarchive.net/collections/lung-ct-segmentation-challenge-lctsc/](https://www.cancerimagingarchive.net/collections/lung-ct-segmentation-challenge-lctsc/). Due to the size of medical imaging data and the complexities of direct programmatic download within a notebook environment, we will focus on analyzing data that is assumed to be locally stored.

### Directory Structure:
It's recommended to organize your downloaded data into patient-specific directories. For example:
```
 LCTSC/
     LCTSC-Test-S0001/
         1-001.dcm
         1-002.dcm
         ...
     LCTSC-Test-S0002/
         ...
     ...
```

## Define the root directory where your LCTSC data is stored:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Now you can access files in your Google Drive under /content/drive/MyDrive/
data_base_dir = '/content/drive/MyDrive/2024_2025/ArtificialIntelligenceForPhysics/DATA' # Adjust the path accordingly
data_csv = '/content/drive/MyDrive/2024_2025/ArtificialIntelligenceForPhysics/DATA/lctsc_metadata.csv' # Adjust the path accordingly

## Introduction

This notebook guides you through the process of downloading lung CT scan data from The Cancer Imaging Archive (TCIA), specifically focusing on the Lung CT Segmentation Challenge (LCTSC) dataset. You will learn how to access the data, visualize individual slices and 3D volumes, perform statistical analysis, and apply basic Machine Learning techniques to the dataset.

## Prerequisites:
 - Advanced knowledge of Python programming.
 - Familiarity with basic concepts of medical imaging (e.g., DICOM format, CT scans).
 - Installation of the required Python libraries (see "Setup and Imports" section).

## Learning Objectives:
 - Learn how to programmatically interact with TCIA data (though this notebook will focus on downloaded data).
 - Understand the structure of DICOM files and how to extract relevant information.
 - Visualize 2D slices and reconstruct 3D volumes from CT scans.
 - Perform statistical analysis on the dataset (e.g., patient demographics, image characteristics).
 - Apply basic Machine Learning techniques for classification or prediction tasks (e.g., using image features).

In [None]:
!pip install pandas numpy matplotlib scikit-learn pydicom itkwidgets

In [None]:
# Import necessary Python libraries. Make sure you have these installed (`pip install pandas numpy matplotlib scikit-learn pydicom itkwidgets`).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pydicom
import os

from ipywidgets import interact, IntSlider
import itkwidgets as viewer

## Exploring the Dataset Structure and Metadata

Let's start by exploring the directory structure and reading the metadata from the DICOM files.

In [None]:
# Load the metadata.csv file
metadata_df = pd.read_csv(data_csv)

# Display the first few rows and column information
print(metadata_df.head())
print(metadata_df.info())

In [None]:
# fill the patient dict variable
patient_dict = {}

# Method 2: Using itertuples() (More efficient than iterrows())
for row in metadata_df.itertuples():
  patient_id = getattr(row, '_5') # unique Name
  file_type = getattr(row, 'Modality') # check if RTSTRUCT or CT
  file_location = getattr(row, '_16') # Path to the images
  if patient_id in patient_dict.keys() :
    patient_dict[patient_id][file_type] = file_location
  else:
    patient_dict[patient_id] = {}
    patient_dict[patient_id][file_type] = file_location

print(patient_dict)

In [None]:
def load_patient_dicom_files(patient_dir):
    """Loads all DICOM files from a patient directory."""
    dicom_files = []
    for filename in os.listdir(patient_dir):
        if filename.endswith('.dcm'):
            file_path = os.path.join(patient_dir, filename)
            try:
                dicom_files.append(pydicom.dcmread(file_path))
            except Exception as e:
                print(f"Error reading {file_path}: {e}")
    return dicom_files

def get_patient_metadata(dicom_files):
    """Extracts relevant metadata from a list of DICOM files for a patient."""
    if not dicom_files:
        return None
    metadata = {
        'PatientID': dicom_files[0].PatientID if 'PatientID' in dicom_files[0] else None,
        'PatientSex': dicom_files[0].PatientSex if 'PatientSex' in dicom_files[0] else None,
        'PatientAge': dicom_files[0].PatientAge if 'PatientAge' in dicom_files[0] else None,
        'StudyDate': dicom_files[0].StudyDate if 'StudyDate' in dicom_files[0] else None,
        'SeriesDescription': dicom_files[0].SeriesDescription if 'SeriesDescription' in dicom_files[0] else None,
        'NumberOfSlices': len(dicom_files),
        # Add more relevant metadata as needed
    }
    return metadata

# Iterate through the patient directories and collect metadata
all_metadata = []

for k,v in patient_dict.items():
  if v['CT']:
    dicom_files = load_patient_dicom_files(os.path.join(data_base_dir,v['CT'].replace('./','')))
    if dicom_files:
        metadata = get_patient_metadata(dicom_files)
        all_metadata.append(metadata)

# Create a Pandas DataFrame from the collected metadata
metadata_df = pd.DataFrame(all_metadata)

# Display the first few rows of the metadata DataFrame
print("First few rows of the metadata DataFrame:")
print(metadata_df.head())

# Basic statistics of the metadata
print("\nBasic statistics of the metadata:")
print(metadata_df.describe(include='all'))



## Visualizing Individual CT Slices

Let's visualize individual slices from a CT scan.

In [None]:
def display_slice(dicom_file):
    """Displays a single CT scan slice."""
    plt.figure(figsize=(8, 8))
    plt.imshow(dicom_file.pixel_array, cmap=plt.cm.gray)
    plt.title(f"Slice: {dicom_file.ImagePositionPatient[2]:.2f}")
    plt.xlabel("Pixels")
    plt.ylabel("Pixels")
    plt.axis('off')
    plt.show()

# Example: Load and display a slice from the first patient
first_key = next(iter(patient_dict))
first_value = patient_dict[first_key]

if first_value['CT']:
  first_patient_files = load_patient_dicom_files(os.path.join(data_base_dir,first_value['CT'].replace('./','')))
  if first_patient_files:
      display_slice(first_patient_files[len(first_patient_files) // 2]) # Display the middle slice


In [None]:
# Interactive slice viewer
def interactive_slice_viewer(dicom_files):
    """Interactive viewer for browsing through CT scan slices."""
    slices = [dcm.pixel_array for dcm in dicom_files]
    def show_slice(slice_num):
        plt.figure(figsize=(8, 8))
        plt.imshow(slices[slice_num], cmap=plt.cm.gray)
        plt.title(f"Slice: {slice_num}")
        plt.xlabel("Pixels")
        plt.ylabel("Pixels")
        plt.axis('off')
        plt.show()

    interact(show_slice, slice_num=IntSlider(min=0, max=len(slices) - 1, step=1, description='Slice Number'))

# Example: Interactive slice viewing for the first patient
first_key = next(iter(patient_dict))
first_value = patient_dict[first_key]

if first_value['CT']:
  first_patient_files = load_patient_dicom_files(os.path.join(data_base_dir,first_value['CT'].replace('./','')))
  if first_patient_files:
      interactive_slice_viewer(first_patient_files)


In [None]:
# ## Reconstructing and Visualizing 3D Volumes
#
# To visualize the 3D volume, we need to stack the 2D slices.

def load_and_sort_slices(patient_dir):
    """Loads DICOM files and sorts them by slice position."""
    slices = [pydicom.dcmread(os.path.join(patient_dir, filename)) for filename in os.listdir(patient_dir) if filename.endswith('.dcm')]
    slices.sort(key=lambda s: float(s.ImagePositionPatient[2]))
    return slices

def get_pixel_array_3d(slices):
    """Combines a list of DICOM slices into a 3D NumPy array."""
    image = np.stack([s.pixel_array for s in slices])
    # Convert to Hounsfield Units (HU) if the Rescale Slope and Intercept are present
    if hasattr(slices[0], 'RescaleIntercept') and hasattr(slices[0], 'RescaleSlope'):
        image = image * slices[0].RescaleSlope + slices[0].RescaleIntercept
    return image

# Interactive slice viewer
def interactive_slice_sorted_viewer(slices):
    """Interactive viewer for browsing through CT scan slices."""
    def show_slice(slice_num):
        plt.figure(figsize=(8, 8))
        plt.imshow(slices[slice_num], cmap=plt.cm.gray)
        plt.title(f"Slice: {slice_num}")
        plt.xlabel("Pixels")
        plt.ylabel("Pixels")
        plt.axis('off')
        plt.show()

    interact(show_slice, slice_num=IntSlider(min=0, max=len(slices) - 1, step=1, description='Slice Number'))

# Example: Load slices and create a 3D volume for the first patient
# Example: Interactive slice viewing for the first patient
first_key = next(iter(patient_dict))
first_value = patient_dict[first_key]

if first_value['CT']:
  ordered_patient_dicom_files = load_and_sort_slices(os.path.join(data_base_dir,first_value['CT'].replace('./','')))

if ordered_patient_dicom_files:
    first_patient_slices = ordered_patient_dicom_files
    if first_patient_slices:
        volume_3d = get_pixel_array_3d(first_patient_slices)
        print("\nShape of the 3D volume:", volume_3d.shape)
        print("\nInteractive visualization:")
        interactive_slice_sorted_viewer(volume_3d)



## Statistical Analysis of the Dataset

Now, let's perform some statistical analysis on the metadata we collected.

In [None]:
# Distribution of Patient Sex
if 'PatientSex' in metadata_df.columns:
    plt.figure(figsize=(6, 4))
    sns.countplot(data=metadata_df, x='PatientSex')
    plt.title('Distribution of Patient Sex')
    plt.xlabel('Sex')
    plt.ylabel('Number of Patients')
    plt.show()

In [None]:
# Distribution of Patient Age
if 'PatientAge' in metadata_df.columns:
    # Convert 'PatientAge' to numeric if it's not already
    metadata_df['PatientAge'] = pd.to_numeric(metadata_df['PatientAge'], errors='coerce')
    plt.figure(figsize=(8, 6))
    sns.histplot(metadata_df['PatientAge'].dropna(), bins=20, kde=True)
    plt.title('Distribution of Patient Age')
    plt.xlabel('Age')
    plt.ylabel('Number of Patients')
    plt.show()

In [None]:
# Distribution of Number of Slices per Patient
if 'NumberOfSlices' in metadata_df.columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(metadata_df['NumberOfSlices'], bins=30, kde=True)
    plt.title('Distribution of Number of Slices per Patient')
    plt.xlabel('Number of Slices')
    plt.ylabel('Number of Patients')
    plt.show()

You can add more statistical analyses here, such as:
- Analysis of 'StudyDate'
- Distribution of 'SeriesDescription'
- Correlations between numerical features (if available)

## Basic Machine Learning Application (Example: Predicting Number of Slices)

This is a simplified example to demonstrate how ML techniques can be applied. We will try to predict the number of slices based on available metadata. **Note:** This is not a clinically relevant task but serves as an educational illustration.

# Prepare the data for ML

In [None]:
if 'PatientSex' in metadata_df.columns and 'NumberOfSlices' in metadata_df.columns:
    # Handle categorical features
    metadata_df_encoded = pd.get_dummies(metadata_df, columns=['PatientSex'], drop_first=True)
    metadata_df_ml = metadata_df_encoded[['PatientSex_M', 'NumberOfSlices']].dropna() # Using PatientAge and encoded Sex

    X = metadata_df_ml[['NumberOfSlices']]
    y = metadata_df_ml['PatientSex_M']

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train a simple Logistic Regression model (though regression might be more appropriate here)
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    print("\n--- Basic Machine Learning Example ---")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=0))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

This ML training example only shows how to extract data from a python file and how to apply a simple ML model. Further ML Ideas could be:
 - Search for some other data within the dicomfiles or in the metadata file.
 - Feature engineering from image data (requires more advanced image processing).
 - Predicting other patient characteristics or outcomes (if available in the dataset or annotations).
 - Exploring different ML models (e.g., Support Vector Machines, Random Forests).

## Conclusion

This notebook provided a foundational workflow for working with lung CT scan data from TCIA. You learned how to:
- Load and explore DICOM metadata.
- Visualize individual CT slices interactively.
- Reconstruct and visualize 3D volumes.
- Perform basic statistical analysis on the dataset's metadata.
- Implement a simple Machine Learning example using the metadata.

### Further Exploration:
- Investigate the segmentation masks associated with the LCTSC dataset.
- Implement more advanced image processing techniques (e.g., lung segmentation).
- Consider the ethical implications of working with medical imaging data.