## Data Acquisition (Assumes Data is Already Downloaded)

This notebook assumes you have already downloaded the LCTSC dataset from the TCIA. Instructions for downloading can be found on the TCIA website: [https://www.cancerimagingarchive.net/collections/lung-ct-segmentation-challenge-lctsc/](https://www.cancerimagingarchive.net/collections/lung-ct-segmentation-challenge-lctsc/). Due to the size of medical imaging data and the complexities of direct programmatic download within a notebook environment, we will focus on analyzing data that is assumed to be locally stored.

### Directory Structure:
It's recommended to organize your downloaded data into patient-specific directories. For example:
```
 LCTSC/
     LCTSC-Test-S0001/
         1-001.dcm
         1-002.dcm
         ...
     LCTSC-Test-S0002/
         ...
     ...
```

## Define the root directory where your LCTSC data is stored:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Lets install the python module of this course

Download from github the library to manage the LCTSC dataset. Have a look at the repository [here](https://github.com/AI-For-Experimental-And-Applied-Physics/LCTSCProject2024).

In [None]:
!git clone https://github.com/AI-For-Experimental-And-Applied-Physics/LCTSCProject2024.git

## Introduction

This notebook will use the downloaded library to generate the numpy arrays to be used for the training of the neural network.

In [None]:
!pip install -r LCTSCProject2024/requirements.txt

### Load Required Modules and Append Git Repository to Path

In the next section, the necessary Python modules are imported, and the path to the cloned Git repository is appended to the system path. This ensures that the custom library for managing the LCTSC dataset can be accessed and utilized in the notebook.

In [None]:
import sys
from pathlib import Path

# Add the LCTSCProject2024 parent directory to the Python path
sys.path.append("LCTSCProject2024/")

import pandas as pd
from lctsc_preprocessor.utils import preprocess_case

### Metadata and Raw Data Paths Initialization

In the next block, the relevant variables are instantiated. These include the path to the metadata CSV file (`metadata_file`) and the directory containing the raw data (`raw_data_path`). Additionally, a flag (`plot`) is set to enable or disable plotting during preprocessing. These variables are essential for organizing and processing the LCTSC dataset.

In [None]:
metadata_file = "PATH/TO/YOUR/lctsc_metadata.csv"  # Replace with the actual path to your metadata file
metadata_df = pd.read_csv(metadata_file)

raw_data_path = "PATH/TO/YOUR/RAW/DATA"  # Replace with the actual path to your raw data directory
plot = True

### Reading Metadata and Populating Patient Dictionary

In the next code block, the metadata CSV file is read into a DataFrame, and a dictionary (`patient_dict`) is populated. Each patient is associated with their unique identifier (`patient_id`) and has corresponding CT and RTSTRUCT file paths. This structure facilitates efficient data organization and access for further processing.

In [None]:
# fill the patient dict variable
patient_dict = {}

# Method 2: Using itertuples() (More efficient than iterrows())
for row in metadata_df.itertuples():
    patient_id = getattr(row, '_5') # unique Name
    file_type = getattr(row, 'Modality') # check if RTSTRUCT or CT
    file_location = getattr(row, '_16') # Path to the images
    if patient_id in patient_dict.keys() :
        patient_dict[patient_id][file_type] = file_location
    else:
        patient_dict[patient_id] = {}
        patient_dict[patient_id][file_type] = file_location

### Function `preprocess_case`

The `preprocess_case` function is a utility from the `lctsc_preprocessor.utils` module. It is used to process individual patient cases by taking the CT and RTSTRUCT file paths as inputs. This function performs the following tasks:

1. **Data Preprocessing**: It processes the CT images and associated RTSTRUCT files to extract relevant information.
2. **Numpy Array Generation**: Converts the processed data into numpy arrays, which are essential for training machine learning models, particularly neural networks.
3. **Visualization (Optional)**: If the `plot` flag is set to `True`, it generates visualizations of the processed data for verification and debugging purposes.

In the next block, we iterate through the `patient_dict` dictionary, which contains patient-specific data, and use the `preprocess_case` function to generate the numpy arrays required for training the neural network.

In [None]:
for k,v in patient_dict.items():
    print(Path(raw_data_path).joinpath(v["CT"]))
    preprocess_case(
        case_id = k,
        ct_path = Path(raw_data_path).joinpath(v["CT"]),
        rtstruct_path = Path(raw_data_path).joinpath(v["RTSTRUCT"]).joinpath('1-1.dcm'),
        plot=plot
    )

### Student Exercises

In this section, you will perform tasks to deepen your understanding of the data preprocessing pipeline and prepare the data for training a neural network. Follow the steps below and answer the questions:

1. **Create a New Notebook**:
    - Write a new Jupyter Notebook that reads the numpy files generated in the previous steps.
    - Explore the contents of the numpy files and document your findings.

2. **Analyze the Data**:
    - Determine the shape of the pixels in mm³. Is there metadata or information in the numpy files that provides this detail?
    - Check if all the images have the same shape. If not, document the differences and consider how this might impact training.

3. **Implement a Keras Sequence Class**:
    - Write a custom `keras.utils.Sequence` class capable of reading the numpy files.
    - The class should output:
      - The input image.
      - The label mask of the lung.
    - Bonus: Extend the implementation to support multi-class segmentation for those interested in advanced tasks.

4. **Document Your Workflow**:
    - For each step, write a brief explanation of what you did and why.
    - Include any challenges you faced and how you resolved them.

These exercises will help you understand the data preparation process and set the foundation for training a robust neural network.