# Preprocessing of IMC Data For Image Segmentation

### **Authors:** oscardong4@gmail.com and heeva.baharlou@gmail.com (29/04/2024) - script adapted from [here](https://github.com/BodenmillerGroup/ImcSegmentationPipeline/blob/main/scripts/imc_preprocessing.ipynb)

## Requirements for input data in `raw` folder

The `raw` folder should contain **2 types** of files:

#### **File Type #1:** The zipped `.mcd` files

The Hyperion Imaging System produces vendor controlled `.mcd` and `.txt` files in the following folder structure:

> ├── {XYZ}.mcd <br>
├── {XYZ}_ROI_001_1.txt <br>
├── {XYZ}_ROI_002_2.txt <br>
├── {XYZ}_ROI_003_3.txt <br>

where `XYZ` defines the filename, `ROI_001`, `ROI_002`, `ROI_003` are names (description) for the selected regions of interest (ROI) and `1`, `2`, `3` indicate the acquistion identifiers. The ROI description entry can be specified in the Fluidigm software when selecting ROIs. The `.mcd` file contains the raw imaging data of all acquired ROIs, while each `.txt` file contains data of a single ROI. 

To enforce a consistent naming scheme and to bundle all metadata, make sure to **zip the folder** for preprocessing. Each `.zip` file should only contain data from a **single** `.mcd` file (and any associated `.txt` files), and should be named `Sample1.zip`, `Sample2.zip` and so on. 

Your file directory should look something like this:

> raw <br>
├── Sample1.zip <br>
│   ├── {A}.mcd <br>
│   ├── {A}_ROI_001_1.txt <br>
│   ├── {A}_ROI_002_2.txt <br>
│   └── {A}_ROI_003_3.txt <br>
├── Sample2.zip <br>
│   ├── {B}.mcd <br>
│   ├── {B}_ROI_001_1.txt <br>
│   ├── {B}_ROI_002_2.txt <br>
│   └── {B}_ROI_003_3.txt <br>
└── ... <br>

#### **File Type #2:** The `panel.csv` file

The panel file (in `.csv` format) specifies the type of antibodies that were used in the experiment and all additional channels (eg. metals used for counterstaining) that you want to include in downstream processing. Example entries to the `panel.csv` file should look like this:

<div align="center">

| Metal Tag | Target      | Full | Segment |
|-----------|-------------|------|---------|
| Nd145     | CD83        | 1    | 0       |
| Nd146     | CD8         | 1    | 1       |
| Sm147     | Podoplanin  | 1    | 0       |
| Nd148     | CD16        | 1    | 0       |

</div>

- **Metal Tag**: indicates the isotope used
- **Target**: indicates the target marker for the particular isotope
- **Full**: a `1` specifies channels you wish to analyse later (eg. to calculate marker intensities for), while a `0` specifies channels you do not wish to analyse
- **Segment**: a `1` specifies channels that will specifically be used for segmentation in Cellpose later, while a `0` indicates channels that will not be used

Your `raw` folder should look like this:

> raw <br>
├── panel.csv <br>
├── Sample1.zip <br>
├── Sample2.zip <br>
├── Sample3.zip <br>
└── ... <br>

## Set your variables

In the code chunk below, alter the following variables:
- `analysis_dir`: set this to your `analysis` folder
- `raw_dir`: set this to your `raw` folder

Remember to **run** the code chunk after setting the variables above. 

In [None]:
# Set this to your 'analysis' folder
analysis_dir = ""

# Set this to your 'raw' folder
raw_dir = ""

## Run the rest of the code

You are now ready to **run** the rest of the code below. After running, you will notice **3 new** folders created in your `analysis` folder:

1. `1a_extracted_mcd`: contains individual folders (one per sample), each of which contain multiple `.ome.tiff` files (one per acquisition) and other files extracted from the original `.mcd` file
2. `1b_for_segmentation`: contains the segmentation stacks for use in Cellpose, as well as `.csv` files indicating the channel order
3. `1c_full_images`: contains the full stacks for analysis, as well as `.csv` files indicating the channel order

After you see the message **'Done!'** printed, you can move to the next step **(1b. Removing outliers from images)** on the GitHub page. 

In [None]:
# Import libraries
from pathlib import Path
from tempfile import TemporaryDirectory
from typing import List
import pandas as pd
import imcsegpipe
from imcsegpipe.utils import sort_channels_by_mass

# Regular expression to select files
file_regex = "*Sample*.zip"

# Working directory storing all outputs
work_dir = Path(analysis_dir)
work_dir.mkdir(exist_ok=True)

# Set and create output directories
acquisitions_dir = work_dir / "1a_extracted_mcd"
segment_dir = work_dir / "1b_for_segmentation"
output_dir = work_dir / "1c_full_images"
acquisitions_dir.mkdir(exist_ok=True)
segment_dir.mkdir(exist_ok=True)
output_dir.mkdir(exist_ok=True)

# Raw directory with raw data files
raw = Path(raw_dir)

# Extract .mcd files to 'extracted_mcd' folder
temp_dirs = []
try:
    for raw_dir in [raw]:
        zip_files = list(raw_dir.rglob(file_regex))
        if len(zip_files) > 0:
            temp_dir = TemporaryDirectory()
            temp_dirs.append(temp_dir)
            for zip_file in sorted(zip_files):
                imcsegpipe.extract_zip_file(zip_file, temp_dir.name)
    acquisition_metadatas = []
    for raw_dir in [raw] + [Path(temp_dir.name) for temp_dir in temp_dirs]:
        mcd_files = list(raw_dir.rglob("*.mcd"))
        mcd_files = [(i) for i in mcd_files if not i.stem.startswith('.')]
        if len(mcd_files) > 0:
            txt_files = list(raw_dir.rglob("*.txt"))
            txt_files = [(i) for i in txt_files if not i.stem.startswith('.')]
            matched_txt_files = imcsegpipe.match_txt_files(mcd_files, txt_files)
            for mcd_file in mcd_files:
                imcsegpipe.extract_mcd_file(
                    mcd_file,
                    acquisitions_dir / mcd_file.stem,
                    txt_files=matched_txt_files[mcd_file]
                )
finally:
    for temp_dir in temp_dirs:
        temp_dir.cleanup()
    del temp_dirs

# Generate image stacks containing all channels (_full) and only channels for segmentation (_segment)
panel = pd.read_csv(raw / "panel.csv")
for acquisition_dir in acquisitions_dir.glob("[!.]*"):
    if acquisition_dir.is_dir():
        imcsegpipe.create_analysis_stacks(
            acquisition_dir=acquisition_dir,
            analysis_dir=output_dir,
            analysis_channels=sort_channels_by_mass(
                panel.loc[panel["Full"] == 1, "Metal Tag"].tolist()
            ),
            suffix="_full",
            hpf=50.0
        )
        imcsegpipe.create_analysis_stacks(
            acquisition_dir=acquisition_dir,
            analysis_dir=segment_dir,
            analysis_channels=sort_channels_by_mass(
                panel.loc[panel["Segment"] == 1, "Metal Tag"].tolist()
            ),
            suffix="_segment",
            hpf=50.0
        )

print("Done!")