# Setup

This is the stuff you have to edit; the rest of the sections can run as-is after you've set the needed parameters.

---

`coord_suffix` must align with 
* the sub-directory (corresponds to region, e.g., "mucosa") under the `dir_coord` directory where the Xenium Explorer-exported selection files are stored, and
* the suffixes of the coordinate selection files (see file naming conventions below).
  
The `AnnData` objects created will have this suffix as well (e.g., `Uninflamed-50452A_mucosa.h5ad`).

---

Selection files should be named by this convention:
`<library_id>_<coord_suffix>.csv`.

For example, if `dir_coord` is `.../coordinates/mucosa`, the mucosa selection file for sample 50452A should be under `.../coordinates/mucosa/50452A_mucosa.csv`. 

More specifically, if the coordinates directory is under `/mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/coordinates`, and the selection region is "mucosa,"`dir_coord` should be `/mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/coordinates/mucosa`, and the full file path for this sample would be `/mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/coordinates/mucosa/50452A_mucosa.csv`.

---

**As with any other file naming schema, suffixes/directory names should not any special characters other than underscores (`_`) (no periods, dashes, spaces, etc.).**

N.B. In the above explanation, `library_id` refers to library/original sample ID without condition (e.g., "50452A", not "Uninflamed-50452A" like in other places). Remember that `coord_suffix` should also be the name of the parent directory of the coordinate file. I include this information in both the directory and file name to prevent mix-ups should files be moved or placed in the wrong folder.

In [21]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import re
import math
import functools
import anndata
import scanpy as sc
import numpy as np
import pandas as pd
import corescpy as cr

# Process Options
coord_suffix = "mucosa"  # sub-directory & file/object suffix
panel_id = "TUQ97N"  # Xenium panel ID
show_cols = [cr.pp.COL_SUBJECT]  # only change if not TUQ97N
cso = cr.pp.COL_SAMPLE_ID_O  # original library ID; change if not TUQ97N
col_sample = cr.pp.COL_SAMPLE_ID  # object sample ID; change if not TUQ97N
libs = [  # sample IDs from patients for whom we have all conditions
    "50452A", "50452B", "50452C",  # old segmentation
    "50006A", "50006B", "50006C",  # rest are new segmentation
    "50217A", "50217B", "50217C",
    "50336B", "50336C", "50336A",
    "50403A2", "50403B", "50403C1"
]  # excludes low-quality sample/condition replicates 50403A1 & 50403C2
# libs = None  # to run all available samples
input_suffix = ""  # in case want to crop objects with some suffix
# due to creation of a subsidiary object, e.g., for
# "Stricture-50452C_downsampled.h5ad"
# input_suffix would be "_downsampled". For "main" objects, input_suffix=""

# Files & Directories
dir_entry = "/mnt/cho_lab/disk2"  # Spark writeable data directory
mdf = str("/mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/"
          "samples.csv")  # metadata file path (for now; will soon be on NFS)
dir_writeable = os.path.join(
    dir_entry, f"{os.getlogin()}/data/shared-xenium-library")  # your folder
direc = "/mnt/cho_lab/bbdata2/"  # mounted NFS with data
dir_data = os.path.join(direc, f"outputs/{panel_id}")
out_dir = os.path.join(
    dir_writeable, f"outputs/{panel_id}/nebraska")  # object output directory
dir_coord = os.path.join(
    out_dir, "coordinates", coord_suffix)  # coordinates (also maybe NFS soon)

# Constants (Shouldn't Need Edits Unless Extreme Process Changes)
col_obj = cr.pp.COL_OBJECT

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Find Files & Load Metadata

Allows us to find the object IDs (e.g., for TUQ97N, object IDs are in the format <condition (Inflamed/Uninflamed/Stricture)><block_id>) corresponding to the sample IDs.

In [22]:
files = functools.reduce(lambda i, j: i + j, [[os.path.join(
    run, i) for i in os.listdir(os.path.join(
        dir_data, run))] for run in os.listdir(dir_data)])  # all data paths
if os.path.exists(out_dir) is False:
    os.makedirs(out_dir)  # make output directory if doesn't yet exist
if os.path.exists(out_dir) is False:
    os.makedirs(os.path.join(
        out_dir, coord_suffix))  # make subsetted object output sub-directory?
metadata = cr.pp.get_metadata_cho(direc, mdf, panel_id=panel_id, samples=libs)
metadata[show_cols]

Unnamed: 0_level_0,Name
Sample,Unnamed: 1_level_1
Uninflamed-50452A,50452
Inflamed-50452B,50452
Stricture-50452C,50452
Inflamed-50006A,50006
Uninflamed-50006B,50006
Stricture-50006C,50006
Inflamed-50217A,50217
Uninflamed-50217B,50217
Stricture-50217C,50217
Inflamed-50336B,50336


# Subset Data by Coordinate Files & Write Cropped Objects

Subset the data by coordinates (`corescpy` can use Xenium Explorer-exported manual selection files to get those coordinates) and then write the cropped objects to `out_dir/<coord_suffix>`.

In [24]:
for s in libs:  # iterate samples
    print(f"\n\n{'*' * 40}\n{s}\n{'*' * 40}\n\n")
    file_path = np.array(files)[np.where([s == os.path.basename(
        x).split("__")[2].split("-")[0] for x in files])[0][0]]
    lib = metadata.reset_index().set_index(cso).loc[s][col_sample]
    self = cr.Spatial(os.path.join(dir_data, file_path), library_id=lib)
    # print("Input: ", os.path.join(
    #     out_dir, f"{lib}{input_suffix}.h5ad"))
    # print("Coordinates: ", os.path.join(
    #     dir_coord, s + f"_{coord_suffix}.csv"))
    # print("Outputs: ", os.path.join(
    #     out_dir, coord_suffix, f"{lib}_{coord_suffix}.h5ad"))
    self.update_from_h5ad(os.path.join(
        out_dir, f"{lib}{input_suffix}.h5ad"))  # load processed object
    self.adata = self.crop(os.path.join(
        dir_coord, s + f"_{coord_suffix}.csv"))  # crop data to coordinates
    # sdata.pl.render_labels("cell_labels").pl.show()
    self.write(os.path.join(
        out_dir, coord_suffix, f"{lib}_{coord_suffix}.h5ad"))  # write cropped



****************************************
50452A
****************************************


Input:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/Uninflamed-50452A.h5ad
Coordinates:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/coordinates/mucosa/50452A_mucosa.csv
Outputs:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/mucosa/Uninflamed-50452A_mucosa.h5ad


****************************************
50452B
****************************************


Input:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/Inflamed-50452B.h5ad
Coordinates:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/coordinates/mucosa/50452B_mucosa.csv
Outputs:  /mnt/cho_lab/disk2/elizabeth/data/shared-xenium-library/outputs/TUQ97N/nebraska/mucosa/Inflamed-50452B_mucosa.h5ad


****************************************
50452C
******************************