# `0_DataProcessing`

## `1_prepare_data.ipynb`

Required files
1. CellRanger output of all MSI samples (.h5ad)
2. Metadata of each dataset (detailing sample accession)

For each dataset (Chen, Joanito-KUL and Joanito-SG1)...
- Read raw h5ad for each sample and run `Scrublet`
- Label `SampleID`, `PatientID`, `BiopsySite` and write the scrublet-processed h5ad to disk
- Merge all h5ad files into one, then run `sc.pp.calculate_qc_metrics`
- Save the merged h5ad file to disk

By the end of this script, you should have 3 h5ad files, each representing the merged h5ad files of each dataset.<br/>
※ Chen et al. SD patients and duodenal carcinoma patients are excluded!

## `2_ScanpyPreprocessing.ipynb`

Required files
1. h5ad files for each dataset generated from `1_prepare_data.ipynb`
2. Cell metadata of each dataset (detailing author annotation of cells)

- Using the author-provided metadata, label the cell-type for each cell
    - There will be many unlabeled cells in the process.
    - For now, do **not** remove these cells
- Save the annotated-file to disk

## `3_NanoMnT_Labeling.ipynb`

Required files
1. h5ad files for each dataset generated from `2_ScanpyPreprocessing.ipynb`
2. NanoMnT allele tables for each sample processed by Cell Ranger

- Load NanoMnT allele tables and process them by 
    - Filtering out reads with low quality flanking sequences (e.g., indels within flankings)
    - Remove G- or C-repeats
    - Filter out reads without CB or UMI
- Merge the allele tables together (with sample/patient information labeled)
- Overlay the STR profile for each cell to adata
- Save the NanoMnT-labeled adata to disk
- Save the merged Allele table

# `1_EpithelialAnalysis`

## `1_Epithelial_Subclustering.ipynb`

Required files
1. h5ad files for each dataset generated from `3_NanoMnT_Labeling.ipynb` (previous notebook)
2. Metadata for each dataset that contains MSI status

※ Process each dataset separately

First, label MSI status to adata using metadata.<br/>
Subset adata for epithelial cells (using author annotation) (same procedures shown in `2_ScanpyPreprocessing.ipynb`).<br/>

**Goal: Distinguishing normal vs. tumor epithelial cells**

Epithelial tumor cell annotation principle, ordered in terms of importance<br/>
1. Leiden clusters that are composed of multiple patients represent normal cells
2. Leiden clusters that are mostly NAT-derived represent normal cells
3. Leiden clusters that satisfy all 3 conditions represent tumor cells: 
    - patient-specific
    - exhibit MSI-like STR profiles (for MSI)
    - high iCMS3 module expression

## `2_MSI_intensity.ipynb`

Required files
1. h5ad files that contains Epithelial cells for each dataset generated from `1_Epithelial_Subclustering.ipynb` (previous notebook)
2. Allele table generated from 

1. MSI intensity of each patient
    - Inspect the STR profile of all cells of both dataset (separately) and check that the results are sound
    - Visualize STR allele profiles of each patient using Seaborn heatmaps across various STR lengths (Tumor cells vs. Normal cells)
        - Further validate the results by comparing multiple individual STR loci with higher coverage

## `3_MSI_intensity_driver.ipynb`

Required files
1. h5ad files that contains Epithelial cells for each dataset generated from `1_Epithelial_Subclustering.ipynb`
2. Allele table from `3_NanoMnT_Labeling.ipynb`

Investigate transcriptomic factors that associate with MSI intensity: gene expression, module score, cell cycle proportions, and other relevant genes.<br/>
**→ Results are unsatisfactory**

## `4_MSI_intensity_prediction.ipynb`

Required files
1. h5ad files that contains Epithelial cells for each dataset generated from `1_Epithelial_Subclustering.ipynb` 

Use XGBoost to identify features (genes) that can be used to predict MSI detection.<br/>
**→ Results are unsatisfactory, except perhaps S100A6 but this may not be relevant.**

# `2_TME_analysis`

## Cell type anntoation

### `1_1_Chen_TME_annotation_1.ipynb`

Annotate B and T cell lineages

### `1_2_Chen_TME_annotation_2.ipynb`

Annotate myeloid and epithelial lineage

### `1_3_Chen_TME_annotation_3.ipynb`

Annotate endothelial and stromal lineages

### `1_4_Joanito_TME_annotation_1.ipynb`

### `1_5_Joanito_TME_annotation_2.ipynb`

### `1_6_Joanito_TME_annotation_3.ipynb`

## `2_TME_proportion_analysis.ipynb`

## 2_2_manual_annot.ipynb

Because Chen et al has not provided a good metadata, we will need to manually re-annotate the cells ourselves.

## 2_3_extract_epi.ipynb

Extract epithelial cells from the manually re-annotated data

## 3_NanoMnT_analysis.ipynb

Overlay NanoMnT results and visualize association with ICI response

## 4_MSI_intensity_analysis.ipynb

Find transcriptomic features associated with MSI intensity