# Downloading the cancer dataset
In this tutorial, we will run the analysis on the cancer dataset following https://github.com/Regaler/EpitopeGen/tree/main/research/cancer_wu. The **Wu et al.** dataset (https://www.nature.com/articles/s41586-020-2056-8) comprises paired single-cell RNA and TCR sequencing data from 14 cancer patients. As a first step, download the annotated dataset. 

In [None]:
cd EpitopeGen/research
python research/scripts/run_cancer_wu.py --download

This will download and extract the dataset to the `cancer_wu/data` directory. The dataset contains the following files:
- `cancer_wu/data/GSE139555_tcell_integrated.h5ad`: AnnData object containing the single-cell RNA-seq data of T cells
- `cancer_wu/data/GSE139555%5Ftcell%5Fmetadata.txt`: Observation metadata of the T cells
- `CN1`, `EN1`, `ET2`, ..., `RN3`: per-patient data. Each file contains the following:
    - `*.filtered_contig_annotations.csv`: contains the TCR sequences
    - `*.mtx`, `*.barcodes.tsv`, `*.genes.tsv`: contains the single-cell RNA-seq data (before preprocessing). These files are used when analyzing the gene expression levels. 

Now, download the tumor-associated epitopes database and the annotated observation that includes EpitopeGen inference results:

In [None]:
cd EpitopeGen/research/cancer_wu/data
wget https://zenodo.org/records/14861864/files/tumor_associated_epitopes.csv
wget https://zenodo.org/records/14861864/files/obs_annotated_cancer_wu_ens_th0.5.csv

The `tumor_associated_epitopes.csv` file serves as the reference database for epitope queries. 

# Analysis
In `obs_annotated_cancer_wu_ens_th0.5.csv`, we can see that for each viable T cell, the prediction columns contain generated epitope sequences. The columns match_{i} is 1 if the generated epitope pred_{i} is a match to the epitope in the reference database, otherwise 0. The columns pred_{i} contain the generated epitope sequences. The following are overview of the analyses:

- **Phenotype-Association (PA) Ratio Analysis**
    - Quantifies Phenotype-Associated T cells within specific repertoire subgroups
    - Enables comparison of tumor-associated T cell proportions across site patterns and cell types

- **Phenotype-Relative Expansion (PRE) Analysis**
    - Evaluates clone sizes of PA T cells
    - Provides comparative analysis against NA T cell expansion

- **Gene Expression Analysis**
    - Performs differential gene expression analysis between PA and NA T cells
    - Identifies distinctive gene expression patterns in PA T cells
    - Requires per-patient raw data (accessed via read_all_raw_data())

## PA Ratio analysis
The PA Ratio analysis quantifies the proportion of PA T cells within specific repertoire subgroups. For more details, please see **Fig. 5c** and section "EpiGen discovers tumor-associated CD8+ T cells" in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import PARatioAnalyzer
from cancer_wu.utils import read_all_data,CELL_TYPES, PATTERN_NAMES_CORE, PATTERN_NAMES2DESC, SITE_PATTERNS_CORE

print("CELL_TYPES", CELL_TYPES)
print("PATTERN_NAMES", PATTERN_NAMES_CORE)
print("PATTERN_NAMES2DESC", PATTERN_NAMES2DESC)

mdata = read_all_data(data_dir="cancer_wu/data", obs_cache="cancer_wu/data/obs_annotated_cancer_wu_ens_th0.5.csv", transpose=True)

pa_analyzer = PARatioAnalyzer(
    cell_types=CELL_TYPES,
    pattern_names=PATTERN_NAMES_CORE,
    pattern_descriptions=PATTERN_NAMES2DESC,
    patterns_dict={k: v for k, v in zip(PATTERN_NAMES_CORE, SITE_PATTERNS_CORE)},
    output_dir='cancer_wu/analysis/PA_ratios'
)
pa_analyzer.analyze(mdata, top_k=4, per_patient=False)

## PRE analysis
The PRE analysis evaluates the clone sizes of PA T cells and provides comparative analysis against NA T cell expansion. For more details, please see **Fig. 5d** and the related section in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import PRERatioAnalyzer
from cancer_wu.utils import read_all_data,CELL_TYPES, PATTERN_NAMES_CORE, PATTERN_NAMES2DESC, SITE_PATTERNS_CORE

print("CELL_TYPES", CELL_TYPES)
print("PATTERN_NAMES", PATTERN_NAMES_CORE)
print("PATTERN_NAMES2DESC", PATTERN_NAMES2DESC)

mdata = read_all_data(data_dir="cancer_wu/data", obs_cache="cancer_wu/data/obs_annotated_cancer_wu_ens_th0.5.csv", transpose=True)

pre_analyzer = PRERatioAnalyzer(
    cell_types=CELL_TYPES,
    pattern_names=PATTERN_NAMES_CORE,
    pattern_descriptions=PATTERN_NAMES2DESC,
    patterns_dict={k: v for k, v in zip(PATTERN_NAMES_CORE, SITE_PATTERNS_CORE)},
    output_dir='cancer_wu/analysis/PRE_ratios'
)
pre_analyzer.analyze(mdata, top_k=4, per_patient=False)

## DEG analysis
The DEG analysis performs differential gene expression analysis between PA and NA T cells. It identifies distinctive gene expression patterns in PA T cells. For more details, please see **Fig. 5e** and the related section in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import DEGAnalyzer
from cancer_wu.utils import read_all_data,read_all_raw_data, filter_and_update_combined_adata,SITE_PATTERNS_CORE, PATTERN_NAMES_CORE

# Read the processed gene expression data of CD8+ T cell and then inject our epitope annotation
mdata = read_all_data(data_dir="cancer_wu/data", obs_cache=f"cancer_wu/data/obs_annotated_cancer_wu_ens_th0.5.csv", transpose=True)
# Read the raw gene expression data of CD8+ T cells
raw_adata = read_all_raw_data(data_dir="cancer_wu/data")
# Merge the raw gene expression data with the previous TCR-GEX data
raw_adata_filtered = filter_and_update_combined_adata(raw_adata, mdata['gex'])

# Perform DEG analysis (grouped)
top_k = 4
for k in range(1, 1 + top_k):
    analyzer = DEGAnalyzer(output_dir="cancer_wu/analysis/gex_grouped", top_k=top_k)
    analyzer.analyze(raw_adata_filtered.copy())

# Perform DEG analysis (per site pattern)
top_k = 4
for k in range(1, 1 + top_k):
    analyzer = DEGAnalyzer(output_dir="cancer_wu/analysis/gex_per_pattern", top_k=top_k, patterns_list=SITE_PATTERNS_CORE, pattern_names=PATTERN_NAMES_CORE)
    analyzer.analyze(raw_adata_filtered.copy(), analyze_patterns=True)