# Downloading the COVID19 dataset
In this tutorial, we will run the analysis on a COVID-19 dataset following https://github.com/Regaler/EpitopeGen/tree/main/research/covid19_su. The **Su et al.** dataset (https://www.cell.com/cell/fulltext/S0092-8674(20)31444-6?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867420314446%3Fshowall%3Dtrue) comprises paired single-cell RNA and TCR sequencing data from 139 COVID-19 patients. The dataset categorizes patients into four severity groups: healthy, mild, moderate, and severe. As a first step, download the annotated dataset. 

In [None]:
cd EpitopeGen/research/covid19_su/data
wget https://zenodo.org/records/14861864/files/cd8_gex_covid19_su.h5ad
wget https://zenodo.org/records/14861864/files/covid19_epitopes.csv
wget https://zenodo.org/records/14896012/files/obs_annotated_covid19_su_ens_th0.5.csv

This will download and extract the dataset to the `covid19_su/data` directory. The dataset contains the following files:
- `covid19_su/data/cd8_gex_covid19_su.h5ad`: AnnData object containing the single-cell RNA-seq data of CD8+ T cells
- `covid19_su/data/obs_annotated_covid19_su_ens_th0.5.csv`: Observation metadata of the T cells. This file includes the **EpitopeGen** inference results. We can see that for each viable T cell, the prediction columns contain generated epitope sequences. The columns `match_{i}` is 1 if the generated epitope `pred_{i}` is a match to the epitope in the reference database, otherwise 0. The columns `pred_{i}` contain the generated epitope sequences. The following are overview of the analyses:
- `covid19_epitopes.csv`: COVID-19 epitopes database

# Analysis

- **Phenotype-Association (PA) Ratio Analysis**
    - Quantifies Phenotype-Associated T cells within specific repertoire subgroups
    - Enables comparison of COVID19-associated T cell proportions across site patterns and cell types

- **Clonal Expansion Analysis**
    - Evaluates clone sizes of PA T cells
    - Provides comparative analysis against NA T cell expansion

- **Gene Expression Analysis**
    - Performs differential gene expression analysis between PA and NA T cells
    - Identifies distinctive gene expression patterns in PA T cells
    - Requires per-patient raw data

- **Antigen Analysis**
    - Identifies recognized COVID-19 antigens in PA T cells
    - For clonally expanded and not-expanded PA T cells

## PA Ratio analysis
The PA Ratio analysis quantifies the proportion of PA T cells within specific repertoire subgroups. For more details, please see **Fig. 6b** and section "EpiGen discovers COVID-19-associated CD8+ T cells" in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import PARatioAnalyzer
from covid19_su.utils import read_all_data,CELL_NAMES,PATTERN_NAMES,WOS_PATTERNS,clean_wos_get_single_pair

print("CELL_TYPES", CELL_NAMES)
print("PATTERN_NAMES", PATTERN_NAMES)
print("WOS_PATTERNS", WOS_PATTERNS)

adata = read_all_data(data_dir="covid19_su/data", gex_cache="covid19_su/data/cd8_gex_covid19_su.h5ad", obs_cache="covid19_su/data/obs_annotated_covid19_su_ens_th0.5.csv")
# Clean the dataframe
df = clean_wos_get_single_pair(adata.obs)
# Create new Anndata with only the kept cells
adata_clean = adata[df.index].copy()
adata_clean.obs = df
# If you want to replace the original adata
adata = adata_clean

pa_analyzer = PARatioAnalyzer(
    cell_types=CELL_NAMES,
    pattern_names=PATTERN_NAMES,
    pattern_descriptions={k: k for k in PATTERN_NAMES},
    patterns_dict={k: v for k, v in zip(PATTERN_NAMES, WOS_PATTERNS)},
    output_dir='covid19_su/analysis/PA_ratios'
)
pa_analyzer.analyze({'gex': adata}, top_k=8, per_patient=False)

## Clonal expansion analysis
The clonal expansion analysis evaluates the clone sizes of PA T cells per patient group (healthy, mild, moderate, and severe). For more details, please see **Fig. 6c** and the related section in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import TCRUMAPVisualizer
from covid19_su.utils import read_all_data,clean_wos_get_single_pair,WOS_PATTERNS,PATTERN_NAMES

adata = read_all_data(data_dir="covid19_su/data", gex_cache="covid19_su/data/cd8_gex_covid19_su.h5ad", obs_cache="covid19_su/data/obs_annotated_covid19_su_ens_th0.5.csv")
# Clean the dataframe
df = clean_wos_get_single_pair(adata.obs)
# Create new Anndata with only the kept cells
adata_clean = adata[df.index].copy()
adata_clean.obs = df
# If you want to replace the original adata
adata = adata_clean

# Initialize visualizer
visualizer = TCRUMAPVisualizer(
    patterns=WOS_PATTERNS,  # [[0], [1,2], [3,4], [5,6,7]]
    pattern_names=PATTERN_NAMES,
    pattern_descriptions={
        'healthy': 'Healthy controls',
        'mild': 'Mild disease',
        'moderate': 'Moderate disease',
        'severe': 'Severe disease'
    },
    output_dir='covid19_su/analysis/tcr_umap'
)

# Create visualization
visualizer.visualize_umap(
    adata,
    match_columns=[f'match_{k}' for k in range(8)],
    primary_color='red',
    sample_size=4000,
    n_proc=40,
)

## DEG analysis
The DEG analysis performs differential gene expression analysis between PA and NA T cells. It identifies distinctive gene expression patterns in PA T cells. For more details, please see **Fig. 6d** and the related section in the manuscript. 

In [None]:
# In EpitopeGen/research
from epitopegen import DEGAnalyzer
from covid19_su.utils import read_all_data,CELL_NAMES,PATTERN_NAMES,WOS_PATTERNS,clean_wos_get_single_pair

print("CELL_TYPES", CELL_NAMES)
print("PATTERN_NAMES", PATTERN_NAMES)
print("WOS_PATTERNS", WOS_PATTERNS)

adata = read_all_data(data_dir="covid19_su/data", gex_cache="covid19_su/data/cd8_gex_covid19_su.h5ad", obs_cache="covid19_su/data/obs_annotated_covid19_su_ens_th0.5.csv")
# Clean the dataframe
df = clean_wos_get_single_pair(adata.obs)
# Create new Anndata with only the kept cells
adata_clean = adata[df.index].copy()
adata_clean.obs = df
# If you want to replace the original adata
adata = adata_clean

# Perform DEG analysis (grouped)
analyzer = DEGAnalyzer(output_dir="covid19_su/analysis/gex_grouped", top_k=8)
analyzer.analyze(adata.copy())

# Perform DEG analysis (per site pattern)
analyzer = DEGAnalyzer(output_dir="covid19_su/analysis/gex_per_pattern", top_k=8, patterns_list=WOS_PATTERNS, pattern_names=PATTERN_NAMES)
analyzer.analyze(adata.copy(), analyze_patterns=True)

## Antigen Analysis
The antigen analysis identifies recognized COVID-19 epitopes in PA T cells. For more details, please see **Fig. 6e** and the related section in the manuscript.

In [None]:
# In EpitopeGen/research
from epitopegen import AntigenAnalyzer,CoronavirusProteinStandardizer
from covid19_su.utils import read_all_data,CELL_NAMES,PATTERN_NAMES,WOS_PATTERNS,clean_wos_get_single_pair,LEIDEN2CELLNAME

print("CELL_TYPES", CELL_NAMES)
print("PATTERN_NAMES", PATTERN_NAMES)
print("WOS_PATTERNS", WOS_PATTERNS)

adata = read_all_data(data_dir="covid19_su/data", gex_cache="covid19_su/data/cd8_gex_covid19_su.h5ad", obs_cache="covid19_su/data/obs_annotated_covid19_su_ens_th0.5.csv")
# Clean the dataframe
df = clean_wos_get_single_pair(adata.obs)
# Create new Anndata with only the kept cells
adata_clean = adata[df.index].copy()
adata_clean.obs = df
# If you want to replace the original adata
adata = adata_clean

# Initialize analyzer with default coronavirus protein standardizer
analyzer = AntigenAnalyzer(
    condition_patterns={
        'nan': 'healthy',
        '1': 'mild', '1 or 2': 'mild', '2': 'mild',
        '3': 'moderate', '4': 'moderate',
        '5': 'severe', '6': 'severe', '7': 'severe'
    },
    pattern_order=['healthy', 'mild', 'moderate', 'severe'],
    protein_colors={
        'Non-structural proteins (NSP)': '#7F63B8',
        'Accessory proteins (ORFs)': '#FF6B6B',
        'Spike (S) protein': '#4ECDC4',
        'Nucleocapsid (N) protein': '#FFD700',
        'Membrane (M) protein': '#4641F0',
        'Envelope (E) protein': '#ED8907',
        'Other': '#9FA4A9'
    },
    protein_standardizer=CoronavirusProteinStandardizer(),
    cell_type_column='leiden',
    cell_type_mapping=LEIDEN2CELLNAME,
    output_dir='covid19_su/analysis/antigen'
)

# Run analysis
results = analyzer.analyze(
    adata,
    top_k=8,
    top_n_proteins=20,
    condition_column="Who Ordinal Scale"
)