#  T-cell vaccine design

Design conserved immune-optimised vaccine(s) to induce a broad T-cell response.

A graph-based method is used to design T-cell vaccines based on a set of input sequences corresponding to a (viral) target protein. A sliding window of length `k` is applied to the input sequences to split them into potential T-cell epitopes (PTEs) `e` which are represented as nodes in a graph `G`. Nodes sharing an overlap of subsequent `k - 1` amino acids are connected by an edge. The epitopes are scored based on their conservation and their likelihood of being presented by MHC molecules. The graph is then traversed to find the optimal path through the nodes, which is the set of epitopes that maximises the sum of the scores, and represents the vaccine design.

In [None]:
import networkx as nx

from tvax.analyse import run_parameter_sweep
from tvax.config import EpitopeGraphConfig
from tvax.design import design_vaccines
from tvax.eval import compute_population_coverage, compute_pathogen_coverage, compute_eigen_dist
from tvax.graph import build_epitope_graph
from tvax.plot import *
from tvax.seq import compute_percent_match, path_to_seq

In [None]:
params = {
    'fasta_path': 'data/input/sar_rbd_protein.fasta',
    'results_dir': 'data/results_sar_rbd',
    'human_proteome_path': 'data/input/human_proteome_2023.03.14.fasta.gz',
    'mhc1_alleles_path': '../optivax/scoring/MHC1_allele_mary_cleaned.txt',
    'mhc2_alleles_path': '../optivax/scoring/MHC2_allele_marry.txt',
    'hap_freq_mhc1_path': '../optivax/haplotype_frequency_marry.pkl',
    'hap_freq_mhc2_path': '../optivax/haplotype_frequency_marry2.pkl',
    'k': [9, 12],
    'm': 1,
    'n_target': 1,
    'aligned': False,
    'decycle': True,
    'equalise_clades': True,
    'n_clusters': None,
    'weights': {
        'frequency': 1,
        'population_coverage_mhc1': 0,
        'population_coverage_mhc2': 0
    }
}

config = EpitopeGraphConfig(**params)
print(config.json(indent=4))

## Simple example 

In [None]:
# Define the input epitopes
# Cyclic example: kmers_dict = {'MSA': 0.6, 'SAM': 0.2, 'AMS': 0.4}
kmers_dict = {
    'MSA': {'score': 0.6},
    'SAM': {'score': 0.2},
    'AMQ': {'score': 0.2},
    'MQL': {'score': 0.2},
    'SAR': {'score': 0.4},
    'MGA': {'score': 0.3},
    'GAR': {'score': 0.7},
    'ARQ': {'score': 0.4},
    'RQL': {'score': 0.4}
}

# Construct the graph
G = build_epitope_graph(config, kmers_dict=kmers_dict)

# Find the optimal path(s) through the graph of epitopes
Q = design_vaccines(G, config)
print([path_to_seq(path) for path in Q])

# Plot the results
fig = plot_epitope_graph(G, Q, node_size=2000, ylim=[-0.1, 1], with_labels=True, interactive=False)

## Construct the epitope graph
Create a Directed Graph (`DiGraph`) using the `networkx` package, where each epitope `e` is a node and edges connect nodes where the last `k−1` characters of `ea` match the first `k−1` characters of `eb`. For computational convenience, two extra nodes `BEGIN` and `END` are added. The `BEGIN` node connects to all the nodes that lack predecessors (`P(e)`) (corresponding to epitopes that are the first `k` characters in a sequence). Nodes that lack successors (`S(e)`) (because they are the last `k` characters in a sequence) are connected to the `END`. For plotting convenience, the length shortest path to the `BEGIN` node is added as a node attribute

In [None]:
epitope_graph = build_epitope_graph(config)
print(nx.info(epitope_graph))

## Design
Take a path through the graph to optimise epitope frequency.

The forward loop computes the function `F(e)` (the largest sum achievable for any path that terminates with the epitope `e`) for all the nodes in a stepwise manner. The backward loop chooses the node with maximum value as the final epitope in our optimal string and works backwards to build the path that achieves the maximal score

In [None]:
# Find the optimal path(s) through the graph of epitopes
vaccine_designs = design_vaccines(epitope_graph, config)
[path_to_seq(path) for path in vaccine_designs]

## Plot the epitope graph
The nodes are the epitopes `e` and the edges connect epitopes whose sequences overlap by `k − 1` amino acids. The x-axis shows the shortest path length to the `BEGIN` node, the y-axis indicates the epitope frequency `f(e)` in this target sequence set. The optimal path is shown in red which corresponds to the protein sequence that maximizes epitope coverage of the population

In [None]:
# F(e)
fig = plot_epitope_graph(epitope_graph, vaccine_designs, node_size=50)

## Evaluate the implemented scores

Compare different scores to see if they are correlated to determine if they are redundant and if they're is a relationship between the different scores
> Here, we can see that the frequency and immune score (MHC binding) are NOT correlated. This means that we have not identified why the epitope frequency is having a disproportionately large affect on the overall score. It also means that the immune scores are not redundant and are providing additional information that is not captured by the epitope frequency

In [None]:
fig = plot_corr(epitope_graph)

View the distribution of the scores for all the epitopes in the graph
> Looking at the distribution of the epitope frequencies we can see that there are many infrequent PTEs and a few frequent PTEs. My current hypothesis is that the epitope frequency is having a disproportionately large affect on the overall score because of the **"zero-sum"** nature of epitope frequency. At a particular position, for one epitope to have a high frequency the other epitopes at that position must have a low frequency. Crucially, as this is for each particular position the epitope frequency has a disproportionately large affect on the overall score because it affects the choice of epitope at each decision point.

In [None]:
fig = plot_score(epitope_graph, score='population_coverage_mhc2')

## Evaluate the vaccine design(s)

Determine how each of the individual scores contribute to the overall score
> Here, we can see that the epitope frequency is having a disproportionately large affect on the overall score (despite the frequency score having a smaller weight). This is because all epitopes will have a non-zero frequency score, whereas most of the population coverage score will be zero for most epitopes.

In [None]:
fig, score_dict = plot_scores(epitope_graph, config.weights, vaccine_designs, percent=True)

In [None]:
for score_name, score_val in dict(config.weights).items():
    contrib = score_dict[score_name]
    print(f'The score {score_name} contributed to {contrib:.2f}% of the total score of the vaccine design, averaged across positions')

In [None]:
n_target = 5
pop_cov_mhc1 = compute_population_coverage(vaccine_designs[0], n_target, config, "mhc1")
pop_cov_mhc2 = compute_population_coverage(vaccine_designs[0], n_target, config, "mhc2")
path_cov = compute_pathogen_coverage(vaccine_designs[0], config)

print(
    f"{pop_cov_mhc1 * 100:.2f}% of the population is predicted to have ≥ {n_target} Class I peptide-HLA hits produced by the vaccine"
)
print(
    f"{pop_cov_mhc2 * 100:.2f}% of the population is predicted to have ≥ {n_target} Class II peptide-HLA hits produced by the vaccine"
)
print(
    f"{path_cov * 100:.2f}% of potential T-cell epitopes for each target input sequence is covered by the vaccine on average"
)

Compare the generated vaccine design to a design optimised only for epitope frequency to determine if the additional scores change the vaccine design

> The additional immune scores (MHC binding and immunogenicity) do not change the vaccine design

In [None]:
seq_sar_freq = 'MSDNGPQNQRSAPRITFGGPTDSTDNNQDGGRSGARPKQRRPQGLPNNTASWFTALTQHGKEELRFPRGQGVPINTNSGKDDQIGYYRRATRRVRGGDGKMKELSPRWYFYYLGTGPEASLPYGANKEGIVWVATEGALNTPKDHIGTRNPNNNAAIVLQLPQGTTLPKGFYAEGSRGGSQASSRSSSRSRGNSRNSTPGSSRGNSPARMASGGGETALALLLLDRLNQLESKVSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKQYNVTQAFGRRGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYHGAIKLDDKDPQFKDNVILLNKHIDAYKTFPPTEPKKDKKKKTDEAQPLPQRQKKQPTVTLLPAADMDDFSRQLQNSMSGASADSTQA'
seq_design = path_to_seq(vaccine_designs[0])
align, perc_match = compute_percent_match(seq_sar_freq, seq_design)

print(align)
print(f'The sequences are {perc_match:.2f}% similar')

### See where the vaccine design(s) are in sequence space
For different numbers of clusters, plot the input target sequences and the vaccine design(s) on a PCA plot

In [None]:
n_clusters = [6]

for n in n_clusters:
    config.n_clusters = n
    epitope_graph = build_epitope_graph(config)
    vaccine_designs = design_vaccines(epitope_graph, config)
    fig, comp_df = plot_vaccine_design_pca(vaccine_designs, config)
    eigen_dist = compute_eigen_dist(comp_df)
    print(f'Using {n} clusters, the vaccine design is {eigen_dist:.2f} standard deviations away from the mean of the input/target sequences')

## Iterative clade vaccine design
Iteratively design a vaccine by first designing a vaccine for each of the clusters and then combining the vaccine designs into a single vaccine design

In [None]:
from tvax.graph import load_fasta, assign_clades

n_clusters = [3]

for n in n_clusters:

    # Set the parameters
    config.n_clusters = n
    config.equalise_clades = True

    # Load the FASTA sequences
    seqs_dict = load_fasta(config.fasta_path)

    # Assign the sequences to clusters
    clusters_dict = assign_clades(seqs_dict, config)
    # Organise the sequences into clusters
    seqs_clusters_dict = {cluster: [] for cluster in clusters_dict.values()}
    for seq, cluster in clusters_dict.items():
        seqs_clusters_dict[cluster].append(seq)
    
    # Design a vaccine for each cluster
    config.equalise_clades = False
    vaccine_seqs_dict = {}
    for cluster, seqs in seqs_clusters_dict.items():
        cluster_seqs = {seq: seqs_dict[seq] for seq in seqs}
        epitope_graph = build_epitope_graph(config, seqs_dict=cluster_seqs)
        vaccine_designs = design_vaccines(epitope_graph, config)
        vaccine_seqs_dict[f'cluster_{cluster}_vaccine_design'] = path_to_seq(vaccine_designs[0])
    
    # Generate a vaccine design using the clade designs
    config.n_clusters = 1
    epitope_graph = build_epitope_graph(config, seqs_dict=vaccine_seqs_dict)
    vaccine_designs = design_vaccines(epitope_graph, config)

    # Plot the PCA of where the vaccine design falls in the space of the original target sequences
    config.n_clusters = n
    fig = plot_vaccine_design_pca(vaccine_designs, config)

### Quality control - MHC binding
Check which MHC alleles are represented in the vaccine design

In [None]:
alleles = None # [a.replace('*','') for a in config.alleles]
fig, peptide_hla_hits_df = plot_mhc_heatmap(vaccine_designs, config, alleles)
fig.show()
peptide_hla_hits_df.style.bar(subset=['n_peptide_hla_hits'], color='#5fba7d')

## Perform a parameter sweep

Generate a set of vaccine designs for different values of `n_clusters` and `config.weights.population_coverage` and compare the designs based on the % population and pathogen coverage

In [None]:
param_sweep_df = run_parameter_sweep(
    n_clusters = list(range(1, 21)),
    pop_cov_weights = list(range(1, 21)),
    results_path = 'data/parameter_sweep_results.csv',
    config = config
)

heatmap_fig = plot_param_sweep_heatmap(param_sweep_df)
scatter_fig = plot_param_sweep_scatter(param_sweep_df, x="path_cov", y="pop_cov")

heatmap_fig.show()
scatter_fig.show()