## Abstract

Despite the header being `## Abstract`, this section will render as a highlighted section titled *Summary*. Ensure this section is a **maximum** of 280 characters.

----

:::{.callout-note title="AI usage disclosure" collapse="true"}
This is a placeholder for the AI usage disclosure. Once all authors sign the AI code form on Airtable, SlackBot will message you an AI disclosure that you should place here.
:::

## Purpose

Once edited by you, this file will become your publication. Alternatively, if you already have a notebook written that you're trying to transform into a pub, replace this file with your own, but be sure to add the YAML front matter (the first cell) to your notebook.

Your pub should begin with a section titled **Purpose** where you, as briefly as possible, explain why you did the work described in the pub, the key takeaway, your primary audience, and how you think it could be useful to them/why you're sharing it.

## Introduction

The MSA Pairformer (2025.08.02.668173v1) introduces a 111M-parameter architecture that leverages multiple sequence alignments and a query-biased pair representation to achieve state-of-the-art performance in contact prediction, protein–protein interface identification, and variant effect prediction. Beyond benchmarks, the paper advances a mechanistic claim: that the model mitigates phylogenetic averaging by weighting sequences according to their evolutionary relevance to the query, thereby recovering subfamily-specific signals.

This interpretation hinges on an assumption that has not been directly tested: that learned sequence weights align with phylogenetic similarity, not merely raw sequence identity. The authors provide indirect evidence—bimodal weight distributions within the response regulator family, improved recovery of subfamily-specific contacts, and weight–identity correlations—but these do not establish whether the model is actually sensitive to tree-based lineage structure.

In this notebook, we examine the relationship between learned sequence weights and phylogenetic distance. Our goal is not to challenge the reported performance, but to evaluate the evolutionary interpretation. By linking weights to tree-based measures of relatedness, we can clarify whether the Pairformer is genuinely capturing subfamily-specific phylogenetic signal, or whether its behavior is better explained by local sequence identity and compositional effects.

## Re-examining the Response Regulator Case Study

The original paper highlights the response regulator (RR) family as a case study for subfamily-specific signal. There, query-biased weights displayed a bimodal distribution: sequences belonging to the same subfamily as the query were consistently upweighted, while sequences from other subfamilies were downweighted. This was presented as evidence that the model captures lineage structure, rather than relying solely on raw sequence identity.

To evaluate this claim more directly, we begin by reproducing the RR multiple sequence alignment (MSA) used in the paper. This alignment serves as the foundation for both the subfamily analysis and our subsequent phylogenetic analysis. By rebuilding the MSA in a transparent, reproducible way, we can:

Confirm that the alignment and sequence set are comparable to those used in the published case study.

Provide input to the MSA Pairformer, enabling us to extract query-biased weights via a forward pass.

Establish a consistent dataset on which to perform downstream analyses, including the correlation of weights with tree-based phylogenetic distances.

In the next cell, we will generate the RR MSA. This involves identifying the same protein family and constructing the alignment under similar constraints (e.g., filtering thresholds, depth caps). The resulting MSA will then be used as input to the model, forming the basis of our re-analysis.

In [None]:
from pathlib import Path

from analysis.pfam import download_and_process_msa
from analysis.tree import run_fasttree

pfam_msa_output_dir = Path("./data/response_regulators")
pfam_msa_output_dir.mkdir(parents=True, exist_ok=True)

family_id = "PF00072"
references = ["1NXS", "4CBV", "4E7P"]
subfamily_ids = ["PF00486", "PF04397", "PF00196"]

# Map from the subfamily PFAM ID to the reference RCSB PDB ID
subfamily_pdb_map = dict(zip(subfamily_ids, references, strict=True))

for reference in references:
    msa_path = pfam_msa_output_dir / f"{family_id}.alignment.final_{reference}.a3m"
    if not msa_path.exists():
        download_and_process_msa(
            family_id=family_id,
            subfamily_ids=list(subfamily_pdb_map.keys()),
            subset_size=4096,
            reference=reference,
            output_dir=pfam_msa_output_dir,
            keep_intermediate=True,
        )

    fasttree_path = pfam_msa_output_dir / f"{family_id}.alignment.final_{reference}.fasttree.newick"
    if not fasttree_path.exists():
        run_fasttree(msa_path, fasttree_path)