# Estimating Persistent Homology Dimension of Protein Sequences in the Latent Space of ESM-2

This is an attempt at an implementation of the estimation of Persistent Homology Dimension described in [Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts](https://arxiv.org/abs/2306.04723). This is a fractal dimension that aims to detect protein sequences generated by AI. 




1. **Model and Tokenizer Initialization**: An ESM-2 protein model and tokenizer are initialized. These represent the mapping function $ f: M \rightarrow \mathbb{R}^n $.

2. **Protein Sequence Input and Tokenization**: A protein sequence, which can be thought of as a finite subset $ X \subseteq M $, is provided as input and tokenized. The tokens are then converted to tensors, which are the required input format for the ESM model.

3. **Embedding Generation**: The token tensors are passed through the model $ f $ to generate a set of embeddings $ Y = f(X) $. Each embedding is a point in $ \mathbb{R}^n $.

4. **Embedding Subsetting**: The embeddings corresponding to the first and last tokens are removed. These special tokens do not represent actual amino acids in the protein sequence and so are not part of the subset $ X $.

5. **Subsampling, Distance Matrix Calculation, and MST Calculation**: A series of subsets $ S_i \subseteq Y $, $i = 1, \ldots, k $ are created, with sizes $ n_i $ varying from 2 to $ |Y| $. For each subset $ S_i $, a distance matrix is calculated and used to compute the minimum spanning tree (MST). The MST corresponds to a graph $ G \subseteq Y $.

6. **Persistent Score Calculation**: The persistent score $E_0^\alpha(S_i) $ is calculated as the maximum edge length in the MST, which corresponds to the lifespans of 0-dimensional features in the PH computation. In this case, $ \alpha = 1 $.

7. **Data Preparation**: The sizes of the subsets and the corresponding persistent scores are logged (natural logarithm) and stored for linear regression.

8. **Linear Regression**: Linear regression is performed on the log-transformed sizes and persistent scores to approximate the relationship $ \log E_0^\alpha(S_i) \sim (1 - \frac{\alpha}{d}) \log n_i $. The slope of the regression line is then used to calculate the Persistent Homology Dimension (PHD) using the formula $ d = \frac{1}{1 - \text{reg.coef}[0][0]} $, which corresponds to $ d = \text{dim}_0^{PH}(M) $.

In summary, the code is estimating the 0-dimensional Persistent Homology Dimension of the manifold $M$ represented by the protein sequence. This is achieved by generating embeddings for the sequence, calculating the MST for subsets of these embeddings, and then performing linear regression on the sizes of these subsets and their corresponding persistent scores. It is unclear whether or not there is any difference in Persistent Homology Dimension for "Natural Proteins" vs. "AI Proteins", so this is just an initial exploratory notebook. 


In [18]:
import numpy as np
from sklearn.linear_model import LinearRegression
from transformers import AutoTokenizer, AutoModel, EsmModel
import torch
from scipy.sparse.csgraph import minimum_spanning_tree

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
model = EsmModel.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Input text
text = "MAPLRKTYVLKLYVAGNTPNSVRALKTLNNILEKEFKGVYALKVIDVLKNPQLAEEDKILATPTLAKVLPPPVRRIIGDLSNREKVLIGLDLLYEEIGDQAEDDLGLE"

# Tokenize the input and convert to tensors
inputs = tokenizer(text, return_tensors='pt')

# Get the embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[0].numpy()

# Remove the first and last embeddings (<CLS> and <EOS>)
embeddings = embeddings[1:-1]

# Sizes for the subsets to sample
sizes = np.linspace(2, len(embeddings), num=100, dtype=int)

# Prepare data for linear regression
x = []
y = []

for size in sizes:
    # Sample a subset of the embeddings
    subset = np.random.choice(len(embeddings), size, replace=False)
    subset_embeddings = embeddings[subset]
    
    # Compute the distance matrix
    dist_matrix = np.sqrt(np.sum((subset_embeddings[:, None] - subset_embeddings)**2, axis=-1))

    # Compute the minimum spanning tree
    mst = minimum_spanning_tree(dist_matrix).toarray()

    # Calculate the persistent score E (the maximum edge length in the MST)
    E = np.max(mst)

    # Append to the data for linear regression
    x.append(np.log(size))
    y.append(np.log(E))

# Reshape for sklearn
X = np.array(x).reshape(-1, 1)
Y = np.array(y).reshape(-1, 1)

# Linear regression
reg = LinearRegression().fit(X, Y)

# Estimated Persistent Homology Dimension
phd = 1 / (1 - reg.coef_[0][0])
print(phd)


Some weights of the model checkpoint at facebook/esm2_t6_8M_UR50D were not used when initializing EsmModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing EsmModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.weight', 'esm.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


1.0783164527027296
