This script processes gene-name sequences generated from the single-cell RNA-seq dataset of [Bassez et al. (2021)](https://lambrechtslab.sites.vib.be/en/single-cell) to generate the embedding using [mixbread](https://www.mixedbread.com/docs/inference/embedding). The original data include paired pre- and on-treatment tumour biopsies from breast-cancer patients receiving anti-PD-1 therapy. For each cell (n = 157,760) across 25,288 genes, a ranked gene-name sequence was generated by ordering genes in decreasing expression.

The code was run using **A100 GPU** (Colab Pro).

Input:  gene_name_sequences.txt generated from Bassez et al. (2021)

Output: embeddings.npy (NumPy array of shape [n_cells, embedding_dim])

# Generate the mxbread cell embeddings

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = '/content/drive/gene_name_sequences.txt'

with open(file_path, 'r') as file:
    sequences = file.readlines()

sequences = [seq.strip() for seq in sequences if seq.strip()]

Loaded 175942 sequences


In [None]:
!pip install transformers sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np



Model: mixedbread-ai/mxbai-embed-large-v1 (Hugging Face)

In [None]:
desired_dimension = 768

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=desired_dimension)

model.safetensors:  86%|########6 | 577M/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [None]:
embeddings = []
for i, sequence in enumerate(sequences):
    print(f"Processing sequence {i + 1}/{len(sequences)}")

    # Generate embeddings
    sequence_embedding = model.encode(sequence, normalize_embeddings=True)
    embeddings.append(sequence_embedding)

In [None]:
# Convert embeddings to a NumPy array
embeddings = np.array(embeddings)

np.save('embeddings.npy', embeddings)

Generated embeddings with shape: (175942, 768)


Remove rows for patients with missing outcome data

In [None]:
import numpy as np
import pandas as pd

removed_indices = np.loadtxt("removed_patient_indices.txt", dtype=int)
removed_indices_python = removed_indices - 1

embeddings = np.delete(embeddings, removed_indices_python, axis=0)

print("New matrix shape:", embeddings.shape)

New matrix shape: (157760, 768)


In [None]:
from google.colab import files
files.download('embeddings.npy')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>