## Immunosuppressive paths between cancer epithelial and CD8 T cells in BC datasets
<br>
<b>Description</b> : In this notebook we tried to generate the perfectly paired, paired and unpaired datasets between single cell and spatial transcriptomics for benchmarking purpose in Figure 2c.<br>
<b>Author</b> : Hejin Huang (huang.hejin@gene.com)<br>

In [1]:
import numpy as np
import os
import pandas as pd
import scanpy as sc
import tangram2 as tg2
import tangram as tg
from sklearn.metrics import jaccard_score

  from pkg_resources import get_distribution, DistributionNotFound


In [3]:
# --- Data Loading and Initial Filtering ---
path = '../../data/tangram2_paper_data/original/bc/sc/'
ad_sc = sc.read_h5ad(path + 'bc.h5ad')

subtype = 'TNBC'
ad_sc_TNBC = ad_sc[ad_sc.obs['subtype'] == subtype].copy() # Use .copy() to avoid SettingWithCopyWarning

patient_IDs = list(ad_sc_TNBC.obs['orig.ident'].unique())



In [4]:
# --- Data Generation Loop for Patients with > 3000 cells ---
output_base_path = '../../data/tangram2_paper_data/imod/bc/simulated_dataset/'
label_used = 'celltype_major' # Define label column once

for patient in patient_IDs:
    ad_sc_patient = ad_sc_TNBC[ad_sc_TNBC.obs['orig.ident'] == patient].copy()

    # Check if the patient has more than 3000 cells
    if len(ad_sc_patient) > 3000:
        print(f"Processing patient: {patient} with {len(ad_sc_patient)} cells.")

        # Create patient-specific directory if it doesn't exist
        patient_output_dir = os.path.join(output_base_path, patient)
        os.makedirs(patient_output_dir, exist_ok=True)

        # Generate 'random' column with 0 or 1 (50% probability each)
        # Using 1 and 2 to match the original notebook's splitting logic
        ad_sc_patient.obs['random'] = np.random.randint(1, 3, size=len(ad_sc_patient))

        ad_sc_patient_A1 = ad_sc_patient[ad_sc_patient.obs['random'] == 1].copy()
        ad_sc_paired_unperfect = ad_sc_patient[ad_sc_patient.obs['random'] == 2].copy()

        # Generate paired dataset using cellmix
        ad_sp, ad_sc_paired = tg2.evalkit.datagen.cellmix.cellmix.cellmix(
            ad_sc_patient_A1,
            n_spots=100,
            n_cells_per_spot=10,
            n_types_per_spot=3,
            label_col=label_used,
            encode_spatial=True,
        )

        # Save the generated paired and unperfectly paired datasets
        # ad_sc_paired.write_h5ad(os.path.join(patient_output_dir, 'ad_sc_paired.h5ad'))
        # ad_sc_paired_unperfect.write_h5ad(os.path.join(patient_output_dir, 'ad_sc_paired_unperfect.h5ad'))
        # ad_sp.write_h5ad(os.path.join(patient_output_dir, 'ad_sp.h5ad')) # Save ad_sp as well

Processing patient: CID4495 with 7985 cells.
Processing patient: CID44971 with 7986 cells.
Processing patient: CID44991 with 7023 cells.
Processing patient: CID4513 with 5619 cells.
Processing patient: CID4515 with 4149 cells.
Processing patient: CID3963 with 3527 cells.
