<center><h1> Benchmarking Datasets </h1></center>

## Context

This notebook intend to load the dataset used in GEARS and in the benchmarking paper [\[1\]](#benchmark).

In particular, we will be using the following [datasets](#datasets).

## Table of Contents

- [Raw Data](#raw-pipeline)
- [Pre-processed Data](#pre-processed-pipeline)

## Datasets

### Raw Data

- [Adamson](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90546): 
  - ht<span>tps://</span>www.ncbi.nlm.nih.gov/geo/download/?acc=GSE90546&format=file
  - Assay type: perturb-seq: a combination of droplet based scRNAseq with a strategy for barcoding CRISPR-mediated perturbations.
  - Single CRIPSRi perturbation.
  - Overall: 3 different pooled CRIPSR screening experiments were conducted via perturb-seq.
- [Replogle](https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387):
  - Single cell raw **K562 genome scale scale day 8 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775507
  - Single cell raw **K562 essential scale day 6 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35773219
  - Single cell raw **RPE1 essential scale  day 7 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775606
- [Norman](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344):
  - raw_barcodes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fbarcodes.tsv.gz
  - raw_cell_identities.csv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fcell%5Fidentities.csv.gz
  - raw_genes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fgenes.tsv.gz
  - raw_matrix.mtx.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fmatrix.mtx.gz

### Pre-processed Data

For completeness, here are the datasets used by GEARS or scFoundation and the benchmarking paper [\[1\]](#benchmark).

**GEARS**:

- Single-gene perturbation:
  - [Adamson](https://dataverse.harvard.edu/file.xhtml?fileId=6154417&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154417
  - [Relogle K562](https://dataverse.harvard.edu/file.xhtml?fileId=7458695&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458695
  - [Replogle RPE1](https://dataverse.harvard.edu/file.xhtml?fileId=7458694&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458694

- Multiple-gene perturbation
  - [Norman](https://dataverse.harvard.edu/file.xhtml?fileId=6154020&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154020
  - 131 two-gene perturbations

**scFoundation**:

Another preprocessing strategy for the Norman et al. dataset
- [Norman scFoundation](https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200?file=44477939): ht<span>tps://</span>figshare.com/ndownloader/files/44477939
## References

1. Ahlmann-Eltze, Huber, and Anders, “Deep Learning-Based Predictions of Gene Perturbation Effects Do Not yet Outperform Simple Linear Baselines.”<a id="benchmark"></a>
2. Adamson et al., “A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response.”<a id="adamson"></a>


## 1) Download Datasets

Note that the bash script is run from the Notebook but you can also run it from the shell. File will be downloaded using the `yaml` config in `fine tune/config/download.yaml`.

In [None]:
%%bash
cd ../../ # cd to root of the project
python fine_tune/scripts/download/cli.py --config fine_tune/datasets/config/benchmarking.yaml

## 2) Adamson

In [1]:
# All imports here
from functools import reduce
from pathlib import Path

import anndata as ad
import numpy as np
import pandas as pd
import pooch
from rdkit import Chem

# Define compound metadata
from rdkit.Chem import Descriptors
from scipy.io import mmread

from fine_tune.scripts import format_obs, format_var

data_path = Path("./../datasets")


%reload_ext autoreload
%autoreload 2

**Note on Adamson**

| Accession              | Experiment                     | Model                               | Protocol                                                                                    | Additional Protocol                                                                                                                                                                                                                                                                                                                                               | sgRNA Count                                                                               | Cell Count | Assay Type           |
|------------------------|--------------------------------|-------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|------------|----------------------|
| **GSM2406675\_10X001** | **Pilot Experiment**           | K562 cells with dCas9-KRAB (cBA010) | Individual transduction → pooled after 3 days → selection + 5 days combined growth          | Growth in the presence of puromycin (3µg/mL)                                                                                                                                                                                                                                                                                                                      | 8 distinct GBCs (1 control)                                                               | 5,768      | 10x 3' v1            |
| **GSM2406677\_10X005** | **UPR Epistasis Experiment**   | K562 cells with dCas9-KRAB (cBA010) | Individual transduction → pooled after 3 days → selection + 2 days combined growth          | To limit heterogenous effects of cell microenvironments caused by cell settling, the sorted cells were grown with continuous agitation on an orbital shaker. <br> <br> Pharmacological treatment: <br> • Thapsigargin (ER Ca²⁺ pump inhibitor) (100 nM/mL for 4 hr.)<br> • Tunicamycin (N-glycosylation inhibitor) (4 μg/mL for 6h) <br> • DMSO (control) for 6hr | 7 GBCs (3-guide vectors of 3 genes — single, double, triple) + 2 triple negative controls | 15,006     | 10x 3' v1            |
| **GSM2406681\_10X010** | **UPR Perturb-seq Experiment** | K562 cells with dCas9-KRAB (cBA011) | Individual lentivirus production → pooled transduction → selection + 7 days combined growth | To limit heterogenous effects of cell microenvironments caused by cell settling, the sorted cells were grown with continuous agitation on an orbital shaker. <br><br> Sequencing across two separate runs totaling 10 lanes.                                                                                                                                      | 93 GBCs (2 control GBCs; targets cover 82 genes)                                          | 65,337     | 10x 3' v1 (10 lanes) |

**Remark:**
- UPR (unfolded protein response)
- The assay technology is assumed to be the v1 of the Chromium Single Cell 3′ Solution (No specification of the version but 2016, launched year of the technology).
- When the identity of a Cas9 targeting single guide RNA (sgRNA) couldn't be uniquely identified, it was either:
  - "multiplet" 
  - "NaN"
- In the "UPR Epistasis Experiment" [\[2\]](#adamson): 
  > Note in the raw sequencing data the tunicamycin-treated cells have gemgroup 1 (as a BAM tag), the thapsigargin-treated cells have gemgroup 2, and the DMSO-treated cells have gemgroup 3.

In [2]:
adamson_dir = data_path.joinpath("raw/raw_adamson")
mtx_files = sorted(adamson_dir.glob("*_matrix.mtx.txt.gz"))

all_adatas = {}

for mtx_file in mtx_files:
    prefix = mtx_file.name.split("_matrix")[0]

    # Paths
    barcodes_file = adamson_dir / f"{prefix}_barcodes.tsv.gz"
    genes_file = adamson_dir / f"{prefix}_genes.tsv.gz"
    identities_file = adamson_dir / f"{prefix}_cell_identities.csv.gz"

    # Load metadata
    genes = pd.read_csv(genes_file, header=None, sep="\t")
    barcodes = pd.read_csv(barcodes_file, header=None)
    cell_identities = pd.read_csv(identities_file, index_col=0)

    # Create AnnData
    adata = ad.io.read_mtx(mtx_file).T
    adata.var_names = genes[0].values
    adata.obs_names = barcodes[0].values
    # adata.obs = cell_identities.loc[adata.obs_names] fail because more barcodes than cell_identity.
    adata.obs = pd.DataFrame(index=adata.obs_names)
    adata.obs = adata.obs.join(cell_identities) # fill with NaN barcodes with non found identity

    # Optional: keep track of the experiment
    adata.obs["experiment"] = prefix
    all_adatas[prefix] = adata

all_adatas

{'GSM2406675_10X001': AnnData object with n_obs × n_vars = 5768 × 35635
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment',
 'GSM2406677_10X005': AnnData object with n_obs × n_vars = 15006 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment',
 'GSM2406681_10X010': AnnData object with n_obs × n_vars = 65337 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment'}

### a) Retrieve correct perturbation

Let's check the number of unique guide barcode which shall corresponds to the unique perturbations (except than certain barcode targets the same genes).

In [3]:
print(
    "Number of unique 'guide barcode' detected:\n" +
    "\n".join([
        f"{exp}: {adata.obs['guide identity'].unique().__len__()}, "
        f"expected {gbc}, of which {control} controls. "
        for (exp, adata), (gbc, control) in zip(
            all_adatas.items(),
            zip([8, 9, 93],[1, 2, 2]) # issued from the table above, sgRNA column
        )]) +
    "\nPotentially NaN or '*' for non identifiable or multiplets should be removed."
)

Number of unique 'guide barcode' detected:
GSM2406675_10X001: 10, expected 8, of which 1 controls. 
GSM2406677_10X005: 21, expected 9, of which 2 controls. 
GSM2406681_10X010: 115, expected 93, of which 2 controls. 
Potentially NaN or '*' for non identifiable or multiplets should be removed.


**Filtering must occurs:**
- GSM2406675_10X001: 
  - Remove Nan and '*'
- GSM2406677_10X005: 
  - Remove Nan and '*'
  - Targeted genes should be a single, double or triple combination of:
    - ATF6
    - IRE1
    - PERK
- GSM2406681_10X010: 
  - Remove Nan and '*'
  - Shall use the "Table S1. Protospacer Sequences of sgRNAs"

In [4]:
adamson_pilot = all_adatas["GSM2406675_10X001"]
adamson_epistasis = all_adatas["GSM2406677_10X005"]
adamson_perturb = all_adatas["GSM2406681_10X010"]

In [5]:
# Pilot experiment - remove NaN and '*'
adamson_pilot = adamson_pilot[
    ~adamson_pilot.obs["guide identity"].astype(str).str.contains(r"(nan|\*)")
]
# UPR epistasis experiment - keep only guide containing any of ctrl, IRE1, PERK, ATF6
adamson_epistasis = adamson_epistasis[
    adamson_epistasis.obs["guide identity"].astype(str).str.contains("(ctrl|IRE1|PERK|ATF6)")
]

# UPR perturb seq experiment - retrieve table S1 and filter out guide while keeping control: 'mod'
file = pooch.retrieve(
    url="https://ars.els-cdn.com/content/image/1-s2.0-S0092867416316609-mmc1.xlsx",
    known_hash="sha256:9b5935cb15ba2f6d60d3017832de2918e7d4f172db6f202be7999cba5feea82b"
)
protospace_df = pd.read_excel(file, header=1)
# Some gene names are separated with '/' for aliases.
# Since the convention chosen in the dataset is not known, corresponding rows are demultiplicated
protospace_df["Gene"] = protospace_df["Gene"].apply(lambda x: x.split("/") if isinstance(x, str) else x)
protospace_df = protospace_df.explode("Gene").reset_index(drop=True)

adamson_perturb = adamson_perturb[(
    # retrieve guide identity in the protospace table
    (pert := adamson_perturb.obs["guide identity"]).isin(
        (protospace_df["Gene"].dropna() + "_" + protospace_df["Perturb-seq_Vector_ID"]).dropna()) |
    # retrieve control with large occurence (more than 100 cells)
    (pert.str.contains("mod") & pert.isin(pert.value_counts()[pert.value_counts() > 100].index))
)]

  ~adamson_pilot.obs["guide identity"].astype(str).str.contains(r"(nan|\*)")
  adamson_epistasis.obs["guide identity"].astype(str).str.contains("(ctrl|IRE1|PERK|ATF6)")
  warn(msg)


### b) Formatting Adamson

Add more metadata:
- `obs`
  - `gene_perturbed`
    - parse `guide identity`:
      - **Renaming target:**
        - GSM2406675_10X001
          - Control is written with 'mod'. Should be renamed in 'NT' (non targeting)
        - GSM2406677_10X005
          - IRE1 -> ERN1 to match the actual gene name
          - PERK -> EIF2AK3 to match the actual gene name
          - Single perturbation should have the 'only' removed and add NT+NT since triple guides are used
          - Double or Triple perturbations should be spaced with '+'
          - Controls are written with 'ctrl'. Since it is a triple guide, should be marked as NT+NT+NT
        - GSM2406681_10X010
          - Controls are written with 'ctrl'. Should be renamed in 'NT'
  - `tissue` 
    - cell line
  - `cell_type`
    - lymphoblasts
  - `cell_line` 
    - K562
  - `disease` 
    - chronic myelogenous leukemia
  - `organism` 
    - Homo sapiens
  - `assay` 
    - 10x 3' v1
  - `cell_barcode`
    - Reindex the obs table
  - `perturbation_type` 
    - CRISPRi, CRISPRa, CRISPRko, drug (sum of those)
  - `drug` 
    - (name of the drug or sum of those, (control = DMSO))
  - `drug_canonical_smiles` 
    - SMILES representation of the drug. Should be canonicalized.
  - `drug_dose` 
    - (dosage of the drug in µM). Should be formatted as a list of the dose: `[(drug1, durg_dose1, drug_unit1), ...]`
  - `drug_targets` 
    - (name of the drug targeted gene, NaN if unknown) Different targets are separated by a `","`
  - `drug_moa` 
    - (name of the drug mechanism of action, NaN if unknown). Different MoA are separated by a `","`
  - `gene_perturbed` 
    - (name of the gene perturbed or sum of those (control = NT) (if single perturbation but double or triple guide, should be gene+NT / gene+NT+NT))
  - `perturbation_id` 
    - **Assigning unique perturbation identifiers**
  
      To facilitate downstream analysis, we add a `perturbation_id` column that uniquely identifies the biological perturbation applied in each sample. This string encodes key information from the metadata, such as the perturbation type (e.g., CRISPRi, drug), the target gene or compound, and the dose when applicable.

      This is useful for:
      - Grouping cells by perturbation
      - Comparing effects across perturbations
      - Ensuring consistent identifiers across datasets

      For example, a row with:
      - `perturbation_type = "CRISPRi"`
      - `gene_perturbed = "TP53"`

      Will yield:  
      `perturbation_id = "TP53_CRISPRi"`

      A drug-based example:
      - `perturbation_type = "drug"`
      - `drug = "bortezomib"`
      - `drug_dose = [("bortezomib", 0.1, "uM")]`

      Will yield:  
      `perturbation_id = "bortezomib_0.1(uM)"`
  - `is_control` 
    - True if drug = DMSO and gene_perturbed = NaN, or if gene_perturbed.split("+") contains only NT. else False.

- `var`
  - `gene_name`
    - Join with `genome.py`



#### i) obs

**Rename control and Parse guide identity**

In [6]:
for ds, pattern, replacement in [
    (adamson_pilot, r"\(mod\)", "NT"),
    (adamson_epistasis, "ctrl", "NT+NT+NT"), # those are triple guide control
    (adamson_perturb, r"\(mod\)", "NT"),
]:
    ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
        ~ds.obs["guide identity"].str.contains(pattern), replacement
    )

# simply remove the guide id
adamson_pilot.obs["gene_perturbed"] = adamson_pilot.obs["gene_perturbed"].str.split("_").apply(lambda x: x[0])

# remove guide id,
# change "IRE1" -> "ERN1" and "PERK" -> "EIF2AK3"
# make sure simple or double perturbation remains marked
# as +NT+NT or +NT because triple guide were used for that experiment
adamson_epistasis.obs["gene_perturbed"] = adamson_epistasis.obs["gene_perturbed"].str.split("_").apply(
    lambda row: row[0] if len(row) == 1 else (
        "+".join(
            [
                {"only": "NT", "IRE1": "ERN1", "PERK": "EIF2AK3"}.get(x, x)
                for x in row[:-1]
            ] + (["NT"] if len(row[:-1]) < 3 else [])
        )
    )
)

# simply remove the guide id
adamson_perturb.obs["gene_perturbed"] = adamson_perturb.obs["gene_perturbed"].str.split("_").apply(lambda x: x[0])

  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(


**Add chemical treatment for the UPR epistasis experiment**

We use the BAM tag number at the end of each barcode to identify the chemical treatment, cf [\[2\]](#adamson)

- BAM-1 tunicamycin
  - Paper Dosage: 4µg/mL 6hr
  - SMILES: 
    ```
    CC(C)CCCCCCCC\C=C\C(=O)N[C@@H]1[C@@H](O)[C@@H](O)[C@@H](C[C@@H](O)[C@H]2O[C@H]([C@H](O)[C@@H]2O)N3C=CC(=O)NC3=O)O[C@H]1O[C@@H]4O[C@@H](CO)[C@H](O)[C@@H](O)[C@@H]4NC(C)=O
    ```
    - Source: [tunicamycin (Sigma, T7765)](https://www.sigmaaldrich.com/US/en/product/sigma/t7765) (Assumed same provider as Thapsigargin (Sigma)).
    - There is multiple version, notably the fatty carbon chain has a length from 8 to 11. Here (CH2)8 is chosen, and it matches the SMILES given in the Connectivity Map.
  
  Other info fetched on: [Thapsigargin Connectivity Map Broad Institute](https://clue.io/command?q=thapsigargin)
  - MoA:
    - GLCNAC phosphotransferase inhibitor
  - Target:
    - DPAGT1, GNPTAB
    - Remark: DPAGT1 is known to be targeted by tunicamycin, (source: [DPAGT1 gene card](https://www.genecards.org/cgi-bin/carddisp.pl?gene=DPAGT1)). GNPTAB is solely reported by the connectivity Map but is not necessarily chemically tested.
  - ID:
    - Broad: BRD-K10573841
    - PubChem: 16220051

- BAM-2 thapsigargin (Sigma, T9033),
  - Paper Dosage: 100 nM 4hr
  - SMILES:
    ```
    [H][C@@]12C([C@H](OC([C@]3(O)C)=O)[C@]3(O)[C@@H](OC(CCC)=O)C[C@]2(C)OC(C)=O)=C(C)[C@H](OC(/C(C)=C(C)/[H])=O)[C@H]1OC(CCCCCCC)=O
    ```
    - Source: [thapsigargin (Sigma, T9033)](https://www.sigmaaldrich.com/US/en/product/sigma/t9033)
    - The SMILES provided from the Connectivity Map omit the stereochemistry, yet it is provided in the SIGMA product info, so it is used instead.
  
  Other info fetched on: [Thapsigargin Connectivity Map Broad Institute](https://clue.io/command?q=thapsigargin)
  - MoA:
    - ATPase inhibitor
  - Target: 
    - ATP2A1, ATP2A2, ATP2A3
    - Remark: ATP2A1 has the strongest binding affinity  (Kd < 100 nM) (source: [thapsigargin LINCS HMS Database](https://web.archive.org/web/20250425124838/https://lincs.hms.harvard.edu/db/sm/10293-999/)) and it is also reported in the Connectivity Map. ATP2A2 and ATP2A3 are reported with moderate binding affinity:  (100 nM < Kd < 1µM)
  - ID
    - Broad: BRD-A62809825
    - PubChem: 5353993
  

- BAM-3 DMSO

Additionally, `is_control = True`, should now be updated so that only DMSO chemically treated + no gene perturbation are said to be control. 

**Note:**

`format_obs` will take care of formatting:
- `drug_canonical_smiles`
  - As long as it is a SMILES, will be canonicalize with `rdkit`
- `drug_dose`
  - As long as it is a `int`, `list[int]` or `list[tuple(str,int)]` for multiple dose / dose and unit.
  - As a rule of thumb, the unit is always given in `uM`.

In [7]:
def get_molar_mass(smiles: str) -> float:
    """
    Calculate the molar mass (molecular weight) of a molecule from its SMILES string.

    Args:
        smiles (str): The SMILES representation of the molecule.

    Returns:
        float: The molecular weight in g/mol.
    """
    mol = Chem.MolFromSmiles(smiles)
    return Descriptors.MolWt(mol)  # in g/mol


tunicamycin_smiles = (
    "CC(C)CCCCCCCC\\C=C\\C(=O)N[C@@H]1[C@@H](O)[C@@H](O)[C@@H](C[C@@H](O)[C@H]2O[C@H]([C@H](O)[C@@H]2O)"
    "N3C=CC(=O)NC3=O)O[C@H]1O[C@@H]4O[C@@H](CO)[C@H](O)[C@@H](O)[C@@H]4NC(C)=O"
)
# 4 µg/mL → g/L -> mol/L -> umol/L = uM (micro molar)
tunicamycin_dose = np.round((4 * 1e-6 * 1e3 / get_molar_mass(tunicamycin_smiles)) * 1e6, decimals=1)

thapsigargin_smiles = (
    "[H][C@@]12C([C@H](OC([C@]3(O)C)=O)[C@]3(O)[C@@H](OC(CCC)=O)C[C@]2(C)OC(C)=O)=C(C)"
    "[C@H](OC(/C(C)=C(C)/[H])=O)[C@H]1OC(CCCCCCC)=O"
)
dmso_smiles = "CS(=O)C"

compound_metadata = {
    "1": {
        "drug": "Tunicamycin",
        "drug_dose": tunicamycin_dose,
        "drug_canonical_smiles": tunicamycin_smiles,
        "drug_moa_broad": "inhibitor/antagonist",
        "drug_moa_fine": "GLCNAC phosphotransferase inhibitor",
        "drug_targets": "DPAGT1, GNPTAB",
        "drug_pubchem_cid": "16220051"
    },
    "2": {
        "drug": "Thapsigargin",
        "drug_dose": 0.1,
        "drug_canonical_smiles": thapsigargin_smiles,
        "drug_moa_broad": "inhibitor/antagonist",
        "drug_moa_fine": "ATPase inhibitor",
        "drug_targets": "ATP2A1, ATP2A2, ATP2A3",
        "drug_pubchem_cid": "5353993"
    },
    "3": {
        "drug": "DMSO",
        "drug_dose": 0.0,
        "drug_canonical_smiles": dmso_smiles,
        "drug_moa_broad": None,
        "drug_moa_fine": None,
        "drug_targets": None,
        "drug_pubchem_cid": "679"
    }
}
compound_metadata = pd.DataFrame.from_dict(compound_metadata, orient="index")

# update compound metadata
adamson_epistasis.obs[
    ["drug", "drug_dose", "drug_canonical_smiles", "drug_moa_broad",
     "drug_moa_fine", "drug_targets", "drug_pubchem_cid"]
] = adamson_epistasis.obs_names.to_series().str.split("-").apply(lambda x: compound_metadata.loc[x[1]])

**Add the rest of the obs metadata**

In [8]:
default_metadata = {
    "tissue": "cell line",
    "cell_type": "lymphoblast",
    "cell_line": "K562",
    "disease": "chronic myelogenous leukemia",
    "assay": "10x 3' v1",
    "organism": "Homo sapiens",
    "gene_perturbation_type": "CRISPRi",
    "drug_dose_unit": "uM",
}
extra_dtype_map = {
    # Basic cell and read metadata
    'cell_barcode': str,               # Unique identifier per cell
    'guide identity': 'category',      # Guide RNA identifier (e.g., sgRNA ID)
    'read count': int,                 # Raw read count for this cell
    'UMI count': int,                  # Unique Molecular Identifier count
    'coverage': float,                 # Sequencing depth or gene coverage
    'good coverage': bool,             # QC filter (e.g., sufficient reads per cell)
    'number of cells': int,            # Count aggregated per perturbation (optional per-cell)
    'experiment': 'category',          # Experiment/batch label (e.g., plate ID)
}

# Apply to each AnnData object
for ds in [adamson_perturb, adamson_epistasis, adamson_pilot]:
    format_obs(ds, default_metadata=default_metadata, extra_dtype_map=extra_dtype_map)

#### ii) var

**Add gene_name**

In [9]:
%%bash
cd ../../ # cd to root of the project
python fine_tune/scripts/genome.py

Found local copy of fine_tune/datasets/genome/gencode.v32.primary_assembly.annotation.csv


In [10]:
gene_name = pd.read_csv(
    data_path.joinpath("genome/gencode.v32.primary_assembly.annotation.csv"),
    index_col="ensembl_id"
)

In [11]:
adamson_pilot = format_var(adamson_pilot, gene_name)
adamson_epistasis = format_var(adamson_epistasis, gene_name)
adamson_perturb = format_var(adamson_perturb, gene_name)



### c) Save Adamson

In [12]:
adamson_pilot.write_h5ad(data_path.joinpath("anndata/adamson_pilot_GSM2406675_10X001.h5ad"))
adamson_epistasis.write_h5ad(data_path.joinpath("anndata/adamson_epistasis_GSM2406677_10X005.h5ad"))
adamson_perturb.write_h5ad(data_path.joinpath("anndata/adamson_perturb_GSM2406681_10X010.h5ad"))

## 2) Norman

In [None]:
# Define paths
norman_dir = data_path.joinpath("raw/raw_norman")

# Load gene, barcode names, cell identities
_genes = pd.read_csv(norman_dir / "genes.tsv.gz", header=None, sep="\t")
barcodes = pd.read_csv(norman_dir / "barcodes.tsv.gz", header=None)
cell_identities = pd.read_csv(norman_dir / "cell_identities.csv.gz", index_col=0)

# Load matrix and filter out barcodes with no information:
mask = barcodes.isin(cell_identities.index).to_numpy().ravel()
matrix = mmread(norman_dir.joinpath("matrix.mtx.gz")).tocsr().T[mask, :]

# retain solely ensembl_id and not gene_name as we use custom gene_name
genes = _genes.rename(columns={0: "ensembl_id"})[["ensembl_id"]].set_index("ensembl_id")

# Build AnnData
adata_norman = ad.AnnData(
    X=matrix,
    obs=cell_identities,
    var=genes
)

### a) Retrieve correct perturbation

The supplementary information provided in the paper need to be used to verify every `guide_identity` are within the expected 287 perturbations of the Norman dataset.

- Table S7: transcriptional readouts for the 287 perturbations

They cannot be retrieved by any of `curl`, `wget` (CLI based approach) or `pooch`, `urllib`, `request` (python based approach) due to a HTTP error 403.

To bypass this, let's download the link from source on the local computer and then forward it to the server:

- Click on the link to download the [tableS7](https://pmc.ncbi.nlm.nih.gov/articles/instance/6746554/bin/NIHMS1045467-supplement-Table_S7.xlsx)
- Then forward it to your server using your preferred method:
  Here using an `scp` command which is written similarly to an `ssh user@server-name`:
  - Syntax: 
    ```bash
    scp path-to-table-S7 user@server-name:output-path-to-save-table-S7
    ```
  - Actual Example:
    ```bash
    scp Downloads/NIHMS1045467-supplement-Table_S7.xlsx hhakem@hhakem-cellarium-t4:projects/cellarium-ml/fine_tune/datasets/raw/raw_norman
    ```

#### i) Process Table S7

Let's verify the mapping between target and ensembl_id is correct according to the genome assembly annotation which is used to name ensembl_id.

In [None]:
tableS7 = pd.read_excel(
    norman_dir.joinpath("NIHMS1045467-supplement-Table_S7.xlsx"),
    header=21, # the top 21 rows are metadata
    index_col=0
).rename_axis("guide_identity")


gene_name = pd.read_csv(
    data_path.joinpath("genome/gencode.v32.primary_assembly.annotation.csv"),
    index_col="ensembl_id"
)

In [None]:
map_target_id = tableS7.reset_index()[["first_target", "first_id", "second_target", "second_id"]]
map_target_id = pd.concat(
    [map_target_id[["first_target", "first_id"]].rename(
        columns={"first_target": "Table S7", "first_id": "ensembl_id"}),
     map_target_id[["second_target", "second_id"]].rename(
         columns={"second_target": "Table S7", "second_id": "ensembl_id"})]
).drop_duplicates().sort_values(by="Table S7")

# filter out NegCtrl
map_target_id = map_target_id[~map_target_id["Table S7"].str.contains("NegCtrl")]

if map_target_id.isna().any(axis=1).any():
    raise ValueError(
        "There are non valid target - ensembl_id pair. See bellow:\n" \
        f"{map_target_id[map_target_id.isna().any(axis=1)]}"
    )
else:
    map_target_id = map_target_id.set_index("ensembl_id")
    map_target_id["Assembly"] = gene_name.loc[map_target_id.index]["symbol"]
    print(
        f"There are {(mask := map_target_id['Table S7'] != map_target_id['Assembly']).sum()} " \
        "gene different between Table S7 and the genome assembly. Those are:\n" \
        f"{map_target_id[mask]}"
    )

In those mismatch Table S7 use gene name that are not used anymore:

| Ensembl_id      | Table S7    | Assembly | Gene Card Link                                                                                                                |
|-----------------|-------------|----------|-------------------------------------------------------------------------------------------------------------------------------|
| ENSG00000156030 | ELMSAN1     | MIDEAS   | https://www.genecards.org/cgi-bin/carddisp.pl?gene=MIDEAS                                                                     |
| ENSG00000143674 | RP5-862P8.2 | MAP3K21  | https://www.genecards.org/cgi-bin/carddisp.pl?gene=MAP3K21 / https://thebiogrid.org/124089/summary/homo-sapiens/kiaa1804.html |

Those are then renamed:

In [None]:
tableS7 = tableS7.reset_index().map(
    lambda x: x.replace("ELMSAN1", "MIDEAS").replace("RP5-862P8.2", "MAP3K21") if isinstance(x, str) else x
).set_index("guide_identity")

#### ii) Processing target_identity

`guide_identity` is the form of: gene1_gene2__gene1_gene2. 
So it is double-double targeted gene (or ctrl) separated by a double underscore. Most of the time the pair of gene is the same but to make sure:
- split at `__`
- retrieve the unique pair

Then those guide identity can be compared to **TableS7**

In [None]:
unique_guide = (adata_norman.obs["guide_identity"]
 .drop_duplicates()
 # `dict.fromkeys()` ~ `set` but preserve order
 .str.split("__").apply(lambda row: "__".join(list(dict.fromkeys(row))))
 .drop_duplicates()
 .to_frame()
 .reset_index(drop=True)
)

unique_guide[["first_target", "second_target"]] = unique_guide["guide_identity"].str.split("_", n=1, expand=True)
unique_guide[["first_target", "second_target"]] = unique_guide[["first_target", "second_target"]].where(
    unique_guide["guide_identity"].str.count("_") == 1,
    unique_guide[["guide_identity"]].values.repeat(2, axis=1),
    axis=1
)

# comparison to table S7
unique_gene = pd.concat([unique_guide["first_target"], unique_guide["second_target"]]).drop_duplicates()
unique_geneS7 = pd.concat([tableS7["first_target"], tableS7["second_target"]]).drop_duplicates()
print("Non matching gene target in the norman `guide_identity` col compared to TableS7:\n"\
      f"{unique_gene[~unique_gene.isin(unique_geneS7)]}\n\n"
      "Non matching gene target in the TableS7 compared to `guide_identity` col:\n"\
      f"{unique_geneS7[~unique_geneS7.isin(unique_gene)]}\n"
      )

First: 
- There are remaining double underscore separated guide. They haven't been separated because at the "__" split, the left portion was different from the right one due to the addition of a end **tag** like `_1` or `_2`. That is most likely formatting error. 
  - Those end tag should be removed.
- `no_reads_found` indicate cells whose identity couldn't be retrieved. Those should be filtered out. 

Second, there is a mismatch due to the usage of gene aliases: 


| guide_identity | table S7 | Gene Card Link                                             |
|----------------|----------|------------------------------------------------------------|
| C19orf26       | CBARP    | https://www.genecards.org/cgi-bin/carddisp.pl?gene=CBARP   |
| C3orf72        | FOXL2NB  | https://www.genecards.org/cgi-bin/carddisp.pl?gene=FOXL2NB |
| ELMSAN1        | MIDEAS   | https://www.genecards.org/cgi-bin/carddisp.pl?gene=MIDEAS  |
| KIAA1804       | MAP3K21  | https://www.genecards.org/cgi-bin/carddisp.pl?gene=MAP3K21 |

[RHOXF2](https://www.genecards.org/cgi-bin/carddisp.pl?gene=RHOXF2) and [RHOXF2B](https://www.genecards.org/cgi-bin/carddisp.pl?gene=RHOXF2B) are however two distinct genes. The paper mentioning **RHOXF2B**, it will be used as the actual target.

Everything put together, this give:


In [None]:
adata_norman = adata_norman[
    ~adata_norman.obs["guide_identity"].str.contains("no_reads_found")
].copy()
replacements = {
    "RHOXF2": "RHOXF2B",
    "C19orf26" : "CBARP",
    "C3orf72" : "FOXL2NB",
    "ELMSAN1": "MIDEAS",
    "KIAA1804": "MAP3K21",
}
adata_norman.obs["guide_identity"] = (
    adata_norman.obs["guide_identity"]
    .str.replace(r"_\d+$", "", regex=True) # remove ending tag
    .map(lambda x: reduce(lambda s, kv: s.replace(*kv), replacements.items(), x)) # replace gene aliases
)

Re-running the above cell with the updated the `adata_norman` leads to no more mismatch, confirming the guide_identity has been correctly processed.

### b) Formatting Norman

Norman is processed similarly as for [Adamson](#b-formatting-adamson).

#### i) obs

**Rename control and Parse guide identity**

In [None]:
adata_norman.obs["gene_perturbed"] = (
    adata_norman
    .obs["guide_identity"]
    .str.split("__", expand=True)[0]
    .str.replace(r"NegCtrl\d+", "NT", regex=True)
    .str.replace("_", "+")
)

In [None]:
adata_norman.obs["tissue"] = "cell line"
adata_norman.obs["cell_type"] = "lymphoblast"
adata_norman.obs["cell_line"] = "K562"
adata_norman.obs["disease"] = "chronic myelogenous leukemia"
adata_norman.obs["perturbation_type"] = "CRISPRa"
adata_norman.obs["compound"] = None
adata_norman.obs["compound_target"] = None
adata_norman.obs["compound_moa"] = None
adata_norman.obs["compound_dose_uM"] = None
adata_norman.obs["organism"] = "Homo sapiens"
adata_norman.obs["assay"] = "10x 3' v2"