<center><h1> GEARS Dataset </h1></center>

## Context

This notebook intend to load the dataset used in GEARS and in the benchmarking paper [\[1\]](#benchmark).

In particular, we will be using the following [datasets](#datasets).

## Table of Contents

- [Raw Data](#raw-pipeline)
- [Pre-processed Data](#pre-processed-pipeline)

## Datasets

### Raw Data

- [Adamson](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90546): 
  - ht<span>tps://</span>www.ncbi.nlm.nih.gov/geo/download/?acc=GSE90546&format=file
  - Assay type: perturb-seq: a combination of droplet based scRNAseq with a strategy for barcoding CRISPR-mediated perturbations.
  - Single CRIPSRi perturbation.
  - Overall: 3 different pooled CRIPSR screening experiments were conducted via perturb-seq.
- [Replogle](https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387):
  - Single cell raw **K562 genome scale scale day 8 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775507
  - Single cell raw **K562 essential scale day 6 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35773219
  - Single cell raw **RPE1 essential scale  day 7 post-transduction** ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775606
- [Norman](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344):
  - raw_barcodes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fbarcodes.tsv.gz
  - raw_cell_identities.csv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fcell%5Fidentities.csv.gz
  - raw_genes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fgenes.tsv.gz
  - raw_matrix.mtx.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fmatrix.mtx.gz

### Pre-processed Data
For completeness, here are the datasets used by GEARS or scFoundation and the benchmarking paper [\[1\]](#benchmark).

**GEARS**:

- Single-gene perturbation:
  - [Adamson](https://dataverse.harvard.edu/file.xhtml?fileId=6154417&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154417
  - [Relogle K562](https://dataverse.harvard.edu/file.xhtml?fileId=7458695&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458695
  - [Replogle RPE1](https://dataverse.harvard.edu/file.xhtml?fileId=7458694&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458694

- Multiple-gene perturbation
  - [Norman](https://dataverse.harvard.edu/file.xhtml?fileId=6154020&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154020
  - 131 two-gene perturbations

**scFoundation**:

Another preprocessing strategy for the Norman et al. dataset
- [Norman scFoundation](https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200?file=44477939): ht<span>tps://</span>figshare.com/ndownloader/files/44477939
## References

1. Ahlmann-Eltze, Huber, and Anders, “Deep Learning-Based Predictions of Gene Perturbation Effects Do Not yet Outperform Simple Linear Baselines.”<a id="benchmark"></a>


## 1) Download Datasets

Note that the bash script is run from the Notebook but you can also run it from the shell. File will be downloaded using the `yaml` config in `fine tune/config/download.yaml`.

In [None]:
%%bash
cd ../../ # cd to root of the project
python fine_tune/scripts/download/cli.py --config fine_tune/datasets/config/download.yaml

## 2) Create AnnData From Raw

In [2]:
# All imports here
from pathlib import Path

import anndata as ad
import numpy as np
import pandas as pd
import pooch

data_path = Path("./../datasets")


%reload_ext autoreload
%autoreload 2


### a) Adamson

**Note on Adamson**

| Accession              | Experiment                    | Model                               | Protocol                                                                                    | Additional Protocol                                                                                                                          | sgRNA Count                                                                               | Cell Count | Assay Type           |
| ---------------------- | ------------------------------ | ----------------------------------- | ------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ---------- | -------------------- |
| **GSM2406675\_10X001** | **Pilot Experiment**           | K562 cells with dCas9-KRAB (cBA010) | Individual transduction → pooled after 3 days → selection + 5 days combined growth          | Growth in the presence of puromycin (3µg/mL)                                                                                                                                            | 8 distinct GBCs (1 control)                                                               | 5,768      | 10x 3' v1  |
| **GSM2406677\_10X005** | **UPR Epistasis Experiment**   | K562 cells with dCas9-KRAB (cBA010) | Individual transduction → pooled after 3 days → selection + 2 days combined growth          | To limit heterogenous effects of cell microenvironments caused by cell settling, the sorted cells were grown with continuous agitation on an orbital shaker. <br> <br> Pharmacological treatment: <br> • Thapsigargin (ER Ca²⁺ pump inhibitor) (100 nM/mL for 4 hr.)<br> • Tunicamycin (N-glycosylation inhibitor) (4 μg/mL for 6h) <br> • DMSO (control) for 6hr | 7 GBCs (3-guide vectors of 3 genes — single, double, triple) + 2 triple negative controls | 15,006     | 10x 3' v1            |
| **GSM2406681\_10X010** | **UPR Perturb-seq Experiment** | K562 cells with dCas9-KRAB (cBA011) | Individual lentivirus production → pooled transduction → selection + 7 days combined growth | To limit heterogenous effects of cell microenvironments caused by cell settling, the sorted cells were grown with continuous agitation on an orbital shaker. <br><br> Sequencing across two separate runs totaling 10 lanes.                                                                                                                                            | 93 GBCs (2 control GBCs; targets cover 82 genes)                                          | 65,337     | 10x 3' v1 (10 lanes) |

**Remarque:**
- UPR (unfolded protein response)
- The assay technology is assumed to be the v1 of the Chromium Single Cell 3′ Solution (No specification of the version but 2016, launched year of the technology).
- When the identity of a Cas9 targeting single guide RNA (sgRNA) couldn't be uniquely identified, it was either:
  - "multiplet" 
  - "NaN"

In [3]:
adamson_dir = data_path.joinpath("raw/raw_adamson")
mtx_files = sorted(adamson_dir.glob("*_matrix.mtx.txt.gz"))

all_adatas = {}

for mtx_file in mtx_files:
    prefix = mtx_file.name.split("_matrix")[0]

    # Paths
    barcodes_file = adamson_dir / f"{prefix}_barcodes.tsv.gz"
    genes_file = adamson_dir / f"{prefix}_genes.tsv.gz"
    identities_file = adamson_dir / f"{prefix}_cell_identities.csv.gz"

    # Load metadata
    genes = pd.read_csv(genes_file, header=None, sep="\t")
    barcodes = pd.read_csv(barcodes_file, header=None)
    cell_identities = pd.read_csv(identities_file, index_col=0)

    # Create AnnData
    adata = ad.io.read_mtx(mtx_file).T
    adata.var_names = genes[0].values
    adata.obs_names = barcodes[0].values
    # adata.obs = cell_identities.loc[adata.obs_names] fail because more barcodes than cell_identity.
    adata.obs = pd.DataFrame(index=adata.obs_names)
    adata.obs = adata.obs.join(cell_identities) # fill with NaN barcodes with non found identity

    # Optional: keep track of the experiment
    adata.obs["experiment"] = prefix
    all_adatas[prefix] = adata

all_adatas

{'GSM2406675_10X001': AnnData object with n_obs × n_vars = 5768 × 35635
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment',
 'GSM2406677_10X005': AnnData object with n_obs × n_vars = 15006 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment',
 'GSM2406681_10X010': AnnData object with n_obs × n_vars = 65337 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells', 'experiment'}

#### i) Retrieve correct perturbation

Let's check the number of unique guide barcode which shall corresponds to the unique perturbations (except than certain barcode targets the same genes).

In [7]:
print(
    "Number of unique 'guide barcode' detected:\n" +
    "\n".join([
        f"{exp}: {adata.obs['guide identity'].unique().__len__()}, "
        f"expected {gbc}, of which {control} controls. "
        for (exp, adata), (gbc, control) in zip(
            all_adatas.items(),
            zip([8, 9, 93],[1, 2, 2]) # issued from the table above, sgRNA column
        )]) +
    "\nPotentially NaN or '*' for non identifiable or multiplets should be removed."
)

Number of unique 'guide barcode' detected:
GSM2406675_10X001: 10, expected 8, of which 1 controls. 
GSM2406677_10X005: 21, expected 9, of which 2 controls. 
GSM2406681_10X010: 115, expected 93, of which 2 controls. 
Potentially NaN or '*' for non identifiable or multiplets should be removed.


**Filtering must occurs:**
- GSM2406675_10X001: 
  - Remove Nan and '*'
- GSM2406677_10X005: 
  - Remove Nan and '*'
  - Targeted genes should be a single, double or triple combination of:
    - ATF6
    - ERN1 (IRE1⍺)
    - EIF2AK3 (PERK) -> We must rename PERK in EIF2AK3 to match the actual gene name
- GSM2406681_10X010: 
  - Remove Nan and '*'
  - Shall use the "Table S1. Protospacer Sequences of sgRNAs"

In [221]:
adamson_pilot = all_adatas["GSM2406675_10X001"]
adamson_epistasis = all_adatas["GSM2406677_10X005"]
adamson_perturb = all_adatas["GSM2406681_10X010"]

In [222]:
# Pilot experiment - remove NaN and '*'
adamson_pilot = adamson_pilot[
    ~adamson_pilot.obs["guide identity"].astype(str).str.contains(r"(nan|\*)")
]
# UPR epistasis experiment - keep only guide containing any of ctrl, IRE1, PERK, ATF6
adamson_epistasis = adamson_epistasis[
    adamson_epistasis.obs["guide identity"].astype(str).str.contains("(ctrl|IRE1|PERK|ATF6)")
]

# UPR perturb seq experiment - retrieve table S1 and filter out guide while keeping control: 'mod'
file = pooch.retrieve(
    url="https://ars.els-cdn.com/content/image/1-s2.0-S0092867416316609-mmc1.xlsx",
    known_hash="sha256:9b5935cb15ba2f6d60d3017832de2918e7d4f172db6f202be7999cba5feea82b"
)
protospace_df = pd.read_excel(file, header=1)
# Some gene names are separated with '/' for aliases.
# Since the convention chosen in the dataset is not known, corresponding rows are demultiplicated
protospace_df["Gene"] = protospace_df["Gene"].apply(lambda x: x.split("/") if isinstance(x, str) else x)
protospace_df = protospace_df.explode("Gene").reset_index(drop=True)

adamson_perturb = adamson_perturb[(
    # retrieve guide identity in ther protospace table
    (pert := adamson_perturb.obs["guide identity"]).isin(
        (protospace_df["Gene"].dropna() + "_" + protospace_df["Perturb-seq_Vector_ID"]).dropna()) |
    # retrieve control with large occurence (more than 100 cells)
    (pert.str.contains("mod") & pert.isin(pert.value_counts()[pert.value_counts() > 100].index))
)]

  ~adamson_pilot.obs["guide identity"].astype(str).str.contains(r"(nan|\*)")
  adamson_epistasis.obs["guide identity"].astype(str).str.contains("(ctrl|IRE1|PERK|ATF6)")
  warn(msg)


#### ii) Formatting Adamson

Add more metadata:
- `obs`
  - `gene_perturbed`
    - parse `guide identity`:
      - **Renaming target:**
        - GSM2406675_10X001
          - Control is written with 'mod'. Should be renamed in 'NT' (non targeting)
        - GSM2406677_10X005
          - IRE1 -> ERN1 to match the actual gene name
          - PERK -> EIF2AK3 to match the actual gene name
          - Single perturbation should have the 'only' removed and add NT+NT since triple guides are used
          - Double or Triple perturbations should be spaced with '+'
          - Controls are written with 'ctrl'. Since it is a triple guide, should be marked as NT+NT+NT
        - GSM2406681_10X010
          - Controls are written with 'ctrl'. Should be renamed in 'NT'
  - `tissue` 
    - cell line
  - `cell_type`
    - lymphoblasts
  - `cell_line` 
    - K562
  - `disease` 
    - chronic myelogenous leukemia
  - `perturbation_type` 
    - CRISPRi, CRISPRa, CRISPRko, compound (sum of those)
  - `compound` 
    - (name of the compound or sum of those, (control = DMSO))
  - `compound_target` 
    - (name of the compound targeted gene, NaN if unknown)
  - `compound_moa` 
    - (name of the compound mechanism of action, NaN if unknown)
  - `compound_dose_uM` 
    - (dosage of the compound in µM)
  - `gene_perturbed` 
    - (name of the gene perturbed or sum of those (control = NT) (if single perturbation but double or triple guide, should be gene+NT / gene+NT+NT))
  - `perturbation_id` 
    - gene_perturbed _ perturbation_type (if any of CRISPR) or compound _ compound_dose with uM concatenated (write down the function)
  - `is_control` 
    - True if compound = DMSO and gene_perturbed = NaN, or if gene_perturbed.split("+") contains only NT. else False (write down the function)
  - `organism` 
    - Homo sapiens
  - `assay` 
    - 10x 3' v1
  - `cell_barcode`
    - Reindex the obs table
- `var`
  - `gene_name`
    - Join with `genome.py`



##### (1) obs

**Rename control and Parse guide identity**

In [223]:
for ds, pattern, replacement in [
    (adamson_pilot, r"\(mod\)", "NT"),
    (adamson_epistasis, "ctrl", "NT+NT+NT"), # those are triple guide control
    (adamson_perturb, "ctrl", "NT"),
]:
    ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
        ~ds.obs["guide identity"].str.contains(pattern), replacement
    )

# simply remove the guide id
adamson_pilot.obs["gene_perturbed"] = adamson_pilot.obs["gene_perturbed"].str.split("_").apply(lambda x: x[0])

# remove guide id,
# change "IRE1" -> "ERN1" and "PERK" -> "EIF2AK3"
# make sure simple or double perturbation remains marked
# as +NT+NT or +NT because triple guide were used for that experiment
adamson_epistasis.obs["gene_perturbed"] = adamson_epistasis.obs["gene_perturbed"].str.split("_").apply(
    lambda row: row[0] if len(row) == 1 else (
        "+".join(
            [
                {"only": "NT", "IRE1": "ERN1", "PERK": "EIF2AK3"}.get(x, x)
                for x in row[:-1]
            ] + (["NT"] if len(row[:-1]) < 3 else [])
        )
    )
)

# simply remove the guide id
adamson_perturb.obs["gene_perturbed"] = adamson_perturb.obs["gene_perturbed"].str.split("_").apply(lambda x: x[0])

  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(
  ds.obs["gene_perturbed"] = ds.obs["guide identity"].where(


**Add other metadata**

In [None]:
def extend_obs(adata):
    obs = adata.obs.copy()

    gene_split = obs["gene_perturbed"].str.split("+")
    gene_counts = obs["gene_perturbed"].str.count(r"\+")

    obs["tissue"] = "cell line"
    obs["cell_type"] = "lymphoblasts"
    obs["cell_line"] = "K562"
    obs["disease"] = "chronic myelogenous leukemia"
    obs["organism"] = "Homo sapiens"
    obs["assay"] = "10x 3' v1"
    obs["compound"] = np.nan
    obs["compound_target"] = np.nan
    obs["compound_moa"] = np.nan

    # Compute perturbation_type: "CRISPRi" + ("+CRISPRi" * number of additional perturbations)
    obs["perturbation_type"] = ["+".join(["CRISPRi"] * (c + 1)) for c in gene_counts]

    # Compute is_control: True if only 'NT', False otherwise
    obs["is_control"] = gene_split.apply(lambda x: set(x) == {"NT"})

    # Save original barcode
    obs = obs.reset_index().rename(columns={"index": "original_barcode"})

    adata.obs = obs
    return adata

# Apply to each AnnData object
for ds in [adamson_perturb, adamson_epistasis, adamson_pilot]:
    extend_obs(ds)


NameError: name 'adamson_perturb' is not defined

In [None]:
%%bash
cd ../../ # cd to root of the project
python fine_tune/scripts/genome.py

In [12]:
gene_name = pd.read_csv(
    data_path.joinpath("genome/gencode.v32.primary_assembly.annotation.csv"),
    index_col="ensembl_id"
)

#### Parse target and plasmid from guide identity

### Raw Norman

In [None]:
# Define paths
norman_dir = data_path.joinpath("raw/raw_norman")

# Load gene and barcode names
genes = pd.read_csv(norman_dir / "genes.tsv.gz", header=None, sep="\t")
barcodes = pd.read_csv(norman_dir / "barcodes.tsv.gz", header=None)

# Load cell identities
cell_identities = pd.read_csv(norman_dir / "cell_identities.csv.gz", index_col=0)

# Build AnnData
adata_norman = ad.io.read_mtx(norman_dir / "matrix.mtx.gz").T
adata_norman.var_names = genes[0].values
adata_norman.obs_names = barcodes[0].values
# adata.obs = cell_identities.loc[adata.obs_names] fail because more barcodes than cell_identity.
adata_norman.obs = pd.DataFrame(index=adata_norman.obs_names)
adata_norman.obs = adata_norman.obs.join(cell_identities) # fill with NaN barcodes with non found identity