<center><h1> GEARS Dataset </h1></center>

## Context

This notebook intend to load the dataset used in GEARS and in the benchmarking paper [\[1\]](#benchmark).

In particular, we will be using the following [datasets](#datasets).

## Table of Contents

- [Raw Data](#raw-pipeline)
- [Pre-processed Data](#pre-processed-pipeline)

## Datasets

### Raw Data

- [Adamson](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90546): ht<span>tps://</span>www.ncbi.nlm.nih.gov/geo/download/?acc=GSE90546&format=file
- [Norman](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344):
  - raw_barcodes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fbarcodes.tsv.gz
  - raw_cell_identities.csv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fcell%5Fidentities.csv.gz
  - raw_genes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fgenes.tsv.gz
  - raw_matrix.mtx.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fmatrix.mtx.gz
- [Replogle](https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387):
  - Single cell raw K562 genome scale scale day 8 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775507
  - Single cell raw K562 essential scale day 6 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35773219
  - Single cell raw RPE1 essential scale  day 7 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775606
### Pre-processed Data

**GEARS**:

- Single-gene perturbation:
  - [Replogle RPE1](https://dataverse.harvard.edu/file.xhtml?fileId=7458694&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/745869
  - [Relogle K562](https://dataverse.harvard.edu/file.xhtml?fileId=7458695&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458695
  - [Adamson](https://dataverse.harvard.edu/file.xhtml?fileId=6154417&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154417
- Multiple-gene perturbation
  - [Norman](https://dataverse.harvard.edu/file.xhtml?fileId=6154020&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154020
  - 131 two-gene perturbations

**scFoundation**:

Another preprocessing strategy for the Norman et al. dataset
- [Norman scFoundation](https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200?file=44477939): ht<span>tps://</span>figshare.com/ndownloader/files/44477939
## References

1. Ahlmann-Eltze, Huber, and Anders, “Deep Learning-Based Predictions of Gene Perturbation Effects Do Not yet Outperform Simple Linear Baselines.”<a id="benchmark"></a>


In [1]:
# All imports here
from pathlib import Path

import anndata as ad
import pandas as pd
import scanpy as sc

from fine_tune.scripts import FileEntry, download_url, iter_download_url

data_path = Path("./../../datasets")


%load_ext autoreload
%autoreload 2


In [2]:
norman_url = (
    "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/"
    "suppl/GSE133344%5Fraw%5F{name_ext}"
)
file_map = {
        # "adamson": FileEntry(**{
        #     "url": "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE90546&format=file",
        #     "out_path": data_path / "raw",
        #     "raw_path": data_path / ".tar" / "raw",
        #     # "filename": "adamson.tar"
        # }),
        "norman": {
            n : FileEntry(**{
                "url": norman_url.format(name_ext=n),
                "out_path": data_path / "raw",
                "filename": "cell_identities.csv.gz" if n == "cell%5Fidentities.csv.gz" else None
            })
            for n in [
                "barcodes.tsv.gz",
                "cell%5Fidentities.csv.gz",
                "genes.tsv.gz",
                "matrix.mtx.gz"
            ]
        }
    }


In [3]:
iter_download_url(file_map=file_map, use_key_name=True)

Downloading barcodes.tsv.gz...
100%|██████████| 18.2M/18.2M [00:00<00:00, 51.4MiB/s]
Done!
Downloading cell_identities.csv.gz...
100%|██████████| 2.24M/2.24M [00:00<00:00, 11.9MiB/s]
Done!
Downloading genes.tsv.gz...
100%|██████████| 265k/265k [00:00<00:00, 2.47MiB/s]
Done!
Downloading matrix.mtx.gz...
100%|██████████| 1.44G/1.44G [00:41<00:00, 35.0MiB/s]
Done!


In [6]:
df = pd.read_csv(data_path / "adamson_raw" / "GSM2406675_10X001_genes.tsv.gz", sep="\t", compression="infer")

In [15]:
adata = sc.read_mtx(data_path / "adamson_raw" / "GSM2406681_10X010_matrix.mtx.txt.gz")

In [None]:
adata

AnnData object with n_obs × n_vars = 35635 × 5768

In [14]:
adata

AnnData object with n_obs × n_vars = 32738 × 15006

In [16]:
adata

AnnData object with n_obs × n_vars = 32738 × 65337

# Pre-processed Data <a id="pre-processed-pipeline"></a>

In [None]:
dataverse_url = "https://dataverse.harvard.edu/api/access/datafile/"
figshare_url = "https://figshare.com/ndownloader/files/"
file_map = {
    "adamson": {
        "server": dataverse_url,
        "file": "6154417"
    },
    "norman": {
        "server": dataverse_url,
        "file": "6154020"
    },
    "replogle_rpe1": {
        "server": dataverse_url,
        "file": "7458694"
    },
    "replogle_k562": {
        "server": dataverse_url,
        "file": "7458695"
    },
    "norman_sc_foundation": {
        "server": figshare_url,
        "file": "44477939"
    }
}
data_path = Path("./../../datasets")

for key, value in file_map.items():
    download_url(
        url=value["server"] + value["file"],
        save_path=data_path / ".zip" / "processed" / key,
        data_path=data_path / "processed",
        ext=".zip"
    )

Found local copy...
Found local copy...
Found local copy...
Found local copy...
Found local copy...


In [39]:
adata_map = {key : ad.read_h5ad(next(Path("../../datasets").joinpath(key).glob("*.h5ad"))) for key in file_map.keys()}

In [4]:
adata_map

{'adamson': AnnData object with n_obs × n_vars = 68603 × 5060
     obs: 'condition', 'cell_type', 'dose_val', 'control', 'condition_name'
     var: 'gene_name'
     uns: 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20',
 'norman': AnnData object with n_obs × n_vars = 91205 × 5045
     obs: 'condition', 'cell_type', 'dose_val', 'control', 'condition_name'
     var: 'gene_name'
     uns: 'non_dropout_gene_idx', 'non_zeros_gene_idx', 'rank_genes_groups_cov_all', 'top_non_dropout_de_20', 'top_non_zero_de_20'
     layers: 'counts',
 'replogle_rpe1': AnnData object with n_obs × n_vars = 162733 × 5000
     obs: 'condition', 'cell_type', 'cov_drug_dose_name', 'dose_val', 'control', 'condition_name'
     var: 'gene_name', 'chr', 'start', 'end', 'class', 'strand', 'length', 'in_matrix', 'mean', 'std', 'cv', 'fano', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
     uns: 'hvg', 'non_dropout_gene_idx', 'non_zer