<center><h1> GEARS Dataset </h1></center>

## Context

This notebook intend to load the dataset used in GEARS and in the benchmarking paper [\[1\]](#benchmark).

In particular, we will be using the following [datasets](#datasets).

## Table of Contents

- [Raw Data](#raw-pipeline)
- [Pre-processed Data](#pre-processed-pipeline)

## Datasets

### Raw Data

- [Adamson](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90546): ht<span>tps://</span>www.ncbi.nlm.nih.gov/geo/download/?acc=GSE90546&format=file
- [Norman](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344):
  - raw_barcodes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fbarcodes.tsv.gz
  - raw_cell_identities.csv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fcell%5Fidentities.csv.gz
  - raw_genes.tsv.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fgenes.tsv.gz
  - raw_matrix.mtx.gz: ht<span>tps://</span>ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344%5Fraw%5Fmatrix.mtx.gz
- [Replogle](https://plus.figshare.com/articles/dataset/_Mapping_information-rich_genotype-phenotype_landscapes_with_genome-scale_Perturb-seq_Replogle_et_al_2022_processed_Perturb-seq_datasets/20029387):
  - Single cell raw K562 genome scale scale day 8 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775507
  - Single cell raw K562 essential scale day 6 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35773219
  - Single cell raw RPE1 essential scale  day 7 post-transduction ht<span>tps://</span>plus.figshare.com/ndownloader/files/35775606
### Pre-processed Data

**GEARS**:

- Single-gene perturbation:
  - [Replogle RPE1](https://dataverse.harvard.edu/file.xhtml?fileId=7458694&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458694
  - [Relogle K562](https://dataverse.harvard.edu/file.xhtml?fileId=7458695&version=6.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/7458695
  - [Adamson](https://dataverse.harvard.edu/file.xhtml?fileId=6154417&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154417
- Multiple-gene perturbation
  - [Norman](https://dataverse.harvard.edu/file.xhtml?fileId=6154020&version=3.0&toolType=PREVIEW): ht<span>tps://</span>dataverse.harvard.edu/api/access/datafile/6154020
  - 131 two-gene perturbations

**scFoundation**:

Another preprocessing strategy for the Norman et al. dataset
- [Norman scFoundation](https://figshare.com/articles/dataset/scFoundation_Large_Scale_Foundation_Model_on_Single-cell_Transcriptomics_-_processed_datasets/24049200?file=44477939): ht<span>tps://</span>figshare.com/ndownloader/files/44477939
## References

1. Ahlmann-Eltze, Huber, and Anders, “Deep Learning-Based Predictions of Gene Perturbation Effects Do Not yet Outperform Simple Linear Baselines.”<a id="benchmark"></a>


## Download Datasets

Note that the bash script is run from the Notebook but you can also run it from the shell.

In [None]:
%%bash
cd ../../../ # cd to root of the project
python fine_tune/scripts/download/cli.py --config fine_tune/config/download.yaml

### Create AnnData From Raw

In [2]:
# All imports here
from pathlib import Path

import pandas as pd
import scipy.io
from anndata import AnnData
from anndata.experimental.multi_files import AnnCollection

data_path = Path("./../../datasets")


%reload_ext autoreload
%autoreload 2


### Raw Adamson

In [3]:
adamson_dir = data_path.joinpath("raw/raw_adamson")
mtx_files = sorted(adamson_dir.glob("*_matrix.mtx.txt.gz"))

all_adatas = {}

for mtx_file in mtx_files:
    prefix = mtx_file.name.split("_matrix")[0]

    # Paths
    barcodes_file = adamson_dir / f"{prefix}_barcodes.tsv.gz"
    genes_file = adamson_dir / f"{prefix}_genes.tsv.gz"
    identities_file = adamson_dir / f"{prefix}_cell_identities.csv.gz"

    # Load matrix
    matrix = scipy.io.mmread(mtx_file).T.tocsr()

    # Load metadata
    genes = pd.read_csv(genes_file, header=None, sep="\t")
    barcodes = pd.read_csv(barcodes_file, header=None)
    cell_identities = pd.read_csv(identities_file, index_col=0)

    # Create AnnData
    adata = AnnData(X=matrix)
    adata.var_names = genes[0].values
    adata.obs_names = barcodes[0].values
    # adata.obs = cell_identities.loc[adata.obs_names] fail because more barcodes than cell_identity.
    adata.obs = pd.DataFrame(index=adata.obs_names)
    adata.obs = adata.obs.join(cell_identities) # fill with NaN barcodes with non found identity

    # Optional: keep track of the experiment
    adata.uns["experiment"] = prefix
    all_adatas[adata.uns["experiment"]] = adata

In [4]:
all_adatas

{'GSM2406675_10X001': AnnData object with n_obs × n_vars = 5768 × 35635
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells'
     uns: 'experiment',
 'GSM2406677_10X005': AnnData object with n_obs × n_vars = 15006 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells'
     uns: 'experiment',
 'GSM2406681_10X010': AnnData object with n_obs × n_vars = 65337 × 32738
     obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells'
     uns: 'experiment'}

In [5]:
# Concatenate all experiments
adata_adamson = AnnCollection(all_adatas, join_obs="outer", join_vars="inner", index_unique="-")
adata_adamson

AnnCollection object with n_obs × n_vars = 86111 × 32738
  constructed from 3 AnnData objects
    obs: 'guide identity', 'read count', 'UMI count', 'coverage', 'good coverage', 'number of cells'

### Raw Norman

In [6]:
# Define paths
norman_dir = data_path.joinpath("raw/raw_norman")

# Load matrix
matrix = scipy.io.mmread(norman_dir / "matrix.mtx.gz").T.tocsr()

# Load gene and barcode names
genes = pd.read_csv(norman_dir / "genes.tsv.gz", header=None, sep="\t")
barcodes = pd.read_csv(norman_dir / "barcodes.tsv.gz", header=None)

# Load cell identities
cell_identities = pd.read_csv(norman_dir / "cell_identities.csv.gz", index_col=0)

# Build AnnData
adata_norman = AnnData(X=matrix)
adata_norman.var_names = genes[0].values
adata_norman.obs_names = barcodes[0].values
# adata.obs = cell_identities.loc[adata.obs_names] fail because more barcodes than cell_identity.
adata_norman.obs = pd.DataFrame(index=adata_norman.obs_names)
adata_norman.obs = adata_norman.obs.join(cell_identities) # fill with NaN barcodes with non found identity

In [7]:
adata_norman

AnnData object with n_obs × n_vars = 5898240 × 33694
    obs: 'guide_identity', 'read_count', 'UMI_count', 'coverage', 'gemgroup', 'good_coverage', 'number_of_cells', 'cellranger_called'

## Inspect AnnData

### Todo

- [ ] Filter cell from the raw norman and raw adamson.
  - [ ] Ask Fedor if required
- [ ] Make sure no gene has been ommited when doing the AnnCollection creation.
  - [ ] There might be a smart way to increase the AnnData with more genes.