# Data Preparation Tutorial

The purpose of this tutorial is to provide a guide about how to prepare data from combinatorial phenotypic screening reuslts into a format parsable by NAIAD.

NAIAD expects the data as a CSV with the following columns:

`| gene1 | gene2 | score |`

The columns `gene1`, `gene2` should contain gene names. `score` should contain a numeric phenotype value for the combination genetic perturbation from `gene1` and `gene2`. This table should also include scores for all single-gene perturbations in the dataset. For the single-gene perturbations, `gene1` is the gene, and `gene2` is 'negative', with the `score` corresponding to the phenotype score of the single perturbed gene.


We will show two examples in this tutorial:

1) Using the [Simpson2023](https://www.biorxiv.org/content/10.1101/2023.08.19.553986v1) dataset which measures cell viability in CRISPRi screens of pairwise genetic interactions for ~500 genes.
2) Using the [Horlbeck2018](https://www.cell.com/cell/fulltext/S0092-8674(18)30735-9) dataset which measures cell viability in CRISPRi screens of pairwise genetic interactions for ~400 genes.

## Set up notebook

In [1]:
import os

import numpy as np
import pandas as pd

from naiad import load_naiad_data

In [2]:
# set some configuration settings for the notebook
pd.set_option("mode.copy_on_write", True)

## Download Simpson Data

Download the original data files from the Simpson2023 dataset, hosted at https://parpi.princeton.edu/map/. 

NOTE: You may need to download the files manually from the "FAQ and Downloads" section of https://parpi.princeton.edu/map/ since the URLs use below aren't statically hosted. 

Save the `Individiaul sgRNA phenotypes` file as `individual_sgRNA_phenotypes.txt` and the `sgRNA pair phenotypes` file as `pair_sgRNA_phenotypes.txt`.

In [3]:
simpson_dir = './data/simpson_raw_data'

# Use some shell commands to download the files into a target directory
!mkdir '{simpson_dir}'
!wget 'https://parpi.princeton.edu/map/session/e4d7d21aacd31d51dd2c5cf7da180ec5/download/dlpairsgpheno?w=' -O '{os.path.join(simpson_dir, "pair_sgRNA_phenotypes.txt")}'
!wget 'https://parpi.princeton.edu/map/session/e4d7d21aacd31d51dd2c5cf7da180ec5/download/dlindsgpheno?w=' -O '{os.path.join(simpson_dir, "individual_sgRNA_phenotypes.txt")}'


--2024-10-31 16:07:17--  https://parpi.princeton.edu/map/session/e4d7d21aacd31d51dd2c5cf7da180ec5/download/dlpairsgpheno?w=
Resolving parpi.princeton.edu (parpi.princeton.edu)... 128.112.116.129
Connecting to parpi.princeton.edu (parpi.princeton.edu)|128.112.116.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 156833149 (150M) [text/plain]
Saving to: ‘./data/simpson_raw_data/pair_sgRNA_phenotypes.txt’


2024-10-31 16:07:29 (13.9 MB/s) - ‘./data/simpson_raw_data/pair_sgRNA_phenotypes.txt’ saved [156833149/156833149]

--2024-10-31 16:07:29--  https://parpi.princeton.edu/map/session/e4d7d21aacd31d51dd2c5cf7da180ec5/download/dlindsgpheno?w=
Resolving parpi.princeton.edu (parpi.princeton.edu)... 128.112.116.129
Connecting to parpi.princeton.edu (parpi.princeton.edu)|128.112.116.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104920 (102K) [text/plain]
Saving to: ‘./data/simpson_raw_data/individual_sgRNA_phenotypes.txt’


2024-10-31 

In [4]:
pair_data = pd.read_csv(os.path.join(simpson_dir, 'pair_sgRNA_phenotypes.txt'), sep='\t')
single_data = pd.read_csv(os.path.join(simpson_dir, 'individual_sgRNA_phenotypes.txt'), sep='\t')

Each row of the pair data file contains a combinatorial perturbation. The `FirstPosition` and `SecondPosition` columns describe which guide RNAs were used for the CRISPR screen. We don't care about the specific guides when using NAIAD, only the genes they target.

For our phenotypic measurements, we use `Gamma`, which is a measure of cellular viability (specifically it corresponds to a logfold change in number of cells with each guide between the beginning and end of the experiment, normalized by the number of cell doublings). For more information, refer to "Statistical Analysis" section of the Appendix of the [Simpson2023](https://www.biorxiv.org/content/10.1101/2023.08.19.553986v1.full.pdf) paper.

In [5]:
pair_data.head()

Unnamed: 0,FirstPosition,FirstGene,SecondPosition,SecondGene,Gamma,Tau,Rho
0,AARS2_+_44281006.23-P1P2,AARS2,AARS2_+_44281006.23-P1P2,AARS2,-0.142116,-0.074401,0.305809
1,AARS2_+_44281006.23-P1P2,AARS2,AARS2_+_44281027.23-P1P2,AARS2,-0.212971,-0.212569,0.210664
2,AARS2_+_44281027.23-P1P2,AARS2,AARS2_+_44281006.23-P1P2,AARS2,-0.206146,-0.211862,0.186636
3,AARS2_+_44281027.23-P1P2,AARS2,AARS2_+_44281027.23-P1P2,AARS2,-0.283184,-0.379132,0.038958
4,AARS2_+_44281006.23-P1P2,AARS2,AATF_-_35306286.23-P1P2,AATF,-0.167829,-0.144589,0.232376


To get the phenotypic measurements from single-gene genetic perturbations, we use the data from the CRISPRi experiments that only targeted a single gene with a single guide at a time.

Again, we use `Gamma` as the phenotypic measurement of interest.

In [6]:
single_data.head()

Unnamed: 0,sgRNA.ID,gene,Gamma,Tau,Rho
0,AARS2_+_44281006.23-P1P2,AARS2,-0.093799,-0.078114,0.134639
1,AARS2_+_44281027.23-P1P2,AARS2,-0.240023,-0.268461,0.170866
2,AATF_-_35306286.23-P1P2,AATF,-0.204515,-0.219527,0.165118
3,AATF_-_35306346.23-P1P2,AATF,-0.172567,-0.178428,0.158923
4,ABCB1_-_87342256.23-ENST00000265724.3,ABCB1,-0.002231,-0.004137,-0.00311


In [7]:
# we add a column for a "fake" second gene to the data frame of the single, and call all the genes in this column
# here 'negative'
single_data.rename(columns={'gene': 'gene1', 'Gamma': 'score'}, inplace=True)
single_data['gene2'] = 'negative'

# rename relevant columns of the pair data frame 
pair_data.rename(columns={'FirstGene': 'gene1', 'SecondGene': 'gene2', 'Gamma': 'score'}, inplace=True)

# extract relevant columns from each data frame
single_data = single_data[['gene1','gene2', 'score']]
pair_data = pair_data[['gene1', 'gene2', 'score']]

# we concatenate these data frame stogether since NAIAD expects single-gene perturbations with a 'negative' pair gene
simpson_score_data = pd.concat([single_data, pair_data], ignore_index=True) 

# save file
simpson_comb_file = os.path.join(simpson_dir, 'simpson_gamma_score.csv')
simpson_score_data.to_csv(simpson_comb_file)

# inspect data
simpson_score_data.head()

Unnamed: 0,gene1,gene2,score
0,AARS2,negative,-0.093799
1,AARS2,negative,-0.240023
2,AATF,negative,-0.204515
3,AATF,negative,-0.172567
4,ABCB1,negative,-0.002231


Since there may be multiple guides targeting the same gene, we will often have duplicates of each entry of the `score_data` data frame. The function `load_naiad_data` will average these combiantions such that we get a single phenotype score for each unique combination. 

In [8]:
# try load DF for running NAIAD model
naiad_data = load_naiad_data(simpson_comb_file)
naiad_data.head()

Unnamed: 0,gene1,gene2,comb_score,g1_score,g2_score
0,AARS2,AARS2,-0.211104,-0.166911,-0.166911
1,AARS2,AATF,-0.199281,-0.166911,-0.188541
2,AARS2,ABCB1,-0.164574,-0.166911,-0.00132
3,AARS2,ABL1,-0.17287,-0.166911,-0.000182
4,AARS2,ADPRM,-0.215765,-0.166911,-0.16683


## Download Horlbeck Data

We access the Horlbeck data from this URL: https://data.mendeley.com/datasets/rdzk59n6j4/1. 

Note: Unlike the Simpson dataset, the Horlbeck dataset doesn't have separate "Individual" and "Pair" data files. Instead, all data is in a "pair" combination data file, which includes pairs of genes perturbed together, as well as target genes paired with non-targeting guides. We extract the single-gene perturbation effect from the [gene]-negative pairs.

In [9]:
horlbeck_dir = './data/horlbeck_raw_data'

# Use some shell commands to download the files into a target directory
!mkdir '{horlbeck_dir}'
!wget 'https://data.mendeley.com/public-files/datasets/rdzk59n6j4/files/7ab168b7-3944-4f35-b249-5a672b32b20c/file_downloaded' -P '{horlbeck_dir}'
!unzip '{os.path.join(horlbeck_dir, "file_downloaded")}' -d '{horlbeck_dir}'
!rm '{os.path.join(horlbeck_dir, "file_downloaded")}'

--2024-10-31 16:08:26--  https://data.mendeley.com/public-files/datasets/rdzk59n6j4/files/7ab168b7-3944-4f35-b249-5a672b32b20c/file_downloaded
Resolving data.mendeley.com (data.mendeley.com)... 162.159.133.86, 162.159.130.86
Connecting to data.mendeley.com (data.mendeley.com)|162.159.133.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/84a5c5f2-574d-4469-b9ff-9c21450db673 [following]
--2024-10-31 16:08:27--  https://prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com/84a5c5f2-574d-4469-b9ff-9c21450db673
Resolving prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com (prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com)... 3.5.68.106, 52.92.20.130, 3.5.64.122, ...
Connecting to prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com (prod-dcd-datasets-public-files-eu-west-1.s3.eu-west-1.amazonaws.com)|3.5.68.106|:443... co

In [10]:
horlbeck_data = pd.read_csv(os.path.join(horlbeck_dir, 'CRISPRi_Jurkat_all_pair_phenotypes.txt'), sep='\t')

horlbeck_data.head()

  horlbeck_data = pd.read_csv(os.path.join(horlbeck_dir, 'CRISPRi_Jurkat_all_pair_phenotypes.txt'), sep='\t')


Unnamed: 0.1,Unnamed: 0,Triple Sequencing,Triple Sequencing.1,sgRNA Sequencing,sgRNA Sequencing.1,Barcode Sequencing,Barcode Sequencing.1
0,,Replicate 1,Replicate 2,Replicate 1,Replicate 2,Replicate 1,Replicate 2
1,,,,,,,
2,AARS2_+_44281027.23-P1P2++AARS2_+_44281027.23-...,-0.0832550349677,-0.208204422149,-0.15589322304,-0.25314490123,-0.0785635910659,-0.19039927648
3,AARS2_+_44281027.23-P1P2++AARS2_+_44281044.23-...,-0.166878226682,-0.178913177538,-0.175819976716,-0.178862197243,-0.157838276089,-0.158146616817
4,AARS2_+_44281027.23-P1P2++AATF_-_35306286.23-P1P2,-0.157906418633,-0.161295489826,-0.178218281841,-0.190490319206,-0.158745626035,-0.160077426188


In [11]:
# preprocess file
horlbeck_data_filtered = horlbeck_data[~horlbeck_data['Triple Sequencing'].isna()] # first replicate
horlbeck_data_filtered = horlbeck_data_filtered[~horlbeck_data_filtered['Triple Sequencing.1'].isna()] # second replicate
horlbeck_data_filtered = horlbeck_data_filtered[1:]
horlbeck_data_filtered.reset_index(drop=True, inplace=True)

# extract gene names
gene_pairs = horlbeck_data_filtered.iloc[:, 0]
g1 = []
g2 = []

for pert in gene_pairs:
    pert = pert.split('++')
    pert = [g.split('_')[0] for g in pert]
    g1.append(pert[0])
    g2.append(pert[1])

horlbeck_data_filtered['gene1'] = g1
horlbeck_data_filtered['gene2'] = g2

# average phenotype scores between the two replicates and extract relevant columns
horlbeck_data_filtered['score'] = (horlbeck_data_filtered['Triple Sequencing'].astype(np.float32) + horlbeck_data_filtered['Triple Sequencing.1'].astype(np.float32)) / 2
horlbeck_data_filtered = horlbeck_data_filtered.loc[:, ['gene1', 'gene2', 'score']]

# sort the genes in each combination alphabetically to average expression across different guides targeting same combination
switch_rows = horlbeck_data_filtered['gene1'] > horlbeck_data_filtered['gene2']
horlbeck_data_filtered.loc[switch_rows, 'gene1'], horlbeck_data_filtered.loc[switch_rows, 'gene2'] = horlbeck_data_filtered.loc[switch_rows, 'gene2'], horlbeck_data_filtered.loc[switch_rows, 'gene1']
horlbeck_data_filtered = horlbeck_data_filtered.groupby(by=['gene1', 'gene2']).mean().reset_index()

# save data to file
horlbeck_comb_file = os.path.join(horlbeck_dir, 'horlbeck_gamma_score.csv')
horlbeck_data_filtered.to_csv(horlbeck_comb_file)

# inspect data
horlbeck_data_filtered.head()

Unnamed: 0,gene1,gene2,score
0,AARS2,AARS2,-0.126785
1,AARS2,AATF,-0.186201
2,AARS2,ABCB7,-0.495732
3,AARS2,ACTL6A,-0.307017
4,AARS2,ACTR10,-0.173424


In [12]:
# load data to use for NAIAD model
naiad_data = load_naiad_data(horlbeck_comb_file)
naiad_data.head()

Unnamed: 0,gene1,gene2,comb_score,g1_score,g2_score
0,AARS2,AARS2,-0.126785,-0.133196,-0.133196
1,AARS2,AATF,-0.186201,-0.133196,-0.150909
2,AARS2,ABCB7,-0.495732,-0.133196,-0.420355
3,AARS2,ACTL6A,-0.307017,-0.133196,-0.286111
4,AARS2,ACTR10,-0.173424,-0.133196,-0.088533
