# CRISPR screen analysis with Perturb-tools

In this tutorial, we will cover
*  Loading three .csv files each about guide, sample, and guide count information into single Screen object
*  Slicing (indexing) Screen object to subset/select guides and samples
*  Adding Screen object to combine technical replicates
*  Concatenating Screen object to combine biological replicates
*  Normalize and calculate log fold change of guides across two different conditions
*  Writing Screen object to .h5ad or .xlsx file

In [6]:
! pip install perturb-tools==0.1.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting perturb-tools==0.1.3
  Downloading perturb-tools-0.1.3.tar.gz (147 kB)
[K     |████████████████████████████████| 147 kB 4.0 MB/s 
Building wheels for collected packages: perturb-tools
  Building wheel for perturb-tools (setup.py) ... [?25l[?25hdone
  Created wheel for perturb-tools: filename=perturb_tools-0.1.3-py3-none-any.whl size=54599 sha256=4cbc4940b2a1f6e06b0486f4ae800ae1d902f63be38310d508c63be50aba0c18
  Stored in directory: /root/.cache/pip/wheels/c3/3a/b1/4110b2bf4c6148000079130a9d31b905ed21ea9a8c0f3d868b
Successfully built perturb-tools
Installing collected packages: perturb-tools
  Attempting uninstall: perturb-tools
    Found existing installation: perturb-tools 0.1.2
    Uninstalling perturb-tools-0.1.2:
      Successfully uninstalled perturb-tools-0.1.2
Successfully installed perturb-tools-0.1.3


In [7]:
import pandas as pd
import perturb_tools as pt

We will download public CRISPR/Cas9 Knock-out dataset: [TKO](http://tko.ccbr.utoronto.ca/) HeLa data.

In [8]:
! wget http://tko.ccbr.utoronto.ca/Data/readcount-HeLa-lib1.gz
! wget http://tko.ccbr.utoronto.ca/Data/readcount-HeLa-lib2.gz
! gunzip readcount-HeLa-lib1.gz
! gunzip readcount-HeLa-lib2.gz

--2022-09-21 05:19:31--  http://tko.ccbr.utoronto.ca/Data/readcount-HeLa-lib1.gz
Resolving tko.ccbr.utoronto.ca (tko.ccbr.utoronto.ca)... 142.150.76.126
Connecting to tko.ccbr.utoronto.ca (tko.ccbr.utoronto.ca)|142.150.76.126|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2877516 (2.7M) [application/x-gzip]
Saving to: ‘readcount-HeLa-lib1.gz’


2022-09-21 05:19:33 (1.52 MB/s) - ‘readcount-HeLa-lib1.gz’ saved [2877516/2877516]

--2022-09-21 05:19:33--  http://tko.ccbr.utoronto.ca/Data/readcount-HeLa-lib2.gz
Resolving tko.ccbr.utoronto.ca (tko.ccbr.utoronto.ca)... 142.150.76.126
Connecting to tko.ccbr.utoronto.ca (tko.ccbr.utoronto.ca)|142.150.76.126|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2680113 (2.6M) [application/x-gzip]
Saving to: ‘readcount-HeLa-lib2.gz’


2022-09-21 05:19:36 (879 KB/s) - ‘readcount-HeLa-lib2.gz’ saved [2680113/2680113]



In [9]:
!cut -f 1,2 readcount-HeLa-lib1 > guide_info-HeLa-lib1.tsv
!cut -f 1,3- readcount-HeLa-lib1 > guide_count-HeLa-lib1.tsv
!cut -f 1,2 readcount-HeLa-lib2 > guide_info-HeLa-lib2.tsv
!cut -f 1,3- readcount-HeLa-lib2 > guide_count-HeLa-lib2.tsv

# Loading text file to Screen object

Basic structure of Screen object contains 3 types of information:
*  `Screen.X`: guide count matrix (numpy array with shape (n_guides, n_samples)
*  `Screen.guides`: guide RNA information ex) sequence, target element (pandas DataFrame with length n_guides)
*  `Screen.condit`: sample information that gave rise to the guide counts (pandas DataFrame with length n_samples)

You can construct Screen object using any number of these three elements.

In [11]:
screen = pt.read_csv(X_path="guide_count-HeLa-lib1.tsv", guide_path="guide_info-HeLa-lib1.tsv", condit_path=None, sep="\t")

  super().__init__(X, obs = guides, var = condit, *args, **kwargs)


In [12]:
screen.X

array([[310., 226., 338., ...,  49., 296., 469.],
       [ 46.,   1.,   0., ...,   1.,  52., 213.],
       [239., 216., 285., ..., 250., 269., 363.],
       ...,
       [508., 479., 248., ..., 331., 386., 566.],
       [ 97.,  50.,  10., ...,  28.,  23.,   3.],
       [ 91., 115., 130., ...,  76., 120., 390.]], dtype=float32)

Alternatively, you can manually read the file and initialize Screen object.

In [None]:
tbl = pd.read_csv("readcount-HeLa-lib1", sep = "\t")
tbl2 = pd.read_csv("readcount-HeLa-lib2", sep = "\t")

In [None]:
def make_screen(tbl, guide_name_col):
  tbl = tbl.rename(columns={guide_name_col:"name"})
  sample_df = pd.DataFrame(tbl.columns[2:]).rename(columns={0:"index"}).set_index("index")
  sample_df["replicate"] = sample_df.index.str[-1]
  sample_df["time"] = sample_df.index.str[1:-1].map(lambda s: int(s) if s else -1)
  return pt.Screen(X=tbl.values[:,2:], guides=tbl.iloc[:,:2], 
                   condit=sample_df)

In [None]:
adata = make_screen(tbl)
bdata = make_screen(tbl2)

  super().__init__(X, obs = guides, var = condit, *args, **kwargs)


In [None]:
adata.guides

Unnamed: 0,name,GENE
0,A1BG_CACCTTCGAGCTGCTGCGCG,A1BG
1,A1BG_AAGAGCGCCTCGGTCCCAGC,A1BG
2,A1BG_TGGACTTCCAGCTACGGCGC,A1BG
3,A1BG_CACTGGCGCCATCGAGAGCC,A1BG
4,A1BG_GCTCGGGCTTGTCCACAGGA,A1BG
...,...,...
91315,luciferase_CCTCTAGAGGATGGAACCGC,luciferase
91316,luciferase_ACAACTTTACCGACCGCGCC,luciferase
91317,luciferase_CTTGTCGTATCCCTGGAAGA,luciferase
91318,luciferase_GGCTATGAAGAGATACGCCC,luciferase


In [None]:
adata.condit

Unnamed: 0_level_0,replicate,time
index,Unnamed: 1_level_1,Unnamed: 2_level_1
T08A,A,8
T08B,B,8
T08C,C,8
T12A,A,12
T12B,B,12
T12C,C,12
T15A,A,15
T15B,B,15
T15C,C,15
T18A,A,18


### Slicing

In [None]:
adata_cut = adata[adata.guides.GENE == "A1BG", :]
adata_cut

Genome Editing Screen comprised of n_guides x n_conditions = 6 x 13
   guides:    'name', 'GENE'
   condit:    'replicate', 'time'
   condit_m:  
   condit_p:  
   layers:    
   uns:       

In [None]:
adata_t8 = adata[:, adata.condit.time == 8]
adata_t8

Genome Editing Screen comprised of n_guides x n_conditions = 91320 x 3
   guides:    'name', 'GENE'
   condit:    'replicate', 'time'
   condit_m:  
   condit_p:  
   layers:    
   uns:       

### Writing

In [None]:
adata.write("HeLa_lib1.h5ad")

In [None]:
# import anndata as ad
# adata_ann = ad.read_h5ad("HeLa_lib1.h5ad")
# adata_pt = pt.Screen.from_adata(adata_ann)

Compatible with .h5ad file output of AnnData.

## Arithmetic

### Adding
If the guide and conditions are exactly the same, objects can be added (ex. technical replicates).

In [None]:
adata + adata

Genome Editing Screen comprised of n_guides x n_conditions = 91320 x 13
   guides:    'name', 'GENE'
   condit:    'replicate', 'time'
   condit_m:  
   condit_p:  
   layers:    
   uns:       

### Concatenating
Biological replicates can be concatenated along 'condit' axis.

In [None]:
pt.concat((adata, adata))

  utils.warn_names_duplicates("obs")


Genome Editing Screen comprised of n_guides x n_conditions = 182640 x 13
   guides:    'name', 'GENE'
   condit:    
   condit_m:  
   condit_p:  
   layers:    
   uns:       

## Normalization & LFC calculation

In [None]:
adata.log_norm()
adata

Genome Editing Screen comprised of n_guides x n_conditions = 91320 x 13
   guides:    'name', 'GENE'
   condit:    'replicate', 'time'
   condit_m:  
   condit_p:  
   layers:    'lognorm_counts'
   uns:       

In [None]:
adata.layers['lognorm_counts']

array([[3.8361688 , 3.723453  , 4.3636794 , ..., 2.0999844 , 4.0123653 ,
        3.575257  ],
       [1.5709194 , 0.0759054 , 0.        , ..., 0.09367192, 1.8715795 ,
        2.5751176 ],
       [3.4906108 , 3.6632092 , 4.1305914 , ..., 4.1514244 , 3.8833196 ,
        3.2405565 ],
       ...,
       [4.508811  , 4.7482824 , 3.942206  , ..., 4.536323  , 4.374373  ,
        3.8255777 ],
       [2.3662837 , 1.8879594 , 0.65947205, ..., 1.5252234 , 1.1218393 ,
        0.0974056 ],
       [2.292497  , 2.8505118 , 3.0931478 , ..., 2.6084018 , 2.835301  ,
        3.3334548 ]], dtype=float32)

Calculating the LFC between T=18 vs T=8

In [None]:
adata.log_fold_change("T18A", "T08A")

In [None]:
adata.guides

Unnamed: 0,name,GENE,T18A_T08A.lfc
0,A1BG_CACCTTCGAGCTGCTGCGCG,A1BG,-0.341631
1,A1BG_AAGAGCGCCTCGGTCCCAGC,A1BG,-1.570919
2,A1BG_TGGACTTCCAGCTACGGCGC,A1BG,-0.060601
3,A1BG_CACTGGCGCCATCGAGAGCC,A1BG,-0.279657
4,A1BG_GCTCGGGCTTGTCCACAGGA,A1BG,0.912809
...,...,...,...
91315,luciferase_CCTCTAGAGGATGGAACCGC,luciferase,0.946176
91316,luciferase_ACAACTTTACCGACCGCGCC,luciferase,0.848200
91317,luciferase_CTTGTCGTATCCCTGGAAGA,luciferase,0.026325
91318,luciferase_GGCTATGAAGAGATACGCCC,luciferase,-1.009472


Calculating the T=18 vs T=8 across all replicates

In [None]:
adata_t = adata[:, adata.condit.replicate != "0"]

In [None]:
adata_t.log_fold_change_reps(18, 8, rep_condit="replicate", compare_condit="time")

Unnamed: 0,A.18_8.lfc,B.18_8.lfc,C.18_8.lfc
0,-0.341631,-1.623469,-0.351314
1,-1.570919,0.017767,1.871580
2,-0.060601,0.488215,-0.247272
3,-0.279657,0.893938,0.169553
4,0.912809,1.240354,0.766471
...,...,...,...
91315,0.946176,0.096584,0.677010
91316,0.848200,2.029430,0.480280
91317,0.026325,-0.211959,0.432167
91318,-1.009472,-0.362736,0.462367


Aggregate the LFCs based on `aggregate_fn [median, mean, sd]`.

In [None]:
adata_t.log_fold_change_aggregate(8, 18, aggregate_condit="replicate", compare_condit="time", aggregate_fn = "median")

In [None]:
adata_t.guides

Unnamed: 0,name,GENE,T18A_T08A.lfc,8_18.lfc.median
0,A1BG_CACCTTCGAGCTGCTGCGCG,A1BG,-0.341631,0.351314
1,A1BG_AAGAGCGCCTCGGTCCCAGC,A1BG,-1.570919,-0.017767
2,A1BG_TGGACTTCCAGCTACGGCGC,A1BG,-0.060601,0.060601
3,A1BG_CACTGGCGCCATCGAGAGCC,A1BG,-0.279657,-0.169553
4,A1BG_GCTCGGGCTTGTCCACAGGA,A1BG,0.912809,-0.912809
...,...,...,...,...
91315,luciferase_CCTCTAGAGGATGGAACCGC,luciferase,0.946176,-0.677010
91316,luciferase_ACAACTTTACCGACCGCGCC,luciferase,0.848200,-0.848200
91317,luciferase_CTTGTCGTATCCCTGGAAGA,luciferase,0.026325,-0.026325
91318,luciferase_GGCTATGAAGAGATACGCCC,luciferase,-1.009472,0.362736


# Writing

In [None]:
adata.to_Excel("Hela_lib1.xlsx")

Writing to: Hela_lib1.xlsx

	Sheet 1:	X
	Sheet 2:	lognorm_counts
	Sheet 3:	guides
	Sheet 4:	condit


In [None]:
adata.to_mageck_input("Hela_mageck_input.txt", target_column="GENE")

In [None]:
! head Hela_mageck_input.txt

sgRNA	gene	T08A	T08B	T08C	T12A	T12B	T12C	T15A	T15B	T15C	T18A	T18B	T18C	T0
A1BG_CACCTTCGAGCTGCTGCGCG	A1BG	310	226	338	356	249	224	186	60	296	125	49	296	469
A1BG_AAGAGCGCCTCGGTCCCAGC	A1BG	46	1	0	7	22	142	0	1	52	0	1	52	213
A1BG_TGGACTTCCAGCTACGGCGC	A1BG	239	216	285	117	244	116	172	298	269	119	250	269	363
A1BG_CACTGGCGCCATCGAGAGCC	A1BG	289	83	166	164	111	14	184	160	214	122	137	214	678
A1BG_GCTCGGGCTTGTCCACAGGA	A1BG	205	34	217	205	148	355	326	100	432	212	85	432	559
A1BG_CAAGAGAAAGACCACGAGCA	A1BG	389	331	468	1074	364	158	664	286	499	464	235	499	647
A1CF_CGTGGCTATTTGGCATACAC	A1CF	452	240	390	630	509	261	471	255	301	322	210	301	898
A1CF_GGTATACTCTCCTTGCAGCA	A1CF	71	30	29	119	155	153	131	76	56	94	61	56	199
A1CF_GACATGGTATTGCAGTAGAC	A1CF	207	227	223	118	141	173	176	198	42	118	166	42	271
