## CERES post-processing

**Input:**
CERES output file (a score for each gene in each cell line)

**Output:**
Cleaned up and filtered gene scores file to use for the rest of the analysis

For the output, use the same format as the table from DepMap.  
* Genes as column headings
* Cell lines as rows

Post-processing steps:
1. Filter out genes that were not targeted by enough guides (less than 3).  
2. Drop genes dropped by DepMap (for quality reasons, i.e. no unique guides).  
3. Rescale scores to the reference essentials / non-essentials.
4. Update genes (columns) to Entrez IDs.

In [65]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
import sklearn.decomposition

get_data_path = lambda folders, fname: os.path.normpath(os.environ['3RD_PARTY_DIR']+'/'+'/'.join(folders) +'/'+fname)
get_local_data_path = lambda folders, fname: os.path.normpath('../local_data/' +'/'.join(folders) +'/'+ fname)

# Input from running CERES / sgRNA mapping
file_ceres_unscaled = get_local_data_path(['processed', 'depmap20Q2'], 'ceres_gene_unscaled_26_05_20.csv')
file_guides_per_gene = get_local_data_path(['processed', 'depmap20Q2'], 'guides_per_gene_26_05_20.csv')

# Inputs from DepMap
# 20Q2 positive controls: intersection of Hart (2015) and Blomen (2014)
file_ref_essentials = get_data_path(['depmap', '20Q2'], 'common_essentials.csv')
file_ref_nonessentials = get_data_path(['depmap', '20Q2'], 'nonessentials.csv')
file_depmap_scores = get_data_path(['depmap', '20Q2'], 'Achilles_gene_effect_unscaled.csv')

file_id_map = get_local_data_path(['processed'], 'HGNC_gene_id_map.csv')

# OUTPUT
file_gene_scores = get_local_data_path(['processed', 'depmap20Q2'], 'gene_scores_26_05_20.csv')
file_table_s2 = get_local_data_path(['supplemental_files'], 'Table_S2.csv')

### Filter & normalize CERES output

In [42]:
scores_raw = pd.read_csv(file_ceres_unscaled, index_col=0)
scores = scores_raw.T
scores = scores.dropna(axis=1, how='all') # Drop columns (genes) where all values are NaN
scores.index = scores.index.str.replace('.','-')
print('Num genes:', scores.shape[1])
print('Num cell lines:', scores.shape[0])
scores[:2]

Num genes: 17056
Num cell lines: 769


Unnamed: 0,SHOC2,NDUFA12,SDAD1,FAM98A,ZNF253,HIST1H2BF,SYNE2,BATF2,MYSM1,EIF2B1,...,OR2L3,LCE1A,GOLGA6B,NUTM2B,ARL1,IFNA5,SRSF10,STEAP1B,MTRNR2L4,UQCRH
ACH-001382,0.199746,-0.35773,-0.782451,0.213855,-0.229298,0.467091,0.474494,-0.649738,-0.053038,-0.2493,...,0.695681,0.244071,-0.787054,0.533119,0.734525,0.56657,-0.92381,0.263426,1.273218,-1.514653
ACH-000250,-0.537468,0.418319,-1.733739,0.739512,-0.301952,0.325332,0.546244,-0.32483,0.661585,-3.411545,...,1.24817,-0.356473,-0.473085,1.274457,-0.115068,0.603075,-2.195758,0.537216,1.700887,-3.864688


#### 1. Drop genes targetted by too few guides (< 3)

In [43]:
# Drop genes that were targetted by too few guides (less than 3)
# Some genes that I expect to be included might not be there if there was no copy number data for them.
guides_per_gene = pd.read_csv(file_guides_per_gene, index_col=0)
guides_per_gene = guides_per_gene.rename(columns={'ccds_symbol':'symbol'})
print('Genes in guide-gene map:', guides_per_gene.symbol.nunique())
display(guides_per_gene[:2])

print('Genes in CERES output that are not in my guide-per-gene map:')
display(scores.loc[:, ~scores.columns.isin(guides_per_gene.symbol)].columns)
# Check if any of the missing ones are protein-coding
id_map = pd.read_csv(file_id_map).dropna(subset=['entrez_id']).astype({'entrez_id':'int'})
#display(id_map[id_map.symbol.isin(scores.loc[:,~scores.columns.isin(guides_per_gene.symbol)].columns)])

# Check raw scores
print('Genes in my guide-per-gene map that are not in CERES output:')
print(guides_per_gene[~guides_per_gene.symbol.isin(scores_raw.index)].symbol.values)

# Filter guides per gene down to genes that are in CERES output (additional genes dropped are due to NaN values)
guides_per_gene = guides_per_gene[guides_per_gene.symbol.isin(scores.columns)]
print('Filtered guide-per-gene map:', guides_per_gene.shape[0])

Genes in guide-gene map: 17047


Unnamed: 0,symbol,entrez_id,guides_per_gene
0,A1BG,1,4
1,A1CF,29974,4


Genes in CERES output that are not in my guide-per-gene map:


Index(['ZNF286B', 'LOC102723996', 'CCDC144NL', 'TCP10', 'UGT2A2', 'TCP10L2',
       'PCDHA13', 'GDF5OS', 'FAM86C1', 'C9orf66'],
      dtype='object')

Genes in my guide-per-gene map that are not in CERES output:
[]
Filtered guide-per-gene map: 17046


In [44]:
guides_per_gene[guides_per_gene.symbol=='UBC']

Unnamed: 0,symbol,entrez_id,guides_per_gene
15711,UBC,7316,2


In [45]:
# Genes without any guides were already filtered out
print('Genes w/ 3 guides:', sum(guides_per_gene.guides_per_gene == 3))
print('Genes w/ 2 guides:', sum(guides_per_gene.guides_per_gene == 2))
print('Genes w/ 1 guide:', sum(guides_per_gene.guides_per_gene == 1))
scores_filtered = scores.loc[:, scores.columns.isin(guides_per_gene[guides_per_gene.guides_per_gene >= 3].symbol)]
print('Num genes after filtering for too few guides:', scores_filtered.shape[1],'/', guides_per_gene.shape[0])

Genes w/ 3 guides: 1509
Genes w/ 2 guides: 363
Genes w/ 1 guide: 243
Num genes after filtering for too few guides: 16440 / 17046


#### 2. Drop genes not in DepMap gene scores

Some genes are marked with "No Unique Guides" in DepMap portal: A gene which shares all of its guides with other genes. Dropped from post-CERES files beginning 20Q1 due to inaccurate CERES scores.

In [46]:
# Load DepMap scores
depmap_scores_raw = pd.read_csv(file_depmap_scores, index_col=0)

In [47]:
print(depmap_scores_raw.shape)
depmap_scores_raw[:1]

(769, 18119)


Unnamed: 0_level_0,A1BG (1),A1CF (29974),A2M (2),A2ML1 (144568),A3GALT2 (127550),A4GALT (53947),A4GNT (51146),AAAS (8086),AACS (65985),AADAC (13),...,ZWILCH (55055),ZWINT (11130),ZXDA (7789),ZXDB (158586),ZXDC (79364),ZYG11A (440590),ZYG11B (79699),ZYX (7791),ZZEF1 (23140),ZZZ3 (26009)
DepMap_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000004,0.550556,0.31976,-0.191171,0.198493,0.247514,-0.117256,0.816571,-0.559049,0.657462,0.507134,...,0.013756,-0.581937,,,0.526685,0.592893,-0.573396,0.728964,0.443095,-0.114864


In [48]:
get_gene_symbol = lambda x: re.search('([\w-]+)\s\(\w+\)', x).group(1)
depmap_scores = depmap_scores_raw.rename(columns=get_gene_symbol)

In [49]:
# Drop genes that weren't in original DepMap gene score file (likely removed for a QC reason)
scores_filtered_2 = scores_filtered.loc[:, scores_filtered.columns.isin(depmap_scores.columns)]
# Dropped genes 
display(scores_filtered.loc[:, ~scores_filtered.columns.isin(depmap_scores.columns)].columns)
print('Filtering 2 - dropped genes not in DepMap:', scores_filtered_2.shape[1], '/', scores_filtered.shape[1])

Index(['TRAPPC2B', 'ANKRD20A1'], dtype='object')

Filtering 2 - dropped genes not in DepMap: 16438 / 16440


#### 3. Normalize CERES gene scores to reference essential/non-essential genes from DepMap

In [50]:
# Get the reference essential and non-essential set of genes to scale scores (downloaded from DepMap)
get_gene_name = lambda x: re.search('([\w-]+)\s\(\w+\)', x).group(1)
get_gene_id = lambda x: re.search('[\w-]+\s\((\w+)\)', x).group(1)

ref_essential = pd.read_csv(file_ref_essentials)
ref_essential = ref_essential.assign(symbol = ref_essential.gene.apply(get_gene_name),
                                     entrez_id = ref_essential.gene.apply(get_gene_id))
print('Num ref essentials in my data:', ref_essential[ref_essential.symbol.isin(scores_filtered.columns)].shape[0])

ref_non_essential = pd.read_csv(file_ref_nonessentials)
ref_non_essential = ref_non_essential.assign(symbol = ref_non_essential.gene.apply(get_gene_name),
                                             entrez_id = ref_non_essential.gene.apply(get_gene_id))
print('Num ref non-essentials in my data:', 
      ref_non_essential[ref_non_essential.symbol.isin(scores_filtered.columns)].shape[0])
ref_essential[:1]

Num ref essentials in my data: 1196
Num ref non-essentials in my data: 651


Unnamed: 0,gene,symbol,entrez_id
0,AAMP (14),AAMP,14


In [51]:
# Normalize CERES gene scores, per cell line, according to reference essential and non-essential genes
# scale_to_essentials function is equivalent to scale_to_essentials in CERES package
# Matrix of scaled gene effects where the median of essential / nonessential genes are -1 and 0, for all cell lines.

# (gene_score - median(non essentials) / median(non essentials) - median(essentials))

def scale_to_essentials(cell_line):
    return ((cell_line - cell_line[cell_line.index.isin(ref_non_essential.symbol)].median()) / 
             (cell_line[cell_line.index.isin(ref_non_essential.symbol)].median() -
              cell_line[cell_line.index.isin(ref_essential.symbol)].median()))

# Normalize per cell line
scores_normed = scores_filtered_2.apply(lambda line: scale_to_essentials(line), axis=1)

# Verify normalization
assert(scores_normed.loc[:,scores_normed.columns.isin(ref_essential.symbol)].median(axis=1).median() == -1)
assert(scores_normed.loc[:,scores_normed.columns.isin(ref_non_essential.symbol)].median(axis=1).median() == 0)

# Order index and columns
scores_normed = scores_normed.reindex(sorted(scores_normed.columns), axis=1).sort_index()
print('N=', scores_normed.shape[1])
scores_normed[:1]

N= 16438


Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
ACH-000004,0.153299,0.037479,-0.244043,-0.025512,-0.019168,-0.207653,0.309718,-0.443775,0.225995,0.144711,...,-0.239938,-0.198167,-0.131967,-0.460693,0.154474,0.170274,-0.477166,0.266623,0.106486,-0.21637


#### 4. Create df with Entrez gene IDs

In [52]:
# Genes that were dropped
dropped_genes = depmap_scores.loc[:, ~depmap_scores.columns.isin(scores_normed.columns)].columns.values
print('N. genes dropped:', len(dropped_genes), '/', depmap_scores.shape[1])

N. genes dropped: 1681 / 18119


In [66]:
final_scores = pd.merge(guides_per_gene[['symbol', 'entrez_id']], 
                        scores_normed.T.reset_index().rename(columns={'index':'symbol'}))
final_scores = final_scores.drop(columns=['symbol']).astype({'entrez_id':'str'}).set_index('entrez_id').T
print('Final num genes:', final_scores.shape[1], ', cell lines:', final_scores.shape[0])
final_scores[:1]

Final num genes: 16438 , cell lines: 769


entrez_id,1,29974,2,144568,127550,53947,51146,8086,65985,13,...,221302,9183,55055,11130,79364,440590,79699,7791,23140,26009
ACH-000004,0.153299,0.037479,-0.244043,-0.025512,-0.019168,-0.207653,0.309718,-0.443775,0.225995,0.144711,...,-0.239938,-0.198167,-0.131967,-0.460693,0.154474,0.170274,-0.477166,0.266623,0.106486,-0.21637


In [55]:
# Export
final_scores.to_csv(file_gene_scores)

In [67]:
# Also export scores as table S1
final_scores.to_csv(file_table_s2)

### Precision-recall analysis with the DepMap reference essentials

In [56]:
# Reduce scores down to overlap
depmap_overlap = depmap_scores.loc[depmap_scores.index.isin(scores_normed.index), 
                                   depmap_scores.columns.isin(scores_normed.columns)]
# Drop cell lines that have NA scores in DepMap
depmap_overlap = depmap_overlap.dropna(axis=0)
depmap_overlap = depmap_overlap.apply(lambda line: scale_to_essentials(line), axis=1)

my_overlap = scores_normed.loc[scores_normed.index.isin(depmap_overlap.index),
                               scores_normed.columns.isin(depmap_overlap.columns)]
print(my_overlap.shape, '==', depmap_overlap.shape)
depmap_overlap[:1]

(757, 16438) == (757, 16438)


Unnamed: 0_level_0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZUP1,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
DepMap_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000004,0.164044,0.036997,-0.244257,-0.029757,-0.002772,-0.203569,0.310479,-0.446764,0.222893,0.140141,...,-0.239401,-0.203131,-0.13145,-0.459363,0.150904,0.18735,-0.454662,0.262253,0.10489,-0.202252


In [57]:
# The benchmark
print('Essential reference:', ref_essential.shape[0])
print('Non-essential reference:',ref_non_essential.shape[0])
display(ref_essential[:1])
print('Included reference genes:', 
      depmap_overlap.loc[:,depmap_overlap.columns.isin(ref_essential.symbol) | 
                           depmap_overlap.columns.isin(ref_non_essential.symbol)].shape[1], '/',
      ref_essential.shape[0]+ref_non_essential.shape[0])

Essential reference: 1246
Non-essential reference: 758


Unnamed: 0,gene,symbol,entrez_id
0,AAMP (14),AAMP,14


Included reference genes: 1847 / 2004


In [58]:
def compute_AUC_for_cell_line(scores, true_values):
    precision, recall, _ = precision_recall_curve(true_values, -scores)
    return auc(recall, precision)

def compute_AP_for_cell_line(scores, true_values):
    return average_precision_score(true_values, -scores)

def compute_AUCs(scores, essentials, nonessentials):
    scores = scores.T.reset_index().rename(columns={'index':'symbol'})
    # Reduce scores down to essential and non-essential genes
    scores = scores[scores.symbol.isin(essentials.symbol) | scores.symbol.isin(nonessentials.symbol)]
    true_values = scores.symbol.apply(lambda x: 1 if x in essentials.symbol.values else 0).values
    scores = scores.set_index('symbol')
    aucs = scores.apply(lambda x: compute_AUC_for_cell_line(x, true_values))
    aps = scores.apply(lambda x: compute_AP_for_cell_line(x, true_values))
    return aucs, aps

In [64]:
depmap_auc, depmap_ap = compute_AUCs(depmap_overlap, ref_essential, ref_non_essential)
print('Mean AUC for DepMap gene scores: %.6f, AP: %.6f' % (depmap_auc.mean(), depmap_ap.mean()))

Mean AUC for DepMap gene scores: 0.988319, AP: 0.988323


In [63]:
my_auc, my_ap = compute_AUCs(my_overlap, ref_essential, ref_non_essential)
print('Mean AUC for my reprocessed gene scores: mean: %.6f, AP: %.6f' % (my_auc.mean(), my_ap.mean()))

Mean AUC for my reprocessed gene scores: mean: 0.988115, AP: 0.988120
