# SLKB Pipeline

Here, we will go over the discussed pipeline using a Toy Data. Feel free to use this file to analyze your dataset. The file is divided into 3 main parts: (1) Data creation, (2) Score calculation, (3) Query Results

## Before getting started

Make sure an R environment with GEMINI, and mageck tool are located in your path. To see whether you can run their respective scores or not, you can run the following command:

```
import shutil
shutil.which('R') ## should yield accessed R environment location
shutil.which('mageck') ## should yield MAGeCK location
```

In additon, make sure to install SLKB python package. The details can be located at its [website](https://github.com/BirkanGokbag/SLKB-Analysis-Pipeline)



In [1]:
## First, we load in our packages
import SLKB
import pandas as pd
import os
import sqlalchemy
from sqlalchemy.engine import URL
# setting warning to None
pd.set_option('mode.chained_assignment', None)

## Section 0 - Checking for MAGeCK and GEMINI Installation
Make sure that the locations for GEMINI installation and MAGeCK script are accessible.

In [2]:
import shutil
print(shutil.which('R')) ## should yield accessed R environment location
print(shutil.which('mageck')) ## should yield MAGeCK location

/usr/local/bin/R
/Users/gokbag.1/Documents/CodingProjectsLocal/SLKB-Analysis-Pipeline/SLKB_env/bin/mageck


# Section 1 - Data Preperation and Database Creation

First, we start by installing a pickle file that contains the demo data (Pickle version 4). Not all input files are required. For score calculation, only sequences and counts files are sufficient. 

In [3]:
# taken from https://pubmed.ncbi.nlm.nih.gov/36060092/
demo_data = SLKB.load_demo_data()

sequence_ref = demo_data['sequence_ref']
counts_ref = demo_data['counts_ref']
score_ref = demo_data['score_ref']

In [4]:
# let us create a local sqlite3 database to store our results in, and connect to it, SLKB has schemas for mysql and sqlite3 databases
url_object = URL.create(
    "mysql+mysqlconnector",
    username="root",
    password="password",  # plain (unescaped) text
    host="localhost",
    port = '3306',
    #database="SLKB_mysql_live",
) # in mysql, database 'SLKB_mysql_live' will be created that can later be connected to. Following database creation, update the database parameter.
# alternatively, a sqlite3 database
url_object = 'sqlite:///SLKB_sqlite3'

SLKB_engine = sqlalchemy.create_engine(url_object)

# create the database at the url_object
SLKB.create_SLKB(engine = SLKB_engine, db_type = 'sqlite3') # or mysql

In [5]:
# if mysql was chosen, the following command should be run before view is accessed (based on mysql version)
# with SLKB_engine.begin() as transaction:
#     transaction.execute(sqlalchemy.text('SET sql_mode=(SELECT REPLACE(@@sql_mode,\'ONLY_FULL_GROUP_BY\',\'\'));'))

In [6]:
SLKB_engine

Engine(sqlite:///SLKB_sqlite3)

In [7]:
print(counts_ref.columns)

Index(['guide_1', 'guide_2', 'gene_1', 'gene_2', 'count_replicates',
       'cell_line_origin', 'study_conditions', 'study_origin'],
      dtype='object')


In [8]:
print(sequence_ref.columns)

Index(['sgRNA_guide_name', 'sgRNA_guide_seq', 'sgRNA_target_name'], dtype='object')


In [9]:
print(score_ref.columns)

Index(['gene_1', 'gene_2', 'study_origin', 'cell_line_origin', 'SL_score',
       'SL_score_cutoff', 'statistical_score', 'statistical_score_cutoff'],
      dtype='object')


## Inserting to DB

After each data is prepared, the study is ready to be inserted into the database. The ```prepare_study_for_export``` function will go over the data and prepare the data for insertion. It will produce errors where necessary, make sure that your files match with the template. 

Make sure your control gene list is set up properly to correctly categorize the counts file. The counts file will produce a ```target_type``` column that contains three categories:
1. Dual - Both sgRNAs targeting different genes.
2. Single - Both sgRNAs targeting the same gene (i.e., gene_1 + gene_1, or gene_1 + control)
3. Control - Both sgRNAs targeting controls.

In [10]:
study_controls = ['0SAFE',
                 '0SAFE-SAFE-GE',
                 '0SAFE-SAFE-SP',
                 '0SAFE-SAFE-MP',
                 '0SAFE-SAFE-U2',
                 '0SAFE-SAFE-DTKP',
                 '0SAFE-SAFE-ACOC',
                 '0SAFE-SAFE-TMM',
                 '0SAFE-SAFE-U1',
                 '0SAFE-SAFE-U3']
study_conditions = [["T0_1", 
                     "T0_2"],
                    ["T12_1",
                     "T12_2"]]

db_inserts = SLKB.prepare_study_for_export(sequence_ref = sequence_ref.copy(), 
                                      counts_ref = counts_ref.copy(),
                                      score_ref = score_ref.copy(),
                                      study_controls = study_controls,
                                      study_conditions = study_conditions)


Starting processing...
Score reference...
Controls within SL score that are removed: 
0
---
Only GI cutoff is present...
Counts reference...
Number of double pairs: 37767
Number of controls: 614
Number of singles: 10550
Sequence reference...
Done! Returning...


In [11]:
print(db_inserts['score_ref'].columns)

Index(['gene_1', 'gene_2', 'study_origin', 'cell_line_origin', 'SL_score',
       'SL_score_cutoff', 'statistical_score', 'statistical_score_cutoff',
       'gene_pair', 'SL_or_not'],
      dtype='object')


In [12]:
print(db_inserts['sequence_ref'].columns)

Index(['sgRNA_guide_name', 'sgRNA_guide_seq', 'sgRNA_target_name',
       'study_origin'],
      dtype='object')


In [13]:
print(db_inserts['counts_ref'].columns)

Index(['guide_1', 'guide_2', 'gene_1', 'gene_2', 'count_replicates',
       'cell_line_origin', 'study_conditions', 'study_origin', 'target_type',
       'T0_counts', 'T0_replicate_names', 'TEnd_counts',
       'TEnd_replicate_names', 'gene_pair', 'gene_pair_orientation'],
      dtype='object')


In [14]:
# Finally, insert the data to the database
SLKB.insert_study_to_db(SLKB_engine, db_inserts)

Updating gene pairs with seperator |...
Final QC...
Beginning transaction...
Done sequence
Done counts
Done score
Successfully inserted!
Added Record stats...
Sequence insert: 247
Counts insert: 48931
Score insert: 1225
Done!


# Section 2 - Score Calculation

Here, we calculate the scores and add them to the database. First, we start by querying the data we just deposited.

In [15]:
# read the data

# experiment design
experiment_design = pd.read_sql_query(con=SLKB_engine.connect(), 
                              sql=sqlalchemy.text('SELECT * from CDKO_EXPERIMENT_DESIGN'.lower()), index_col = 'sgRNA_id')
experiment_design.reset_index(drop = True, inplace = True)
experiment_design.index.rename('sgRNA_id', inplace = True)

# counts
counts = pd.read_sql_query(con=SLKB_engine.connect(), 
                              sql=sqlalchemy.text('SELECT * from joined_counts'.lower()), index_col = 'sgRNA_pair_id')

# scores
scores = pd.read_sql_query(con=SLKB_engine.connect(), 
                              sql=sqlalchemy.text('SELECT * from CDKO_ORIGINAL_SL_RESULTS'.lower()), index_col = 'id')
scores.reset_index(drop = True, inplace = True)
scores.index.rename('gene_pair_id', inplace = True)

curr_study = '36060092'
curr_cl = '22RV1'
curr_counts = counts[(counts['study_origin'] == curr_study) & (counts['cell_line_origin'] == curr_cl)]

## Median B/NB Score

In [16]:
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'median_nb_score', SLKB_engine):
    median_res = SLKB.run_median_scores(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'MEDIAN_Files')
    SLKB.add_table_to_db(curr_counts.copy(), median_res['MEDIAN_NB_SCORE'], 'median_nb_score', SLKB_engine)
    if median_res['MEDIAN_B_SCORE'] is not None:
        SLKB.add_table_to_db(curr_counts.copy(), median_res['MEDIAN_B_SCORE'], 'median_b_score', SLKB_engine)

Checking if score already computed: median_nb_score
Running median scores...
Getting raw counts...
Filtering enabled... Condition: 35 counts
Filtered a total of 8134 out of 48931 sgRNAs.

---

Not full normalization...
Normalization enabled...
Current counts:
T0_1    4174896.0
T0_2    4200469.0
dtype: float64
Normalize based on a specific value... 4771163.0 counts
Normalization enabled...
Current counts:
T12_1    5341857.0
T12_2    6456429.0
dtype: float64
Normalize based on a specific value... 4771163.0 counts
Normalization enabled...
Current counts:
T0_1    34419.0
T0_2    34661.0
dtype: float64
Normalize based on a specific value... 44888.0 counts
Normalization enabled...
Current counts:
T12_1    55115.0
T12_2    65234.0
dtype: float64
Normalize based on a specific value... 44888.0 counts
Normalization enabled...
Current counts:
T0_1    877394.0
T0_2    878730.0
dtype: float64
Normalize based on a specific value... 1064517.0 counts
Normalization enabled...
Current counts:
T12_1    1

## sgRNA Derived B/NB Score

In [17]:
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'sgrna_derived_nb_score', SLKB_engine):
    sgRNA_res = SLKB.run_sgrna_scores(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'sgRNA-DERIVED_Files')
    SLKB.add_table_to_db(curr_counts.copy(), sgRNA_res['SGRNA_DERIVED_NB_SCORE'], 'sgrna_derived_nb_score', SLKB_engine)
    if sgRNA_res['SGRNA_DERIVED_B_SCORE'] is not None:
        SLKB.add_table_to_db(curr_counts.copy(), sgRNA_res['SGRNA_DERIVED_B_SCORE'], 'sgrna_derived_b_score', SLKB_engine)

Checking if score already computed: sgrna_derived_nb_score
Running sgrna derived score...
Getting raw counts...
Filtering enabled... Condition: 35 counts
Filtered a total of 8134 out of 48931 sgRNAs.

---

Not full normalization...
Normalization enabled...
Current counts:
T0_1    4174896.0
T0_2    4200469.0
dtype: float64
Normalize based on a specific value... 4771163.0 counts
Normalization enabled...
Current counts:
T12_1    5341857.0
T12_2    6456429.0
dtype: float64
Normalize based on a specific value... 4771163.0 counts
Normalization enabled...
Current counts:
T0_1    34419.0
T0_2    34661.0
dtype: float64
Normalize based on a specific value... 44888.0 counts
Normalization enabled...
Current counts:
T12_1    55115.0
T12_2    65234.0
dtype: float64
Normalize based on a specific value... 44888.0 counts
Normalization enabled...
Current counts:
T0_1    877394.0
T0_2    878730.0
dtype: float64
Normalize based on a specific value... 1064517.0 counts
Normalization enabled...
Current count

## Horlbeck Score

In [18]:
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'horlbeck_score', SLKB_engine):
    horlbeck_res = SLKB.run_horlbeck_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'HORLBECK_Files', do_preprocessing = True, re_run = False)
    SLKB.add_table_to_db(curr_counts.copy(), horlbeck_res['HORLBECK_SCORE'], 'horlbeck_score', SLKB_engine)

Checking if score already computed: horlbeck_score
Running horlbeck score...
Running preprocessing...
Getting raw counts...
Sorting gene pairs and guides based on ordering gene ordering...
For replicate 1
Total of 12 sgRNAs were filtered out of 222
For replicate 2
Total of 7 sgRNAs were filtered out of 222


  res = f(group)


Calculating GI_Score_1...
Calculating GI_Score_2...


  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  ret = ret.dtype.type(ret / rcount)


---------ADDING-TO-DB---------
Processing table for: horlbeck_score
Beginning transaction...
Successfully inserted!
Added Record stats...
Score insert: 1225


## GEMINI Score

In [19]:
cmd_params = []#['module load R/4.1.0']
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'gemini_score', SLKB_engine):
    gemini_res = SLKB.run_gemini_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'GEMINI_Files', command_line_params = cmd_params, re_run = False)
    SLKB.add_table_to_db(curr_counts.copy(), gemini_res['GEMINI_SCORE'], 'gemini_score', SLKB_engine)

Checking if score already computed: gemini_score
Running gemini score...
Getting raw counts...
Running GEMINI...
Finished running GEMINI!
---------ADDING-TO-DB---------
Processing table for: gemini_score
Beginning transaction...
Successfully inserted!
Added Record stats...
Score insert: 1225


## MAGeCK Score

In [20]:
cmd_params = []#'conda activate myEnv'
if not SLKB.check_if_added_to_table(curr_counts.copy(), 'mageck_score', SLKB_engine):
    mageck_res = SLKB.run_mageck_score(curr_counts.copy(), curr_study = curr_study, curr_cl = curr_cl, store_loc = os.getcwd(), save_dir = 'MAGECK_Files', command_line_params = cmd_params,re_run = False)
    SLKB.add_table_to_db(curr_counts.copy(), mageck_res['MAGECK_SCORE'], 'mageck_score', SLKB_engine)

Checking if score already computed: mageck_score
Running mageck score...
Getting raw counts...
Paired Status = True
Running mageck...
Finished running mageck!
Loading computed results...
Filtered gene count: 0
---------ADDING-TO-DB---------
Processing table for: mageck_score
Beginning transaction...
Successfully inserted!
Added Record stats...
Score insert: 1225


# Section 3 - Query Results

Finally, we can query the data to produce the calculation table.

In [21]:
all_scores = pd.read_sql_query(con=SLKB_engine.connect(), 
                              sql=sqlalchemy.text('SELECT * from calculated_sl_table'))

In [22]:
all_scores.head(15)

Unnamed: 0,gene_1,gene_2,study_origin,cell_line_origin,gene_pair_id,median_nb_score_SL_score,median_nb_score_standard_error,median_nb_score_Z_SL_score,median_b_score_SL_score,median_b_score_standard_error,...,sgrna_derived_b_score_SL_score,sgrna_derived_nb_score_SL_score,horlbeck_score_SL_score,horlbeck_score_standard_error,mageck_score_SL_score,mageck_score_standard_error,mageck_score_Z_SL_score,gemini_score_SL_score_Strong,gemini_score_SL_score_SensitiveLethality,gemini_score_SL_score_SensitiveRecovery
0,AKT3,AR,36060092,22RV1,496,-0.022646,0.056881,-0.398139,-0.062565,0.056881,...,-2.490022,-1.760639,-0.589468,0.258965,-0.06816,0.066297,-1.028105,-0.091424,-0.091424,
1,AKT3,AURKA,36060092,22RV1,497,0.06047,0.102366,0.59072,0.020551,0.102366,...,0.346373,0.496084,-0.014408,0.156112,0.044937,0.086791,0.517758,-0.046752,-0.046752,
2,AKT3,BMP6,36060092,22RV1,498,-0.055041,0.044215,-1.244839,-0.09496,0.044215,...,-1.648458,-0.954725,-0.304914,0.168399,-0.079574,0.064855,-1.226956,-0.035922,-0.035922,
3,AKT3,CCNE2,36060092,22RV1,499,-0.012519,0.043639,-0.286871,-0.052437,0.043639,...,0.015597,0.616933,-0.262514,0.123153,0.004231,0.051476,0.082203,-0.045427,-0.045427,
4,AKT3,CDC6,36060092,22RV1,500,0.044958,0.10914,0.411928,0.005039,0.10914,...,0.07923,0.253455,0.015164,0.186259,-0.046699,0.09884,-0.472464,0.002442,0.002442,
5,AKT3,CDK2,36060092,22RV1,501,0.046438,0.06529,0.711268,0.00652,0.06529,...,1.357099,2.168888,-0.015867,0.121892,0.127801,0.062357,2.049505,-0.082281,-0.082281,
6,AKT3,CTNNB1,36060092,22RV1,502,-0.003363,0.039923,-0.084237,-0.043282,0.039923,...,-2.260286,-1.494126,-0.368898,0.151781,-0.093034,0.047451,-1.960613,-0.126063,-0.126063,
7,AKT3,DHFR,36060092,22RV1,503,0.108463,0.099658,1.088355,0.068544,0.099658,...,-1.010516,0.270471,-0.207039,0.126963,0.115901,0.087806,1.319972,-0.072729,-0.072729,
8,AKT3,ETF1,36060092,22RV1,504,-0.163307,0.213862,-0.763611,-0.203226,0.213862,...,-0.36606,-0.069083,0.16715,0.218769,0.084006,0.162383,0.517337,-0.132287,,0.046196
9,AKT3,EZH2,36060092,22RV1,505,0.06099,0.059958,1.017218,0.021071,0.059958,...,-0.024138,0.495638,0.016892,0.199305,0.066378,0.05906,1.12391,-0.052789,-0.052789,


# Section 3.5 - Query Results for specific tables
If the user has created additional tables and calculated their own scores, they can directly access their result

In [23]:
temp = SLKB.query_result_table(curr_counts.copy(), 'median_b_score', curr_study, curr_cl, SLKB_engine)

Accessing table: median_b_score
Available gene pairs: 1225


In [24]:
temp.head(15)

Unnamed: 0,gene_pair,median_b_score_SL_score,median_b_score_standard_error,median_b_score_Z_SL_score,study_origin,cell_line_origin
0,AKT3|AR,-0.062565,0.056881,-1.099934,36060092,22RV1
1,AKT3|AURKA,0.020551,0.102366,0.200761,36060092,22RV1
2,AKT3|BMP6,-0.09496,0.044215,-2.14766,36060092,22RV1
3,AKT3|CCNE2,-0.052437,0.043639,-1.201618,36060092,22RV1
4,AKT3|CDC6,0.005039,0.10914,0.046173,36060092,22RV1
5,AKT3|CDK2,0.00652,0.06529,0.099859,36060092,22RV1
6,AKT3|CTNNB1,-0.043282,0.039923,-1.084114,36060092,22RV1
7,AKT3|DHFR,0.068544,0.099658,0.687798,36060092,22RV1
8,AKT3|ETF1,-0.203226,0.213862,-0.950267,36060092,22RV1
9,AKT3|EZH2,0.021071,0.059958,0.351438,36060092,22RV1


# Section 4 - SLKB Dump File
SLKB's database dump can be inserted into your local database as well. After downloading the dump file from the following [link](https://figshare.com/s/06c2fc68cb33ec22f591), the contents can be inserted to the database either through python right here, or through command-line interface (CLI).

In [None]:
sql_dump_loc = ''
# read the schema
with open(sql_dump_loc) as f:
    command = f.read()

# execute
with SLKB_engine.begin() as transaction:
    for com in command.split(';\n'):
        transaction.execute(sqlalchemy.text(com)) 

# Closing the connection
Following SLKB access, the connection should be closed.

In [25]:
SLKB_engine.dispose()

# Section 5 - Loading up SLKB Web App for Analysis
Following database creation, SLKB's webapp can be used for data browsing and analysis locally. Run the following command to copy SLKB web app contents to your designated folder. Proceed to edit the server.R within the webapp to get started. 

In [26]:
SLKB.extract_SLKB_webapp(location = os.getcwd())

Extracting to location: /Users/gokbag.1/Documents/CodingProjectsLocal/SLKB-Analysis-Pipeline/SLKB/files
Done!
