# Excluding sgRNAs

Once all the scores had been calculated and imported for the various sgRNAs, these scores and other factors were used to exclude the sgRNAs which were unlikely to be active or were likely to result in off-target cleavage (7,421 sgRNAs excluded of 26,344 total).

Exclude Reason | Character | # of sgRNAs
-------------- | --------- | -----------
Targets Repetitive | R     | 447
Contains poly(T) | T       | 1183
Cleaves in extended | C    | 4283
Other exact matches | D    | 864
Zhang score < 0.2 | L      | 475

## sgRNAs Targeting Repetitive Sequences

Those sgRNAs with greater than 10 identical matches in the human genome beside the targeted miRNA(s) were excluded not only due to likely off-target effects. Those sgRNAs with more than 10 target sites are also expected to create genomic instablity which may kill the cell.

In [None]:
import getpass
import data_processing as dp

def exclude_repetitive(db_name, sql_version="MySQL", firewall=False):
    """
        Adds 'R' to exclude column where NumExactMatch > 10 + number of sites in miRNA(s)
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    # clear row
    db_con.update_row({"Exclude": None}, {}, "SingleGuideRNA")
    
    rows = db_con.fetch_query("""SELECT t.SgID
FROM SingleGuideRNA AS s
JOIN SgRNATargetInformation AS t
ON s.SgID = t.SgID
GROUP BY t.SgID, s.NumExactMatch
HAVING s.NumExactMatch > COUNT(t.SgID)+10;""")
    
    if sql_version == "MSSQL":
        sgIDs = [row.SgID for row in rows]
    else:
        sgIDs = [sg for sg, in rows]
        
    db_con.update_many_rows({"Exclude": ["R"]*len(sgIDs)}, {"SgID": sgIDs}, "SingleGuideRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
exclude_repetitive("miR-test", firewall=True)

447 sgRNAs are excluded because they target repetitive sequences when hg19 genomic alignment is used.

## Poly(T)

Those sgRNAs with 4 or greater T's in a row, which can lead to <a href="https://doi.org/10.1016/0092-8674(81)90522-5">RNA Pol III termination</a>, were excluded.

In [None]:
import data_processing as dp

def exclude_polyT(db_name, sql_version="MySQL", firewall=False):
    """
        Adds 'T' to exclude column if sgRNAs has >4 T's in a row
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    rows = db_con.fetch_query("SELECT SgID FROM SingleGuideRNA WHERE SgRNA LIKE '%TTTT%' AND Exclude IS NULL;")
    if sql_version == "MSSQL":
        sgIDs = [row.SgID for row in rows]
    else:
        sgIDs = [sg for sg, in rows]
    db_con.update_many_rows({"Exclude": ["T"]*len(sgIDs)}, {"SgID": sgIDs}, "SingleGuideRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
exclude_polyT("miR-test", firewall=True)

1183 sgRNAs are excluded because of poly(T).

## Extended Cleavage

Those sgRNAs which cleave outside the primary miRNA stemloop were excluded as indels in theses regions are unlikely to knockout the targeted miRNA (our data). 

In [None]:
import data_processing as dp

def exclude_extended_cleavage(db_name, sql_version="MySQL", firewall=False):
    """
        Adds 'C' to exclude if all cleavage sites are 'ext'
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    
    rows_ext = db_con.fetch_query("""SELECT DISTINCT t.SgID
FROM SgRNATargetInformation AS t 
JOIN SingleGuideRNA AS s 
ON t.SgID = s.SgID
WHERE t.CleavageSite LIKE 'ext' 
AND s.Exclude IS NULL 
AND t.SgID NOT IN
(SELECT DISTINCT SgID 
FROM SgRNATargetInformation 
WHERE CleavageSite NOT LIKE 'ext');""")
    
    if sql_version == "MSSQL":
        ext = [row.SgID for row in rows_ext]
    else:
        ext = [sgID for sgID, in rows_ext]
    
    up_dict = {"Exclude": ["C"]*len(ext)}
    select_dict = {"SgID": ext}
    
    db_con.update_many_rows(up_dict, select_dict, "SingleGuideRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
exclude_extended_cleavage("miR-test", firewall=True)

There are 4283 sgRNAs which should be excluded due to targeting regions outside the stemloop. 

## Other Exact Matches

sgRNAs with less than 10, but at least one off-target site which exactly matches the sgRNA in hg19 were excluded.

In [None]:
import getpass
import data_processing as dp

def exclude_multiple(db_name, sql_version="MySQL", firewall=False):
    """
        Adds 'D' to exclude column where NumExactMatch greater than the number of sites in miRNA(s)
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
        
    rows = db_con.fetch_query("""SELECT t.SgID
FROM SingleGuideRNA AS s
JOIN SgRNATargetInformation AS t
ON s.SgID = t.SgID
GROUP BY t.SgID, s.NumExactMatch, s.Exclude
HAVING s.NumExactMatch > COUNT(t.SgID) AND s.Exclude IS NULL;""")
    if sql_version == "MSSQL":
        sgIDs = [row.SgID for row in rows]
    else:
        sgIDs = [sg for sg, in rows]
        
    db_con.update_many_rows({"Exclude": ["D"]*len(sgIDs)}, {"SgID": sgIDs}, "SingleGuideRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [1]:
exclude_multiple("miR-test", firewall=True)

NameError: name 'exclude_multiple' is not defined

There are 864 sgRNAs with more than the expected miRNA target sites as exact matches which are not excluded for other reasons.

## Low Zhang score

Those sgRNAs with Zhang scores below 0.2 were excluded due to possible off-target effect. The letter in the exclude column was added to the database after the oligos had been ordered, which is why additional filters are used to remove low scoring sgRNAs in the functions used to select sgRNAs.

In [None]:
import data_processing as dp

def exclude_zhang(db_name, sql_version="MySQL", firewall=False):
    """
        Adds 'L' to exclude if Zhang score below 0.2
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    
    rows_low = db_con.fetch_query("""SELECT SgID
FROM SingleGuideRNA
WHERE ZhangScore < 0.2  
AND Exclude IS NULL;""")
    
    if sql_version == "MSSQL":
        low = [row.SgID for row in rows_low]
    else:
        low = [sgID for sgID, in rows_low]
    
    up_dict = {"Exclude": ["L"]*len(low)}
    select_dict = {"SgID": low}
    
    db_con.update_many_rows(up_dict, select_dict, "SingleGuideRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
exclude_zhang("miR-test", firewall=True)

475 sgRNAs which were not excluded for other reasons have Zhang scores less than 0.2.