# SgRNA Controls

We are also including 1,000 control sgRNAs which do not target any sequences in the human genome. 

## Extracting from Literature

If we need to test the efficacy of the knockout, we may want to use EGFP or puromycin. Therefore, we included 4 control sgRNAs targeting each. The EGFP targeting control sgRNAs were taken from the supliment of <a href="https:\\doi.org\10.1126\science.1247005">Shalem et al., 2014</a> (1, 2, 5 and 6): 

<img src="Published Libraries/Shalem et al EGFP sgRNA.png">


The puromycin control sgRNAs were CGGCGTCTCGCCCGACCACC, GTCGGGCGAGACGCCGACGG, CGTGGTCCAGACCGCCACCG and ACGCGCGTCGGGCTCGACAT.

We wanted the rest of the controls to not target any sequence in the human genome. These other control sequences were taken from <a href="https:\\doi.org\10.1126\science.1246981">Wang et al., 2014</a> (pLX-sgRNA), <a href="https:\\doi.org\10.1126\science.aac7041">Wang et al., 2015</a> (Essential-gene) and <a href="https:\\doi.org\10.1038\nmeth.3047">Sanjana et al., 2014</a> (Human-GeCKOv2; library sequences downloaded from <a href="http://genome-engineering.org/gecko/?page_id=15">here</a>). The control sgRNA sequences were extracted from the list of sgRNAs in the pool and compiled into a single file with out replicating any sgRNAs:

In [None]:
import pandas as pd

with open("Control sgRNAs/All Literature Control sgRNAs.csv", "w") as fout:
    fout.write("sgID,sgRNA Sequence,Library,Lab\n")
    fout.write("EGFP_1,GGGCGAGGAGCTGTTCACCG,EGFP,Lander\n")
    fout.write("EGFP_2,GAGCTGGACGGCGACGTAAA,EGFP,Lander\n")
    fout.write("EGFP_5,GAAGTTCGAGGGCGACACCC,EGFP,Lander\n")
    fout.write("EGFP_6,GGTGAACCGCATCGAGCTGA,EGFP,Lander\n")
    fout.write("Puro_1,CGGCGTCTCGCCCGACCACC,Puro,\n")
    fout.write("Puro_2,GTCGGGCGAGACGCCGACGG,Puro,\n")
    fout.write("Puro_3,CGTGGTCCAGACCGCCACCG,Puro,\n")
    fout.write("Puro_4,ACGCGCGTCGGGCTCGACAT,Puro,\n")
    
    sg_list = ["GGGCGAGGAGCTGTTCACCG", "GAGCTGGACGGCGACGTAAA", "GAAGTTCGAGGGCGACACCC", "GGTGAACCGCATCGAGCTGA", 
               "CGGCGTCTCGCCCGACCACC", "GTCGGGCGAGACGCCGACGG", "CGTGGTCCAGACCGCCACCG", "ACGCGCGTCGGGCTCGACAT"]
    
    # Look at GeCKOv2 A (controls are identical in library A and B)
    # due to use of \r as line break, the entire file is interpreted as a single line by python, so use pd
    df = pd.read_csv("Published Libraries/Human_GeCKOv2_Library_A_09Mar2015.csv", header=0)
    df_con = df[df["gene_id"].str.contains("Control")]
    for i in df_con.index.tolist():
        sgID = df.loc[i, "UID"]
        seq = df.loc[i, "seq"]
        if seq not in sg_list:
            sg_list += [seq]
            out_str = "{},{},Human_GeCKOv2_Library,Zhang\n".format(sgID, seq)
            fout.write(out_str)
        else:
            print "SgRNA ID {} with sequence {} already in library".format(sgID, seq)
            
    # Look at pLX-sgRNA
    with open("Published Libraries/Wang et al 2014 Supplementary Table 1 sgRNA Sequences.csv", "r") as pLX_f:
        for line in pLX_f:
            if "control" not in line:
                continue
            ele = line.split(",")
            sgID = ele[0]
            seq = ele[6]
            if seq not in sg_list:
                sg_list += [seq]
                out_str = "{},{},pLX-sgRNA_Library,Lander\n".format(sgID, seq)
                fout.write(out_str)
            else:
                print "SgRNA ID {} with sequence {} already in library".format(sgID, seq)
            
    #Look at Essential gene screen
    with open("Published Libraries/Wang et al 2015 Supplementary Table 1 sgRNA Sequences.csv", "r") as ess_f:
        for line in ess_f:
            if "CTRL" not in line:
                continue
            ele = line.split(",")
            sgID = ele[0]
            seq = ele[5]
            if seq not in sg_list:
                sg_list += [seq]
                out_str = "{},{},Essential-gene_Library,Sabatini\n".format(sgID, seq)
                fout.write(out_str)
            else:
                print "SgRNA ID {} with sequence {} already in library".format(sgID, seq)

## Making fastq File

We then wanted to align these control sequences to the human genome to ensure no matches were found. A fastq file containing the sgRNA sequences was created:

In [None]:
control_file = "Control sgRNAs/All Literature Control sgRNAs.csv"
fastq_file = "Control sgRNAs/lit_controls.fastq"
with open(fastq_file, "w") as fout, open(control_file, "r") as fin:
    fin.next() # skip header line
    for line in fin:
        ele = line.split(",")
        sg = ele[1]
        sglen = len(sg)
        out = "@{}\n{}\n+\n{}\n".format(ele[0], sg, "I"*sglen)
        fout.write(out)

## Align to hg19

The control sgRNAs were then aligned to the hg19 version of the human genome using <a href="http://bowtie-bio.sourceforge.net/index.shtml">bowtie</a>.

In [None]:
import data_processing.trim_align as ta

# Create TrimAndAlign object
taObj = ta.TrimAndAlign("Control sgRNAs/Literature Control Alignment Log.log")
# move fastq file to server
taObj.fileToServer("Control sgRNAs/lit_controls.fastq", "lit_controls", ext=".fastq")
# align using bowtie
taObj.align_bowtie("lit_controls", "../UCSC/hg19/bowtie-indexes/hg19", options="-v 0 -a -p 4")
# move aligned sam file back from server
taObj.fileFromServer("Control sgRNAs/", "lit_controls_bowtie-aligned", ext=".sam")
# delete files on server
taObj.cleanUp("lit_controls_aligned")

## Filter by alignment

The sgRNAs were then filtered to remove those which aligned to the genome. The sgRNAs removed may not have cut the genome, because our alignment did not require the PAM sequence to be present, but are removed to be safe.

In [None]:
with open("Control sgRNAs/filtered_lit_controls.csv", "w") as fout:
    with open("Control sgRNAs/lit_controls_aligned.sam", "r") as alCon:
        for line in alCon:
            # Pass on comment lines
            if line[0] == "@":
                continue
            ele = line.split("\t")
            # Skip if it does align to the genome
            if ele[2] != "*" or ele[3] != "0":
                continue
            outStr = "{},{}\n".format(ele[0], ele[9])
            fout.write(outStr)

## Sort

To ensure everytime this code is run, the same control sgRNAs are selected, the sgRNAs were sorted before selection.

In [None]:
import pandas as pd

def sort_df(df):
    """
        Sorts the dataframe based on the value of the numbers which are part of the sgName string
        from 
        http://stackoverflow.com/questions/37693600/how-to-sort-dataframe-based-on-particular-stringcolumns-using-python-pandas
    """
    name_ser = df.loc[:, "LiteratureSgRNAID"].str.extract("(\d+)", expand=False)
    df = df.assign(sort=pd.to_numeric(name_ser)) # Add new column with name 'sort'
    df.sort_values("sort", inplace=True)
    df = df.drop("sort", axis=1) # remove 'sort' column
    return df

In [None]:
df = pd.read_csv("Control sgRNAs/filtered_lit_controls.csv", header=None, 
                 names=["LiteratureSgRNAID", "SgRNA"])
# divide the dataframe into parts by origin
egfp_df = df[df["LiteratureSgRNAID"].str.contains("EGFP")]
puro_df = df[df["LiteratureSgRNAID"].str.contains("Puro")]
geckoA_df = df[df["LiteratureSgRNAID"].str.contains("HGLibA")]
geckoB_df = df[df["LiteratureSgRNAID"].str.contains("HGLibB")]
pLX_df = df[(df["LiteratureSgRNAID"].str.contains("CTRL")) & (df["LiteratureSgRNAID"].str.len()==8)]
ess_df = df[(df["LiteratureSgRNAID"].str.contains("CTRL")) & (df["LiteratureSgRNAID"].str.len()==9)]

part_dfs = [(egfp_df, "EGFP"), (puro_df, "Puro"), (geckoA_df, "GeCKOv2"), (geckoB_df, "GeCKOv2"), (pLX_df, "pLX-sgRNA"), 
            (ess_df, "Essential")]
part_dfs_sorted = []
n = 1
for part_df, part_name in part_dfs:
    sorted_df = sort_df(part_df)
    sorted_df["SgRNAName"] = ""    # create new column for name
    for ind in sorted_df.index.tolist():
        sorted_df.set_value(ind, "SgRNAName", "control_{}_{}".format(n, part_name))
        n += 1
    part_dfs_sorted += [sorted_df]
    
# put all of the partial dataframes back together
full_df = pd.concat(part_dfs_sorted, ignore_index=True)
# output to file
full_df.to_csv("Control sgRNAs/Filtered and Sorted Literature Control sgRNAs.csv", index=False)

# Making ControlSgRNA Table

To hold the control RNA sequences, along with the literature sgRNA ID, the ControlSgRNA table was created. 

In [None]:
import data_processing as dp

def create_control_table(db_name, sql_version="MySQL", firewall=False):
    """
        Creates the ControlSgRNA table
    """
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    db_con.make_table("ControlSgRNA", {"SgRNAName": ["VARCHAR(50)", "NOT NULL"], "SgRNA": ["CHAR(20)"], 
                                       "LiteratureSgRNA": ["VARCHAR(200)"]}, ["PRIMARY KEY (SgRNAName)"])
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
create_control_table("miR-test", firewall=True)

## Fill ControlSgRNA Table

The first 1000 control sgRNAs were then imported into the new table.

In [None]:
import pandas as pd
import data_processing as dp

def import_controls(db_name, sql_version="MySQL", firewall=False):
    df = pd.read_csv("Control sgRNAs/Filtered and Sorted Literature Control sgRNAs.csv", 
                     header=0)
    first_df = df.head(1000) # take first 1,000 rows
    
    insert_dict = {"SgRNAName": [], "SgRNA": [], "LiteratureSgRNA": []}
    for i in first_df.index.tolist():
        insert_dict["SgRNAName"] += [first_df.get_value(i, "SgRNAName")]
        insert_dict["SgRNA"] += [first_df.get_value(i, "SgRNA")]
        insert_dict["LiteratureSgRNA"] += [first_df.get_value(i, "LiteratureSgRNAID")]
    
    db_con = dp.DatabaseConnection(sql_version, db_name=db_name, firewall=firewall)
    db_con.make_many_rows(insert_dict, "ControlSgRNA")
    db_con.close_cursor()
    db_con.close_connection()

In [None]:
import_controls("miR-test", firewall=True)