# Dataset from Guillaume

In [63]:
using CSV, DataFrames, Plots, cmdstan_utils, DelimitedFiles
plotlyjs()

Plots.PlotlyJSBackend()

Import file containing all mapped ~17K TSSs they tested in the paper to 2,387 operons in E. coli. He matched TSSs with an operon if the operon was within 500bp downstream (smaller margins possible). The data frame contains the position and strength of the TSS/promoter, whether they considered the TSS 'active' or 'inactive', and the coordinates of the operon (left, right, strand).

In [2]:
df = CSV.read("../data/gen2/tss_operon_regulation.txt")
names(df)

9-element Array{Symbol,1}:
 :tss_name
 :tss_strand
 :tss_position
 :prom_expression
 :active
 :left
 :right
 :operon
 :strand

## Update genome positions

An old E.coli genome is used for this dataset, so first we have to update all the positions from the old data set to the new one. Therefore we use the [Ecocyc tool](https://ecocyc.org/ECOLI/map-seq-coords-form?chromosome=COLI-K12), and make sure that the old format is U00096.2 and the new one is U00096.3.

In [3]:
old_pos = df.tss_position
old_pos_arr = convert(Array{Int64, 1}, old_pos)
writedlm("position_list.txt", old_pos_arr)

In [4]:
new_pos = readdlm("new_positions.txt")
new_pos = convert(Array{Int64, 1}, new_pos[:, 1])
df.new_tss_position = new_pos;

In [5]:
df2 = CSV.read("../data/gen2/scanning_mutagenesis_data.txt", missingstring="NA")
df2 = df2[findall(x -> ~ismissing(x), df2.tss_position), :];

In [6]:
old_pos2 = df2.tss_position
old_pos_arr2 = convert(Array{Int64, 1}, old_pos2)
writedlm("position_list2.txt", old_pos_arr2)

In [7]:
new_pos2 = readdlm("new_positions2.txt")
new_pos2 = convert(Array{Int64, 1}, new_pos2[:, 1])
df2.new_tss_position = new_pos2;
df2 = df2[findall(x -> ~ismissing(x), df2.scramble_start), :];
df2 = df2[findall(x -> ~ismissing(x), df2.relative_exp), :];

Let's look at all the column names in the mutagenesis file.

In [8]:
names(df2)

18-element Array{Symbol,1}:
 :name
 :tss_name
 :tss_position
 :strand
 :scramble_start
 :scramble_end
 :prom_left_position
 :prom_right_position
 :scramble_start_pos
 :scramble_end_pos
 :scramble_pos_rel_tss
 :expn_med_fitted
 :unscrambled_exp
 :relative_exp
 :num_barcodes_integrated
 :category
 :active
 :new_tss_position

## Load on Reg-Seq transcription start sites

Many genes in the RegSeq list are actually part of an operon. So for every gene, which is not in the list, I checked if it is part of an operon which is in the list. Therefore we have the dictionary below, which maps the RegSeq gene names to the operon names in Guillaume's data set. Some genes are also known under a different name, which can be found on EcoCyc.

In [9]:
regseq_tss = CSV.read("../data/RegSeq/wtsequences.csv")
regseq_tss[!, :start_site] = convert.(Int64, regseq_tss[!,:start_site]);

In [10]:
gene_to_operon = Dict(
    "ecnB" => "ecnAB",
    "fdoH" => "fdoGHI-fdhE",
    "dinJ" => "dinJ-yafQ",
    "ompR" => "ompR-envZ",
    "dicB" => "dicB-ydfDE-insD-intQ",
    "dicC" => "dicC-ydfXW",
    "xylA" => "xylAB",
    "xylF" => "xylFGHR",
    "tar" => "tar-tap-cheRBYZ",
    "mscM" => "psd-mscM",
    "rspA" => "rspAB",
    "poxB" => "poxB-ltaE-ybjT",
    "yehT" => "yehUT",
    "yehU" => "yehUT",
    "yggW" => "yggSTU-rdgB-yggW",
    "ydjA" => "ydjA-selD-topB",
    "ybeZ" => "ybeZYX-lnt",
    "minC" => "minCDE",
    "hicB" => "hicAB",
    "rcsF" => "rcsF-trmO",
    "pcm" => "umpG-pcm",
    "ybiP"=>"opgE",
    "yecE" => "yecDE",
    "yajL" => "panE-yajL",
    "modE" => "modEF",
    "ymgG" => "ymgGD",
    "hslU" => "hslVU",
    "ydhO" => "mepH",
    "fdhE" => "fdoGHI-fdhE",
    "htrB" => "lpxL",
    "yagH" => "yagGH",
    "mtgA" => "elbB-mtgA",
    "ybjT" => "poxB-ltaE-ybjT",
    "ykgE" => "ykgEFG",
    "thiM" => "thiMD",
    "sdaB" => "sdaCB",
    "fdoH" => "fdoGHI-fdhE",
    "leuABCD" => "leuLABCD",
    "araAB" => "araBAD",
    "WaaA-coaD" => "waaA-coaD",
    "acuI" => "yhdH",
    "dnaE_start2" => "bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE",
    "dnaE_start1" => "bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE",
    "rlmC" => "ybjO-rlmC"
    
)

Dict{String,String} with 43 entries:
  "ompR"        => "ompR-envZ"
  "poxB"        => "poxB-ltaE-ybjT"
  "pcm"         => "umpG-pcm"
  "yehT"        => "yehUT"
  "dnaE_start1" => "bamA-skp-lpxD-fabZ-lpxAB-rnhB-dnaE"
  "modE"        => "modEF"
  "tar"         => "tar-tap-cheRBYZ"
  "dinJ"        => "dinJ-yafQ"
  "ymgG"        => "ymgGD"
  "ecnB"        => "ecnAB"
  "ybeZ"        => "ybeZYX-lnt"
  "rspA"        => "rspAB"
  "mtgA"        => "elbB-mtgA"
  "ydhO"        => "mepH"
  "minC"        => "minCDE"
  "ykgE"        => "ykgEFG"
  "yagH"        => "yagGH"
  "fdoH"        => "fdoGHI-fdhE"
  "acuI"        => "yhdH"
  "yggW"        => "yggSTU-rdgB-yggW"
  "leuABCD"     => "leuLABCD"
  "rcsF"        => "rcsF-trmO"
  "ybjT"        => "poxB-ltaE-ybjT"
  "mscM"        => "psd-mscM"
  "yajL"        => "panE-yajL"
  ⋮             => ⋮

Now we can compare the transcription start sites for each promoter that is present in both data sets. For now we are just looking for matching (or very close) TSS, which in this case is every TSS within 10 bp. Later we can go through other cases as well. 

In [13]:
# Array to store all differences in TSS
differences = Int64[]

# Array to store the TSS for matching genes
positions = Int64[]

# Get gene names which are part of operons
gene_keys = collect(keys(gene_to_operon))

# Store genes which are not in Guillaume's data set
found_none = String[]

# Store TSSs which are close to RegSeq TSSs
exact_match = Int64[]

# 
unique_operons = df.operon |> unique

for gene in regseq_tss.name
    if gene in unique_operons
        println("$gene is in the operon list.")
        
        reqseq_start = regseq_tss[regseq_tss.name.== gene, :start_site][1]
        guillaume_start = df[df.operon .== gene, :new_tss_position]
        
        println("RegSeq TSS: ", reqseq_start)
        println("Guillaume TSS: ", guillaume_start)
        
        diffs = guillaume_start .- reqseq_start
        
        println("Differences in start sites: ", diffs)
        
        push!(differences, diffs...)
        push!(positions, guillaume_start...)
        
        println()
        
        exact = findall(x->abs(x - reqseq_start) <= 10, guillaume_start)
        for x in exact
            push!(exact_match, guillaume_start[x]...)
        end
        
    elseif gene in gene_keys
        lab = gene_to_operon[gene]
        
        println("$gene is in the operon list for operon $lab.")
        
        reqseq_start = regseq_tss[regseq_tss.name.== gene, :start_site][1]
        guillaume_start = df[df.operon .== lab, :new_tss_position]
        
        println("RegSeq TSS: ", reqseq_start)
        println("Guillaume TSS: ", guillaume_start)
        
        diffs = guillaume_start .- reqseq_start
        
        println("Differences in start sites: ", diffs)
        
        push!(differences, diffs...)
        push!(positions, guillaume_start...)
        
        exact = findall(x->abs(x - reqseq_start) <= 10, guillaume_start)
        for x in exact
            push!(exact_match, guillaume_start[x]...)
        end

        println()
    else
        push!(found_none, gene)
    end
end
println("Found no operon for genes $found_none")

fdoH is in the operon list for operon fdoGHI-fdhE.
RegSeq TSS: 4085867
Guillaume TSS: [4085961]
Differences in start sites: [94]

sdaB is in the operon list for operon sdaCB.
RegSeq TSS: 2928035
Guillaume TSS: [2927935, 2927935, 2928103, 2928035]
Differences in start sites: [-100, -100, 68, 0]

thiM is in the operon list for operon thiMD.
RegSeq TSS: 2185451
Guillaume TSS: [2185450]
Differences in start sites: [-1]

ykgE is in the operon list for operon ykgEFG.
RegSeq TSS: 321511
Guillaume TSS: [321511, 321481, 321537]
Differences in start sites: [0, -30, 26]

sdiA is in the operon list.
RegSeq TSS: 1996867
Guillaume TSS: [1996851]
Differences in start sites: [-16]

yqhC is in the operon list.
RegSeq TSS: 3155262
Guillaume TSS: [3155384]
Differences in start sites: [122]

ybjT is in the operon list for operon poxB-ltaE-ybjT.
RegSeq TSS: 909320
Guillaume TSS: [911075]
Differences in start sites: [1755]

mtgA is in the operon list for operon elbB-mtgA.
RegSeq TSS: 3350504
Guillaume TSS: 

Let's have a look at all the observed differences. 

In [14]:
Plots.scatter(positions, differences, label=:none, xlabel="Position", ylabel="difference in TSS")

It seems like we found many nearly matching TSSs.

Seems like we found 79 TSSs which are within 10 bp of the RegSeq TSS. 

In [15]:
exact_match

79-element Array{Int64,1}:
 2928035
 2185450
  321511
 3350497
 4269355
 4474096
 1523276
  285350
 2303797
 1116709
 2585567
 2230395
  197821
       ⋮
 2524910
 3957912
 3927129
 4494597
 1977300
  933135
   87969
 3997907
 4338042
 1942634
 3637612
   83728

Let's check how many of those TSS were observed to be active.

In [16]:
active_close_df = df[map(x->x in exact_match, df.new_tss_position) .* (df.active .== "active"), :]

Unnamed: 0_level_0,tss_name,tss_strand,tss_position,prom_expression,active
Unnamed: 0_level_1,String,String,Int64,Float64,String
1,TSS_14751_wanner,+,3806532,2.35042,active
2,TSS_353_regulondb,-,83728,3.68341,active
3,TSS_17053_storz_regulondb,+,4368639,2.37858,active
4,TSS_9945_storz_wanner_regulondb,-,2642935,2.43472,active
5,TSS_16340_storz_regulondb,+,4176377,12.9788,active
6,TSS_785_regulondb,+,189712,5.96017,active
7,TSS_16903_regulondb,-,4336065,2.98382,active
8,TSS_8330_wanner,-,2183472,2.26949,active
9,TSS_3578_storz_regulondb,+,932358,2.06769,active
10,TSS_15199_wanner,-,3920547,4.45137,active


Only 21?

Finally, let's see for which TSSs we have data in the mutagenesis file. Therefore we check which entries have the same TSS as the ones we extracted above. (here we can look either at all or only at active ones)

In [17]:
#inds = map(x-> x in active_close_df.new_tss_position, df2.new_tss_position)
inds = map(x-> x in exact_match, df2.new_tss_position)
inds = map(x->begin
        if ismissing(x)
               false
        else
            x
           end
       end,
    inds);

In [18]:
matching_tss = df2[inds, :]

Unnamed: 0_level_0,name,tss_name
Unnamed: 0_level_1,String,String
1,"TSS_8330_wanner,2183472,-_scrambled5-15",TSS_8330_wanner
2,"TSS_7564_storz_regulondb,1975324,-_scrambled10-20",TSS_7564_storz_regulondb
3,"TSS_3400_regulondb,889164,-_scrambled15-25",TSS_3400_regulondb
4,"TSS_3076_wanner,793872,-_scrambled5-15",TSS_3076_wanner
5,"TSS_15537_storz_regulondb,3995930,+_scrambled30-40",TSS_15537_storz_regulondb
6,"TSS_7290_storz,1905108,-_scrambled15-25",TSS_7290_storz
7,"TSS_3076_wanner,793872,-_scrambled100-110",TSS_3076_wanner
8,"TSS_15199_wanner,3920547,-_scrambled40-50",TSS_15199_wanner
9,"TSS_6644_regulondb,1732381,+_scrambled105-115",TSS_6644_regulondb
10,"TSS_8604_storz_regulondb,2264236,+_scrambled25-35",TSS_8604_storz_regulondb


In [19]:
matched_tss_list = matching_tss.new_tss_position |> unique 

40-element Array{Int64,1}:
 2185450
 1977300
  889941
  794649
 3997907
 1907084
 3922524
 1734357
 2266214
 1942634
 2214671
 2928035
 1395972
       ⋮
   87969
  486489
 3069871
 1972715
 2644913
 4472550
 1226139
 3808509
 4640508
 2999918
 3536707
  189712

I don't fully understand why there are only scrambles for 46 out of 79. However, if the only the active ones are considered above, then we find scrambles for all TSSs.

In [64]:
p = plot(layout = (length(matched_tss_list), 1), size = (300, 300*length(matched_tss_list)))
for (i, tss) in enumerate(matched_tss_list)
    x = matching_tss[matching_tss.new_tss_position .== tss, :scramble_start]
    y = matching_tss[matching_tss.new_tss_position .== tss, :relative_exp]
    tss_name = unique(matching_tss[matching_tss.new_tss_position .== tss, :tss_name])[1]
    operon = df[df.tss_name .== tss_name, :operon][1]
    scatter!(p[i, 1], x, y, title=operon, label=:none)
end
savefig(p, "test")