# Swiss-Prot length distributions

In [1]:
using BioAlignments
using FASTX
using ProgressMeter
using Printf
using CSV
using Plots
include("source/utils.jl")
include("source/io.jl")
include("source/alignment.jl")
include("source/pctid.jl")

percentid

### SCOP2 superfamily representative sequences
Download the SCOP2 superfamily representative sequences into the `data/` directory.

```
cd data
wget https://scop.mrc-lmb.cam.ac.uk/files/scop_sf_represeq_lib_latest.fa
```

For convenience, rename the file.

```
mv scop_sf_represeq_lib_latest.fa scop.fa
```

### Swiss-Prot
Download and unzip Swiss-Prot.

```
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta data/sprot.fa
```

Filter Swiss-Prot to exclude any sequences exceeding 2000 peptides and those with 'X' characters.

In [None]:
sprot = readsequences("data/sprot.fa")
sprot = [seq for seq=sprot if length(seq) < 2000 && !occursin('X', seq)]
writesequences("data/sprot.fa", sprot)

### Many-to-many alignment with BLAST
Construct a BLAST database from the SCOP2 superfamily representatives. First create a subdirectory `data/scopdb` to contain the files, then run `makeblastdb` to generate the database. 

```
mkdir scopdb
makeblastdb -dbtype prot -in scop.fa -out scopdb/scopsf -title scopsf
```

Align Swiss-Prot against the SCOP2 superfamily representative sequences. Set `num_threads` for your system.

```
cd ..
mkdir outputs
blastp -query data/sprot.fa -db data/scopdb/scop -outfmt "6 qseqid qstart qend sseqid sstart send" -num_threads 12 > outputs/sprot_scopsf.out
```

### Filter Swiss-Prot to sequences with >=95% identity to a representative sequence in the SCOP2 superfamily
First, map Swiss-Prot sequence ids to their index in the record and likewise for the SCOP2 SF sequences

In [None]:
sprot = readfasta("data/sprot.fa")
sprotids = record_id.(sprot)
lookup_sprot_index = Dict(sprotids[i] => i for i=1:length(sprotids))
scop = readfasta("data/scop.fa")
scopids = record_id.(scop)
lookup_scop_index = Dict(scopids[i] => i for i=1:length(scopids))

Read the search results. Column names can be parsed from BLAST's `-outfmt` argument.

In [None]:
columns = split("6 qseqid qstart qend sseqid sstart send", ' ')[2:end]
outputs = readtable("outputs/sprot_scopsf.out", columns)
m = size(outputs)[1]

We could have configured the BLASTP output table to include percent IDs. But BLAST calculates `(alignment_length - gaps) / alignment_length`, whereas we want to calculate `(alignment_length - mismatches - gaps) / alignment_length`. We'll need to rerun alignments using BioAlignment.jl's global alignment.

In [None]:
qsubstrings = (FASTX.sequence(sprot[lookup_sprot_index[string(outputs.qseqid[i])]])[outputs.qstart[i]:outputs.qend[i]] 
    for i=1:m)
ssubstrings = (FASTX.sequence(scop[lookup_scop_index[string(outputs.sseqid[i])]])[outputs.sstart[i]:outputs.send[i]] 
    for i=1:m)
pctid = align(Pairwise(), 
    qsubstrings, 
    ssubstrings, 
    model=GlobalAlignment(),
    verbose=true,
    formatter=x::PairwiseAlignmentResult -> percentid(x));

Organize the results and filter alignments for percent identity exceeding $95$.

In [None]:
n = length(sprot)
sprot_pct = zeros(AbstractFloat, n)
sprot_rep = zeros(Int, n)
for i=1:m
    pct = pctid[i]
    sprotidx = lookup_sprot_index[string(outputs.qseqid[i])]
    scopid = string(outputs.sseqid[i])
    if pct >= 0.95
        if pct > sprot_pct[sprotidx]
            sprot_pct[sprotidx] = pct
            sprot_rep[sprotidx] = parse(Int, scopid)
        end
    end
end
represented_sprotidx = [i for i=1:n if sprot_pct[i] >= 0.95]
represented_sprotid = [sprotids[i] for i=represented_sprotidx]
represented_scopidx = [lookup_scop_index[string(sprot_rep[i])] for i=represented_sprotidx]
represented_scopid = [scopids[i] for i=represented_scopidx]
represented_pct = [sprot_pct[i] for i=represented_sprotidx]

Now that Swiss-Prot has been registered to the SCOP2 superfamily representative sequences, write the table to `outputs/`.

In [None]:
columns = ["sprot-index", "sprot-id", "scop-index", "scop-id", "percent identity"]
sprot_scop_registration = DataFrame(
    zip(represented_sprotidx, represented_sprotid, represented_scopidx, represented_scopid, represented_pct), 
    columns)
writeframe("outputs/sprot_scop_registration.df", sprot_scop_registration)

Write the restriction of Swiss-Prot to sequences with $95\%$ identity to SCOP2 SF to `data`.

In [None]:
sprot_scop_registration = readframe("outputs/sprot_scop_registration.df")
sprot_scop = [sprot[i] for i=sprot_scop_registration."sprot-index"]
writefasta("data/sprot_scop.fa", sprot_scop)

## Plot

In [None]:
sprot = readsequences("data/sprot.fa")
sprot_scop = readsequences("data/sprot_scop.fa")
plot(xlabel="Sequence length",
     ylabel="Occurrences",
     xrange=[50, 2000],
     xscale=:log,
     xticks=maketicks([50, 100, 200, 500, 1000, 2000]),
     legend=:topright,
     dpi=500)
histogram!(length.(sprot), label=latexstring(sprotall))
histogram!(length.(sprot_scop), label=latexstring(sprotscop))
savefig("figures/LengthDistribution_SwissProt.png")
plot!()