# Figure 1:
### Sequence length distributions in SwissProt and SwissProt restricted by 95% identity to SCOP2 superfamily representatives

Download the SCOP2 superfamily representative sequences into the `data/` directory.

```
cd data
wget https://scop.mrc-lmb.cam.ac.uk/files/scop_sf_represeq_lib_latest.fa
```

For convenience, rename the file.

```
mv scop_sf_represeq_lib_latest.fa scop.fa
```

Construct a BLAST database from the SCOP2 superfamily representatives. First create a subdirectory to contain the files, then run `makeblastdb` to generate the database. 

```
mkdir scopdb
makeblastdb -in scop.fa -out scopdb/scop -dbtype prot -title scop
```

Download, unzip, and rename Swiss-Prot.

```
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta.fa sprot.fa
```

Align Swiss-Prot against the SCOP2 superfamily representative sequences. Set `num_threads` for your system.

(On my Latitude 5040, the BLAST search completes in about six hours.)

```
cd ..
mkdir outputs
blastp -query data/sprot.fa -db data/scopdb/scopsf -word_size 4 -outfmt "6 pctid qstart qend sstart send pctid" -num_threads 12 > outputs/sprot_scopsf.out
```

In [None]:
# parse the output table


# Figure 2
## Methods
* BLAST
* EASEL
    * installation
* disjointpermutation.jl

### a) Generate Swiss-Prot variants
- random: find a permutation $\rho$ of sprot so that sequence $i$ and sequence $\rho(i)$ are from different SCOP2 superfamilies
- shuf$_{\text{all}}$, shuf: 
- rev$_{\text{all}}$, rev: 
- shufrev$_{\text{all}}$, shufrev: esl-shuffle

### b) Run paired Swiss-Prot alignments
- using BLAST for pairwise alignment

# Figure 3
## Methods
* BLAST
* tantan
    * installation
* mask.jl

### a) Generate soft masks with tantan

### b) Run hard-masked paired alignments

# Figure 4
## Methods
* BioAlignments
* Score matrix distributions
    * parsing
    * talk to Jack
* Sequence mutation (BLOSUM.jl)
* Wilcoxon (wilcoxon.jl)

### a) Generate BLOSUM90 variants
### b) Generate mutations of Swiss-Prot variants
### c) Run paired alignments

# Figure 5
## Methods
* manacher.jl
* BioAlignments
    * Configuration for exact alignment

### a) Splice repetitive regions from Swiss-Prot using tantan masks
### b) Run exact paired alignments and Manacher's algorithm

# Figure 6
## Methods
* manacher.jl
* BioAlignments
* EASEL
* chromosome.jl

### a) Download chromosome 22 and NCBI annotations for protein-coding, non-coding RNA, and pseudo-genes
### b) Partition chromosome 22 by NCBI annotations.
### c) Generate shuffled variant of chromosome 22
### d) Gather LCS, LPS, and alignment

# Figure 7
## Methods
* manacher.jl
* BioAlignments
* trypsin.jl

### a) Extract tryptic peptides from Swiss-Prot, masked Swiss-Prot.
### b) Shuffle tryptic peptides from both sources.
### c) Gather LCS and LPS for each.

In [None]:
using FASTX: sequence, FASTARecord, FASTAWriter
using ProgressMeter: @showprogress, Progress, next!
using Base.Threads: @threads, nthreads
println("threads: ", nthreads())
include("source/io.jl")
include("source/trypsin.jl")
include("source/palindrome.jl")
sprot = readfasta("data/sprot.fa");
try
    mkdir("data/tryptic")
catch IOError
    @warn "data/tryptic already exists"
end
minlength = 5
maxlength = 100
peptidename(i::Int, j::Int, minlength::Int, maxlength::Int) = "sprot $(i) tryptic peptide $(j) minlength $(minlength) maxlength $(maxlength)"
p = Progress(length(sprot), 1, "Digesting...")
FASTAWriter(open("data/tryptic/sprot.fa", "w")) do writer
   for i=1:length(sprot)
        seq = sequence(sprot[i])
        tryptic = trypticpeptides(seq, minlength, maxlength)
        for j=1:length(tryptic)
            header = peptidename(i, j, minlength, maxlength)
            record = FASTARecord(header, tryptic[j])`
            write(writer, record)
        end
        next!(p)
    end
end

In [None]:
using FASTX: sequence, FASTARecord, FASTAWriter
using ProgressMeter: @showprogress, Progress, next!
using Base.Threads: @threads, nthreads
println("threads: ", nthreads())
include("source/io.jl")
include("source/trypsin.jl")
include("source/palindrome.jl")
sprot = readfasta("data/sprot.fa");
try
    mkdir("data/tryptic")
catch IOError
    @warn "data/tryptic already exists"
end
minlength = 5
maxlength = 100
peptidename(i::Int, j::Int, minlength::Int, maxlength::Int) = "> sprot $(i) tryptic peptide $(j) minlength $(minlength) maxlength $(maxlength)\n"
p = Progress(length(sprot), 1, "Digesting...")
open("data/tryptic/sprot.fa", "w") do writer
   for i=1:length(sprot)
        seq = sequence(sprot[i])
        tryptics = trypticpeptides(seq, minlength, maxlength)
        headers = peptidename.([i], 1:length(tryptics), [minlength], [maxlength])
        write.(writer, headers .* tryptics .* '\n')
        #for j=1:length(tryptic)
        #    header = peptidename(i, j, minlength, maxlength)
        #    write(writer, header * tryptic[j] * '\n')
        #end
        next!(p)
    end
end

threads: 12


[33m[1m└ [22m[39m[90m@ Main In[2]:12[39m
[32mDigesting...   1%|▎                                      |  ETA: 0:52:45[39m