# Figure 1
### Sequence length distributions in SwissProt and SwissProt restricted by 95% identity to SCOP2 superfamily representatives

### SCOP2 superfamily database
Download the SCOP2 superfamily representative sequences into the `data/` directory.

```
cd data
wget https://scop.mrc-lmb.cam.ac.uk/files/scop_sf_represeq_lib_latest.fa
```

For convenience, rename the file.

```
mv scop_sf_represeq_lib_latest.fa scop.fa
```

### Swiss-Prot
Download and unzip Swiss-Prot.

```
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
```

Filter Swiss-Prot for sequences with lengths between 50 and 2000 and write to `data/sprot.fa`.

```
```

### BLAST alignment
Construct a BLAST database from the SCOP2 superfamily representatives. First create a subdirectory `data/scopdb` to contain the files, then run `makeblastdb` to generate the database. 

```
mkdir scopdb
makeblastdb -in scop.fa -out scopdb/scop -dbtype prot -title scop
```

Align Swiss-Prot against the SCOP2 superfamily representative sequences. Set `num_threads` for your system.

(On my Latitude 5040, the BLAST search completes in about six hours.)

```
cd ..
mkdir outputs
blastp -query data/sprot.fa -db data/scopdb/scopsf -word_size 4 -outfmt "6 qstart qend sstart send pctid" -num_threads 12 > outputs/sprot_scopsf.out
```

Parse sequence IDs from the output table and register each sequence in Swiss-Prot to a superfamily representative sequence in SCOP2.

```
```

Classify each sequence in Swiss-Prot according to its superfamily's fold. 

```
```

# Figure 2
## Methods
* BLAST
* EASEL
    * installation
* disjointpermutation.jl

### a) Generate Swiss-Prot variants
- random: find a permutation $\rho$ of sprot so that sequence $i$ and sequence $\rho(i)$ are from different SCOP2 superfamilies
- shuf$_{\text{all}}$, shuf: 
- rev$_{\text{all}}$, rev: 
- shufrev$_{\text{all}}$, shufrev: esl-shuffle

### b) Run paired Swiss-Prot alignments
- using BLAST for pairwise alignment

# Figure 3
## Methods
* BLAST
* tantan
    * installation
* mask.jl

### a) Generate soft masks with tantan

### b) Run hard-masked paired alignments

# Figure 4
## Methods
* BioAlignments
* Score matrix distributions
    * parsing
    * talk to Jack
* Sequence mutation (BLOSUM.jl)
* Wilcoxon (wilcoxon.jl)

### a) Generate BLOSUM90 variants
### b) Generate mutations of Swiss-Prot variants
### c) Run paired alignments

# Figure 5
## Methods
* manacher.jl
* BioAlignments
    * Configuration for exact alignment

### a) Splice repetitive regions from Swiss-Prot using tantan masks
### b) Run exact paired alignments and Manacher's algorithm

# Figure 6
### Exact and approximate palindromes in DNA 

In [None]:
include("source/io.jl")
include("source/chromosome.jl")

### Human Chromosome 22
Retrieve chromosome 22 on NCBI Nucleotide.
- <a href="https://www.ncbi.nlm.nih.gov/nuccore/NC_000022?report=fasta">GRCh38.p14 Primary Assembly /  NC_000022.11</a>

Download the sequence file in FASTA format using the `Send to` dropdown menu.
<img src="assets/ncbinucleotidesendto.png"/>

### Chromosome Annotations
Retrieve NCBI Gene annotations for chromosome 22 using the queries from <a href="https://en.wikipedia.org/wiki/Chromosome_22#cite_note-NCBI_coding-9">Wikipedia</a>.
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%22genetype%20protein%20coding%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Protein-coding gene annotations </a>
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%28%22genetype%20miscrna%22%5BProperties%5D%20OR%20%22genetype%20ncrna%22%5BProperties%5D%20OR%20%22genetype%20rrna%22%5BProperties%5D%20OR%20%22genetype%20trna%22%5BProperties%5D%20OR%20%22genetype%20scrna%22%5BProperties%5D%20OR%20%22genetype%20snrna%22%5BProperties%5D%20OR%20%22genetype%20snorna%22%5BProperties%5D%29%20NOT%20%22genetype%20protein%20coding%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Noncoding RNA annotations </a>
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%22genetype%20pseudo%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Pseudogene annotations </a>
    
Download each annotation file in tabular text format using the `Send to` dropdown menu.
<img src="assets/ncbigenesendto.png"/>

# Figure 7
### Distribution of LPS and LCS in tryptic peptides of Swiss-Prot

In [1]:
using Base.Threads: @threads
using FASTX: sequence
using ProgressMeter: Progress, next!
using DelimitedFiles: writedlm

include("source/io.jl");
include("source/trypsin.jl");
include("source/palindrome.jl");

### LPS and LCS

In [2]:
sprotall = [record for record=readfasta("data/sprot.fa") if 50 < length(sequence(record)) <= 2000]
nsprotall = length(sprotall)
palindromedistribution = zeros(Int, (2000, 2000))
minlength = 5
maxlength = 100
p = Progress(length(sprotall), 1, "Digesting...")
@threads for i=1:nsprotall
    seq = sequence(sprotall[i])
    trypticintervals = trypticpeptides(seq, minlength, maxlength)
    trypticsequences = (seq[start:stop] for (start,stop)=trypticintervals)
    lps = longestpalindromicsubstring.(trypticsequences)
    for (start, stop)=lps
        palindromelength = stop - start
        sequencelength = length(seq)
        palindromedistribution[palindromelength, sequencelength] += 1
    end
    next!(p)
end
writedlm("outputs/sprot_all-tryptic-palindrome-distribution.dlm", palindromedistribution)

[32mDigesting... 100%|███████████████████████████████████████| Time: 0:01:33[39m
