# Figure 1
### Sequence length distributions in SwissProt and SwissProt restricted by 95% identity to SCOP2 superfamily representatives

### SCOP2 superfamily database
Download the SCOP2 superfamily representative sequences into the `data/` directory.

```
cd data
wget https://scop.mrc-lmb.cam.ac.uk/files/scop_sf_represeq_lib_latest.fa
```

For convenience, rename the file.

```
mv scop_sf_represeq_lib_latest.fa scop.fa
```

### Swiss-Prot
Download and unzip Swiss-Prot.

```
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
```

Filter Swiss-Prot for sequences with lengths between 50 and 2000 and write to `data/sprot.fa`.

```
```

### Many-to-many alignment with BLAST
Construct a BLAST database from the SCOP2 superfamily representatives. First create a subdirectory `data/scopdb` to contain the files, then run `makeblastdb` to generate the database. 

```
mkdir scopdb
makeblastdb -in scop.fa -out scopdb/scop -dbtype prot -title scop
```

Align Swiss-Prot against the SCOP2 superfamily representative sequences. Set `num_threads` for your system.

(On my Latitude 5040, the BLAST search completes in about six hours.)

```
cd ..
mkdir outputs
blastp -query data/sprot.fa -db data/scopdb/scopsf -word_size 4 -outfmt "6 qstart qend sstart send pctid" -num_threads 12 > outputs/sprot_scopsf.out
```

Parse sequence IDs from the output table and register each sequence in Swiss-Prot to a superfamily representative sequence in SCOP2.

```
```

Classify each sequence in Swiss-Prot according to its superfamily's fold. 

```
```

# Figure 2
### BLAST score distributions

### Swiss-Prot variants

### Paired alignment with BLAST

# Figure 3
### BLAST score by sequence length

### Masking with tantan

### Converting tantan masks to BLAST format

### Paired alignment with BLAST

# Figure 4
### Reverse affinity of homologs

### BLOSUM90 variants

### Inducing mutation

### Paired alignment with BLAST

# Figure 5
### LPS and LCS in Swiss-Prot

### Splice repetitive regions from Swiss-Prot using tantan masks

### Generate LPS and gapless exact alignments

# Figure 6
### Exact and approximate palindromes in DNA 

### Human Chromosome 22
Retrieve chromosome 22 on NCBI Nucleotide.
- <a href="https://www.ncbi.nlm.nih.gov/nuccore/NC_000022?report=fasta">GRCh38.p14 Primary Assembly /  NC_000022.11</a>

Download the sequence file in FASTA format using the `Send to` dropdown menu.
<img src="assets/ncbinucleotidesendto.png"/>

### NCBI Gene Annotations
Retrieve annotations for chromosome 22 e.g., using the NCBI Gene queries from <a href="https://en.wikipedia.org/wiki/Chromosome_22#cite_note-NCBI_coding-9">Wikipedia</a>.
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%22genetype%20protein%20coding%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Protein-coding gene annotations </a>
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%28%22genetype%20miscrna%22%5BProperties%5D%20OR%20%22genetype%20ncrna%22%5BProperties%5D%20OR%20%22genetype%20rrna%22%5BProperties%5D%20OR%20%22genetype%20trna%22%5BProperties%5D%20OR%20%22genetype%20scrna%22%5BProperties%5D%20OR%20%22genetype%20snrna%22%5BProperties%5D%20OR%20%22genetype%20snorna%22%5BProperties%5D%29%20NOT%20%22genetype%20protein%20coding%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Noncoding RNA annotations </a>
- <a href="https://www.ncbi.nlm.nih.gov/gene?term=22%5BCHR%5D%20AND%20%22Homo%20sapiens%22%5BOrganism%5D%20AND%20%28%22genetype%20pseudo%22%5BProperties%5D%20AND%20alive%5Bprop%5D%29&cmd=DetailsSearch"> Pseudogene annotations </a>
    
Download each annotation file in tabular text format using the `Send to` dropdown menu.
<img src="assets/ncbigenesendto.png"/>

### Masking with tantan
Run `tantan` on `data/ncbi_chr22_sequence.fasta` to mask repetitive and low-complexity regions; store the masked file in `data/ncbi_chr22_maskedsequence.fasta`. Rather than using lowercase letters for masks, configure tantan to mask with 'N' characters with the `-x N` argument. The proceeding script will automatically remove these along with the other 'N' characters already present in the chromosome.

```
tantan -x N data/ncbi_chr22_sequence.fasta > data/ncbi_chr22_maskedsequence.fasta
```

### Distribution of LPS in chromosome 22
Using the `lps` flag, generate the distribution of maximal palindromes in protein-coding, noncoding RNA, and pseudogene regions of chromosome 22.
```
julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_protein.txt outputs/fig6_chr22palindromes_protein.dlm lps

julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_pseudo.txt outputs/fig6_chr22palindromes_pseudo.dlm lps

julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_rna.txt outputs/fig6_chr22palindromes_rna.dlm lps
```

Generate the LPS distribution with tantan masks. 

```
julia -t 12 fig6.jl data/ncbi_chr22_maskedsequence.fasta data/ncbi_chr22_protein.txt outputs/fig6_maskedchr22palindromes_protein.dlm lps

julia -t 12 fig6.jl data/ncbi_chr22_maskedsequence.fasta data/ncbi_chr22_pseudo.txt outputs/fig6_maskedchr22palindromes_pseudo.dlm lps

julia -t 12 fig6.jl data/ncbi_chr22_maskedsequence.fasta data/ncbi_chr22_rna.txt outputs/fig6_maskedchr22palindromes_rna.dlm lps
```

For a control, rerun each job with the `shuffle` keyword to shuffle each region before the LPS is calculated.
```
julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_protein.txt outputs/fig6_shuffledchr22palindromes_protein.dlm lps shuffle

julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_pseudo.txt outputs/fig6_shuffledchr22palindromes_pseudo.dlm lps shuffle

julia -t 12 fig6.jl data/ncbi_chr22_sequence.fasta data/ncbi_chr22_rna.txt outputs/fig6_shuffledchr22palindromes_rna.dlm lps shuffle
```

### Plot

In [None]:
using Plots
using DataFrames
using DelimitedFiles

In [None]:
chr22palindromes_protein = readdlm("outputs/fig6_chr22palindromes_protein.dlm", Int)
chr22palindromes_pseudo = readdlm("outputs/fig6_chr22palindromes_pseudo.dlm", Int)
chr22palindromes_rna = readdlm("outputs/fig6_chr22palindromes_rna.dlm", Int)
maskedchr22palindromes_protein = readdlm("outputs/fig6_maskedchr22palindromes_protein.dlm", Int)
maskedchr22palindromes_pseudo = readdlm("outputs/fig6_maskedchr22palindromes_pseudo.dlm", Int)
maskedchr22palindromes_rna = readdlm("outputs/fig6_maskedchr22palindromes_rna.dlm", Int);
shuffledchr22palindromes_protein = readdlm("outputs/fig6_shuffledchr22palindromes_protein.dlm", Int)
shuffledchr22palindromes_pseudo = readdlm("outputs/fig6_shuffledchr22palindromes_pseudo.dlm", Int)
shuffledchr22palindromes_rna = readdlm("outputs/fig6_shuffledchr22palindromes_rna.dlm", Int);

In [None]:
seqlengths_and_palindromes = [
    chr22palindromes_protein, 
    chr22palindromes_pseudo, 
    chr22palindromes_rna, 
    maskedchr22palindromes_protein, 
    maskedchr22palindromes_pseudo, 
    maskedchr22palindromes_rna, 
    shuffledchr22palindromes_protein, 
    shuffledchr22palindromes_pseudo, 
    shuffledchr22palindromes_rna]
labels = [
    "protein-coding genes",
    "pseudogenes",
    "non-coding rna",
    "masked protein-coding genes",
    "masked pseudogenes",
    "masked non-coding rna",
    "shuffled protein-coding genes",
    "shuffled pseudogenes",
    "shuffled non-coding rna"]
colors = [:red, :darkorange, :maroon, :cyan, :lightgreen, :lightblue, :yellow, :pink, :violet];

In [None]:
include("source/utils.jl")
function plot_distribution_table!(distribution::Matrix{Int}, label, color)
    seq_lengths, lps_lengths = unzip(Tuple.(eachrow(distribution)))
    scatter!(seq_lengths, lps_lengths, label=label, color=color, ms=3, ma=0.5)
end;

In [None]:
plot(ylabel="longest palindromic substring", 
     xlabel="sequence length",
     xscale=:log,
     legend=:topleft,
     dotsize=100,)
subset = 1:length(seqlengths_and_palindromes)
plot_distribution_table!.(seqlengths_and_palindromes[subset], labels[subset], colors[subset])
savefig("figures/fig6_chr22palindromes.png")
plot!()

# Figure 7
### Distribution of LPS in tryptic peptides of Swiss-Prot

### Generate LPS distributions of tryptic peptides
Generate LPS distributions of tryptic peptides between lengths `5` and `50` in `data/sprot.fa`, `data/maskedsprot.fa`, and `data/shuf.fa` and write them to `outputs/fig7_shuffledtrypticpalindromes_sprot.dlm` and `outputs/fig7_shuffledtrypticpalindromes_shuf.dlm`.
```
julia -t 12 fig7.jl data/sprot.fa outputs/fig7_trypticpalindromes_sprot.dlm 5 50

julia -t 12 fig7.jl data/maskedsprot.fa outputs/fig7_trypticpalindromes_maskedsprot.dlm 5 50

julia -t 12 fig7.jl data/shuf.fa outputs/fig7_trypticpalindromes_shuf.dlm 5 50
```

Also generate the LPS distribution for shuffled tryptic peptides in `data\sprot.fa`. The keyword `shuffle` shuffles each tryptic peptide before it is passed to the LPS algorithm; this allows us to compare palindromicity as a result of low-complexity tryptic peptides (e.g., dipeptide repeats are always palindromic) against palindromicity as a result of patterns that are not permutation invariant.
```
julia -t 12 fig7.jl data/sprot.fa outputs/fig7_shuffledtrypticpalindromes_sprot.dlm 5 50 shuffle

julia -t 12 fig7.jl data/maskedsprot.fa outputs/fig7_shuffledtrypticpalindromes_maskedsprot.dlm 5 50 shuffle
```

### Plot

In [None]:
using Plots
using DataFrames
using DelimitedFiles
using StatsBase

In [None]:
minlength = 5
maxlength = 50
sprot_dist = readdlm("outputs/fig7_trypticpalindromes_sprot.dlm", Int)
maskedsprot_dist = readdlm("outputs/fig7_trypticpalindromes_maskedsprot.dlm", Int)
sprotshuffled_dist = readdlm("outputs/fig7_shuffledtrypticpalindromes_sprot.dlm", Int)
maskedsprotshuffled_dist = readdlm("outputs/fig7_shuffledtrypticpalindromes_maskedsprot.dlm", Int)
shuf_dist = readdlm("outputs/fig7_trypticpalindromes_shuf.dlm", Int);

In [None]:
function plot_distribution_matrix!(distribution::Matrix{Int}, label)
    global minlength, maxlength
    @assert maxlength == size(distribution)[1]
    distribution = maxlength .* distribution ./ sum(distribution)
    # calculate mean length palindrome for each sequence length
    avg_pal = [sum(collect(1:maxlength) .* distribution[seq, :]) for seq=minlength:maxlength]
    scatter!(minlength:maxlength, avg_pal, label=label)
end;

In [None]:
plot(title="LPS in tryptic peptides length 5-50",
     ylabel="longest palindromic substring",
     yrange=(2, 4),
     xlabel="sequence length",
     xrange=(5, 60),
     xscale=:log,
     legend=:topleft)
plot_distribution_matrix!(shuf_dist, "tryptic substrings of random sequences")
plot_distribution_matrix!(sprot_dist, "tryptic peptides in Swiss-Prot")
plot_distribution_matrix!(maskedsprot_dist, "tryptic peptides in masked Swiss-Prot")
savefig("figures/fig7_trypticpeptides.png")
plot!()

In [None]:
function plot_difference!(dist_A::Matrix{Int}, dist_B::Matrix{Int}, label)
    global minlength, maxlength
    @assert maxlength == size(dist_A)[1] == size(dist_B)[1]
    dist_A = maxlength .* dist_A ./ sum(dist_A)
    dist_B = maxlength .* dist_B ./ sum(dist_B)
    avg_pal_A =  [sum(collect(1:maxlength) .* dist_A[seq, :]) for seq=minlength:maxlength]
    avg_pal_B =  [sum(collect(1:maxlength) .* dist_B[seq, :]) for seq=minlength:maxlength]
    plot!(avg_pal_A .- avg_pal_B, label=label)
end;

In [None]:
plot(title="difference in means of LPS in tryptic palindromes\n and shuffled tryptic palindromes",
     ylabel="longest palindromic substring", 
     xlabel="sequence length",
     xscale=:log,
     xrange=(5, 50),
     legend=:topleft)
plot_difference!(sprot_dist, sprotshuffled_dist, "Swiss-Prot")
plot_difference!(maskedsprot_dist, maskedsprotshuffled_dist, "Masked Swiss-Prot")
savefig("figures/fig7_reverseaffinity.png")
plot!()