Here you learn how to run external repeat detection software on your favourite sequence file and output the results.

Requirements for this tutorial:

- Install TRAL. TRAL ships with the data needed for this tutorial.
- Install XSTREAM. You can also install one or more other tandem repeat detectors instead.

## Read in your sequences.

In [2]:
import os
from tral.sequence import sequence
from tral.paths import PACKAGE_DIRECTORY

proteome_HIV = os.path.join(PACKAGE_DIRECTORY, "examples", "data", "HIV-1_388796.faa")
sequences_HIV = sequence.Sequence.create(file = proteome_HIV, input_format = 'fasta')

## Run the external repeat detector.

Detect tandem repeats on a single sequence with TREKS.

In [3]:
tandem_repeats = sequences_HIV[0].detect(denovo = True, detection = {"detectors": ["T-REKS"]})

As an example, the first detected putative tandem repeat looks as follows (interpretation):

In [5]:
print(tandem_repeats.repeats[0])

> begin:385 l_effective:7 n:2
LFNSTKLE
LFNSST-N


Detect tandem repeats on a single sequence with XSTREAM: (works now!!)

In [6]:
tandem_repeats = sequences_HIV[0].detect(denovo = True, detection = {"detectors": ["XSTREAM"]})

In [7]:
print(tandem_repeats.repeats[0])

> begin:316 l_effective:4 n:2
GDII
GDIR


Detect tandem repeats on all sequences with all de novo tandem repeat detection algorithms defined in the configuration file:

In [8]:
for iSequence in sequences_HIV:
    iTandem_repeats = iSequence.detect(denovo = True)
    iSequence.set_repeatlist(iTandem_repeats, "denovo")

In [9]:
for iSequence in sequences_HIV:
    iTandem_repeats = iSequence.detect("/home/lina/Desktop/tral/tral/examples/data/Kelch_1.hmm")
    
    ## how does this work??

Exception: The lHMM value is not a list.

In [30]:
for iSequence in sequences_HIV:
    print(iSequence)
    iSequence.get_repeatlist("denovo")

<tral.sequence.sequence.Sequence object at 0x7f566444bf28>
<tral.sequence.sequence.Sequence object at 0x7f566444be48>
<tral.sequence.sequence.Sequence object at 0x7f566444bf98>
<tral.sequence.sequence.Sequence object at 0x7f562d917898>
<tral.sequence.sequence.Sequence object at 0x7f562d9178d0>
<tral.sequence.sequence.Sequence object at 0x7f566444beb8>
<tral.sequence.sequence.Sequence object at 0x7f566444bf60>
<tral.sequence.sequence.Sequence object at 0x7f562d91e780>
<tral.sequence.sequence.Sequence object at 0x7f562d91e7f0>


Different different algorithms usually detect different tandem repeats. This the the absolute number of detections in the HIV proteome for a couple of algorithms:

**These numbers here are not all correct!! It seems that this part is currently not working**

In [4]:
len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "T-REKS"])

4

In [5]:
# is not always the same, the tutorial says 6 repeats, before here was 6 and now 4...
# the number of repeats may be differ between runs
len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "TRUST"])

6

In [6]:
len([i for j in sequences_HIV for i in j.get_repeatlist('denovo').repeats if i.TRD == "XSTREAM"])

9

As an example, T-REKS detects the following repeat in the second HIV sequence (interpretation):

In [7]:
print([i for i in sequences_HIV[1].get_repeatlist('denovo').repeats if i.TRD == "T-REKS"][0])
# This output seems correct.

> begin:449 l_effective:10 n:2
RPEPTAPP-ESL
RPEPTAPPPES-


## Output the detected tandem repeats.

Write a singe repeat_list to .tsv format:

In [12]:
path_to_output_tsv_file = "outputfile.tsv" # Choose your path and filename
tandem_repeats.write(output_format = "tsv", file = path_to_output_tsv_file)

The created .tsv looks as follows (interpretation):

In [17]:
!cat outputfile.tsv ## An other output than in the original tutorial! (Because here is another TRD used)

begin	msa_original	l_effective	n_effective	repeat_region_length	divergence	pvalue
385	LFNSTKLE,LFNSST-N	7	2.0	15	None	None


Write a singe repeat_list to .pickle format:

In [14]:
path_to_output_pickle_file = "outputfile.pickle"  # Choose your path and filename
tandem_repeats.write(output_format = "pickle", file = path_to_output_pickle_file)

A repeat_list in pickle format can easily be read in again:

In [15]:
from tral.repeat_list import repeat_list
tandem_repeats = repeat_list.RepeatList.create(input_format = "pickle", file = path_to_output_pickle_file)

Save multiple sequence together with tandem repeat annotations:

In [16]:
import pickle
path_to_output_pickle_file = "outputfile.pickle" # Choose your path and filename
with open(path_to_output_pickle_file, 'wb') as fh:
    pickle.dump(sequences_HIV, fh)