# Tardigrades: from genestealers to space marines 

**Project 4.**\
Lab journal by Anna Ogurtsova

---

### Step 0. Obtaining data. Genome sequence


For this project  a sequence of the [Ramazzottius varieornatus](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=947166) was used.
Taxonomy ID: 947166

In [None]:
%%bash
mkdir raw_data
cd raw_data/

### Step 1. Structural annotation

Functional annotation comprise a search for homologous proteins and conserved domains to assign potential functions to the genes and proteins found.

The extraction of protein sequences (fasta) from the prediction output (usually GFF/GTF) is needed.

I use gene prediction results obtained by AUGUSTUS and I use [getAnnoFasta.pl](http://augustus.gobics.de/binaries/scripts/getAnnoFasta.pl)script which creates fasta sequence files from the AUGUSTUS output.

1. To calculate the number of predicted proteins:

In [None]:
!cat raw_data/augustus.whole.aa | grep '^>' | wc -l 

We observe that there are 16435 proteins were found.

Let’s find how many peptides were identified by tandem mass spectrometry.

In [None]:
!cat raw_data/peptides.fa | grep '^>' | wc -l

We observe that there are 43 peptides were found by TMS

### Step 2. Physical localization

With the use of data provided by tandem mass spectrometry (TMS), a set of different peptides which are supposed to be parts of different DNA-binding proteins was obtained. Now the goal was to match this peptide fragments to *R.varieornatus* protein sequences.

In [None]:
!mkdir database
!makeblastdb -in raw_data/augustus.whole.aa -dbtype prot -out database/tardigrade_db

Search for peptides in local DB

In [None]:
!blastp -db database/tardigrade_db -query raw_data/peptides.fa -outfmt "6 sseqid pident"  -out result_custom
!blastp -db database/tardigrade_db -query raw_data/peptides.fa -outfmt "6 sseqid"  -out all_ids.txt
!wc -l result_custom #output 118 

The fragments were located within 118 protein sequences with percentage of identical positions varying from 53.846% to 100%. \
The next step will be to extract this proteins from fasta file with 16435 *R.varieornatus* proteint sequences.

Extraction of unique proteins with 100% identical positions with the use of Pandas library

In [None]:
%%python
import pandas as pd
import numpy as np
df_all_ids = pd.read_csv('all_ids.txt', sep = '\t')
df_all_ids.columns = ['ID']
x = df_all_ids['ID'].unique()
np.savetxt('unique.txt', x, fmt='%s')


So among 118 proteins found with BLAST only 34 were unique.\
Further extraction of 34 proteins of interest was done with [seqtk subseq utility](https://github.com/lh3/seqtk).

The obtained seq_subset_unique.fasta was further used for localization prediction.


### Step 4. Localization prediction

**4a. WoLF PSORT**

[WoLF PSORT](https://wolfpsort.hgc.jp/) predicts the subcellular localization of proteins based on the presence of a signal peptide on their N-terminus.\

Results:
```Output
g702.t1 details extr: 29, plas: 2, lyso: 1 g1285.t1 details extr: 25, plas: 5, mito: 1, lyso: 1 g1285.t1 details extr: 25, plas: 5, mito: 1, lyso: 1 g1285.t1 details extr: 25, plas: 5, mito: 1, lyso: 1 g3428.t1 details mito: 18, cyto: 11, extr: 2, nucl: 1 g3428.t1 details mito: 18, cyto: 11, extr: 2, nucl: 1 g3428.t1 details mito: 18, cyto: 11, extr: 2, nucl: 1 g3679.t1 details extr: 26, mito: 2, lyso: 2, plas: 1, E.R.: 1 g4106.t1 details E.R.: 14.5, E.R._golg: 9.5, extr: 7, golg: 3.5, lyso: 3, pero: 2, plas: 1, mito: 1 g4106.t1 details E.R.: 14.5, E.R._golg: 9.5, extr: 7, golg: 3.5, lyso: 3, pero: 2, plas: 1, mito: 1 g5237.t1 details plas: 24, mito: 8 g5237.t1 details plas: 24, mito: 8 g5467.t1 details extr: 27, plas: 4, mito: 1 g5467.t1 details extr: 27, plas: 4, mito: 1 g5502.t1 details extr: 31, lyso: 1 g5502.t1 details extr: 31, lyso: 1 g5502.t1 details extr: 31, lyso: 1 g5503.t1 details extr: 29, plas: 1, mito: 1, lyso: 1 g5503.t1 details extr: 29, plas: 1, mito: 1, lyso: 1 g5503.t1 details extr: 29, plas: 1, mito: 1, lyso: 1 g5510.t1 details plas: 23, mito: 7, E.R.: 1, golg: 1 g5616.t1 details extr: 31, mito: 1 g5616.t1 details extr: 31, mito: 1 g5616.t1 details extr: 31, mito: 1 g5616.t1 details extr: 31, mito: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g5641.t1 details extr: 31, lyso: 1 g10513.t1 details nucl: 20, cyto_nucl: 14.5, cyto: 7, extr: 3, E.R.: 1, golg: 1 g10514.t1 details nucl: 19, cyto_nucl: 15, cyto: 9, extr: 3, mito: 1 g10514.t1 details nucl: 19, cyto_nucl: 15, cyto: 9, extr: 3, mito: 1 g12510.t1 details plas: 29, cyto: 3 g12510.t1 details plas: 29, cyto: 3 g12510.t1 details plas: 29, cyto: 3 g12562.t1 details extr: 30, lyso: 2 g12562.t1 details extr: 30, lyso: 2 g12562.t1 details extr: 30, lyso: 2 g12562.t1 details extr: 30, lyso: 2 g12562.t1 details extr: 30, lyso: 2 g12562.t1 details extr: 30, lyso: 2 g13530.t1 details extr: 13, nucl: 6.5, lyso: 5, cyto_nucl: 4.5, plas: 3, E.R.: 3, cyto: 1.5 g13530.t1 details extr: 13, nucl: 6.5, lyso: 5, cyto_nucl: 4.5, plas: 3, E.R.: 3, cyto: 1.5 g14472.t1 details nucl: 28, plas: 2, cyto: 1, cysk: 1 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15153.t1 details extr: 32 g15484.t1 details nucl: 17.5, cyto_nucl: 15.3333, cyto: 12, cyto_mito: 6.83333, plas: 1, golg: 1 g15484.t1 details nucl: 17.5, cyto_nucl: 15.3333, cyto: 12, cyto_mito: 6.83333, plas: 1, golg: 1 g15484.t1 details nucl: 17.5, cyto_nucl: 15.3333, cyto: 12, cyto_mito: 6.83333, plas: 1, golg: 

I didn't like the output format, so I didn't use the obtained result in further research given that I also have TargetP Server.

**4b. TargetP Server**

Predict localization of eukaryotic proteins based on the presence of any of the N-terminal presequences: chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP) or secretory pathway signal peptide (SP).

Among 34 proteins of interest, 21 proteins were classified as **"OTHER"** localisation by TargetP Server. 


I extracted these 21 proteins for further BLAST search against “UniProtKB/Swiss-Prot” database.

In [None]:
%%python
df_predicted = pd.read_csv('output_protein_type.txt', sep = '\t')
df_predicted_filtered = df_predicted[df_predicted['Prediction'] == 'OTHER']
df_predicted_filtered['# ID'].to_csv('genes_to_check1.txt', index=False, header=False) #extraction of variants classified as "OTHER"

In [None]:
%%bash
seqtk subseq augustus.whole.aa genes_to_check1.txt > seq_subset_21.fasta

Link for [the results](https://drive.google.com/file/d/1pu7TsT1Mn2un-qchcXAco6TAvE7QNlv_/view?usp=drive_link)

### Step 5. BLAST search


Using **seq_subset_21.fasta** as an input file a performed **a BLAST search against “UniProtKB/Swiss-Prot” database** without specifying an organism.

**Only one protein in 21 was suitable and this was g14472.t1 - Damage suppressor protein (UniProtKB/Swiss-Prot: P0DOW4.1)**.

10 proteins didn't align at all and 10 proteins had different homologues within eucariotic proteins but none of them belong to DNA-repair related proteins.

Link for [the results](https://drive.google.com/file/d/1WSlnXJcyc8C2HwJm9ayewGAvtCB9qlCo/view?usp=drive_link)
