# Overview 
- FsxAs were manually picked in order to get a meaningful representation of the whole set of FsxAs in terms of clades and ecology.
- FsxAs are filtered in order to get a similar number of cysteins as FsxA11 (9 CYS residues; selecting FsxAs with at least 8 CYS in order to be close to this number).
- All chosen FsxAs are pairwise aligned against our structural reference (FsxA11), in order to posteriously employ MODELLER to perform structural modelling. Alignments are performed both with and without FsxA ectodomain HMM as a reference for alignment.
- A subset of the selected tree is presented (I'll try at least), or at least a version with this clades selected.

# Picking up FsxAs
## Picking up I selected as a pre-selection

- AntAceMinimDraft_4_1070372.scaffolds.fasta_scaffold07707_5 (TBA) 
- RifCSPhighO2_12_1023870.scaffolds.fasta_scaffold04391_3 (TBA) 
- MGYP000359581082 
- RifOxyB1_1023888.scaffolds.fasta_scaffold05169_1 (TBA) 
- 3300018015−Ga0187866−1000629−915963−9 
- RLG94066.1 (Candidatus Bathyarchaeota archaeon) 
- SAMEA2619974_10776_4 
- MGYP000565992219 
- RKX41251.1 (**Thermotogae bacterium) 
- HDO79935.1 or HEX32987.1 (Candidatus Aenigmarchaeota archaeon) 
- AntAceMinimDraft_4_1070372.scaffolds.fasta_scaffold89608_1 (1 ORF scaffold) 
- mc20532409|Representative=scaffold68885_2/1553 (TBA) 
- RKZ11204.1 (Candidatus Fermentibacteria bacterium) 
- NOZ47386.1 (Chlorobi bacterium) 
- PlaIllAssembly_1097288.scaffolds.fasta_scaffold00281_9 (TBA) 
- AntAceMinimDraft_4_1070372.scaffolds.fasta_scaffold28048_2 (TBA) 
- HID09282.1 (Candidatus Micrarchaeota archaeon) 
- 3300014613−Ga0180008−1001212−875221−12 
- 3300014206−Ga0172377−10000119−870930−129 (Methanomicrobia) 
- HGF63239.1 (Candidatus Micrarchaeota archaeon) 
- LKMP01000007_1 (Nanohaloarchaea archaeon B1−Br10_U2g21) 
- MGYP000103175163 
- WP_049937247.1 − WP_157573584.1 (Haloplanus natans DSM 17983) 
- HHR27186.1 (Candidatus Bathyarchaeota archaeon) 

In [32]:
# import libraries
import os
import subprocess
import glob
from Bio import SeqIO
import pandas as pd

# create vector with IDs of candidates
fsxAs_preselection = ['AntAceMinimDraft_4_1070372.scaffolds.fasta_scaffold07707_5', 
                      'RifCSPhighO2_12_1023870.scaffolds.fasta_scaffold04391_3', 
                      'MGYP000359581082', 
                      'RLG94066.1', 
                      'SAMEA2619974_10776_4', 
                      'HDO79935.1', 
                      'mc20532409|Representative=scaffold68885_2/1553', 
                      'RKZ11204.1', 
                      'NOZ47386.1', 
                      'PlaIllAssembly_1097288.scaffolds.fasta_scaffold00281_9', 
                      'HID09282.1', 
                      'WP_058826362.1',
                      'WP_174701778.1',
                      'WP_007110832.1',
                      'WP_179268568.1',
                      'WP_163487151.1',
                      '3300014613-Ga0180008-1001212-875221-12', 
                      '3300014206-Ga0172377-10000119-870930-129', 
                      'HGF63239.1', 
                      'LKMP01000007_1', 
                      'MGYP000103175163', 
                      'WP_049937247.1', 
                      'HHR27186.1']

# creating auxiliary function to create directories
def create_dir(dir):
    if not os.path.exists(dir):
        os.mkdir(dir)

# creating relevant directories to allocate alignemts
relevant_dirs = ['../results/data_for_modelling',
                 '../results/data_for_modelling/seqs',
                 '../results/data_for_modelling/alns',
                 '../results/data_for_modelling/alns/hmm_guided',
                 '../results/data_for_modelling/alns/hmm_guided/pairwise_alns',
                 '../results/data_for_modelling/alns/without_hmm_guidance',]

[create_dir(dir) for dir in relevant_dirs]

# get their number of CYS and create a pandas DataFrame
# load all FsxA sequences
fsxA_seqs = [record for record in SeqIO.parse('../data/sequences/FsxA_full_length.faa', 'fasta')]

# create a list to allocate rows of DataFrame
fsxa_seqs_df_list = []

# iterate and consider each seq and their number of CYS
for record in fsxA_seqs:
    fsxa_seqs_df_list.append(pd.DataFrame.from_dict({'ID': [record.id], 'Num of CYS': [record.seq.count('C')]}))

# concatenate
fsxa_seqs_df = pd.concat(fsxa_seqs_df_list)

# filter to consider those with at least 8 CYS

In [33]:
fsxa_seqs_df_filtered = fsxa_seqs_df.query("`ID` in @fsxAs_preselection")
fsxa_seqs_df_filtered

Unnamed: 0,ID,Num of CYS
0,3300014206-Ga0172377-10000119-870930-129,10
0,3300014613-Ga0180008-1001212-875221-12,6
0,AntAceMinimDraft_4_1070372.scaffolds.fasta_sca...,14
0,mc20532409|Representative=scaffold68885_2/1553,14
0,PlaIllAssembly_1097288.scaffolds.fasta_scaffol...,18
0,RifCSPhighO2_12_1023870.scaffolds.fasta_scaffo...,16
0,SAMEA2619974_10776_4,18
0,MGYP000359581082,22
0,MGYP000103175163,9
0,WP_058826362.1,8


All our sequences got the appropiate number of CYS.

In [34]:
fsxa_seqs_df.query("`Num of CYS` < 15")

Unnamed: 0,ID,Num of CYS
0,3300000868-JGI12330J12834-1000008-299010-8,8
0,3300014206-Ga0172377-10000119-870930-129,10
0,3300014208-Ga0172379-10000243-871512-158,10
0,3300014208-Ga0172379-10001592-871560-40,10
0,3300014613-Ga0180008-1001212-875221-12,6
0,3300018015-Ga0187866-1000629-915963-9,14
0,AntAceMinimDraft_18_1070375.scaffolds.fasta_sc...,12
0,AntAceMinimDraft_18_1070375.scaffolds.fasta_sc...,14
0,AntAceMinimDraft_18_1070375.scaffolds.fasta_sc...,9
0,AntAceMinimDraft_18_1070375.scaffolds.fasta_sc...,14


> **Conclusion** it might be the case that we dont have similar number of cysteins with different ecological conditions. **This indeed could be a biological relevant factor tu consider!**

In [35]:
# saving desired sequences into a FASTA file
output_fasta = '../results/data_for_modelling/seqs/fsxAs_for_modelling.fasta'

# loading FsxA11 and adding it to the sequence list
fsxa11 = [record for record in SeqIO.parse('../data/sequences/fsx11_chB.afasta', 'fasta')]

# filtering fsxA sequences for modelling
fsxA_target_seqs = [record for record in fsxA_seqs if record.id in fsxAs_preselection]

# joining with FsxA11
fsxA_target_seqs = fsxA_target_seqs + fsxa11

# saving file
if not os.path.exists(output_fasta):
    with open(output_fasta, 'w') as output_handle:
        SeqIO.write(fsxA_target_seqs, output_handle, 'fasta')


# Creating HMM for FsxA ectodomains

## Creating alignment with MAFFT

In [36]:
# run MAFFT alignment
mafft_output = '../results/data_for_modelling/fsxA_ectodomain.msa'
if not os.path.exists(mafft_output):
  mafft_command = 'mafft --maxiterate 1000 --localpair ../data/sequences/FsxA_ectodomains.faa'.split(' ')
  out_file = open(mafft_output, 'w') 
  subprocess.run(mafft_command, stdout = out_file)

## Creating HMM with hmmbuild

In [37]:
hmm_profile = '../results/data_for_modelling/fsxA_ectodomain.hmm'
if not os.path.exists(hmm_profile):
    hmmbuild_cmd = 'hmmbuild ../results/data_for_modelling/fsxA_ectodomain.hmm ../results/data_for_modelling/fsxA_ectodomain.msa'.split(' ')
    subprocess.run(hmmbuild_cmd)

# Pairwise alignment against FsxA11

## Without FsxA ectodomain HMM as a reference

## With FsxA ectodomain HMM as a reference
- Going to align sequences against FsxA ectodomain HMM (using hmmalign) and create pairwise subsampling of these alignment.

In [38]:
aln_output = '../results/data_for_modelling/alns/hmm_guided/fsxAs_hmm_aligned.sto'
if not os.path.exists(aln_output):
    out_file = open(aln_output, 'w')
    hmmalign_cmd = 'hmmalign --amino --informat FASTA --outformat Stockholm ../results/data_for_modelling/fsxA_ectodomain.hmm ../results/data_for_modelling/seqs/fsxAs_for_modelling.fasta'.split(' ')
    subprocess.run(hmmalign_cmd, stdout = out_file)

### Convert from Stockholm to FASTA

In [39]:
# import libraries
from Bio import AlignIO

# loading HMM guided alignment
fsxa_msa = AlignIO.read('../results/data_for_modelling/alns/hmm_guided/fsxAs_hmm_aligned.sto', 'stockholm')

# save alignment in FASTA format
if not os.path.exists('../results/data_for_modelling/alns/hmm_guided/fsxAs_hmm_aligned.msa'):
    with open('../results/data_for_modelling/alns/hmm_guided/fsxAs_hmm_aligned.msa', 'w') as handle:
        SeqIO.write(fsxa_msa, handle, 'fasta')

### Subset all pairwise sequences with FsxA11

In [40]:
# import libraries
from itertools import combinations

# import the alignment
fsxa_msa = AlignIO.read('../results/data_for_modelling/alns/hmm_guided/fsxAs_hmm_aligned.sto', 'stockholm')

# get all combinations of sequence IDs
fsxa_combinations = combinations(fsxa_msa, 2)

# create list to allocate
fsxa_pairwise_alns = []

# iterating
for set in fsxa_combinations:
    # create a list to allocate
    fsxa_pairwise_aln = []
    # depacking sequences
    seq_1, seq_2 = set
    # modifying a bit the description of both sequences
    #seq_1_tag = '|'.join(seq_1.id.split('|')[1:])
    seq_1_tag = seq_1.id
    seq_2_tag = seq_2.id
    #seq_2_tag = '|'.join(seq_2.id.split('|')[1:])
    #print(seq_1_tag)
    # only considering if one of the sequences is FsxA11
    if seq_1_tag == fsxa11[0].id or seq_2_tag == fsxa11[0].id:
        # appending both sequences
        fsxa_pairwise_aln.append(seq_1)
        fsxa_pairwise_aln.append(seq_2)
        # creating output filename
        output_filename = '../results/data_for_modelling/alns/hmm_guided/pairwise_alns/{0}_vs_{1}.msa'.format(seq_1_tag.replace('|', '').replace('=', '').replace('/', ''), seq_2_tag.replace('|', '').replace('=', '').replace('/', ''))
        # saving FASTA MSA
        if not os.path.exists(output_filename):
            with open(output_filename, 'w') as handle:
                SeqIO.write(fsxa_pairwise_aln, handle, 'fasta')

# Checking that I saved the correct number of sequences
Checked, got 23 files

In [43]:
len(fsxAs_preselection)

23

# Creating files for datasharing 

In [54]:
# create dir for data sharing
create_dir('../results/data_for_modelling/data_sharing')

In [55]:
%%bash
cp -r ../results/data_for_modelling/alns/hmm_guided/pairwise_alns/ ../results/data_for_modelling/data_sharing/
tar -czvf ../results/data_for_modelling/fsxAs_for_structural_modelling_210517.tar.gz ../results/data_for_modelling/data_sharing/

../results/data_for_modelling/data_sharing/
../results/data_for_modelling/data_sharing/pairwise_alns/
../results/data_for_modelling/data_sharing/pairwise_alns/AntAceMinimDraft_4_1070372.scaffolds.fasta_scaffold07707_5_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/mc20532409Representativescaffold68885_21553_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/LKMP01000007_1_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/WP_049937247.1_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/WP_179268568.1_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/WP_058826362.1_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/PlaIllAssembly_1097288.scaffolds.fasta_scaffold00281_9_vs_fx11_chB_26.509.pdbchain.msa
../results/data_for_modelling/data_sharing/pairwise_alns/

tar: Removing leading `../' from member names


In [51]:
'../results/data_for_modelling/alns/hmm_guided/pairwise_alns/'

'../results/data_for_modelling/'