# Protein BLAST-ing Roary and Roary+CLARC results to database of essential gene sequences

This notebook will take each of our results folders (as found in Zenodo folder 10.5281/zenodo.14187853) as input and then take the original Roary accessory and core gene sequences, and the new CLARC accessory and core gene sequences, and it:
    
(1) Translates them into proteins and saves them in the appropiate locations
(2) Calls the .sh script to blast against essential gene database, and save the results in the appropiate locations

Let's go!

The same function can be used on each of the three folders (i80, i90, i95), just have to change the path name. This will create a folder in each subdirectory named 'essential_blasting_sw'. The Zenodo folders already have these results included. Therefore, if you want to try this, just delete that folder and run the code here.

### Import necessary packages

In [5]:
import subprocess
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import os

### Translate and pBLAST

In [6]:
def translate_dna_to_protein(input_fasta, output_fasta):
    protein_records = []
    for record in SeqIO.parse(input_fasta, "fasta"):
        protein_seq = record.seq.translate()
        protein_record = SeqRecord(protein_seq, id=record.id, description=record.description)
        protein_records.append(protein_record)
    SeqIO.write(protein_records, output_fasta, "fasta")

def run_blast_script(acc_cogs, core_cogs, out_acc, out_core):
    blast_script_path = "./essential_protein_blast_paper.sh"
    subprocess.run([blast_script_path, acc_cogs, core_cogs, out_acc, out_core])

In [8]:
results_directory = "~/i80"
counter = -1

# Loop through all folders within the given folder
for entry in os.listdir(results_directory):
    path = os.path.join(results_directory, entry)
    
    if os.path.isdir(path):
    
        dir_name = entry
        clarc_folder = path

        out_path_essential = clarc_folder+'/essential_blasting_sw'

        # Create directory
        os.makedirs(out_path_essential, exist_ok=True)

        # Example paths to your fasta files and output folder
        og_acc_dna_fasta = clarc_folder+'/clarc_output/accessory_rep_seqs.fasta'
        og_core_dna_fasta = clarc_folder+'/clarc_output/core_rep_seqs.fasta'

        og_acc_protein_fasta = out_path_essential+'/og_acc_protein_rep_seqs.fasta'
        og_blast_out_acc = out_path_essential+'/og_acc_protein_essential_blast.tsv'

        og_core_protein_fasta = out_path_essential+'/og_core_protein_rep_seqs.fasta'
        og_blast_out_core = out_path_essential+'/og_core_protein_essential_blast.tsv'

        # Example paths to your fasta files and output folder
        clarc_acc_dna_fasta = clarc_folder+'/clarc_output/clarc_results/clarc_acc_cog_seqs.fasta'
        clarc_core_dna_fasta = clarc_folder+'/clarc_output/clarc_results/clarc_core_cog_seqs.fasta'

        clarc_acc_protein_fasta = out_path_essential+'/clarc_acc_protein_rep_seqs.fasta'
        clarc_blast_out_acc = out_path_essential+'/clarc_acc_protein_essential_blast.tsv'

        clarc_core_protein_fasta = out_path_essential+'/clarc_core_protein_rep_seqs.fasta'
        clarc_blast_out_core = out_path_essential+'/clarc_core_protein_essential_blast.tsv'

        # Translate DNA to protein
        translate_dna_to_protein(og_acc_dna_fasta, og_acc_protein_fasta)
        translate_dna_to_protein(og_core_dna_fasta, og_core_protein_fasta)
        translate_dna_to_protein(clarc_acc_dna_fasta, clarc_acc_protein_fasta)
        translate_dna_to_protein(clarc_core_dna_fasta, clarc_core_protein_fasta)

        # Run BLAST script

        # For original method (Roary or Panaroo)
        run_blast_script(og_acc_protein_fasta, og_core_protein_fasta, og_blast_out_acc, og_blast_out_core)

        # For CLARC results
        run_blast_script(clarc_acc_protein_fasta, clarc_core_protein_fasta, clarc_blast_out_acc, clarc_blast_out_core)


CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:09
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00490093 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:13
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00443888 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:17
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00412488 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:21
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00431108 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:25
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00411701 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:29
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00404906 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:33
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00388408 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:38
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00438094 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:42
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00378919 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:45
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00379992 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:50
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00407386 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:53
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00396204 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:37:57
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00390196 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:38:01
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00416183 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:38:05
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00388193 seconds.





CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.






Building a new DB, current time: 07/08/2024 12:38:09
New DB name:   /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
New DB title:  /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Sequence type: Protein
Deleted existing Protein BLAST database named /Users/indragonzalez/Dropbox/Lipsitch_Rotation/NFDS/Scripts/clean_projects/CLARC/analyses/essential_genes/essential_fasta/strep_pneumo_essential_protein_seqs.fasta
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 343 sequences in 0.00380802 seconds.


