# Pilot 1 & 2
I've pivoted away from the initial idea of comparing with plasmids, and instead I'm going to focus on the 16S rRNA gene.

## Basic analysis

Based on the paper {cite:t}`Pike_2013`, I wanted to expand on their phylogenetic analysis that mainly used ME (minimum evolution) method. I wanted to see if I could replicate their results using a different method, and also see if I could improve on their results by adding more 16S sequences.

1. They took their 16S ribosomal RNA samples and sequenced them.
2. They used BLASTN to quantify phylogenetic relatedness of sequences within GenBank.
3. They also used EzTaxon (now called EzBioCloud) to identify the closest relatives of their sequences.
4. They aligned using BioEdit v7.0.5.3.

    > BioEdit is (was) a sequence alignment editor written for Windows 95/98/NT/2000/XP/7. It is free for academic and non-profit use. It is no longer supported, but the last version is still available for download. It is not available for Mac or Linux. {cite:p}`hall1999bioedit`

5. They used MEGA v4.0 (2007) {cite}`Tamura_2007` to construct a phylogenetic tree using the:

    Generally:
    - Neighbor-joining method: for initial tree construction
    - bootstrapping with 1000 replications
    - distance calculated with Jukes-Cantor model

    Method 1:
    - minimum evolution method: close-neighbour-interchange algorithm @ level 1
    - complete deletion of gaps/missing data @ all positions

    Method 2:
    - maximum parsimony method

    Method 3:
    - UPGMA method

    > MEGA v4.0 (2007) is a program for conducting molecular evolutionary analysis of DNA sequences. Depending on the version, it packages many of the workflows used in phylogenetic analysis in one program. Particularily for v4.0, they boast the ability to generate captions, and use maximum composite likelihood (MCL) method to estimate evolutionary distances between all pairs. Recent version has expanded it's feature set and performance. It is available for Windows, Mac and Linux. It is free for academic and non-profit use. {cite:p}`Tamura_2007`


```{figure} ../../outputs/pike_basic_analysis/papers_tree.png
---
name: pike-2013-tree
---
Tree showing the phylogenetic relationship of Endozoicomonas to other bacterial species (43 type strains based on 16S rRNA gene sequences).
```

But first, we need to wrangle our data.

## Wrangle data

In [1]:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import os

Lets write a function that can take a list of directories and return all of the fasta sequences within them.

In [7]:
dir_paths = [os.path.realpath("../../datasets/common"),
             os.path.realpath("../../datasets/pike_basic_analysis")]

In [3]:
def parse_fasta(dir_paths: list[str]) -> list[SeqRecord]:
  fasta_records: list[SeqRecord] = []

  if isinstance(dir_paths, list):
    for dir_path in dir_paths:
        for path in os.listdir(dir_path):
          file = os.path.join(dir_path, path)
          if "fasta" in file:
            fasta_record = list(SeqIO.parse(open(file), "fasta"))
            fasta_records = fasta_records + fasta_record

  else:
    raise TypeError("dir_paths must be a string or list of strings")

  return fasta_records

In [4]:
def remove_duplicate(fasta_records: list):
  accession_ids: list[str] = []
  duplicates = []
  duplicate_ids = []

  retained_records = []
  for idx, fasta in enumerate(fasta_records):
    if fasta.id in accession_ids:
      duplicates.append(idx)
      duplicate_ids.append(fasta.id)
    else:
      retained_records.append(fasta)
    accession_ids.append(fasta.id)

  return duplicates, duplicate_ids, retained_records

In [5]:
def write_unaligned_fasta(dir_paths_for_fastas: list[str], id_path_str: str):
  dir_path = os.path.dirname(id_path_str)
  ids = open(id_path_str).read().splitlines()
  fasta_records = parse_fasta(dir_paths_for_fastas)
  duplicates, duplicate_ids, retained_records = remove_duplicate(fasta_records)
  filtered = list(filter(lambda x: x.id in ids, retained_records))
  SeqIO.write(filtered, os.path.join(dir_path, "all_unaligned.fasta"), "fasta")
  print("Done writing unaligned fasta")
  return filtered


## Pike et al (2013) basic analysis

Let us load the fasta files for the 16S sequences from the Pike et al (2013) study.

**To repeat the basic analysis, I did the following:**

1. I downloaded [MEGA but version 11](https://www.megasoftware.net/) {cite:p}`Tamura_2021`.
2. I downloaded the 16S sequences using the [accession ID](../../outputs/pike_basic_analysis/list_of_accession.txt) in the Endozoicomonas subtree ({numref}`pike-2013-tree`).
3. I concatenated the sequences (see below).
4. I aligned the sequences using MUSCLE v5.1 {cite:p}`Edgar_2022`.

    ```bash
    muscle5.1 -align outputs/pike_basic_analysis/all_unaligned.fasta -output outputs/pike_basic_analysis/muscle_aligned.fasta
    ```

4. Inside MEGA I selected the following settings:

:::{table} MEGA settings to repeat basic analysis
| Setting | Value |
| --- | --- |
| Scope | All taxa |
| Statistical method | Maximum evolution |
| Test of phylogeny | Bootstrap method |
| No. of bootstrap replicates | 1000 |
| Substitution type | Nucleotide |
| Model/method | Juke-Cantor model |
| Gaps/missing data treatment | Complete deletion |
| ME Search Level | 1 |
:::

In [8]:
write_unaligned_fasta(dir_paths, "../../outputs/pike_basic_analysis/list_of_accession.txt")

Done writing unaligned fasta


[SeqRecord(seq=Seq('CTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAG...CGG'), id='AB196667.1', name='AB196667.1', description='AB196667.1 Endozoicimonas elysicola gene for 16S ribosomal RNA, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAACAGAACTAG...AAG'), id='AB695088.1', name='AB695088.1', description='AB695088.1 Endozoicomonas numazuensis gene for 16S rRNA, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TGCAAGTCGAGCGGTAACAGAACTAGCTTGCTAGTTGCTGACGAGCGGCGGACG...AAG'), id='FJ347758.1', name='FJ347758.1', description='FJ347758.1 Endozoicomonas montiporae CL-33 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('AGAGTTTTGGATCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAA...GGC'), id='JX488684.2', name='JX488684.2', description='JX488684.2 Endozoicomonas euniceicola strain EF212 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('AGAGTTTGATCCTGGCGCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG...

## Pilot Analysis (1 & 2)

To make improvement on the basic analysis, I added more 16S sequences by searching for Endozoicomonas & coral in the NCBI database. I targeted sequences that were fairly long (~1000 or more, except for one sequence) I downloaded the sequences and aligned them using MUSCLE (v.5.1) from CLI {cite:p}`Edgar_2022`.

:::{table} List of sequences added to the alignment
| Species (or Isolate/Strain) | Description | Length | Accession | GI |
| :------ | :---------: | :----- | :-------- | :- |
| Endozoicomonas **atrinae** strain WP70 | 16S ribosomal RNA gene, partial sequence | 1,465 bp linear DNA | KC878324.1 | 499141103 |
| Endozoicomonas **montiporae** strain Ab112_MC | 16S ribosomal RNA (16S) gene, complete sequence | 1,024 bp linear DNA | KJ372452.1 | 646280569 |
| Endozoicomonas **ascidiicola** strain AVMART05 | 16S ribosomal RNA gene, partial sequence | 1,501 bp linear DNA | KT364257.1 | 1016200855 |
| Endozoicomonas **sp.** isolate PM28062 | 16S ribosomal RNA gene, partial sequence | 1,408 bp linear DNA | KX780138.1 | 1226602311 |
| Uncultured Endozoicomonas **sp.** | partial 16S rRNA gene, clone Dpd21_3_44 | 1,473 bp linear DNA | LN626318.1 | 966207698 |
| Endozoicomonas **sp.** strain Acr-14 | partial 16S rRNA gene, isolate Sea coral | 1,452 bp linear DNA | LN875493.1 | 916534561 |
| Endozoicomonas **sp.** Acr-12 | partial 16S rRNA gene, strain Acr-12, isolate Sea coral | 1,450 bp linear DNA | LN879492.1 | 928189784 |
| Endozoicomonas **sp.** strain LZHN29 | 16S ribosomal RNA gene, partial sequence | 1,550 bp linear DNA | MH201322.1 | 1377517605 |
| Endozoicomonas **sp.** Hp36 | 16S ribosomal RNA gene, partial sequence | 1,531 bp linear DNA | MK633876.1 | 1591471162 |
| Endozoicomonas **sp.** strain XS200 | 16S ribosomal RNA gene, partial sequence | 562 bp linear DNA | OQ618154.1 | 2456913731 |
:::

### Pilot #1

This was a run where I had one sequence missing (FJ347758.1) from the original study from {cite}`Pike_2013`.
1. I downloaded the 16S sequences using the [accession ID](../../outputs/pilot_1/accession_id_list.txt).
2. I concatenated the sequences (see below).
3. I aligned the sequences using MUSCLE v5.1 {cite:p}`Edgar_2022`.

    ```bash
    muscle5.1 -align outputs/pilot_1/all_unaligned.fasta -output outputs/pilot_1/muscle/Endozoicomonas_sp.phy
    ```
4. I used modeltest-ng in RaxMLGUI {cite:p}`Edler2021, Darriba2019`. **TPM1uf+G4** was the best model.
5. I then tree searched using raxmlHPC in RaxMLGUI {cite:p}`Edler2021, Kozlov2019`. However, it seems like I made a mistake and ran it with GTRGAMMAIX instead, using ML + thorough bootstrap.

**This produced:**
```{figure} ../../outputs/pilot_1/Ito_Kaede_PartB_screenshot_page-0001.jpg
---
name: pilot-1-raxml-tree
---
Tree searched using ML inference with RaxMLGUI {cite:p}`Edler2021, Kozlov2019`.
```

In [9]:
write_unaligned_fasta(dir_paths, "../../outputs/pilot_1/accession_id_list.txt")

Done writing unaligned fasta


[SeqRecord(seq=Seq('ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGCGGGAAGAG...GTG'), id='KC878324.1', name='KC878324.1', description='KC878324.1 Endozoicomonas atrinae strain WP70 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGCAGTCGAGCGGTGACAGAACTAGCTTGCTAGTTGCTGACGAGCGGCGGACG...AGG'), id='KJ372452.1', name='KJ372452.1', description='KJ372452.1 Endozoicomonas montiporae strain Ab112_MC 16S ribosomal RNA (16S) gene, complete sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('AGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG...GTA'), id='KT364257.1', name='KT364257.1', description='KT364257.1 Endozoicomonas ascidiicola strain AVMART05 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TAACATTCCCAGCTTGCTGGGAGATGACGAGCGGCGGACGGGTGAGTAACACGT...GAT'), id='KX780138.1', name='KX780138.1', description='KX780138.1 Endozoicomonas sp. isolate PM28062 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TGGCTCAGATTGAACGCT

### Pilot #2
Then I did a model test in RaxMLGUI (v2.0.10) {cite:p}`Edler2021, Darriba2019` using the full list of sequences (including the one missing) which produced TrN (Tamura-Nei) as the best model.
1. I downloaded the 16S sequences using the [accession ID](../../outputs/pilot_2/accession_id_list.txt).
2. I concatenated the sequences (see below) (which included the sequences in the Endozoicomonas subtree ({numref}`pike-2013-tree`)).
3. I aligned the sequences using MUSCLE v5.1 {cite:p}`Edgar_2022`.

    ```bash
    muscle5.1 -align outputs/pilot_2/all_unaligned.fasta -output outputs/pilot_2/muscle/16S_all_aligned.fasta
    ```

4. Inside MEGA I selected the following settings:
:::{table} MEGA settings for basic analysis (with missing sequence)
| Setting | Value |
| --- | --- |
| Scope | All taxa |
| Statistical method | Maximum likelihood |
| Test of phylogeny | Bootstrap method |
| No. of bootstrap replicates | 700 |
| Substitution type | Nucleotide |
| Model/method | Tamura-Nei model |
| Gaps/missing data treatment | Complete deletion |
| ML Heuristic method | Nearest-Neighbor-Interchange (NNI) |
| Initial tree for ML | Neighbor-joining |
:::

**This produced:**

```{figure} ../../outputs/pilot_2/16s_ml_all_aligned.png
The evolutionary history was inferred by using the Maximum Likelihood method and Tamura-Nei model {cite:p}`Tamura_1993`. The bootstrap consensus tree inferred from 500 replicates {cite:p}`Felsenstein_1985` is taken to represent the evolutionary history of the taxa analyzed. Branches corresponding to partitions reproduced in less than 50% bootstrap replicates are collapsed. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test 500 replicates are shown next to the branches {cite:p}`Felsenstein_1985`. Initial tree(s) for the heuristic search were obtained by applying the Neighbor-Joining method to a matrix of pairwise distances estimated using the Tamura-Nei model. This analysis involved 20 nucleotide sequences. Codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated (complete deletion option). There were a total of 545 positions in the final dataset. Evolutionary analyses were conducted in MEGA11 {cite:p}`Tamura_2021`
```

In [10]:
write_unaligned_fasta(dir_paths, "../../outputs/pilot_2/accession_id_list.txt")

Done writing unaligned fasta


[SeqRecord(seq=Seq('ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGCGGGAAGAG...GTG'), id='KC878324.1', name='KC878324.1', description='KC878324.1 Endozoicomonas atrinae strain WP70 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('ATGCAGTCGAGCGGTGACAGAACTAGCTTGCTAGTTGCTGACGAGCGGCGGACG...AGG'), id='KJ372452.1', name='KJ372452.1', description='KJ372452.1 Endozoicomonas montiporae strain Ab112_MC 16S ribosomal RNA (16S) gene, complete sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('AGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAG...GTA'), id='KT364257.1', name='KT364257.1', description='KT364257.1 Endozoicomonas ascidiicola strain AVMART05 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TAACATTCCCAGCTTGCTGGGAGATGACGAGCGGCGGACGGGTGAGTAACACGT...GAT'), id='KX780138.1', name='KX780138.1', description='KX780138.1 Endozoicomonas sp. isolate PM28062 16S ribosomal RNA gene, partial sequence', dbxrefs=[]),
 SeqRecord(seq=Seq('TGGCTCAGATTGAACGCT