# 1 Sequences Scrapping - Introduction
This notebook is an automated tool designed for retrieving 16S rRNA sequences from the GenBank database, managed by the National Center for Biotechnology Information (NCBI). The focus is on sequences associated with various genera found in water bodies, as part of a broader study.

# Key Steps and Features:
### Reading Genera List:
The script starts by reading a list of genera from an Excel file. These genera are of particular interest for the study.
### NCBI Database Interaction:
It establishes a connection with the NCBI database and iteratively queries each genus. This process involves fetching the unique identifiers (accession numbers) of the relevant 16S rRNA sequences.
### Data Storage and Organization:
The retrieved data, mainly accession numbers, are stored in a dictionary for subsequent processing.
### Sequence Retrieval and Verification:
The notebook includes a function for fetching the actual sequence data from GenBank using the accession numbers. There's also a provision for validating the retrieved sequences.
### Data Manipulation:
Extensive data manipulation is performed, including transforming and merging dataframes to align the sequence data with the original genera list.
### Phylogenetic Analysis:
The primary goal is to align these sequences to construct a dendrogram, offering insights into the genomic relationships among the genera.
The notebook implements the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) method for tree construction, with a mention of the possibility of using Neighbor-Joining (NJ).
The script includes steps for sequence alignment, conversion to FASTA format, and preparation for phylogenetic tree construction.
### Bootstrap Analysis for Tree Reliability:
To assess the reliability of the phylogenetic trees, a bootstrap analysis is performed. This involves generating multiple pseudo-replicated datasets and analyzing the resulting tree structures.
### Consensus Tree Construction:
A consensus tree is constructed from the bootstrap trees, providing a robust representation of the phylogenetic relationships.
### Additional Notes:
The script contains several safety and optimization measures, such as using time.sleep to avoid overloading the NCBI server and handling exceptions during sequence retrieval.
There's a focus on improving the script, including suggestions for secure API key storage, error handling enhancements, and code documentation.

I am having problems installilng packages from the terminal, so I am installing the biophyton from here.

# 2. Preprocessing the data 
This notebook has been worked in colab and in vsc, the code is silence eitherway.
## 2.1 Mounting the data in colab

In [8]:
'''from google.colab import drive  #silence for vscode
drive.mount('/content/drive')

#change the path
os.chdir('/content/drive/My Drive/MIC')'''

"from google.colab import drive  #silence for vscode\ndrive.mount('/content/drive')\n\n#change the path\nos.chdir('/content/drive/My Drive/MIC')"

## 2.2 Importing the necesary libraries

In [49]:
# for colab and in the terminal 
#!pip install biopython
# Create organized folder structure
from pathlib import Path
import time
import pandas as pd
from Bio import Phylo
import matplotlib.pyplot as plt
import os
from Bio import Entrez, SeqIO, AlignIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo import draw
from random import choice
from Bio.Align import MultipleSeqAlignment
from Bio.Phylo.Consensus import majority_consensus

In [None]:
# For Colab
'''
from Bio import Phylo
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))  # Adjust size as needed
Phylo.draw(consensus_tree)
plt.show()
'''

## 2.3. Creating a Folder for the Results: Data_tree
dedicated folder to keep the results and bootstraping of the present notebook  

In [19]:
# For VSCode
base_dir = Path("/home/beatriz/MIC/2_Micro/data_tree")
bootstrap_dir = base_dir / "bootstrapping"
bootstrap_dir.mkdir(exist_ok=True)

# For Colab
'''
from google.colab import drive
drive.mount('/content/drive')
base_dir = Path('/content/drive/My Drive/MIC/data')
bootstrap_dir = base_dir / "bootstrapping"
bootstrap_dir.mkdir(exist_ok=True)
'''

'\nfrom google.colab import drive\ndrive.mount(\'/content/drive\')\nbase_dir = Path(\'/content/drive/My Drive/MIC/data\')\nbootstrap_dir = base_dir / "bootstrapping"\nbootstrap_dir.mkdir(exist_ok=True)\n'

## 2.4. Loadging, cleaning and preparing the dataframe
from book 3_feature_selection we importe 'selected' DataFrame. 

In [26]:
# Read the Excel file
selected = pd.read_excel("data/df_after_pca.xlsx", sheet_name='selected', header=[0,1,2,3,4,5,6,7])
# Drop first row specifically (index 0 which contains NaNs)
selected = selected.drop(index=0)
# Drop first column (the index column with Level1, Level2, etc)
selected = selected.drop(selected.columns[0], axis=1)
# Remove 'Unnamed' level names
selected.columns = selected.columns.map(lambda x: tuple('' if 'Unnamed' in str(level) else level for level in x))
# Setting index to Sites
selected= selected.set_index("Sites")
selected_taxa = selected.T
selected_taxa  = selected_taxa.reset_index()
selected_taxa.head()

Sites,level_0,level_1,level_2,level_3,level_4,level_5,level_6,level_7,site_1,site_2,...,site_61,site_62,site_63,site_64,site_65,site_66,site_67,site_68,site_69,site_70
0,Category,,,,,,,,3.0,1.0,...,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,1.0,1.0
1,Rhodocyclales_Rhodocyclaceae_Azospira,Bacteria,Proteobacteria,Betaproteobacteria,Rhodocyclales,Rhodocyclaceae,Azospira,110.0,26.928048,1.85923,...,0.353291,0.571304,0.624133,0.26,4.518236,0.4,0.004886,0.0,1.47,1.72
2,Actinomycetales_Dermabacteraceae_Brachybacterium,Bacteria,Actinobacteria,Actinobacteria,Actinomycetales,Dermabacteraceae,Brachybacterium,140.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.054437,0.0,0.0,0.021172,0.0,0.0
3,Erysipelotrichales_Erysipelotrichaceae_Bulleidia,Bacteria,Firmicutes,Erysipelotrichi,Erysipelotrichales,Erysipelotrichaceae,Bulleidia,154.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Actinomycetales_Promicromonosporaceae_Cellulos...,Bacteria,Actinobacteria,Actinobacteria,Actinomycetales,Promicromonosporaceae,Cellulosimicrobium,201.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 3. Stablishing Comunication with the NCBI through API
## 3.1. Configuration and link
A configuration file is made in py which corresponds to the details of the comunication and will be stored in gitignore.

In [21]:
# Comunicating with the NCBI
from config import NCBI_API_KEY, NCBI_EMAIL
Entrez.email = NCBI_EMAIL
Entrez.api_key = NCBI_API_KEY

## 3.2. Selecting the Genera 
If necesary it is possible to retrieve only the wild strains and water environments like this
```search_term = f"{genus}[Orgn] AND 16S rRNA[Gene] AND wild[Properties] AND water[Environment]"
```

In [27]:
# Extract Genera from the multi-index
genera = selected.columns.get_level_values(6).to_list()

# Dictionary to store the results
results = {}

## 3.3. Retriving Data from the NCBI
Following code is silent, so that I dont run it again by mistake and ask again the accession numbers to the NCBI

In [None]:
# Loop over the genera names
for genus in genera:
    print(f"Searching for {genus}...")
    search_term = f"{genus}[Orgn] AND 16S rRNA[Gene]"
    handle = Entrez.esearch(db="nucleotide", term=search_term)
    record = Entrez.read(handle)
    print(f"Found {len(record['IdList'])} sequences")
    results[genus] = record["IdList"]
    time.sleep(10)  # pause for 10 seconds

## 3.4. Resulting accension numbers

In [16]:
# Print the results
for genus, ids in results.items():
    print(f"{genus}: {ids}")

: []
Azospira: ['1219551464', '1219551462', '1219551461', '1219551459', '1219551458', '1219551457', '1219551456', '1219551450', '1219551441', '1219551423', '1219551402', '1219551399', '1219551250', '1219551238', '1219551233', '1219551203', '1219550793', '1219550379', '1033472697', '1033471739']
Brachybacterium: ['2265463493', '1962359363', '1962354592', '1674370651', '1605110036', '1465391646', '1417735331', '1417735330', '1336499538', '1279335676', '558476457', '1100787592', '1041788017', '1041788016', '1041788015', '1041788014', '953940236', '953940153', '998612131', '984880459']
Bulleidia: ['2318612070', '1328373273', '1328373068', '1328372999', '1328372966', '1328372367', '1328372305', '1328372078', '1328372025', '1328371988', '1328371942', '1328371755', '1328371587', '1328371526', '1328371383', '1328371371', '1328371182', '1328370916', '1328370910', '1328370543']
Cellulosimicrobium: ['2283394915', '2056114386', '1997760214', '1635510468', '1635510435', '1417735350', '1417735349', 

In [None]:
# Create a DataFrame from the dictionary
df_accension = pd.DataFrame(list(results.items()), columns=['Genus', 'IDs'])
df_accension.head(16)

Unnamed: 0,Genus,IDs


## 3.5 Combining the Taxa with the Abundance Values in a Dataframe 

In [28]:
current_columns = selected_taxa.columns.tolist()
new_columns = ['Jointax', 'Kingdom','Phylum', 'Class', 'Order', 'Family', 'Genus', 'GID']
combined_columns= new_columns + current_columns[8:]
selected_taxa.columns = combined_columns

In [29]:
# Saving the intermediary results 
selected_taxa.to_excel("data_tree/selected_to_note.xlsx")

## 3.6. Merging Taxa, Abundance and Accension Numbers on a dataframe

In [None]:
# Merge the two DataFrames on the 'Genus' column
taxa_accension = pd.merge(selected_taxa, df_accension, on='Genus')
taxa_accension # former merged_sequences, was no correct naming

In [22]:
# Save the merged sequences to use making the dendrogram notebook
taxa_accension.to_csv('data_tree/taxa_accension.csv', index=False)

# 4. Retrieving Sequencies with the Accession Numbers
These numbers are the GenBank accession numbers, which are unique identifiers for sequences in the GenBank database. Following is to retrieve the actual sequences using these accession numbers.
## 4.1. Define get_sequence function and validate it

In [44]:
# Specify email (required by NCBI)
Entrez.email = "wattsbeatrizamanda@gmail.com"

# Retrieve the sequence for a given accession number
def get_sequence(accession):
    try:
        handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text")
        record = SeqIO.read(handle, "genbank")
        handle.close()
        return record.seq
    except:
        return None
#sequence validation
def validate_sequence(sequence):
    """Validate 16S rRNA sequence quality"""
    if len(sequence) < 500:
        print(f"Warning: Sequence too short ({len(sequence)}bp)")
        return False
    
    base_counts = {base: str(sequence).upper().count(base) for base in 'ATCG'}
    total = sum(base_counts.values())
    
    if total == 0:
        print("Warning: Empty sequence")
        return False
        
    for base, count in base_counts.items():
        percentage = (count/total) * 100
        if percentage < 10 or percentage > 40:
            print(f"Warning: Unusual {base} content: {percentage:.1f}%")
            return False
            
    return True

Make the df with the accension numbers with the actual number and not just the index, code difficult to run in this machine

In [25]:
results =[]

# to use instead so that it could write down the actual accession number instead of the index per row
# Loop over the rows in the DataFrame
for i, row in taxa_accension.iterrows():
    # Get the genus, GID and accession numbers for the current row
    genus = row['Genus']
    gid = row['GID']
    accession_numbers = row['IDs']

    # Initialize a variable to store the accession number that returns a valid sequence
    valid_accession = None

    # Loop over the accession numbers
    for accession in accession_numbers:
        # Retrieve the sequence
        sequence = get_sequence(accession)

        # Check if a sequence was found
        if sequence is not None and validate_sequence(sequence):
            # Store the accession number and break the loop
            valid_accession = accession
            break
    
    # Check if a valid accession number was found
    if valid_accession is not None:
        # Store the result in the list
        results.append({'Genus': genus, 'GID': gid, 'Accession': valid_accession, 'Sequence': sequence})

    # Pause for 5 second
    time.sleep(5)

# Convert the list of results to a DataFrame
final_sequences = pd.DataFrame(results)

# Print the first few rows of the DataFrame
final_sequences.head(16)

Unnamed: 0,Genus,GID,Accession,Sequence
0,Azospira,110,1219551464,"(G, G, G, G, A, A, T, T, T, T, G, G, A, C, A, ..."
1,Brachybacterium,140,2265463493,"(A, C, G, A, T, G, A, C, G, G, T, G, G, T, G, ..."
2,Bulleidia,154,2318612070,"(T, A, C, G, T, A, G, G, T, A, G, C, G, A, G, ..."
3,Cellulosimicrobium,201,2283394915,"(T, A, C, G, G, A, G, A, G, T, T, T, G, A, T, ..."
4,Clostridium,214,2318625970,"(T, A, C, G, T, A, G, G, T, G, G, C, A, A, G, ..."
5,Corynebacterium,229,2734208655,"(A, G, G, C, C, G, T, G, G, C, G, G, C, G, T, ..."
6,Halomonas,354,2652435758,"(C, A, C, T, T, C, T, G, G, T, G, C, A, G, T, ..."
7,Legionella,408,1985648187,"(T, A, C, G, G, A, G, G, G, T, G, C, G, A, G, ..."
8,Mycoplana,471,802131290,"(A, A, C, G, A, A, C, G, C, T, G, G, C, G, G, ..."
9,Oerskovia,497,1041788013,"(C, A, T, G, C, C, G, T, A, A, A, C, G, T, T, ..."


Number 14, I am no sure where about to incrust it 14. For sequence retrieval, add data quality checks. Also I did retrieve the taxa and realise the chosen genera have no missing taxa names, So no need to check that actually, I think what I did was to take the biger level and make the name with it for instance if Rhodocyclales had no name on the genera, I will put instead of Azospira someling like Rhodocyclae_genus, but this was to few of the unknow that something was known, but I have lots of unknown, however they were no frequent and abundant so were no selected on my third notebook, this is the 4 notebook, in the 5 I do the iTOOL_tree. So the only request now is the number 14

## 4.2. Validating the Results

In [26]:
# Select a random sample of 10 rows
sample = final_sequences.sample(10)

In [27]:
# Loop over the rows in check_sequences
for i, row in sample.iterrows():
    # Get the accession number for the current row
    accession = row['Accession']

    # Retrieve the sequence from the NCBI database
    sequence = get_sequence(accession)

    # Print the accession number and the sequence
    print(f"Accession number: {accession}")
    print(f"Sequence from NCBI: {sequence}")
    print(f"Sequence from final_sequences: {row['Sequence']}")

Accession number: 2283394915
Sequence from NCBI: TACGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGATGATGCCCAGCTTGCTGGGTTGATTAGTGGCGAACGGGTGAGTAACACGTGAGTAACCTGCCCTTGACTTCGGGATAACTCCGGGAAACCGGGGCTAATACCGGATATGAGCCGTCCTCGCATGGGGGTGGTTGGAAAGTTTTTCGGTCAGGGATGGGCTCGCGGCCTATCAGCTTGTTGGTGGGGTGATGGCCTACCAAGGCGACGACGGGTAGCCGGCCTGAGAGGGCGACCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGCAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTCGCGTCTGGTGTGAAAACTCGAGGCTCAACCTCGAGCTTGCATCGGGTACGGGCAGACTAGAGTGCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCCGCAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGTTGGGCACTAGGTGTGGGGCTCATTCCACGAGTTCCGTGCCGCAGCAAACGCATTAAGTGCCCCGCCTGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGCGGATTAATTCGATGCAACGCGAAG

### validation was suscessfull it had correctly retrieve the sequences, now I am proceeding to align the sequences
Sequence alignment is a crucial step in comparative genomics. It allows to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.  Biopython's interface can be use and use MUSCLE for sequence alignment:
Before aligning the sequences, it is required to convert the df to a fasta file
# 5. Converting Sequences to a Fasta Format

In [28]:
# Checking for nans values 
if pd.isna(row['Sequence']):
    print(f"Missing value at index {i}")

In [29]:
# Initialize an empty list to store the SeqRecord objects
seq_records = []

# Loop over the rows in the DataFrame
for i, row in final_sequences.iterrows():
       
    # Validate sequence before adding
    if pd.isna(row['Sequence']):
        print(f"Warning: Missing sequence for {row['Genus']}")
        continue
    try:
        seq = Seq(str(row['Sequence'])) # Create a Seq object from the sequence string
        seq_record = SeqRecord(seq,  # Create a SeqRecord object from the Seq object
                             id=row['Genus'], 
                             description=f"Accession:{row['Accession']}")
        seq_records.append(seq_record)  #   Add the SeqRecord object to the list
    except Exception as e:
        print(f"Error processing {row['Genus']}: {str(e)}")    

# Write the SeqRecord objects to a FASTA file
with open("data_tree/final_sequences.fasta", "w") as output_handle:
    SeqIO.write(seq_records, output_handle, "fasta")

In [30]:
if not isinstance(row['Sequence'], Seq):
    print(f"Sequence at index {i} is not a Seq object")

# 6. Final Alignment
Following snipet is just the code that I use on Colab to do the final alignment. The present PC is not robust enough to pursue the next step of alignment the sequences.

In [None]:
# For VSCode better to install it on the terminal
def run_muscle_alignment():
    try:
        !muscle -in data/final_sequences.fasta -out data/aligned_sequences.fasta
    except:
        print("For VSCode, install MUSCLE using:")
        print("sudo apt-get update && sudo apt-get install muscle")

'''# Installing the software from the jpnb shell, I silence it. For Colab 
def run_muscle_alignment():
    !apt-get update
    !apt-get install muscle -y
    !muscle -in data/final_sequences.fasta -out data/aligned_sequences.fasta
'''

# 7.  Bootstrapping: Checking alignment correctness
bootstrapping is a resampling technique used to assess the reliability of the estimated tree. The bootstrap replicates are resampled datasets used to generate multiple trees, which are then analyzed to provide support values for the branches in the original tree. They are not the actual data but are used for statistical validation of the tree topology.
This script generates bootstrap replicates of the aligment file. The bootstrapping process should be applied to the original unaligned sequences. The idea is to generate multiple pseudo-replicated datasets from the original unaligned sequences, align each of these datasets separately, and then compare the resulting alignments. This process allows to assess the reliability of our alignment by checking how consistent the alignments are across the pseudo-replicated datasets. I separate this results on a folder named Bootstrappings
## 7.1. Check Sequence Reading

In [36]:
# Check Sequence Reading
for seq_record in SeqIO.parse("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))

Halomonas
Seq('------------------------------------------------------...---')
Pseudoalteromonas
Seq('-----------------------------------------------------G...---')
Legionella
Seq('------------------------------------------------------...---')
Azospira
Seq('------------------------------------------------------...---')
Corynebacterium
Seq('------------------------------------------------------...---')
Cellulosimicrobium
Seq('TACGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACAT...TTT')
Brachybacterium
Seq('------------------------------ACGATGACGGTGGTGCTTGCAC--...---')
Oerskovia
Seq('------------------------------------------------------...---')
Mycoplana
Seq('-------------------------AACGAACGCTGGCGGCAGGCTTAACACAT...---')
Paracoccus
Seq('-------------------------AGCGAGATCT----TCGGATCT-------...---')
Bulleidia
Seq('------------------------------------------------------...---')
Clostridium
Seq('------------------------------------------------------...---')
Oxobacter
Seq('--------GTTTGATCC

## 7.2. Check alignment object

In [37]:
#Check alignment object
alignment = AlignIO.read("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta", "fasta")
print(alignment)

Alignment with 13 rows and 1763 columns
--------------------------------------------...--- Halomonas
--------------------------------------------...--- Pseudoalteromonas
--------------------------------------------...--- Legionella
--------------------------------------------...--- Azospira
--------------------------------------------...--- Corynebacterium
TACGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTG...TTT Cellulosimicrobium
------------------------------ACGATGACGGTGGT...--- Brachybacterium
--------------------------------------------...--- Oerskovia
-------------------------AACGAACGCTGGCGGCAGG...--- Mycoplana
-------------------------AGCGAGATCT----TCGGA...--- Paracoccus
--------------------------------------------...--- Bulleidia
--------------------------------------------...--- Clostridium
--------GTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTG...TT- Oxobacter


## 7.3. Check Distance between aligned segments

In [38]:
#check
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
print(dm)

Halomonas	0
Pseudoalteromonas	0.7328417470221213	0
Legionella	0.7169597277368123	0.3238797504254113	0
Azospira	0.6891661939875213	0.26091888825865006	0.1440726035167328	0
Corynebacterium	0.7419171866137266	0.38230289279636986	0.1378332387975043	0.1962563811684629	0
Cellulosimicrobium	0.763471355643789	0.5677821894498014	0.763471355643789	0.7022121384004538	0.810550198525241	0
Brachybacterium	0.7260351673284176	0.15995462280204198	0.31934203062960864	0.2648893930799773	0.366988088485536	0.4815655133295519	0
Oerskovia	0.7441860465116279	0.3698241633579127	0.1537152580828134	0.17583664208735106	0.1866137266023823	0.7941009642654566	0.34940442427680096	0
Mycoplana	0.7073170731707317	0.5042541123085649	0.6795235394214407	0.6250709018718095	0.7379466817923993	0.26659103800340334	0.4946114577424844	0.7192285876347135	0
Paracoccus	0.6976744186046512	0.23199092456040837	0.38854225751559845	0.33238797504254114	0.443562110039705	0.47419171866137266	0.21837776517300056	0.42768009075439595	0.351673

## 7.4. Loading Original Aligned Sequences and Generating Bootstrap Replicates

In [39]:
# Read the original alignment
alignment = AlignIO.read("/home/beatriz/MIC/2_Micro/data_tree/aligned_sequences.fasta", "fasta")

In [40]:
# Initialize a list to store bootstrap replicates
bootstrap_replicates = []

# Generate bootstrap replicates with validation and progress tracking
for i in range(BOOTSTRAP_REPLICATES):
    # Resample columns and create a new MultipleSeqAlignment object
    new_seqs = [record[:] for record in alignment]
    new_alignment = MultipleSeqAlignment(new_seqs)

    for j in range(len(alignment[0])):
        col = choice(range(len(alignment[0])))
        for k, record in enumerate(new_alignment):
            new_alignment[k].seq += alignment[k, col]

    # sequence length validation
    if len(new_alignment[0]) != len(alignment[0]):
    print(f"Warning: Replicate {i} length mismatch")
    continue 

    if i % 10 == 0:  # Every 10 replicates
        print(f"Generated {i}/{BOOTSTRAP_REPLICATES} bootstrap replicates")

    # Calculate distance matrix
    calculator: DistanceCalculator('identity')
    dm = calculator.get_distance(new_alignment)

    # Store the distance matrix
    bootstrap_replicates.append(dm)

'''# Check the replicates
for replicate in bootstrap_replicates:
    print(replicate)'''

'# Check the replicates\nfor replicate in bootstrap_replicates:\n    print(replicate)'

## 7.6. Generating Individual Trees for Each Bootstrap Replicate

In [41]:
# Initialize a list to store bootstrap trees
bootstrap_trees = []

# Initialize the tree constructor
constructor = DistanceTreeConstructor()

# Generate trees from each bootstrap replicate
for dm in bootstrap_replicates:
    # Construct the tree (using UPGMA here, but you can use 'nj' for Neighbor-Joining)
    tree = constructor.upgma(dm)

    # Store the tree
    bootstrap_trees.append(tree)

# Check the trees
for tree in bootstrap_trees:
    print(tree)

Tree(rooted=True)
    Clade(branch_length=0, name='Inner12')
        Clade(branch_length=0.2307745143221781, name='Inner11')
            Clade(branch_length=0.16815874485961424, name='Inner9')
                Clade(branch_length=0.13053034600113445, name='Mycoplana')
                Clade(branch_length=0.03098411798071471, name='Inner7')
                    Clade(branch_length=0.09954622802041974, name='Oxobacter')
                    Clade(branch_length=0.09954622802041974, name='Cellulosimicrobium')
            Clade(branch_length=0.18779856335082243, name='Inner10')
                Clade(branch_length=0.09625150666477597, name='Inner6')
                    Clade(branch_length=0.0860482841747022, name='Oerskovia')
                    Clade(branch_length=0.0033944271128757697, name='Inner5')
                        Clade(branch_length=0.08265385706182643, name='Azospira')
                        Clade(branch_length=0.01529707884288145, name='Inner3')
                            Clade(

## 7.7. Bootstrap Basic Statistics

In [None]:
# Some basic Analytics for quality assurance
support_values = []
for clade in consensus_tree.find_clades():
    if clade.confidence:
        support_values.append(clade.confidence)

print(f"Average bootstrap support: {np.mean(support_values):.1f}%")
print(f"Minimum bootstrap support: {np.min(support_values):.1f}%")
print(f"Maximum bootstrap support: {np.max(support_values):.1f}%")

# 8. Consensus Tree: Phylogenetic Robustness Assessment
Visualising helps in understanding how well-supported each clade is in the original tree. The construction of the tree would generate multiple trees from the bootstrap replicates and then summarize these trees to get a consensus tree. The branch lengths or support values in the consensus tree would represent the percentage of bootstrap trees that support that particular branch. In summary the final bootstrap_tree/consensus tree contain all bootstrap trees
## 8.1. Generating and Saving the Consensus Tree
Generating the consensus tree has been challenging because of the packages are resisting installation, phylo is said no to be recognised and phylo.draw seems to be obsolete. 

In [51]:
# Generate the majority consensus tree
consensus_tree = majority_consensus(bootstrap_trees, 0.5)  # 0.5 is the cutoff for majority rule

# Print or draw the consensus tree
print('Consensus Tree:')
print(consensus_tree)
# Preparing to plot
consensus_tree.ladderize()  # Organize branches
for clade in consensus_tree.find_clades():
    if clade.confidence:  # Add bootstrap values as branch labels
        clade.name = f'{clade.name}({clade.confidence:.0f})'

# Save both newick and PNG formats
Phylo.write([consensus_tree], 'data_tree/consensus_tree.newick', 'newick')

Consensus Tree:
Tree(rooted=True)
    Clade()
        Clade(branch_length=0.36130486032331255, name='Halomonas')
        Clade(branch_length=0.23541484242058996, confidence=100.0)
            Clade(branch_length=0.1825928925304878, confidence=100.0)
                Clade(branch_length=0.0898881412188032, confidence=100.0)
                    Clade(branch_length=0.0860482841747022, name='Oerskovia')
                    Clade(branch_length=0.0027987247791913137, confidence=98.0)
                        Clade(branch_length=0.08265385706182643, name='Azospira')
                        Clade(branch_length=0.014845079410096431, confidence=100.0)
                            Clade(branch_length=0.06735677821894498, name='Corynebacterium')
                            Clade(branch_length=0.05177254679523538, confidence=100.0)
                                Clade(branch_length=0.016661939875212695, name='Legionella')
                                Clade(branch_length=0.001597604133270856, confi

NameError: name 'Phylo' is not defined

## 8.2. Plotting the Consensus Tree

In [None]:
def plot_tree_with_annotations(consensus_tree, output_path):
    plt.figure(figsize=(10, 8))
    
    # Draw tree with bootstrap values
    Phylo.draw(consensus_tree, 
               show_confidence=True,
               label_func=lambda x: f"{x.name} ({x.confidence:.0f})" if x.confidence else x.name)
               
    plt.title("16S rRNA Phylogenetic Tree with Bootstrap Support")
    plt.xlabel("Genetic Distance")
    plt.show()
    
    # Save in multiple formats
    plt.savefig(output_path / 'tree.png', dpi=300, bbox_inches='tight')
    plt.savefig(output_path / 'tree.pdf', format='pdf', bbox_inches='tight')

# 9. Interpretation of the Concensus Tree
High Confidence Clades: Clades with 100% bootstrap support, such as the one containing 'Haemophilus' and 'Legionella', are highly reliable. This indicates that these taxonomic groupings are very likely to be accurate.

Moderate Confidence Clades: The clade with a bootstrap value of 68%, containing 'Oxobacter' and 'Pseudarthrobacter', has moderate support. While this is generally considered to be a lower limit for strong support, it still suggests that this grouping is more likely to be true than not.

Taxonomic Implications: The high bootstrap values for clades containing genera like 'Haemophilus', 'Legionella', 'Roseococcus', etc., may have implications for our understanding of the relationships among these bacteria. For example, it could suggest that 'Haemophilus' and 'Legionella' share a more recent common ancestor with each other than with other genera in the tree.

Data Quality: High bootstrap values across the tree also speak to the quality of our sequence data and the alignment, suggesting that the results are not artifacts of poor data or methodology.

In [None]:
'''# Debugging
# check the function to generate replicates
for replicate in bootstrap_replicates:
    print(replicate)
# Store the concatenated replicates in the dictionary
bootstrap_replicates[0] = concatenated_replicates

# Print first bootstrap replicate to check
print('First bootstrap replicate:', bootstrap_replicates[0])
print("Number of sequences in alignment:", len(alignment))
print("IDs of sequences in alignment:", [record.id for record in alignment])

# # Tree basic info
for i, tree in enumerate(trees):
    print(f"Tree {i}: {tree}")
from Bio.Phylo.BaseTree import Tree
#Are the trees objects?
for i, tree in enumerate(trees):
    if not isinstance(tree, Tree):
        print(f"The element at index {i} is not a Tree object.")
        # Is the trees list empty?
if not trees:
    print("The list of trees is empty.")'''

# 10. Summarising the Results

In [None]:
def summarize_analysis(final_sequences, consensus_tree, output_dir):
    """Generate analysis summary"""
    summary = {
        'total_genera': len(final_sequences['Genus'].unique()),
        'sequences_retrieved': len(final_sequences),
        'avg_sequence_length': final_sequences['Sequence'].str.len().mean(),
        'genera_list': ', '.join(final_sequences['Genus'].unique())
    }
    
    # Save summary
    with open(output_dir / 'analysis_summary.txt', 'w') as f:
        for key, value in summary.items():
            f.write(f"{key}: {value}\n")
            
    return summary

In [None]:
# Save the tree in Newick format
Phylo.write([consensus_tree], 'data/consensus_tree.newick', 'newick')

1