# Sequences Scrapping Automatically
This notebook is a tool I've developed to automate the process of retrieving 16S rRNA sequences of specific genera from the GenBank database, which is maintained by the National Center for Biotechnology Information (NCBI). The sequences I'm focusing on are those found in various water bodies, as identified in my study.

My script reads a list of these genera, connects to the NCBI, and then iterates over the list, making a search request to the NCBI for each genus. The results are then stored in a dictionary. My ultimate goal is to align these sequences and use them to construct a dendrogram. This dendrogram will provide a visual representation of the comparative genomic relationships among the genera at the different sites and with the corresponding corrosion categories. Ultimately I am interested in understanding the genetic relationships and evolutionary history of the microbial communities found at different sites.

I am having problems installilng packages from the terminal, so I am installing the biophyton from here.

In [7]:
#!{sys.executable} -m pip install requests biopython

In [8]:
# Importing the necesary libraries
from Bio import Entrez
import time
import pandas as pd

In [9]:
# Comunicating with the NCBI
Entrez.email = "wattsbeatrizamanda@gmail.com"
Entrez.api_key = "01d2f369faef0e78cd4906063672fab7c809"

In [10]:
# Read the Excel file
df = pd.read_excel("data/genera.xlsx")

# Get the list of genera
genera = df["Genus"].tolist()
# Dictionary to store the results
results = {}

If necesary it is possible to retrieve only the wild strains and water environments like this 
```search_term = f"{genus}[Orgn] AND 16S rRNA[Gene] AND wild[Properties] AND water[Environment]"
```

I silence the following code so that I dont run it again by mistake and ask again the accession numbers to the NCBI

In [11]:
'''# Loop over the genera names
for genus in genera:
    search_term = f"{genus}[Orgn] AND 16S rRNA[Gene]"
    handle = Entrez.esearch(db="nucleotide", term=search_term)
    record = Entrez.read(handle)
    results[genus] = record["IdList"]
    time.sleep(10)  # pause for 10 seconds
'''

'# Loop over the genera names\nfor genus in genera:\n    search_term = f"{genus}[Orgn] AND 16S rRNA[Gene]"\n    handle = Entrez.esearch(db="nucleotide", term=search_term)\n    record = Entrez.read(handle)\n    results[genus] = record["IdList"]\n    time.sleep(10)  # pause for 10 seconds\n'

In [12]:
# Print the results
for genus, ids in results.items():
    print(f"{genus}: {ids}")

In [13]:
# Create a DataFrame from the dictionary
df_sequences = pd.DataFrame(list(results.items()), columns=['Genus', 'IDs'])
df_sequences.head(10)

Unnamed: 0,Genus,IDs


In [14]:
# Merge the two DataFrames on the 'Genus' column
merged_sequences = pd.merge(df, df_sequences, on='Genus')

In [15]:
 # Save the merged sequences to use making the dendrogram notebook
merged_sequences.to_csv('data/sequences.csv', index=False)

These numbers are the GenBank accession numbers, which are unique identifiers for sequences in the GenBank database. Following is to retrieve the actual sequences using these accession numbers.

In [16]:
from Bio import Entrez, SeqIO

# Specify your email (required by NCBI)
Entrez.email = "wattsbeatrizamanda@gmail.com"

# Retrieve the sequence for a given accession number
def get_sequence(accession):
    try:  
        handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text")
        record = SeqIO.read(handle, "genbank")
        handle.close()
        return record.seq
    except:
        return None

In [17]:
# Get the first 10 rows of the DataFrame, just to try
merged_sample = merged_sequences.head(10)
merged_sample.head()

Unnamed: 0,Kingdom,Phylum,Class,Order,Familia,Genus,GID,IDs


Make the df with the accension numbers with the actual number and not just the index, code difficult to run in this machine

In [18]:
results =[]

# to use instead so that it could write down the actual accession number instead of the index per row
# Loop over the rows in the DataFrame
for i, row in merged_sequences.iterrows():
    # Get the genus, GID and accession numbers for the current row
    genus = row['Genus']
    gid = row['GID']
    accession_numbers = row['IDs']
    
    # Initialize a variable to store the accession number that returns a valid sequence
    valid_accession = None
    
    # Loop over the accession numbers
    for accession in accession_numbers:
        # Retrieve the sequence
        sequence = get_sequence(accession)
        
        # Check if a sequence was found
        if sequence is not None:
            # Store the accession number and break the loop
            valid_accession = accession
            break

    # Check if a valid accession number was found
    if valid_accession is not None:
        # Store the result in the list
        results.append({'Genus': genus, 'GID': gid, 'Accession': valid_accession, 'Sequence': sequence})

    # Pause for 5 second
    time.sleep(5)

# Convert the list of results to a DataFrame
final_sequences = pd.DataFrame(results)

# Print the first few rows of the DataFrame
final_sequences.head()

## Checking the results

In [None]:
# Select a random sample of 10 rows
sample = final_sequences.sample(10)

In [None]:
# Loop over the rows in check_sequences
for i, row in sample.iterrows():
    # Get the accession number for the current row
    accession = row['Accession']
    
    # Retrieve the sequence from the NCBI database
    sequence = get_sequence(accession)
    
    # Print the accession number and the sequence
    print(f"Accession number: {accession}")
    print(f"Sequence from NCBI: {sequence}")
    print(f"Sequence from final_sequences: {row['Sequence']}")


Accession number: 219903933
Sequence from NCBI: None
Sequence from final_sequences: GCTTAACACATGCAAGTCGAACGGGCGTAGCAATACGTCAGTGGCAGACGGGTGAGTAACACGTGGGAACCTTCCTCGTAGTACGGAACAACTCAGGGAAACTTGAGCTAATACCGTATACGTCCGAGAGGAGAAAGATTTATCGCTATGAGACGGGCCCGCGTCCGATTAGCTAGTTGGTGGGGTAACGGCCTACCAAGGCAACGATCGGTAGCTGATCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTGGACAATGGGCGCGAGCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTAGGGTTGTAAAGCTCTTTTAGCAGGGACGATAATGACGGTACCTGCAGAATAAGCCCCGGCAAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGATTGTTAAGTCAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCTTTGATACTGGCAATCTTGAGGCTGGAAGAGGTTGGTAGAATTCCCAGTGTAGAGGTGAAATTCGTAGATATTGGGAAGAATACCAGTGGCGAGGGCGGCCAACTGGTCCAGATCTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACTATGGGTGCTAGCCGTCAGCGGGCTTGCTCGTTGGTGGCGCAGCTAACGCATTAAGCACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTTAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCTACCCTTGACATCCCGGTCGCGGACACCAGAG

### validation was suscessfull it had correctly retrieve the sequences, now I am proceeding to align the sequences
Sequence alignment is a crucial step in comparative genomics. It allows to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.  Biopython's interface can be use and use MUSCLE for sequence alignment:

Before aligning I need to convert the df to a fasta file

In [None]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Initialize an empty list to store the SeqRecord objects
seq_records = []

# Loop over the rows in the DataFrame
for i, row in final_sequences.iterrows():
    # Create a Seq object from the sequence string
    seq = Seq(row['Sequence'])
    
    # Create a SeqRecord object from the Seq object
    seq_record = SeqRecord(seq, id=f"{row['GID']}_{row['Accession']}", description='')
    
    # Add the SeqRecord object to the list
    seq_records.append(seq_record)

# Write the SeqRecord objects to a FASTA file
with open("data/sequences.fasta", "w") as output_handle:
    SeqIO.write(seq_records, output_handle, "fasta")


## Final Alignment



Following snipet is just the code that I use on Colab to do the final alignment. The present PC is not robust enough to pursue the next step of alignment the sequences.

In [None]:
!./muscle3.8.31_i86linux64 -in final_sequences.fasta -out /content/drive/MyDrive/aligned.fasta

# Checking alignment correctness
This script generates bootstrap replicates of the aligment file. The bootstrapping process should be applied to the original unaligned sequences. The idea is to generate multiple pseudo-replicated datasets from the original unaligned sequences, align each of these datasets separately, and then compare the resulting alignments. This process allows to assess the reliability of our alignment by checking how consistent the alignments are across the pseudo-replicated datasets. Also done in Colab

In [None]:
# Now we run the alignment on each bootstrap replicate, loop over the list and run the alignment command for each file.
# Packages needed
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import random

# Load the original sequences
sequences = list(SeqIO.parse("/content/drive/MyDrive/sequences.fasta", "fasta"))

# Number of bootstrap replicates
n = 100

# Generate the bootstrap replicates
for i in range(n):
    # Generate a random sample of indices
    indices = [random.randint(0, len(sequences)-1) for _ in range(len(sequences))]

    # Create a new set of sequences from the sampled indices
    bootstrap_sequences = [sequences[i] for i in indices]

    # Write the bootstrap sequences to a new FASTA file
    output_file = f"/content/drive/MyDrive/replicate_new_{i}.fasta"
    SeqIO.write(bootstrap_sequences, output_file, "fasta")

    # Run MUSCLE on the bootstrap replicate
    aligned_file = f"/content/drive/MyDrive/output_replica_new_{i}.fasta"
    !./muscle3.8.31_i86linux64 -in {output_file} -out {aligned_file}

Removing the duplicated from the bootstrap process

In [None]:
from Bio import SeqIO

def remove_duplicates(input_file, output_file):
    # Create a dictionary to store sequences
    sequences = {}

    # Parse the input FASTA file
    for record in SeqIO.parse(input_file, "fasta"):
        # Use the sequence identifier as the key and only keep sequences which haven't been encountered yet
        if record.id not in sequences:
            sequences[record.id] = record
        else:
            print(f"Warning: duplicate identifier {record.id} found")

    # Write the unique sequences to a new file
    SeqIO.write(sequences.values(), output_file, "fasta")

# Use the function for each of your files
for i in range(70):
    input_file = f"/content/drive/MyDrive/output_replica_new_{i}.fasta"
    output_file = f"/content/drive/MyDrive/output_replica_no_duplicates_{i}.fasta"
    remove_duplicates(input_file, output_file)

# Comparing the Replicates:
Loading the necessary libraries and reading in the alignment files.

In [None]:
from Bio import AlignIO

# List to store the alignments
alignments = []

# Loop through the alignment files
for i in range(70):
    # Read the alignment file
    alignment = AlignIO.read(f"/content/drive/MyDrive/output_replica_no_duplicates_{i}.fasta", "fasta")
       
    # Add the alignment to the list
    alignments.append(alignment)

Now, we calculate the pairwise distances for each alignment and store them in a list. This will give us a measure of how similar the sequences in each aligment are to each other. The "identity" argument specifies to calculate the sequences number of identical positions.

In [None]:
from Bio.Phylo.TreeConstruction import DistanceCalculator

# List to store the distance matrices
distances = []

# Loop through the alignments
for alignment in alignments:
    # Calculate the distance matrix
    calculator = DistanceCalculator('identity')
    dm = calculator.get_distance(alignment)
    
    # Add the distance matrix to the list
    distances.append(dm)

Calculating the average pairwise distance for each alignment and storage it so we know the overal similarity and therefore the robusteness of the algorith of Muscle

In [None]:
import numpy as np

# List to store the average distances
average_distances = []

# Loop through the distance matrices
for dm in distances:
    # Calculate the average distance
    average_distance = np.mean([dm[i, j] for i in range(len(dm)) for j in range(i+1, len(dm))])
    
    # Add the average distance to the list
    average_distances.append(average_distance)

The null hypothesis in an ANOVA is that all group means are equal.
If the p-value is less than our chosen significance level (often 0.05), the null hypothesis had to be rejected, in other words that all group means are equal, suggesting that at least one group mean is significantly different from the others.

In [None]:
# Now I will proceed to do a simple test to know if the average distances are statistically significant different. The * operator is to unpack the groups.
import scipy.stats as stats

# Assume that average_distances is a list of lists, where each sublist is a group
H, p_value = stats.kruskal(*average_distances)

print("H-value:", H)
print("p-value:", p_value)