<a href="https://colab.research.google.com/github/Palaeoprot/Bear/blob/main/Compile_FASTAs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This uses the EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024.

Madeira F et al
Nucleic Acids Research, 01 Jul 2024, 52(W1):W521-W525
https://doi.org/10.1093/nar/gkae241


The **EMBL-EBI Job Dispatcher** sequence analysis tools framework (https://www.ebi.ac.uk/jdispatcher) enables the scientific community to perform a diverse range of sequence analyses using popular bioinformatics applications. Free access to the tools and required sequence datasets is provided through user-friendly web applications, as well as via RESTful and SOAP-based APIs. These are integrated into popular EMBL-EBI resources such as UniProt, InterPro, ENA and Ensembl Genomes. This paper overviews recent improvements to Job Dispatcher, including its brand new website and documentation, enhanced visualisations, improved job management, and a rising trend of user reliance on the service from low- and middle-income regions.

Documentation: https://www.uniprot.org/help/api
WEBSITE_API = "https://rest.uniprot.org/"

Documentation: https://www.ebi.ac.uk/proteins/api/doc/
PROTEINS_API = "https://www.ebi.ac.uk/proteins/api"

In [None]:
!pip install biopython  # Install the Biopython library

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.84


In [None]:
from google.colab import drive
drive.mount('/content/drive')
#from Bio import SeqIO
import requests, sys, json, os, time
import requests
import sys
import pandas as pd
from IPython.display import display

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Collect sequences
 fetch data, isolate chains, remove duplicates, submit alignment jobs, check if aligned files already exist, and handle cases with a single sequence.

In [None]:
#slightly imporvedimport requests
import sys
import pandas as pd
import os
from google.colab import drive
from Bio import SeqIO

# Mount Google Drive
drive.mount('/content/drive')

# Documentation: https://www.uniprot.org/help/api
WEBSITE_API = "https://rest.uniprot.org/"
taxonomy_id = 9632

# Helper function to download data
def get_url(url, **kwargs):
    response = requests.get(url, **kwargs)
    if not response.ok:
        print(response.text)
        response.raise_for_status()
        sys.exit()
    return response

# List of gene names
gene_names = ["COL1A1", "COL1A2"]

# Output directory
output_dir = "/content/drive/MyDrive/7Papers/26_Ursus_abstrusus/Ancient_Bear/Ancient_Bear_Analysis/FASTAs/Computational_FASTAs/Fasta_Bear_proteins/"
os.makedirs(output_dir, exist_ok=True)

# DataFrame to store the results
df = pd.DataFrame(columns=["Gene Names", "Organism", "Organism (ID)", "Taxonomic lineage", "Length", "Sequence", "Chain", "Aligned Sequence"])

# Check existing files in the output directory
existing_files = set(os.listdir(output_dir))

# Error log file
error_log_file = os.path.join(output_dir, "error_log.txt")

# Iterate over each gene name and fetch the data
for gene_name in gene_names:
    aligned_file = f"{gene_name}_aligned.fasta"
    if aligned_file in existing_files:
        print(f"Skipping {gene_name} as it is already aligned.")
        continue

    print(f"Fetching data for gene: {gene_name}")
    url = f"{WEBSITE_API}/uniprotkb/search?query={gene_name} AND (taxonomy_id:{taxonomy_id})&fields=gene_names,organism_name,organism_id,lineage,length,sequence,ft_chain&format=tsv"
    response = get_url(url)

    # Parse the response and add to the DataFrame
    lines = response.text.strip().split('\n')[1:]  # Skip the header
    for line in lines:
        parts = line.split('\t')
        if len(parts) == 7:
            gene, organism, organism_id, lineage, length, sequence, chain_info = parts
            # Check if chain_info contains the expected "CHAIN" substring
            if "CHAIN" in chain_info:
                try:
                    # Handle multiple chains if they exist
                    chains = chain_info.split('CHAIN ')
                    for chain in chains[1:]:
                        chain_start, chain_end = chain.split(';')[0].split('..')
                        truncated_sequence = sequence[int(chain_start)-1:int(chain_end)]
                        new_row = {
                            "Gene Names": gene,
                            "Organism": organism,
                            "Organism (ID)": organism_id,
                            "Taxonomic lineage": lineage,
                            "Length": length,
                            "Sequence": truncated_sequence,
                            "Chain": f"{chain_start}..{chain_end}",
                            "Aligned Sequence": ""
                        }
                        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
                except ValueError:
                    with open(error_log_file, 'a') as log:
                        log.write(f"Error parsing chain_info for {gene_name}: {chain_info}\n")
                    print(f"Error parsing chain_info for {gene_name}: {chain_info}")
            else:
                # Use the whole sequence if no valid chain is found
                new_row = {
                    "Gene Names": gene,
                    "Organism": organism,
                    "Organism (ID)": organism_id,
                    "Taxonomic lineage": lineage,
                    "Length": length,
                    "Sequence": sequence,
                    "Chain": "full_sequence",
                    "Aligned Sequence": ""
                }
                df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
                print(f"Using full sequence for {gene_name}: {sequence}")

# Remove duplicate sequences after isolating the chain
df = df.drop_duplicates(subset=["Sequence"])

# Filter out genes with fewer than two sequences
df = df.groupby('Gene Names').filter(lambda x: len(x) > 1)

# Save DataFrame to a CSV file
df.to_csv(os.path.join(output_dir, "uniprot_bear_proteins_truncated.csv"), index=False)

# Display the DataFrame to see the data
display(df)

# Create FASTA files for each gene, ensuring unique identifiers
job_ids = []
for gene_name in df["Gene Names"].unique():
    gene_df = df[df["Gene Names"] == gene_name]
    fasta_content = ""
    for idx, row in gene_df.iterrows():
        # Create a unique identifier using the index to differentiate potential duplicates
        unique_id = f"{row['Gene Names']}_{idx}"
        fasta_content += f">{unique_id}|{row['Organism']}|{row['Organism (ID)']}|{row['Chain']}\n{row['Sequence']}\n"

    fasta_file = os.path.join(output_dir, f"{gene_name}.fasta")
    with open(fasta_file, 'w') as file:
        file.write(fasta_content)

    if len(gene_df) == 1:
        # If there is only one sequence, rename the file to indicate it is aligned
        os.rename(fasta_file, os.path.join(output_dir, f"{gene_name}_aligned.fasta"))
        print(f"Only one sequence for {gene_name}, marked as aligned.")
        continue

    # Submit alignment job using Clustal Omega
    r = requests.post("https://www.ebi.ac.uk/Tools/services/rest/clustalo/run", data={
        "email": "example@example.com",
        "iterations": 0,
        "outfmt": "fa",  # Using FASTA format
        "order": "aligned",
        "title": gene_name,  # Naming the job with the gene name
        "sequence": fasta_content
    })

    if r.status_code != 200:
        # Log errors during job submission
        with open(error_log_file, 'a') as log:
            log.write(f"Error submitting job for {gene_name}: {r.text}\n")
        print(f"Error submitting job for {gene_name}: {r.text}")
        continue

    job_id = r.text
    print(f"Job ID for {gene_name}: {job_id}")
    job_ids.append((gene_name, job_id))

# Save job IDs to a file
job_ids_file = os.path.join(output_dir, "job_ids.txt")
with open(job_ids_file, 'w') as f:
    for gene_name, job_id in job_ids:
        f.write(f"{gene_name}\t{job_id}\n")

print(f"Job IDs saved to {job_ids_file}")


## Assembling Results into a DataFrame and Saving to a TSV File
This script will check the status of the alignment jobs, retrieve the results, update the DataFrame, and save the results to a TSV file.

In [17]:
import os
import requests
import pandas as pd
from time import sleep
from google.colab import drive
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Mount Google Drive
drive.mount('/content/drive')

# Set directories
input_directory = "/content/drive/MyDrive/7Papers/26_Ursus_abstrusus/Ancient_Bear/Ancient_Bear_Analysis/FASTAs/Computational_FASTAs"
job_ids_file = os.path.join(input_directory, "job_ids.txt")
output_file = os.path.join(input_directory, "uniprot_bear_proteins_aligned.csv")
fasta_output_file = os.path.join(input_directory, "combined_bear_proteins.fasta")
error_log_file = os.path.join(input_directory, "error_log.txt")

# Helper function to download data
def get_url(url, **kwargs):
    response = requests.get(url, **kwargs)
    if not response.ok:
        print(response.text)
        response.raise_for_status()
    return response

# Read job IDs from the file, handling potential formatting issues
job_ids = []
with open(job_ids_file, 'r') as f:
    for line in f:
        parts = line.strip().split('\t')
        if len(parts) == 2:  # Check if the line has the expected two parts
            gene_name, job_id = parts
            job_ids.append((gene_name, job_id))
        else:
            print(f"Skipping malformed line: {line.strip()}")  # Log or handle malformed lines

# Initialize DataFrame to store the results
df = pd.DataFrame(columns=["Gene Names", "Organism", "Organism (ID)", "Taxonomic lineage", "Length", "Sequence", "Chain", "Aligned Sequence"])

# Process each job ID
for gene_name, job_id in job_ids:
    status_url = f"https://www.ebi.ac.uk/Tools/services/rest/clustalo/status/{job_id}"
    result_url = f"https://www.ebi.ac.uk/Tools/services/rest/clustalo/result/{job_id}/fa"

    # Check job status
    status_response = get_url(status_url)
    status = status_response.text.strip()
    print(f"Status for {gene_name} ({job_id}): {status}")

    if status == "FINISHED":
        # Get the alignment result
        result_response = get_url(result_url)
        aligned_sequences = result_response.text

        # Save the alignment result to a file
        aligned_fasta_file = os.path.join(input_directory, f"{gene_name}_aligned.fasta")
        with open(aligned_fasta_file, 'w') as file:
            file.write(aligned_sequences)
        print(f"Alignment result saved to {aligned_fasta_file}")

        # Parse the aligned sequences and add to the DataFrame
        for record in SeqIO.parse(aligned_fasta_file, "fasta"):
            aligned_sequence = str(record.seq)
            parts = record.description.split('|')
            if len(parts) >= 4:
                organism = parts[1].strip()
                organism_id = parts[2].strip()
                chain = parts[3].strip()
            else:
                organism = "Unknown"
                organism_id = "Unknown"
                chain = "Unknown"
                with open(error_log_file, 'a') as log:
                    log.write(f"Incomplete description for {record.id} in gene {gene_name}: {record.description}\n")

            df.loc[len(df)] = {
                "Gene Names": gene_name,
                "Organism": organism,
                "Organism (ID)": organism_id,
                "Taxonomic lineage": "",  # This can be filled with the correct lineage if available
                "Length": len(aligned_sequence),
                "Sequence": "",  # Original sequence is not required here
                "Chain": chain,
                "Aligned Sequence": aligned_sequence
            }

    else:
        # Log any issues
        with open(error_log_file, 'a') as log:
            log.write(f"Job {job_id} for gene {gene_name} is not finished or has errors: Status {status}\n")
        print(f"Job {job_id} for gene {gene_name} is not finished or has errors: Status {status}")

    # Avoid hitting the server too hard
    sleep(5)

# Save DataFrame to CSV
df.to_csv(output_file, index=False)
print(f"Aligned sequences DataFrame saved to {output_file}")

# Combine all aligned sequences into a single FASTA file
all_sequences = []
for _, row in df.iterrows():
    record = SeqRecord(Seq(row['Aligned Sequence']),
                       id=f"{row['Gene Names']}_{row.name}",
                       description=f"GN={row['Gene Names']} | OS={row['Organism']} | OID={row['Organism (ID)']} | Chain={row['Chain']}")
    all_sequences.append(record)

SeqIO.write(all_sequences, fasta_output_file, "fasta")
print(f"Combined FASTA file created: {fasta_output_file}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Skipping malformed line: clustalo-R20240716-103746-0657-81195443-p1m
Status for ALB (clustalo-R20240716-103735-0785-39921992-p1m): FINISHED
Alignment result saved to /content/drive/MyDrive/7Papers/26_Ursus_abstrusus/Ancient_Bear/Ancient_Bear_Analysis/FASTAs/Computational_FASTAs/ALB_aligned.fasta
Status for OXA1L (clustalo-R20240716-103741-0890-23030550-p1m): FINISHED
Alignment result saved to /content/drive/MyDrive/7Papers/26_Ursus_abstrusus/Ancient_Bear/Ancient_Bear_Analysis/FASTAs/Computational_FASTAs/OXA1L_aligned.fasta
Status for GPR143 (clustalo-R20240716-103749-0683-6263135-p1m): FINISHED
Alignment result saved to /content/drive/MyDrive/7Papers/26_Ursus_abstrusus/Ancient_Bear/Ancient_Bear_Analysis/FASTAs/Computational_FASTAs/GPR143_aligned.fasta
Status for AEBP1 (clustalo-R20240716-103825-0303-17310005-p1m): FINISHED
Alignment result saved to /content/d