<a href="https://colab.research.google.com/github/PDBeurope/afdb-notebooks/blob/main/AFDB_FTP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src = "https://www.embl.org/about/info/communications/wp-content/uploads/2017/09/Ebi_official_logo.png"
 height="100" align="right">

#Downloading AlphaFold Protein Structures via FTP with Google Colab

FTP, or File Transfer Protocol, is a standard network protocol facilitating the exchange of files between computers.

The European Bioinformatics Institute (EMBL-EBI) makes AlphaFold structures available for download through their FTP server. This includes protein fragments, which are only accessible through FTP.


This guide helps you download protein structures from the AlphaFold Database FTP area using Google Colab.
<br>
<br>

##Available Downloads

Model organism proteomes: As of September 2023, the FTP area hosts compressed files (TAR archives) containing structures for 48 organisms including model organisms and pathogens.

Swiss-Prot: This database contains predicted structures for over 542,000 proteins. You can download these structures in two formats: PDB and CIF.

##Understanding Folder Structure

Folders are named using a specific format:

* **Reference Proteome (UPID)**: This identifies the reference protein set used. (e.g., UP000000429)
* **Taxonomy ID:** This is a unique identifier for the organism. (e.g., 85962)
* **Organism:** This is a short name derived from the genus and species names. (e.g., HELPY for Helicobacter pylori)

<br>

The FTP server also provides compressed files containing predicted structures for the Swiss-Prot database, which includes over 542,378 entries.

|File type|File name|Size|
|---------|--------------|---------------------|
|Swiss-Prot (CIF Files)|swissprot_cif_v4.tar| 37,643 MB|
|Swiss-Prot (PDB files)|swissprot_pdb_v4.tar|26,935 MB|

<br>

For detailed information on the available data and folder structure, refer to the AlphaFold DB:

##Further Information

* Downloads tab: https://alphafold.ebi.ac.uk/download
* CHANGELOG: https://ftp.ebi.ac.uk/pub/databases/alphafold


#Running Code in Colab

Before running the code in this section, make sure to select the desired options using the provided dropdown menus.

To execute a code block, simply click the "Run" button (usually a play triangle symbol) next to the code cell.

In [1]:
#@title ##Run to see what's in the FTP area
#@markdown Run this block to see what's in the FTP area
import ftplib
import io
import json
from ftplib import FTP
import tarfile
import tempfile
import os
from google.colab import files

ftp_server = ftplib.FTP("ftp.ebi.ac.uk")

# Login as an anonymous user
ftp_server.login("anonymous", "anonymous@")

# Navigate to the directory
ftp_server.cwd("/pub/databases/alphafold/")

# List the contents of the directory
# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

CHANGELOG.txt
README.txt
accession_ids.csv
download_metadata.json
latest
sequences.fasta
v1
v2
v3
v4


In [None]:
#@title #See all the available proteomes in one version
#@markdown This block will retrieve a list of the files inside the version archive you define

#@markdown Before running the code in this cell, please make sure you've selected your desired options from the dropdown menu provided.
folder_navigate = "v4" #@param["v1", "v2", "v3", "v4"]
#@markdown `folder_navigate` is the version of the AFDB that you want to use to download

#@title Navigate to the "v4" directory
ftp_server.cwd(folder_navigate)

# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

In [None]:
#@title #Get the metadata
#@markdown Run this block will retrieve the metadata and print it if available.
ftp_server = "ftp.ebi.ac.uk"

try:
    with ftplib.FTP(ftp_server) as ftp:
        print("Accessing metadata...")
        ftp.login(user="anonymous", passwd="anonymous")
        ftp.cwd("/pub/databases/alphafold")  # Navigate to the directory containing the metadata file

        with io.BytesIO() as bio:
            ftp.retrbinary('RETR download_metadata.json', bio.write)
            bio.seek(0)  # Go to the start of the BytesIO buffer
            metadata = json.load(bio)

    # If metadata is a list of dictionaries, similar to the sample above
    if metadata:
        headers = metadata[0].keys()
        print("\t".join(headers))

        # Print the values for each record
        for record in metadata:
            values = [str(record[key]) for key in headers]
            print("\t".join(values))
    else:
        print("No metadata found.")

except Exception as e:
    # Only print an error message for actual FTP errors
    if isinstance(e, ftplib.all_errors):
        print(f"Error accessing FTP: {e}")
    else:
        pass


In [None]:
#@title #Extract all fragments for a UniProt accession. <br>
#@markdown This block will download all the fragments for a specific UniProt accession <br>
#@markdown Before running the block in this cell, please make sure you've selected your desired options from the dropdown menus provided.
#@markdown This cell also provides variables you can modify directly to adjust the download behavior. <br><br>
#@markdown <strong>NOTE:</strong>  You will get a pop-up window to download files to your local computer
from ftplib import FTP
import tarfile
import tempfile
import os
import zipfile
from google.colab import files
import ftplib
import io
import json


def extract_files_and_zip_from_ftp(folder_navigate, tar_file, file_fragment_name, file_type):
    username = 'anonymous'
    password = 'anonymous'
    ftp_server = "ftp.ebi.ac.uk"
    base_path = f'pub/databases/alphafold/{folder_navigate}/'
    file_path = base_path + tar_file
    extracted_files = []  # To keep track of extracted file names

    # Connect to the FTP server
    with FTP(ftp_server) as ftp:
        ftp.login(username, password)

        # Use a temporary file to store the tar file
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            try:
                ftp.retrbinary(f'RETR {file_path}', tmp_file.write)
                tar_file_path = tmp_file.name
            except Exception as e:
                print(f"Error downloading the file: {e}")
                return

        # Open the temporary tar file for reading
        try:
            with tarfile.open(tar_file_path, mode="r:*") as tar:
                # Search through the entire tar file for all matches and check filename contains the file fragment and has a .cif extension
                for member in tar.getmembers():
                    if file_fragment_name in member.name and member.name.endswith(f'.{file_type.lower()}.gz'):
                        extracted_file = tar.extractfile(member)
                        if extracted_file:
                            content = extracted_file.read()
                            output_filename = f"{member.name.replace('/', '_')}"
                            with open(output_filename, 'wb') as f_out:
                                f_out.write(content)
                            print(f"Extracted and saved {member.name} as {output_filename}.")
                            extracted_files.append(output_filename)
        except tarfile.TarError as e:
            print(f"Error reading the tar file: {e}")
        finally:
            # Clean up the temporary file
            os.remove(tar_file_path)

    # Zip the extracted files
    zip_filename = "extracted_fragments_files.zip"
    with zipfile.ZipFile(zip_filename, 'w') as zipf:
        for file in extracted_files:
            zipf.write(file)
            os.remove(file)  # Optional: remove the file after adding it to the zip to save space
    print(f"Created zip archive: {zip_filename}")

    # Download the zip file
    files.download(zip_filename)


#Input parameters
database_version = 'v4' #@param["v1", "v2", "v3", "v4"]
tar_file = 'UP000005640_9606_HUMAN_v4.tar' #@param {type:"string"}
#@markdown Make sure this file coincides with the database version you're searching in. i.e. end of tar file should say `v4`  if you're searching in the `version 4` within the database
UniProt_accession = 'Q5T4S7'  #@param {type:"string"}
file_type = "cif" #@param["pdb", "cif"]
#@markdown If you wish to download `cif` or `pdb` files, make sure it doesn't contain whitespaces


extract_files_and_zip_from_ftp(database_version, tar_file, UniProt_accession, file_type)
