# Introduction to the CASP14 Data Download Notebook

## Overview

This IPython notebook is designed to automate the process of downloading and organizing Alphafold2 data from the CASP14. It focuses on fetching sequences, target structures, and predictions.

## Contents of the Notebook

1. **Utility Functions:**
   - `get_tar_gz_links(url)`: Fetches `.tar.gz` file links to targets from a specified URL.
   - `download_predictions(url, save_dir)`: Downloads and extracts alphafold2 model 1 pdb prediction.
   - `download_targets(url, save_dir)`: Downloads and extracts target files.
   - `download_sequences(url, save_dir)`: Downloads sequence data.

2. **Main Function:**
   - Coordinates the entire downloading and organizing process.

## How to Use

1. **Specify Save Directory:** Set the `save_dir` variable to the desired path on your system where you want the data to be saved.
2. **Run the Notebook:** Execute the cells in the notebook sequentially.

## Requirements

- Ensure you have a stable internet connection for uninterrupted downloading.
- The necessary Python libraries (`os`, `shutil`, `tempfile`, `tarfile`, `requests`, `BeautifulSoup`, `tqdm`) should be installed.


In [111]:
import os
import shutil
import tempfile
import tarfile
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from tqdm import tqdm

def get_tar_gz_links(url):
    """
    Get links for .tar.gz files from the given URL.

    Args:
    - url (str): The URL containing the .tar.gz files to be downloaded.

    Returns:
    - list: A list of URLs for the .tar.gz files.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    links = soup.find_all('a', href=lambda href: href and href.endswith('.tar.gz'))
    
    return [urljoin(url, link['href']) for link in links]

def download_predictions(url, save_dir):
    """
    Download predictions from the provided list of links.

    Args:
    - links (list): A list of URLs for the .tar.gz files to be downloaded.
    - save_dir (str): The directory where the downloaded files with 'TS427' in the name will be saved.
    """

    links = get_tar_gz_links(url)
    
    tmp_dir = tempfile.mkdtemp()
    
    for link in tqdm(links, desc="Downloading", unit="file"):
        target = os.path.basename(link)[:-7]
        if os.path.exists(os.path.join(save_dir, target + '.pdb')):
            continue
        file_name = os.path.join(tmp_dir, os.path.basename(link))
        with open(file_name, 'wb') as f:
            response = requests.get(link)
            f.write(response.content)

        tar = tarfile.open(file_name, 'r:gz')
        for member in tar.getmembers():
            if 'TS427_1' in member.name:
                file_content = tar.extractfile(member).read()
                with open(os.path.join(save_dir, target + '.pdb'), 'wb') as f:
                    f.write(file_content)
                break
        tar.close()
        
    shutil.rmtree(tmp_dir)

def download_targets(url, save_dir):
    """
    Download a .tgz file from the given URL, extract its contents to a specified directory, and remove the .tgz file.

    Args:
    - url (str): The URL of the .tgz file to be downloaded.
    - save_dir (str): The directory where the extracted files will be saved.
    """
    tmp_dir = tempfile.mkdtemp()
    file_name = os.path.join(tmp_dir, os.path.basename(url))
    with open(file_name, 'wb') as f:
        response = requests.get(url)
        f.write(response.content)

    with tarfile.open(file_name, 'r:gz') as tar:
        tar.extractall(path=save_dir)
    
    print("Downloaded targets")
    shutil.rmtree(tmp_dir)

def download_sequences(url, save_dir):
    """
    Download sequences from the given URL and save them as a FASTA file in the specified directory.

    Args:
    - url (str): The URL of the sequences file to be downloaded.
    - save_dir (str): The directory where the sequences will be saved as a FASTA file.
    """
    file_name = os.path.join(save_dir, 'sequences.fasta')
    with open(file_name, 'wb') as f:
        response = requests.get(url)
        f.write(response.content)
    print("Downloaded sequences")


In [112]:
def main(save_dir):
    """
    Main function to download .tar.gz files from the given URL to the specified directory.

    Args:
    - save_dir (str): The directory where the downloaded files with 'TS427' in the name will be saved.
    """
    save_dir_sequences = os.path.join(save_dir, "sequences")
    if not os.path.exists(save_dir_sequences):
        os.makedirs(save_dir_sequences)
        
    url_sequences = "https://predictioncenter.org/download_area/CASP14/sequences/casp14.seq.txt"
    download_sequences(url_sequences, save_dir_sequences)
    
    save_dir_targets = os.path.join(save_dir, "targets")
    if not os.path.exists(save_dir_targets):
        os.makedirs(save_dir)
        
    url_targets = "https://predictioncenter.org/download_area/CASP14/targets/_4invitees/casp14.targ.whole.4invitees.tgz"
    download_targets(url_targets, save_dir_targets)

    save_dir_predictions = os.path.join(save_dir, "predictions")
    if not os.path.exists(save_dir_targets):
        os.makedirs(save_dir_targets)
        
    url_predictions = "https://predictioncenter.org/download_area/CASP14/predictions/regular/"
    download_predictions(url_predictions, save_dir_predictions)

In [113]:
save_dir = "../data/casp14_alphafold"
main(save_dir)

Downloaded sequences
Downloaded targets


Downloading: 100%|██████████| 84/84 [00:50<00:00,  1.66file/s]
