# AlphaFold extract average prediction confidence

## Overview
This function handles the downloading, unpacking, and processing of predicted AlphaFold structures from the EBI database (https://ftp.ebi.ac.uk/pub/databases/alphafold/). The primary goal of this function is to calculate average confidence scores for each protein structure in the dataset.

## Functionality
- **Download and Unpack**: If the AlphaFold data (specified by the URL) is not already present in the specified download directory, the function will download and unpack it.
- **Data Processing**: For each protein structure file, the function calculates the average confidence score. 
- **Output Generation**: It then generates a CSV file containing these scores along with other relevant information about each protein.

## Parameters
1. `url`: The URL where the predicted AlphaFold structures are located. This URL should point to a tar file containing the predicted structures.
2. `output_file` (optional): The path to the output CSV file. Default is `"./output.csv"`.
3. `download_path` (optional): The base path where the data will be downloaded and unpacked. Default is the current directory (`"./"`).

## Output
The function outputs a CSV file with the following columns:
- `protein_id`: The identifier of the protein.
- `f_value`: Additional identifier related to the protein.
- `model`: The model version used in the prediction.
- `avg_confidence`: The average confidence score for the predicted structure.

### Example Output
| protein_id   | f_value | model   | avg_confidence |
|--------------|---------|---------|----------------|
| A0A024R1R8   | F1      | model_v4| 72.2203125     |

## Usage Notes
- Ensure the URL provided points directly to a tar file containing the AlphaFold data.
- The function requires internet access to download the data.
- Adequate disk space should be available for downloading and unpacking the data.
- The function depends on external libraries such as `os`, `pandas`, and `tqdm` for its operations.

## Example Usage
```python
main(url="https://ftp.ebi.ac.uk/pub/databases/alphafold/v4/UP000005640_9606_HUMAN_v4.tar", 
     output_file="./output.csv", 
     download_path="./data")


In [9]:
import os
import requests
import tarfile
from tqdm import tqdm
import gzip
import pandas as pd
from Bio.PDB import PDBParser

In [10]:
def download_file(url, save_path):
    """
    Downloads a file from the given URL to the specified path.

    Args:
    - url: URL of the file to download.
    - save_path: Path where the file should be saved.
    """
    os.makedirs(os.path.dirname(save_path), exist_ok=True)

    response = requests.head(url)
    file_size = int(response.headers.get('content-length', 0))

    with requests.get(url, stream=True) as r, open(save_path, 'wb') as f:
        for chunk in tqdm(r.iter_content(chunk_size=1024), total=file_size//1024, unit='KB', desc=os.path.basename(save_path)):
            if chunk:
                f.write(chunk)

In [11]:
def unpack_tar(tar_file, extract_path):
    """
    Extracts a .tar file to the specified path.

    Args:
    - tar_file: Path to the .tar file.
    - extract_path: Path where the contents of the .tar file should be extracted.
    """
    os.makedirs(extract_path, exist_ok=True)

    with tarfile.open(tar_file, 'r') as tar:
        members = tar.getmembers()
        for member in tqdm(members, desc=os.path.basename(tar_file), unit='file'):
            tar.extract(member, extract_path)

In [12]:
def get_avg_confidence(pdb_file):
    """
    Extracts the average confidence of all residues in a PDB file.

    Args:
    - pdb_file: Path to the gzipped PDB file.

    Returns:
    - avg_confidence_score: Average confidence score of all residues.
    """
    try:
        with gzip.open(pdb_file, 'rt') as f:
            parser = PDBParser()
            structure = parser.get_structure(os.path.splitext(pdb_file)[0], f)
            confidence_scores = []
    
            for chain in structure[0]:
                for residue in chain:
                    confidence_scores.append(residue["CA"].get_bfactor())
    
        if confidence_scores:
            return sum(confidence_scores) / len(confidence_scores)
    except Exception as e:
        print(f"unable to process {pdb_file}. Got error: {e}")

    return None

In [7]:
def main(url, output_file = "../data/alphafold/output.csv", download_path = "../data/alphafold/download"):
    """
    Sets up the AlphaFold avg confidence data, downloading and unpacking if necessary.

    Args:
    - url: Alphafold predicted proteome URL location.
    - output_file: Path for the output file.
    - download_path: Path to the download folder.
    """
    filename = os.path.basename(url) 
    tar_file = f"{download_path}/{filename}"
    data_folder = f"{download_path}/{os.path.splitext(filename)[0]}"

    if not os.path.exists(tar_file):
        print("Downloading data...")
        download_file(url, tar_file)
        print("Download complete.")

    if not os.path.exists(data_folder):
        print("Unpacking data...")
        unpack_tar(tar_file, data_folder)
        print("Unpacking complete.")

    pdb_files = [f for f in os.listdir(data_folder) if f.endswith(".pdb.gz")]
    data = []
    for pdb_file in tqdm(pdb_files, desc="Processing files", unit="file"):
        avg_confidence = get_avg_confidence(os.path.join(data_folder, pdb_file))
        if avg_confidence is not None:
            no_extension = pdb_file.split('.')[0]
            splits = no_extension.split("-")[1:]
            data.append(splits + [avg_confidence])
    columns = ['protein_id', 'f_value', 'model', 'avg_confidence']
    pd.DataFrame(data, columns=columns).to_csv(f"{output_file}", index=False)

In [8]:
url = "https://ftp.ebi.ac.uk/pub/databases/alphafold/v4/UP000005640_9606_HUMAN_v4.tar"
main(url, output_file = "../data/alphafold/output.csv", download_path = "../data/alphafold/download")

Processing files: 100%|██████████| 23391/23391 [42:47<00:00,  9.11file/s] 
