<a href="https://colab.research.google.com/github/ThomasCMcLean/Lazy_AF/blob/main/Lazy_AF_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Lazy_AF Workflow Part 1**

This workflow is currently designed to feed into [AlphaFold2_BATCH](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.2/batch/AlphaFold2_batch.ipynb) for high-throughput protein structure and complex prediction.

**Usage**

You will need to have your .txt file with target fastas in a folder in your Google Drive, this is your `input_dir`. Results will be sent to the `result_dir`.

Please specify the `input_file` name including .txt. Please also inculde your bait sequence (`protein_sequence`) and bait name (`protein_name`).

**N.B. Your fasta identifier must either have a [gene=] and/or a [protein_id=] in their title otherwise it will not be included in the output**

Once you have done the modelling you can rank the results using **Lazy_AF Part 2** : https://colab.research.google.com/drive/1j7WJLcUHTR8BrjkWDaU549rFk6X5Zu18#scrollTo=GMqFOU0SeAEO

For details, refer to our manuscript: *in prep*

For more details checkout the [ColabFold GitHub](https://github.com/ThomasCMcLean/Lazy_AF).

In [None]:
#@title Mount google drive
from google.colab import drive
drive.mount('/content/drive')
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"

In [None]:
#@title Input directories and file name from Google Drive. Then runtime -> run all
from google.colab import drive
drive.mount('/content/drive')

import os
import re

# Directory and input file locations
input_dir = '/content/drive/MyDrive/input' #@param {type:"string"}
result_dir = '/content/drive/MyDrive/results' #@param {type:"string"}
input_file = "genome.txt" #@param {type:"string"}

# Read the input .txt file
with open(os.path.join(input_dir, input_file), 'r') as file:
    txt_data = file.readlines()

# Input bait protein sequence and bait protein name
protein_sequence = "MSEPVLAVSGVNKSFPIYRSPWQALWHALNPKADVKVFQALRDIELTVYRGETIGIVGHN" #@param {type:"string"}
protein_name = "AbcA" #@param {type:"string"}

# Initialize variables to store gene data
gene_data = {}
current_gene_name = None
divide = ':'

# Create the result directory if it doesn't exist
os.makedirs(result_dir, exist_ok=True)

# Loop through the lines of the input file
for line in txt_data:
    # Check if the line contains the gene name
    gene_name_match = re.search(r'\[gene=(.*?)\]', line)
    if gene_name_match:
        current_gene_name = gene_name_match.group(1)
        gene_data[current_gene_name] = [line]
    else:
        # If gene name is not found, check for protein_id
        protein_id_match = re.search(r'\[protein_id=(.*?)\]', line)
        if protein_id_match:
            current_gene_name = protein_id_match.group(1)
            gene_data[current_gene_name] = [line]
        elif current_gene_name is not None:
            # If neither gene name nor protein_id is found, append the line to the current gene data
            gene_data[current_gene_name].append(line)

# Create .fasta files for each gene and append the protein sequence and name
for gene_name, gene_content in gene_data.items():
    fasta_content = gene_content + [divide] + [protein_sequence]
    output_file = os.path.join(result_dir, f"{gene_name}_{protein_name}.fasta")
    with open(output_file, 'w') as file:
        file.writelines(fasta_content)