# Sequence coding

This notebook will generate a dataframe that correlate the original identifiers of each sequence with new codes in the 'gene_X_database' format.

Ensure that you update any instance of 'bvbrc' in the code to reflect the database from which the sequences originated.

The original sequence headers will then be replaced by the new codes in another script.

In [1]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.85


In [2]:
from google.colab import files
import os
import pandas as pd
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

In [3]:
# In the local repository, compress the folder of the database of interest with the respective sequences
# tar czvf new_folder_name.tar.gz folder_name

In [4]:
# Upload the compressed folder with the fasta sequences

uploaded = files.upload()

Saving bvbrc.tar.gz to bvbrc.tar.gz


In [5]:
# Unzip folder

!tar -xvzf bvbrc.tar.gz

bvbrc/
bvbrc/siaM_bv.fasta
bvbrc/nanB_bv.fasta
bvbrc/kpsC_bvbrc.fasta
bvbrc/nanT_bv.fasta
bvbrc/nane_bv.fasta
bvbrc/kpsM_bvbrc.fasta
bvbrc/siaP_bv.fasta
bvbrc/nanM_bv.fasta
bvbrc/nanR_bv.fasta
bvbrc/nanU_bv.fasta
bvbrc/neuE_bv.fasta
bvbrc/cpsK_bvbrc.fasta
bvbrc/neuD_bv.fasta
bvbrc/kpsD_bv.fasta
bvbrc/SIAE_bv.fasta
bvbrc/neuA_bv.fasta
bvbrc/neuB_bv.fasta
bvbrc/nagA_bv.fasta
bvbrc/neuC_bv.fasta
bvbrc/satA_bv.fasta
bvbrc/lst_bv.fasta
bvbrc/nanH_bv.fasta
bvbrc/kpsT_bvbrc.fasta
bvbrc/nanC_bv.fasta
bvbrc/neuS_bv.fasta
bvbrc/nagZ_bv.fasta
bvbrc/ompF_bv.fasta
bvbrc/nanA_bv.fasta
bvbrc/nagB_bv.fasta
bvbrc/satC_bv.fasta
bvbrc/siaT_bv.fasta
bvbrc/lic3A_bv.fasta
bvbrc/satB_bv.fasta
bvbrc/nanE_bv.fasta
bvbrc/kpsE_bv.fasta
bvbrc/ompC_bv.fasta
bvbrc/kpsF_bv.fasta
bvbrc/neuO_bv.fasta
bvbrc/kpsS_bv.fasta
bvbrc/satD_bv.fasta
bvbrc/siaQ_bv.fasta
bvbrc/nanK_bv.fasta
bvbrc/nanQ_bv.fasta


1. Generating the dataframe with new codes

In [6]:
def sequence_coding(fasta_file, input_filename):
    """
    Process a FASTA file to generate a DataFrame containing headers and internal gene codes.

    Args:
        fasta_file (str): Path to the input FASTA file.
        input_filename (str): Name of the input file (used for generating codes).

    Returns:
        None: Saves the resulting DataFrame to a CSV file.
    """

    # List to store the headers
    headers = []

    # Read the FASTA file and extract the headers
    for record in SeqIO.parse(fasta_file, "fasta"):
        headers.append(record.id)

    # Create a DataFrame with the headers
    df = pd.DataFrame(headers, columns=['Header'])

    # Add a column with gene codes in the format gene_X_UNIP
    df['Gene_Code'] = [
        f"{input_filename.split('_')[0]}_" + str(i + 1) + f"_{input_filename.split('_')[1].replace('.fasta', '')}"
        for i in range(len(headers))
    ]

    # Save the DataFrame to a CSV file with the input filename in the output name
    output_filename = f"{input_filename.split('.')[0]}_codes_df.csv"
    df.to_csv(output_filename, index=False)

In [7]:
# Applying the function to all fasta files in the directory of interest

# Directory where the files are located
directory = 'bvbrc'  # Adapt according to the directory name

# List all files in the directory
files = os.listdir(directory)

# Filter the list to include only FASTA files
fasta_files = [f for f in files if f.endswith('.fasta')]

# Iterate over the FASTA files
for fasta_file in fasta_files:
    full_path = os.path.join(directory, fasta_file)
    sequence_coding(full_path, fasta_file)

In [8]:
# Organizing files

!mkdir bvbrc_code_dataframes
!mv *.csv bvbrc_code_dataframes/

In [9]:
# Concatenating the dataframes

csv_files = [file for file in os.listdir('bvbrc_code_dataframes') if file.endswith('.csv')]

dataframes = []

for file in csv_files:
  file_path = os.path.join('bvbrc_code_dataframes', file)
  df = pd.read_csv(file_path)
  dataframes.append(df)

refseq_merged_df = pd.concat(dataframes, ignore_index=True)

# Saving the dataframe
refseq_merged_df.to_csv('bvbrc_merged_df.csv', index=False)

In [10]:
# Organizing and compressing files

!mv bvbrc_merged_df.csv bvbrc_code_dataframes/
!tar -czvf bvbrc_code_dataframes.tar.gz bvbrc_code_dataframes/

bvbrc_code_dataframes/
bvbrc_code_dataframes/nanC_bv_codes_df.csv
bvbrc_code_dataframes/nanT_bv_codes_df.csv
bvbrc_code_dataframes/satD_bv_codes_df.csv
bvbrc_code_dataframes/kpsC_bvbrc_codes_df.csv
bvbrc_code_dataframes/satB_bv_codes_df.csv
bvbrc_code_dataframes/kpsD_bv_codes_df.csv
bvbrc_code_dataframes/neuD_bv_codes_df.csv
bvbrc_code_dataframes/neuA_bv_codes_df.csv
bvbrc_code_dataframes/nane_bv_codes_df.csv
bvbrc_code_dataframes/nagB_bv_codes_df.csv
bvbrc_code_dataframes/siaP_bv_codes_df.csv
bvbrc_code_dataframes/nanQ_bv_codes_df.csv
bvbrc_code_dataframes/kpsT_bvbrc_codes_df.csv
bvbrc_code_dataframes/nanH_bv_codes_df.csv
bvbrc_code_dataframes/satA_bv_codes_df.csv
bvbrc_code_dataframes/bvbrc_merged_df.csv
bvbrc_code_dataframes/nanR_bv_codes_df.csv
bvbrc_code_dataframes/siaM_bv_codes_df.csv
bvbrc_code_dataframes/ompC_bv_codes_df.csv
bvbrc_code_dataframes/satC_bv_codes_df.csv
bvbrc_code_dataframes/kpsS_bv_codes_df.csv
bvbrc_code_dataframes/nanB_bv_codes_df.csv
bvbrc_code_dataframes/nagA

In [None]:
# Download the compressed file for storage in the local repository