# FASTA Format Converter 

The following code is to replace the `Locus_tag` from the Sequence Description with `Genbank_acc`. *Note: Ensure that all the files required are in the same directory*

Import packages

In [None]:
import os
import pandas as pd
import Bio.SeqIO as sio

Regex for GenBank Accession numbers format. Based on the information from this link: https://www.ncbi.nlm.nih.gov/Sequin/acc.html

In [None]:
#   1 letter + 5 numerals
#pattern_1 = r"^(\S){1}(\d){5}$"

#   2 letters + 6 numerals
#pattern_2 = r"^(\S){2}(\d){6}$"

#   2 letters + 8 numerals
#pattern_3 = r"^(\S){2}(\d){8}$"

#   4 letters + x numerals
#pattern_4 = r"^(\S){2}(\d)*"

### Part 1: Data Frame

**Enter the path to the current directory.** Ensure that the all the file required shows up in the output.

In [None]:
path = os.listdir(path = 'C:/Users/vishwakarmas/Downloads/fasta_format_converter/')
print(path)

**Enter the file name.** Read and concatenate the sheets into one `dataframe`.

In [None]:
seq_info = pd.concat(pd.read_excel('1-s2.0-S0960982220305868-mmc2.xlsx', sheet_name = None), ignore_index = True)

Create 2 `dataframes` without `nan` and `NA`. *`filter_null` - removes any `nan` present in the `Genbank_acc` column.* *`filter_NA` - removes any `NA` present in the `Genbank_acc` column.*

In [None]:
filter_null = seq_info[seq_info['Genbank_acc'].notnull()]
filter_NA = filter_null[~filter_null.Genbank_acc.str.contains(r'\A(NA){1}\b', regex = True)]

### Part 2: Modify `Genbank_acc` values

Create a function to right-pad the cell with 8 zeros if it only has 4 characters.

In [None]:
def applyZeros(genbank_val):
    if len(genbank_val) == 4:
        return genbank_val.ljust(12, '0')
    else:
        return genbank_val

Apply the function to all cells in the `Genbank_acc` column.

In [None]:
filter_NA_new_col = filter_NA['Genbank_acc'].apply(applyZeros).rename('New_Genbank_acc')

**View list of `Locus_tags` included.**

In [None]:
updated_locus_tags = filter_NA['Locus_tag'].tolist()
print(updated_locus_tags)

**View `DataFrame` for more information.**

In [None]:
print(filter_NA)

**View `list` of `Locus_tags` not included.**

In [None]:
not_included = seq_info[~seq_info['Locus_tag'].isin(filter_NA['Locus_tag'])]
print(not_included['Locus_tag'].tolist())

**View `DataFrame` for more information.**

In [None]:
print(not_included)

Concatenate the new `Series` to the currently existing `DataFrame`.

In [None]:
updated_seq_info = pd.concat([filter_NA, filter_NA_new_col], axis = 1)

Select `New_Genbank_acc` and `Locus_tag` columns from the DataFrame into a `Dictionary`

In [None]:
dict_filter_NA = pd.Series(updated_seq_info.New_Genbank_acc.values, index = updated_seq_info.Locus_tag).to_dict()

### Part 3: Converting

Enter the `fasta` file and create a new file for the corrected `fasta` file. Use the `Dictionary` to replace matched Keys (`Locus_tag`) to Values (`New_Genbank_acc`).

In [None]:
with open('Sample_fasta_file.txt', 'r') as original, open('New_fasta_file.txt', 'w') as corrected:
    for seq_record in sio.parse(original, 'fasta'):
        if seq_record.id in dict_filter_NA:
            seq_record.id = seq_record.description = dict_filter_NA[seq_record.id]
        sio.write(seq_record, corrected, 'fasta')