## Sobre la base de datos

Está relacionada al trabajo:
[Characterization and identification of antimicrobial peptides with different functional activities](https://academic.oup.com/bib/article/21/3/1098/5498047)

Contiene registros positivos y negativos de antifúngicos.

Usaremos la biblioteca `Biopython` que es muy útil para trabajar con secuencias biológicas.

In [1]:
%pip install biopython

Collecting biopython
  Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.86


In [2]:
from Bio import SeqIO
import pandas as pd

In [3]:
fasta_train_pos = '/content/train_pos_antifungal.fasta'
fasta_test_pos = '/content/test_pos_antifungal.fasta'
fasta_train_neg = '/content/train_neg_antifungal.fasta'
fasta_test_neg = '/content/test_neg_antifungal.fasta'

## Secuencias positivas

In [4]:
data = []
try:
    for record in SeqIO.parse(fasta_train_pos, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_train_pos = pd.DataFrame(data)
    display(df_train_pos.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_train_pos}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,AntiFP87,AntiFP87,KHRKKRKAWLLALA
1,ADAM_0387,ADAM_0387,ATCRKPSMYFSGACFSDTNCQKACNREDWPNGKCLVGFKCECQRPC
2,ADAM_6743,ADAM_6743,SMGAVKLAKLLIDKMKCEVTKAC
3,AP00376,AP00376,GWKDWAKKAGGWLKKKGPGMAKAALKAAMQ
4,AntiFP1283,AntiFP1283,KQLKTCTSVIKLGHPCDIESCLNECFRVYNTGFATCRGDKYSQLCT...


In [5]:
data = []
try:
    for record in SeqIO.parse(fasta_test_pos, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_test_pos = pd.DataFrame(data)
    display(df_test_pos.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_test_pos}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,AntiFP1398,AntiFP1398,YDLFTGIGIDARTVPPTCYESCNATFQNPECNKMCVGLAYKDGSCI...
1,AntiFP768,AntiFP768,NVQQKRSNRCIDFPVNPKTGLCVLKDCESVCKKTSKGLEGICWKFN...
2,AP02542,AP02542,LRLKSIVSYAKKVL
3,AntiFP1392,AntiFP1392,MARSVPLVSTIFVFLLLLVATGPSMVAEARTCESQSHKFKGPCASD...
4,AP00476,AP00476,GLNALKKVFQGIHEAIKLINNHVQ


In [6]:
# Seleccionar las secuencias positivas
train_pos = df_train_pos[['Sequence']]
test_pos = df_test_pos[['Sequence']]

# Agregar una columna de label (si tiene actividad antifúngica o no)
train_pos['label'] = 1
test_pos['label'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_pos['label'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_pos['label'] = 1


## Secuencias negativas

In [7]:
data = []
try:
    for record in SeqIO.parse(fasta_train_neg, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_train_neg = pd.DataFrame(data)
    display(df_train_neg.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_train_neg}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,AVP0705,AVP0705,RRCPTRPEGQNYTEG
1,AVP1180,AVP1180,VSTALPQWRIYSYAGDNI
2,AP01624,AP01624,HAEHKVKIGVEQKYGQFPQGTEVTYTCSGNYFLM
3,AVP1630,AVP1630,VRVLEDGVNYATGNLPGC
4,ParaPep_1490,ParaPep_1490,VGALAVVVWLWLWLW


In [8]:
data = []
try:
    for record in SeqIO.parse(fasta_test_neg, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_test_neg = pd.DataFrame(data)
    display(df_test_neg.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_test_neg}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,AVP0942,AVP0942,CEGLPNIDC
1,AVP1001,AVP1001,MDVNPTFLFLKVPAQ
2,AP01365,AP01365,AAKPMGITCDLLSLWKVGHAACAAHCLVLGDVGGYCTKEGLCVCKE
3,AP02948,AP02948,RLLLVMIGLRSKIKWHSGI
4,AVP0891,AVP0891,CSLTPHRSC


In [9]:
# Seleccionar las secuencias negativas
train_neg = df_train_neg[['Sequence']]
test_neg = df_test_neg[['Sequence']]

# Agregar una columna de label (si tiene actividad antifúngica o no)
train_neg['label'] = 0
test_neg['label'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_neg['label'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_neg['label'] = 0


In [10]:
train_neg.head()

Unnamed: 0,Sequence,label
0,RRCPTRPEGQNYTEG,0
1,VSTALPQWRIYSYAGDNI,0
2,HAEHKVKIGVEQKYGQFPQGTEVTYTCSGNYFLM,0
3,VRVLEDGVNYATGNLPGC,0
4,VGALAVVVWLWLWLW,0


## Concatenar ambas secuencias

In [11]:
# Concatenar los dataframes
df_combined = pd.concat([train_pos, train_neg, test_pos, test_neg])

In [12]:
df_combined.head()

Unnamed: 0,Sequence,label
0,KHRKKRKAWLLALA,1
1,ATCRKPSMYFSGACFSDTNCQKACNREDWPNGKCLVGFKCECQRPC,1
2,SMGAVKLAKLLIDKMKCEVTKAC,1
3,GWKDWAKKAGGWLKKKGPGMAKAALKAAMQ,1
4,KQLKTCTSVIKLGHPCDIESCLNECFRVYNTGFATCRGDKYSQLCT...,1


In [13]:
df_combined.describe()

Unnamed: 0,label
count,5148.0
mean,0.530692
std,0.499106
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [15]:
df_combined["Sequence"].unique().shape

(5148,)

In [16]:
df = df_combined.drop_duplicates().copy()

In [17]:
df.shape

(5148, 2)

## Verificar canónicos

In [18]:
# Definir aminoácidos canónicos
aminoacidos = set("ACDEFGHIKLMNPQRSTVWY")

In [19]:
# Verificar cuáles secuencias válidas
df["is_valid"] = [set(seq).issubset(aminoacidos) for seq in df["Sequence"]]

In [20]:
no_validas = df[~df["is_valid"]]
print("Secuencias inválidas:", len(no_validas))

Secuencias inválidas: 0


In [21]:
df_validas = df[df["is_valid"]].drop(columns="is_valid")
print("Secuencias válidas restantes:", len(df_validas))

Secuencias válidas restantes: 5148


In [22]:
display(df_validas)

Unnamed: 0,Sequence,label
0,KHRKKRKAWLLALA,1
1,ATCRKPSMYFSGACFSDTNCQKACNREDWPNGKCLVGFKCECQRPC,1
2,SMGAVKLAKLLIDKMKCEVTKAC,1
3,GWKDWAKKAGGWLKKKGPGMAKAALKAAMQ,1
4,KQLKTCTSVIKLGHPCDIESCLNECFRVYNTGFATCRGDKYSQLCT...,1
...,...,...
1150,KKKKLLLATLFFF,0
1151,NKGCATCSIGAACLVDGPIPDFEIAGATGLFGLWG,0
1152,LFR,0
1153,TKPTLLGLPLGAGPAAGPGKR,0


In [23]:
df_validas = df_validas.rename(columns={'Sequence': 'sequence'})
display(df_validas.head())

Unnamed: 0,sequence,label
0,KHRKKRKAWLLALA,1
1,ATCRKPSMYFSGACFSDTNCQKACNREDWPNGKCLVGFKCECQRPC,1
2,SMGAVKLAKLLIDKMKCEVTKAC,1
3,GWKDWAKKAGGWLKKKGPGMAKAALKAAMQ,1
4,KQLKTCTSVIKLGHPCDIESCLNECFRVYNTGFATCRGDKYSQLCT...,1


In [24]:
df_validas.to_csv('AMPfun_labeled.csv', index=False, header=True)

## Obtener metadatos

In [25]:
import json

In [26]:
def export_json(path_to_export, data_to_export):
    with open(path_to_export, 'w') as doc_export:
        json.dump(
            data_to_export,
            doc_export,
            indent=4,
            default=str,
            ensure_ascii=False)

In [27]:
def create_metada_with_multiple_values(df_metada_filter, full_df):
    dict_metadata = {}

    for column in df_metada_filter.columns:
        values = df_metada_filter[column].unique().tolist()

        if len(values)>1:
            values = [str(value) for value in values]
            values = ";".join(values)
            dict_metadata.update({column:values})
        else:
            dict_metadata.update({column:values[0]})

    dict_metadata.update({
        "number_of_sequences" : len(full_df)
    })

    return dict_metadata

In [28]:
def read_metadata(path_data, name_source):
    df_metada = pd.read_excel(path_data)
    df_metada_filter = df_metada[df_metada["name source"] == name_source]
    df_metada_filter = df_metada_filter[['type source',
                                         'estatico-dinamico',
                                         'licencia',
                                         'año de publicación',
                                         'fecha ultima actualizacion',
                                         'download date',
                                         'formato',
                                         'peptide property',
                                         'informacion del dataset',
                                         'unidad de medida',
                                         'Construccion de dataset negativos',
                                         'repositorio o servidor',
                                         'Publicacion',]]
    return df_metada_filter

In [29]:
df_metada_filter = read_metadata("/content/description_raw_data.xlsx", "AMPfun")

In [30]:
dict_metadata = create_metada_with_multiple_values(df_metada_filter, df_validas)

In [31]:
export_json("metadata.json", dict_metadata)