## Sobre la base de datos

Está relacionada al trabajo:
[UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity](https://academic.oup.com/bib/article/24/3/bbad135/7107929)

Contiene registros positivos y negativos de péptidos antifúngicos.

Usaremos la biblioteca `Biopython` que es muy útil para trabajar con secuencias biológicas.

In [None]:
%pip install biopython

Collecting biopython
  Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Downloading biopython-1.86-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.86


In [None]:
from Bio import SeqIO
import pandas as pd

In [None]:
fasta_train_pos = '/content/afp_train_pos.fa'
fasta_test_pos = '/content/afp_test_pos.fa'
fasta_train_neg = '/content/afp_train_negpp.fa'
fasta_test_neg = '/content/afp_test_negpp.fa'

## Secuencias positivas

In [None]:
data = []
try:
    for record in SeqIO.parse(fasta_train_pos, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_train_pos = pd.DataFrame(data)
    display(df_train_pos.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_train_pos}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,PPepDB_3939,PPepDB_3939,SGECNMYGRCPPGYCCSKFGYCGGVRAYCG
1,PPepDB_2354,PPepDB_2354,DNKAKSKKRDKEKPSSGRPGQTNSVPNAAIQVYKED
2,PPepDB_1628,PPepDB_1628,RECRSQSKQFVGLCVSDTNCASVCLTEHFPGGKCDGYRRCFCTKDC
3,PPepDB_753,PPepDB_753,LPSDATLVLDQTGKELDARL
4,PPepDB_3409,PPepDB_3409,LCCSQFGFCGTTR


In [None]:
data = []
try:
    for record in SeqIO.parse(fasta_test_pos, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_test_pos = pd.DataFrame(data)
    display(df_test_pos.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_test_pos}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,PPepDB_1568,PPepDB_1568,KSTCKAESNTFPGLCITKPPCRKACLSEKFTDGKCSKILRRCICYKPC
1,PPepDB_596,PPepDB_596,AAKMAKNVDKPLFTATFNVQASSADYATFIAGIRNKLRNPAHFSHN...
2,PPepDB_2214,PPepDB_2214,GLPVCGETCVGGTCNTPGCTCSWPVCTRN
3,PPepDB_2099,PPepDB_2099,KSCCRNTTARNCYNVCRIPG
4,PPepDB_3509,PPepDB_3509,NILEASFNTDYEEIEK


In [None]:
# Seleccionar las secuencias positivas
train_pos = df_train_pos[['Sequence']]
test_pos = df_test_pos[['Sequence']]

# Agregar una columna de label (si tiene actividad antifúngica o no)
train_pos['label'] = 1
test_pos['label'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_pos['label'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_pos['label'] = 1


## Secuencias negativas

In [None]:
data = []
try:
    for record in SeqIO.parse(fasta_train_neg, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_train_neg = pd.DataFrame(data)
    display(df_train_neg.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_train_neg}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,PPepDB_1740,PPepDB_1740,GIPCGESCVWIPCISGAIGCSCKSKVCYKN
1,PPepDB_2585,PPepDB_2585,KSCCKDTLGRDCYDLCRARGAPKLCSTLCRCKITSGLSCPKDFPK
2,PPepDB_5802,PPepDB_5802,ICINCCSGYKGCNYYNSFGKFICEGESDPKRPNACTFNCDPNIAYS...
3,PPepDB_5294,PPepDB_5294,ECNTICMNKKYKEGSCVGFGIPPTSKYCCCKTMKMESSKMLVVFTL...
4,PPepDB_2246,PPepDB_2246,MKFSMRLISAVLFLVMIFVATGMGPVTVEARTCASQSQRFKGKCVS...


In [None]:
data = []
try:
    for record in SeqIO.parse(fasta_test_neg, "fasta"):
        data.append({
            "ID": record.id,
            "Description": record.description,
            "Sequence": str(record.seq)
        })

    df_test_neg = pd.DataFrame(data)
    display(df_test_neg.head())

except FileNotFoundError:
    print(f"Error: El archivo no se encuentra en la ruta especificada: {fasta_test_neg}")
except Exception as e:
    print(f"Ocurrió un error al parsear el archivo fasta: {e}")

Unnamed: 0,ID,Description,Sequence
0,PPepDB_5874,PPepDB_5874,KAYKNGKGTCWQKFCQCVYDCMSKPTVIVIFMAILVLGMATKETQG...
1,PPepDB_1773,PPepDB_1773,GLPVCGETCVGGTCNTEYCTCSWPVCTRD
2,PPepDB_6039,PPepDB_6039,MANISWSHFLILMLVFSVVKKGKGDQTDKYCTIIIDPRTPCDLVDC...
3,PPepDB_5239,PPepDB_5239,DVRISFRAYTTCLQSTEWHIDSELAAGRRHVITGPVKDPSPSGREN...
4,PPepDB_4494,PPepDB_4494,ANVCHNEGFVGGNCRGFRRRCFCTRHCMKLSMRLISAVLIMFMIFV...


In [None]:
# Seleccionar las secuencias negativas
train_neg = df_train_neg[['Sequence']]
test_neg = df_test_neg[['Sequence']]

# Agregar una columna de label (si tiene actividad antifúngica o no)
train_neg['label'] = 0
test_neg['label'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_neg['label'] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_neg['label'] = 0


In [None]:
train_neg.head()

Unnamed: 0,Sequence,label
0,GIPCGESCVWIPCISGAIGCSCKSKVCYKN,0
1,KSCCKDTLGRDCYDLCRARGAPKLCSTLCRCKITSGLSCPKDFPK,0
2,ICINCCSGYKGCNYYNSFGKFICEGESDPKRPNACTFNCDPNIAYS...,0
3,ECNTICMNKKYKEGSCVGFGIPPTSKYCCCKTMKMESSKMLVVFTL...,0
4,MKFSMRLISAVLFLVMIFVATGMGPVTVEARTCASQSQRFKGKCVS...,0


## Concatenar ambas secuencias

In [None]:
# Concatenar los dataframes
df_combined = pd.concat([train_pos, train_neg, test_pos, test_neg])

In [None]:
df_combined.head()

Unnamed: 0,Sequence,label
0,SGECNMYGRCPPGYCCSKFGYCGGVRAYCG,1
1,DNKAKSKKRDKEKPSSGRPGQTNSVPNAAIQVYKED,1
2,RECRSQSKQFVGLCVSDTNCASVCLTEHFPGGKCDGYRRCFCTKDC,1
3,LPSDATLVLDQTGKELDARL,1
4,LCCSQFGFCGTTR,1


In [None]:
df_combined.describe()

Unnamed: 0,label
count,954.0
mean,0.5
std,0.500262
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [None]:
df_combined.to_csv('PTAMP_combinado.csv', index=False, header=True)