# Instituto Tecnológico y de Estudios Superiores de Monterrey
## Maestría en Inteligencia Artificial Aplicada
### Proyecto Integrador (Gpo 10) - TC5035.10

### **Proyecto: Diseño Acelerado de Fármacos Agonistas de la Hormona GLP-1**

### Avance 4: Modelos alternativos

#### **Docentes:**
- Dra. Grettel Barceló Alonso - Profesor Titular
- Dr. Luis Eduardo Falcón Morales - Profesor Titular
- Dra. Eduviges Ludivina Facundo Flores  – Profesor Tutor

### **Asesores**
- Dr. Juan Arturo Nolazco Flores
- Dr. Carlos Alberto Brizuela Rodríguez

#### **Miembros del equipo:**
- Cesar Ivan Herrera Martinez A01796392  
- Juan Antonio Cruz Acosta A01795375 
- Julio Baltazar Colín A01794476 

# Modelos Alternativos
Generación de Nuevas secuencias Peptídicas Agonistas de GLP-1 mediante Modelos de Lenguaje y Evaluación In Silico de su Actividad Biológica

## Introducción

## Hemolisis

La hemólisis se define como la disrupción de las membranas de los glóbulos rojos, lo que lleva a una disminución en la vida útil de las células. Es esencial identificar agentes antimicrobianos o péptidos que no causen hemólisis para asegurar su uso seguro y no tóxico contra infecciones bacterianas, peptideBERT clasifica un peptido como hemolitico cuando el 50% de las células rojas (RBCs) se someten a lisis con una actividad menor a 100 µg/mL. 

## Solubilidad 

La solubilidad de un péptido se refiere a su capacidad para disolverse en un solvente específico, generalmente agua o soluciones buffer. PeptideBERT categoriza la solubilidad de los péptidos en solubles e insolubles basándose en datos de PROSO II. La clasificación se determina retrospectivamente a partir de registros electrónicos de laboratorio, en el contexto del Protein Structure Initiative

## No Adherencia 

La no adherencia en péptidos se refiere a la incapacidad de un péptido para interactuar y unirse de manera estable a una superficie específica, ya sea una membrana, un material sintético, o una biomolécula como una proteína o ADN. PeptideBERT clasifica la no adherencia (non-fouling) de los péptidos siguiendo la metodología de White et al., basada en la distribución de aminoácidos en las superficies exteriores de proteínas. Clasificando como muestras positivas basandose en los siguientes puntos:

Ejemplos positivos (Non-fouling peptides):

- Péptidos que siguen el patrón de distribución de aminoácidos observado en superficies de proteínas solubles, especialmente en entornos de alta tendencia a la agregación, como el citoplasma.
- Péptidos diseñados para autoensamblarse y generar superficies que minimizan la adsorción no específica.
- Péptidos con características similares a aquellas encontradas en proteínas chaperonas, donde evitar interacciones no específicas es fundamental.

Ejemplos negativos (Fouling peptides):

- Péptidos que no siguen este patrón y muestran una mayor tendencia a interacciones no específicas y agregación en superficies.

Para esta prueba de concepto se ejecutaran los tres modelos de peptideBERT en una funcion recursiva y se concatenaran los resultados en un dataframe para un mejor manejo

## Generación de Nuevas secuencias Peptídicas Agonistas de GLP-1 mediante Modelos de Lenguaje

### Carga de los datos base para la generación de nuevas secuencias

In [3]:
import os
import sys
import torch
import numpy as np
import pandas as pd
from pathlib import Path

# Machine Learning y Transformers
from sklearn.model_selection import train_test_split
from transformers import (
    XLNetLMHeadModel,
    XLNetTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset

# ruta del directorio del notebook actual
notebook_dir = Path.cwd()
directorio_base = Path.cwd().parent
sys.path.append(str(directorio_base))

from src.plotting import plot_pca_3d

# Asegurarse de que W&B esté deshabilitado si no se usa
os.environ["WANDB_DISABLED"] = "true"

In [4]:
# Establecer la ruta los archivos de datos

directorio_datos = Path(directorio_base / "data")
directorio_modelos = Path(directorio_base / "models")
directorio_modelos_automl= Path(directorio_modelos / "pycaret")
raw_data_dir = directorio_datos / "raw"
processed_data_dir = directorio_datos / "processed"

#ruta a los modelos automl
ruta_mejor_modelo_final = directorio_modelos_automl / "modelos_GLP1_no_pca" /"mejor_modelo_final"
# Ruta modelos generativos
ruta_modelo_protxlnet = directorio_modelos / "prot_xlnet_finetuned"

# Datos con actividad conocida
ruta_125_ec50 = processed_data_dir / "descriptores_125.csv"

# Datos sin actividad conocida
ruta_peptidos_eval = processed_data_dir / "descriptores_cdhit.csv"

# directorio para nuevas secuencias 
directorio_nuevas_secuencias = Path(processed_data_dir/ "secuencias_nuevas")
os.makedirs(directorio_nuevas_secuencias, exist_ok=True)

# directorio modelos peptide Bert
directorio_modelos_peptidebert = directorio_modelos / "peptideBert"


In [5]:
# cargar datos procesados
df_125_conocidos = pd.read_csv(ruta_125_ec50)
df_125_conocidos.set_index('ID', inplace=True)
df_125_conocidos.columns = df_125_conocidos.columns.str.replace('.', '_', regex=False)
df_125_conocidos['pEC50'] = -np.log10(df_125_conocidos["EC50_T2"] * 1e-12)

df_glp1 = pd.read_csv(ruta_peptidos_eval)
df_glp1.set_index('ID', inplace=True)
df_glp1.columns = df_glp1.columns.str.replace('.', '_', regex=False)

In [6]:
# visualizar los datos conocidos
df_125_conocidos.head()

Unnamed: 0_level_0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,EC50_T2,EC50_LOG_T2,pEC50
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
seq_pep1,0.033333,0.0,0.1,0.1,0.066667,0.066667,0.033333,0.0,0.033333,0.066667,...,-0.02798,-0.182783,0.054222,0.190428,-0.142437,0.090372,HSQGTFTSDYSKYLDSRRAQDFVQWLEEGE,563.0,-9.25,9.249492
seq_pep2,0.033333,0.0,0.1,0.1,0.066667,0.066667,0.033333,0.0,0.033333,0.066667,...,0.00091,-0.316149,0.170202,0.157133,-0.144228,0.115217,HSQGTFTSDYSKYLDSRRAEDFVQWLENGE,552.0,-9.26,9.258061
seq_pep3,0.034483,0.0,0.103448,0.068966,0.068966,0.034483,0.034483,0.0,0.034483,0.068966,...,-0.004817,-0.250582,0.18155,0.098041,-0.203722,0.127012,HSQGTFTSDYSKYLDSRRAEDFVQWLENT,252.0,-9.6,9.598599
seq_pep4,0.055556,0.0,0.083333,0.027778,0.055556,0.166667,0.027778,0.0,0.027778,0.055556,...,0.22509,-0.097965,0.052838,0.377701,0.150231,0.286987,HSQGTFTSDYSKYLDSRRAEDFVQWLVAGGSGSGSG,6.03,-11.22,11.219683
seq_pep5,0.066667,0.0,0.1,0.066667,0.066667,0.066667,0.033333,0.0,0.033333,0.066667,...,0.088858,-0.190213,0.020097,0.069381,-0.184796,0.222087,HSQGTFTSDYSKYLDSRRAQDFVQWLEAEG,238.0,-9.62,9.623423


In [7]:
# visualizar los datos a predecir
df_glp1.head()

Unnamed: 0_level_0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,NMBroto_BEGF750102_lag1,NMBroto_BEGF750102_lag2,NMBroto_BEGF750102_lag3,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AF-A0A060VXS0-F1,0.1,0.0,0.066667,0.066667,0.033333,0.066667,0.033333,0.0,0.066667,0.066667,...,0.037142,-0.508484,0.112768,0.224928,-0.142711,0.120597,0.09933,-0.372417,0.041586,HAEGTYTSDMSSYLQDQAAKEFVSWLKNGR
AF-A0A060VY52-F1,0.1,0.0,0.066667,0.066667,0.033333,0.066667,0.033333,0.0,0.066667,0.066667,...,0.037142,-0.508484,0.112768,0.145803,-0.195554,0.086696,0.178218,-0.292699,0.046946,HAEGTYTSDVSSYLQDQAAKEFVSWLKNGR
AF-A0A060WDT4-F1,0.1,0.0,0.133333,0.0,0.033333,0.066667,0.033333,0.0,0.066667,0.1,...,-0.029432,-0.341003,0.010969,-0.055908,-0.413757,-0.005239,0.086834,-0.334601,-0.073197,HADGTYTSDVSTYLQDQAAKDFVSWLKSGL
AF-A0A087VEU7-F1,0.133333,0.0,0.033333,0.1,0.033333,0.1,0.033333,0.066667,0.033333,0.066667,...,0.068401,-0.469261,-0.028003,0.179833,-0.232789,0.257321,0.296206,-0.151547,-0.086574,HAEGTYTSDITSYLEGQAAKEFIAWLVNGR
AF-A0A087XPV4-F1,0.1,0.0,0.133333,0.0,0.066667,0.066667,0.033333,0.033333,0.1,0.066667,...,0.094549,-0.460741,0.040947,0.156853,-0.296064,-0.234672,0.149114,-0.209235,-0.250359,HADGTFTSDVSSYLKDQAIKDFVAQLKSGQ


### Predicción de actividad para los péptidos GLP-1

In [8]:
# cargar el modelo guardado para predicciones con PyCaret
from pycaret.regression import load_model, predict_model
modelo_pycaret = load_model(ruta_mejor_modelo_final)
modelo_pycaret

Transformation Pipeline and Model Successfully Loaded


In [9]:
# Predecir la actividad de los péptidos GLP-1
df_predicciones_glp1 = predict_model(modelo_pycaret, data=df_glp1)
df_predicciones_glp1.rename(columns={'prediction_label': 'pEC50'}, inplace=True)
df_predicciones_glp1.head(10)

Unnamed: 0_level_0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,NMBroto_BEGF750102_lag2,NMBroto_BEGF750102_lag3,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,pEC50
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AF-A0A060VXS0-F1,0.1,0.0,0.066667,0.066667,0.033333,0.066667,0.033333,0.0,0.066667,0.066667,...,-0.508484,0.112768,0.224928,-0.142711,0.120597,0.09933,-0.372417,0.041586,HAEGTYTSDMSSYLQDQAAKEFVSWLKNGR,8.729368
AF-A0A060VY52-F1,0.1,0.0,0.066667,0.066667,0.033333,0.066667,0.033333,0.0,0.066667,0.066667,...,-0.508484,0.112768,0.145803,-0.195554,0.086696,0.178218,-0.292699,0.046946,HAEGTYTSDVSSYLQDQAAKEFVSWLKNGR,8.988079
AF-A0A060WDT4-F1,0.1,0.0,0.133333,0.0,0.033333,0.066667,0.033333,0.0,0.066667,0.1,...,-0.341003,0.010969,-0.055908,-0.413757,-0.005239,0.086834,-0.334601,-0.073197,HADGTYTSDVSTYLQDQAAKDFVSWLKSGL,9.318927
AF-A0A087VEU7-F1,0.133333,0.0,0.033333,0.1,0.033333,0.1,0.033333,0.066667,0.033333,0.066667,...,-0.469261,-0.028003,0.179833,-0.232789,0.257321,0.296206,-0.151547,-0.086574,HAEGTYTSDITSYLEGQAAKEFIAWLVNGR,9.398033
AF-A0A087XPV4-F1,0.1,0.0,0.133333,0.0,0.066667,0.066667,0.033333,0.033333,0.1,0.066667,...,-0.460741,0.040947,0.156853,-0.296064,-0.234672,0.149114,-0.209235,-0.250359,HADGTFTSDVSSYLKDQAIKDFVAQLKSGQ,9.803035
AF-A0A091DI12-F1,0.133333,0.0,0.033333,0.1,0.066667,0.1,0.033333,0.033333,0.066667,0.066667,...,-0.530788,-0.04763,0.16739,-0.286063,0.189741,0.250639,-0.305108,-0.179249,HAEGTFTSDVSSYLEGQAAKEFIAWLVKGR,10.234846
AF-A0A091N9Y7-F1,0.033333,0.0,0.1,0.033333,0.1,0.033333,0.066667,0.033333,0.133333,0.066667,...,-0.143338,-0.219515,-0.1579,-0.401052,0.154808,0.038722,-0.216945,0.023633,HSEGTFTSDFTRYLDKMKAKDFVHWLINTK,9.780633
AF-A0A091P079-F1,0.034483,0.0,0.057471,0.057471,0.045977,0.045977,0.011494,0.022989,0.08046,0.091954,...,0.151567,0.134447,0.000482,-0.007747,0.045643,0.039176,0.142432,0.085655,MKMKSVYFIAGLLLMIVQGSWQNPLQDTEEKSRSFKASQSEPLDES...,9.102588
AF-A0A0F8AUA0-F1,0.133333,0.0,0.1,0.0,0.066667,0.066667,0.033333,0.033333,0.1,0.066667,...,-0.429825,0.067043,-0.01986,-0.094131,-0.228686,-0.009579,-0.200217,-0.346726,HADGTFTSDVSSYLKQQAIKDFVARLKAGQ,10.139276
AF-A0A0H4A7I9-F1,0.1,0.0,0.066667,0.1,0.1,0.033333,0.033333,0.033333,0.066667,0.066667,...,-0.389142,0.039464,0.1238,-0.018384,0.067122,0.189003,-0.391515,-0.086112,HADGTFTSDVASYLERQTVKAFIKFLQEES,9.557083


In [10]:
# Selección de las secuencias con mayor actividad biológica como semilla para la generación de nuevas secuencias
df_125_conocidos.sort_values(by='pEC50', ascending=False, inplace=True)
df_predicciones_glp1.sort_values(by='pEC50', ascending=False, inplace=True)

#see

# unir los datos conocidos con las predicciones
df_todas_actividades = pd.concat( [df_125_conocidos.head(50), df_predicciones_glp1[df_predicciones_glp1['sequence'].str.len()<= 60].head(50)], axis=0)
df_todas_actividades.sort_values(by='pEC50', ascending=False, inplace=True)
df_todas_actividades

Unnamed: 0_level_0,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,AAC_L,...,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,EC50_T2,EC50_LOG_T2,pEC50
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
seq_pep117,0.137931,0.0,0.034483,0.103448,0.068966,0.068966,0.034483,0.034483,0.068966,0.068966,...,0.146081,-0.260145,0.256426,0.190963,-0.335632,-0.156167,HAEGTFTSDVSSYLEGQAAKEFIAWLVKR,0.96,-12.02,12.017729
seq_pep26,0.076923,0.0,0.051282,0.128205,0.051282,0.102564,0.000000,0.025641,0.025641,0.102564,...,0.319468,0.005239,-0.012149,0.284983,-0.065844,0.122248,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGPSSGAPPPS,1.03,-11.99,11.987163
seq_pep7,0.068966,0.0,0.034483,0.137931,0.068966,0.068966,0.000000,0.034483,0.034483,0.137931,...,0.308208,0.194179,0.178745,0.100220,-0.256994,0.033664,YSQGTFTSDYSKYLEEEAVRLFIEWLLAG,1.06,-11.97,11.974694
seq_pep11,0.100000,0.0,0.050000,0.025000,0.050000,0.150000,0.025000,0.000000,0.050000,0.075000,...,0.204633,0.032283,0.272735,0.360519,0.151019,0.076209,HSQGTFTSDYSKYLDSRAAAKFVQWLLNGGPSSGAPPEGG,1.49,-11.83,11.826814
seq_pep93,0.068966,0.0,0.068966,0.068966,0.068966,0.034483,0.034483,0.034483,0.034483,0.068966,...,0.092625,-0.049220,0.099761,0.111584,-0.335907,-0.103850,HSQGTFTSDYSKYLDSRAASEFVQWLISE,1.57,-11.80,11.804100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AF-A0A3Q2E1H1-F1,0.100000,0.0,0.133333,0.033333,0.066667,0.066667,0.033333,0.033333,0.133333,0.066667,...,0.087096,-0.336384,-0.244321,0.170322,-0.207247,-0.244486,HADGTFTSDVSSYLKDQAIKDFVAKLKSGE,,,9.896199
AF-A0A5J5D8T6-F1,0.066667,0.0,0.100000,0.066667,0.066667,0.033333,0.033333,0.033333,0.133333,0.100000,...,-0.117962,-0.290027,0.184839,-0.012187,-0.264375,0.027516,HSEGTFTSDLTRYLDKIKAKDFVEWLASTK,,,9.889078
AF-A0A218UDQ3-F1,0.033333,0.0,0.100000,0.033333,0.100000,0.033333,0.066667,0.033333,0.100000,0.066667,...,-0.145938,-0.413293,0.136696,0.022909,-0.224721,-0.003699,HSEGTFTSDFTRYLDRMKAKDFVHWLINTK,,,9.878462
AF-A0A3B4EEM2-F1,0.133333,0.0,0.133333,0.000000,0.033333,0.066667,0.033333,0.033333,0.066667,0.066667,...,-0.098377,-0.202901,-0.078202,0.147971,-0.185318,-0.172646,HADGTYTSDVSAYLQDQAAKDFITWLKSGQ,,,9.875917


### Generación de nuevas secuencias usando modelo generatívo

In [11]:
longitud_maxima = df_todas_actividades['sequence'].str.len().max()
longitud_minima = df_todas_actividades['sequence'].str.len().min()
sequences_base = df_todas_actividades['sequence'].tolist()
print(f"Longitud máxima de secuencia: {longitud_maxima}")
print(f"Longitud mínima de secuencia: {longitud_minima}")

Longitud máxima de secuencia: 41
Longitud mínima de secuencia: 24


In [12]:
# generación de nuevas secuencias usando ProtXLNet
from src.ProtXLNet_generator import generate_peptide_variants, generate_peptide_variants_fast
# Configuración del dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Usando dispositivo: {device}")

print(f"Cargando modelo desde: {ruta_modelo_protxlnet}")
tokenizer = XLNetTokenizer.from_pretrained(ruta_modelo_protxlnet)
model = XLNetLMHeadModel.from_pretrained(ruta_modelo_protxlnet)
model.to(device)
# Generación de nuevas secuencias
print("\nIniciando la generación de variantes con la función importada...")

# Llama a la función
nuevas_variantes = generate_peptide_variants_fast  (
    prompt_sequences=sequences_base,
    model=model,
    tokenizer=tokenizer,
    top_k=5,
    num_variants_per_seq=30, # Generar N variantes por cada secuencia base
    min_length=longitud_minima,
    max_length=longitud_maxima
)



Usando dispositivo: cuda
Cargando modelo desde: d:\source\Proyecto Integrador\glp-1_drug_discovery\models\prot_xlnet_finetuned

Iniciando la generación de variantes con la función importada...
Generando 3000 variantes en lotes de 32...


Generando:   0%|          | 0/94 [00:00<?, ?it/s]


shape len: 41


This is a friendly reminder - the current text generation call has exceeded the model's predefined maximum length (-1). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.



shape len: 41

shape len: 41

shape len: 41

shape len: 32

shape len: 41

shape len: 41

shape len: 32

shape len: 41

shape len: 41

shape len: 32

shape len: 32

shape len: 41

shape len: 41

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 41

shape len: 41

shape len: 32

shape len: 32

shape len: 38

shape len: 38

shape len: 31

shape len: 41

shape len: 31

shape len: 31

shape len: 31

shape len: 41

shape len: 41

shape len: 41

shape len: 41

shape len: 41

shape len: 34

shape len: 32

shape len: 31

shape len: 41

shape len: 41

shape len: 32

shape len: 41

shape len: 41

shape len: 41

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len: 32

shape len

In [13]:
if model is not None:
    del model
if tokenizer is not None:
    del tokenizer
if device.type == 'cuda':
    torch.cuda.empty_cache()

In [16]:
df_secuencias_nuevas = pd.DataFrame(nuevas_variantes, columns=["sequence"])
df_secuencias_nuevas['ID'] = [f"secuencia_{idx}" for idx in range(1, len(df_secuencias_nuevas) + 1)]
df_secuencias_nuevas = df_secuencias_nuevas[['ID', 'sequence']]
df_secuencias_nuevas.head(10)

Unnamed: 0,ID,sequence
0,secuencia_1,HGEGTFTSDVSSYMERQSVDEFIAWLQKGR
1,secuencia_2,HSQGTFTSDMSKYLDEAAASDFVQWLVAGG
2,secuencia_3,HSQGTFTSDYSKYLDSERASEFVQWLVSE
3,secuencia_4,HSEGVFTNDVTRLLEEKATSEFIAWLLKGL
4,secuencia_5,HSECTFTSDYSKYLENKQAKDFVRWLMNAK
5,secuencia_6,HSQGTFTSDPSEYLDSRRASEFVQWLISEY
6,secuencia_7,HAEGTFTSDVSSYLEGQAAKEFIAQLVKGRYY
7,secuencia_8,HSQGTFTSDYHKYLDSEAASDFVQWLVAGG
8,secuencia_9,HGEGGFTSDVSSYMESQLVDEFIAWLLKGR
9,secuencia_10,HSQGTFTSDYSKYLDRAAASDFVQWLVAQG


In [17]:
from datetime import datetime

# Generar timestamp en formato AñoMesDía_HoraMinutoSegundo
File_timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')

In [18]:
# guardar en un archivo CSV las nuevas secuencias generadas

# Guardar df_seccuencias_nuevas en CSV con timestamp

nombre_archivo_csv = f"secuencias_nuevas_{File_timestamp}.csv"

# Guardar el DataFrame en CSV
ruta_salida_secuencias_nuevas = Path(directorio_nuevas_secuencias /nombre_archivo_csv)

df_secuencias_nuevas.to_csv(ruta_salida_secuencias_nuevas, index=False)
print(f"Archivo guardado: {ruta_salida_secuencias_nuevas}")


Archivo guardado: d:\source\Proyecto Integrador\glp-1_drug_discovery\data\processed\secuencias_nuevas\secuencias_nuevas_20251018_202905.csv


In [None]:
File_timestamp = "20251018_195804"
ruta_salida_secuencias_nuevas = Path(directorio_nuevas_secuencias / f"secuencias_nuevas_{File_timestamp}.csv")

In [19]:
df_secuencias_nuevas = pd.read_csv(ruta_salida_secuencias_nuevas)
#df_secuencias_nuevas.reset_index(drop=True, inplace=True)
df_secuencias_nuevas.head()

Unnamed: 0,ID,sequence
0,secuencia_1,HGEGTFTSDVSSYMERQSVDEFIAWLQKGR
1,secuencia_2,HSQGTFTSDMSKYLDEAAASDFVQWLVAGG
2,secuencia_3,HSQGTFTSDYSKYLDSERASEFVQWLVSE
3,secuencia_4,HSEGVFTNDVTRLLEEKATSEFIAWLLKGL
4,secuencia_5,HSECTFTSDYSKYLENKQAKDFVRWLMNAK


## Calculo de las propiedades fisico químicas de las nuevas secuencias generadas

In [20]:
# Guardar las nuevas variantes generadas en formato FASTA
from src.bio_utils import save_df_as_fasta, fasta_to_dataframe, inspect_fasta_file

nombre_archivo_fasta = f"secuencias_nuevas_{File_timestamp}.fasta"
ruta_salida_fasta = Path(directorio_nuevas_secuencias /nombre_archivo_fasta)

save_df_as_fasta(
    dataframe=df_secuencias_nuevas,
    id_col='ID',
    seq_col='sequence',
    output_file=ruta_salida_fasta
    
)

results = inspect_fasta_file(ruta_salida_fasta)

if results and results['is_valid']:
    print(f"'{ruta_salida_fasta}' es válido.")
    print(f"Se encontraron {results['record_count']} registros válidos.")
else:
    print(f"\nLa validación falló para '{ruta_salida_fasta}'. Por favor, revisa los registros.")

Success! DataFrame has been saved to 'd:\source\Proyecto Integrador\glp-1_drug_discovery\data\processed\secuencias_nuevas\secuencias_nuevas_20251018_202905.fasta'.
Inspecting file: d:\source\Proyecto Integrador\glp-1_drug_discovery\data\processed\secuencias_nuevas\secuencias_nuevas_20251018_202905.fasta...
  - OK! File is structurally valid. Found 3886 records.
'd:\source\Proyecto Integrador\glp-1_drug_discovery\data\processed\secuencias_nuevas\secuencias_nuevas_20251018_202905.fasta' es válido.
Se encontraron 3886 registros válidos.


In [21]:
### Cálculo de Características con iFeature Omega

from src.ifeature_process import *

# cargar las configuraciónes
ifeatures_settings_json = Path(directorio_datos / 
                               "iFeature Settings" / 
                               "Protein_parameters_setting.json") 
ifeatures_settings_json

WindowsPath('d:/source/Proyecto Integrador/glp-1_drug_discovery/data/iFeature Settings/Protein_parameters_setting.json')

In [22]:
# Definimos una lista de descriptores
descriptores = [
            "AAC",				# Amino acid composition
            "CKSAAGP type 1",	# Composition of k-spaced amino acid group pairs type 1- normalized
            "DPC type 1",		# Dipeptide composition type 1 - normalized
            "CTDC",				# Composition
            "CTDT",				# Transition
            "CTDD",				# Distribution
            "CTriad",			# Conjoint triad
            "GAAC",				# Grouped amino acid composition
            "Moran",			# Moran
            "SOCNumber",		# Sequence-order-coupling number
            "QSOrder",			# Quasi-sequence-order descriptors
            "PAAC",				# Pseudo-amino acid composition
            "APAAC",			# Amphiphilic PAAC
            "NMBroto",			# Auto-cross covariance
        ]

In [23]:
# Calculamos los descriptores
df_descriptores_cdhit = compute_peptide_features(ruta_salida_fasta, descriptores, ifeatures_settings_json)

Calculando descriptor: AAC
File imported successfully.
Calculando descriptor: CKSAAGP type 1
File imported successfully.
Calculando descriptor: DPC type 1
File imported successfully.
Calculando descriptor: CTDC
File imported successfully.
Calculando descriptor: CTDT
File imported successfully.
Calculando descriptor: CTDD
File imported successfully.
Calculando descriptor: CTriad
File imported successfully.
Calculando descriptor: GAAC
File imported successfully.
Calculando descriptor: Moran
File imported successfully.
Calculando descriptor: SOCNumber
File imported successfully.
Calculando descriptor: QSOrder
File imported successfully.
Calculando descriptor: PAAC
File imported successfully.
Calculando descriptor: APAAC
File imported successfully.
Calculando descriptor: NMBroto
File imported successfully.


In [24]:
df_descriptores_cdhit

Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750101.lag3,NMBroto_BEGF750102.lag1,NMBroto_BEGF750102.lag2,NMBroto_BEGF750102.lag3,NMBroto_BEGF750103.lag1,NMBroto_BEGF750103.lag2,NMBroto_BEGF750103.lag3,NMBroto_BHAR880101.lag1,NMBroto_BHAR880101.lag2,NMBroto_BHAR880101.lag3
0,secuencia_1,0.033333,0.000000,0.066667,0.100000,0.066667,0.100000,0.033333,0.033333,0.033333,...,-0.081643,-0.009895,-0.514782,0.134478,-0.184003,0.071976,-0.057906,0.192952,-0.117623,-0.049551
1,secuencia_2,0.133333,0.000000,0.100000,0.033333,0.066667,0.100000,0.033333,0.000000,0.033333,...,-0.036895,0.097693,-0.436367,-0.087864,0.248581,-0.274171,-0.248948,0.011279,-0.403324,-0.089791
2,secuencia_3,0.034483,0.000000,0.068966,0.103448,0.068966,0.034483,0.034483,0.000000,0.034483,...,0.037012,0.202303,-0.406988,-0.053519,-0.059251,0.073899,0.143754,0.102505,-0.160051,0.019291
3,secuencia_4,0.066667,0.000000,0.033333,0.133333,0.066667,0.066667,0.033333,0.033333,0.066667,...,0.297831,0.283548,-0.471178,-0.265712,0.008244,-0.066317,0.179920,0.047736,-0.254851,-0.134250
4,secuencia_5,0.066667,0.033333,0.066667,0.066667,0.066667,0.000000,0.033333,0.000000,0.133333,...,0.170242,0.220029,-0.242840,-0.130685,-0.089329,-0.358443,0.145675,-0.110942,-0.041814,0.106294
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3881,secuencia_3882,0.066667,0.000000,0.100000,0.066667,0.033333,0.033333,0.033333,0.033333,0.133333,...,0.084193,-0.069624,-0.262649,-0.190357,-0.125136,-0.338590,0.177134,0.000873,-0.186753,0.090478
3882,secuencia_3883,0.133333,0.000000,0.033333,0.100000,0.066667,0.100000,0.033333,0.033333,0.100000,...,0.071269,0.089790,-0.530788,-0.047630,0.184253,-0.313379,0.195469,0.242121,-0.322439,-0.183522
3883,secuencia_3884,0.076923,0.000000,0.025641,0.102564,0.051282,0.102564,0.000000,0.025641,0.051282,...,0.103204,0.349309,0.089313,0.097455,0.060270,-0.069555,-0.090891,0.277620,0.050686,0.174199
3884,secuencia_3885,0.102564,0.000000,0.051282,0.000000,0.025641,0.128205,0.025641,0.000000,0.051282,...,0.029828,0.319848,0.087955,0.181520,0.283135,0.183683,0.206869,0.360807,0.177749,-0.063624


In [25]:
# Unión de los dataframes de las propiedades con los datos de identificación 
df_resultado_desconocidos = pd.merge(
    left=df_descriptores_cdhit,         
    right=df_secuencias_nuevas[['ID','sequence']], 
    left_on='ID',                      
    right_on='ID',                 
    how='inner'                        
)

df_resultado_desconocidos

Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750102.lag1,NMBroto_BEGF750102.lag2,NMBroto_BEGF750102.lag3,NMBroto_BEGF750103.lag1,NMBroto_BEGF750103.lag2,NMBroto_BEGF750103.lag3,NMBroto_BHAR880101.lag1,NMBroto_BHAR880101.lag2,NMBroto_BHAR880101.lag3,sequence
0,secuencia_1,0.033333,0.000000,0.066667,0.100000,0.066667,0.100000,0.033333,0.033333,0.033333,...,-0.009895,-0.514782,0.134478,-0.184003,0.071976,-0.057906,0.192952,-0.117623,-0.049551,HGEGTFTSDVSSYMERQSVDEFIAWLQKGR
1,secuencia_2,0.133333,0.000000,0.100000,0.033333,0.066667,0.100000,0.033333,0.000000,0.033333,...,0.097693,-0.436367,-0.087864,0.248581,-0.274171,-0.248948,0.011279,-0.403324,-0.089791,HSQGTFTSDMSKYLDEAAASDFVQWLVAGG
2,secuencia_3,0.034483,0.000000,0.068966,0.103448,0.068966,0.034483,0.034483,0.000000,0.034483,...,0.202303,-0.406988,-0.053519,-0.059251,0.073899,0.143754,0.102505,-0.160051,0.019291,HSQGTFTSDYSKYLDSERASEFVQWLVSE
3,secuencia_4,0.066667,0.000000,0.033333,0.133333,0.066667,0.066667,0.033333,0.033333,0.066667,...,0.283548,-0.471178,-0.265712,0.008244,-0.066317,0.179920,0.047736,-0.254851,-0.134250,HSEGVFTNDVTRLLEEKATSEFIAWLLKGL
4,secuencia_5,0.066667,0.033333,0.066667,0.066667,0.066667,0.000000,0.033333,0.000000,0.133333,...,0.220029,-0.242840,-0.130685,-0.089329,-0.358443,0.145675,-0.110942,-0.041814,0.106294,HSECTFTSDYSKYLENKQAKDFVRWLMNAK
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3881,secuencia_3882,0.066667,0.000000,0.100000,0.066667,0.033333,0.033333,0.033333,0.033333,0.133333,...,-0.069624,-0.262649,-0.190357,-0.125136,-0.338590,0.177134,0.000873,-0.186753,0.090478,HSEGTVTSDLTRYLDKIKAKDFVEWLASTK
3882,secuencia_3883,0.133333,0.000000,0.033333,0.100000,0.066667,0.100000,0.033333,0.033333,0.100000,...,0.089790,-0.530788,-0.047630,0.184253,-0.313379,0.195469,0.242121,-0.322439,-0.183522,HAEGTFTSDVKSYLEGQAAKEFIAWLVKGR
3883,secuencia_3884,0.076923,0.000000,0.025641,0.102564,0.051282,0.102564,0.000000,0.025641,0.051282,...,0.349309,0.089313,0.097455,0.060270,-0.069555,-0.090891,0.277620,0.050686,0.174199,YSEGTFNSDLSILKEKEANREFVNWLLAGGPSSGAPPPS
3884,secuencia_3885,0.102564,0.000000,0.051282,0.000000,0.025641,0.128205,0.025641,0.000000,0.051282,...,0.319848,0.087955,0.181520,0.283135,0.183683,0.206869,0.360807,0.177749,-0.063624,HSQGTFTSDYSKYLDSRKAAATVQWLLNGGPSSGAPPPG


### Predicción de las propiedades de las nuevas secuencias


In [26]:
## prediccones
df_resultado_desconocidos.columns = df_resultado_desconocidos.columns.str.replace('.', '_', regex=False)
df_predicciones_nuevos = predict_model(modelo_pycaret, data=df_resultado_desconocidos)
df_predicciones_nuevos.rename(columns={'prediction_label': 'pEC50'}, inplace=True)
df_predicciones_nuevos.sort_values(by='pEC50', ascending=False, inplace=True)
df_predicciones_nuevos.head(10)

Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750102_lag2,NMBroto_BEGF750102_lag3,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,pEC50
1320,secuencia_1321,0.076923,0.025641,0.051282,0.128205,0.051282,0.102564,0.0,0.025641,0.025641,...,-0.169605,0.074522,0.349972,0.049621,0.033465,0.286369,-0.070351,0.090447,YSEGTFTSDYSKLLEEEAVRDFIECLLAGGPSSGAPPPS,11.82017
559,secuencia_560,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.0,0.025641,0.025641,...,-0.125669,0.111158,0.173205,-0.144977,-0.033151,0.285003,-0.064386,0.124307,YSEGTFTSDYSKLLEEPAVRDFIEWLLAGGPSSGAPPPS,11.805154
3652,secuencia_3653,0.076923,0.0,0.051282,0.128205,0.076923,0.102564,0.0,0.025641,0.025641,...,-0.270233,-0.088306,0.34743,-0.131722,-0.079127,0.275253,-0.226546,-0.010614,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGPSSGAFPPS,11.752389
1084,secuencia_1085,0.068966,0.0,0.068966,0.0,0.068966,0.068966,0.034483,0.034483,0.034483,...,-0.299824,-0.089697,0.054295,0.009845,0.042675,0.187532,-0.082585,-0.103402,HSQGTFTSDYSKYLDSRRASAFVQWLISG,11.721878
3614,secuencia_3615,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.025641,0.025641,0.025641,...,-0.161775,0.054169,0.267846,-0.047778,-0.058125,0.279102,-0.071885,0.144865,YSEGTFTSDYSKLLHEEAVRDFIEWLLAGGPSSGAPPPS,11.718813
3473,secuencia_3474,0.068966,0.0,0.068966,0.068966,0.068966,0.034483,0.034483,0.034483,0.0,...,-0.412163,-0.162117,0.110239,-0.054948,0.139628,0.136839,-0.360385,-0.041452,HSQGTFTSDYSRYLDSRAASEFVQWLISE,11.715201
3637,secuencia_3638,0.1,0.0,0.066667,0.066667,0.066667,0.1,0.033333,0.0,0.0,...,-0.165916,-0.129738,0.000942,-0.2613,-0.169466,0.086565,-0.132599,-0.010995,HSQGTFTSDYSAYLESERARDFVQWLVAGG,11.712342
3460,secuencia_3461,0.051282,0.0,0.051282,0.076923,0.051282,0.102564,0.0,0.025641,0.051282,...,-0.119083,0.216736,0.203331,-0.027119,-0.016381,0.24382,0.089825,0.12286,YSEGTFTSDYSKLLERQAIDEFVNWRLKGGPSSGAPPPS,11.699745
687,secuencia_688,0.076923,0.0,0.051282,0.128205,0.051282,0.102564,0.0,0.025641,0.025641,...,-0.182822,0.096674,0.290394,-0.024621,-0.003429,0.300871,-0.049527,0.126275,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGRSSGAPPPS,11.681476
1098,secuencia_1099,0.051282,0.0,0.051282,0.076923,0.051282,0.102564,0.025641,0.025641,0.051282,...,-0.16873,0.234699,0.200251,-0.042708,-0.010343,0.396022,0.02563,0.056006,YSEGTFTSDYSKLLERQAIDEFVNWHLKGGPSSGAPPPS,11.680554


In [27]:
# Guardar el DataFrame en CSV
nombre_predicciones_csv = f"predicciones_{File_timestamp}.csv"
ruta_salida_csv = Path(directorio_nuevas_secuencias /nombre_predicciones_csv)

df_predicciones_nuevos.to_csv(ruta_salida_csv, index=False)
print(f"Archivo guardado: {ruta_salida_csv}")


Archivo guardado: d:\source\Proyecto Integrador\glp-1_drug_discovery\data\processed\secuencias_nuevas\predicciones_20251018_202905.csv


In [28]:
df_predicciones_nuevos = pd.read_csv(ruta_salida_csv)
df_predicciones_nuevos.head()

Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750102_lag2,NMBroto_BEGF750102_lag3,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,pEC50
0,secuencia_1321,0.076923,0.025641,0.051282,0.128205,0.051282,0.102564,0.0,0.025641,0.025641,...,-0.169605,0.074522,0.349972,0.049621,0.033465,0.286369,-0.070351,0.090446,YSEGTFTSDYSKLLEEEAVRDFIECLLAGGPSSGAPPPS,11.82017
1,secuencia_560,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.0,0.025641,0.025641,...,-0.125669,0.111158,0.173205,-0.144977,-0.033151,0.285003,-0.064386,0.124307,YSEGTFTSDYSKLLEEPAVRDFIEWLLAGGPSSGAPPPS,11.805154
2,secuencia_3653,0.076923,0.0,0.051282,0.128205,0.076923,0.102564,0.0,0.025641,0.025641,...,-0.270233,-0.088306,0.34743,-0.131722,-0.079127,0.275253,-0.226546,-0.010614,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGPSSGAFPPS,11.752389
3,secuencia_1085,0.068966,0.0,0.068966,0.0,0.068966,0.068966,0.034483,0.034483,0.034483,...,-0.299824,-0.089697,0.054295,0.009845,0.042675,0.187532,-0.082586,-0.103402,HSQGTFTSDYSKYLDSRRASAFVQWLISG,11.721878
4,secuencia_3615,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.025641,0.025641,0.025641,...,-0.161775,0.054169,0.267846,-0.047778,-0.058125,0.279102,-0.071885,0.144865,YSEGTFTSDYSKLLHEEAVRDFIEWLLAGGPSSGAPPPS,11.718813


### Cálculo de Propiedades Hemolisis, Solubilidad, No Adherencia y Fouling de las nuevas secuencias generadas


In [35]:
import yaml
from models.peptideBert.network import create_model
# Preparación de datos y modelos para ejecutar peptideBERT
def load_bert_model(feature, device):
    config = yaml.load(open(f'{directorio_modelos_peptidebert}/{feature}/config.yaml', 'r'), Loader=yaml.FullLoader)
    config['device'] = device
    model = create_model(config)
    model.load_state_dict(torch.load(f'{directorio_modelos_peptidebert}/{feature}/model.pt',weights_only = False)['model_state_dict'], strict=False)
    return model

In [52]:
# ejecucion de peptideBERT

def predict_peptidebert(sequences,feats=['hemo','sol','nf']):
    peptides =sequences.copy()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'Usando dispositivo: {device}')
    MAX_LEN = max(map(len, sequences))
    # convert to tokens
    mapping = dict(zip(
        ['[PAD]','[UNK]','[CLS]','[SEP]','[MASK]','L',
        'A','G','V','E','S','I','K','R','D','T','P','N',
        'Q','F','Y','M','H','C','W'],
        range(30)
    ))

    
    for i in range(len(sequences)):
        sequences[i] = [mapping[c] for c in sequences[i]] 
        sequences[i].extend([0] * (MAX_LEN - len(sequences[i])))  # padding to max length
    
    results = pd.DataFrame({'Sequence':peptides})
    feats = feats
    with torch.inference_mode():
        for c in feats:
            model = load_bert_model(c,device)
            preds = []
            for i in range(len(sequences)):
                input_ids = torch.tensor([sequences[i]]).to(device)
                attention_mask = (input_ids != 0).float()
                output = float(model(input_ids, attention_mask)[0])
                #output = int(model(input_ids, attention_mask)[0] > 0.5)
                #print(f'Secuencia {peptides[i]} {c}: {output}')
                preds.append(output)
                
            if model is not None:
                del model
            if device.type == 'cuda':
                torch.cuda.empty_cache()
                
            results = pd.concat([results,pd.DataFrame(preds, columns = [c]).astype(float)], axis=1)
    
    return results
    

In [53]:
seqs = df_secuencias_nuevas['sequence'].tolist()
bert_results = predict_peptidebert(seqs, ['hemo','sol'])

Usando dispositivo: cuda


In [56]:
# Procseamiento con peptideBERT
bert_results

Unnamed: 0,Sequence,hemo,sol
0,HGEGTFTSDVSSYMERQSVDEFIAWLQKGR,0.059255,0.840228
1,HSQGTFTSDMSKYLDEAAASDFVQWLVAGG,0.063566,0.792326
2,HSQGTFTSDYSKYLDSERASEFVQWLVSE,0.061086,0.841271
3,HSEGVFTNDVTRLLEEKATSEFIAWLLKGL,0.092241,0.778248
4,HSECTFTSDYSKYLENKQAKDFVRWLMNAK,0.048258,0.839455
...,...,...,...
3881,HSEGTVTSDLTRYLDKIKAKDFVEWLASTK,0.070470,0.830053
3882,HAEGTFTSDVKSYLEGQAAKEFIAWLVKGR,0.063577,0.845732
3883,YSEGTFNSDLSILKEKEANREFVNWLLAGGPSSGAPPPS,0.060271,0.690746
3884,HSQGTFTSDYSKYLDSRKAAATVQWLLNGGPSSGAPPPG,0.045773,0.834750


In [57]:
# una las características de peptideBERT con las predicciones previas
df_final_bert = pd.merge(df_predicciones_nuevos, bert_results, left_on='sequence', right_on='Sequence', how='inner')
df_final_bert.drop(columns=['Sequence'], inplace=True)
df_final_bert.head()

Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,pEC50,hemo,sol
0,secuencia_1321,0.076923,0.025641,0.051282,0.128205,0.051282,0.102564,0.0,0.025641,0.025641,...,0.349972,0.049621,0.033465,0.286369,-0.070351,0.090446,YSEGTFTSDYSKLLEEEAVRDFIECLLAGGPSSGAPPPS,11.82017,0.072858,0.708222
1,secuencia_560,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.0,0.025641,0.025641,...,0.173205,-0.144977,-0.033151,0.285003,-0.064386,0.124307,YSEGTFTSDYSKLLEEPAVRDFIEWLLAGGPSSGAPPPS,11.805154,0.069578,0.685881
2,secuencia_3653,0.076923,0.0,0.051282,0.128205,0.076923,0.102564,0.0,0.025641,0.025641,...,0.34743,-0.131722,-0.079127,0.275253,-0.226546,-0.010614,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGPSSGAFPPS,11.752389,0.073255,0.657103
3,secuencia_1085,0.068966,0.0,0.068966,0.0,0.068966,0.068966,0.034483,0.034483,0.034483,...,0.054295,0.009845,0.042675,0.187532,-0.082586,-0.103402,HSQGTFTSDYSKYLDSRRASAFVQWLISG,11.721878,0.067476,0.675657
4,secuencia_3615,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.025641,0.025641,0.025641,...,0.267846,-0.047778,-0.058125,0.279102,-0.071885,0.144865,YSEGTFTSDYSKLLHEEAVRDFIEWLLAGGPSSGAPPPS,11.718813,0.068158,0.674377


In [62]:
# Guardar los resultados de peptideBERT
nombre_bert_csv = f"bert_vegf_features_{File_timestamp}.csv"
df_final_bert.to_csv(directorio_nuevas_secuencias/nombre_bert_csv, index=False)


## Filtrado de las secuencias generadas según las propiedades calculadas

In [63]:
#Filtrado de secuencias generadas por propiedades deseadas
df_seleccionadas = df_final_bert[(df_final_bert['hemo'] < 0.5) & (df_final_bert['sol'] > 0.5)].sort_values(by=['pEC50','sol'], ascending=False)
df_seleccionadas.head()


Unnamed: 0,ID,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750103_lag1,NMBroto_BEGF750103_lag2,NMBroto_BEGF750103_lag3,NMBroto_BHAR880101_lag1,NMBroto_BHAR880101_lag2,NMBroto_BHAR880101_lag3,sequence,pEC50,hemo,sol
0,secuencia_1321,0.076923,0.025641,0.051282,0.128205,0.051282,0.102564,0.0,0.025641,0.025641,...,0.349972,0.049621,0.033465,0.286369,-0.070351,0.090446,YSEGTFTSDYSKLLEEEAVRDFIECLLAGGPSSGAPPPS,11.82017,0.072858,0.708222
1,secuencia_560,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.0,0.025641,0.025641,...,0.173205,-0.144977,-0.033151,0.285003,-0.064386,0.124307,YSEGTFTSDYSKLLEEPAVRDFIEWLLAGGPSSGAPPPS,11.805154,0.069578,0.685881
2,secuencia_3653,0.076923,0.0,0.051282,0.128205,0.076923,0.102564,0.0,0.025641,0.025641,...,0.34743,-0.131722,-0.079127,0.275253,-0.226546,-0.010614,YSEGTFTSDYSKLLEEEAVRDFIEWLLAGGPSSGAFPPS,11.752389,0.073255,0.657103
3,secuencia_1085,0.068966,0.0,0.068966,0.0,0.068966,0.068966,0.034483,0.034483,0.034483,...,0.054295,0.009845,0.042675,0.187532,-0.082586,-0.103402,HSQGTFTSDYSKYLDSRRASAFVQWLISG,11.721878,0.067476,0.675657
4,secuencia_3615,0.076923,0.0,0.051282,0.102564,0.051282,0.102564,0.025641,0.025641,0.025641,...,0.267846,-0.047778,-0.058125,0.279102,-0.071885,0.144865,YSEGTFTSDYSKLLHEEAVRDFIEWLLAGGPSSGAPPPS,11.718813,0.068158,0.674377


## Predicción de la actividad biológica (pEC50) de las nuevas secuencias generadas utilizando el modelo desarrollado previamente

In [None]:
# cargade

## Visualización de los resultados obtenidos

## Conclusiones y próximos pasos