# **Instalacion de librerias**

In [5]:
!pip install dask[dataframe]
!pip install lifelines==0.26.0
!pip install biopython
!pip install padelpy
!pip install padel-pywrapper
!pip install rdkit
!pip install mol2vec
!pip install rdkit-pypi
!pip install propy3
!pip install PyBioMed
!pip install PyProtein

Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Downloading dask_expr-1.1.21-py3-none-any.whl.metadata (2.6 kB)
INFO: pip is looking at multiple versions of dask-expr to determine which version is compatible with other requirements. This could take a while.
  Downloading dask_expr-1.1.20-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.19-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.18-py3-none-any.whl.metadata (2.6 kB)
  Downloading dask_expr-1.1.16-py3-none-any.whl.metadata (2.5 kB)
Downloading dask_expr-1.1.16-py3-none-any.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.2/243.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dask-expr
Successfully installed dask-expr-1.1.16
Collecting lifelines==0.26.0
  Downloading lifelines-0.26.0-py3-none-any.whl.metadata (4.0 kB)
Collecting autograd-gamma>=0.3 (from lifelines==0.26.0)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Prep

In [1]:
# librerias para leer datos y archivos
import pandas as pd
import numpy as np
from scipy.stats import boxcox
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
!mkdir datasets
!mkdir "Plots figures"
!mkdir "Trained models"

# Etapa 2 del desarrollo: Busqueda de Descriptores.

In [6]:
dfs_info_cov_transformed = pd.read_excel("datasets/dfs_info_cov_transformed.xlsx")
filtered_OBP_info_new = pd.read_excel("datasets/filtered_OBP_info_new.xlsx")
dfs_info_groups_transformed = pd.read_excel("datasets/dfs_info_groups_transformed.xlsx")

In [7]:
compounds_PBPs_GOBPs = dfs_info_cov_transformed['Smiles']
compounds_PBPs_GOBPs

Unnamed: 0,Smiles
0,CC(=O)/C=C/C1=C(C)CCCC1(C)C
1,CC(C)=CCCC(=C)C=C
2,O=Cc1ccccc1
3,CC\C=C/CCO
4,CCC/C=C/C=O
...,...
249,CC(C)CCCCCO
250,CC(O)CCC(C)O
251,CCC(O)C(C)C
252,COCc1ccccc1


In [8]:
sequences_OBPs = filtered_OBP_info_new['AA Sequence W/O signal peptide']
sequences_OBPs

Unnamed: 0,AA Sequence W/O signal peptide
0,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...
1,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...
2,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...
3,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...
4,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...
...,...
105,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...
106,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...
107,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...
108,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...


## Funciones para busqueda de descriptores
---
PaDEL-Descriptor es una herramienta muy popular para calcular una amplia gama de descriptores moleculares, que se utilizan para estudiar las propiedades químicas y fisicoquímicas de los compuestos a partir de su estructura en formato SMILES (Simplified Molecular Input Line Entry System). PaDEL-Descriptor ofrece 1875 tipos de descriptores moleculares que se pueden clasificar en varias categorías principales, las cuales explico a continuación:

1. Descriptores basados en átomos
    - Número de átomos: Calcula el número total de átomos, así como el número de átomos específicos, como oxígeno, nitrógeno, etc.
    - Distribución de átomos: Incluye parámetros relacionados con la distribución espacial y la conectividad de los átomos en la molécula.
2. Descriptores topológicos
    - Índices de conectividad: Descriptores que consideran cómo los átomos están conectados, como el índice de conectividad de Randic.
    - Índices de distancias: Describen las distancias entre átomos, considerando rutas mínimas en la estructura molecular.
    - Índices de Balaban y Wiener: Representan propiedades basadas en la conectividad del grafo molecular.
3. Descriptores de carga
    - Carga parcial: Calcula la distribución de la carga parcial sobre átomos específicos y el centro de la carga electrónica.
    - Dipolo eléctrico: Descriptores relacionados con el momento dipolar de la molécula, relevante para su reactividad y comportamiento en campos eléctricos.
4. Descriptores fisicoquímicos generales
    - Hidrofobicidad (LogP): Mide la lipofilicidad de la molécula, importante para la solubilidad y permeabilidad.
    - Área de superficie polar: Relacionado con la capacidad de la molécula para formar enlaces de hidrógeno, lo que afecta la absorción y la biodisponibilidad.
    - Punto de fusión y ebullición estimados: Factores importantes para la estabilidad y propiedades de estado físico.
5. Descriptores de fragmentos moleculares
    - Presencia de grupos funcionales: Informa sobre la presencia o ausencia de grupos funcionales específicos, como ésteres, aminas, cetonas, etc.
    - Frecuencia de ciertos subfragmentos: Cuantifica la aparición de ciertos subfragmentos estructurales, lo cual es útil para análisis de similaridad.
6. Descriptores tridimensionales (3D)
    - Índices de forma: Evalúan la forma global de la molécula, lo que influye en la interacción con otras moléculas.
    - Propiedades estéricas: Incluyen descriptores que representan el espacio ocupado por la molécula, como el volumen molecular y el área de superficie molecular.
7. Descriptores de carga y polaridad
    - Momento dipolar: Describe cómo las cargas están distribuidas en la molécula.
    - Índices de polarizabilidad: Indican cuán fácilmente una molécula puede ser polarizada por un campo eléctrico externo.
8. Descriptores electrónicos
    - Propiedades de orbitales moleculares: Calcula descriptores relacionados con los orbitales de frontera HOMO y LUMO, que son importantes para la reactividad química.
    - Electronegatividad: Mide la tendencia de una molécula a atraer electrones.
9. Descriptores basados en la información
    - Entropía de información: Cuantifica la cantidad de información estructural de la molécula.
    - Índices de complejidad: Evalúan la complejidad estructural de la molécula a partir de sus representaciones como grafos.
10. Descriptores quimioinformáticos específicos
    - Fingerprints: Generan representaciones binarias que describen la presencia de patrones estructurales específicos, útiles para análisis de similaridad molecular.

---
La función GetAll() de la biblioteca Propy3 permite calcular una variedad de descriptores a partir de una secuencia de aminoácidos de una proteína. Estos descriptores proporcionan información sobre las características estructurales, fisicoquímicas y secuenciales de las proteínas, que son esenciales para estudios de bioinformática y predicción de funciones biológicas. Aquí te explico las categorías principales de descriptores generados por esta función:

1. Composición de aminoácidos (AAC)
    - Calcula el porcentaje de cada uno de los 20 aminoácidos en la secuencia. Este descriptor es importante para entender la abundancia relativa de diferentes aminoácidos en la proteína.
    - Ejemplo: Si la secuencia tiene una alta proporción de leucina, esto puede sugerir propiedades específicas como hidrofobicidad.
2. Composición de dipéptidos (DPC)
    - Proporciona el porcentaje de cada uno de los posibles dipéptidos formados por combinaciones de dos aminoácidos consecutivos. Hay 400 posibles combinaciones, por lo que este descriptor es más detallado que AAC y captura información sobre la secuencia y orden de los aminoácidos.
    - Aplicación: Los dipéptidos pueden ser útiles para identificar motivos que influyen en la función o estabilidad de la proteína.
3. Composición de tripletes de aminoácidos
    - Similar al descriptor de dipéptidos, este descriptor analiza tripletes de aminoácidos para proporcionar información más detallada sobre la secuencia y sus posibles motivos funcionales.
    - Ejemplo: La presencia de un triplete específico podría estar relacionado con un sitio activo o de unión enzimático.
4. Propiedades fisicoquímicas basadas en escalas experimentales
    - Calcula descriptores basados en propiedades experimentales de los aminoácidos, como:
        - Hidrofobicidad: Mide cómo los residuos interactúan con el agua, utilizando escalas como la de Kyte-Doolittle.
        - Punto isoeléctrico (pI): Calcula el valor del pH al cual la proteína tiene carga neta cero.
        - Volumen molecular: Informa sobre el espacio que ocupan los aminoácidos en la estructura.
        - Propiedades de carga: Evaluaciones relacionadas con las cargas positivas y negativas de los aminoácidos.
5. Descriptores de composición, transición y distribución (CTD)
    - Composición (C): Calcula el porcentaje de aminoácidos que poseen una propiedad específica (como ser hidrofóbicos, polares, etc.).
    - Transición (T): Mide la frecuencia con la que cambia una propiedad (por ejemplo, de hidrofóbico a polar) a lo largo de la secuencia.
    - Distribución (D): Evalúa la posición en la secuencia donde aparecen por primera vez, en el medio y al final los aminoácidos con una propiedad específica.
6. Descriptores de orden de pseudo-aminoácidos (PseAAC)
    - Incluyen descriptores que combinan información sobre la secuencia de aminoácidos y propiedades fisicoquímicas para capturar la complejidad estructural de las proteínas.
        - PseAAC Clásico: Una extensión de AAC que incorpora la correlación de propiedades de aminoácidos a lo largo de la secuencia.
        - PseAAC Extendido: Añade información sobre el contexto espacial de la proteína y su composición, muy útil en estudios de predicción de funciones.
7. Descriptores de autocorrelación
    - Miden la correlación de una propiedad específica a lo largo de la secuencia, basándose en distancias definidas por el número de enlaces peptídicos. Los tipos comunes de autocorrelación incluyen:
        - Autocorrelación de propiedades: Cómo varía una propiedad fisicoquímica (como la hidrofobicidad) en función de la distancia entre residuos.
        - Autocorrelación cruzada: Evalúa la correlación entre dos propiedades diferentes a lo largo de la secuencia.
8. Descriptores basados en el contenido de aminoácidos particulares
    - Contenido de aminoácidos específicos: Calcula la frecuencia de ciertos aminoácidos que son importantes para funciones específicas, como cisteína (que puede formar puentes disulfuro) o prolina (que puede inducir giros estructurales).
10. Descriptores de características de la secuencia secundaria
    - Predicción de estructura secundaria: Evalúa la proporción de hélices alfa, láminas beta y giros en la estructura secundaria predicha de la proteína.
    - Propensión estructural: Informa sobre la tendencia de los aminoácidos a formar diferentes estructuras secundarias.
11. Descriptores de complejidad secuencial
    - Mide la complejidad de la secuencia basándose en la entropía de la información. Una secuencia con baja complejidad puede ser repetitiva, mientras que una de alta complejidad puede tener implicaciones funcionales diversas.



In [9]:
from padelpy import from_smiles
import pandas as pd
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Descriptors
from propy import PyPro
import numpy as np


def calculate_descriptors_for_smiles(smiles_list):
    # Diccionario para almacenar los resultados
    descriptors_list = []

    # Iterar sobre cada SMILES en la lista
    for smile in smiles_list:
        try:
            # Calcular los descriptores usando Padelpy
            descriptors = from_smiles(smile, descriptors=True, fingerprints=False)
            #descriptors["SMILES"] = smile  # Añadir el SMILES al diccionario de descriptores
            descriptors_list.append(descriptors)
        except Exception as e:
            print(f"Error al calcular descriptores para SMILES {smile}: {e}")
            continue

    # Convertir la lista de descriptores a un DataFrame de pandas
    df = pd.DataFrame(descriptors_list)

    return df


def calculate_fingerprints_for_smiles(smiles_list):
    # Diccionario para almacenar los resultados
    fingerprints_list = []

    # Iterar sobre cada SMILES en la lista
    for smile in smiles_list:
        try:
            # Calcular los descriptores usando Padelpy
            fingerprint = from_smiles(smile, fingerprints=True, descriptors=False)
            #fingerprint["SMILES"] = smile  # Añadir el SMILES al diccionario de descriptores
            fingerprints_list.append(fingerprint)
        except Exception as e:
            print(f"Error al calcular fingerprints para SMILES {smile}: {e}")
            continue

    # Convertir la lista de descriptores a un DataFrame de pandas
    df = pd.DataFrame(fingerprints_list)

    return df, fingerprints_list


def calculate_fingerprints(smiles_list):
    # Diccionario para almacenar los fingerprints
    fingerprints_dict = {
        "SMILES": [],
        "Fingerprints": []
    }

    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        if mol:
            # Calcular fingerprint de Morgan (circular)
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
            fingerprints_dict["Fingerprints"].append(arr.tolist())
        else:
            fingerprints_dict["Fingerprints"].append([0] * 2048)  # En caso de un SMILES inválido

        fingerprints_dict["SMILES"].append(smiles)

    return fingerprints_dict

def calculate_sequence_descriptors(sequence_list):
    # Diccionario para almacenar los descriptores de secuencias
    # descriptors_dict = {
    #     "Protein_Sequence": [],
    #     "Descriptors": []
    # }

    descriptors_dict = []

    for sequence in sequence_list:
        try:
            des_object = PyPro.GetProDes(sequence)
            descriptors = des_object.GetALL()

            descriptors_dict.append(descriptors)
            #descriptors_dict["Descriptors"].append(descriptors)
            #descriptors_dict["Protein_Sequence"].append(sequence)
        except Exception as e:
            print(f"Error al calcular descriptores para secuencia {sequence}: {e}")
            continue

    return descriptors_dict

In [10]:
list_sequences = sequences_OBPs.tolist()
list_smiles = compounds_PBPs_GOBPs.tolist()

## Busqueda de descriptores de VOCs mediante su smile por PadelPy


In [11]:
all_smiles_df = calculate_descriptors_for_smiles(list_smiles)

In [12]:
all_smiles_df

Unnamed: 0,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,nHeavyAtom,nH,...,P1s,P2s,E1s,E2s,E3s,Ts,As,Vs,Ks,Ds
0,0,1.5767999999999986,2.4862982399999956,60.661500000000004,37.01785999999998,0,0,34,14,20,...,,,,,,,,,,
1,0,3.0936,9.570360959999999,48.49419999999999,28.268687999999976,0,0,26,10,16,...,,,,,,,,,,
2,0,-0.19240000000000002,0.037017760000000004,6.2939,17.122757999999997,6,6,14,8,6,...,0.6383662540079179,0.33936245577007623,0.5527969565969456,0.4328184166840602,0.18605774423740984,4.923937853865384,5.780357905664708,11.28028819648454,0.4665930646669912,1.1716731175184156
3,0,-0.5055999999999994,0.25563135999999936,29.317300000000003,19.363515999999994,0,0,19,7,12,...,,,,,,,,,,
4,0,0.187600000000001,0.03519376000000037,29.799500000000002,18.029929999999997,0,0,17,7,10,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,0,-0.4842000000000004,0.2344496400000004,34.7312,26.884273999999973,0,0,27,9,18,...,0.8326806109596323,0.11103344482941373,0.6003090439414169,0.45953765924506623,0.38143441661931043,9.908495941175014,14.292131386708963,29.263015346264734,0.7490209164394486,1.4412811198057935
250,0,-1.4948000000000001,2.2344270400000004,29.5456,21.499101999999986,0,0,22,8,14,...,0.7764040001439427,0.14828148699720184,0.5551115722850664,0.4844067989858877,0.33315655819801987,7.033262534406277,9.139906879705615,19.18981198005572,0.6646060002159142,1.372674929468974
251,0,-0.17450000000000054,0.03045025000000019,29.040599999999998,20.69710199999999,0,0,21,7,14,...,0.6538276657984088,0.24641881321701012,0.5301997584223208,0.5223907844834279,0.41448448277474026,5.5420986170238615,7.706916452030066,15.984838954826504,0.48074149869761323,1.467075025680489
252,0,-0.040199999999999875,0.0016160399999999898,12.027999999999999,21.549929999999986,6,6,19,9,10,...,0.7759000323587145,0.18670434734715882,0.5570812466264286,0.39509863508285953,0.2773294065085866,7.4900062969763255,10.146342689619086,19.91263846153928,0.663850048538072,1.2295092882178746


In [13]:
all_smiles_df.describe()

Unnamed: 0,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,nHeavyAtom,nH,...,P1s,P2s,E1s,E2s,E3s,Ts,As,Vs,Ks,Ds
count,254,254.0,254.0,254.0,254.0,254,254,254,254,254,...,254.0,254.0,254.0,254.0,254.0,254.0,254.0,254.0,254.0,254.0
unique,2,214.0,214.0,214.0,132.0,5,5,51,20,26,...,146.0,146.0,146.0,146.0,144.0,146.0,146.0,146.0,146.0,147.0
top,0,-1.7193000000000034,2.9559924900000114,62.1684,28.268687999999976,0,0,26,10,16,...,,,,,,,,,,
freq,245,5.0,5.0,5.0,10.0,211,211,21,37,26,...,106.0,106.0,106.0,106.0,106.0,106.0,106.0,106.0,106.0,106.0


In [14]:
all_smiles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Columns: 1875 entries, nAcid to Ds
dtypes: object(1875)
memory usage: 3.6+ MB


transformación de columnas object a su respectiva variable

In [15]:

# Convertir las columnas 'object' a sus tipos de datos correspondientes
for col in all_smiles_df.columns:
  if all_smiles_df[col].dtype == 'object':
    try:
      # Intentar convertir a numérico
      all_smiles_df[col] = pd.to_numeric(all_smiles_df[col], errors='coerce')
    except ValueError:
      # Si no se puede convertir a numérico, intentar convertir a categórico
      all_smiles_df[col] = all_smiles_df[col].astype('category')

# Mostrar información sobre los tipos de datos de las columnas
all_smiles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 0 to 253
Columns: 1875 entries, nAcid to Ds
dtypes: float64(1842), int64(33)
memory usage: 3.6 MB


### **Creacion de dataset de descriptores de compuestos**


In [16]:
cov_descriptors = pd.concat([dfs_info_cov_transformed[['Compound name', 'Smiles']], all_smiles_df], axis=1)
cov_descriptors = cov_descriptors.fillna(0)
cov_descriptors

Unnamed: 0,Compound name,Smiles,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,...,P1s,P2s,E1s,E2s,E3s,Ts,As,Vs,Ks,Ds
0,ionone (beta),CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,0,34,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,"beta-myrcene / 7-methyl-3-methylene-1,6-octad...",CC(C)=CCCC(=C)C=C,0,3.0936,9.570361,48.4942,28.268688,0,0,26,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,benzaldehyde,O=Cc1ccccc1,0,-0.1924,0.037018,6.2939,17.122758,6,6,14,...,0.638366,0.339362,0.552797,0.432818,0.186058,4.923938,5.780358,11.280288,0.466593,1.171673
3,Z-3-hexen-1-ol / cis-3-Hexen-1-ol,CC\C=C/CCO,0,-0.5056,0.255631,29.3173,19.363516,0,0,19,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,(E)-2-Hexenal,CCC/C=C/C=O,0,0.1876,0.035194,29.7995,18.029930,0,0,17,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,Isooctanol / isooctyl alcohol,CC(C)CCCCCO,0,-0.4842,0.234450,34.7312,26.884274,0,0,27,...,0.832681,0.111033,0.600309,0.459538,0.381434,9.908496,14.292131,29.263015,0.749021,1.441281
250,"2,5‐Hexanediol",CC(O)CCC(C)O,0,-1.4948,2.234427,29.5456,21.499102,0,0,22,...,0.776404,0.148281,0.555112,0.484407,0.333157,7.033263,9.139907,19.189812,0.664606,1.372675
251,2-methyl-3-pentanol,CCC(O)C(C)C,0,-0.1745,0.030450,29.0406,20.697102,0,0,21,...,0.653828,0.246419,0.530200,0.522391,0.414484,5.542099,7.706916,15.984839,0.480741,1.467075
252,Methyl benzyl ether,COCc1ccccc1,0,-0.0402,0.001616,12.0280,21.549930,6,6,19,...,0.775900,0.186704,0.557081,0.395099,0.277329,7.490006,10.146343,19.912638,0.663850,1.229509


In [17]:
#all_smiles_df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Tesis/Datasets/Imput/all_smiles_df.xlsx')
cov_descriptors.to_excel('datasets/cov_descriptors.xlsx')

### concatenación de grupos funcionales (opcional)

In [18]:
dfs_info_groups_transformed2 = dfs_info_groups_transformed.iloc[:, 4:]
dfs_info_groups_transformed2

Unnamed: 0,acetylenic carbon,aldehyde,amide,amino acid,azo nitrogen,azole,diazene,bromine,carbamate,carbamic ester,...,Streptococus pneumoniae,Clostridium difficile,Mycobaterium Tuberculosis,Haemophilus Influenzae,Escherichia coli,Klebisiella pneumoniae,Pseudomonas aeruginosa,COPD,Lung cancer,Asthma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
250,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
251,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
252,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
cov_descriptors_grpf = pd.concat([dfs_info_cov_transformed[['Compound name', 'Smiles']], dfs_info_groups_transformed2, all_smiles_df], axis=1)
cov_descriptors_grpf = cov_descriptors_grpf.fillna(0)
cov_descriptors_grpf

Unnamed: 0,Compound name,Smiles,acetylenic carbon,aldehyde,amide,amino acid,azo nitrogen,azole,diazene,bromine,...,P1s,P2s,E1s,E2s,E3s,Ts,As,Vs,Ks,Ds
0,ionone (beta),CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,"beta-myrcene / 7-methyl-3-methylene-1,6-octad...",CC(C)=CCCC(=C)C=C,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,benzaldehyde,O=Cc1ccccc1,0,1,0,0,0,0,0,0,...,0.638366,0.339362,0.552797,0.432818,0.186058,4.923938,5.780358,11.280288,0.466593,1.171673
3,Z-3-hexen-1-ol / cis-3-Hexen-1-ol,CC\C=C/CCO,0,0,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,(E)-2-Hexenal,CCC/C=C/C=O,0,1,0,0,0,0,0,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,Isooctanol / isooctyl alcohol,CC(C)CCCCCO,0,0,0,0,0,0,0,0,...,0.832681,0.111033,0.600309,0.459538,0.381434,9.908496,14.292131,29.263015,0.749021,1.441281
250,"2,5‐Hexanediol",CC(O)CCC(C)O,0,0,0,0,0,0,0,0,...,0.776404,0.148281,0.555112,0.484407,0.333157,7.033263,9.139907,19.189812,0.664606,1.372675
251,2-methyl-3-pentanol,CCC(O)C(C)C,0,0,0,0,0,0,0,0,...,0.653828,0.246419,0.530200,0.522391,0.414484,5.542099,7.706916,15.984839,0.480741,1.467075
252,Methyl benzyl ether,COCc1ccccc1,0,0,0,0,0,0,0,0,...,0.775900,0.186704,0.557081,0.395099,0.277329,7.490006,10.146343,19.912638,0.663850,1.229509


In [20]:
cov_descriptors_grpf.to_excel('datasets/cov_descriptors_grpf.xlsx')

## Busqueda de descriptores de OBPs mediante su secuencia de aminoácidos por ProPy3

In [21]:
all_sequences_df = calculate_sequence_descriptors(list_sequences)

In [22]:
all_sequences_df = pd.DataFrame(all_sequences_df)
all_sequences_df.fillna(0, inplace=True)
all_sequences_df

Unnamed: 0,A,R,N,D,C,E,Q,G,H,I,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,6.897,5.517,2.759,6.897,4.828,11.724,4.138,4.138,7.586,3.448,...,0.031120,0.032513,0.033323,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002
1,9.220,3.546,2.837,7.092,4.255,11.348,2.128,4.965,5.674,3.546,...,0.030421,0.033105,0.031694,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529
2,8.392,0.699,2.797,9.091,4.196,6.993,4.895,3.497,3.497,4.895,...,0.030708,0.032094,0.032838,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733
3,12.676,2.113,3.521,3.521,4.225,13.380,3.521,4.930,4.930,4.225,...,0.031123,0.033101,0.030987,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560
4,5.479,4.110,2.740,10.274,4.110,9.589,2.055,4.795,4.110,7.534,...,0.028859,0.031950,0.030161,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,9.220,3.546,2.837,5.674,4.255,12.057,2.128,4.255,6.383,3.546,...,0.030173,0.033000,0.031407,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435
106,6.383,3.546,2.128,11.348,4.255,7.092,4.255,4.255,4.965,4.965,...,0.031578,0.032051,0.028764,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283
107,8.451,0.704,3.521,9.155,4.225,7.042,3.521,4.225,3.521,4.225,...,0.029223,0.031227,0.030380,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429
108,8.333,0.694,7.639,6.944,4.167,9.722,2.778,4.167,4.167,2.778,...,0.028995,0.032412,0.029701,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488


In [23]:
all_sequences_df.describe()

Unnamed: 0,A,R,N,D,C,E,Q,G,H,I,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
count,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,...,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0,110.0
mean,8.021845,3.136336,3.457055,7.118318,4.327327,10.644582,3.093009,4.685064,4.932027,4.953227,...,0.030081,0.032248,0.03174,0.030563,0.026934,0.030745,0.032033,0.029631,0.030912,0.031655
std,1.745267,1.502265,1.362925,1.66376,0.26737,2.134748,1.265137,1.060232,1.256901,1.622837,...,0.001486,0.001043,0.001526,0.00114,0.001565,0.001225,0.000999,0.000989,0.001148,0.001224
min,3.521,0.69,0.787,3.521,3.497,5.517,0.0,2.759,0.0,1.389,...,0.026583,0.029326,0.028764,0.028186,0.023186,0.027799,0.028782,0.02733,0.028088,0.028221
25%,7.00525,2.113,2.778,6.25,4.196,9.04375,2.2695,4.138,4.167,4.17425,...,0.028896,0.031576,0.030607,0.029639,0.025849,0.02989,0.031265,0.028941,0.030219,0.030779
50%,8.3045,2.847,2.837,7.042,4.225,10.977,2.817,4.3325,4.965,4.861,...,0.030166,0.032334,0.031708,0.030578,0.026694,0.030715,0.032118,0.029565,0.030971,0.031619
75%,9.155,4.196,4.196,8.17125,4.27825,12.057,3.5905,4.93,5.674,5.664,...,0.031155,0.032993,0.033066,0.03151,0.027971,0.031663,0.032716,0.030277,0.031603,0.032401
max,12.676,7.639,7.639,11.348,4.965,13.889,7.639,9.524,8.276,10.49,...,0.033818,0.035233,0.034475,0.032943,0.032618,0.033147,0.034071,0.033125,0.034123,0.034451


In [24]:
all_sequences_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Columns: 1547 entries, A to QSOgrant50
dtypes: float64(1547)
memory usage: 1.3 MB


### **Creacion de dataset de descriptores de proteinas**


In [25]:
pbp_gobp_descriptors = pd.concat([filtered_OBP_info_new[['Binding Protein Name', 'AA Sequence W/O signal peptide']], all_sequences_df], axis=1)
pbp_gobp_descriptors = pbp_gobp_descriptors.fillna(0)
pbp_gobp_descriptors

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,A,R,N,D,C,E,Q,G,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,6.897,5.517,2.759,6.897,4.828,11.724,4.138,4.138,...,0.031120,0.032513,0.033323,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,9.220,3.546,2.837,7.092,4.255,11.348,2.128,4.965,...,0.030421,0.033105,0.031694,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,8.392,0.699,2.797,9.091,4.196,6.993,4.895,3.497,...,0.030708,0.032094,0.032838,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,12.676,2.113,3.521,3.521,4.225,13.380,3.521,4.930,...,0.031123,0.033101,0.030987,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,5.479,4.110,2.740,10.274,4.110,9.589,2.055,4.795,...,0.028859,0.031950,0.030161,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,9.220,3.546,2.837,5.674,4.255,12.057,2.128,4.255,...,0.030173,0.033000,0.031407,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,6.383,3.546,2.128,11.348,4.255,7.092,4.255,4.255,...,0.031578,0.032051,0.028764,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,8.451,0.704,3.521,9.155,4.225,7.042,3.521,4.225,...,0.029223,0.031227,0.030380,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,8.333,0.694,7.639,6.944,4.167,9.722,2.778,4.167,...,0.028995,0.032412,0.029701,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488


In [26]:
pbp_gobp_descriptors.to_excel('datasets/pbp_gobp_descriptors.xlsx')

### concatenación de cystine count y type protein (opcional)

In [27]:
pbp_gobp_descriptors_ct = pd.concat([filtered_OBP_info_new[['Binding Protein Name', 'AA Sequence W/O signal peptide', 'Binding Protein Type','Cystine count', 'Species']], all_sequences_df], axis=1)
pbp_gobp_descriptors_ct = pbp_gobp_descriptors_ct.fillna(0)
pbp_gobp_descriptors_ct

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,Binding Protein Type,Cystine count,Species,A,R,N,D,C,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,GOBP1,7,Agrotis ipsilon,6.897,5.517,2.759,6.897,4.828,...,0.031120,0.032513,0.033323,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,GOBP2,6,Agrotis ipsilon,9.220,3.546,2.837,7.092,4.255,...,0.030421,0.033105,0.031694,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,PBP,6,Agrotis ipsilon,8.392,0.699,2.797,9.091,4.196,...,0.030708,0.032094,0.032838,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,PBP,6,Agrotis ipsilon,12.676,2.113,3.521,3.521,4.225,...,0.031123,0.033101,0.030987,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,PBP,6,Agrotis ipsilon,5.479,4.110,2.740,10.274,4.110,...,0.028859,0.031950,0.030161,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,GOBP2,6,Spodoptera litura,9.220,3.546,2.837,5.674,4.255,...,0.030173,0.033000,0.031407,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,PBP,6,Spodoptera litura,6.383,3.546,2.128,11.348,4.255,...,0.031578,0.032051,0.028764,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,PBP,6,Tryporyza intacta,8.451,0.704,3.521,9.155,4.225,...,0.029223,0.031227,0.030380,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,PBP,6,Tryporyza intacta,8.333,0.694,7.639,6.944,4.167,...,0.028995,0.032412,0.029701,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488


transformación de binding protein type a categorias dumies

In [28]:
# Suponiendo que tu dataframe se llama df
df = pd.get_dummies(pbp_gobp_descriptors_ct, columns=['Binding Protein Type'])

In [29]:
df

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,Cystine count,Species,A,R,N,D,C,E,...,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50,Binding Protein Type_GOBP1,Binding Protein Type_GOBP2,Binding Protein Type_PBP
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,7,Agrotis ipsilon,6.897,5.517,2.759,6.897,4.828,11.724,...,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002,True,False,False
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,6,Agrotis ipsilon,9.220,3.546,2.837,7.092,4.255,11.348,...,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529,False,True,False
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,6,Agrotis ipsilon,8.392,0.699,2.797,9.091,4.196,6.993,...,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733,False,False,True
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,6,Agrotis ipsilon,12.676,2.113,3.521,3.521,4.225,13.380,...,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560,False,False,True
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,6,Agrotis ipsilon,5.479,4.110,2.740,10.274,4.110,9.589,...,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,6,Spodoptera litura,9.220,3.546,2.837,5.674,4.255,12.057,...,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435,False,True,False
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,6,Spodoptera litura,6.383,3.546,2.128,11.348,4.255,7.092,...,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283,False,False,True
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,6,Tryporyza intacta,8.451,0.704,3.521,9.155,4.225,7.042,...,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429,False,False,True
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,6,Tryporyza intacta,8.333,0.694,7.639,6.944,4.167,9.722,...,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488,False,False,True


transformación de Species en Label Encoder

In [30]:
from sklearn.preprocessing import LabelEncoder

# Suponiendo que tu dataframe se llama df y la columna categórica es 'NombreColumnaCategorica'
label_encoder = LabelEncoder()
df['Species'] = label_encoder.fit_transform(df['Species'])

In [31]:
df

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,Cystine count,Species,A,R,N,D,C,E,...,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50,Binding Protein Type_GOBP1,Binding Protein Type_GOBP2,Binding Protein Type_PBP
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,7,0,6.897,5.517,2.759,6.897,4.828,11.724,...,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002,True,False,False
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,6,0,9.220,3.546,2.837,7.092,4.255,11.348,...,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529,False,True,False
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,6,0,8.392,0.699,2.797,9.091,4.196,6.993,...,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733,False,False,True
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,6,0,12.676,2.113,3.521,3.521,4.225,13.380,...,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560,False,False,True
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,6,0,5.479,4.110,2.740,10.274,4.110,9.589,...,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,6,29,9.220,3.546,2.837,5.674,4.255,12.057,...,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435,False,True,False
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,6,29,6.383,3.546,2.128,11.348,4.255,7.092,...,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283,False,False,True
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,6,30,8.451,0.704,3.521,9.155,4.225,7.042,...,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429,False,False,True
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,6,30,8.333,0.694,7.639,6.944,4.167,9.722,...,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488,False,False,True


In [32]:
df1 = df.iloc[:, 2:1551]
df2 = df.iloc[:, -3:]
df2 = df2.astype('int')
df3 = pd.concat([df[['Binding Protein Name', 'AA Sequence W/O signal peptide']], df2, df1], axis=1)
df3

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,Binding Protein Type_GOBP1,Binding Protein Type_GOBP2,Binding Protein Type_PBP,Cystine count,Species,A,R,N,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,1,0,0,7,0,6.897,5.517,2.759,...,0.031120,0.032513,0.033323,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,0,1,0,6,0,9.220,3.546,2.837,...,0.030421,0.033105,0.031694,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,0,0,1,6,0,8.392,0.699,2.797,...,0.030708,0.032094,0.032838,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,0,0,1,6,0,12.676,2.113,3.521,...,0.031123,0.033101,0.030987,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,0,0,1,6,0,5.479,4.110,2.740,...,0.028859,0.031950,0.030161,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,0,1,0,6,29,9.220,3.546,2.837,...,0.030173,0.033000,0.031407,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,0,0,1,6,29,6.383,3.546,2.128,...,0.031578,0.032051,0.028764,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,0,0,1,6,30,8.451,0.704,3.521,...,0.029223,0.031227,0.030380,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,0,0,1,6,30,8.333,0.694,7.639,...,0.028995,0.032412,0.029701,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488


In [33]:
#all_sequences_df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Tesis/Datasets/Imput/all_sequences_df.xlsx')

In [34]:
df3.to_excel('datasets/proteins_descriptors_grpf.xlsx')

# Etapa 3 del desarrollo: Creación del Conjunto de Datos.

## unpitvotación de dataset df_obps_covs_info_filtrered

eliminacion de columna smile

In [35]:
dfs_info_cov_transformed2 = dfs_info_cov_transformed.iloc[:, 4:]
dfs_info_cov_transformed3 = pd.concat([dfs_info_cov_transformed[['Compound name']], dfs_info_cov_transformed2], axis=1 )
dfs_info_cov_transformed3

Unnamed: 0,Compound name,AipsGOBP1,AipsGOBP2,AipsPBP1,AipsPBP2,AipsPBP3,AlepGOBP1,AlepGOBP2,AlepGOBP2 F118A,AlepGOBP2 F12A,...,SexiPBP3,SinfPBP1,SinfPBP2,SinfPBP3,SlitGOBP1,SlitGOBP2,SlitPBP1,TintPBP1,TintPBP2,TintPBP3
0,ionone (beta),10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,9.66,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
1,"beta-myrcene / 7-methyl-3-methylene-1,6-octad...",20.1,20.1,10000.0,10000.0,10000.0,20.0,20.0,10000.0,10000.0,...,10000.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
2,benzaldehyde,30.0,30.0,10000.0,10000.0,10000.0,20.0,20.0,10000.0,10000.0,...,20.00,10000.0,10000.0,10000.0,10000.0,10000.0,13.13,10000.0,10000.0,10000.0
3,Z-3-hexen-1-ol / cis-3-Hexen-1-ol,13.3,4.3,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,40.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
4,(E)-2-Hexenal,30.0,24.3,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,40.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,Isooctanol / isooctyl alcohol,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
250,"2,5‐Hexanediol",10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
251,2-methyl-3-pentanol,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0
252,Methyl benzyl ether,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.00,10000.0,10000.0,10000.0,10000.0,10000.0,10000.00,10000.0,10000.0,10000.0


In [36]:
# Inicializar una lista para almacenar los datos
VOC_Ki = dfs_info_cov_transformed3.copy()

data = []

# Iterar sobre las filas de VOC_Ki
for index, row in VOC_Ki.iterrows():
    compound_name = row['Compound name']
    #smiles = row['Smiles']
    for column in VOC_Ki.columns[1:]:  # Iterar sobre las columnas después de 'Compound name'
        affinity = row[column]
        if affinity != 10000:
            data.append([compound_name, column, affinity]) # se guardan solo los que no tienen afinidad en 10000

# Crear un DataFrame a partir de los datos
new_df = pd.DataFrame(data, columns=['Compound name', 'protein_name', 'affinity'])

# Mostrar el nuevo DataFrame
new_df

Unnamed: 0,Compound name,protein_name,affinity
0,ionone (beta),CmedPBP4,7.13
1,ionone (beta),CpunPBP2,10.06
2,ionone (beta),CpunPBP5,9.85
3,ionone (beta),CsinGOBP1,12.93
4,ionone (beta),CsinGOBP2,30.00
...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,30.00
1455,2-methyl-3-pentanol,CsinGOBP2,9.57
1456,Methyl benzyl ether,CsinGOBP1,24.11
1457,Methyl benzyl ether,CsinGOBP2,30.00


## descriptores de compuestos

In [37]:
cov_descriptors1 = cov_descriptors.copy()
cov_descriptors1

Unnamed: 0,Compound name,Smiles,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,nAromBond,nAtom,...,P1s,P2s,E1s,E2s,E3s,Ts,As,Vs,Ks,Ds
0,ionone (beta),CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,0,34,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,"beta-myrcene / 7-methyl-3-methylene-1,6-octad...",CC(C)=CCCC(=C)C=C,0,3.0936,9.570361,48.4942,28.268688,0,0,26,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,benzaldehyde,O=Cc1ccccc1,0,-0.1924,0.037018,6.2939,17.122758,6,6,14,...,0.638366,0.339362,0.552797,0.432818,0.186058,4.923938,5.780358,11.280288,0.466593,1.171673
3,Z-3-hexen-1-ol / cis-3-Hexen-1-ol,CC\C=C/CCO,0,-0.5056,0.255631,29.3173,19.363516,0,0,19,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,(E)-2-Hexenal,CCC/C=C/C=O,0,0.1876,0.035194,29.7995,18.029930,0,0,17,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,Isooctanol / isooctyl alcohol,CC(C)CCCCCO,0,-0.4842,0.234450,34.7312,26.884274,0,0,27,...,0.832681,0.111033,0.600309,0.459538,0.381434,9.908496,14.292131,29.263015,0.749021,1.441281
250,"2,5‐Hexanediol",CC(O)CCC(C)O,0,-1.4948,2.234427,29.5456,21.499102,0,0,22,...,0.776404,0.148281,0.555112,0.484407,0.333157,7.033263,9.139907,19.189812,0.664606,1.372675
251,2-methyl-3-pentanol,CCC(O)C(C)C,0,-0.1745,0.030450,29.0406,20.697102,0,0,21,...,0.653828,0.246419,0.530200,0.522391,0.414484,5.542099,7.706916,15.984839,0.480741,1.467075
252,Methyl benzyl ether,COCc1ccccc1,0,-0.0402,0.001616,12.0280,21.549930,6,6,19,...,0.775900,0.186704,0.557081,0.395099,0.277329,7.490006,10.146343,19.912638,0.663850,1.229509


## Descriptores de proteinas

In [38]:
pbp_gobp_descriptors1 = pbp_gobp_descriptors.copy()
pbp_gobp_descriptors1

Unnamed: 0,Binding Protein Name,AA Sequence W/O signal peptide,A,R,N,D,C,E,Q,G,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,AipsGOBP1,DVNVMKDVTLGFGQALDKCRQESDLTEEKMEEFFHFWRDDFKFEHR...,6.897,5.517,2.759,6.897,4.828,11.724,4.138,4.138,...,0.031120,0.032513,0.033323,0.030630,0.025502,0.030334,0.032630,0.029757,0.030729,0.031002
1,AipsGOBP2,TAEVMSHVTAHFGKALEECRDESGLSAEVLEEFQHFWREDFEVVHR...,9.220,3.546,2.837,7.092,4.255,11.348,2.128,4.965,...,0.030421,0.033105,0.031694,0.031302,0.026401,0.030630,0.032248,0.029420,0.031807,0.030529
2,AipsPBP1,SQEIIKNLSLQFAKPLEDCKKEMDLSDTVITDFYNFWKEGYEFTNR...,8.392,0.699,2.797,9.091,4.196,6.993,4.895,3.497,...,0.030708,0.032094,0.032838,0.030650,0.025914,0.030020,0.032498,0.029065,0.030219,0.031733
3,AipsPBP2,SQEVVASFSKGFTNVVEHCKAEVNAGEHIMQDIYNFWREEYQLVNR...,12.676,2.113,3.521,3.521,4.225,13.380,3.521,4.930,...,0.031123,0.033101,0.030987,0.031228,0.027111,0.030033,0.033080,0.029036,0.031036,0.030560
4,AipsPBP3,EIEPSKDAMKYITSGFVKVLEECKQELNMNDRIIADLFHYWKLDYT...,5.479,4.110,2.740,10.274,4.110,9.589,2.055,4.795,...,0.028859,0.031950,0.030161,0.028564,0.028090,0.032414,0.032521,0.029962,0.031094,0.031886
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,SlitGOBP2,TAEVMSHVTAHFGKALEECREESGLSAEVLEEFQHFWREDFEVVHR...,9.220,3.546,2.837,5.674,4.255,12.057,2.128,4.255,...,0.030173,0.033000,0.031407,0.031059,0.025827,0.031193,0.033089,0.029832,0.031605,0.030435
106,SlitPBP1,SQDLMVKMTKGFTRVVDDCKTELNVGDHIMQDMYNYWREDYQLINR...,6.383,3.546,2.128,11.348,4.255,7.092,4.255,4.255,...,0.031578,0.032051,0.028764,0.029270,0.029657,0.031639,0.033000,0.028432,0.030474,0.031283
107,TintPBP1,SQDVMKQMTLNFAKLVDLCKKELDLPDTISKDFANFWKEGYEISDR...,8.451,0.704,3.521,9.155,4.225,7.042,3.521,4.225,...,0.029223,0.031227,0.030380,0.031943,0.026600,0.029306,0.031012,0.030084,0.030320,0.033429
108,TintPBP2,SQDVMKSMTKNFLKAYEVCSKEYNLPENTANELVNFWKEDFTTTNR...,8.333,0.694,7.639,6.944,4.167,9.722,2.778,4.167,...,0.028995,0.032412,0.029701,0.032108,0.027156,0.030182,0.033364,0.028037,0.029328,0.030488


## concatenación de descriptores de conmpuestos y proteínas

In [39]:
merge_df = pd.merge(new_df, cov_descriptors1, left_on='Compound name', right_on='Compound name', how='left')
merge_df = pd.merge(merge_df, pbp_gobp_descriptors1, left_on='protein_name', right_on='Binding Protein Name', how='left')
merge_df = merge_df.drop(columns=['Binding Protein Name'])

In [40]:
merge_df

Unnamed: 0,Compound name,protein_name,affinity,Smiles,nAcid,ALogP,ALogp2,AMR,apol,naAromAtom,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,7.13,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,10.06,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,9.85,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,12.93,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,30.00,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,1.5768,2.486298,60.6615,37.017860,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,30.00,CCC(O)C(C)C,0,-0.1745,0.030450,29.0406,20.697102,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1455,2-methyl-3-pentanol,CsinGOBP2,9.57,CCC(O)C(C)C,0,-0.1745,0.030450,29.0406,20.697102,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,24.11,COCc1ccccc1,0,-0.0402,0.001616,12.0280,21.549930,6,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,30.00,COCc1ccccc1,0,-0.0402,0.001616,12.0280,21.549930,6,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829


Ordenar pocisión de columnas

In [41]:

# Reordenar las columnas del DataFrame
cols = merge_df.columns.tolist()
smiles_index = cols.index('Smiles')
sequence_index = cols.index('AA Sequence W/O signal peptide')
afinity_index = cols.index('affinity')

# Mover la columna 'AA Sequence W/O signal peptide' al lado de 'Smiles'
cols.insert(smiles_index + 1, cols.pop(sequence_index))
cols.insert(smiles_index + 1, cols.pop(afinity_index))

# Reorganizar el DataFrame con las columnas reordenadas
merge_df = merge_df[cols]

merge_df['affinity'] = merge_df['affinity'].astype(float)

merge_df



Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MEVEMLPEGMKQLTGGFIKVFEACKTELGLKDGMLTDMYHLWREEY...,7.13,0,1.5768,2.486298,60.6615,37.017860,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MMKDMTKNFLKAYGECQQELHLTDDTARDLMFFWKEDYEVTSREAG...,10.06,0,1.5768,2.486298,60.6615,37.017860,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,CC(=O)/C=C/C1=C(C)CCCC1(C)C,SQEVMKKMSATFFKLLEECKKELSVTDDMIQGLVRFWLEDSALGER...,9.85,0,1.5768,2.486298,60.6615,37.017860,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,CC(=O)/C=C/C1=C(C)CCCC1(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,12.93,0,1.5768,2.486298,60.6615,37.017860,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,1.5768,2.486298,60.6615,37.017860,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,CCC(O)C(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,30.00,0,-0.1745,0.030450,29.0406,20.697102,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,-0.1745,0.030450,29.0406,20.697102,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,-0.0402,0.001616,12.0280,21.549930,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,-0.0402,0.001616,12.0280,21.549930,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829


In [42]:
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Columns: 3427 entries, Compound name to QSOgrant50
dtypes: float64(3390), int64(33), object(4)
memory usage: 38.1+ MB


In [43]:
merge_df.to_excel('datasets/dataset_cov_obps_4.xlsx', index=False)

## Concatenación de grupos funcionales de compuestos y datos extras de proteinas al conjunto de datos (opcional)

In [44]:
merge_df1 = pd.merge(new_df, cov_descriptors_grpf, left_on='Compound name', right_on='Compound name', how='left')
merge_df1 = pd.merge(merge_df1, df3, left_on='protein_name', right_on='Binding Protein Name', how='left')
merge_df1 = merge_df1.drop(columns=['Binding Protein Name'])

In [45]:
merge_df1

Unnamed: 0,Compound name,protein_name,affinity,Smiles,acetylenic carbon,aldehyde,amide,amino acid,azo nitrogen,azole,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,7.13,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,10.06,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,9.85,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,12.93,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,30.00,CC(=O)/C=C/C1=C(C)CCCC1(C)C,0,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,30.00,CCC(O)C(C)C,0,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1455,2-methyl-3-pentanol,CsinGOBP2,9.57,CCC(O)C(C)C,0,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,24.11,COCc1ccccc1,0,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,30.00,COCc1ccccc1,0,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829


Ordenar pocisión de columnas

In [46]:

# Reordenar las columnas del DataFrame
cols = merge_df1.columns.tolist()
smiles_index = cols.index('Smiles')
sequence_index = cols.index('AA Sequence W/O signal peptide')
afinity_index = cols.index('affinity')

# Mover la columna 'AA Sequence W/O signal peptide' al lado de 'Smiles'
cols.insert(smiles_index + 1, cols.pop(sequence_index))
cols.insert(smiles_index + 1, cols.pop(afinity_index))

# Reorganizar el DataFrame con las columnas reordenadas
merge_df1 = merge_df1[cols]

merge_df1['affinity'] = merge_df1['affinity'].astype(float)

merge_df1


Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,acetylenic carbon,aldehyde,amide,amino acid,azo nitrogen,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MEVEMLPEGMKQLTGGFIKVFEACKTELGLKDGMLTDMYHLWREEY...,7.13,0,0,0,0,0,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MMKDMTKNFLKAYGECQQELHLTDDTARDLMFFWKEDYEVTSREAG...,10.06,0,0,0,0,0,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,CC(=O)/C=C/C1=C(C)CCCC1(C)C,SQEVMKKMSATFFKLLEECKKELSVTDDMIQGLVRFWLEDSALGER...,9.85,0,0,0,0,0,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,CC(=O)/C=C/C1=C(C)CCCC1(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,12.93,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,CCC(O)C(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,30.00,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,0,0,0,0,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,0,0,0,0,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829


In [47]:
merge_df1.to_excel("datasets/data_obps_covs_more_info.xlsx")

## Agregar una nueva fila de interacción de proteina-ligando al dataset

In [78]:
# @title Funciones para Buscar descriptores de VOCs y OBPs
def calculate_descriptors_for_smile(smile):
    # Diccionario para almacenar los resultados
    #descriptors_list = []

    # Iterar sobre cada SMILES en la lista
    #for smile in smiles_list:
    try:
        # Calcular los descriptores usando Padelpy
        descriptors = from_smiles(smile, descriptors=True, fingerprints=False)
        #descriptors["SMILES"] = smile  # Añadir el SMILES al diccionario de descriptores
        #descriptors_list.append(descriptors)
    except Exception as e:
        print(f"Error al calcular descriptores para SMILES {smile}: {e}")

    # Convertir la lista de descriptores a un DataFrame de pandas
    #smile_df = pd.DataFrame(descriptors)

    return descriptors




def calculate_descriptors_for_sequence(sequence):
    # Diccionario para almacenar los descriptores de secuencias
    # descriptors_dict = {
    #     "Protein_Sequence": [],
    #     "Descriptors": []
    # }

    #descriptors_dict = []

    #for sequence in sequence_list:
    try:
        des_object = PyPro.GetProDes(sequence)
        descriptors = des_object.GetALL()

        #descriptors_dict.append(descriptors)
        #descriptors_dict["Descriptors"].append(descriptors)
        #descriptors_dict["Protein_Sequence"].append(sequence)
    except Exception as e:
        print(f"Error al calcular descriptores para secuencia {sequence}: {e}")

    #sequence_df = pd.DataFrame(descriptors)


    return descriptors

def filter_and_fill_missing(combined_df, columns_X_caracteristics):
    """
    Filters a DataFrame based on a list of columns and fills missing columns with zeros.

    Args:
        combined_df: The input DataFrame.
        columns_X_caracteristics: A list of columns to keep.

    Returns:
        A new DataFrame containing only the specified columns, with missing columns filled with zeros.
    """

    # Create a new DataFrame with the specified columns
    filtered_df = pd.DataFrame(columns=columns_X_caracteristics)

    # Iterate through the columns in the input DataFrame
    for col in columns_X_caracteristics:
        if col in combined_df.columns:
            # If the column exists, add it to the new DataFrame
            filtered_df[col] = combined_df[col]
        else:
            # If the column does not exist, add a column filled with zeros
            filtered_df[col] = 0

    return filtered_df

In [3]:
dataset = pd.read_excel("datasets/dataset_cov_obps_4.xlsx") # importación de dataset

In [64]:
dataset

Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MEVEMLPEGMKQLTGGFIKVFEACKTELGLKDGMLTDMYHLWREEY...,7.13,0,1.5768,2.486298,60.6615,37.017860,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MMKDMTKNFLKAYGECQQELHLTDDTARDLMFFWKEDYEVTSREAG...,10.06,0,1.5768,2.486298,60.6615,37.017860,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,CC(=O)/C=C/C1=C(C)CCCC1(C)C,SQEVMKKMSATFFKLLEECKKELSVTDDMIQGLVRFWLEDSALGER...,9.85,0,1.5768,2.486298,60.6615,37.017860,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,CC(=O)/C=C/C1=C(C)CCCC1(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,12.93,0,1.5768,2.486298,60.6615,37.017860,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,1.5768,2.486298,60.6615,37.017860,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2-methyl-3-pentanol,CsinGOBP1,CCC(O)C(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,30.00,0,-0.1745,0.030450,29.0406,20.697102,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,-0.1745,0.030450,29.0406,20.697102,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,-0.0402,0.001616,12.0280,21.549930,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,-0.0402,0.001616,12.0280,21.549930,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829


### Input para nueva fila de interacción de proteina-ligando con la busqueda de sus descriptores

In [96]:
# @title Input para nueva fila de interacción
compound_name = '(E)‐β‐farnesene' #@param {type:"string"}
protein_name = 'LbotPBP1' #@param {type:"string"}
affinity =  86.5 #@param {type:"number"}
smile = 'C=CC(CC/C=C(C)/CC/C=C(C)/C)=C' #@param {type:"string"}
sequence = 'SKEVVKDMSVNFKKALDVCIAEMNLPDTIFIDFINFWKEDYVITNRDTGCAIMCLSTKLEIVDPDLKLHHGNANDFVTQNGADEALAKELVNIIHVCETNLPQFDDGCLKVLEWAKCFKAEIHKKGMAPSMEVAAGEMLAEV' #@param {type:"string"}

In [97]:
smile_descriptor = calculate_descriptors_for_smile(smile)
sequence_descriptor = calculate_descriptors_for_sequence(sequence)

In [98]:
# Assuming 'smile' and 'sequence' are dictionaries as returned by your functions
smile_df = pd.DataFrame([smile_descriptor])
sequence_df = pd.DataFrame([sequence_descriptor])

# Convertir las columnas 'object' a sus tipos de datos correspondientes
for col in smile_df.columns:
  if smile_df[col].dtype == 'object':
    try:
      # Intentar convertir a numérico
      smile_df[col] = pd.to_numeric(smile_df[col], errors='coerce')
    except ValueError:
      # Si no se puede convertir a numérico, intentar convertir a categórico
      smile_df[col] = smile_df[col].astype('category')


# Concatenate the DataFrames
descriptors_df = pd.concat([smile_df, sequence_df], axis=1)

descriptors_matrix = filter_and_fill_missing(descriptors_df, dataset.iloc[:,5:].columns.tolist())

new_row = {
    'Compound name': compound_name,  # Replace with your compound name
    'protein_name': protein_name,  # Replace with your protein name
    'Smiles': smile, # Replace with the SMILES
    'AA Sequence W/O signal peptide': sequence, # Replace with your sequence
    'affinity': affinity,  # Replace with the affinity value
}

# Convert the new row to a DataFrame
new_row_df = pd.DataFrame([new_row])

new_row_df2 = pd.concat([new_row_df, descriptors_matrix], axis=1)

new_row_df2

Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,(E)‐β‐farnesene,LbotPBP1,C=CC(CC/C=C(C)/CC/C=C(C)/C)=C,SKEVVKDMSVNFKKALDVCIAEMNLPDTIFIDFINFWKEDYVITNR...,86.5,0,4.3258,18.712546,72.412,42.403032,...,0.03059,0.033452,0.031702,0.02902,0.026075,0.029873,0.031522,0.028853,0.030992,0.029858


### Dataset final con nueva fila agregada

In [99]:
# Concatenate the new row to the existing DataFrame
dataset2 = pd.concat([dataset, new_row_df2], axis=0)
dataset2.reset_index(drop=True, inplace=True)

# Verify the addition of the new row
dataset2.tail()

Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,-0.1745,0.03045,29.0406,20.697102,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,-0.0402,0.001616,12.028,21.54993,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.0,0,-0.0402,0.001616,12.028,21.54993,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1458,Hexamethyldisiloxane,CbuqPBP1,C[Si](C)(C)O[Si](C)(C)C,LSESLVDEMKEKLQKYGLECAEKEKASEEDIQALMNHERPVTHAGK...,36.89,0,4.1543,17.258208,36.2961,34.124274,...,0.033185,0.029595,0.029654,0.0294,0.026501,0.029821,0.030959,0.029028,0.033224,0.031223
1459,(E)‐β‐farnesene,LbotPBP1,C=CC(CC/C=C(C)/CC/C=C(C)/C)=C,SKEVVKDMSVNFKKALDVCIAEMNLPDTIFIDFINFWKEDYVITNR...,86.5,0,4.3258,18.712546,72.412,42.403032,...,0.03059,0.033452,0.031702,0.02902,0.026075,0.029873,0.031522,0.028853,0.030992,0.029858


In [100]:
dataset2.to_excel("datasets/dataset_cov_obps_5.xlsx", index=False)

## Reemplazar una Afinidad de proteina-ligando en el dataset por sus nombres

In [101]:
dataset2

Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
0,ionone (beta),CmedPBP4,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MEVEMLPEGMKQLTGGFIKVFEACKTELGLKDGMLTDMYHLWREEY...,7.13,0,1.5768,2.486298,60.6615,37.017860,...,0.029493,0.031452,0.030593,0.032004,0.026301,0.031368,0.032444,0.027974,0.029908,0.034341
1,ionone (beta),CpunPBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,MMKDMTKNFLKAYGECQQELHLTDDTARDLMFFWKEDYEVTSREAG...,10.06,0,1.5768,2.486298,60.6615,37.017860,...,0.029432,0.031559,0.031934,0.032102,0.026964,0.029060,0.033730,0.031201,0.029511,0.032615
2,ionone (beta),CpunPBP5,CC(=O)/C=C/C1=C(C)CCCC1(C)C,SQEVMKKMSATFFKLLEECKKELSVTDDMIQGLVRFWLEDSALGER...,9.85,0,1.5768,2.486298,60.6615,37.017860,...,0.031201,0.030126,0.031509,0.030460,0.027874,0.028269,0.031272,0.030974,0.032708,0.031396
3,ionone (beta),CsinGOBP1,CC(=O)/C=C/C1=C(C)CCCC1(C)C,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,12.93,0,1.5768,2.486298,60.6615,37.017860,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
4,ionone (beta),CsinGOBP2,CC(=O)/C=C/C1=C(C)CCCC1(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,1.5768,2.486298,60.6615,37.017860,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,-0.1745,0.030450,29.0406,20.697102,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,-0.0402,0.001616,12.0280,21.549930,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.00,0,-0.0402,0.001616,12.0280,21.549930,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1458,Hexamethyldisiloxane,CbuqPBP1,C[Si](C)(C)O[Si](C)(C)C,LSESLVDEMKEKLQKYGLECAEKEKASEEDIQALMNHERPVTHAGK...,36.89,0,4.1543,17.258208,36.2961,34.124274,...,0.033185,0.029595,0.029654,0.029400,0.026501,0.029821,0.030959,0.029028,0.033224,0.031223


### Input para buscar la fila de interacción por los nombres de la proteina y el ligando

In [111]:
# @title Input1
compound_name_search = '(E)‐β‐farnesene' #@param {type:"string"}
protein_name_search = 'LbotPBP1' #@param {type:"string"}

### Input para reemplazar la afinidad de la proteina-ligando buscado

In [110]:
# @title Input2
affinity_replace =  86.5 #@param {type:"number"}

In [108]:
# Find the row index to replace
row_index_to_replace = dataset2[(dataset2['Compound name'] == compound_name_search) & (dataset2['protein_name'] == protein_name_search)].index

# Check if the row exists
if not row_index_to_replace.empty:
    # Update the values in the specified columns for the identified row
    dataset2.loc[row_index_to_replace, ['affinity']] = [affinity_replace]  # Replace with desired values

    # Save the updated DataFrame to the Excel file
    dataset2.to_excel("datasets/dataset_cov_obps_5.xlsx", index=False)
    print("Row replaced successfully!")
else:
    print("Row not found in the DataFrame.")

Row replaced successfully!


In [105]:
dataset2.tail()

Unnamed: 0,Compound name,protein_name,Smiles,AA Sequence W/O signal peptide,affinity,nAcid,ALogP,ALogp2,AMR,apol,...,QSOgrant41,QSOgrant42,QSOgrant43,QSOgrant44,QSOgrant45,QSOgrant46,QSOgrant47,QSOgrant48,QSOgrant49,QSOgrant50
1455,2-methyl-3-pentanol,CsinGOBP2,CCC(O)C(C)C,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,9.57,0,-0.1745,0.03045,29.0406,20.697102,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1456,Methyl benzyl ether,CsinGOBP1,COCc1ccccc1,KVEVMKDVTLGFGEALQHCREQSQLTEEKMEEFFHFWRDDFKFEHR...,24.11,0,-0.0402,0.001616,12.028,21.54993,...,0.031101,0.033718,0.033598,0.028898,0.025392,0.030858,0.032347,0.029792,0.030958,0.033562
1457,Methyl benzyl ether,CsinGOBP2,COCc1ccccc1,TAEIMSHVTAHFGKLLEECRQESGLTTDILEEFQHFWREDFEVVHR...,30.0,0,-0.0402,0.001616,12.028,21.54993,...,0.031391,0.031523,0.031068,0.032233,0.026548,0.030795,0.031066,0.030422,0.029792,0.031829
1458,Hexamethyldisiloxane,CbuqPBP1,C[Si](C)(C)O[Si](C)(C)C,LSESLVDEMKEKLQKYGLECAEKEKASEEDIQALMNHERPVTHAGK...,36.89,0,4.1543,17.258208,36.2961,34.124274,...,0.033185,0.029595,0.029654,0.0294,0.026501,0.029821,0.030959,0.029028,0.033224,0.031223
1459,(E)‐β‐farnesene,LbotPBP1,C=CC(CC/C=C(C)/CC/C=C(C)/C)=C,SKEVVKDMSVNFKKALDVCIAEMNLPDTIFIDFINFWKEDYVITNR...,86.6,0,4.3258,18.712546,72.412,42.403032,...,0.03059,0.033452,0.031702,0.02902,0.026075,0.029873,0.031522,0.028853,0.030992,0.029858
