# ¿Que más se podría hacer?

- Test con data augmentation y la LSTM
- Usar los tokenizadores usados en los transformers para ver como funcionan
- TSNE de embeddings y analizar si hay interpretación
- Probar con otras encimas o proteinas
- Usar los embeddings entrenados para analizar resultados de proteinas o encimas con menos data
- Entrenar una red neuronal con los features (fingerprints por ejemplo) y comparar los resultados con los embeddings


# Tome cualquiera de estas propuestas o alguna suya y desarrolle

In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski
import pandas as pd

In [None]:

df = pd.read_csv('data/acetylcholinesterase_02_bioactivity_data_preprocessed_tokenizada.csv')

## FEATURES

### Lipinski descriptors

Christopher Lipinski, a scientist at Pfizer, came up with a set of rule-of-thumb for evaluating the druglikeness of compounds. Such druglikeness is based on the Absorption, Distribution, Metabolism and Excretion (ADME) that is also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs in the formulation of what is to be known as the Rule-of-Five or Lipinski's Rule.

The Lipinski's Rule stated the following:

- Molecular weight < 500 Dalton
- Octanol-water partition coefficient (LogP) < 5
- Hydrogen bond donors < 5
- Hydrogen bond acceptors < 10

La regla de cinco de Lipinski es una regla completamente empírica que permite evaluar cualitativamente cómo de adecuado podría resultar un compuesto químico para cumplir alguna determinada función farmacológica o actividad biológica una vez que es ingerido como medicamento para consumo oral en seres humanos.

Tal y como lo indica la regla de Lipinski, en general, un principio activo y para que sea posible su administración por vía oral no debe violar más de una de las Lipinski's Rule (ver arriba)



In [None]:
#funcion para obtener los descriptores moleculares desde la informacion quimica(atomic details from molecula -->smiles) 
# Inspired by: https://codeocean.com/explore/capsules?query=tag:data-curation
def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]   
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)
    
    return descriptors

In [None]:
df_lipinski = lipinski(df.canonical_smiles)
df_lipinski

In [None]:
df_token_lipinski = pd.concat([df,df_lipinski], axis=1)

In [None]:
df_token_lipinski.head()

### Feature Categorico -->Componente Activo, Itermedio y Inactive

In [None]:
bioactivity_class = []
for i in df_token_lipinski.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df_token_lipinski_clasificado = pd.concat([df_token_lipinski, bioactivity_class], axis=1)

In [None]:
df_token_lipinski_clasificado.head()

### FINGERPRINT DESCRIPTORS: PaDEL Descriptors

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df_token_lipinski_clasificado[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat padel.sh

In [None]:
! bash padel.sh

In [None]:
df_fingerPrint = pd.read_csv('descriptors_output.csv')

In [None]:
df_fingerPrint.head()

In [25]:
df_token_lipinski_clasificado_fingerPrint = pd.concat([df_token_lipinski_clasificado,df_fingerPrint], axis=1)

In [26]:
df_token_lipinski_clasificado_fingerPrint.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,standard_value_norm,pIC50,X_seq,X_seq_pad,MW,LogP,NumHDonors,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,750.0,6.124939,"[1, 1, 6, 1, 4, 5, 5, 2, 15, 1, 7, 1, 1, 1, 1,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",312.325,2.8032,0.0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,100.0,7.0,"[6, 8, 1, 2, 5, 4, 1, 1, 1, 1, 1, 4, 3, 5, 4, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",376.913,4.5546,0.0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,50000.0,4.30103,"[1, 5, 2, 1, 2, 8, 6, 3, 5, 4, 5, 1, 2, 15, 1,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",426.851,5.3574,0.0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,300.0,6.522879,"[6, 8, 1, 2, 5, 4, 1, 1, 1, 1, 1, 4, 3, 5, 4, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",404.845,4.7069,0.0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,800.0,6.09691,"[1, 22, 1, 4, 5, 1, 2, 15, 1, 7, 1, 1, 1, 2, 6...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",346.334,3.0953,0.0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
df_token_lipinski_clasificado_fingerPrint.to_csv('acetylcholinesterase_02_bioactivity_data_preprocessed_token_descriptors.csv' ,index=False)