# Dense Neural Networks

Hello there!

In the previous approach we have considered a linear estimation for the bio-activity. Our result presents an average $R^{2}=0.62$ and a MAPE of $7.19$ In this notebook, we present a new approach by the use of Deep Neural Networks, in this initial case we use only Dense Layers or a Feed Forward. The descriptors used are obtained by the use of Mutual Information (MI). We've first selected those descriptors with a higher mutual than $0.4$, where we've reduced the dimension from 1200 to just 99 descriptors.

Then, we have selected from the 99 descriptors the one descriptor with the highest MI (piPC4) and have selected two variables that are independent among them. This means, the MI values among them is the lowest value possible.

## Used libraries

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

In [None]:
colab = False

In [None]:
if colab: 
    import sys
    sys.path.append('/content/drive/MyDrive/Colaboracion_Quimica/Main_Codes/AutoEncoders/AmaroX/AmaroX')
    ! pip install python-telegram-bot

    from ai_functions import *
    from ai_models import *
    from utilities import *
    from data_manipulation import *
    import pandas as pd
else: 
    from AmaroX.AmaroX.ai_functions import *
    from AmaroX.AmaroX.ai_models import *
    from AmaroX.AmaroX.utilities import *
    from AmaroX.AmaroX.data_manipulation import *
    import pandas as pd

In [None]:
import keras_tuner
import sklearn

## Data

The data presented here corresponds to molecules with their SMILE representation and descriptors, along with the biological activity. Let's first do a quick view of the data shape.

* All the data presented here was obtained by colaboration with Dr. Erick Padilla at Facultad de Estudios Superiores Zaragoza - UNAM.

### Downloading the data

In [None]:
if colab:
    ! gdown --id 1cHM9neEhTOZ82UU9HaZkdGdlwE1d4SJT
    ! gdown --id 1wZp9pou63ElEYyGGjBeC2pDtscgRgCpj

The _data.xlsx_ file contains all the molecular descriptors from the molecule, along with a SMILE representation.

In [None]:
compounds_md = pd.read_excel("../Data/data.xlsx")
activity = pd.read_excel("../Data/Actividad.xlsx")

In [None]:
compounds_md.head()

In [None]:
compounds_md.shape

In [None]:
activity.head()

In [None]:
activity.shape

* The variable _x_ are the molecular descriptors, we're only interested in numerical properties:

In [None]:
x = compounds_md.copy()
x = x.select_dtypes("number")

## Applying Mutual Information to Molecular Descriptors

In the previous notebook, we have selected 3 molecular descriptors that are independent among them and present a high MI with respect to the bio-activity.

In [None]:
x_array = np.array(x[ ['piPC4'] ])
x_array.shape

In [None]:
y_array = np.array( activity )[:, 0]
y_array.shape

## Standarize Features

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
x_std = scaler.fit_transform(x_array)
x_std.shape

In [None]:
plot_xy([x_std, y_array])

## Splitting Train and Test

In [None]:
N_BINS=6 ##discretizer, this was 10 before
N_SPLITS=6 ##splitter
TEST_SIZE=21/70 ##splitter

In [None]:
# dividimos train test con stratified
discretizer = sklearn.preprocessing.KBinsDiscretizer(n_bins=N_BINS, encode="ordinal", strategy="uniform")
splitter = sklearn.model_selection.StratifiedShuffleSplit(n_splits=N_SPLITS,test_size=TEST_SIZE, random_state=13)
y_discrete = discretizer.fit_transform(np.expand_dims(y_array, axis = -1))
split, split_test = next(splitter.split(np.expand_dims(x_std, axis = -1), y_discrete ))

In [None]:
x_train = x_std[split]
x_test = x_std[split_test]
y_train = y_array[split]
y_test = y_array[split_test]

In [None]:
x_train.shape, x_test.shape

In [None]:
# Crear una figura con dos subplots en horizontal (1 fila, 2 columnas)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))  # figsize ajusta el tamaño

# Graficar la primera curva en el primer subplot
ax1.hist(y_train, color='blue', label='train', bins = 6)
ax1.set_title('Train Density')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.legend()

# Graficar la segunda curva en el segundo subplot
ax2.hist(y_test, color='red', label='test', bins = 6)
ax2.set_title('Test Density')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.legend()

# Ajustar el espacio entre los subplots
plt.tight_layout()

# Mostrar la figura
plt.show()

## Paths

In [None]:
name = 'DNN_MI_1_2_Reg'
if colab:
    folder_path = '/content/drive/MyDrive/Colaboracion_Quimica/Main_Codes/AutoEncoders/models'
else: 
    folder_path = '../models'
    
final_path = os.path.join(folder_path, name)

## Callbacks

In [None]:
callbacks = standard_callbacks(folder_name= name,
                               folder_path= folder_path,
                               patiences= [250, 250], # 50 epochs without progress, and 2 epochs to reduce LR
                               monitor = 'val_mape',
                               flow_direction = 'min')

## DNN Model

In [None]:
def _DNN(DP, L1, L2):

  inputs = keras.layers.Input((1,))

  _DNN_ = G_Dense(
      inputs = inputs,
      nodes = [14, 300],
      DP = DP,
      n_final = 1,
      act_func = 'leaky_relu',
      final_act_func = 'relu',
      WI = 'he_normal',
      L1 = L1, 
      L2 = L2,
      use_bias = True
  )

  return keras.models.Model(inputs = inputs, outputs = _DNN_)

In [None]:
def compile_model(DP, L1, L2, optimizer, modelo):

  model = modelo(DP = DP, L1 = L1, L2=L2)

  model.compile(optimizer = optimizer,
                loss = 'mae',
                metrics = ['mape', 'r2_score'])

  return model

In [None]:
def build_model(hp):

  #nodes = [hp.Int('Nodes-1', min_value = 1, max_value = 300, step = 1),
  #         hp.Int('Nodes-2', min_value = 1, max_value = 300, step = 1)
  #         ]

  DP = hp.Int('Dropout', min_value = 0, max_value = 50, step = 2)

  L1 = hp.Choice('L1', [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0])

  L2 = hp.Choice('L2', [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0])

  optimizer = hp.Choice('optimizer', ['adam',]) #'sgd', 'adagrad'])

  if optimizer == 'adam': opt = keras.optimizers.Adam(
        learning_rate = 0.001
    )

  elif optimizer == 'sgd': opt = keras.optimizers.SGD(
        learning_rate = 0.001
    )

  elif optimizer == 'adagrad': opt = keras.optimizers.Adagrad(
        learning_rate = 0.001
    )


  model_f = compile_model(DP = DP, L1= L1, L2=L2, optimizer = optimizer, modelo = _DNN)

  return model_f

In [None]:
build_model(keras_tuner.HyperParameters())

In [None]:
tuner = keras_tuner.BayesianOptimization(
    hypermodel=build_model,
    objective= keras_tuner.Objective('val_mape', 'min') ,
    max_trials= 50, # Set to 3
    executions_per_trial = 2,
    overwrite=True,
    directory= final_path,
    project_name="DNN-MI-KT",
)

In [None]:
tuner.search_space_summary()

In [None]:
tuner.search(x_train, y_train, epochs=250, validation_data=(x_test, y_test), batch_size=12)

In [None]:
file_path = os.path.join(final_path, 'best_models.txt')

with open(file_path, "w") as file:
    # Save the original stdout
    original_stdout = sys.stdout
    try:
        sys.stdout = file  # Redirect stdout to the file
        tuner.results_summary()  # Call your function
    finally:
        sys.stdout = original_stdout