# 3. Data Preparation:

This phase covers all activities to construct the final dataset (data that will be fed into the modeling tools) from the initial raw data. Tasks usually include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Text Cleaning and Normalization

1. Remove special characters and numbers: We'll use regex to remove any non-alphabetic characters, keeping only letters and spaces.
2. Convert to lowercase: This step ensures consistency across all text entries.
3. Remove accents: Spanish text often contains accented characters, which we'll normalize to their non-accented equivalents.
4. Remove extra whitespaces: Trim leading and trailing spaces and replace multiple spaces with a single space.

Spacty dependence for spanish language: !python -m spacy download es_core_news_sm

In [5]:
## Data Manipulation
import pandas as pd
import numpy as np

## Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

## Natural Language Processing
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from collections import defaultdict
import re
import unicodedata

## Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

## Deep Learning (Tensor Flow)
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

## Grid search
from bayes_opt import BayesianOptimization

## 3.1 Null or NaN fixing

In [6]:
# Load the dataset
df = pd.read_csv('../files/intent.csv')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6679 entries, 0 to 6678
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   motivos  6679 non-null   object 
 1   crec     6679 non-null   int64  
 2   cred     6679 non-null   int64  
 3   equ      6679 non-null   int64  
 4   inic     6679 non-null   int64  
 5   inv      6679 non-null   int64  
 6   mkt      6679 non-null   int64  
 7   no       6679 non-null   int64  
 8   renta    6679 non-null   int64  
 9   sueldo   6679 non-null   int64  
 10  temp     6673 non-null   float64
dtypes: float64(1), int64(9), object(1)
memory usage: 574.1+ KB


In [8]:
# As we saw in the data analysis phase, the only value that doesn't have any category and is a nan is the row 1130, so lets see what are the "motivos" here
df.loc[1130]['motivos']

'Capital de trabajo\n,0,0,0,0,0,0,0,1,0,0\nRe invertir en materiales de manufacturacion,0,0,0,0,1,0,0,0,0,0\nPARA COMPRAR MAS PRODUCTOS  ASI COMO REFRIGERADORES.,0,0,1,0,1,0,0,0,0,0\nPARA COMPRA DE EQUIPO DE COMPUTO Y CONSUMIBLES,0,0,1,0,0,0,0,0,0,0\nInversión en desarrollo departamentos y compra de herramientas para la empresa,0,0,1,0,0,0,0,0,0,0\nRealizar ampliación de mi negocio independiente,1,0,0,0,0,0,0,0,0,0\nPara materias primas herramientas insumos y ampliación,0,0,0,0,1,0,0,0,0,0\nIncrementar mi negocio comprando suministros y manejar mas créditos,0,0,0,0,1,0,0,0,0,0\nInvertir en equipo de computo y tecnologias de la información,0,0,1,0,0,0,0,0,0,0\nInversion en productos y pago de facturas,0,1,0,0,0,0,0,0,0,0\nRe invertir en materiales de manufacturacion,0,0,1,0,0,0,0,0,0,0\nPAGAR CREDITOS CAROS E IMPUESTOS PENDIENTES,0,1,0,0,0,0,0,0,0,0\nExpancion y remodelacion de mobiliario del negocio,0,0,0,0,0,0,0,1,0,0\nCompra y diversificación de productos.El negocio es virtual y con 

In [9]:
# We see that we have rows inside ths motivo, probably due to a bad query or something went wrong. We need to fix it.

# The problematic row
motivo_text = df.loc[1130]['motivos']

# Split the text into lines
lines = motivo_text.strip().split('\n')

# Define the target labels
target_columns = ['crec', 'cred', 'equ', 'inic', 'inv', 'mkt', 'no', 'renta', 'sueldo', 'temp']

# Prepare a list to store parsed data
data = []

# Parse each line, skipping the first motive and attaching it to the second row
first_motivo_skipped = False
for line in lines:
    parts = line.split(',')
    motivo = parts[0].strip()  # Extract the motivo text
    values = list(map(int, parts[1:]))  # Convert the values to integers
    data.append([motivo] + values)  # Combine motivo and values into a single list

# Move the first motivo to the second row
if len(data) > 1:
    data[1][0] = data[0][0]  # Copy motivo from the first row to the second row
    data.pop(0)  # Remove the first row

# Create a new DataFrame with the parsed data
parsed_df = pd.DataFrame(data, columns=['motivos'] + target_columns)


# Display the resulting DataFrame
display(parsed_df)

Unnamed: 0,motivos,crec,cred,equ,inic,inv,mkt,no,renta,sueldo,temp
0,Capital de trabajo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,Re invertir en materiales de manufacturacion,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,PARA COMPRAR MAS PRODUCTOS ASI COMO REFRIGERA...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,PARA COMPRA DE EQUIPO DE COMPUTO Y CONSUMIBLES,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Inversión en desarrollo departamentos y compra...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Realizar ampliación de mi negocio independiente,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Para materias primas herramientas insumos y am...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,Incrementar mi negocio comprando suministros y...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,Invertir en equipo de computo y tecnologias de...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Inversion en productos y pago de facturas,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# Delete the last row 
parsed_df = parsed_df.iloc[:-1]

# Reset the index
parsed_df.reset_index(drop=True, inplace=True)

It appears that the descriptions of the "motivos" fit the assigned labels. Now, we must concatenate the dataframe with the main one.

In [11]:
df = pd.concat([df, parsed_df], axis=0)

Now we lets fill with cero the rest of the nans, since they already have a category.

In [12]:
# Fill NaN values with 0 d
df = df.fillna(0)

## 3.2 Data type transformation.

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6701 entries, 0 to 21
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   motivos  6701 non-null   object 
 1   crec     6701 non-null   float64
 2   cred     6701 non-null   float64
 3   equ      6701 non-null   float64
 4   inic     6701 non-null   float64
 5   inv      6701 non-null   float64
 6   mkt      6701 non-null   float64
 7   no       6701 non-null   float64
 8   renta    6701 non-null   float64
 9   sueldo   6701 non-null   float64
 10  temp     6701 non-null   float64
dtypes: float64(10), object(1)
memory usage: 628.2+ KB


We see that originaly all the categories where int, and now they are float, which is wrong. Lets convert them again into int.

In [14]:
# Convert the target columns to int type
df[target_columns] = df[target_columns].astype(int)

# Display the DataFrame to confirm the changes
print(df[target_columns].dtypes)

crec      int64
cred      int64
equ       int64
inic      int64
inv       int64
mkt       int64
no        int64
renta     int64
sueldo    int64
temp      int64
dtype: object


## 3.3 Text Normalization

In [15]:
# Text cleaning function
def clean_text(text):
    text = re.sub(r'[^a-zA-ZáéíóúÁÉÍÓÚñÑ\s]', '', text)
    text = text.lower()
    text = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    text = ' '.join(text.split())
    return text

# Apply text cleaning
df['cleaned_text'] = df['motivos'].apply(clean_text)

Text Preprocessing

1. Tokenization: Split the text into individual words or tokens.
2. Remove stopwords: Eliminate common Spanish words that don't carry significant meaning for our classification task.
3. Lemmatization: Reduce words to their base or dictionary form. This is particularly important for Spanish, which has rich verb conjugations and noun/adjective agreements.

In [16]:
# Load Spanish language model for spaCy
nlp = spacy.load('es_core_news_sm')

# Text preprocessing function
def preprocess_text(text):
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return ' '.join(tokens)

# Apply text preprocessing
df['preprocessed_text'] = df['cleaned_text'].apply(preprocess_text)

In [17]:
df.head()

Unnamed: 0,motivos,crec,cred,equ,inic,inv,mkt,no,renta,sueldo,temp,cleaned_text,preprocessed_text
0,Crear un departamento de ventas e inversión a ...,0,0,0,0,0,1,0,0,0,0,crear un departamento de ventas e inversion a ...,crear departamento venta inversion publicidad
1,establecerme en un local y agregar materia pri...,0,0,0,0,1,0,0,1,0,0,establecerme en un local y agregar materia pri...,establecerme local agregar materia prima stock
2,Compra de equipo e incrementar inventario,0,0,1,0,1,0,0,0,0,0,compra de equipo e incrementar inventario,compra equipo incrementar inventario
3,Invertir en crecimiento de flotilla de unidade...,0,0,1,0,0,0,0,0,0,0,invertir en crecimiento de flotilla de unidade...,invertir crecimiento flotilla unidad carga seg...
4,Para comprar mercancía y comprar lonas nuevas,0,0,0,0,1,0,0,0,0,0,para comprar mercancia y comprar lonas nuevas,comprar mercancia comprar lona


We see that the new preprocessed text has all the verbs in the infinitive and the most important words. We will have to evaluate later how well the model performs with this preprocessing.

Automated Keyword Extraction

This approach is going to use words frecuency and the TF-IDF scores, so for that we need to reate an automated key word extraction for each class.

* TF-IDF (Term Frequency-Inverse Document Frequency): Calculate TF-IDF scores for all words in each class.
* Select the top N words with the highest TF-IDF scores as keywords for each class.
* TextRank Algorithm: Apply the TextRank algorithm to extract keywords from the text of each class.

In [18]:
# First, we create an automated keyword extraction process.
def extract_keywords(df, class_columns, n_keywords=5):
    keywords = defaultdict(list)
    
    for class_col in class_columns:
        # Get text for this class
        class_text = df[df[class_col] == 1]['motivos']
        
        # TF-IDF vectorization
        # Create a custom stop wrods list in spanish
        spanish_stop_words = ["de", "la", "que", "el", "en", "y", "a", "los", "del", "se", "las", "por", "un", "para", "con", "no", "una", "su", "al", "lo", "como", "más", "pero", "sus", "le", "ya", "o", "este", "sí", "porque", "esta", "entre", "cuando", "muy", "sin", "sobre", "también", "me", "hasta", "hay", "donde", "quien", "desde", "todo", "nos", "durante", "todos", "uno", "les", "ni", "contra", "otros", "ese", "eso", "ante", "ellos", "e", "esto", "mí", "antes", "algunos", "qué", "unos", "yo", "otro", "otras", "otra", "él", "tanto", "esa", "estos", "mucho", "quienes", "nada", "muchos", "cual", "poco", "ella", "estar", "estas", "algunas", "algo", "nosotros", "mi", "mis", "tú", "te", "ti", "tu", "tus", "ellas", "nosotras", "vosotros", "vosotras", "os", "mío", "mía", "míos", "mías", "tuyo", "tuya", "tuyos", "tuyas", "suyo", "suya", "suyos", "suyas", "nuestro", "nuestra", "nuestros", "nuestras", "vuestro", "vuestra", "vuestros", "vuestras"]
        tfidf = TfidfVectorizer(stop_words=spanish_stop_words)

        tfidf_matrix = tfidf.fit_transform(class_text)
        
        # Get feature names and their scores
        feature_names = tfidf.get_feature_names_out()
        tfidf_scores = tfidf_matrix.sum(axis=0).A1
        
        # Sort words by TF-IDF score and select top N
        top_indices = tfidf_scores.argsort()[-n_keywords:][::-1]
        keywords[class_col] = [feature_names[i] for i in top_indices]
    
    return keywords

# Usage
class_columns = ['crec', 'cred', 'equ', 'inic', 'inv', 'mkt', 'no', 'renta', 'sueldo', 'temp']
auto_keywords = extract_keywords(df, class_columns)

In [19]:
auto_keywords

defaultdict(list,
            {'crec': ['negocio', 'capital', 'trabajo', 'sucursal', 'mas'],
             'cred': ['pagar', 'deudas', 'pago', 'negocio', 'crédito'],
             'equ': ['equipo', 'compra', 'trabajo', 'comprar', 'maquinaria'],
             'inic': ['negocio', 'poner', 'proyecto', 'quiero', 'iniciar'],
             'inv': ['compra', 'negocio', 'comprar', 'trabajo', 'capital'],
             'mkt': ['publicidad', 'negocio', 'marketing', 'compra', 'equipo'],
             'no': ['casa', 'pagar', 'gastos', 'necesito', 'comprar'],
             'renta': ['mobiliario', 'compra', 'equipo', 'local', 'negocio'],
             'sueldo': ['personal', 'pago', 'equipo', 'trabajo', 'capital'],
             'temp': ['capital',
              'trabajo',
              'clientes',
              'temporada',
              'compra']})

Feature Engineering

* Text length: Create a new feature representing the length of the original text, which might be indicative of certain classes.
* Word count: Add a feature for the number of words in each text entry after preprocessing.
* Presence of specific keywords: Create binary features for the presence of key terms related to each class (e.g., "compra" for 'inv', "publicidad" for 'mkt').

In [20]:
df

Unnamed: 0,motivos,crec,cred,equ,inic,inv,mkt,no,renta,sueldo,temp,cleaned_text,preprocessed_text
0,Crear un departamento de ventas e inversión a ...,0,0,0,0,0,1,0,0,0,0,crear un departamento de ventas e inversion a ...,crear departamento venta inversion publicidad
1,establecerme en un local y agregar materia pri...,0,0,0,0,1,0,0,1,0,0,establecerme en un local y agregar materia pri...,establecerme local agregar materia prima stock
2,Compra de equipo e incrementar inventario,0,0,1,0,1,0,0,0,0,0,compra de equipo e incrementar inventario,compra equipo incrementar inventario
3,Invertir en crecimiento de flotilla de unidade...,0,0,1,0,0,0,0,0,0,0,invertir en crecimiento de flotilla de unidade...,invertir crecimiento flotilla unidad carga seg...
4,Para comprar mercancía y comprar lonas nuevas,0,0,0,0,1,0,0,0,0,0,para comprar mercancia y comprar lonas nuevas,comprar mercancia comprar lona
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17,Para incrementar mercancia en inventario,0,0,0,0,1,0,0,0,0,0,para incrementar mercancia en inventario,incrementar mercancia inventario
18,Para expancion de productibidad y compra de ma...,0,0,1,0,0,0,0,0,0,0,para expancion de productibidad y compra de ma...,expancion productibidad compra makinaria contr...
19,INVERSIÓN EN EQUIPO DE TRABAJO Y CONSUMIBLES,0,0,1,0,0,0,0,0,0,0,inversion en equipo de trabajo y consumibles,inversion equipo trabajo consumibl
20,PARA TENER LIQUIDEZ PARA COMPRA DE REFACCIONES,0,0,0,0,1,0,0,0,0,0,para tener liquidez para compra de refacciones,liquidez compra refacción


In [21]:
# Feature engineering
df['text_length'] = df['motivos'].str.len()
df['word_count'] = df['preprocessed_text'].str.split().str.len()

# If some of the words are in the keywords and apply to the characteristic, then mark it as 1.
for label, words in auto_keywords.items():
    df[f'{label}_keywords'] = df['preprocessed_text'].apply(lambda x: any(word in x for word in words)).astype(int)

In [22]:
df.columns

Index(['motivos', 'crec', 'cred', 'equ', 'inic', 'inv', 'mkt', 'no', 'renta',
       'sueldo', 'temp', 'cleaned_text', 'preprocessed_text', 'text_length',
       'word_count', 'crec_keywords', 'cred_keywords', 'equ_keywords',
       'inic_keywords', 'inv_keywords', 'mkt_keywords', 'no_keywords',
       'renta_keywords', 'sueldo_keywords', 'temp_keywords'],
      dtype='object')

We see 10 new features, since all have at least one key word that represent that class.

['crec_keywords', 'cred_keywords', 'equ_keywords','inic_keywords', 'inv_keywords', 'mkt_keywords', 'no_keywords','renta_keywords', 'sueldo_keywords', 'temp_keywords']


In [23]:
# Assing the y labels.
y = df[target_columns].values

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf.fit_transform(df['preprocessed_text'])

# Combine TF-IDF features with engineered features
X_extra = df[['text_length', 'word_count', 'crec_keywords', 'cred_keywords', 'equ_keywords','inic_keywords', 'inv_keywords', 'mkt_keywords', 'no_keywords','renta_keywords', 'sueldo_keywords', 'temp_keywords']]
X = np.hstack((X_tfidf.toarray(), X_extra))

In [24]:
y

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [25]:


def split_data(X, y, test_size=0.2, val_size=0.2, random_state=42):
    # First, split into train+val and test sets
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Then split the train+val set into train and validation sets
    val_size_adjusted = val_size / (1 - test_size)
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val, test_size=val_size_adjusted, random_state=random_state
    )
    
    return X_train, X_val, X_test, y_train, y_val, y_test

# Usage
X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y)

In [26]:
print(f"The shape of the data is:\nX_train:{X_train.shape} - y_train: {y_train.shape}\nX_test:{X_test.shape} - y_test: {y_test.shape}\nX_valid:{X_val.shape} - y_valid: {y_val.shape}")

The shape of the data is:
X_train:(4020, 5012) - y_train: (4020, 10)
X_test:(1341, 5012) - y_test: (1341, 10)
X_valid:(1340, 5012) - y_valid: (1340, 10)


# 4. Modeling:
In this phase, various modeling techniques and algorithms are selected and applied to the dataset, and their parameters are calibrated to optimal values.

Define the Model:
We'll use a simple feedforward neural network with dropout layers to prevent overfitting. The output layer will have 8 units with a sigmoid activation function to handle multilabel outputs.

## 4.1 Model creation

In [27]:
def create_model(input_dim):
    model = Sequential()
    model.add(Dense(256, input_dim=input_dim, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))  # 10 output units for multilabel classification
    return model

model = create_model(input_dim=5012)

2024-12-01 14:41:59.220134: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [28]:
model.summary()

In [29]:
model.compile(optimizer=Adam(learning_rate=0.001), 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

## 4.2 Model Training

In [30]:
# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5)

history = model.fit(X_train, y_train, 
                    epochs=100, 
                    batch_size=32, 
                    validation_data=(X_val, y_val),
                    callbacks=[early_stopping])

2024-12-01 14:41:59.414503: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 80592960 exceeds 10% of free system memory.


Epoch 1/100


[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - accuracy: 0.1744 - loss: 0.6411 - val_accuracy: 0.2918 - val_loss: 0.3423
Epoch 2/100
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.2918 - loss: 0.3626 - val_accuracy: 0.2918 - val_loss: 0.3195
Epoch 3/100
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.3130 - loss: 0.3341 - val_accuracy: 0.3970 - val_loss: 0.3012
Epoch 4/100
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.3435 - loss: 0.3089 - val_accuracy: 0.4291 - val_loss: 0.2881
Epoch 5/100
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.3694 - loss: 0.2984 - val_accuracy: 0.4433 - val_loss: 0.2782
Epoch 6/100
[1m126/126[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.3974 - loss: 0.2845 - val_accuracy: 0.4925 - val_loss: 0.2618
Epoch 7/100
[1m126/126[0m [32m━

In [31]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')

[1m33/42[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 2ms/step - accuracy: 0.6198 - loss: 0.2065 

[1m42/42[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6279 - loss: 0.2051
Test Loss: 0.199889674782753, Test Accuracy: 0.6569724082946777


In [32]:
predictions = model.predict(X_test)
# Convert probabilities to binary outputs (e.g., threshold at 0.5)
predicted_labels = (predictions > 0.5).astype(int)

[1m39/42[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 1ms/step 

[1m42/42[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


In [33]:
predicted_labels

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## 4.3 Model Fine Tunning

In [34]:
# Define a function to create and compile the model with hyperparameters
def create_model(input_dim, learning_rate, dropout_rate, units_1, units_2):
    model = Sequential()
    model.add(Dense(int(units_1), input_dim=input_dim, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(int(units_2), activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))  # 10 output units for multilabel classification
    
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, 
                  loss='binary_crossentropy', 
                  metrics=['accuracy'])
    return model

In [35]:
# Define a function to be optimized by Bayesian Optimization
def train_evaluate_model(learning_rate, dropout_rate, units_1, units_2):
    # Create and compile the model
    model = create_model(input_dim=5012,
                         learning_rate=learning_rate,
                         dropout_rate=dropout_rate,
                         units_1=units_1,
                         units_2=units_2)
    
    # Early stopping to prevent overfitting
    early_stopping = EarlyStopping(monitor='val_loss', patience=5)
    
    # Train the model
    history = model.fit(X_train, y_train,
                        epochs=1000,
                        batch_size=32,
                        validation_data=(X_val, y_val),
                        callbacks=[early_stopping],
                        verbose=0)  # Set verbose to 0 for faster optimization
    
    # Evaluate the model on validation data and return the negative accuracy (since we want to maximize accuracy)
    val_accuracy = history.history['val_accuracy'][-1]
    return val_accuracy

In [36]:
# Define the parameter space for Bayesian Optimization
pbounds = {
    'learning_rate': (0.0001, 0.01),
    'dropout_rate': (0.1, 0.5),
    'units_1': (64, 512),
    'units_2': (32, 256)
}

In [37]:
# Initialize Bayesian Optimization
optimizer = BayesianOptimization(
    f=train_evaluate_model,
    pbounds=pbounds,
    random_state=42,
)

# Perform optimization
optimizer.maximize(
    init_points=10,
    n_iter=30,
)

# Print the best parameters found
print("Best parameters found: ", optimizer.max)

|   iter    |  target   | dropou... | learni... |  units_1  |  units_2  |
-------------------------------------------------------------------------


2024-12-01 14:42:30.255630: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 80592960 exceeds 10% of free system memory.


| [39m1        [39m | [39m0.5858   [39m | [39m0.2498   [39m | [39m0.009512 [39m | [39m391.9    [39m | [39m166.1    [39m |


2024-12-01 14:42:46.407488: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 80592960 exceeds 10% of free system memory.


| [35m2        [39m | [35m0.6403   [39m | [35m0.1624   [39m | [35m0.001644 [39m | [35m90.02    [39m | [35m226.0    [39m |


2024-12-01 14:42:53.735260: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 80592960 exceeds 10% of free system memory.


| [39m3        [39m | [39m0.5201   [39m | [39m0.3404   [39m | [39m0.00711  [39m | [39m73.22    [39m | [39m249.3    [39m |


2024-12-01 14:43:00.349166: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 80592960 exceeds 10% of free system memory.


| [39m4        [39m | [39m0.6142   [39m | [39m0.433    [39m | [39m0.002202 [39m | [39m145.5    [39m | [39m73.08    [39m |
| [39m5        [39m | [39m0.609    [39m | [39m0.2217   [39m | [39m0.005295 [39m | [39m257.5    [39m | [39m97.24    [39m |
| [35m6        [39m | [35m0.6575   [39m | [35m0.3447   [39m | [35m0.001481 [39m | [35m194.9    [39m | [35m114.1    [39m |
| [39m7        [39m | [39m0.4821   [39m | [39m0.2824   [39m | [39m0.007873 [39m | [39m153.5    [39m | [39m147.2    [39m |
| [39m8        [39m | [39m0.6522   [39m | [39m0.337    [39m | [39m0.0005599[39m | [39m336.2    [39m | [39m70.2     [39m |
| [39m9        [39m | [39m0.5806   [39m | [39m0.126    [39m | [39m0.009494 [39m | [39m496.6    [39m | [39m213.1    [39m |
| [39m10       [39m | [39m0.6567   [39m | [39m0.2218   [39m | [39m0.001067 [39m | [39m370.5    [39m | [39m130.6    [39m |
| [35m11       [39m | [35m0.6604   [39m | [35m0.2453   [

In [38]:
# Lets evaluate the model
# Extract the best hyperparameters
best_params = optimizer.max['params']

# Create and compile a new model with the best hyperparameters
best_model = create_model(input_dim=5012,
                          learning_rate=best_params['learning_rate'],
                          dropout_rate=best_params['dropout_rate'],
                          units_1=int(best_params['units_1']),
                          units_2=int(best_params['units_2']))

# Train the model on the full training data (including validation data)
best_model.fit(np.concatenate((X_train, X_val)), np.concatenate((y_train, y_val)),
               epochs=1000,
               batch_size=32,
               callbacks=[early_stopping],
               verbose=0) 

<keras.src.callbacks.history.History at 0x7776d707aa70>

In [39]:
# Evaluate the model on the test dataset
test_loss, test_accuracy = best_model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}, Test Accuracy: {test_accuracy}')

[1m42/42[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6185 - loss: 0.4368
Test Loss: 0.40732723474502563, Test Accuracy: 0.6323639154434204


As we can see, hyperparameter optimization using the Bayesian optimization method did not improve the metrics as expected. Hardly any improvement is noticed using Bayesian hyperparameter optimization. 