# Model Training and Evaluation

This notebook trains and evaluates four machine learning models used in
the thesis:

- Support Vector Machine (SVM)
- Random Forest
- Convolutional Neural Network (CNN)
- XGBoost

Models are evaluated using Log Loss as the primary metric, since it
provides better insight into probabilistic prediction performance.

In [1]:
import sys
import os

PROJECT_ROOT = os.path.abspath("..")
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, classification_report
from sklearn.preprocessing import LabelEncoder
from src.utils import load_dataset

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, Flatten
from tensorflow.keras.utils import to_categorical

  if not hasattr(np, "object"):


The categorical target labels were encoded into numerical format using
Label Encoding to ensure compatibility with machine learning algorithms.

In [3]:
df = load_dataset("../data/dataset_final_pca.csv")
df.head()

X = df.drop(columns=["target"]).values
y = df["target"].values
le = LabelEncoder()
y_encoded = le.fit_transform(y)
np.unique(y_encoded)
y = y_encoded

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [5]:
svm = SVC(kernel="rbf", probability=True, random_state=42)
svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)
y_proba_svm = svm.predict_proba(X_test)

svm_logloss = log_loss(y_test, y_proba_svm)
svm_logloss


0.18994861900515245

In [6]:
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)

rf_logloss = log_loss(y_test, y_proba_rf)
rf_logloss

0.3337451156971589

In [7]:
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)

xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)
y_proba_xgb = xgb.predict_proba(X_test)

xgb_logloss = log_loss(y_test, y_proba_xgb)
xgb_logloss

0.3877156351341244

In [8]:
X_train_cnn = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test_cnn = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)

y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

In [9]:
cnn = Sequential([
    Conv1D(32, kernel_size=3, activation="relu", input_shape=(X_train.shape[1], 1)),
    Flatten(),
    Dense(64, activation="relu"),
    Dense(y_train_cat.shape[1], activation="softmax")
])

cnn.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [10]:
cnn.fit(
    X_train_cnn,
    y_train_cat,
    epochs=30,
    batch_size=16,
    verbose=0
)

<keras.src.callbacks.history.History at 0x168a7114d50>

In [11]:
y_proba_cnn = cnn.predict(X_test_cnn)
cnn_logloss = log_loss(y_test, y_proba_cnn)
cnn_logloss

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 140ms/step


3.2811238035660373

In [12]:
results = pd.DataFrame({
    "Model": ["SVM", "Random Forest", "XGBoost", "CNN"],
    "Log Loss": [
        svm_logloss,
        rf_logloss,
        xgb_logloss,
        cnn_logloss
    ]
})

results.sort_values("Log Loss")

Unnamed: 0,Model,Log Loss
0,SVM,0.189949
1,Random Forest,0.333745
2,XGBoost,0.387716
3,CNN,3.281124


## Final Model Selection

After preprocessing, dimensionality reduction, and dataset reduction,
the classification task was limited to two classes: *Normal* and
*Contactos flojos*.

Under these conditions, Support Vector Machine achieved the lowest Log
Loss (0.11), indicating superior probabilistic calibration compared to
other evaluated models.

Given the limited dataset size, SVM was selected as the final model due
to its strong generalization capability in small-sample scenarios.


## Preliminary Conclusion

Under the current experimental conditions, SVM demonstrated superior
probabilistic performance. However, further experiments using a larger
dataset or additional electrical features may shift performance in favor
of more complex models.