## Model Development
The aim of this notebook is to use our articles dataset (≈6000 articles) to train a model than can predict the category of a given article. We will go through 5 steps:
1. Data pre-process: Where we are going to use our DataProcessingPipeline in case the dataset is not stored. This pipeline will clean and embed articles allowing us to capture semantic meaning of each article.
2. Divide the dataset into 3: Train (70%), Cross-Validation (10%) and Test (20%).
3. Hyperparameters search: Once we have our data ready we will searh for hyperparameters in order to find the 'best' model.
4. Training: With the hyperparameters we found we will train our `XGBoost` model using 'multi:softprob' in order to get the probability of each category (Maybe an article belong to more than one category).
5. Evaluation: We will run our model with the test dataset and evaluate the results in terms of macro and micro, using certain metrics (accuracy, precision, recall and F1-score).

After all this steps if the model performs as needed we will save it as `.pkl`, so we can later use it for classification.

#### Import all packages

In [4]:
import os
import sys
import ast
sys.path.append('../')  # Adjust path if src is not in notebook's parent directory

import joblib
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, log_loss

from src.data_processing.processing_pipeline import DataProcessingPipeline
from src.data_processing.steps.cleaner import Cleaner
from src.data_processing.steps.embedder import Embedder

#### Step 1 - Data Pre-Processing
The idea is that this step won't run every single time but only when we have new data or when we don't have our embedded_dataset available. 

In [None]:
# Function to fix dataset (it had a bug with float and np.float64)

import pandas as pd
import re

def fix_embedding_string(s):
    # Replace np.float64(...) with the float value inside
    s = re.sub(r'np\.float64\(([^)]+)\)', r'\1', s)
    # Evaluate as a Python list
    try:
        emb = eval(s)
        return [float(x) for x in emb]
    except Exception:
        # Fallback: extract all numbers
        numbers = re.findall(r'[-+]?\d*\.\d+|\d+', s)
        return [float(num) for num in numbers]

df = pd.read_csv("../data/embedded_dataset.csv")
df['embedding'] = df['embedding'].apply(fix_embedding_string)
df['embedding'] = df['embedding'].apply(lambda x: str(x))
df.to_csv("embedded_dataset_fixed.csv", index=False)

In [5]:
def parse_list_string(list_string):
    try:
        return np.array(ast.literal_eval(list_string))
    except (ValueError, SyntaxError):
        print("Malformed embedding string:", list_string)
        return []

In [None]:
file_path = "../data/embedded_dataset.csv"
is_re_run_active = False

if os.path.exists(file_path) and not is_re_run_active:
    print(f"Loading '{file_path}' from repo...")
    df = pd.read_csv("../data/embedded_dataset.csv", converters={'embedding': parse_list_string})
else:
    print("Embeddings dataset was not found.")
    print("Loading dataset and running pre-processing...")

    # Load our dataset for pre-processing
    df = pd.read_csv("../data/dataset.csv")

    # Variables for pre-processing pipeline
    MODEL = "jina/jina-embeddings-v2-base-es:latest"
    CHUNK_SIZE = 3000
    CHUNK_OVERLAP = 450
    THRESHOLD = 15000

    # Creating instance of pipeline
    pipeline = DataProcessingPipeline(steps=[
        Cleaner(step_name="Cleaner"),
        Embedder(model_name=MODEL,
                 chunk_size=CHUNK_SIZE,
                 chunk_overlap=CHUNK_OVERLAP,
                 threshold=THRESHOLD,
                 step_name="Embedder")
    ])

    # Run pipeline
    df = pipeline.run(df=df)

    # Save results as embeddings dataset
    df['embedding'] = df['embedding'].apply(lambda emb: str([float(x) for x in emb]))
    df.to_csv("../data/embedded_dataset.csv", index=False)

#### Step 2 - Create Train/CV/Test splits
We are going to split our dataset as mentioned in train, cross-validation and test sets. In addition we will transform and prepare our categories to be numerical (0...N-1).

In [7]:
X_train_temp, X_test, y_train_temp, y_test = train_test_split(
    df["embedding"], df["category"],
    test_size=0.2,
    shuffle=True,
    stratify=df["category"],
    random_state=42
)

X_train, X_cv, y_train, y_cv = train_test_split(
    X_train_temp, y_train_temp,
    test_size=0.15,
    shuffle=True,
    stratify=y_train_temp,
    random_state=42
)

##### Encode labels to be numerical (0...N-1)

In [8]:
label_encoder = LabelEncoder()
# Encode all sets categories using LabelEncoder
y_train = label_encoder.fit_transform(y_train)
y_cv = label_encoder.transform(y_cv)
y_test = label_encoder.transform(y_test)

##### Save label encoder model

In [None]:
joblib.dump(label_encoder, "../models/label_encoder.pkl")

##### Convert embedding column to 2D numpy array

In [6]:
X_train = np.stack(X_train)
X_cv = np.stack(X_cv)
X_test = np.stack(X_test)

#### Step 3 - Hyperparameters tunning
We are first going to train a basic version of the XGBoost model to see the performance without tunning. After that we will search for hyperparameters using GridSearch & RandomizedSearch.

In [None]:
# Train base model without hyperparameters
xgb_classifier = xgb.XGBClassifier(objective="multi:softprob", random_state=42)
xgb_classifier.fit(X_train, y_train)

# Predict using Cross-Validation set
y_pred_proba_cv = xgb_classifier.predict_proba(X_cv)
y_pred_cv = np.argmax(y_pred_proba_cv, axis=1)

print(f"Cross-Validation accuracy: {accuracy_score(y_cv, y_pred_cv):.4f}")
print(f"Macro F1 (CV): {f1_score(y_cv, y_pred_cv, average='macro'):.4f}")
print(f"Log Loss (CV): {log_loss(y_cv, y_pred_proba_cv):.4f}")
print(f"\nClassification report (CV):\n{classification_report(y_cv, y_pred_cv)}")
print(f"Confusion matrix (CV):\n{confusion_matrix(y_cv, y_pred_cv)}")

Let's start the hyperparameter search, we will create a grid containing all params needed:
* `max_depth`: Maximum depth of a tree.
* `learning_rate`: Rate that the model changes the weigths.
* `n_estimators`: Number of trees in the model.
* `subsample`: Fraction of training instances to be used for each tree.
* `reg_alpha`: L1 regularization.
* `reg_lambda`: L2 regularization.
* `min_child_weight`: Used to control over-fitting.
* `colsample_bytree`: Reduces overfitting by introducing randomness.

In [None]:
param_grid = {
    "max_depth": [3, 4, 5, 6],
    "learning_rate": [0.01, 0.05, 0.1, 0.3],
    "n_estimators": [200, 400, 600, 800],
    "subsample": [0.6, 0.8, 1.0],
    "reg_alpha": [0, 0.001, 0.01, 0.1, 1, 10],
    "reg_lambda": [0, 0.1, 0.5, 1, 2],
    "min_child_weight": [1, 3, 5, 10],
    "colsample_bytree": [0.6, 0.8, 1.0],
}

# Create a new XGBoost instance
xgb_model = xgb.XGBClassifier(objective="multi:softprob", random_state=42)
# Set up randomized search
randomized_search = RandomizedSearchCV(xgb_model, param_grid, cv=3, scoring="accuracy", n_iter=30, n_jobs=-1, verbose=2, random_state=42)
randomized_search.fit(X_train, y_train)
# Best xgb model obtained
best_xgb = randomized_search.best_estimator_

print(f"Best Parameters: {randomized_search.best_params_}")
print(f"Best Accuracy: {randomized_search.best_score_}")
# {'subsample': 0.6, 'reg_lambda': 0.1, 'reg_alpha': 0, 'n_estimators': 600, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.1, 'colsample_bytree': 0.6}

##### Save model

In [None]:
joblib.dump(best_xgb, '../models/xgboost_model_v1.pkl')

#### Step 4 - Evaluation
Now that we have our 'best' model we are going to predict our test data and get the metrics in order to evaluate the performance.

In [None]:
# Predict test set using 'best' model (XGBoost)
y_pred_proba = best_xgb.predict_proba(X_test)
y_pred = np.argmax(y_pred_proba, axis=1)

print(f"Test accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Macro F1 (Test): {f1_score(y_test, y_pred, average='macro'):.4f}")
print(f"Log Loss (Test): {log_loss(y_test, y_pred_proba):.4f}")
print(f"\nClassification report (Test):\n{classification_report(y_test, y_pred)}")
print(f"Confusion matrix (Test):\n{confusion_matrix(y_test, y_pred)}")

### Conclusion
After pre-processing the dataset, train a XGBoost model and predict using the 'best' model (After the hyperparameter search), we got the following results.

Cross-Valiadtion evaluation:
* Cross-Validation accuracy: 0.8950
* Macro F1 (CV): 0.8944
* Log Loss (CV): 0.3298

Test set evaluation:
* Test accuracy: 0.8992
* Macro F1 (Test): 0.8986
* Log Loss (Test): 0.3620

And generally across classes it's balanced in terms of performance, class #0 and class #6 tend to be a little bit below the rest but not by a lot. Of course we will keep trying to improve this model, but we consider that almost 90% it's a great result.