# <FONT COLOR = "red">***EXERCISE - LESSON 2 - MODEL MONITORING AND MAINTENANCE***</FONT>
---
---

The notebook presents the resolution of the activities present in module 5 of lesson 2 called "*Continuous monitoring and maintenance*".

## <FONT COLOR="orange">**EXERCISE 1. PERFORMANCE TRACKING**</FONT>
---
---

The monitoring system implemented to register the metrics and error of the classification model.

To do so, we need develop a function to analize the drift in the test dataset and generate an alert if is detected a significant change.

In [2]:
# IMPORT COMMON LIBRARIES
import numpy as np

In [3]:
# METRICS REGISTER SYSTEM
def log_metrics(precision, err):
  with open('metrics.log', 'a') as file:
    file.write(f'Precision: {precision}, Error: {err}.\n')

In [4]:
# AUTOMATIC ALERTS
def check_alerts(precision):
  # Threshold
  if precision < .8:
    print('Alert! Precision is below the threshold.')

In [5]:
# ANALIZE DRIFT CONCEPT
def concept_drift_analysis(data_old, data_new):
  if (np.mean(data_new) != np.mean(data_old)):
    print('Alert! Concept drift is detected on the data distribution.')

To try this system, I create fictitious data.

In [10]:
# GENERATE RANDOM PRECISION AND ERROR VALUES
precision = np.random.rand()
error = np.random.rand()

# GENERATE RANDOM DATA_OLD
data_old = np.random.rand(100)

# CREATE A SIGNIFICANT CHANGE ON THE DATA_OLD DISTRIBUTION
data_new = data_old + np.random.rand(100)

In [11]:
# LOG METRICS CREATION
log_metrics(precision, error)

In [12]:
# DISPLAY AUTOMATIC ALERT
check_alerts(precision)

Alert! Precision is below the threshold.


In [13]:
# CHECK THE DRIFT DATA DISTRIBUTION
concept_drift_analysis(data_old, data_new)

Alert! Concept drift is detected on the data distribution.


## <FONT COLOR="orange">**EXERCISE 2. MAINTENANCE AND UPDATE**</FONT>
---
---

Develop new scripts that automaticly update the ML model with new data is very important. However the last update isn't necessary good, for this reason is important realize testing to prove the update impact on the model performance. Finally, to obtain a good model is important realize a bug fix.

In [41]:
# IMPORT COMMON LIBRARIES
import numpy as np
import pandas as pd

# IMPORT DATASET
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# IMPORT MODEL LIBRARIES
from sklearn.ensemble import RandomForestClassifier

# IMPORT SAVE MODEL LIBRARY
import joblib

In [34]:
# UPDATE MODEL FUNCTION
def update_model (model, new_data):
  # TRAIN THE MODEL WITH NEW DATA
  updated_model = model.fit(new_data['X'], new_data['y'])
  # SAVE THE UPDATED MODEL
  joblib.dump(updated_model, 'updated_model.pkl')

In [38]:
# TESTING TO PROVE THE UPDATE IMPACT
def run_test(model, test_data):
  # TEST THE MODEL PERFORMANCE
  performance = model.score(test_data['X'], test_data['y'])
  # PERFORMANCE COMPARISON
  if performance < .9:
    print('Alert! Model performance has decreased with the last update.')

In [48]:
# AUTO-ERROR CORRECTION
def auto_error_correction (model, data):
  # ERROR DETECTION
  errors = detect_errors(model, data)
  # ERROR FIX
  corrected_model = correct_errors(model, errors, data)
  # SAVE THE ADJUST MODEL
  joblib.dump(corrected_model, 'corrected_model.pkl')

In [46]:
# ERROR DETECTION
def detect_errors(model, data):
  X = data['X']
  y_true = data['y']
  y_pred = model.predict(X)
  errors_index = np.where(y_true != y_pred)[0]  # Get indices of incorrect predictions
  return errors_index

In [47]:
# CORRECT ERRORS
def correct_errors(model, errors, data):
  # FILTER DATA TO INCLUDE ONLY RECORDS WHERE ERRORS WERE DETECTED
  error_data_X = data['X'][errors]
  error_data_y = data['y'][errors]

  # RETRAIN THE MODEL ON THE SUBSET OF THE DATA WHERE ERRORS WERE DETECTED
  corrected_model = model.fit(error_data_X, error_data_y)
  return corrected_model

As a proof of concept, I generate data to test each one of the functions.

In [31]:
# NUMBER OF SAMPLES AND FEATURES
n_samples = 1000
n_features = 100

# GENERATE DATA
X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=50,  # Number of informative features
    n_redundant=30,    # Number of redundant features
    n_classes=2,       # Number of classes (binary classification)
    random_state=42    # For reproducibility
)

# CREATE DATASETS
data = {'X': X[:500], 'y': y[:500]}            # First 500 records
new_data = {'X': X[500:900], 'y': y[500:900]}  # Next 400 records
test_data = {'X': X[900:], 'y': y[900:]}       # Last 100 records

In [32]:
# CREATE AND TRAIN THE FIRST MODEL
model = RandomForestClassifier(random_state=42)
model.fit(data['X'], data['y'])

Test functions to create an update and test the impact of this update.

In [36]:
# UPDATE MODEL
update_model(model, new_data)

# LOAD THE UPDATED MODEL
updated_model = joblib.load('updated_model.pkl')

In [39]:
# TEST THE UPDATE IMPACT
run_test(updated_model, test_data)

Alert! Model performance has decreased with the last update.


In [49]:
# AUTO-ERROR CORRECTION
auto_error_correction(updated_model, data)