<div style="text-align: center;">
    <h1><strong>Train Resnet Model Notebook </strong></h1>
</div> 

# Goal of the notebook:
#### The overarching goal of this notebook is to implement a pipeline for the custom training of <strong>Resnet models from the Keras library</strong> 
#### <strong>MLflow</strong> is implemented for the tracking of training experiments and the storage of results (i.e. runs parameters and artifacts)
#### In this notebook, <strong>Azure Posgresql Database</strong> and <strong>Azure Blob Storage</strong> was use as storage solutions    

# Summary:
### 1- Import of Packages and Dependencies
### 2- Import Environment Variables
### 3- Set the parameters to get the resnet models and build the dataset
### 4- Build the datasets
### 5- Generate a trainable model
### 6- We configure MLflow

# 1- Import of Packages and Dependencies

In [1]:
import os
from dotenv import load_dotenv
from datetime import datetime
from utils.build_dataset import *
from utils.build_model import *
from datetime import datetime
import mlflow
from azure.storage.blob import BlobServiceClient
import tempfile
from tensorflow.keras.callbacks import ModelCheckpoint

2025-01-02 12:04:42.892054: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-02 12:04:42.892539: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-02 12:04:42.894885: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-02 12:04:42.901600: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1735815882.913234   27551 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1735815882.91

# 2- Import Environment Variables  

In [2]:
# Load environment variables from the .env file
load_dotenv()

# Access environment variables using os.getenv() method
# We need api_key and pai_url to connect to the API and get the data
api_key = os.getenv("API_KEY")
api_url = os.getenv("API_URL")

# We need the follow variables to connect to the Azure Blob Storage
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")
storage_account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
connection_string = os.getenv("AZURE_STORAGE_CONNECTION_STRING")

# We need the follow variables to connect to the Azure Posgresql Database
pghost = os.getenv("PGHOST")
pguser = os.getenv("PGUSER")
pgport = os.getenv("PGPORT")
pgdatabase = os.getenv("PGDATABASE")
pgpassword = os.getenv("PGPASSWORD")

# 3- Set the parameters to get the resnet model and build the dataset

### Parameters settings for trainable model compilation
- Please set the model name to be used
- Can be 'ResNet101', 'ResNet101V2', 'ResNet152', 'ResNet152V2', 'ResNet50', 'ResNet50V2'
- Run the cell to obtained a ResNet model ready to be trained

In [3]:
# Uncomment the model_name you want to use

model_name = "ResNet50" 
#model_name = "ResNet50V2" 
#model_name = "ResNet101" 
#model_name = "ResNet101V2" 
#model_name = "ResNet152" 
#model_name = "ResNet152V2" 

### Parameters settings for dataset collection
- Set the start_date using "YYYY-mm-DD" format (ex: "2020-08-01")
- Set the end_date using "YYYY-mm-DD" format (ex: "2020-08-01")
- Set the labels as string or list (ex: labels = ['vine', 'grass', 'ground'] or labels = 'ground') 

In [4]:
# We set the start date and end date for the training data
start_date = "2021-05-27"
end_date = "2021-06-01"

# We set the labels (i.e. 'vine', 'grass' or 'ground') or the list of labels we want to train the model on (i.e ['vine', 'grass', 'ground'])
labels = ['vine', 'grass', 'ground']

# 4- Build the datasets 

#### The train and validation datasets are created following 4 steps:
##### 1- The urls of images are collected according to the parameters we have set (i.e. labels, start_date, end_date)
##### 2- A data_frame is create in order to map data of the samples (df_sample_map)
##### 3- From the df_sample_map, HTTP request are perform to collect images and stored them locally in the 'media' folder
##### 4- The dataframe is saved locally to be exported latter on as an artifact
##### 5- Train and validation datasets (usable in model input) are generated with respect to the Resnet model used (i.e. preprocessing)        

In [5]:
# We collect the image urls for the labels and the dates
image_urls = get_image_urls_with_multiple_labels(labels, start_date, end_date, api_key, api_url)

# We create a dataframe with the image urls and the labels
df_sample_map = create_sample_map(image_urls)

# We download the images and save them in the media folder
image_dir = 'media'
df_sample_map = download_images(df_sample_map, image_dir)

# we save the dataset as a .csv file
df_sample_map.to_csv("dataset_csv.csv")

# We create the train and validation datasets for the given model
train_dataset, val_dataset = create_train_val_datasets(df_sample_map,
                              image_dir = 'media',
                              model_name = model_name,
                              )

Number of urls collected for vine: 16
Number of urls collected for grass: 5
Number of urls collected for ground: 13
Dataframe created successfully with shape : (34, 4)
Preprocess_input function for 'ResNet50' loaded successfully.


2025-01-02 12:05:15.634252: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


# 5- Generate a trainable model

#### The strategy behind the compile_new_model function (from the built_model module) can be breakdown following these steps:
##### 1- The import of the model is performed dynamically from the tf.keras.applications
##### 2- The model weights are imported without the top_layers
##### 3- Customs top layers are added with respect to the original architure and the use case (3 classes)
##### 4- The new model is compile and return

In [6]:
# We generate the trainable model
model = compile_new_model(model_name)

Model 'ResNet50' found in tf.keras.applications.
Base_model 'ResNet50' loaded successfully.
New ResNet50 compiled successfully and is ready to be trained!


# 6- We configure MLflow

In [7]:
# Construct the Azure Blob Storage URI for the collection of artifacts
artifact_location = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net?"

# Construct the URI for the model and set the MLflow tracking URI
tracking_uri=f"postgresql://{pguser}:{pgpassword}@{pghost}:{pgport}/{pgdatabase}"
mlflow.set_tracking_uri(tracking_uri)

# We instantiate the MLflow client for Azure Blob Storage
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# We set the experiment name
experiment_name = "my_experiment"

# Attempt to get the experiment by name
existing_experiment = mlflow.get_experiment_by_name(experiment_name)

# We check if the experiment exists and create it if it doesn't
if existing_experiment is None:
    # If the experiment doesn't exist, create it
    experiment_id = mlflow.create_experiment(
        experiment_name,
        artifact_location=artifact_location,
        tags={"version": "v1", "priority": "P1"},
    )
    print(f"Experiment '{experiment_name}' created.")
else:
    # If the experiment exists, use the existing experiment
    experiment_id = existing_experiment.experiment_id
    print(f"Experiment '{experiment_name}' already exists. Using the existing experiment.")

Experiment 'my_experiment' already exists. Using the existing experiment.


# 7- We run the training of the Model using an MLflow experiment 

In [8]:
temp_dir = 'temporary_model_dir'

# We use a temporary directory for ModelCheckpoint
with tempfile.TemporaryDirectory() as temp_dir:
    checkpoint_filepath = f"{temp_dir}/best_model.keras"

In [9]:
# We define the ModelCheckpoint callback
model_checkpoint = ModelCheckpoint(
    filepath=checkpoint_filepath,  # Temporary location
    monitor='val_loss',             # Metric to monitor
    save_best_only=True,            # Save only the best model
    save_weights_only=False,        # Save the entire model (architecture + weights)
    mode='min',                     # 'min' for loss
    verbose=1                       # Print saving information
)

In [10]:
# We set the number of epochs
number_of_epochs = 5

# Start a new MLflow run
with mlflow.start_run(experiment_id=experiment_id) as run:
    
    # Unable autologging for the model using the keras autolog to save the model using the .keras file format
    mlflow.keras.autolog()
    
    # We train the model
    history = model.fit(
        train_dataset,
        validation_data=val_dataset,
        epochs=number_of_epochs,
        callbacks=[model_checkpoint])

    # Log other parameters    
    mlflow.log_param("model_name", model_name)
    mlflow.log_param("labels", labels)
    mlflow.log_param("start_date", start_date)
    mlflow.log_param("end_date", end_date)
    # Log the dataset as artifact
    mlflow.log_artifact("dataset_csv.csv")

2025-01-02 12:05:18.643264: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-01-02 12:05:18.865840: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/5
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 786ms/step - accuracy: 0.4300 - loss: 5.7590
Epoch 1: val_loss improved from inf to 0.21295, saving model to /tmp/tmphiooz5cl/best_model.keras
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 1s/step - accuracy: 0.4400 - loss: 5.7577 - val_accuracy: 0.8333 - val_loss: 0.2129
Epoch 2/5
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 670ms/step - accuracy: 0.7282 - loss: 0.7956
Epoch 2: val_loss did not improve from 0.21295
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 722ms/step - accuracy: 0.7364 - loss: 0.8095 - val_accuracy: 0.3333 - val_loss: 2530.3411
Epoch 3/5
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 691ms/step - accuracy: 0.9244 - loss: 0.3640
Epoch 3: val_loss did not improve from 0.21295
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 744ms/step - accuracy: 0.9199 - loss: 0.3728 - val_accuracy: 0.6667 - val_loss: 20013.2559
Epoch

In [11]:
# We end the run
mlflow.end_run()

In [12]:
stop

NameError: name 'stop' is not defined

In [12]:
# Train the model
number_of_epochs = 3
history = model.fit(train_dataset, validation_data=val_dataset, epochs=number_of_epochs)

Epoch 1/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 1s/step - accuracy: 0.3907 - loss: 3.1763 - val_accuracy: 0.8333 - val_loss: 5.9545
Epoch 2/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 712ms/step - accuracy: 0.7013 - loss: 1.9444 - val_accuracy: 0.8333 - val_loss: 28.8344
Epoch 3/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 736ms/step - accuracy: 0.8349 - loss: 0.5273 - val_accuracy: 0.3333 - val_loss: 5007.9819




In [41]:
pred = model.predict(val_dataset)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step


In [42]:
pred

array([[0.34405488, 0.38155767, 0.27438748],
       [0.39924031, 0.41176608, 0.18899363]], dtype=float32)

# <strong>Step 4:</strong> We save the trained model

- The file is saved in the "./model_archives/" folder using the .keras extension
- The name of the saved model follows {model_name}_trained_model_{datetime_stamp}} pattern (ex: ResNet152V2_trained_model_20241221-174855.keras)
- Where the "model_name" is the name of the ResNet model used as a basemodel
- Where the "datetime_stamp" is the date and time at the end of the training in YYYYmmDD-HHMMSS format
- Run the cell to save the model 

In [33]:
# Chech if 'model_archive' folder exists, if not create one
archive_folder = 'model_archive'
if not os.path.exists(archive_folder):
    os.makedirs(archive_folder)
# Save the trained model
model.export(os.path.join(archive_folder, f"{model_name}_trained_model_{datetime.now().strftime("%Y%m%d-%H%M%S")}"))

INFO:tensorflow:Assets written to: model_archive/ResNet152V2_trained_model_20241222-075754/assets


INFO:tensorflow:Assets written to: model_archive/ResNet152V2_trained_model_20241222-075754/assets


Saved artifact at 'model_archive/ResNet152V2_trained_model_20241222-075754'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name='keras_tensor_1749')
Output Type:
  TensorSpec(shape=(None, 3), dtype=tf.float32, name=None)
Captures:
  139787717345936: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717347856: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717348624: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717348816: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717347088: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717348240: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717350736: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717351504: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717351696: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139787717351312: TensorSpec(shape=(), dt

# Additional snippet to load the trained models 

In [36]:
os.listdir(archive_folder)

['ResNet152V2_trained_model_20241222-075754',
 'ResNet152V2_trained_model_20241222-074953.keras',
 'ResNet152V2_trained_model_20241222-074803.keras']

In [39]:
# We load the trained model
loaded_model = keras.layers.TFSMLayer(os.path.join(archive_folder, os.listdir(archive_folder)[0]))

NameError: name 'keras' is not defined

In [40]:
# We load the trained model
loaded_model = tf.keras.models.load_model(os.path.join(archive_folder, os.listdir(archive_folder)[0]))

# We verify it works
#loaded_model.summary()

ValueError: File format not supported: filepath=model_archive/ResNet152V2_trained_model_20241222-075754. Keras 3 only supports V3 `.keras` files and legacy H5 format files (`.h5` extension). Note that the legacy SavedModel format is not supported by `load_model()` in Keras 3. In order to reload a TensorFlow SavedModel as an inference-only layer in Keras 3, use `keras.layers.TFSMLayer(model_archive/ResNet152V2_trained_model_20241222-075754, call_endpoint='serving_default')` (note that your `call_endpoint` might have a different name).

In [None]:
# Define a checkpoint directory
checkpoint_dir = "path_to_directory/checkpoints"
checkpoint_prefix = f"{checkpoint_dir}/ckpt"

# Create a callback to save model checkpoints
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True,  # Saves only weights, not the full model
    save_best_only=True,     # Saves the best model (based on validation loss)
    monitor="val_loss",      # Metric to monitor
    verbose=1
)

# Train the model with the checkpoint callback
history = model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    callbacks=[checkpoint_callback]
)
