<div style="text-align: center;">
    <h1><strong>Training a Resnet Model</strong></h1>
    <h1><strong>Azure posgresql Database & Azure Blob Storage</strong></h1>
</div> 

# Goal of the notebook:
#### The overarching goal of this notebook is to implement a pipeline for the custom training of <strong>Resnet models from the Keras library</strong> 
#### <strong>MLflow</strong> is implemented for the tracking of training experiments and the storage of results (i.e. runs parameters and artifacts)
#### In this notebook, <strong>Azure Posgresql Database</strong> and <strong>Azure Blob Storage</strong> were used as storage solutions    

# Summary:
### 1- Import of Packages and Dependencies
### 2- Import Environment Variables
### 3- Set the parameters to get the resnet models and build the dataset
### 4- Build the datasets
### 5- Generate a trainable model
### 6- We configure MLflow
### 7- We run the training of the Model using an MLflow experiment 

# 1- Import of Packages and Dependencies

In [1]:
import os
from dotenv import load_dotenv
from datetime import datetime
from utils.build_dataset import *
from utils.build_model import *
from datetime import datetime
import mlflow
from azure.storage.blob import BlobServiceClient
import tempfile
from tensorflow.keras.callbacks import ModelCheckpoint

2025-01-02 16:07:13.200313: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-02 16:07:13.209214: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-02 16:07:13.319584: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-01-02 16:07:13.398530: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1735830433.452856   27643 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1735830433.47

# 2- Import Environment Variables  

In [2]:
# Load environment variables from the .env file
load_dotenv()

# Access environment variables using os.getenv() method
# We need api_key and pai_url to connect to the API and get the data
api_key = os.getenv("API_KEY")
api_url = os.getenv("API_URL")

# We need the follow variables to connect to the Azure Blob Storage
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")
storage_account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
connection_string = os.getenv("AZURE_STORAGE_CONNECTION_STRING")

# We need the follow variables to connect to the Azure Posgresql Database
pghost = os.getenv("PGHOST")
pguser = os.getenv("PGUSER")
pgport = os.getenv("PGPORT")
pgdatabase = os.getenv("PGDATABASE")
pgpassword = os.getenv("PGPASSWORD")

# 3- Set the parameters to get the resnet model and build the dataset

### Parameters settings for trainable model compilation
- Please set the model name to be used
- Can be 'ResNet101', 'ResNet101V2', 'ResNet152', 'ResNet152V2', 'ResNet50', 'ResNet50V2'
- Run the cell to obtained a ResNet model ready to be trained

In [3]:
# Uncomment the model_name you want to use

model_name = "ResNet50" 
#model_name = "ResNet50V2" 
#model_name = "ResNet101" 
#model_name = "ResNet101V2" 
#model_name = "ResNet152" 
#model_name = "ResNet152V2" 

### Parameters settings for dataset collection
- Set the start_date using "YYYY-mm-DD" format (ex: "2020-08-01")
- Set the end_date using "YYYY-mm-DD" format (ex: "2020-08-01")
- Set the labels as string or list (ex: labels = ['vine', 'grass', 'ground'] or labels = 'ground') 

In [4]:
# We set the start date and end date for the training data
start_date = "2021-05-27"
end_date = "2021-06-01"

# We set the labels (i.e. 'vine', 'grass' or 'ground') or the list of labels we want to train the model on (i.e ['vine', 'grass', 'ground'])
labels = ['vine', 'grass', 'ground']
labels = 'vine'

# 4- Build the datasets 

#### The train and validation datasets are created following 4 steps:
##### 1- The urls of images are collected according to the parameters we have set (i.e. labels, start_date, end_date)
##### 2- A data_frame is create in order to map data of the samples (df_sample_map)
##### 3- From the df_sample_map, HTTP request are perform to collect images and stored them locally in the 'media' folder
##### 4- The dataframe is saved locally to be exported latter on as an artifact
##### 5- Train and validation datasets (usable in model input) are generated with respect to the Resnet model used (i.e. preprocessing)        

In [5]:
# We collect the image urls for the labels and the dates
image_urls = get_image_urls_with_multiple_labels(labels, start_date, end_date, api_key, api_url)

# We create a dataframe with the image urls and the labels
df_sample_map = create_sample_map(image_urls)

# We download the images and save them in the media folder
image_dir = 'media'
df_sample_map = download_images(df_sample_map, image_dir)

# we save the dataset as a .csv file
df_sample_map.to_csv("dataset_csv.csv")

# We create the train and validation datasets for the given model
train_dataset, val_dataset = create_train_val_datasets(df_sample_map,
                              image_dir = 'media',
                              model_name = model_name,
                              )

Number of urls collected for vine: 16
Dataframe created successfully with shape : (16, 4)
Preprocess_input function for 'ResNet50' loaded successfully.


2025-01-02 16:07:31.548913: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


# 5- Generate a trainable model

#### The strategy behind the compile_new_model function (from the built_model module) can be breakdown following these steps:
##### 1- The import of the model is performed dynamically from the tf.keras.applications
##### 2- The model weights are imported without the top_layers
##### 3- Customs top layers are added with respect to the original architure and the use case (3 classes)
##### 4- The new model is compile and return

In [6]:
# We generate the trainable model
model = compile_new_model(model_name)

Model 'ResNet50' found in tf.keras.applications.
Base_model 'ResNet50' loaded successfully.
New ResNet50 compiled successfully and is ready to be trained!


# 6- We configure MLflow

In [7]:
# Construct the Azure Blob Storage URI for the collection of artifacts
artifact_location = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net?"

# Construct the URI for the model and set the MLflow tracking URI
tracking_uri=f"postgresql://{pguser}:{pgpassword}@{pghost}:{pgport}/{pgdatabase}"
mlflow.set_tracking_uri(tracking_uri)

# We instantiate the MLflow client for Azure Blob Storage
blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# We set the experiment name
experiment_name = "my_experiment_v2"

# Attempt to get the experiment by name
existing_experiment = mlflow.get_experiment_by_name(experiment_name)

# We check if the experiment exists and create it if it doesn't
if existing_experiment is None:
    # If the experiment doesn't exist, create it
    experiment_id = mlflow.create_experiment(
        experiment_name,
        artifact_location=artifact_location,
        tags={"version": "v1", "priority": "P1"},
    )
    print(f"Experiment '{experiment_name}' created.")
else:
    # If the experiment exists, use the existing experiment
    experiment_id = existing_experiment.experiment_id
    print(f"Experiment '{experiment_name}' already exists. Using the existing experiment.")

Experiment 'my_experiment_v2' already exists. Using the existing experiment.


# 7- We run the training of the Model using an MLflow experiment 

In [8]:
temp_dir = 'temporary_model_dir'

# We use a temporary directory for ModelCheckpoint
with tempfile.TemporaryDirectory() as temp_dir:
    checkpoint_filepath = f"{temp_dir}/best_model.keras"

In [9]:
# We define the ModelCheckpoint callback
model_checkpoint = ModelCheckpoint(
    filepath=checkpoint_filepath,  # Temporary location
    monitor='val_loss',             # Metric to monitor
    save_best_only=True,            # Save only the best model
    save_weights_only=False,        # Save the entire model (architecture + weights)
    mode='min',                     # 'min' for loss
    verbose=1                       # Print saving information
)

In [10]:
# We set the number of epochs
number_of_epochs = 5

# Start a new MLflow run
with mlflow.start_run(experiment_id=experiment_id) as run:
    
    # Unable autologging for the model using the keras autolog to save the model using the .keras file format
    mlflow.keras.autolog()
    
    # We train the model
    history = model.fit(
        train_dataset,
        validation_data=val_dataset,
        epochs=number_of_epochs,
        callbacks=[model_checkpoint])

    # Log other parameters    
    mlflow.log_param("model_name", model_name)
    mlflow.log_param("labels", labels)
    mlflow.log_param("start_date", start_date)
    mlflow.log_param("end_date", end_date)
    # Log the dataset as artifact
    mlflow.log_artifact("dataset_csv.csv")

2025-01-02 16:07:34.701821: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-01-02 16:07:34.935366: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Epoch 1/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 677ms/step - accuracy: 0.3718 - loss: 0.9177 - precision: 0.4508 - recall: 0.3718          
Epoch 1: val_loss improved from inf to 0.04145, saving model to /tmp/tmpdv4e3fxm/best_model.keras
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 2s/step - accuracy: 0.4327 - loss: 0.8425 - precision: 0.5199 - recall: 0.4327 - val_accuracy: 1.0000 - val_loss: 0.0415 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 2/5
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 625ms/step - accuracy: 1.0000 - loss: 0.0016 - precision: 1.0000 - recall: 1.0000
Epoch 2: val_loss improved from 0.04145 to 0.00012, saving model to /tmp/tmpdv4e3fxm/best_model.keras
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1s/step - accuracy: 1.0000 - loss: 0.0015 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 1.0000 - val_loss: 1.2285e-04 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 3/5
[1m3/3[0m

In [11]:
# We end the run
mlflow.end_run()

: 