Ce notebook est destiné à calculer les performances d'un modèle à base de réseaux de neurones basé sur différentes architectures avec des poids pré-entrainés ou pas.<br>
Pour des raisons liées à une instabilité mémoire du GPU sur cet environnement de test, chaque expérience rechargera un environnement après redémarrage du kernel pour éviter tout crash.<br>


# **EXPERIMENT SCRATCH**

## **Global Average Pooling Run**

In [2]:
# Please restart the kernel before running this cell to free up GPU memory
%run nn_env_a.ipynb

# Define the URI of the MLflow server and the name of the experiment
run = "GlobalAveragePooling1D"

# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 10000
seq_length = 100
embedding_dim = 16
embedding_trainable = True
epochs = 20
layers = (tf.keras.layers.GlobalAveragePooling1D(),)

2025-01-05 19:12:23.203522: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-05 19:12:24.705507: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2025-01-05 19:12:24.705573: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2025-01-05 19:12:24.710590: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2025-01-05 19:12:25.288729: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available


In [3]:
# Create the tensorflow dataset
train_ds, val_ds, test_ds = dl.to_tensorflow_dataset(
    col_name,
    PATH_PARQUET,
    X_train_full,
    X_test_full,
    y_train,
    y_test,
    val_split,
    batch_size,
)
# Start the MLflow run & autolog
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run) as active_run:
    # Create & build the model with defined parameters
    model = dl.create_tf_model(
        max_tokens, seq_length, None, embedding_dim, layers, train_ds
    )
    model.get_layer("embedding").trainable = embedding_trainable
    model.build(input_shape=(None, 1))

    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
        callbacks=[
            EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True)
        ],
    )

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    start_time = time.time()
    loss, accuracy = model.evaluate(test_ds)
    inference_time = time.time() - start_time
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)
    # Log the additionnal metrics & parameters
    # mlflow.log_metrics({"val_loss": val_loss, "val_accuracy": val_accuracy, "inference_time": inference_time})
    # mlflow.log_params({"data_preparation": col_name, "test_size_ratio": test_split, "val_splits": len(val_scores["test_accuracy"])})

    # Evaluate the data on the test set with th model logged in MLflow
    # evaluation_data = pd.concat([X_test, y_test], axis=1).assign(predictions=y_pred)
    model_uri = f"runs:/{active_run.info.run_id}/model"
    mlflow.evaluate(
        model=model_uri,
        model_type="classifier",
        data=test_ds,
        targets=y_test,
        evaluators=None,
        evaluator_config={
            "log_model_explainability": False
        },  # Disable SHAP explanations
    )

2025-01-05 19:12:33.003355: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-05 19:12:33.004801: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2025-01-05 19:12:33.091568: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 19:12:33.091604: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2025-01-05 19:12:33.091625: I t

Vocabulary size:  10000
Epoch 1/20
  1/479 [..............................] - ETA: 3:29 - loss: 0.6968 - binary_accuracy: 0.3125

2025-01-05 19:12:34.815253: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-01-05 19:12:34.882288: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 19:12:34.882342: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2025-01-05 19:12:34.883984: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 19:12:34.884027: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2025-01-05 19:12:45.006526: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-01-05 19:12:45.027028: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 19:12:45.027069: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20


2025-01-05 19:15:27.285910: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmp6z_iq5bt/model/data/model/assets
Evaluate on test data


2025/01/05 19:15:32 INFO mlflow.tracking._tracking_service.client: 🏃 View run GlobalAveragePooling1D at: http://localhost:5000/#/experiments/4/runs/486c6beca0a441f982db1ce1b3aa895e.
2025/01/05 19:15:32 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/4.


TypeError: TextVectorizer.__init__() got an unexpected keyword argument 'name'

In [3]:
# Create the datasets
train_ds, val_ds, test_ds = dl.to_tensorflow_dataset(
    col_name,
    PATH_PARQUET,
    X_train_full,
    X_test_full,
    y_train,
    y_test,
    val_split,
    batch_size,
)
# Create the model
model = dl.create_tf_model(
    max_tokens, seq_length, None, embedding_dim, layers, train_ds
)
model.get_layer("embedding").trainable = embedding_trainable
model.build(input_shape=(None, 1))
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
        callbacks=[
            EarlyStopping(monitor="val_loss", patience=3, restore_best_weights=True)
        ],
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2025-01-05 18:53:01.357576: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-01-05 18:53:01.359070: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2025-01-05 18:53:01.438897: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 18:53:01.438929: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2025-01-05 18:53:01.438949: I t

Vocabulary size:  10000
Epoch 1/20
  1/479 [..............................] - ETA: 3:20 - loss: 0.6917 - binary_accuracy: 0.6875

2025-01-05 18:53:03.206742: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-01-05 18:53:03.258409: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 18:53:03.258456: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2025-01-05 18:53:03.260134: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 18:53:03.260174: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2025-01-05 18:53:13.254556: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2025-01-05 18:53:13.274288: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-01-05 18:53:13.274329: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


2025-01-05 18:56:48.175787: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmp88xal1dj/model/data/model/assets
Evaluate on test data


2025/01/05 18:56:52 INFO mlflow.tracking._tracking_service.client: 🏃 View run GlobalAveragePooling1D at: http://localhost:5000/#/experiments/4/runs/960f98dcbf494daf9085f409ec0e8a80.
2025/01/05 18:56:52 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/4.


## **Global Max Pooling run**

In [1]:
# Please restart the kernel before running this cell to free up GPU memory
%run environnement.ipynb

2024-11-30 14:38:37.629832: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-30 14:38:39.043850: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-30 14:38:39.043904: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-30 14:38:39.048544: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-30 14:38:39.648404: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [3]:
# Define the URI of the MLflow server and the name of the experiment
experiment = "nn_scratch_embedding"
run = "GlobalMaxPooling1D"

# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 5000
seq_length = 100
embedding_dim = 16
embedding_trainable = True
epochs = 5
layers = (tf.keras.layers.GlobalMaxPooling1D(),)

In [4]:
# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except:
    pass
mlflow.set_experiment(experiment)

<Experiment: artifact_location='/home/hedredo/github/p7/mlflow/689416981458083287', creation_time=1732973943871, experiment_id='689416981458083287', last_update_time=1732973943871, lifecycle_stage='active', name='nn_scratch_embedding', tags={}>

In [7]:
# Create the datasets
train_ds, val_ds, test_ds = to_tensorflow_dataset(
    col_name, PATH_PARQUET, X_train, X_test, y_train, y_test, val_split, batch_size
)
# Create the model
model = create_tf_model(
    max_tokens, seq_length, custom_standardization, embedding_dim, layers, train_ds
)
model.get_layer("embedding").trainable = embedding_trainable
model.build(input_shape=(None, 1))
mlflow.tensorflow.autolog(checkpoint=False, log_datasets=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)
    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_param("dataset_training_rows", X_train.shape[0])
    mlflow.log_param(
        "dataset_nb_features", X_train.shape[1] if len(X_train.shape) > 1 else 1
    )
    mlflow.log_param("dataset_schema", str(X_train.dtypes))
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2024-11-30 14:44:00.393956: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Vocabulary size:  5000
Epoch 1/5
  9/320 [..............................] - ETA: 4s - loss: 0.6928 - binary_accuracy: 0.5104

2024-11-30 14:44:00.205966: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-30 14:44:00.254594: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-30 14:44:00.254653: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-30 14:44:00.256125: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-30 14:44:00.256161: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2024-11-30 14:44:04.804892: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-30 14:44:04.826158: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-30 14:44:04.826204: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


2024-11-30 14:44:22.927136: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmp9e166qd6/model/data/model/assets




Evaluate on test data


2024/11/30 14:44:26 INFO mlflow.tracking._tracking_service.client: 🏃 View run GlobalMaxPooling1D at: http://localhost:5000/#/experiments/689416981458083287/runs/b5efda2cee954ff1a88923f357bc0525.
2024/11/30 14:44:26 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/689416981458083287.


# **GLOVE**

## **Gloval Average Pooling Run**

In [1]:
# Please restart the kernel before running this cell to free GPU memory
%run environnement.ipynb
from gensim.models import KeyedVectors

2024-11-22 19:14:55.072454: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:14:56.631337: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-22 19:14:56.631397: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-22 19:14:56.636721: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:14:57.231003: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [2]:
# Load the glove-twitter-100 model
repo_id = "fse/glove-twitter-100"
model_file = hf_hub_download(repo_id=repo_id, filename="glove-twitter-100.model")
vector_file = hf_hub_download(
    repo_id=repo_id, filename="glove-twitter-100.model.vectors.npy"
)
glove = KeyedVectors.load(model_file, mmap="r")

In [3]:
# define the name of your experiment
experiment = "nn_glove_embeddings"
run = "GlobalAveragePooling1D"
# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 5000
seq_length = 100
embedding_dim = 100
embedding_trainable = False
epochs = 3
layers = (tf.keras.layers.GlobalAveragePooling1D(),)

In [4]:
# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except:
    pass
mlflow.set_experiment(experiment)

<Experiment: artifact_location='mlflow-artifacts:/246698213555700691', creation_time=1732299334237, experiment_id='246698213555700691', last_update_time=1732299334237, lifecycle_stage='active', name='nn_glove_embeddings', tags={}>

In [5]:
# Create the datasets
train_ds, val_ds, test_ds = to_tensorflow_dataset(
    col_name, PATH_PARQUET, X_train, X_test, y_train, y_test, val_split, batch_size
)

# Create the model
model = create_tf_model(
    max_tokens, seq_length, embedding_dim, layers, train_ds, pretrained_weights=glove
)
# set the embedding layer trainable or not
model.get_layer("embedding").trainable = embedding_trainable
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2024-11-22 19:15:43.453711: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:15:43.455013: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:15:43.527045: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:15:43.527077: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-11-22 19:15:43.527100: I t

Vocabulary size:  5000




Epoch 1/3


2024-11-22 19:15:45.006199: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:15:45.064555: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:15:45.064633: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-22 19:15:45.066950: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:15:45.067014: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2024-11-22 19:15:58.370998: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:15:58.393988: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:15:58.394042: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/3
Epoch 3/3


2024-11-22 19:16:28.861709: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmpc6cww1vm/model/data/model/assets




Evaluate on test data


2024/11/22 19:16:33 INFO mlflow.tracking._tracking_service.client: 🏃 View run GlobalAveragePooling1D at: http://localhost:5000/#/experiments/246698213555700691/runs/8c97a6124a964b228811579944c9f76e.
2024/11/22 19:16:33 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/246698213555700691.


## **Global Max Pooling Run**

In [1]:
# Please restart the kernel before running this cell to free GPU memory
%run environnement.ipynb
from gensim.models import KeyedVectors

2024-11-22 19:16:54.391837: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:16:55.233833: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-22 19:16:55.233903: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-22 19:16:55.238588: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:16:55.404099: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [3]:
# Load the glove-twitter-100 model
repo_id = "fse/glove-twitter-100"
model_file = hf_hub_download(repo_id=repo_id, filename="glove-twitter-100.model")
vector_file = hf_hub_download(
    repo_id=repo_id, filename="glove-twitter-100.model.vectors.npy"
)
glove = KeyedVectors.load(model_file, mmap="r")

In [4]:
# define the name of your experiment
experiment = "nn_glove_embeddings"
run = "GlobalMaxPooling1D"
# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 5000
seq_length = 100
embedding_dim = 100
embedding_trainable = False
epochs = 3
layers = (tf.keras.layers.GlobalMaxPooling1D(),)

In [5]:
# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except:
    pass
mlflow.set_experiment(experiment)

<Experiment: artifact_location='mlflow-artifacts:/246698213555700691', creation_time=1732299334237, experiment_id='246698213555700691', last_update_time=1732299334237, lifecycle_stage='active', name='nn_glove_embeddings', tags={}>

In [6]:
# Create the datasets
train_ds, val_ds, test_ds = to_tensorflow_dataset(
    col_name, PATH_PARQUET, X_train, X_test, y_train, y_test, val_split, batch_size
)

# Create the model
model = create_tf_model(
    max_tokens, seq_length, embedding_dim, layers, train_ds, pretrained_weights=glove
)
# set the embedding layer trainable or not
model.get_layer("embedding").trainable = embedding_trainable
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2024-11-22 19:17:20.117409: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:17:20.118892: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:17:20.180930: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:17:20.180959: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-11-22 19:17:20.180983: I t

Vocabulary size:  5000




Epoch 1/3


2024-11-22 19:17:21.667546: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:17:21.729626: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:17:21.729674: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-22 19:17:21.731446: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:17:21.731484: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2024-11-22 19:17:37.965533: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:17:37.986932: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:17:37.986977: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/3
Epoch 3/3


2024-11-22 19:18:08.437812: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmp1ci9kodr/model/data/model/assets




Evaluate on test data


2024/11/22 19:18:12 INFO mlflow.tracking._tracking_service.client: 🏃 View run GlobalMaxPooling1D at: http://localhost:5000/#/experiments/246698213555700691/runs/a4d4aa4b49994aff9aad8c6927d9292f.
2024/11/22 19:18:12 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/246698213555700691.


# **FASTTEXT EMBEDDINGS + CUSTOM NN**

## **Global Average Pooling Run**

In [1]:
# Please restart the kernel before running this cell to free GPU memory
%run environnement.ipynb
import fasttext

2024-11-22 19:19:04.747451: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:19:05.618176: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-22 19:19:05.618239: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-22 19:19:05.623202: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:19:05.789649: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [2]:
# Load fasttext embeddings trained on twitter data
model_path = hf_hub_download(
    repo_id="facebook/fasttext-en-vectors", filename="model.bin"
)
fastxt = fasttext.load_model(model_path)

In [3]:
# define the name of your experiment
experiment = "nn_fasttext_embeddings"
run = "GlobalAveragePooling1D"

# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 5000
seq_length = 100
embedding_dim = 300
embedding_trainable = False
epochs = 3
layers = (tf.keras.layers.GlobalAveragePooling1D(),)

In [4]:
# Create the datasets
train_ds, val_ds, test_ds = to_tensorflow_dataset(
    col_name, PATH_PARQUET, X_train, X_test, y_train, y_test, val_split, batch_size
)
# Create the model
model = create_tf_model(
    max_tokens, seq_length, embedding_dim, layers, train_ds, pretrained_weights=fastxt
)
# set the embedding layer trainable or not
model.get_layer("embedding").trainable = embedding_trainable
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2024-11-22 19:19:14.117706: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:19:14.121834: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:19:14.926604: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:14.926815: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-11-22 19:19:14.927028: I t

Vocabulary size:  5000




Epoch 1/3


2024-11-22 19:19:18.494021: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:19:18.629321: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:18.629378: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-22 19:19:18.632752: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:18.632809: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



2024-11-22 19:20:00.362353: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:20:00.385027: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:20:00.385072: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)


Epoch 2/3
Epoch 3/3


2024-11-22 19:22:57.383693: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


INFO:tensorflow:Assets written to: /tmp/tmpr1gxd5sp/model/data/model/assets




Evaluate on test data


## **Global Max Pooling Run**

In [None]:
# Please restart the kernel before running this cell to free GPU memory
%run environnement.ipynb
import fasttext

2024-11-22 19:19:04.747451: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:19:05.618176: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-22 19:19:05.618239: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-22 19:19:05.623202: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:19:05.789649: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [None]:
# Load fasttext embeddings trained on twitter data
model_path = hf_hub_download(
    repo_id="facebook/fasttext-en-vectors", filename="model.bin"
)
fastxt = fasttext.load_model(model_path)

In [None]:
# define the name of your experiment
experiment = "nn_fasttext_embeddings"
run = "GlobalMaxPooling1D"

# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32

# Args for the model
max_tokens = 5000
seq_length = 100
embedding_dim = 300
embedding_trainable = False
epochs = 3
layers = (tf.keras.layers.GlobalMaxPooling1D(),)

In [None]:
# Create the datasets
train_ds, val_ds, test_ds = to_tensorflow_dataset(
    col_name, PATH_PARQUET, X_train, X_test, y_train, y_test, val_split, batch_size
)
# Create the model
model = create_tf_model(
    max_tokens, seq_length, embedding_dim, layers, train_ds, pretrained_weights=fastxt
)
# set the embedding layer trainable or not
model.get_layer("embedding").trainable = embedding_trainable
mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
with mlflow.start_run(run_name=run):
    # Fit the model
    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=epochs,
        verbose=1,
    )
    mlflow.log_param("batch_size_", batch_size)
    mlflow.log_param("validation_split_", val_split)

    # Evaluate the model
    print("Evaluate on test data")
    print("==============")
    loss, accuracy = model.evaluate(test_ds)
    mlflow.log_metric("test_loss", loss)
    mlflow.log_metric("test_accuracy", accuracy)

2024-11-22 19:19:14.117706: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 19:19:14.121834: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 19:19:14.926604: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:14.926815: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-11-22 19:19:14.927028: I t

Vocabulary size:  5000




Epoch 1/3


2024-11-22 19:19:18.494021: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-22 19:19:18.629321: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:18.629378: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-22 19:19:18.632752: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-22 19:19:18.632809: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f



# BIDERECTIONAL LSTM

In [None]:
# Please restart the kernel before running this cell to free GPU memory
%run environnement.ipynb

2024-11-22 18:39:58.435638: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-22 18:39:59.344145: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdirectml.d6f03b303ac3c4f2eeb8ca631688c9757b361310.so
2024-11-22 18:39:59.344209: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libdxcore.so
2024-11-22 18:39:59.349589: I tensorflow/c/logging.cc:34] Successfully opened dynamic library libd3d12.so
Dropped Escape call with ulEscapeCode : 0x03007703
Dropped Escape call with ulEscapeCode : 0x03007703
2024-11-22 18:39:59.516451: I tensorflow/c/logging.cc:34] DirectML device enumeration: found 1 compatible adapters.


Random seed set to 314
Tensorflow framework: GPU is available
(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
       'tokenizer with lowercase, handle stripping, and length reduction',
       'tokenizer with lowercase and alpha',
       'tokenizer with lowercase, alpha and emoji',
       'tokenizer with lowercase, alpha, and no stop words',
       'tokenizer with lowercase, alpha and emoji, and no stop words'],
      dtype='object'), array([2, 0, 1, 3, 4, 5, 6, 7, 8]))
<class 'pandas.core.frame.DataFrame'>
Index: 1596630 entries, 0 to 799999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   text    1596630 non-null  object
 1   target  1596630 non-null  int8  
dtypes: int8(1), object(1)
memory usage: 25.9+ MB
None
(12772,) (3194,) (12772,) (3194,)


In [None]:
# define the name of your experiment
experiment = "nn_bidirectionnalLSTM_embedding"

# Set the tracking URI
mlflow.set_tracking_uri(URI)
# try to connect to the server
try:
    mlflow.tracking.get_tracking_uri()
except Exception as e:
    print(f"Cannot connect to the server : {URI}. Check the server status.")
    raise e
# Set, and create if necessary, the experiment
try:
    mlflow.create_experiment(experiment)
except:
    pass
mlflow.set_experiment(experiment)

<Experiment: artifact_location='mlflow-artifacts:/846190965426187584', creation_time=1731654957941, experiment_id='846190965426187584', last_update_time=1731654957941, lifecycle_stage='active', name='neural_network_SEQ_embedding', tags={}>

In [None]:
# Args for dataset preparation
col_name = "text"
val_split = 0.2
batch_size = 32
# Create the datasets
train_ds, val_ds, test_ds = prepare_tf_dataset(col_name, val_split, batch_size)

# Args for the model
max_tokens = 1000
seq_length = 100
embedding_dim = 16
embedding_trainable = True
epochs = 15
additionnal_layers = [
    (
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(8, return_sequences=True)),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(8)),
    ),
]
runs = ("BiderectionalLSTM",)

In [None]:
for layers, run_name in zip(additionnal_layers, runs):
    # Create the model
    model = create_tf_model(max_tokens, seq_length, embedding_dim, layers)
    model.get_layer("embedding").trainable = embedding_trainable
    mlflow.tensorflow.autolog(checkpoint=False, log_models=True)
    with mlflow.start_run(run_name=run_name):
        # Fit the model
        history = model.fit(
            train_ds,
            validation_data=val_ds,
            epochs=epochs,
            verbose=1,
        )
        mlflow.log_param("batch_size_", batch_size)
        mlflow.log_param("validation_split_", val_split)

        # Evaluate the model
        print("Evaluate on test data")
        print("==============")
        loss, accuracy = model.evaluate(test_ds)
        mlflow.log_metric("test_loss", loss)
        mlflow.log_metric("test_accuracy", accuracy)

2024-11-15 19:59:42.045555: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Vocabulary size:  1000




Epoch 1/15


2024-11-15 19:59:50.504612: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2024-11-15 19:59:50.831360: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-15 19:59:50.831405: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-11-15 19:59:50.832566: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-15 19:59:50.832599: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_f

InvalidArgumentError: Graph execution error:

No OpKernel was registered to support Op 'CudnnRNN' used by {{node CudnnRNN}} with these attrs: [T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=0, is_training=true, dropout=0, seed=0]
Registered devices: [CPU, GPU]
Registered kernels:
  <no registered kernels>

	 [[CudnnRNN]]
	 [[sequential_2/bidirectional_2/forward_lstm_2/PartitionedCall]] [Op:__inference_train_function_55753]

# SENTENCE TRANSFORMER

In [None]:
from sentence_transformers import SentenceTransformer

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
# Load the x_train data aligned on cleaned text corpus
X_train, X_test, y_train, y_test = load_splits_from_parquet(
    X_train, X_test, y_train, y_test, col_name="tokenizer with lowercase"
)

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")
X_train_encoded = model.encode(X_train.values, device="cpu")
X_test_encoded = model.encode(X_test.values, device="cpu")

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train_encoded, y_train)
preds = rfc.predict(X_test_encoded)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.73      0.76      0.75      1601
           1       0.75      0.72      0.74      1593

    accuracy                           0.74      3194
   macro avg       0.74      0.74      0.74      3194
weighted avg       0.74      0.74      0.74      3194



In [None]:
logit = LogisticRegression()
logit.fit(X_train_encoded, y_train)
preds = logit.predict(X_test_encoded)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.78      0.77      0.78      1601
           1       0.77      0.79      0.78      1593

    accuracy                           0.78      3194
   macro avg       0.78      0.78      0.78      3194
weighted avg       0.78      0.78      0.78      3194



In [None]:
cols

(Index(['hour', 'target', 'text', 'tokenizer with lowercase',
        'tokenizer with lowercase, handle stripping, and length reduction',
        'tokenizer with lowercase and alpha',
        'tokenizer with lowercase, alpha and emoji',
        'tokenizer with lowercase, alpha, and no stop words',
        'tokenizer with lowercase, alpha and emoji, and no stop words'],
       dtype='object'),
 array([2, 0, 1, 3, 4, 5, 6, 7, 8]))

In [None]:
# Load the x_train data aligned on cleaned text corpus
X_train, X_test, y_train, y_test = load_splits_from_parquet(
    X_train,
    X_test,
    y_train,
    y_test,
    col_name="tokenizer with lowercase, handle stripping, and length reduction",
)

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")
X_train_encoded = model.encode(X_train.values, device="cpu")
X_test_encoded = model.encode(X_test.values, device="cpu")

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train_encoded, y_train)
preds = rfc.predict(X_test_encoded)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.74      0.74      0.74      1601
           1       0.74      0.74      0.74      1593

    accuracy                           0.74      3194
   macro avg       0.74      0.74      0.74      3194
weighted avg       0.74      0.74      0.74      3194



In [None]:
logit = LogisticRegression()
logit.fit(X_train_encoded, y_train)
preds = logit.predict(X_test_encoded)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.77      0.77      0.77      1601
           1       0.77      0.77      0.77      1593

    accuracy                           0.77      3194
   macro avg       0.77      0.77      0.77      3194
weighted avg       0.77      0.77      0.77      3194



In [None]:
from datasets import Dataset


In [None]:
pd.concat([X_train, y_train], axis=1).to_parquet("../data/processed/X_train.parquet")
pd.concat([X_test, y_test], axis=1).to_parquet("../data/processed/X_test.parquet")

In [None]:
dataset = load_dataset(
    "parquet",
    data_files={
        "train": "../data/processed/X_train.parquet",
        "test": "../data/processed/X_test.parquet",
    },
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/12772 [00:00<?, ? examples/s]

Map:   0%|          | 0/3194 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=2
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",  # Output directory
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=8,  # Batch size for training
    per_device_eval_batch_size=8,  # Batch size for evaluation
    warmup_steps=500,  # Number of warmup steps
    weight_decay=0.01,  # Strength of weight decay
    logging_dir="./logs",  # Directory for storing logs
    logging_steps=10,
)

# Define the Trainer
trainer = Trainer(
    model=model,  # The instantiated 🤗 Transformers model to be trained
    args=training_args,  # Training arguments, defined above
    train_dataset=tokenized_datasets["train"],  # Training dataset
    tokenizer=tokenizer,  # Tokenizer
)

# Train the model
trainer.train()