<a href="https://colab.research.google.com/github/SebastianSaldarriagaC1/os-final-project-tinyml/blob/main/TinyML02_Model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download preprocessed dataset

First, we have to get the cleaned data set from the [TinyML01 - Data preprocessing Notebook](https://colab.research.google.com/drive/1qHDEBMzlEsFVm5CmYwjsLib4Gr91EQuq?authuser=1#scrollTo=H-DEAkH15pCs). This ensures that our data is ready for training, with any noise or irrelevant information already removed, allowing our model to learn more effectively.

In [None]:
# Import training libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
!pip install gdown



In [None]:
url = 'https://drive.google.com/uc?id=1lJWy8niBfia6uacFvtpPKLVQpn-zKQmN'
output = 'processed-earth-surface-temperature-data.csv'
gdown.download(url, output, quiet=False)

df = pd.read_csv(output)

Downloading...
From (original): https://drive.google.com/uc?id=1lJWy8niBfia6uacFvtpPKLVQpn-zKQmN
From (redirected): https://drive.google.com/uc?id=1lJWy8niBfia6uacFvtpPKLVQpn-zKQmN&confirm=t&uuid=7595d221-0999-451b-ac31-ab1bf0ced738
To: /content/processed-earth-surface-temperature-data.csv
100%|██████████| 202M/202M [00:01<00:00, 136MB/s]


In [None]:
df

Unnamed: 0,AverageTemperature,Latitude,Longitude,Month,Year
0,6.068,57.05,10.33,11,1743
1,10.644,57.05,10.33,5,1744
2,14.051,57.05,10.33,6,1744
3,16.082,57.05,10.33,7,1744
4,12.781,57.05,10.33,9,1744
...,...,...,...,...,...
7149067,7.710,52.24,5.26,4,2013
7149068,11.464,52.24,5.26,5,2013
7149069,15.043,52.24,5.26,6,2013
7149070,18.775,52.24,5.26,7,2013


# Model Training


## Data normalization

First, we preprocess the data by selecting the relevant features that our model will use to learn and make predictions. These features include AverageTemperature, Latitude, Longitude, Month, and Year. After selecting the features, we normalize them using StandardScaler to ensure that all features have a mean of 0 and a standard deviation of 1, which helps in speeding up the training process and achieving better performance.

In [None]:
features = df[['AverageTemperature', 'Latitude', 'Longitude', 'Month', 'Year']]

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

## Create autoencoder model

Here, we define and build our autoencoder model. An autoencoder is a type of neural network used to learn efficient representations of data, typically for the purpose of anomaly detection. We start by defining the input dimensions based on the scaled features. The architecture includes an encoder that compresses the input into a lower-dimensional representation, and a decoder that reconstructs the input from this representation. The model is then compiled using the Adam optimizer and mean squared error as the loss function.


In [None]:
# Defining input dimensions
input_dim = scaled_features.shape[1]

# Defining Autoencoder architecture
input_layer = Input(shape=(input_dim,))
encoder = Dense(32, activation="relu")(input_layer)
encoder = Dense(16, activation="relu")(encoder)
encoder = Dense(8, activation="relu")(encoder)
decoder = Dense(16, activation="relu")(encoder)
decoder = Dense(32, activation="relu")(decoder)
output_layer = Dense(input_dim, activation="sigmoid")(decoder)

autoencoder = Model(inputs=input_layer, outputs=output_layer)

# Compilamos el modelo
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50


<keras.src.callbacks.History at 0x7e2829492d10>

## Split the dataset and train the model

In this step, we split the scaled dataset into training and testing sets with an 80-20 ratio to evaluate our model's performance. We then define an early stopping callback to prevent overfitting by stopping training when the validation loss does not improve for 5 consecutive epochs. Finally, we train the autoencoder model on the training data, allowing it to learn how to reconstruct the input data. The early stopping callback helps ensure that the model retains the best weights achieved during training.


**Warning:** The next code takes at least 2 hours to execute. Execute at your own risk

In [None]:
X_train, X_test = train_test_split(scaled_features, test_size=0.2, random_state=42)

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

autoencoder.fit(X_train, X_train, epochs=50, batch_size=32, validation_split=0.2, callbacks=[early_stopping])

# Model evaluation

In this section, we evaluate the trained autoencoder model by making predictions on the test dataset. We calculate the mean squared error (MSE) between the original and reconstructed data points to determine how well the model can reconstruct normal data. To detect anomalies, we set a threshold based on the 95th percentile of the MSE values. Any data point with an MSE above this threshold is considered an anomaly. We then count and print the number of anomalies detected and list their indices.

In [None]:
# Evaluate model
# Prediction with test split data
reconstructed = autoencoder.predict(X_test)
mse = np.mean(np.power(X_test - reconstructed, 2), axis=1)

threshold = np.percentile(mse, 95)

anomalies = mse > threshold

print(f'Amount of anomalies detected: {np.sum(anomalies)}')


Número de anomalías detectadas: 71491


In [None]:
# Getting detected anomalies indexes
anomaly_indices = np.where(mse > threshold)[0]

# Print anomalies indexes
print("Detected anomalies indexes:")
print(anomaly_indices)

Índices de las anomalías detectadas:
[     80      99     173 ... 1429677 1429695 1429715]


Here, we locate and print a specific row from the original DataFrame df based on a given index (1429677 in this case). This step helps us inspect the data point that was detected as an anomaly, allowing us to understand the nature of the anomalies detected by the model.

In [None]:
fila_especifica = df.loc[1429677]

print(fila_especifica)


AverageTemperature      10.183
Latitude                34.560
Longitude              -81.730
Month                   11.000
Year                  1871.000
Name: 1429677, dtype: float64


# Convert the model to TensorFlow Lite

In this final step, we convert the trained autoencoder model into the TensorFlow Lite format. TensorFlow Lite models are optimized for mobile and embedded device deployment, making them ideal for TinyML applications. After conversion, we save the model as a .tflite file, which can then be deployed on edge devices for real-time anomaly detection.

In [None]:
# Convertir el modelo a TensorFlow Lite
converter = tf.lite.TFLiteConverter.from_keras_model(autoencoder)
tflite_model = converter.convert()

# Guardar el modelo convertido
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)
