# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Model training</span>

<span style="font-width:bold; font-size: 1.4rem;"> This notebook explains how to read from a feature group and create training dataset within the feature store. You will train a model on the created training dataset. You will train your model using TensorFlow, although it could just as well be trained with other machine learning frameworks such as Scikit-learn, Keras, and PyTorch. You will also see some of the exploration that can be done in Hopsworks, notably the search functions and the lineage.</span>

## **🗒️ This notebook is divided into the following steps:** 
1. **Feature Selection**: Select the features you want to train your model on.
2. **Feature Transformation**: How the features should be preprocessed.
3. **Training Dataset Creation**: Create a dataset for training anomaly detection model.
2. **Model Training**: Train your anomaly detection model.
3. **Model Registry**: Register model to Hopsworks model registry.
4. **Model Deployment**: Deploy the model for real-time inference.

![tutorial-flow](../images/02_training-dataset.png) 

## <span style='color:#ff5f27'> 📝 Imports

In [1]:
import ast
import numpy as np
import pandas as pd
import tensorflow as tf
import os

from anomaly_detection import GanEncAnomalyDetector

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

# Get the feature store handle for the project's feature store
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://staging.cloud.hopsworks.ai/p/120
Connected. Call `.close()` to terminate connection gracefully.


## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You start by selecting all the features we want to include for model training/inference.

In [3]:
# Retrieve Feature Groups
transactions_monthly_fg = fs.get_feature_group(
    name="transactions_monthly", 
    version=1,
)

graph_embeddings_fg = fs.get_feature_group(
    name="graph_embeddings", 
    version=1,
) 

party_fg = fs.get_feature_group(
    name="party_labels", 
    version=1,
)

In [4]:
# AML model query 
aml_model_query = transactions_monthly_fg.select(
    [
        "monthly_in_count", 
        "monthly_in_total_amount", 
        "monthly_in_mean_amount", 
        "monthly_in_std_amount", 
        "monthly_out_count", 
        "monthly_out_total_amount", 
        "monthly_out_mean_amount", 
        "monthly_out_std_amount",
    ]
).join(
    graph_embeddings_fg.select(["graph_embeddings"]),
).join(
    party_fg.select(["type", "is_sar"]), 
)

In [5]:
# uncommnet this line if you would like to see query results
#aml_model_query.show(5)

### <span style="color:#ff5f27;"> 🤖 Transformation Functions </span>

Transformation functions are a mathematical mapping of input data that may be stateful - requiring statistics from the partent feature view (such as number of instances of a category, or mean value of a numerical feature)

We will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this we simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [6]:
# Load built in transformation functions.
min_max_scaler = fs.get_transformation_function(name="min_max_scaler")

# Map features to transformations.
transformation_functions = {
    "monthly_in_count": min_max_scaler,
    "monthly_in_total_amount": min_max_scaler,
    "monthly_in_mean_amount": min_max_scaler,
    "monthly_in_std_amount": min_max_scaler,
    "monthly_out_count": min_max_scaler,
    "monthly_out_total_amount": min_max_scaler,
    "monthly_out_mean_amount": min_max_scaler,
    "monthly_out_std_amount": min_max_scaler,
}

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

In Hopsworks, you write features to feature groups (where the features are stored) and you read features from feature views. A feature view is a logical view over features, stored in feature groups, and a feature view typically contains the features used by a specific model. This way, feature views enable features, stored in different feature groups, to be reused across many different models. The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.
In order to create a Feature View we may use `fs.create_feature_view()`

In [7]:
# Create the 'aml_feature_view' feature view
feature_view = fs.create_feature_view(
    name='aml_feature_view',
    query=aml_model_query,
    labels=["is_sar"],
    transformation_functions=transformation_functions,
)

Feature view created successfully, explore it at 
https://staging.cloud.hopsworks.ai/p/120/fs/68/fv/aml_feature_view/version/1


## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

Training dataset is created using `feature_view.training_data()` method.

**From feature view APIs we can also create training datasts based on even time filters specifing `start_time` and `end_time`**. 


In [8]:
# Get training data
X_train, y_train = feature_view.training_data(
    description='AML training dataset',
)



Finished: Reading data from Hopsworks, using Hive (53.57s) 




In [9]:
# Displaying the first three rows of the training data
X_train.head(3)

Unnamed: 0,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount,graph_embeddings,type
0,0.076923,0.03902,0.149079,0.17756,0.1875,0.103499,0.27494,0.065549,"[0.9999758,0.99986506,0.99988717,0.99990636,0....",1
1,0.0,0.0,0.0,0.0,0.1875,0.0654,0.173731,0.188546,"[0.999906,0.99997395,0.9998985,0.9999284,0.999...",1
2,0.115385,0.068557,0.174616,0.12408,0.0625,0.02666,0.212459,0.0,"[0.99997574,0.9998652,0.99988717,0.99990636,0....",1


In [10]:
# Displaying the first three rows of the target data
y_train.head(3)

Unnamed: 0,is_sar
0,0
1,0
2,0


## <span style="color:#ff5f27;"> 👓 Exploration </span>

### Similar to Feature Groups Feature Views and Training Tatasets are now accessible and searchable in the UI
![fv-overview](images/feature_views_explore.gif)

## 📊 Statistics
We can explore feature statistics in the feature views. 

![fv-stats](images/feature_view_stats.gif)


# <span style="color:#ff5f27;">🤖 Model Building</span>


In [11]:
# Converting string representations of Python literals in 'graph_embeddings' column to actual objects
X_train['graph_embeddings'] = X_train['graph_embeddings'].apply(ast.literal_eval)

In [12]:
# Convert each element in the 'graph_embeddings' column to a NumPy array
X_train['graph_embeddings'] = X_train['graph_embeddings'].apply(np.array)

# Merge the original DataFrame with a DataFrame of exploded embeddings
X_train = X_train.merge(
    pd.DataFrame(X_train['graph_embeddings'].to_list()).add_prefix('emb_'), 
    left_index=True, 
    right_index=True,
).drop('graph_embeddings', axis=1)

# Display the first three rows of the modified DataFrame
X_train.head(3)

Unnamed: 0,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount,type,emb_0,...,emb_22,emb_23,emb_24,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31
0,0.076923,0.03902,0.149079,0.17756,0.1875,0.103499,0.27494,0.065549,1,0.999976,...,0.999954,0.999849,0.999985,0.999922,0.999919,0.999951,0.999989,0.999666,0.99998,0.999654
1,0.0,0.0,0.0,0.0,0.1875,0.0654,0.173731,0.188546,1,0.999906,...,0.999889,0.999644,0.999986,0.999971,0.99985,0.999947,0.999994,0.999854,0.999989,0.999334
2,0.115385,0.068557,0.174616,0.12408,0.0625,0.02666,0.212459,0.0,1,0.999976,...,0.999954,0.999849,0.999985,0.999922,0.999919,0.999952,0.999989,0.999666,0.99998,0.999654


You are going to train [gan for anomaly detection](https://arxiv.org/pdf/1905.11034.pdf). During training step  you will provide only features of accounts that have never been reported for suspicios activity.  You will disclose previously reported accounts to the model only in evaluation step.   

In [13]:
# Filter non-suspicious transactions from X_train based on y_train values equal to 0
non_sar_transactions = X_train[y_train.values == 0]

# Drop any rows with missing values from the non-suspicious transactions DataFrame
non_sar_transactions = non_sar_transactions.dropna()

Now lets define Tensorflow Dataset as we are going to train keras tensorflow model

In [14]:
def windowed_dataset(dataset, window_size, batch_size):
    # Create a windowed dataset using the specified window_size and shift of 1
    # Drop any remaining elements that do not fit in complete windows
    ds = dataset.window(window_size, shift=1, drop_remainder=True)

    # Flatten the nested datasets into a single dataset of windows
    ds = ds.flat_map(lambda x: x.batch(window_size))

    # Batch the windows into batches of the specified batch_size
    # Use drop_remainder=True to ensure that all batches have the same size
    # Prefetch one batch to improve performance
    return ds.batch(batch_size, drop_remainder=True).prefetch(1)

In [15]:
# Convert non_sar_transactions to a TensorFlow dataset, casting the values to float32
training_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(non_sar_transactions.astype('float32'), tf.float32))

# Use the windowed_dataset function to create a windowed dataset
# Parameters: window_size=2 (sequence length), batch_size=16 (number of sequences in each batch)
training_dataset = windowed_dataset(
    training_dataset, 
    window_size=2, 
    batch_size=16,
)

training_dataset

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


<PrefetchDataset element_spec=TensorSpec(shape=(16, None, 41), dtype=tf.float32, name=None)>

## <span style="color:#ff5f27;"> 🏃 Train Model</span>

Next we'll train a model. Here, we set the class weight of the positive class to be twice as big as the negative class.

## <span style="color:#ff5f27;">🧬 Model architecture</span>

![tutorial-flow](images/model_architecture.png)

In [17]:
# Create an instance of the GanEncAnomalyDetector model with input dimensions [2, 41]
model = GanEncAnomalyDetector([2, 41])

# Compile the model
model.compile()

In [18]:
# Iterate through each layer in the model
for layer in model.layers:
    # Print the name and output shape of each layer
    print(layer.name, layer.output_shape)

encoder_model (None, 1, 41)
generator_model (None, 2, 41)
discriminator_model (None, 1)


In [19]:
# Train the model using the training_dataset
# Set the number of epochs to 2 and suppress verbose output during training
history = model.fit(
    training_dataset,  # Training dataset used for model training
    epochs=2,          # Number of training epochs
    verbose=0,         # Verbosity mode (0: silent, 1: progress bar, 2: one line per epoch)
)

In [20]:
# Create a dictionary to store metrics
# The key is 'loss', and the value is the initial value of the generator loss from the training history
metrics = {
    'loss': history.history["g_loss"][0],
}

### <span style="color:#ff5f27;">⚙️ Model Schema</span>

The model needs to be set up with a [Model Schema](https://docs.hopsworks.ai/3.0/user_guides/mlops/registry/model_schema/), which describes the inputs and outputs for a model.

A Model Schema can be automatically generated from training examples, as shown below.

In [21]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Define the input schema using the values of X_train
input_schema = Schema(X_train)

# Define the output schema using y_train
output_schema = Schema(y_train)

# Create a ModelSchema object specifying the input and output schemas
model_schema = ModelSchema(
    input_schema=input_schema, 
    output_schema=output_schema,
)

# Convert the model schema to a dictionary for further inspection or serialization
model_schema.to_dict()

{'input_schema': {'columnar_schema': [{'name': 'monthly_in_count',
    'type': 'float64'},
   {'name': 'monthly_in_total_amount', 'type': 'float64'},
   {'name': 'monthly_in_mean_amount', 'type': 'float64'},
   {'name': 'monthly_in_std_amount', 'type': 'float64'},
   {'name': 'monthly_out_count', 'type': 'float64'},
   {'name': 'monthly_out_total_amount', 'type': 'float64'},
   {'name': 'monthly_out_mean_amount', 'type': 'float64'},
   {'name': 'monthly_out_std_amount', 'type': 'float64'},
   {'name': 'type', 'type': 'int64'},
   {'name': 'emb_0', 'type': 'float64'},
   {'name': 'emb_1', 'type': 'float64'},
   {'name': 'emb_2', 'type': 'float64'},
   {'name': 'emb_3', 'type': 'float64'},
   {'name': 'emb_4', 'type': 'float64'},
   {'name': 'emb_5', 'type': 'float64'},
   {'name': 'emb_6', 'type': 'float64'},
   {'name': 'emb_7', 'type': 'float64'},
   {'name': 'emb_8', 'type': 'float64'},
   {'name': 'emb_9', 'type': 'float64'},
   {'name': 'emb_10', 'type': 'float64'},
   {'name': 'em

## <span style="color:#ff5f27;">📝 Register model</span>

One of the features in Hopsworks is the model registry. This is where we can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [22]:
# Set the path for exporting the trained model
export_path = "aml_model"
print('Exporting trained model to: {}'.format(export_path))

# Get the concrete function for serving the model
call = model.serve_function.get_concrete_function(tf.TensorSpec([None, None, None], tf.float32))

# Save the model to the specified export path with the serving signature
tf.saved_model.save(model, export_path, signatures=call)

# Access the model registry in your project
mr = project.get_model_registry()

# Create a TensorFlow model in the model registry with specified metadata
mr_model = mr.tensorflow.create_model(
    name="aml_model",                                    # Specify the model name
    metrics=metrics,                                     # Include model metrics
    model_schema=model_schema,                           # Include model schema
    description="Adversarial anomaly detection model.",  # Model description
    input_example=["70408aef"],                          # Input example
)

# Save the registered model to the model registry
mr_model.save(export_path)

Exporting trained model to: aml_model
2023-12-12 14:13:33,556 INFO: Assets written to: aml_model/assets
Connected. Call `.close()` to terminate connection gracefully.


  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://staging.cloud.hopsworks.ai/p/120/models/aml_model/1


Model(name: 'aml_model', version: 1)

## <span style="color:#ff5f27;"> 🚀 Model Deployment</span>


In [26]:
fv = fs.get_feature_view("aml_feature_view", 1)

In [28]:
fv.query.show(5).columns



Finished: Reading data from Hopsworks, using Hive (47.70s) 


Index(['monthly_in_count', 'monthly_in_total_amount', 'monthly_in_mean_amount',
       'monthly_in_std_amount', 'monthly_out_count',
       'monthly_out_total_amount', 'monthly_out_mean_amount',
       'monthly_out_std_amount', 'graph_embeddings', 'type', 'is_sar'],
      dtype='object')

In [None]:
fv.init_serving(1)

In [23]:
%%writefile aml_model_transformer.py

import os
import hsfs
import numpy as np

class Transformer(object):
    
    def __init__(self):        
        # get feature store handle
        fs_conn = hsfs.connection()
        self.fs = fs_conn.get_feature_store()
        
        # get feature views
        self.fv = self.fs.get_feature_view("aml_feature_view", 1)
        
        # initialise serving
        self.fv.init_serving(1)
    
    def preprocess(self, inputs):
        feature_vector = self.fv.get_feature_vector({"id": inputs["inputs"][0]})
        return {
            "inputs": np.array(list(flat2gen(
                feature_view.get_feature_vector({'id': node_id})
            ))).reshape(1,41).tolist(),
        }

    def postprocess(self, outputs):
        return outputs

    def flat2gen(self, alist):
        for item in alist:
            if isinstance(item, list):
                for subitem in item: yield subitem
            else:
                yield item
    

Overwriting aml_model_transformer.py


In [24]:
from hsml.transformer import Transformer

# Get the dataset API from the project
dataset_api = project.get_dataset_api()

# Upload the transformer script file to the "Models" dataset
uploaded_file_path = dataset_api.upload(
    "aml_model_transformer.py",   # Name of the script file
    "Models",                     # Destination folder in the dataset
    overwrite=True,               # Overwrite the file if it already exists
)

# Construct the full path to the uploaded transformer script file
transformer_script_path = os.path.join("/Projects", project.name, uploaded_file_path)

# Create a Transformer object using the uploaded script
transformer_script = Transformer(
    script_file=transformer_script_path,
)


Uploading: 0.000%|          | 0/954 elapsed<00:00 remaining<?

In [25]:
# Retrieve the "aml_model" from the model registry
model = mr.get_model("aml_model", version=1)

# Deploy the model with the specified name ("amlmodeldeployment") and associated transformer
deployment = model.deploy(
    name="amlmodeldeployment",      # Specify the deployment name
    transformer=transformer_script, # Associate the transformer script with the deployment
)

RestAPIError: Metadata operation error: (url: https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/120/serving). Server response: 
HTTP code: 400, HTTP reason: Bad Request, body: b'{"errorCode":240021,"usrMsg":"Transformers are only supported on KServe deployments","errorMsg":"Transformers not supported"}', error code: 240021, error msg: Transformers not supported, user msg: Transformers are only supported on KServe deployments

In [None]:
print("Deployment: " + deployment.name)
deployment.describe()

> The deployment has now been registered. However, to start it you need to run:

In [None]:
deployment.start(await_running=300)

> For trouble shooting one can use `get_logs` method

In [None]:
deployment.get_logs()

> To stop the deployment you simply run:

In [None]:
deployment.stop()

---
## <span style="color:#ff5f27;"> ⏭️ Next: Part 03: Online Inference </span>
    
In the next notebook you will use your deployment for online inference.