# <span style="color:#ff5f27"> 👨🏻‍🏫 PyTorch Model and Sklearn Transformation Functions Registration in the Model Registry</span>

In this notebook you will see how to **register Sklearn Transformation Functions and PyTorch model** in Hopsworks Model Registry, how to **retrieve** them and then use for **batch and feature vector prediction**.

## <span style="color:#ff5f27">🗄️ Table of Contents</span>
- [📝 Imports](#1)
- [💽 Loading Data](#2)
- [🔮 Connecting to Hopsworks Feature Store](#3)
- [🪄 Creating Feature Groups](#4)
- [🖍 Feature View Creation](#5)
- [👩🏻‍🔬 Data Transformation](#6)
- [👔 Transformer instances fit](#7)
- [🧬 Modeling](#8)
- [💾 Saving the Model and Transformation Functions](#9)
- [📮 Retrieving the Model and Transformation Functions from Model Registry](#10)
- [👨🏻‍⚖️ Batch Prediction](#11)
- [👨🏻‍⚖️ Serving Feature Vector Prediction](#12)

<a name='1'></a>
## <span style='color:#ff5f27'> 📝 Imports </span>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import joblib

import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score

<a name='2'></a>
## <span style="color:#ff5f27;"> 💽 Loading Data </span>

In [None]:
# Load the data
df_original = pd.read_csv("https://repo.hops.works/dev/davit/air_quality/backfill_pm2_5_eu.csv")
# Generate a binary target column
df_original['target'] = np.random.choice([0, 1], size=len(df_original))
df_original.head(3)

<a name='3'></a>
## <span style="color:#ff5f27;"> 🔮 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

<a name='4'></a>
## <span style="color:#ff5f27;">🪄 Creating Feature Groups</span>

In [None]:
feature_group = fs.get_or_create_feature_group(
    name='feature_group_online',
    description='Online Feature Group',
    version=1,
    primary_key=['city_name', 'date'],
    online_enabled=True,
)    
feature_group.insert(df_original)

<a name='5'></a>
## <span style="color:#ff5f27;"> 🖍 Feature View Creation</span>

In [None]:
# Create a Query object
query = feature_group.select_except(['date'])

feature_view = fs.get_or_create_feature_view(
    name='serving_fv',
    version=1,
    query=query,
    labels=['target']
)

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>


In [None]:
# Create a train-test split dataset
td_version, job = feature_view.create_train_test_split(
    test_size=0.1,
    description='Description of the dataset',
    data_format='csv'
)

### <span style="color:#ff5f27;">🪝 Training Dataset Retrieval</span>

In [None]:
# Retrieve the train-test split
X_train, X_test, y_train, y_test = feature_view.get_train_test_split(
    training_dataset_version=td_version
)

In [None]:
X_train.head(3)

In [None]:
y_train.head(3)

<a name='6'></a>
## <span style="color:#ff5f27;">👩🏻‍🔬 Data Transformation</span>

For Data Transformation let's create two functions: `to_df` and `transform_data`.

- `to_df` function will transform a feature vector(s) list into a pandas DataFrame.
- `transform_data` function will apply transformations to the input data using OneHotEncoder and StandardScaler.

In [None]:
def to_df(feature_vector):
    """
    Convert a feature vector or a list of feature vectors into a pandas DataFrame.

    Parameters:
        feature_vector (a list, or list of lists): 
            A feature vector or a list of feature vectors. A feature vector is 
            represented as a list containing two elements: the first 
            element corresponds to the city name (categorical feature), and the 
            second element corresponds to the PM2.5 value (numerical feature).

    Returns:
        pandas.DataFrame: A DataFrame representing the feature vector(s). 
        The DataFrame will have two columns: 'city_name' for the city names 
        and 'pm2_5' for the corresponding PM2.5 values.

    Example:
        >>> feature_vector = ['New York', 15.3]
        >>> to_df(feature_vector)
           city_name  pm2_5
        0  New York   15.3

        >>> multiple_vectors = [['New York', 15.3], ['Los Angeles', 10.7]]
        >>> to_df(multiple_vectors)
          city_name  pm2_5
        0  New York   15.3
        1  Los Angeles 10.7
    """
    if isinstance(feature_vector[0], list): 
        city_names = [vector[0] for vector in feature_vector]
        pm2_5_values = [vector[1] for vector in feature_vector]
        data = pd.DataFrame(
            {
                'city_name': city_names,
                'pm2_5': pm2_5_values,
            }
        )
        return data

    data = pd.DataFrame(
            {
                'city_name': [feature_vector[0]],
                'pm2_5': [feature_vector[1]],
            }
        )
    return data

In [None]:
def transform_data(data, one_hot_encoder, standard_scaler):
    """
    Apply transformations to the input data using OneHotEncoder and StandardScaler.

    Parameters:
        data (pandas.DataFrame):
            The input DataFrame containing the columns 'city_name' (categorical feature)
            and 'pm2_5' (numerical feature) to be transformed.

        one_hot_encoder (sklearn.preprocessing.OneHotEncoder):
            The fitted OneHotEncoder object used to encode the 'city_name' column into binary vectors.

        standard_scaler (sklearn.preprocessing.StandardScaler):
            The fitted StandardScaler object used to standardize the 'pm2_5' column.

    Returns:
        pandas.DataFrame:
            A new DataFrame with the 'city_name' column encoded and the 'pm2_5' column
            standardized using StandardScaler. The new DataFrame contains all the original
            columns except 'city_name', and the encoded 'city_name' columns as binary vectors.
    """
    # Transform the 'city_name' column using OneHotEncoder
    city_encoded = one_hot_encoder.transform(data[['city_name']])

    # Create a new DataFrame with the encoded values
    encoded_df = pd.DataFrame(city_encoded, columns=one_hot_encoder.categories_[0])

    # Concatenate the encoded DataFrame with the original DataFrame
    data = pd.concat([data.drop('city_name', axis=1), encoded_df], axis=1)
    
    # Transform the 'pm2_5' column using StandardScaler
    data['pm2_5'] = standard_scaler.transform(data[['pm2_5']])

    return data


<a name='7'></a>
### <span style="color:#ff5f27;"> 👔 Transformer instances fit</span>

The next step is to create instances of OneHotEncoder and StandardScaler transformers and fit them on X_train dataset.

In [None]:
# Create an instance of the OneHotEncoder and StandardScaler
one_hot_encoder = OneHotEncoder(sparse=False)
standard_scaler = StandardScaler()

In [None]:
one_hot_encoder.fit(X_train[['city_name']])
standard_scaler.fit(X_train[['pm2_5']])
print('✅ Done!')

### <span style="color:#ff5f27;">⛳️ Train Data Transformation</span>

Now let's use `transform_data` function to transform `X_train` and `X_test` using fitted `OneHotEncoder` and `StandardScaler` transformers.

In [None]:
X_train_transformed = transform_data(X_train, one_hot_encoder, standard_scaler)
X_train_transformed.head(3)

### <span style="color:#ff5f27;">⛳️ Test Data Transformation</span>

In [None]:
X_test_transformed = transform_data(X_test, one_hot_encoder, standard_scaler)
X_test_transformed.head(3)

<a name='8'></a>
## <span style="color:#ff5f27;">🧬 Modeling</span>

In the Modeling part, you will build a PyTorch Binary Classification model and fit it on the transformed X_train dataset.

In addition, let's create the `to_tensor` function in order to **transform pandas dataframe** into **PyTorch tensor**.

In [None]:
def to_tensor(dataframe):
    """
    Convert a pandas DataFrame to a PyTorch tensor.

    Parameters:
        dataframe (pandas.DataFrame):
            The input DataFrame to be converted to a tensor.

    Returns:
        torch.Tensor:
            A PyTorch tensor containing the values from the input DataFrame.
            The data type of the tensor is torch.float32.
    """
    return torch.tensor(dataframe.values, dtype=torch.float32)

In [None]:
# Convert data to PyTorch tensors
X_train_transformed_tensor = to_tensor(X_train_transformed)
y_train_tensor = to_tensor(y_train)

X_train_transformed_tensor[0]

In [None]:
# Define the model class
class BinaryClassificationModel(nn.Module):
    def __init__(self, input_dim):
        super(BinaryClassificationModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

# Create the model instance
input_dim = X_train_transformed_tensor.shape[1]
model = BinaryClassificationModel(input_dim)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.005)

# Train the model
num_epochs = 5
batch_size = 32
num_batches = len(X_train_transformed_tensor) // batch_size

for epoch in range(num_epochs):
    for i in range(num_batches):
        # Prepare mini-batches
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        batch_X, batch_y = X_train_transformed_tensor[start_idx:end_idx], y_train_tensor[start_idx:end_idx]

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y.view(-1, 1))

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print training progress
        if (i + 1) % 1786 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{num_batches}], Loss: {loss.item():.4f}')

## <span style="color:#ff5f27;">🗄 Model Registry</span>

In [None]:
mr = project.get_model_registry()

### <span style="color:#ff5f27;">⚙️ Model Schema</span>


In [None]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_schema = Schema(X_train_transformed.values)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_schema.to_dict()

<a name='9'></a>
### <span style="color:#ff5f27;">💾 Saving the Model and Transformation Functions</span>

In [None]:
model_dir = "torch_tf_model"

if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)

# Save Transformation Functions
joblib.dump(one_hot_encoder, model_dir + '/one_hot_encoder.pkl')
joblib.dump(standard_scaler, model_dir + '/standard_scaler.pkl')

# Save the model
joblib.dump(model, model_dir + '/torch_classifier.pkl')

In [None]:
# Create a model in the model registry
model = mr.torch.create_model(
    name="torch_model",
    description="PyTorch model",
    input_example=X_train.sample(),
    model_schema=model_schema,
)

model.save(model_dir)

<a name='10'></a>
## <span style="color:#ff5f27;"> 📮 Retrieving the Model and Transformation Functions from Model Registry </span>

In [None]:
# Retrieve your model from the model registry
retrieved_model = mr.get_model(
    name="torch_model",
    version=1
)
saved_model_dir = retrieved_model.download()

In [None]:
# Retrieve the PyTorch model
retrieved_torch_model = joblib.load(saved_model_dir + "/torch_classifier.pkl")

# Retrieve Transformation Functions
one_hot_encoder = joblib.load(saved_model_dir + "/one_hot_encoder.pkl")
standard_scaler = joblib.load(saved_model_dir + "/standard_scaler.pkl")

<a name='11'></a>
## <span style="color:#ff5f27;"> 👨🏻‍⚖️ Batch Prediction </span>

In [None]:
# Initialise feature view to retrieve batch data
feature_view.init_batch_scoring(training_dataset_version=td_version)

# Retrieve batch data
batch_data = feature_view.get_batch_data()
batch_data.head(3)

In [None]:
# Apply transformations to the batch data using transform_data function
batch_data_transformed = transform_data(batch_data, one_hot_encoder, standard_scaler)
batch_data_transformed.head(3)

In [None]:
# Predict batch data using retrieved model
predictions_batch = retrieved_torch_model(to_tensor(batch_data_transformed))
predictions_batch[:10]

<a name='12'></a>
## <span style="color:#ff5f27;"> 👨🏻‍⚖️ Serving Feature Vector Prediction</span>

In [None]:
# Initialise feature view to retrieve feature vector
feature_view.init_serving(1)

# Retrieve a feature vector
feature_vector = feature_view.get_feature_vector(
    entry = {
        "city_name": 'Amsterdam',
        "date": '2013-01-01',
    }
)
feature_vector

In [None]:
# Transform feature vector to pandas dataframe
feature_vector_df = to_df(feature_vector)
feature_vector_df

In [None]:
# Apply transformations to the feature vector df using transform_data function
feature_vector_transformed = transform_data(feature_vector_df, one_hot_encoder, standard_scaler)
feature_vector_transformed.head(3)

In [None]:
# Predict transformed feature vector using retrieved model
prediction_feature_vector = retrieved_torch_model(to_tensor(feature_vector_transformed))
prediction_feature_vector

In [None]:
# Retrieve feature vectors from feature store
feature_vectors = feature_view.get_feature_vectors(
    entry = [
        {"city_name": 'Amsterdam', "date": '2013-01-01'},
        {"city_name": 'Amsterdam', "date": '2014-01-01'},
    ]
)
feature_vectors

In [None]:
# Convert feature vectors to pandas dataframe
feature_vectors_df = to_df(feature_vectors)
feature_vectors_df

In [None]:
# Apply transformations to the feature vectors df using transform_data function
feature_vectors_transformed = transform_data(feature_vectors_df, one_hot_encoder, standard_scaler)
feature_vectors_transformed.head(3)

In [None]:
# Predict transformed feature vectors using retrieved model
prediction_feature_vectors = retrieved_torch_model(to_tensor(feature_vectors_transformed))
prediction_feature_vectors

---