# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 03: Training Pipeline</span>

<span style="font-width:bold; font-size: 1.4rem;">This notebook explains how to create a feature view, create a training dataset, train a model and save it in the Hopsworks Model Registry.</span>

## 🗒️ This notebook is divided into the following sections:

1. Fetch Feature Groups.
2. Create a Feature View.
3. Create a Training Dataset.
4. Train a model.
5. Save trained model in the Model Registry.

![part2](../../images/02_training-dataset.png) 

### <span style='color:#ff5f27'> 📝 Imports

In [None]:
import os
import joblib

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings("ignore")

## <span style="color:#ff5f27;"> 📡 Connecting to Hopsworks Feature Store </span>

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

In [None]:
# Retrieve feature groups
air_quality_fg = fs.get_feature_group(
    name='air_quality',
    version=1,
)
weather_fg = fs.get_feature_group(
    name='weather',
    version=1,
)

## <span style="color:#ff5f27;"> 🖍 Feature View Creation and Retrieval </span>

In [None]:
# Select features for training data.
selected_features = air_quality_fg.select_all().join(
    weather_fg.select_except(['city_name', 'unix_time', 'date']), 
)

In [None]:
# Uncomment this if you would like to view your selected features
# selected_features.show(5)

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

You can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [None]:
# Get or create the 'air_quality_fv' feature view
feature_view = fs.get_or_create_feature_view(
    name='air_quality_fv',
    version=1,
    query=selected_features,
)

For now, your `Feature View` is saved in Hopsworks and you can retrieve it using `FeatureStore.get_feature_view()`.

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset you use the `FeatureView.training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- You can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

In [None]:
X, _ = feature_view.training_data(
    description = 'Air Quality dataset',
)

## <span style="color:#ff5f27;">🧬 Modeling</span>

In [None]:
# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Fit the encoder to the data in the 'city_name' column
label_encoder.fit(X[['city_name']])

# Transform the 'city_name' column data using the fitted encoder
encoded = label_encoder.transform(X[['city_name']])

In [None]:
# Convert the output of the label encoding to a dense array and concatenate with the original data
X = pd.concat([X, pd.DataFrame(encoded)], axis=1)

# Drop columns 'date', 'city_name', 'unix_time' from the DataFrame 'X'
X = X.drop(columns=['date', 'city_name', 'unix_time'])

# Rename the newly added column with label-encoded city names to 'city_name_encoded'
X = X.rename(columns={0: "city_name_encoded"})

In [None]:
# Extract the target variable 'pm2_5' from the DataFrame 'X' and assigning it to the variable 'y'
y = X.pop('pm2_5')

In [None]:
# Split the data into training and testing sets using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=42,
)

X_train.head(3)

In [None]:
y_train.head(3)

## <span style='color:#ff5f27'>🏃🏻‍♂️ Model Training

In [None]:
# Create an instance of the XGBoost Regressor
xgb_regressor = XGBRegressor()

# Fit the XGBoost Regressor to the training data
xgb_regressor.fit(X_train, y_train)

## <span style='color:#ff5f27'> ⚖️ Model Validation

In [None]:
# Predict target values on the test set
y_pred = xgb_regressor.predict(X_test)

# Calculate Mean Squared Error (MSE) using sklearn
mse = mean_squared_error(y_test, y_pred)
print("⛳️ MSE:", mse)

# Calculate Root Mean Squared Error (RMSE) using sklearn
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("⛳️ RMSE:", rmse)

# Calculate R squared using sklearn
r2 = r2_score(y_test, y_pred)
print("⛳️ R^2:", r2)

In [None]:
# Create a DataFrame 'df_' to store true and predicted values for evaluation
df_pred = pd.DataFrame({
    "y_true": y_test,
    "y_pred": y_pred,
})
df_pred.head()

In [None]:
def create_residual_plot(df_pred):
    """Create a residual plot with specified styling."""
    plt.figure(figsize=(10, 6))
    residplot = sns.residplot(
        data=df_pred, 
        x="y_true", 
        y="y_pred", 
        color='orange',
        scatter_kws={'alpha': 0.5}
    )
    
    plt.title('Model Residuals', fontsize=14)
    plt.xlabel('True Values', fontsize=12)
    plt.ylabel('Residuals (Error)', fontsize=12)
    plt.tight_layout()
    
    return residplot.get_figure()

def create_feature_importance_plot(xgb_regressor):
    """Create a feature importance plot with specified styling."""
    plt.figure(figsize=(12, 8))
    plot_importance(
        xgb_regressor,
        max_num_features=25,
        title='Feature Importance',
        xlabel='F-score',
        ylabel='Features'
    )
    plt.tight_layout()
    
    return plt.gcf()  # get current figure

In [None]:
# Display plot
residual_fig = create_residual_plot(df_pred)
plt.show()

In [None]:
feature_importance_fig = create_feature_importance_plot(xgb_regressor)
plt.show()

## <span style='color:#ff5f27'>🗄 Model Registry</span>

One of the features in Hopsworks is the model registry. This is where you can store different versions of models and compare their performance. Models from the registry can then be served as API endpoints.

In [None]:
# Retrieve the model registry
mr = project.get_model_registry()

In [None]:
# Create directories and save artifacts
model_dir = "air_quality_model"
images_dir = os.path.join(model_dir, "images")
os.makedirs(images_dir, exist_ok=True)

In [None]:
# Save model artifacts
model_artifacts = {
    'label_encoder': os.path.join(model_dir, 'label_encoder.pkl'),
    'xgboost_model': os.path.join(model_dir, 'xgboost_regressor.pkl')
}

for name, path in model_artifacts.items():
    joblib.dump(
        label_encoder if name == 'label_encoder' else xgb_regressor,
        path
    )

In [None]:
# Save plots
residual_fig.savefig(
    os.path.join(images_dir, "residuals.png"),
    dpi=300,
    bbox_inches='tight'
)
feature_importance_fig.savefig(
    os.path.join(images_dir, "feature_importance.png"),
    dpi=300,
    bbox_inches='tight'
)

# Close figures to free memory
plt.close(residual_fig)
plt.close(feature_importance_fig)

In [None]:
# Create a Python model in the model registry named 'air_quality_xgboost_model'
aq_model = mr.python.create_model(
    name="air_quality_xgboost_model", 
    description="Air Quality (PM2.5) predictor",
    metrics={
        "RMSE": rmse,
        "MSE": mse,
        "R squared": r2,
    },
    input_example=X_test.sample().values, 
    feature_view=feature_view,
)

# Save the model artifacts to the 'air_quality_model' directory in the model registry
aq_model.save(model_dir)

---
## <span style="color:#ff5f27;">⏭️ **Next:** Part 04: Batch Inference</span>

In the following notebook you will use your model for Batch Inference.
