# **ID2223 Project: Training Pipeline**

*Overview of pipeline*

**PREPARATION**
1. Install requirements
2. Control Colab GPU and CPU setup
3. Login to Hopsworks

---

**DATASET**
1.  Download full dataset from Hopsworks and save it to Colab.

---

**XGBOOST REGRESSOR**
1. Load full dataset.
2. Assign correct datatypes to the variables.
3. Separate the features (X) and the label (Y).
4. Split full dataset (*seed = 7*) into a training set (80%) and test set (20%).


5. Initialize XGBOOST Regressor model.
6. Optimize hyperparameters with GridSearch.
7. Train the XGBBOOST model on the training data.
8. Let the model predict on the testing data.
9. Calculate performance metrics: *MAE* and *MAPE*.
10. Plot model's predictions vs. original values.


11. Upload model to Hopsworks Model Registry.
12. Download model from Hopsworks Model Registry.

---
**AUTOGLUON TABULAR PREDICTOR**
1. Load full dataset and assign correct datatypes.
2. Separate the features (X) and the label (Y).
3. Split full dataset (*seed = 7*) into a training set (80%) and test set (20%).


4. Initialize, train, and then save Autogluon Tabular Predictor.
5. Let the model predict on the testing data.
6. Calculate performance metrics MAE and MAPE


7. Upload model to Hopsworks Model Registry
8. Download model from Hopsworks Model Registry

---


Install requirements

In [None]:
!pip install hopsworks
!pip install datasets
!pip install pandas
!pip install xgboost==1.7.2
!pip install autogluon

Check Colab setup

In [None]:
# Check GPU info
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
# Check CPU info
!lscpu |grep 'Model name'

Login to and Hopsworks

In [None]:
import hopsworks

project = hopsworks.login()
dataset_api = project.get_dataset_api()

## Download full dataset from Hopsworks

In [None]:
# Download full dataset from Hopsworks
HW_PATH = "/Projects/alexao00/Project/features.csv"
LOCAL_PATH = "/content"

downloaded_file_path = dataset_api.download(
    HW_PATH, 
    local_path = LOCAL_PATH, overwrite=True)
print('The following file has been succesfully downloaded from Hopsworks and is available at:' + '\n' + downloaded_file_path)

# XGBOOST REGRESSOR 🌲

Load dataset and assign correct datatypes

In [None]:
def fix_datatypes_in_dataframe(df):
  '''
  Drop unnecessary features.
  Assign correct datatypes to features in a dataframe.
  '''

  df = df.drop(['area', 'brf'], axis=1)
 
  features_to_categorical = ["streetName", "agency"]
  features_to_float = ["number", "sqm", "rooms", "price", "soldDate", "monthlyFee",
                      "monthlyCost", "floor", "yearBuilt", "gdp", "unemployment",
                      "interestRate"]

  df[features_to_categorical] = df[features_to_categorical].astype("category")
  df[features_to_float] = df[features_to_float].astype(float)
  
  df = df[(df['lat'] != 0) | (df['lon'] != 0)]

  return df

In [None]:
import pandas as pd

# Read full dataset 
full_dataset = pd.read_csv ("/content/features.csv", sep=";")

# Assign correct datatypes to the features
full_dataset = fix_datatypes_in_dataframe(full_dataset)

# Preview dataset
print(full_dataset.info())

Separate the features (X) and the label (Y)

In [None]:
# full_dataset = full_dataset.drop(['streetName', 'agency'], axis=1) # If training for UI without categorical
xgb_X_full = full_dataset.loc[:, full_dataset.columns != 'price']
xgb_Y_full = full_dataset.loc[:, full_dataset.columns == 'price']

Split data into a train set (*80%*) and a test set (*20%*)

In [None]:
from sklearn.model_selection import train_test_split
seed = 7
test_size = 0.20
xgb_X_train, xgb_X_test, xgb_Y_train, xgb_Y_test = train_test_split(xgb_X_full, xgb_Y_full, test_size=test_size, random_state=seed, shuffle=False)

Initialize XGBOOST Regressor model and tune its hyperparameters

In [None]:
from xgboost import XGBRegressor
# The following options are important because they enable the usage of 
#   categorical features: (tree_method="gpu_hist", enable_categorical=True). 
#   (reference: https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html)

# Create temporary xgb regressor model:
temp_model = XGBRegressor(tree_method="gpu_hist", enable_categorical=True)

# Perform Grid Search to find optimal hyperparameters:
from sklearn.model_selection import GridSearchCV

# Set up search grid
param_grid = {"max_depth":    [4, 6, 8, 10],
              "n_estimators": [500, 3000, 5000],
              "learning_rate": [0.01, 1e-3, 1e-5]} # This was done more properly, this is mainly an exmaple

# Try out every combination of the grid's values
search = GridSearchCV(temp_model, param_grid, cv=5).fit(xgb_X_train, xgb_Y_train)

print("The best hyperparameters are ",search.best_params_)

In [None]:
# Initialize XGBOOST Regressor and use GridSearch's recommended hyperparameter values
xgb_model = XGBRegressor(tree_method="gpu_hist", 
                     enable_categorical=True,
                     learning_rate = search.best_params_["learning_rate"],
                     n_estimators  = search.best_params_["n_estimators"],
                     max_depth     = search.best_params_["max_depth"]
                     )

Train XGBOOST Regressor model

In [None]:
# Train XGB Regressor
xgb_model.fit(xgb_X_train, xgb_Y_train)

# Generate training score
train_score = xgb_model.score(xgb_X_train, xgb_Y_train)  
print("Training score: ", train_score)

Test XGBOOST Regressor model

In [None]:
# Let the model predict on the test data
xgb_Y_pred = xgb_model.predict(xgb_X_test)

# Generate test score
test_score = xgb_model.score(xgb_X_test, xgb_Y_test)
print("Test score: ", test_score)

Calculate performance metrics: *MAE* and *MAPE*

In [None]:
xgb_df = pd.DataFrame()
xgb_df['Target']=xgb_Y_test
xgb_df['Prediction']=xgb_Y_pred

xgb_df['pred_Diff_']=xgb_df['Prediction']-xgb_df['Target']
xgb_df['pred_MAPE_']=(xgb_df['Prediction']-xgb_df['Target'])/xgb_df['Target']
xgb_df['pred_MAE']=xgb_df['pred_Diff_'].abs()
xgb_df['pred_MAPE']=xgb_df['pred_MAPE_'].abs()

print('_____XGBOOST REGRESSOR_____')
print('MAE: {}'.format(xgb_df['pred_MAE'].mean()))
print('MAPE: {}'.format(xgb_df['pred_MAPE'].mean()))

Plot the XGBOOST Regressor's predictions vs. correct values

In [None]:
# Plot the model's predictions and the original values
import matplotlib.pyplot as plt 

x_ax = range(len(xgb_Y_test))
plt.plot(x_ax, xgb_Y_test, label="original")
plt.plot(x_ax, xgb_Y_test, label="predicted")
plt.title("XGBOOST Regressor performance")
plt.legend()
plt.show()

Upload XGBOOST Regressor model to Hopsworks Model Registry

In [None]:
import os
import joblib
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# fs is a reference to the Hopsworks Feature Store
fs = project.get_feature_store()

# Create an object for the Hopsworks model registry
mr = project.get_model_registry()

# Create a directory in which the model is saved
model_dir = "xgboost_model"
if os.path.isdir(model_dir) == False:
    os.mkdir(model_dir)
joblib.dump(xgb_model, model_dir + "/xgboost_model.pkl")


# Create a schema for the model which specifies the input (=X_train) and output (=y_train) data
input_schema = Schema(xgb_X_train)
output_schema = Schema(xgb_Y_train)
model_schema = ModelSchema(input_schema, output_schema)

# Create an entry for the model in the model registry
xgboost_model = mr.python.create_model(
  name="xgboost_model",
  version=3,
  metrics={"MAE":xgb_df['pred_MAE'].mean(), "MAPE":xgb_df['pred_MAPE'].mean()},
  model_schema=model_schema,
  description="XGBOOST Regressor on Stockholm apartment sales data"
)

# Upload the model to the model registry
xgboost_model.save(model_dir)

Download XGBOOST Regressor model from Hopsworks Model Registry

In [None]:
import joblib
# Get the model from Hopsworks
mr = project.get_model_registry()
temp = mr.get_model("xgboost_model", version=3)
model_path = temp.download()

xgb_model = joblib.load(model_path + "/xgboost_model.pkl")
print(xgb_model)

## AUTOGLUON TABULAR PREDICTOR 🧠

Download full dataset from Hopsworks

In [None]:
# Download full dataset from Hopsworks
HW_PATH = "/Projects/alexao00/Project/features.csv"
LOCAL_PATH = "/content"

downloaded_file_path = dataset_api.download(
    HW_PATH, 
    local_path = LOCAL_PATH, overwrite=True)
print('The following file has been succesfully downloaded from Hopsworks and is available at:' + '\n' + downloaded_file_path)

Load dataset and assign correct datatypes



In [None]:
import pandas as pd
full_dataset = pd.read_csv ("/content/features.csv", sep=";")
full_dataset = fix_datatypes_in_dataframe(full_dataset)

Dataset preparation



In [None]:
# Separate the features (X) and the label (Y)
ag_X_full = full_dataset.loc[:, full_dataset.columns != 'price']
ag_Y_full = full_dataset.loc[:, full_dataset.columns == 'price']

# Split data into a train set (80%) and a test set (20%)
from sklearn.model_selection import train_test_split
seed = 7
test_size = 0.20
ag_X_train, ag_X_test, ag_Y_train, ag_Y_test = train_test_split(ag_X_full, ag_Y_full, test_size=test_size, random_state=seed, shuffle=False)

# Concatenate ag_X_train and ag_Y_train:
ag_XY_train = pd.concat([ag_X_train, ag_Y_train], axis=1)

Delete previous AutoGluon models from Colab





In [None]:
!rm -rf "/content/AutogluonModels/"

Initialize and train Autogluon Tabular Predictor

In [None]:
from autogluon.tabular import TabularPredictor

# Choose label
label = 'price'

# Choose model name
ag_model_name = "ag_model_20230109"
ag_model_folder_path = "/content/AutogluonModels/" + ag_model_name


predictor = TabularPredictor(label=label,
                             path=ag_model_folder_path,
                             eval_metric='root_mean_squared_error').fit(
    ag_XY_train,
    auto_stack=True,
    time_limit=30*60,
    num_gpus=0,
)
                             
# ___ Notes ___                             
# "time_limit" is in seconds
# num_gpus=0 --> Use CPU
# num_gpus=1 --> Use GPU

Test Autogluon Tabular Predictor

In [None]:
# Load model
predictor = TabularPredictor.load(ag_model_folder_path)

# Let model predict on test data
ag_Y_pred = predictor.predict(ag_X_test)

# Summarize results
predictor.fit_summary(show_plot=True)

Calculate performance metrics *MAE* and *MAPE*

In [None]:
ag_df = pd.DataFrame()
ag_df['Target']= ag_Y_test
ag_df['Prediction']=ag_Y_pred

ag_df['pred_Diff_']=ag_df['Prediction']-ag_df['Target']
ag_df['pred_MAPE_']=(ag_df['Prediction']-ag_df['Target'])/ag_df['Target']
ag_df['pred_MAE']=ag_df['pred_Diff_'].abs()
ag_df['pred_MAPE']=ag_df['pred_MAPE_'].abs()
print('_____AUTOGLUON_____')
print('MAE: {}'.format(ag_df['pred_MAE'].mean()))
print('MAPE: {}'.format(ag_df['pred_MAPE'].mean()))

Upload AutoGluon Tabular Predictor to Hopsworks Model Registry

In [None]:
import os
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# fs is a reference to the Hopsworks Feature Store
fs = project.get_feature_store()

# Create an object for the Hopsworks model registry
mr = project.get_model_registry()

# Create a schema for the model which specifies the input (=X_train) and output (=y_train) data
ag_input_schema = Schema(ag_X_train)
ag_output_schema = Schema(ag_Y_train)
ag_model_schema = ModelSchema(ag_input_schema, ag_output_schema)

# Create an entry for the model in the model registry
ag_model = mr.python.create_model(
  name="ag_model_20230109",
  version=2,
  metrics={"MAE":ag_df['pred_MAE'].mean(), "MAPE":ag_df['pred_MAPE'].mean()},
  model_schema=ag_model_schema,
  description="AutoGluon Tabular Predictor on Stockholm apartment sales data"
)

# Upload the model to the model registry
ag_folder_path = "/content/AutogluonModels"
ag_model.save(ag_folder_path)

Download AutoGluon Tabular Predictor model from Hopsworks Model Registry

In [None]:
# Get the model from Hopsworks
mr = project.get_model_registry()
temp = mr.get_model("ag_model_20230109", version=2)
ag_downloaded_folder_path = temp.download()

print(ag_downloaded_folder_path)

In [None]:
# Move Autogluon model folder to the correct folder
import shutil
original = ag_downloaded_folder_path
target = "/content/AutogluonModels"

shutil.move(original, target)

Load AutoGluon model

In [None]:
# Use the following path: "/content/AutogluonModels/[model version]/"+ag_model_name
predictor = TabularPredictor.load("/content/AutogluonModels/5/"+ag_model_name)



---


# Alternative approach: ZIP AutoGluon model and save it to Hopsworks File Browser 
### Save model
1.   Compress Autogluon model folder to a zip-file
2.   Upload zip-file to Hopsworks

### Reuse model
1.   Download zip-file from Hopsworks to Colab
2.   Unzip the file
3.   Move output to correct folder
4.   Delete the leftover folder


In [None]:
# Zip folder
!zip -r {ag_model_folder_path}.zip {ag_model_folder_path}

In [None]:
# Upload zip-file to Hopsworks
import os

LOCAL_PATH = ag_model_folder_path + ".zip"
UPLOAD_PATH = "/Projects/alexao00/Project/"

dataset_api = project.get_dataset_api()

ag_hw_path = dataset_api.upload(
    local_path = LOCAL_PATH, 
    upload_path = UPLOAD_PATH, overwrite=True)

print('The following file has been succesfully added to Hopsworks and is available at:' + '\n' + ag_hw_path)

In [None]:
# Download saved Autogluon model from Hopsworks and then unzip
HW_PATH = ag_hw_path
LOCAL_PATH = "/content"

ag_downloaded_file_path = dataset_api.download(
    HW_PATH, 
    local_path = LOCAL_PATH, overwrite=True)
print('The following file has been succesfully downloaded from Hopsworks and is available at:' + '\n' + ag_downloaded_file_path)

In [None]:
# Unzip Autogluon model
!unzip "{ag_downloaded_file_path}" -d "zip_output"

# Move Autogluon model folder to the correct folder
import shutil
original = r"/content/zip_output/content/AutogluonModels/" + ag_model_name
target = r"/content/AutogluonModels"
!rm -rf {ag_model_folder_path}  # Delete existing folder to avert overwrite error
shutil.move(original, target)

# Clean up by removing the empty folder
!rm -rf "/content/zip_output"