# Predictive Modelling with the `sert` Python Package

**Author:** Amin Shoari Nejad &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Date created:** 2023/09/04

In this notebook, we demonstrate how to utilize the sert package for predicting a continuous outcome. The process includes data loading, preprocessing, model instantiation, training, and evaluation.

# 1. Imports

Assuming that you have installed the sert package, we can import the necessary modules.

In [24]:
from sert.models import SERT
from sert.preprocessing import DataPreparer
from sert.losses import MaskedMSE

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

- **SERT**: This is one of the primary classes of the `sert` package, suitable for both classification and regression predictive modelling. In this notebook, we'll use it for a regression problem.

- **DataPreparer**: This class assists in preparing data. Models within the `sert` package expect data in a particular format, and `DataPreparer` facilitates this transformation. It employs the same `fit_transform` and `transform` syntax found in scikit-learn.

- **MaskedMSE**: This is a custom loss function for training models on sparse data. It is a masked version of the mean squared error (MSE) loss function that masks out the missing values in the output.

# 2. Load the Dataset

In this tutorial we will use the Ames housing dataset which is a well-known dataset including both numerical and categorical features. Our goal is to predict the sales price of the houses in the dataset using their features in the dataset. To learn more about the dataset please refer to [this link](https://www.kaggle.com/datasets/shashanknecrothapa/ames-housing-dataset).

In [25]:
# Loading the data
url = "https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/ames_housing_no_missing.csv"

ames_data = pd.read_csv(url)
ames_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1460 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          1460 non-null   object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

The dataset is complete with no missing values. However, to make the prediction task more challenging we introduce missing values in the dataset by randomly replacing 10% of the feature values with NaNs.

In [27]:
# separating the target from input variables
X = ames_data.drop(['SalePrice'], axis=1)
# randomly replace 10% of the data with NaN
X = X.mask(np.random.random(X.shape) < .1)
# choose the target variable
y = ames_data[['SalePrice']]

# 3. Data Preprocessing

In order to feed our data into the **SERT** model, we need to perform some preprocessing steps.
First we need to scale the data. We use the `StandardScaler` from scikit-learn for this purpose. To do so we need to fit the scaler on the training data and then transform both the training and test data.

In [28]:
# splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# IMPORTANT STEP: reseting the index to avoid problems with the shuffling of the data in the subsequent steps
X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)

# Standardising numerical predictors
# Identify numerical columns in the training set
numerical_cols = X_train.select_dtypes(exclude=['object']).columns

# Instantiate the scaler
scaler = StandardScaler()

# Fit the scaler to the numerical columns of the training data and transform them
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])

# Use the fitted scaler to transform the numerical columns of the test data
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

Models in sert are designed to work with data in set format (i.e., each row is a single observation with variable name, value). We melt the data into this format as follows:

In [29]:
X_train_long = X_train.reset_index().melt(
    id_vars=['index'], value_vars=X.columns)
X_test_long = X_test.reset_index().melt(
    id_vars=['index'], value_vars=X.columns)
X_train_long

Unnamed: 0,index,variable,value
0,0,MSSubClass,-0.867671
1,1,MSSubClass,0.076689
2,2,MSSubClass,-0.631581
3,3,MSSubClass,-0.159401
4,4,MSSubClass,-0.159401
...,...,...,...
92267,1163,SaleCondition,Normal
92268,1164,SaleCondition,Normal
92269,1165,SaleCondition,Normal
92270,1166,SaleCondition,Normal


Next, we'll convert the input data into a list of numpy arrays suitable for the model. We utilize the `DataPreparer` class for this, which accepts a single argument: `token_capacity`. This represents the maximum number of observations (a.k.a tokens) expected in the training data. 

After instantiating the `DataPreparer` class, you'll need to:

- Call the `fit_transform` method on the training data. This method returns a list of numpy arrays prepared for training.
- Call the `transform` method on the test data. This method returns a list of numpy arrays ready for testing. Note that this method assumes that the `fit_transform` method has already been called on the training data and the training and test data are of the same format.

The `fit_transform` method requires four arguments:

1. **index**: The name of the column that contains the unique identifier for each sequence. In our dataset, this is the `index` column.
3. **names**: The column name that represents the name of the variables. In this example, it's the `variable` column.
4. **values**: This refers to the column containing the values of the variable for each observation. In our example, it's the `value` column. It's worth noting that in this particular example this column has both numerical and categorical values. This is not an issue, as `DataPreparer` appends the character values to the variable names, encoding them together in the model, while masking their values in the value column. Alternatively, you can also use one-hot encoding for the categorical variables in the dataset before doing the subsequent steps. This is not necessary, but it may improve the performance of the model in some applications.

In [30]:
# Determine the token capacity based on the maximum input length in the training set
token_cap = X_train_long.groupby('index').size().max()

processor = DataPreparer(token_capacity=token_cap)

train_input = processor.fit_transform(X_train_long,
                                      index='index',
                                      names='variable',
                                      values='value')

test_input = processor.transform(X_test_long)

Our input data is now ready to be fed into the model. We also need to prepare the target data. Since we're using a masked MSE loss function, we need to create a mask for the target data. We also need to stack the target data and the mask together to create a single numpy array that will be used by the loss function (`MaskedMSE` requires the target and the mask to be stacked).

In [31]:
# create the output mask: 1 if the value is not missing, 0 otherwise
y_mask = ~np.isnan(y_train.values)
# impute the missing values with 0, will be masked out later and doesn't affect the loss
y_train = np.nan_to_num(y_train.values)

We also need to stack the target data and the mask together to create a single numpy array that will be used by the loss function.

In [32]:
train_output = np.stack([y_train, y_mask], axis=-1)

# 4. Model Instantiation

Now, we'll instantiate the SERT model with appropriate hyperparameters.

In [33]:
num_var = len(processor.name_to_int)

model = SERT(num_var=num_var,
             emb_dim=15,
             num_head=3,
             ffn_dim=5,
             num_repeat=1,
             num_out=1)

### Hyperparameters:

- **num_var**: Represents the number of variables in the dataset required by the embedding layer to encode variable ids. In our example, `processor` has already encoded the variable names into ids and stored the mapping in the `name_to_int` attribute. We can use this attribute to determine the number of variables in the dataset. Note that this is not necessarily equal to the number of columns minus the target in the original dataset, since the categorical values are merged with the variable names to create new combinations that need to be embedded.
- **emb_dim**: Dimension of the embedding layer. Represents the dimension of the latent space.
- **num_head**: Number of attention heads.
- **ffn_dim**: Dimension of the feedforward layer.
- **num_repeat**: Number of times the encoder block is repeated.
- **num_out**: Dimension of the target variable. In this case, we are only predicting a single variable, so this is equal to one. But if we were predicting multiple variables, this would be equal to the number of target variables.

The `emb_dim`, `num_head`, `ffn_dim`, and `num_repeat` hyperparameters determine the size of the model. The larger these values are, the more complex the model becomes. 

# 5. Model Compilation

The instantiated model is a TensorFlow model. We must compile it using the `compile` method, specifying the optimizer, loss function, and metrics we want to track, as with any TensorFlow model. For this example, we are using the Adam optimizer, the masked MSE loss function provided by the package.

In [34]:
model.compile(loss=MaskedMSE(), optimizer='adam')

# 6. Model Training

With our data and model ready, we can now train the SERT model.

In [45]:
model.fit(train_input, train_output, epochs=200, batch_size=250)

Epoch 1/200


Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78/200
Epoch 7

<keras.src.callbacks.History at 0x2b09bdd60>

In [47]:
# Predict on the test set
y_pred = model.predict(test_input)
y_pred = y_pred.reshape(-1)

test_obs = y_test.to_numpy().reshape(-1)

# # Calculate performance metrics
rmse = np.sqrt(mean_squared_error(test_obs, y_pred))
r2 = r2_score(test_obs, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R2: {r2:.2f}")

RMSE: 41267.59
R2: 0.78


# Conclusion

This notebook demonstrated how to effectively use the `sert` package for regression problems. With its intuitive API and preprocessing utilities, working with tabular data with missing values becomes efficient and straightforward.

### Few notes:

- In this tutorial, we didn't discuss the hyperparameter tuning process as the goal was to demonstrate how to work with the package API. We encourage you to experiment with different hyperparameters and see how they affect the model performance.

- We showed that `SERT` can handle categorical variables in the dataset without the need for one-hot encoding. 

- We also showed that `SERT` can handle missing values naturally in the dataset without the need for imputation. However, it's worth noting that the model's performance might be affected by the percentage of missing values in the dataset. In this tutorial, we introduced 10% missing values completely at random into the dataset. You can experiment with different percentages and other missing value mechanisms, such as MNAR, to see how they impact the model's performance.

- The package also provides another alternative model to `SERT` called `SERNN` which doesn't use the transformer architecture and only relies on set encoding and feedforward layers. `SERNN` runs much faster but might compromise performance. You can simply replace `SERT` with `SERNN`, which has fewer hyperparameters, like below:

```python
from sert.models import SERNN

model = SERNN(num_var=1,
              emb_dim=15,
              num_out=y_train.shape[1],
              task='regression')
                 
```