# Classification with the `sert` Python Package

**Author:** Amin Shoari Nejad &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Date created:** 2023/09/04

In this notebook, we demonstrate how to utilize the sert package for predicting a continuous outcome. The process includes data loading, preprocessing, model instantiation, training, and evaluation.

# 1. Imports

Assuming that you have installed the sert package, we can import the necessary modules.

In [73]:
from sert.models import SERT
from sert.preprocessing import DataPreparer
from sert.losses import WeightedCrossentropy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.datasets import load_iris

- **SERT**: This is one of the primary classes of the `sert` package, suitable for both classification and regression predictive modelling. In this notebook, we'll use it for a classification problem.

- **DataPreparer**: This class assists in preparing data. Models within the `sert` package expect data in a particular format, and `DataPreparer` facilitates this transformation. It employs the same `fit_transform` and `transform` syntax found in scikit-learn.

- **MaskedMSE**: This is a custom loss function for training models on sparse data. It is a masked version of the mean squared error (MSE) loss function that masks out the missing values in the output.

- **WeightedCrossentropy**: This is a custom loss function for training models, essentially a weighted iteration of the crossentropy loss function. The weights can be adjusted to balance the loss function. For instance, in situations where data is skewed, the loss function might be weighted to prioritize the underrepresented class. In this notebook, we'll utilize this loss function for training, though we won't assign varied weights to the classes.

# 2. Load the Dataset

In this tutorial we will use the well-known [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris). The dataset contains 150 samples, each with four features and one of three possible classes. The goal is to predict the class of each sample.

In [74]:
data = load_iris()
X = data.data
# randomly replace 10% of the data with NaN
np.random.seed(1)
X[np.random.random(X.shape) < .05] = np.nan
y = data.target

# 3. Data Preprocessing

In order to feed our data into the **SERT** model, we need to perform some preprocessing steps.
First we need to scale the data. We use the `StandardScaler` from scikit-learn for this purpose. To do so we need to fit the scaler on the training data and then transform both the training and test data.

In [75]:
# splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

features = data.feature_names
X_test.columns = features
X_train.columns = features

# Scaling the data
# Instantiate the scaler
scaler = StandardScaler()

# Fit the scaler to the numerical columns of the training data and transform them
X_train[features] = scaler.fit_transform(X_train[features])

# Use the fitted scaler to transform the numerical columns of the test data
X_test[features] = scaler.transform(X_test[features])

Models in sert are designed to work with data in set format (i.e., each row is a single observation with variable name, value). We melt the data into this format as follows:

In [76]:
X_train_long = X_train.reset_index().melt(
    id_vars=['index'], value_vars=features)
X_test_long = X_test.reset_index().melt(
    id_vars=['index'], value_vars=features)

X_train_long.sort_values(by=['index'])

Unnamed: 0,index,variable,value
0,0,sepal length (cm),-1.468028
360,0,petal width (cm),-1.289006
120,0,sepal width (cm),1.263388
240,0,petal length (cm),-1.545837
1,1,sepal length (cm),-0.134894
...,...,...,...
118,118,sepal length (cm),-0.013700
239,119,sepal width (cm),-0.142236
119,119,sepal length (cm),1.561822
359,119,petal length (cm),1.269136


Next, we'll convert the input data into a list of numpy arrays suitable for the model. We utilize the `DataPreparer` class for this, which accepts a single argument: `token_capacity`. This represents the maximum number of observations (a.k.a tokens) expected in the training data. 

After instantiating the `DataPreparer` class, you'll need to:

- Call the `fit_transform` method on the training data. This method returns a list of numpy arrays prepared for training.
- Call the `transform` method on the test data. This method returns a list of numpy arrays ready for testing. Note that this method assumes that the `fit_transform` method has already been called on the training data and the training and test data are of the same format.

The `fit_transform` method requires four arguments:

1. **index**: The name of the column that contains the unique identifier for each sequence. In our dataset, this is the `index` column.
3. **names**: The column name that represents the name of the variables. In this example, it's the `variable` column.
4. **values**: This refers to the column containing the values of the variable for each observation. In our example, it's the `value` column. 

In [77]:
# Determine the token capacity based on the maximum input length in the training set
token_cap = X_train_long.groupby('index').size().max()

processor = DataPreparer(token_capacity=token_cap)

train_input = processor.fit_transform(X_train_long,
                                      index='index',
                                      names='variable',
                                      values='value')

test_input = processor.transform(X_test_long)

Our input data is now ready to be fed into the model. We also need to prepare the target data. Since we want to use the `WeightedCrossentropy` class, we need to one-hot encode the target data that is required for this loss function. We can do this using the `OneHotEncoder` class from scikit-learn.

In [78]:
encoder = OneHotEncoder(sparse=False)
train_output = encoder.fit_transform(y_train.reshape(-1, 1))
test_output = encoder.transform(y_test.reshape(-1, 1))

# 4. Model Instantiation

Now, we'll instantiate the SERT model with appropriate hyperparameters.

In [79]:
model = SERT(num_var=4,
             emb_dim=15,
             num_head=3,
             ffn_dim=5,
             num_repeat=1,
             num_out=3,
             task='classification')

### Hyperparameters:

- **num_var**: Represents the number of variables in the dataset required by the embedding layer to encode variable names. 
- **emb_dim**: Dimension of the embedding layer. Represents the dimension of the latent space.
- **num_head**: Number of attention heads.
- **ffn_dim**: Dimension of the feedforward layer.
- **num_repeat**: Number of times the encoder block is repeated.
- **num_out**: Number of output classes.

The `emb_dim`, `num_head`, `ffn_dim`, and `num_repeat` hyperparameters determine the size of the model. The larger these values are, the more complex the model becomes. The `num_out` hyperparameter is set to 3, as there are three classes. The `task` argument is set to `classification` since the goal is to classify the species. For regression, the `task` argument is set to `regression`, which is the default.


# 5. Model Compilation

The instantiated model is a TensorFlow model. We must compile it using the `compile` method, specifying the optimizer, loss function, and metrics we want to track, as with any TensorFlow model. For this example, we are using the Adam optimizer, the weighted cross-entropy loss function provided by the package, and the accuracy as the metric. 

In [80]:
model.compile(optimizer='adam',
              loss=WeightedCrossentropy([1, 1, 1]),
              metrics=['accuracy'])

 In this example, we are utilizing the weight vector [1,1,1] for the weighted cross-entropy loss function. This indicates that all classes are treated with equal importance by the loss function. You can adjust the weights to prioritize one class over the others. For instance, if the data is skewed, you might want to assign a higher weight to the underrepresented class. You can use `compute_class_weight` from scikit-learn to compute the weights and use them in the loss function like below: 

In [81]:
# from sklearn.utils.class_weight import compute_class_weight
# import tensorflow as tf

# class_weights = compute_class_weight(class_weight='balanced', classes=[0,1,2], y=y_train)
# class_weights = tf.cast(class_weights, dtype=tf.float32)
# model.compile(optimizer='adam', loss=WeightedCrossentropy(class_weights), metrics=['accuracy'])

# 6. Model Training

With our data and model ready, we can now train the SERT model.

In [82]:
model.fit(train_input, train_output, epochs=100, batch_size=75, verbose=0)

<keras.src.callbacks.History at 0x13aed5cf0>

In [83]:
_loss_, accuracy = model.evaluate(test_input, test_output)
print(f"Test Accuracy: {accuracy:.2f}")

Test Accuracy: 1.00


# Conclusion

This notebook demonstrated how to effectively use the `sert` package for classification. With its intuitive API and preprocessing utilities, working with tabular data with missing values becomes efficient and straightforward.

### Few notes:

- In this tutorial, we didn't discuss the hyperparameter tuning process as the goal was to demonstrate how to work with the package API. Nonetheless, the model performed perfectly on the test data with arbitrary hyperparameters.

- We showed that with the right informative features, the model can achieve high performance even on small datasets with missing values. 

- The package also provides another alternative model to `SERT` called `SERNN` which doesn't use the transformer architecture and only relies on set encoding and feedforward layers. `SERNN` runs much faster but might compromise performance. You can simply replace `SERT` with `SERNN`, which has fewer hyperparameters, like below:

```python
from sert.models import SERNN

model = SERNN(num_var=1,
              emb_dim=15,
              num_out=y_train.shape[1],
              task='classification')
                 
```