# Multimodal data: Image classification as an example

We take the MNIST image classification task as an example of loading multimodal data. This tutorial is for those that have read all parts in "Get Started" and advanced parts in "Advanced Usage" including "New data derivers" and "Customized model base".

Although we support multimodal data, multimodal models are currently not integrated as part of the package (that's why this part is in "Advanced Usage"). `pytorch_widedeep` (`WideDeep` in this package) and `autogluon` (`AutoGluon` in this package) support some multimodal models. If you are willing to develop multimodal models or add support to model bases, you are welcome to contribute on GitHub.

In [1]:
import tabensemb
import torch
import os
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

device = "cuda" if torch.cuda.is_available() else "cpu"

The following code is copied from [an official example](https://github.com/pytorch/examples/blob/main/mnist/main.py) of `pytorch` that defines the network and transformation of images and downloads the dataset.

**Remark**: Note that the `Net` returns logits instead of the `log_softmax` transformed values in the official example for compatibility with the framework. We have emphasized this in "Customized model base".

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])

dataset1 = datasets.MNIST(os.path.join(temp_path.name, "data"), train=True, download=True, transform=transform)
dataset2 = datasets.MNIST(os.path.join(temp_path.name, "data"), train=False, transform=transform)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting /tmp/tmp5ss42mgl/data/MNIST/raw/train-images-idx3-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting /tmp/tmp5ss42mgl/data/MNIST/raw/train-labels-idx1-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting /tmp/tmp5ss42mgl/data/MNIST/raw/t10k-images-idx3-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting /tmp/tmp5ss42mgl/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /tmp/tmp5ss42mgl/data/MNIST/raw



In this tutorial, the images are loaded into the memory.

In [3]:
import numpy as np

train_images = []
train_targets = []
test_images = []
test_targets = []
for img, target in dataset1:
    train_images.append(img)
    train_targets.append(target)
for img, target in dataset2:
    test_images.append(img)
    test_targets.append(target)
images_array = torch.concat(train_images + test_images, dim=0).numpy()
targets_array = np.array(train_targets + test_targets)
images_array.shape, targets_array.shape

((70000, 28, 28), (70000,))

Under this framework, multimodal data is loaded through data derivers. For data derivers, to load images for each data point, we need a column (here we name it `image_index`) in the tabular dataset that indicates the location of the image. In our case, the location is the index of the image in `images_array`. In other cases, the location might be a path to the image in the drive.

The MNIST dataset has a separate testing set (index>=60,000 in the `images_array` and `targets_array` defined above). We will use it after training to see the performance.

In [4]:
import pandas as pd

train_df = pd.DataFrame({"image_index": list(range(len(train_images))), "target": train_targets})
test_df = pd.DataFrame({"image_index": list(range(len(train_images), len(train_images) + len(test_images))), "target": test_targets})

The data deriver to load images is very simple. Multimodal data is not in the tabular data, so `stacked=False` is set. The tabular data `df` contains indices of images that can be used to extract images from the above `images_array`. We need the user to pass an argument `image_path` to specify the column that indicates the location of images. This is not necessary because we can directly use `"image_index"` instead of `self.kwargs["image_path"]` since we already know which column is needed.

In [5]:
from tabensemb.data import AbstractDeriver
from tabensemb.data.dataderiver import deriver_mapping

class MNISTLoader(AbstractDeriver):
    def _required_cols(self):
        return ["image_path"]

    def _defaults(self):
        return dict(stacked=False, derived_name="images", intermediate=False)

    def _derive(self, df, datamodule):
        images = images_array[df[self.kwargs["image_path"]]]
        print(f"Loaded images: {images.shape}")
        return images

deriver_mapping["MNISTLoader"] = MNISTLoader

The network of the official example can be easily migrated to the framework. In the forward passing, loaded images from the data deriver can be accessed in `derived_tensors`, and the key is `"images"` as defined above in `_defaults`. The tensor is of the shape `(n_samples, width, height)` and we transform it into `(n_samples, n_channels, width, height)` where `n_channels=1` to meet the requirement of `Net`.

In [6]:
from tabensemb.model import TorchModel, AbstractNN

class NetNN(AbstractNN):
    def __init__(self, datamodule, **kwargs):
        super(NetNN, self).__init__(datamodule, **kwargs)
        self.net = Net()

    def _forward(self, x, derived_tensors):
        images = derived_tensors["images"].unsqueeze(1)
        return self.net(images)

The implementation of the model base is straightforward.

In [7]:
class NetModel(TorchModel):
    def _initial_values(self, model_name):
        return self.trainer.chosen_params

    def _space(self, model_name):
        return self.trainer.SPACE

    def _new_model(self, model_name: str, verbose: bool, **kwargs):
        return NetNN(self.trainer.datamodule, **kwargs)

    def _get_program_name(self):
        return "NetModel"

    def _get_model_names(self):
        return ["Net"]

Then we configure the `Trainer`. Importantly, the `MNISTLoader` defined above is used to load images, and the argument `image_path` is given here.

In [8]:
from tabensemb.config import UserConfig
from tabensemb.trainer import Trainer

cfg = UserConfig.from_dict({
    "database": "mnist",
    "label_name": ["target"],
    "task": "multiclass",
    "data_derivers": [("MNISTLoader", {"image_path": "image_index"})],
    "epoch": 100,
})
trainer = Trainer(device=device)
trainer.load_config(config=cfg)

The project will be saved to /tmp/tmp5ss42mgl/output/mnist/2023-09-12-11-23-01-0_UserInputConfig


Since we have a separate testing set, during the training stage, we use the first 50,000 images for training and the last 10,000 images for validation and testing. We use the `DataModule.set_data` API instead of `load_data` to configure the dataset using these indices, which will skip the data splitter.

In [9]:
train_indices = np.arange(50000)
val_indices = np.arange(50000, 60000)
test_indices = val_indices
trainer.datamodule.set_data(train_df, cont_feature_names=[], cat_feature_names=[], label_name=["target"], train_indices=train_indices, val_indices=val_indices, test_indices=test_indices)

Loaded images: (60000, 28, 28)


We can see that the images are loaded in `DataModule.derived_data`

In [10]:
trainer.datamodule.derived_data["images"].shape

(60000, 28, 28)

Now train the model. The default loss function is cross entropy loss as shown in the output.

In [11]:
trainer.clear_modelbase()
trainer.add_modelbases([NetModel(trainer)])
trainer.train(stderr_to_stdout=True)


-------------Run NetModel-------------

Training Net
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                | Type             | Params
---------------------------------------------------------
0 | default_loss_fn     | CrossEntropyLoss | 0     
1 | default_output_norm | Softmax          | 0     
2 | net                 | Net              | 1.2 M 
---------------------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M

RuntimeError: Input type (double) and bias type (float) should be the same

It is easy to make inferences on the testing set for both predicted classes and probabilities. The data deriver again loads images from `images_array`.

In [None]:
predictions = trainer.get_modelbase("NetModel").predict(test_df, model_name="Net")
proba = trainer.get_modelbase("NetModel").predict_proba(test_df, model_name="Net")

The prediction accuracy reaches around 99% on the testing set.

In [None]:
from tabensemb.utils import auto_metric_sklearn

auto_metric_sklearn(targets_array[60000:], proba, "accuracy_score", "multiclass")