# Imputing missing data in teras

Using state of the art deep learning data imputation models for tabular data can be quite a challenge, not just because of how complex the model architecture might get, but also because of the data preprocessing and transformation steps involved. But teras makes it as easy as doing a classification or regression task.


As of teras v0.3, it offers two GAN-based architectures for data imputation, namely ``GAIN`` and ``PCGAIN``.

For the sake of this tutorial, we'll use the ``GAIN`` architecture.


So without further ado, let's get to coding!

As always, the first step is to configure your backend. I'll be using JAX because it's almost always is the fastest of the three.

To configure your backend for teras, you need to set the ``KERAS_BACKEND`` environment variable.

**NOTE:** You need to configure you backend before importing ``teras``/``keras``

In [1]:
import os
os.environ["KERAS_BACKEND"] = "jax"

For this tutorial, we'll be using the Boston Housing dataset made available by keras.

In [2]:
from keras.datasets import boston_housing

(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

Let's combine all the data since our task here is self-supervised so we don't need labels or test data to compute any metrics

In [None]:
import numpy as np

dataset = np.concatenate([np.concatenate([X_train, y_train[:, np.newaxis]], axis=1),
                          np.concatenate([X_test, y_test[:, np.newaxis]], axis=1)],
                         axis=0)
dataset.shape

(506, 14)

Always a good idea to normalize our dataset

In [None]:
from sklearn.preprocessing import Normalizer

normalizer = Normalizer()
dataset = normalizer.fit_transform(dataset)

Now, this dataset in itself doesn't contain any missing value, so we'll inject missing values ourselves to simulate a real world scenario.

And for that, teras offers a handy utility that can be quite helpful for quickly simulating such situations. It conveniently named ``inject_missing_values``

In [None]:
from teras.utils import inject_missing_values

print("# of missing values: ")
print("Before injecting: ", np.isnan(dataset).sum())
dataset = inject_missing_values(dataset, 0.2)
print("After injecting: ", np.isnan(dataset).sum())

# of missing values: 
Before injecting:  0
After injecting:  1426


The ``GAIN`` architecture that we'll be using requires dataset in the form ``(x_generator, x_discriminator)``. 

There's a handy data utility function in teras for this purpose named ``create_gain_dataset``.

``NOTE:`` As of teras v0.3.0, you need to have TensorFlow installed to use this function since it makes use of ``tf.data`` to create a TensorFlow dataset that is then handled by Keras 3 to be used with any backend.
It is also true for any data sampling classes available in teras. You may not like TensorFlow but you cannot not like ``tf.data``. 

In [None]:
from teras.data_utils import create_gain_dataset

gain_dataset = create_gain_dataset(dataset)

# Remember to batch your tensorflow dataset
BATCH_SIZE = 64
gain_dataset = gain_dataset.batch(BATCH_SIZE)

2024-04-10 13:43:50.949841: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-10 13:43:50.949885: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-10 13:43:50.951338: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Now let's import ``GAIN``

Since ``GAIN`` is a generative adversarial network, so it requires a instaces of a generator and a discriminator, which we'll also import.

In [None]:
from teras.models import GAIN
from teras.models import GAINGenerator
from teras.models import GAINDiscriminator

If you look at the documentation, to instantiate either the ``GAINGenerator`` or ``GAINDiscriminator`` you need a positional argument namely ``data_dim``.
Now it's usually the same as the input dimensionality of the dataset, but is named so for cases when the input dataset has different dimensionality from the original dataset due to data transformations and such other preprocessing craft.

Anyway, here ``data_dim`` refers to the dimensionality of the original dataset.

In [None]:
dataset.shape[1]

14

In [None]:
generator = GAINGenerator(data_dim=dataset.shape[1])

discriminator = GAINDiscriminator(data_dim=dataset.shape[1])

gain = GAIN(generator,
            discriminator)

**NOTE:** You can customize these models futher by specifying various keyword arguments. Look up docs! I'll just stick with default for the sake of this tutorial.

Now let's compile our model. Note that we're not passing any loss function to the compile method of ``GAIN`` instance, the reason being these specialized architectures contain loss computing methods within.

In [None]:
import keras

gain.compile(generator_optimizer=keras.optimizers.Adam(),
             discriminator_optimizer=keras.optimizers.Adam())

The rule of thumb for GAN-based models in teras is to ALWAYS build them yourself because the dataset that we pass to such architectures is usually deviates from normal (X, y) paired dataset, so Keras fails to build such models automatically due to failure to infer expected input shape.

So let's build the model ourself!

In [None]:
gain.build((BATCH_SIZE, dataset.shape[1]))

Now, if and only if you're using the JAX backend, you'll have to call the `build_optimizers` method when using any GAN based model or any model that makes use of more than one optimizer. It is not needed for other backends like TensorFlow or PyTorch, neither it is needed for any architecture that only uses a single optimizer, which is usually how it is in 99.99% of the cases.

Anyway, since we ARE using the JAX backend, so we'll call this method.

In [None]:
gain.build_optimizers()

**WARNING:** Calling ``build_optimizers`` method on a backend other than JAX will result in error!

In [None]:
history = gain.fit(gain_dataset, epochs=2)

Epoch 1/2


[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 384ms/step - discriminator_loss: 0.7368 - generator_loss: 48.5096
Epoch 2/2
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - discriminator_loss: 0.7002 - generator_loss: 47.1353


Now the model is trained. Cool. But if we can't put it to use, it's useless. So let's put it into use.

To impute data with missing values, you can either use the ``predict`` method of the trained ``GAIN`` instance or use a the ``Imputer`` class available in ``teras.tasks`` module. The ``Imputer`` class may not be that useful here, but it can be very useful in cases where you transform your data using a data transformer class.

So, assuming you already know how to use ``predict``, we'll use the ``Imputer`` class here. It offers an ``impute`` method that takes in dataset with missing values and returns imputed data. If a data transformer instance is passed in during the instantiation, it will return the imputed data in its original format.

Since we're not using any data transformer class so we'll set the ``reverse_transform`` parameter in ``impute`` method to ``False`` otherwise it'll result in error.

In [None]:
from teras.tasks import Imputer

gain_imputer = Imputer(gain)

imputed_dataset = gain_imputer.impute(dataset, reverse_transform=False)

[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 32ms/step


In [None]:
print("Missing values in the original dataset: ", np.isnan(dataset).sum())
print("Missing values in the imputed dataset: ", np.isnan(imputed_dataset).sum())

Missing values in the original dataset:  1426
Missing values in the imputed dataset:  0


And that wraps it up! As you saw, it's super easy and intuitive to use state of the art complex architectures for data imputation, thanks to teras!

If you have any questions or run into an issue, reach us at twitter 
[@TerasML](https://twitter.com/TerasML) or file an issue at [teras github repository](https://github.com/KhawajaAbaid/teras).