# Data Imputation Tutorial

## Introduction

In this tutorial, I'll walk you through on how to impute missing data using state of the art deep learning architectures in `Teras`.

**Model**: `GAIN`

**Dataset**: Gemstone dataset (from Kaggle)

**Task**: Imputing missing data using GAIN

## Data Loading and Preprocessing

In [1]:
import pandas as pd

# We'll use the first 10000 instances
# We also drop the id column since that is useless
gem_df = pd.read_csv("./datasets/gemstone_dataset.csv").drop("id", axis=1)[:10000]
gem_df.head(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,1.52,Premium,F,VS2,62.2,58.0,7.27,7.33,4.55,13619
1,2.03,Very Good,J,SI2,62.0,58.0,8.06,8.12,5.05,13387
2,0.7,Ideal,G,VS1,61.2,57.0,5.69,5.73,3.5,2772


In [2]:
categorical_feats = ["cut", "color", "clarity"]
numerical_feats = ["carat", "depth", "table", "x", "y", "z"]

### Injecting missing values
Since this datasets doesn't contain any NaN values, so we'll introduce nan values ourselves to demonstrate the imputation process.

For that we'll use a utility function from Teras

In [3]:
from teras.utils import inject_missing_values

gem_missing_df = inject_missing_values(gem_df)

2023-07-13 20:49:41.077686: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-13 20:49:41.150881: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-07-13 20:49:41.151986: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# Let's verify that our new dataset does indeed contain nan values

print("# of missing values in original dataset: ", gem_df.isna().sum().sum())
print("# of missing values in missing dataset: ", gem_missing_df.isna().sum().sum())

# of missing values in original dataset:  0
# of missing values in missing dataset:  9845


### Data Transformer and Data Sampler classes for Generative models:
The generative architectures like `GAIN`, `PCGAIN`, `CTGAN` and `TVAE` require sophisticated data preprocessing and transformation as well as how the batches of data are generated, hence to make it easier for the users, `Teras` implements `DataTransformer` and `DataSampler` classes for each of these models.

These can be imported from the `teras.preprocessing.<architecture_name>` module.
For instance, for GAIN, we'll import its `DataSampler` and `DataTransformer` classes as follows,

`from teras.preprocessing.gain import DataSampler, DataTransformer`

In [5]:
from teras.preprocessing.gain import DataTransformer, DataSampler

data_transformer = DataTransformer(numerical_features=numerical_feats,
                                   categorical_features=categorical_feats)
gem_transformed = data_transformer.fit_transform(gem_missing_df, return_dataframe=True)

data_sampler = DataSampler(batch_size=1024)
dataset = data_sampler.get_dataset(gem_transformed)

## Training

In [None]:
from teras.impute import GAIN

# Notice how we use data_sampler.data_dim instead of the dimensions
# of the originald dataset. That is because during data transformation,
# most of the time the data dimensions are expanded by quite a lot
# so it's safer to use `DataSampler`'s `.data_dim` attribute.
gain_imputer = GAIN(data_dim=data_sampler.data_dim)

# For highly customized architectures like these, 
# which employ custom loss functions, `Teras` has default values in place.
# So it's recommended to just call the compile method without any parameters
# unless you understand the underlying structure.
# Read more at Section 4 of General Guidelines and FAQs notebook in tutorial directory.
gain_imputer.compile()


history = gain_imputer.fit(dataset, epochs=2)

## Imputation

All imputation models offered by `Teras` have a `.impute` method that can be used after training to impute the dataset with missing values.

It receives the dataset with missing values, and it is recommended to pass it the instance of `DataTransformer` class that was used to transform the original data during the preprocessing step that is then used to resverse transform the generated imputed data back to original format/distributions.

In [7]:
test_chunk = gem_transformed[500:1000]
x_imputed = gain_imputer.impute(test_chunk,
                                data_transformer=data_transformer,
                                reverse_transform=True)

x_imputed.head()



Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.38,Ideal,H,SI1,61.5,57.0,4.66,4.7,2.88,717.000013
1,1.01,Good,G,VS1,63.7,56.0,6.37,6.4,4.06,6449.00015
2,1.580129,Ideal,F,VS2,62.3,89.586648,6.52,6.45,4.04,8928.822703
3,0.32,Ideal,F,VS2,62.1,56.0,4.43,4.38,2.74,828.000017
4,1.03,Ideal,H,SI1,60.6,57.0,6.51,6.55,3.95,4485.000018


## Wrapping it up!

And that wraps up our data imputation tutorial using Teras.

If you need more help, consult documentation, and other available resources and if that still leaves you with questions, feel free to raise an issue or email me khawaja.abaid@gmail.com

If you find `Teras` useful, please consider giving it a star on GitHub and sharing it with others!

Thank you!