# Teras Classificaiton Tutorial

## Introduction:

In this tutorial, we'll use Teras and its both API, default and LayerFlow for classification task.

**Model**: `TabTransformerClassifier`

**Dataset**: Adult Income dataset

## Data loading and Preprocessing

In [1]:
import pandas as pd
import numpy as np

adult_df = pd.read_csv("./datasets/adult_income_dataset.csv")
adult_df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K


In [2]:
categorical_feats = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country']
numerical_feats = ['age', 'hours-per-week']

### Fill in the missing values

In [3]:
adult_df.replace("?", np.nan, inplace=True)
adult_df.loc[:, categorical_feats] = adult_df.loc[:, categorical_feats].fillna("missing")
adult_df.loc[:, numerical_feats] = adult_df.loc[:, numerical_feats].fillna(0)

In [4]:
adult_df.isna().sum().sum()

0

### Label Encode the target column

In [5]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
adult_df["income"] = le.fit_transform(adult_df["income"])

### You don't need to encode your categorical values except for string values!

Most of the models offered by Teras contain either `CategoricalFeatureEmbedding` layer or its variant which takes care of encoding the categorical features but it expects those values to be in integer format as Teras has removed support for string values. So only when your categorical features contain string values, you're required to manually encode those features otherwise you can just set the `enocde` parameter to True and Teras will take care of it.

Here we're using `TabTransformer` model which is a **Transformer** based architecture for Classificaiton and Regression, it contains a `CategoricalFeatureEmbedding` layer which handles the categorical features. Just set the `encode` parameter to `True` and that layer will take care of encoding th.

Just to make it clear, here **encoding** means converting *string values to integers/floats* and **embedding** means converting those *encoded values to dense representations* of `embeddeing_dim` dimensionality.


**Tip**: All transformer based models contain some sort of CategoricalFeatureEmbedding layer. You can always print a model's summary or plot a model using `keras.utils.plot_model` utility function to take a look at the layers being used by a model.

Since in our case, the categorical features are of string type, we'll encode these values ourselves using `sklearn`'s `OrdinalEncoder`.

In [6]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
adult_df[categorical_feats] = oe.fit_transform(adult_df[categorical_feats])

### What's features metadata and why do we need it?

Some layers like `CategoricalFeatureEmbedding` that need to be able to differentiate between feature types and apply certain operations to relevant features, need a way for that differentiation.

And for that very purpose, we've introduced this concept of **Features Metadata** which is a dictionary that contains two sub dictionaries, one for categorical features and one for numerical features. 

The categorical features dictionary maps the categorical feature names to a tuple of feature index in the dataset and a list of unique values in that features i.e. `{feature_name: (feature_index, list_of_unique_values)}`. The list of unique values within a categorical feature is also sometimes referred to as the "*vocabulary*".
It can be accessed through `features_metadata["categorical"]`

The numerical features dictionary maps the numerical feature names to the feature index in the dataset i.e. `{feature_name: feature_index}`. 
It can be accessed through `features_metadata["numerical"]`

#### Why does the categorical features dictionary contains a list of unique values for each feature? 
The reason why categorical features dictionary contains a list of unqiue values for each feature is that, in case the user wants to encode the categorical values in the features, we want to have a lookup table to map these categorical values to their integer indices.

And since during training the model receives data in batches and not as whole, we want to construct that lookup table on all of the categorical values, not just on values that exist within a batch of data which is what we have access to during training — since as you may guess the batch of data is likely to miss many of the categorical values that exist within the dataset and creating lookup table over just those values will result in unexpected and errornous results.

### We've got you covered!

No, you don't need to worry about constructing that features metadata dictionary yourself, `Teras` offers a handy utility function just for that purpose!

Just import the `get_features_metadata_for_embedding` function from the `teras.utils` module and pass it the dataset in `pandas DataFrame` format along with a list of categorical features names — if they don't exist in your dataset just leave it as `None`, and a list of numerial features names — again if they don't exist then leave it as `None`

In [7]:
from teras.utils import get_features_metadata_for_embedding

metadata = get_features_metadata_for_embedding(adult_df,
                                               categorical_features=categorical_feats,
                                               numerical_features=numerical_feats)

### Convert dataframe to tensorflow dataset

Now after you're done with preprocessing, you must convert your pandas DataFrame to a tensorflow dataset.


If you're familiar with creating TensorFlow datasets, you can do that yourself otherwise don't worry about how to do it, `Teras` makes it easy for you do create a dataset.

Just import `dataframe_to_tf_dataset` function from `teras.utils` and pass it your dataframe, along with optional `target` and `batch_size` parameters.

In [8]:
from teras.utils import dataframe_to_tf_dataset

adult_ds = dataframe_to_tf_dataset(dataframe=adult_df,
                                   target="income")

And that's pretty much it for the preprocessing!

## Importing and instantiating the model

Teras offers two different APIs for accesing and customizing the models to satiate different levels of accessibility and flexibility needs.

1. **Parametric API**: It is the default API and something you're already familiar with — you import the model class and specify the values for parameters that determine how the sub layers, models or blocks within the given model are constructed.

    For instance, specify the `embedding_dim` parameter during instantiation of `TabTransformerClassifier` and that will be the dimensionality value used to construct/instatiate the `CategoricalFeatureEmbedding` layer. Simple enough, right?

In this API, most of the parameters come with default values, so you can specify values for parameters of your choice only.


2. **LayerFlow API**: It maximizes the flexbility while minimizing the interface. That may sound a bit contradictory at first, but let me explain. Here, instead of specifiy individual parameters specific to sub layers/models, the user instead specifies instances of those sub layers, models or blocks that are used in the given model architecture.

    For instance, instead of specifying the `embedding_dim` parameter value, the user specifies an instance of `CategoricalFeatureEmbedding` layer. 


    Now in this instance, we're just passing one parameter instead of another so it may not seem like much beneficial at first glance but let me highlight how it can immensely help depending on your use case:

    1. Since all you need to pass is an instance of layer, it can be any layer, there's no resitriction that it must be instance of `CategoricalFeatureEmbedding` layer — which means that you get complete control over not just customizing the existing layers offered by Teras but also you can design/create an Embedding layer of your own that can work in the place of the original  `CategoricalFeatureEmbedding` layer or any other layer for that matter. This is especially useful, if you're a desinging a new architecture and want to rapidly test out new modifications of the existing architectures by just plugging certain custom layers of your own in place of the default ones. Pretty cool, right?

    2. In many cases, to reduce the plethora of parameters and keep the most important ones, some parameters specific to sub-layers, models are not offered at the top level of the given architecture by the Parametric API, so if you need to tweak those parameters missing from the main model, you can use LayerFlow API and create an instance of that layer/model with desired parameters and pass it to the model.

    3. There are no seperate model classer for Regression and Classification, there's just one model class whose `head` parameter's value determines the task at hand. Like for classification, we'll pass an instance of `ClassificationHead` layer or any custom layer for that purpose.

    Please note that, you are required to pass values for all parameters when using the LayerFlow API.

    Now enough with theory, let's put things into practice!


### Using default API

In [9]:
from tensorflow import keras
from teras.models import TabTransformerClassifier

# since there are 14 input featurs in the input dataset so the input_dim = 14
model = TabTransformerClassifier(num_classes=2,
                                input_dim=14,
                                features_metadata=metadata)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
              loss=keras.losses.BinaryCrossentropy(),
              metrics=["accuracy"])
history = model.fit(adult_ds, epochs=1)

2023-08-28 18:33:19.171749: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [48842]
	 [[{{node Placeholder/_1}}]]
2023-08-28 18:33:19.172323: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_1' with dtype int64 and shape [48842]
	 [[{{node Placeholder/_1}}]]




### Using LayerFlow API
Let's now use LayerFlow API and customize the `head` layer a bit before plugging it with the rest of the architecture.

In [16]:
from teras.layerflow.models import TabTransformer
from teras.layers import (TabTransformerColumnEmbedding,
                          CategoricalFeatureEmbedding,
                          Encoder,
                          NumericalFeatureNormalization,
                          ClassificationHead)

EMBEDDING_DIM = 32

categorical_feature_embedding = CategoricalFeatureEmbedding(features_metadata=metadata,
                                                            embedding_dim=EMBEDDING_DIM,
                                                            encode=False)
col_embedding = TabTransformerColumnEmbedding(num_categorical_features=len(categorical_feats),
                                              embedding_dim=EMBEDDING_DIM)
encoder = Encoder(embedding_dim=EMBEDDING_DIM)
numerical_normalization = NumericalFeatureNormalization(features_metadata=metadata,
                                                        normalization="layer")
head = ClassificationHead(num_classes=2)

model_lf = TabTransformer(input_dim=14,
                          categorical_feature_embedding=categorical_feature_embedding,
                          column_embedding=col_embedding,
                          encoder=encoder,
                          numerical_feature_normalization=numerical_normalization,
                          head=head)
model_lf.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
                 loss=keras.losses.BinaryCrossentropy(),
                 metrics="accuracy")
history = model_lf.fit(adult_ds, epochs=1)



## Wrappin it up!

And that wraps up our classification tutorial using Teras.
If you need more help, consult documentation, and other available resources and if that still leaves you with questions, feel free to raise an issue or email me khawaja.abaid@gmail.com

If you find `Teras` useful, please consider giving it a star on GitHub and sharing it with others! Thank you!