# New data processors

Data processors are the core functionalities in the processing procedure. They can

1. Add new data points by inheriting `tabensemb.data.AbstractAugmenter`;
2. Remove data points by inheriting `tabensemb.data.AbstractProcessor`;
3. Change values of features by inheriting `tabensemb.data.AbstractTransformer` or `tabensemb.data.AbstractScaler`;
4. Reduce the number of features by inheriting `tabensemb.data.AbstractFeatureSelector`.

The above-mentioned classes are all subclasses of `tabensemb.data.AbstractProcessor`. A subclass of `AbstractProcessor` should have `_fit_transform` and `_transform` implemented. `_fit_transform` is used to fit the processor and transform the training set and the validation set. `_transform` will be called to transform the testing set or an upcoming dataset using the fitted processor. For all these classes, you can implement `_required_kwargs` and `_defaults` as we did in "New data derivers" because they all inherit `tabensemb.data.AbstractRequireKwargs`.

The usage of processors is already introduced in "Using data functionalities".

The implemented processors should be registered as follows to be recognized by `DataModule.set_data_processors` automatically.

```python
from tabensemb.data.dataprocessor import processor_mapping
processor_mapping["ADataProcessor"] = ADataProcessor
```

In [1]:
from tabensemb.data import AbstractAugmenter, AbstractProcessor, AbstractTransformer, AbstractScaler, AbstractFeatureSelector, DataModule
import pandas as pd
import numpy as np

## `AbstractAugmenter`

We provide an example of data augmentation in the package, which simply copies the last two data points of the input `DataFrame` that contains the training set and the validation set. The method `_get_augmented`, which returns a `DataFrame` containing new data points, is the only method that needs to be implemented.

In [2]:
class SampleDataAugmentor(AbstractAugmenter):
    def _get_augmented(
        self, data: pd.DataFrame, datamodule: DataModule
    ) -> pd.DataFrame:
        augmented = data.loc[data.index[-2:], :].copy()
        return augmented

## `AbstractProcessor`

It is the base class for data processors. Other mentioned classes implement these two methods and provide higher-level methods for simplification. Currently, only processors that remove some data points are still implemented under `AbstractProcessor` directly. Take `tabensemb.data.dataprocessors.FeatureValueSelector` as an example.

`FeatureValueSelector` is used to select data points that have the specific value (the argument "value") of a certain feature (the argument "feature"). These two arguments are defined in `_required_kwargs`.

```python
class FeatureValueSelector(AbstractProcessor):
    def _required_kwargs(self):
        return ["feature", "value"]
```

It directly removes unwanted data points in the `DataFrame`.

**Remark**: **DO NOT** reset the index of the returned `DataFrame`, which is used to update the indices of training/validation/testing sets.

```python
    def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):
        feature = self.kwargs["feature"]
        value = self.kwargs["value"]
        where_value = data.index[np.where(data[feature] == value)[0]]
        data = data.loc[where_value, :]
        self.feature, self.value = feature, value
        return data
```

`FeatureValueSelector` behaves differently when processing the dataset at hand (`datamodule.training==True`) and the upcoming dataset (`datamodule.training==False`) respectively. In the latter case, data points should not be removed when making inferences. However, data points can be removed from the validation or testing sets (`datamodule.training==True`) because we only want the specific value of the feature in the entire dataset.

```python
    def _transform(self, data: pd.DataFrame, datamodule: DataModule):
        if datamodule.training:
            if self.value not in list(data[self.feature]):
                raise Exception(
                    f"Value {self.value} not available for feature {self.feature}. Select from {data[self.feature].unique()}"
                )
            where_value = data.index[np.where(data[self.feature] == self.value)[0]]
            data = data.loc[where_value, :]
        else:
            if self.value not in list(data[self.feature]):
                warnings.warn(
                    f"Value {self.value} not available for feature {self.feature} selected by "
                    f"{self.__class__.__name__}."
                )
        return data
```

## `AbstractFeatureSelector`

`AbstractFeatureSelector` is used to select tabular features and thus reduce the dimension of the problem. The only necessary method is `_get_feature_names_out` which returns a list of selected features. Take `tabensemb.data.dataprocessors.VarianceFeatureSelector` that uses `sklearn.feature_selection.VarianceThreshold` as an example. A parameter `thres` can be given. The input `DataFrame` is the training and validation set.

```python
from sklearn.feature_selection import VarianceThreshold

class FeatureSelector(AbstractFeatureSelector):
    def _defaults(self):
        return dict(thres=0.8)

    def _get_feature_names_out(self, data, datamodule):
        thres = self.kwargs["thres"]
        sel = VarianceThreshold(threshold=(thres * (1 - thres)))
        sel.fit(
            data[datamodule.all_feature_names],
            data[datamodule.label_name].values.flatten()
            if len(datamodule.label_name) == 1
            else data[datamodule.label_name].values,  # Ignored.
        )
        retain_features = list(sel.get_feature_names_out())
        return retain_features
```

## `AbstractTransformer`

`AbstractTransformer` is used to modify the values of features. Its implementation is exactly the same as that of `AbstractProcessor`. It is mostly a classification criteria to tell the user what it will do, and so does the following `AbstractScaler` that inherits it. A typical example is `tabensemb.data.dataprocessors.CategoricalOrdinalEncoder` which turns categorical features containing meaningful strings into numerical representations.

A method called `DataModule.get_var_change` can calculate what a specific value of a specific feature will become after going through all `AbstractTransformer`s used. It can be useful when the zero values are needed to be unchanged.

## `AbstractScaler`

It inherits `AbstractTransformer`. The last data processor defined in an `DataModule` must be a `AbstractScaler`. As shown in "Customized model base", some representations of the dataset in the `DataModule` are stored in the unscaled form, which means they have gone through all data processors except for the last one. Call `datamodule.data_transform(df, scaler_only=True)` to scale them by calling the last data processor (the `AbstractScaler`). The implementation is similar to `AbstractProcessor`. Take `tabensemb.data.dataprocessors.StandardScaler` which uses `sklearn.preprocessing.StandardScaler` as an example:

```python
from sklearn.preprocessing import StandardScaler as skStandardScaler

class StandardScaler(AbstractScaler):
    def _fit_transform(self, data: pd.DataFrame, datamodule: DataModule):
        scaler = skStandardScaler()
        if len(datamodule.cont_feature_names) > 0:
            data[datamodule.cont_feature_names] = scaler.fit_transform(
                data[datamodule.cont_feature_names]
            ).astype(np.float64)

        self.transformer = scaler
        return data

    def _transform(self, data: pd.DataFrame, datamodule: DataModule):
        if len(datamodule.cont_feature_names) > 0:
            data[datamodule.cont_feature_names] = self.transformer.transform(
                data[datamodule.cont_feature_names]
            ).astype(np.float64)
        return data
```

**Remark**: It is highly recommended to use the 64-bit float (double) precision to avoid inconsistent result between `_fit_transform` and `_transform`.

**Remark**: There can be no continuous and/or categorical features. Please confirm that your `AbstractProcessor`s support empty `datamodule.cont_feature_names` and/or `datamodule.cat_feature_names`.