# Example usage of the `torchTextClassifiers` library

*Warning*

*`torchTextClassifiers` library is still under active development. Have a
regular look to <https://github.com/inseefrlab/torchTextClassifiers> for
latest information.*

To install package, you can run the following snippet

In [None]:
# Stable version
%uv pip install --system .. 
%uv pip install --system captum unidecode nltk scikit-learn


In [None]:
from torchTextClassifiers import ModelConfig, TrainingConfig, torchTextClassifiers
from torchTextClassifiers.dataset import TextClassificationDataset
from torchTextClassifiers.model import TextClassificationModel, TextClassificationModule
from torchTextClassifiers.model.components import (
    AttentionConfig,
    CategoricalVariableNet,
    ClassificationHead,
    TextEmbedder,
    TextEmbedderConfig,
)
from torchTextClassifiers.tokenizers import HuggingFaceTokenizer, WordPieceTokenizer
from torchTextClassifiers.utilities.plot_explainability import (
    map_attributions_to_char,
    map_attributions_to_word,
    plot_attributions_at_char,
    plot_attributions_at_word,
)

%load_ext autoreload
%autoreload 2

# Load and preprocess data

In that guide, we propose to illustrate main package functionalities
using that `DataFrame`:

In [None]:
import pandas as pd

df = pd.read_parquet("https://minio.lab.sspcloud.fr/projet-ape/data/08112022_27102024/naf2008/split/df_train.parquet")
df = df.sample(100000)

In [None]:
df

Our goal will be to build multilabel classification for the `code`
variable using `libelle` as feature.

## Enriching our test dataset

Unlike `Fasttext`, this package offers the possibility of having several
feature columns of different types (string for the text column and
additional variables in numeric form, for example). To illustrate that,
we propose the following enrichment of the example dataset:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


def categorize_surface(
    df: pd.DataFrame, surface_feature_name: int, like_sirene_3: bool = True
) -> pd.DataFrame:
    """
    Categorize the surface of the activity.

    Args:
        df (pd.DataFrame): DataFrame to categorize.
        surface_feature_name (str): Name of the surface feature.
        like_sirene_3 (bool): If True, categorize like Sirene 3.

    Returns:
        pd.DataFrame: DataFrame with a new column "surf_cat".
    """
    df_copy = df.copy()
    df_copy[surface_feature_name] = df_copy[surface_feature_name].replace("nan", np.nan)
    df_copy[surface_feature_name] = df_copy[surface_feature_name].astype(float)
    # Check surface feature exists
    if surface_feature_name not in df.columns:
        raise ValueError(f"Surface feature {surface_feature_name} not found in DataFrame.")
    # Check surface feature is a float variable
    if not (pd.api.types.is_float_dtype(df_copy[surface_feature_name])):
        raise ValueError(f"Surface feature {surface_feature_name} must be a float variable.")

    if like_sirene_3:
        # Categorize the surface
        df_copy["surf_cat"] = pd.cut(
            df_copy[surface_feature_name],
            bins=[0, 120, 400, 2500, np.inf],
            labels=["1", "2", "3", "4"],
        ).astype(str)
    else:
        # Log transform the surface
        df_copy["surf_log"] = np.log(df[surface_feature_name])

        # Categorize the surface
        df_copy["surf_cat"] = pd.cut(
            df_copy.surf_log,
            bins=[0, 3, 4, 5, 12],
            labels=["1", "2", "3", "4"],
        ).astype(str)

    df_copy[surface_feature_name] = df_copy["surf_cat"].replace("nan", "0")
    df_copy[surface_feature_name] = df_copy[surface_feature_name].astype(int)
    df_copy = df_copy.drop(columns=["surf_log", "surf_cat"], errors="ignore")
    return df_copy


def clean_and_tokenize_df(
    df,
    categorical_features=["CJ", "NAT", "TYP", "CRT"],
    text_feature="libelle_processed",
    label_col="apet_finale",
):
    df.fillna("nan", inplace=True)
    les = []
    for col in categorical_features:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        les.append(le)

    df = categorize_surface(df, "SRF", like_sirene_3=True)
    df = df[[text_feature,  "CJ", "NAT", "TYP", "SRF", "CRT", label_col]]

    return df, les

In [None]:
categorical_features = [ "CJ", "NAT", "TYP", "SRF", "CRT"]
text_feature = "libelle"
y = "apet_finale"

Right now, the model requires the label (variable y) to be a numerical
variable. If the label variable is a text variable, we recommend using
Scikit Learn’s
[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
to convert into a numeric variable. Using that function will give user
the possibility to get back labels from the encoder after running
predictions.

In [None]:
encoder = LabelEncoder()
df["apet_finale"] = encoder.fit_transform(df["apet_finale"])

The function `clean_and_tokenize_df` requires special `DataFrame`
formatting:

-   First column contains the processed text (str)
-   Next ones contain the “encoded” categorical (discrete) variables in
    int format

In [None]:
df, _ = clean_and_tokenize_df(df, text_feature="libelle")
X = df[["libelle", "CJ", "NAT", "TYP", "CRT", "SRF"]].values
y = df["apet_finale"].values

In [None]:
X.shape, y.shape

## Splitting in train-test sets

As usual in a learning approach, you need to break down your data into
learning and test/validation samples to obtain robust performance
statistics.
This work is the responsibility of the package’s users. Please make sure that np.max(y_train) == len(np.unique(y_train))-1 (i.e. your labels are well encoded, in a consecutive manner, starting from 0), and that all the possible labels appear at least once in the training set.

We provide the function stratified_train_test_split to match these requirements here..

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Tokenizer

In [None]:
text = X_train[:, 0].tolist()

In [None]:
tokenizer = HuggingFaceTokenizer.load_from_pretrained("google-bert/bert-base-uncased")
tokenizer.tokenize(text[0]).input_ids.shape

In [None]:
tokenizer = WordPieceTokenizer(5000, output_dim=125)
tokenizer.train(text)
tokenizer.tokenize(text[:256]).input_ids.shape

## Consider each component indepedently

In [None]:
vocab_size = tokenizer.vocab_size
padding_idx = tokenizer.padding_idx

embedding_dim = 96
n_layers = 1
n_head = 4
n_kv_head = n_head
sequence_len = tokenizer.output_dim

In [None]:
attention_config = AttentionConfig(
    n_layers=n_layers,
    n_head=n_head,
    n_kv_head=n_kv_head,
    sequence_len=sequence_len,
)

text_embedder_config = TextEmbedderConfig(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    padding_idx=padding_idx,
    attention_config=attention_config,
)


text_embedder = TextEmbedder(
    text_embedder_config=text_embedder_config,
)
text_embedder.init_weights()

In [None]:
X[:, 1:].max(axis=0).tolist()

In [None]:
categorical_vocab_sizes = (X[:, 1:].max(axis=0) + 1).tolist()
categorical_embedding_dims = 25

categorical_var_net = CategoricalVariableNet(
    categorical_vocabulary_sizes=categorical_vocab_sizes,
    categorical_embedding_dims=categorical_embedding_dims,
)

In [None]:
num_classes = int(y.max() + 1)
expected_input_dim = embedding_dim + categorical_var_net.output_dim
classification_head = ClassificationHead(
    input_dim=expected_input_dim,
    num_classes=num_classes,
)

In [None]:
model = TextClassificationModel(
    text_embedder=text_embedder,
    categorical_variable_net=categorical_var_net,
    classification_head=classification_head,
)
model

In [None]:
import torch

module = TextClassificationModule(
    model=model,
    loss=torch.nn.CrossEntropyLoss(),
    optimizer=torch.optim.Adam,
    optimizer_params={"lr": 1e-3},
    scheduler=None,
    scheduler_params=None,
    scheduler_interval="epoch",
)
module

## Using the wrapper

In [None]:
model_config = ModelConfig(
    embedding_dim=embedding_dim,
    categorical_vocabulary_sizes=categorical_vocab_sizes,
    categorical_embedding_dims=categorical_embedding_dims,
    num_classes=num_classes,
    attention_config=attention_config,
)

training_config = TrainingConfig(
    lr=1e-3,
    batch_size=256,
    num_epochs=10,
)

ttc = torchTextClassifiers(
    tokenizer=tokenizer,
    model_config=model_config,
)

In [None]:
X_train[1, :]

In [None]:
tokenizer.tokenize(X_train[:256, 0].tolist()).input_ids.shape

In [None]:
ttc.train(
    X_train=X_train,
    y_train=y_train,
    X_val=X_test,
    y_val=y_test,
    training_config=training_config,
)