This notebook's purpose is to show how to use the sklearn-like models pipeline for text classification.

The pipeline trains a selected classifier on a selected dataset, training a specified vectorizer previously. Then, it computes the text classification evaluation metrics and saves them in a JSON file in a specified path.

Apart from the "SklearnClassificationPipeline" class, all you need to import is a selected sklearn-like classifier and any sklearn vectorizer, like CountVectorizer or TfidfVectorizer.

In [1]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from xgboost.sklearn import XGBClassifier

from embeddings.pipeline.sklearn_classification import SklearnClassificationPipeline

Variables you need to pass to the SklearnClassificationPipeline class:
- __dataset kwargs__: name of the dataset and names of X and Y columns, respectively. You can pass them to the class from a dict, like in all examples below, or directly.
- __output_path__: a path where you want a file with evaluation metrics saved.

The remaining elements are optional. Note that arguments __"embeddings_kwargs"__ and __"classifier_kwargs"__ are passed to the class __without "**"__.

In [2]:
dataset_kwargs = {
    "dataset_name": "clarin-pl/polemo2-official",
    "input_column_name": "text",
    "target_column_name": "target"
}

embeddings_kwargs = {
    "max_features": 100,
    "max_df": 10
}

classifier_kwargs = {
    "n_estimators": 100
}

evaluation_filename = "adaboost_tfidf_evaluation.json" #default name: evaluation_filename.json
output_path = "."

adaboost_tfidf_pipeline = SklearnClassificationPipeline(
    **dataset_kwargs,
    output_path=output_path,
    classifier=AdaBoostClassifier,
    vectorizer=TfidfVectorizer,
    evaluation_filename=evaluation_filename,
    classifier_kwargs=classifier_kwargs,
    embedding_kwargs=embeddings_kwargs
)

In [3]:
print(adaboost_tfidf_pipeline.run())

No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/mariuszkossakowski/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

{'accuracy': {'accuracy': 0.42073170731707316}, 'f1__average_macro': {'f1': 0.22407483222530247}, 'recall__average_macro': {'recall': 0.28367992757721533}, 'precision__average_macro': {'precision': 0.4174189364461738}, 'data': {'y_pred': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 3, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1

In [4]:
svm_kwargs = {
    "kernel": "linear",
    "C": 0.6
}

evaluation_filename_svm_tdidf = "svm_tfidf_evaluation.json"

svm_tfidf_pipeline = SklearnClassificationPipeline(
    **dataset_kwargs,
    output_path=output_path,
    classifier=SVC,
    vectorizer=TfidfVectorizer,
    evaluation_filename=evaluation_filename_svm_tdidf,
    classifier_kwargs=svm_kwargs,
    embedding_kwargs=embeddings_kwargs
)

In [5]:
print(svm_tfidf_pipeline.run())

No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/mariuszkossakowski/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

{'accuracy': {'accuracy': 0.43414634146341463}, 'f1__average_macro': {'f1': 0.22266715757115924}, 'recall__average_macro': {'recall': 0.28829274638237506}, 'precision__average_macro': {'precision': 0.5283213327686008}, 'data': {'y_pred': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 3, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1

In [6]:
embeddings_kwargs = {
    "max_features": 200
}

xgb_kwargs = {
    "n_estimators": 200,
    "max_depth": 7
}

evaluation_filename_xgb_tdidf = "xgb_tfidf_evaluation.json"

xgb_tfidf_pipeline = SklearnClassificationPipeline(
    **dataset_kwargs,
    output_path=output_path,
    classifier=XGBClassifier,
    vectorizer=TfidfVectorizer,
    evaluation_filename=evaluation_filename_xgb_tdidf,
    classifier_kwargs=xgb_kwargs,
    embedding_kwargs=embeddings_kwargs
)

In [7]:
print(xgb_tfidf_pipeline.run())

No config specified, defaulting to: pol_emo2/all_text
Reusing dataset pol_emo2 (/Users/mariuszkossakowski/.cache/huggingface/datasets/clarin-pl___pol_emo2/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70)


  0%|          | 0/3 [00:00<?, ?it/s]

{'accuracy': {'accuracy': 0.6792682926829269}, 'f1__average_macro': {'f1': 0.6650572917071165}, 'recall__average_macro': {'recall': 0.6657629993435544}, 'precision__average_macro': {'precision': 0.6650214067253306}, 'data': {'y_pred': array([1, 3, 2, 2, 2, 0, 0, 0, 1, 3, 1, 0, 2, 2, 2, 1, 1, 1, 1, 3, 3, 2,
       2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 3, 3, 2, 2, 1, 1, 2, 3, 1, 2,
       0, 1, 1, 0, 2, 1, 2, 0, 1, 3, 1, 2, 0, 1, 2, 2, 2, 2, 3, 1, 0, 3,
       0, 1, 0, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 0, 3, 1, 1, 1, 1, 2, 1, 3,
       3, 1, 0, 3, 2, 3, 1, 1, 3, 3, 1, 2, 2, 0, 1, 1, 1, 0, 2, 2, 2, 1,
       3, 1, 1, 1, 1, 0, 2, 1, 0, 3, 0, 2, 0, 1, 1, 1, 1, 1, 2, 0, 2, 2,
       1, 1, 0, 2, 1, 2, 1, 0, 2, 1, 3, 1, 1, 2, 1, 0, 2, 1, 1, 0, 1, 1,
       1, 1, 0, 2, 2, 2, 0, 3, 1, 1, 1, 2, 2, 0, 1, 2, 3, 1, 3, 1, 1, 1,
       3, 0, 2, 1, 1, 1, 0, 2, 2, 1, 1, 1, 1, 3, 1, 1, 2, 2, 1, 1, 2, 1,
       0, 1, 2, 1, 1, 0, 1, 1, 2, 1, 0, 2, 0, 0, 1, 1, 0, 1, 1, 2, 1, 1,
       1, 1, 1, 2, 1, 2, 1, 2, 1, 3