# XLNET


### Introduction
In this notebook, our objective is to train an XLNET model for a challenging multilabel classification task. This task involves categorizing textual arguments into one or more of 20 distinct categories, each representing a fundamental human value.

The categories are as follows:
- Self-direction: thought
- Self-direction: action
- Stimulation
- Hedonism
- Achievement
- Power: dominance
- Power: resources
- Face
- Security: personal
- Security: societal
- Tradition
- Conformity: rules
- Conformity: interpersonal
- Humility
- Benevolence: caring
- Benevolence: dependability
- Universalism: concern
- Universalism: nature
- Universalism: tolerance
- Universalism: objectivity





### Flow of the notebook


The notebook will be structured into distinct sections to offer a well-organized guide through the implemented process. These sections will include:

1. Installing the required libraries
2. Importing the libraries
3. Loading the datasets
3. Defining and Fine-Tuning the Model
  - XLNET Base
  - XLENT Large
4. Evaluation

## Installing the Required Libraries
We have to install the libraries below because it is not be pre-installed in the runtime environment provided by Google Colab:

transformers
SentencePiece
wandb
simpletransformers

In [None]:
!pip install transformers

In [None]:
!pip install SentencePiece

In [None]:
!pip install wandb --upgrade

In [None]:
!pip install --upgrade wandb simpletransformers

In [None]:
!pip install simpletransformers

## Importing the Required Libraries

In [None]:
from transformers import AutoTokenizer

import logging
import wandb

import torch

import numpy as np

from simpletransformers.classification import MultiLabelClassificationModel, ClassificationArgs

import pandas as pd
from sklearn.metrics import f1_score, classification_report

## Loading the Data
We utilize data sourced from [Zenodo](https://zenodo.org/record/7550385#.Y8wMquzMK3I), specifically from the [Human Value Detection 2023 competition](https://touche.webis.de/semeval23/touche23-web/index.html). Our focus is on the following datasets: arguments-training.tsv, arguments-validation.tsv, arguments-test.tsv, labels-training.tsv, labels-validation.tsv, and labels-test.tsv.


In [None]:
# defining the label titles in our datasets
label_cols = ['Self-direction: thought', 'Self-direction: action', 'Stimulation',
       'Hedonism', 'Achievement', 'Power: dominance', 'Power: resources',
       'Face', 'Security: personal', 'Security: societal', 'Tradition',
       'Conformity: rules', 'Conformity: interpersonal', 'Humility',
       'Benevolence: caring', 'Benevolence: dependability',
       'Universalism: concern', 'Universalism: nature',
       'Universalism: tolerance', 'Universalism: objectivity']

In [None]:
# loading the data
train_args = pd.read_csv("arguments-training.tsv",delimiter='\t')
train_labels = pd.read_csv("labels-training.tsv",delimiter='\t')

val_labels = pd.read_csv("labels-validation.tsv",delimiter='\t')
val_args = pd.read_csv("arguments-validation.tsv",delimiter='\t')

test_labels = pd.read_csv("labels-test.tsv",delimiter='\t')
test_args = pd.read_csv("arguments-test.tsv",delimiter='\t')

The input data for simple transformers' models need to have a 'text' column containing the context we need to apply multiclassification and the list of labels for each element in the context.

In [None]:
train_args['text'] = train_args['Conclusion'] + " " + train_args['Stance'] + " " + train_args['Premise']
train_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

val_args['text'] = val_args['Conclusion'] + " " + val_args['Stance'] + " " + val_args['Premise']
val_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

test_args['text'] = test_args['Conclusion'] + " " + test_args['Stance'] + " " + test_args['Premise']
test_args.drop(labels=['Conclusion', 'Stance', 'Premise'], axis=1, inplace=True)

print(val_labels)

     Argument ID  Self-direction: thought  Self-direction: action  \
0         A01001                        0                       0   
1         A01012                        0                       0   
2         A02001                        0                       0   
3         A02002                        0                       1   
4         A02009                        0                       0   
...          ...                      ...                     ...   
1891      E08014                        1                       0   
1892      E08021                        1                       0   
1893      E08022                        0                       1   
1894      E08024                        0                       1   
1895      E08025                        0                       1   

      Stimulation  Hedonism  Achievement  Power: dominance  Power: resources  \
0               0         0            0                 0                 0   
1          

In [None]:
def prepare_labels(data):
  df = pd.DataFrame(data)

  final_list = df.copy()  # Make a copy of the DataFrame
  final_list['labels'] = df.iloc[:, 1:].apply(lambda row: tuple(row), axis=1)

  return final_list

In [None]:
# Adding the 'labels' column
train_labels_compressed = prepare_labels(train_labels)
val_labels_compressed = prepare_labels(val_labels)
test_labels_compressed = prepare_labels(test_labels)

print(train_labels_compressed)

     Argument ID  Self-direction: thought  Self-direction: action  \
0         A01002                        0                       0   
1         A01005                        0                       0   
2         A01006                        0                       0   
3         A01007                        0                       0   
4         A01008                        0                       0   
...          ...                      ...                     ...   
5388      E08016                        0                       0   
5389      E08017                        0                       0   
5390      E08018                        0                       0   
5391      E08019                        0                       0   
5392      E08020                        0                       1   

      Stimulation  Hedonism  Achievement  Power: dominance  Power: resources  \
0               0         0            0                 0                 0   
1          

In [None]:
# Merging the content of two files into one dataset
merged_train = pd.merge(train_args, train_labels_compressed, on='Argument ID', how='inner')
merged_val = pd.merge(val_args, val_labels_compressed, on='Argument ID', how='inner')
merged_test = pd.merge(test_args, test_labels_compressed, on='Argument ID', how='inner')

print(merged_train)

     Argument ID                                               text  \
0         A01002  We should ban human cloning in favor of we sho...   
1         A01005  We should ban fast food in favor of fast food ...   
2         A01006  We should end the use of economic sanctions ag...   
3         A01007  We should abolish capital punishment against c...   
4         A01008  We should ban factory farming against factory ...   
...          ...                                                ...   
5388      E08016  The EU should integrate the armed forces of it...   
5389      E08017  Food whose production has been subsidized with...   
5390      E08018  Food whose production has been subsidized with...   
5391      E08019  Food whose production has been subsidized with...   
5392      E08020  The EU should integrate the armed forces of it...   

      Self-direction: thought  Self-direction: action  Stimulation  Hedonism  \
0                           0                       0            0 

We have implemented the following functions to determine the optimal threshold after training and assessing the model. This is necessary because the model's output consists of a series of probabilities, and these probabilities need to be transformed into binary values (0s and 1s) using a threshold.

In [None]:
# Obtaining the best optimal threshold
def get_threshold(labels, model_output):
  results = {}
  labels_compressed = labels.drop("Argument ID", axis=1)

  df_prediction = labels_compressed.copy()
  for tr in np.arange(0.1, 0.9, 0.05):
      tr = round(tr, 2)
      for i, label in enumerate(label_cols):
          prediction = np.where(model_output[:,i] >= tr, 1, 0)
          df_prediction[label] = prediction

      y_pred = df_prediction.values.tolist()
      y_test = labels_compressed.values.tolist()
      f1 = f1_score(y_test, y_pred, average = "macro", zero_division = 1)
      results[tr] = f1

  for k,v in results.items():
      print("THRESHOLD: {:.2f} ".format(k), "F1 score: {:.3f}".format(v))

  THRESHOLD = max(results, key = results.get)

  print("\nBest threshold obtained:", THRESHOLD , "having F1 score of: {:.2f}".format(max(results.values())))

  return THRESHOLD

Using the function below, we can get a report of the classification scores on various classes.

In [None]:
# Obtaining the final report of our scores
def get_report(labels, model_output, THRESHOLD):
  labels_compressed = labels.drop("Argument ID", axis=1)
  df_prediction = labels_compressed.copy()

  for i, label in enumerate(label_cols):
    prediction = np.where(model_output[:,i] >= THRESHOLD, 1, 0)
    df_prediction[label] = prediction


  y_pred = df_prediction.values.tolist()
  y_test = labels_compressed.values.tolist()

  print(classification_report(y_test, y_pred, target_names = label_cols))

## Defining the Model
In this section, we introduce our XLNET model and initiate the fine-tuning process by exploring various parameters. To facilitate this experimentation, we make use of the Sweep library, which enables us to systematically evaluate different model configurations.

We work with two versions of the XLNET model, namely "xlnet-base-cased" and "xlnet-large-cased." However, owing to constraints with our CUDA resources, we begin by fine-tuning "xlent-base-cased" using hyperparameter tuning to identify the optimal hyperparameters. Once we've determined the best hyperparameters for "xlnet-base-cased," we proceed to train both the "xlnet-base-cased" and "xlnet-large-cased" models. This approach allows us to make a fair and meaningful comparison of their performance on our datasets.

### Defining Hyperparameter


We attempted to conduct experiments by varying the parameters related to the number of training epochs and the batch size.

In [None]:
sweep_config = {
    "method": "grid",  # grid, random
    "parameters": {
        "num_train_epochs": {"values": [3, 5, 10]},
        "train_batch_size": {"values": [16, 32]}
    },
}

In [None]:
# create a sweep in W&B
sweep_id = wandb.sweep(sweep_config, project="Human Value Detection with XLNET")

In [None]:
# Our fixed arguments for the simple transformers model
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [None]:
args = {
      'manual_seed': 42,
      'overwrite_output_dir': True,
      'gradient_accumulation_steps': 8,
      "lr": 2e-4,
      "optimizer": 'AdamW'
      }

### XLNET Base
This section contains the training phase for our XLNET base model:

In [None]:
def train():
    # Initialize a new wandb run
    wandb.init()

    # Create a TransformerModel
    model = MultiLabelClassificationModel(
        'xlnet',
        'xlnet-base-cased',
        use_cuda=True,
        args=args,
        sweep_config=wandb.config,
        num_labels=20
    )

    # Train the model
    model.train_model(merged_train)

    # Evaluate the model
    model.eval_model(merged_val)

    # Sync wandb
    wandb.join()


In [None]:
wandb.agent(sweep_id, train, count=6)

In [None]:
# Initialize the W&B API
api = wandb.Api()

# Replace 'your_project_name' with your actual project name
project = api.project('Human Value Detection with XLNET')

project

In [None]:
# Get the sweep
sweep = api.sweep(f"{project.name}/{sweep_id}")

# Retrieve the best run
best_run = sweep.best_run()

# Get the best run's parameters and results
best_params = best_run.config
best_results = best_run.summary



In [None]:
print("Best Parameters:", best_params)
print("Best Results:", best_results)

Best Parameters: {'num_train_epochs': 10, 'train_batch_size': 32}
Best Results: {'lr': 2.0304568527918785e-06, '_step': 3, '_wandb': {'runtime': 755}, '_runtime': 701.5600302219391, '_timestamp': 1693835222.798767, 'global_step': 200, 'Training loss': 0.29959309101104736}


In [None]:
 # Defining the model using the best parameters obtained
 model_xlnet_base = MultiLabelClassificationModel(
        'xlnet',
        'xlnet-base-cased',
        use_cuda=True,
        args={
              'num_train_epochs': best_params['num_train_epochs'],
              'train_batch_size': best_params['train_batch_size'],
              'manual_seed': 42,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

Some weights of XLNetForMultiLabelSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.bias', 'logits_proj.bias', 'sequence_summary.summary.weight', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# layers of our xlnet base model
model_xlnet_base.model

XLNetForMultiLabelSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0-11): 12 x XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation_function): GELUActivation()
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (sequence_summary): SequenceSummary(
    (summary): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
    (first_dropout): Identity()

In [None]:
# training the model
model_xlnet_base.train_model(merged_train, eval_df=merged_val)

## XLNET Large

This section encompasses the training phase for our XLNET large model.

In [None]:
# Defining our xlent large model with the hyperparameters we got from xlnet base
model_xlnet_large = MultiLabelClassificationModel(
        'xlnet',
        'xlnet-large-cased',
        use_cuda=True,
        args={
              'num_train_epochs': 10,
              'train_batch_size': 32,
              'manual_seed': 42,
              'overwrite_output_dir': True,
              'gradient_accumulation_steps': 8,
              "lr": 2e-4,
              "optimizer": 'AdamW'
            },
        num_labels=20
)

In [None]:
# Layers of our xlnet large model
model_xlnet_large.model

In [None]:
# Training our xlnet large model
model_xlnet_large.train_model(merged_train, eval_df=merged_val)

## Evaluation of the models


In this section, we assess the performance of the XLNET base and XLNET large models on both the validation and test sets. We gauge their performance using the F1 score as the primary metric for evaluation.

### Evaluation of XLNET-base-cased
In here, we evaluate the XLNET base model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_xlnet_base, model_outputs_xlnet_base, wrong_predictions_xlnet_base = model_xlnet_base.eval_model(merged_val)
print("Results of xlnet base on validtion set:\n", result_xlnet_base)

In [None]:
xlnet_base_threshold = get_threshold(val_labels, model_outputs_xlnet_base)

THRESHOLD: 0.10  F1 score: 0.402
THRESHOLD: 0.15  F1 score: 0.424
THRESHOLD: 0.20  F1 score: 0.420
THRESHOLD: 0.25  F1 score: 0.403
THRESHOLD: 0.30  F1 score: 0.382
THRESHOLD: 0.35  F1 score: 0.363
THRESHOLD: 0.40  F1 score: 0.342
THRESHOLD: 0.45  F1 score: 0.321
THRESHOLD: 0.50  F1 score: 0.299
THRESHOLD: 0.55  F1 score: 0.272
THRESHOLD: 0.60  F1 score: 0.250
THRESHOLD: 0.65  F1 score: 0.226
THRESHOLD: 0.70  F1 score: 0.197
THRESHOLD: 0.75  F1 score: 0.172
THRESHOLD: 0.80  F1 score: 0.142
THRESHOLD: 0.85  F1 score: 0.108

Best threshold obtained: 0.15 having F1 score of: 0.42


In [None]:
val_labels

Unnamed: 0,Argument ID,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,...,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A01001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,A01012,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,A02001,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,A02002,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,A02009,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1891,E08014,1,0,0,0,1,0,0,0,1,...,0,1,0,0,0,0,1,0,0,1
1892,E08021,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,1,0,0,1
1893,E08022,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,1,1,0,0,1
1894,E08024,0,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,1,0,0,1


In [None]:
get_report(val_labels, model_outputs_xlnet_base, xlnet_base_threshold)

                            precision    recall  f1-score   support

   Self-direction: thought       0.33      0.73      0.45       251
    Self-direction: action       0.42      0.74      0.53       496
               Stimulation       0.33      0.31      0.32       138
                  Hedonism       0.40      0.38      0.39       103
               Achievement       0.50      0.85      0.63       575
          Power: dominance       0.24      0.40      0.30       164
          Power: resources       0.31      0.82      0.45       132
                      Face       0.22      0.21      0.22       130
        Security: personal       0.51      0.95      0.66       759
        Security: societal       0.40      0.89      0.55       488
                 Tradition       0.32      0.49      0.39       172
         Conformity: rules       0.35      0.84      0.49       455
 Conformity: interpersonal       0.21      0.13      0.16        60
                  Humility       0.17      0.08

**Evaluation on test set**

In [None]:
result_xlnet_base_test, model_outputs_xlnet_base_test, wrong_predictions_xlnet_base_test = model_xlnet_base.eval_model(merged_test)
print("Results of xlnet base on validtion set:\n", result_xlnet_base_test)

In [None]:
xlnet_base_threshold_test = get_threshold(test_labels, model_outputs_xlnet_base_test)

THRESHOLD: 0.10  F1 score: 0.379
THRESHOLD: 0.15  F1 score: 0.410
THRESHOLD: 0.20  F1 score: 0.414
THRESHOLD: 0.25  F1 score: 0.403
THRESHOLD: 0.30  F1 score: 0.390
THRESHOLD: 0.35  F1 score: 0.368
THRESHOLD: 0.40  F1 score: 0.351
THRESHOLD: 0.45  F1 score: 0.326
THRESHOLD: 0.50  F1 score: 0.302
THRESHOLD: 0.55  F1 score: 0.282
THRESHOLD: 0.60  F1 score: 0.257
THRESHOLD: 0.65  F1 score: 0.233
THRESHOLD: 0.70  F1 score: 0.211
THRESHOLD: 0.75  F1 score: 0.188
THRESHOLD: 0.80  F1 score: 0.156
THRESHOLD: 0.85  F1 score: 0.130

Best threshold obtained: 0.2 having F1 score of: 0.41


In [None]:
model_outputs_xlnet_base_test

array([[0.0526123 , 0.07666016, 0.03417969, ..., 0.01554108, 0.14880371,
        0.17443848],
       [0.08978271, 0.15625   , 0.02328491, ..., 0.04708862, 0.21728516,
        0.20446777],
       [0.03796387, 0.07434082, 0.03567505, ..., 0.0692749 , 0.05059814,
        0.1895752 ],
       ...,
       [0.20117188, 0.08990479, 0.11126709, ..., 0.25048828, 0.14147949,
        0.48828125],
       [0.09075928, 0.17712402, 0.01020813, ..., 0.01306915, 0.13000488,
        0.38525391],
       [0.04553223, 0.34399414, 0.04525757, ..., 0.04360962, 0.04821777,
        0.29736328]])

In [None]:
test_labels

Unnamed: 0,Argument ID,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,...,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A26004,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,1,0,1,0
1,A26010,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,1,1
2,A26016,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,1,1,0,0,0
3,A26024,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,A26026,0,0,0,0,1,0,0,0,1,...,0,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1571,E07272,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1572,E07273,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
1573,E07275,1,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,1,0
1574,E07280,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,1


In [None]:
get_report(test_labels, model_outputs_xlnet_base_test, xlnet_base_threshold_test)

                            precision    recall  f1-score   support

   Self-direction: thought       0.32      0.67      0.44       143
    Self-direction: action       0.51      0.70      0.59       391
               Stimulation       0.24      0.08      0.12        77
                  Hedonism       0.31      0.19      0.24        26
               Achievement       0.49      0.72      0.59       412
          Power: dominance       0.31      0.38      0.34       108
          Power: resources       0.38      0.70      0.49       105
                      Face       0.21      0.07      0.11        96
        Security: personal       0.51      0.93      0.66       537
        Security: societal       0.39      0.89      0.54       397
                 Tradition       0.39      0.72      0.51       168
         Conformity: rules       0.32      0.83      0.47       287
 Conformity: interpersonal       0.33      0.08      0.12        53
                  Humility       0.07      0.03

### Evaluation of xlnet-large-cased
This section involves the evaluation of the XLNET large model on both the validation and test datasets.

**Evaluation on validation set**

In [None]:
result_xlnet_large, model_outputs_xlnet_large, wrong_predictions_xlnet_large = model_xlnet_large.eval_model(merged_val)
print("Results of xlnet base on validtion set:\n", result_xlnet_large)

In [None]:
xlnet_large_threshold = get_threshold(val_labels, model_outputs_xlnet_large)

THRESHOLD: 0.10  F1 score: 0.427
THRESHOLD: 0.15  F1 score: 0.457
THRESHOLD: 0.20  F1 score: 0.463
THRESHOLD: 0.25  F1 score: 0.457
THRESHOLD: 0.30  F1 score: 0.445
THRESHOLD: 0.35  F1 score: 0.430
THRESHOLD: 0.40  F1 score: 0.412
THRESHOLD: 0.45  F1 score: 0.393
THRESHOLD: 0.50  F1 score: 0.368
THRESHOLD: 0.55  F1 score: 0.356
THRESHOLD: 0.60  F1 score: 0.335
THRESHOLD: 0.65  F1 score: 0.309
THRESHOLD: 0.70  F1 score: 0.289
THRESHOLD: 0.75  F1 score: 0.264
THRESHOLD: 0.80  F1 score: 0.234
THRESHOLD: 0.85  F1 score: 0.194

Best threshold obtained: 0.2 having F1 score of: 0.46


In [None]:
get_report(val_labels, model_outputs_xlnet_large, xlnet_large_threshold)

                            precision    recall  f1-score   support

   Self-direction: thought       0.38      0.70      0.49       251
    Self-direction: action       0.50      0.71      0.59       496
               Stimulation       0.37      0.33      0.35       138
                  Hedonism       0.44      0.39      0.41       103
               Achievement       0.57      0.80      0.66       575
          Power: dominance       0.29      0.32      0.30       164
          Power: resources       0.37      0.77      0.50       132
                      Face       0.27      0.25      0.26       130
        Security: personal       0.57      0.92      0.71       759
        Security: societal       0.49      0.81      0.61       488
                 Tradition       0.41      0.49      0.44       172
         Conformity: rules       0.47      0.78      0.58       455
 Conformity: interpersonal       0.36      0.17      0.23        60
                  Humility       0.11      0.13

**Evaluation on test set**

In [None]:
result_xlnet_large_test, model_outputs_xlnet_large_test, wrong_predictions_xlnet_large_test = model_xlnet_large.eval_model(merged_test)
print("Results of xlnet base on validtion set:\n", result_xlnet_large_test)

In [None]:
xlnet_large_threshold_test = get_threshold(test_labels, model_outputs_xlnet_large_test)

THRESHOLD: 0.10  F1 score: 0.402
THRESHOLD: 0.15  F1 score: 0.429
THRESHOLD: 0.20  F1 score: 0.438
THRESHOLD: 0.25  F1 score: 0.440
THRESHOLD: 0.30  F1 score: 0.426
THRESHOLD: 0.35  F1 score: 0.420
THRESHOLD: 0.40  F1 score: 0.411
THRESHOLD: 0.45  F1 score: 0.395
THRESHOLD: 0.50  F1 score: 0.379
THRESHOLD: 0.55  F1 score: 0.360
THRESHOLD: 0.60  F1 score: 0.342
THRESHOLD: 0.65  F1 score: 0.322
THRESHOLD: 0.70  F1 score: 0.293
THRESHOLD: 0.75  F1 score: 0.268
THRESHOLD: 0.80  F1 score: 0.243
THRESHOLD: 0.85  F1 score: 0.198

Best threshold obtained: 0.25 having F1 score of: 0.44


In [None]:
get_report(test_labels, model_outputs_xlnet_large_test, xlnet_large_threshold_test)

                            precision    recall  f1-score   support

   Self-direction: thought       0.37      0.62      0.46       143
    Self-direction: action       0.63      0.71      0.66       391
               Stimulation       0.42      0.06      0.11        77
                  Hedonism       0.21      0.15      0.18        26
               Achievement       0.55      0.66      0.60       412
          Power: dominance       0.36      0.41      0.38       108
          Power: resources       0.36      0.58      0.44       105
                      Face       0.21      0.10      0.14        96
        Security: personal       0.60      0.89      0.72       537
        Security: societal       0.47      0.82      0.60       397
                 Tradition       0.47      0.71      0.56       168
         Conformity: rules       0.38      0.76      0.51       287
 Conformity: interpersonal       0.42      0.09      0.15        53
                  Humility       0.07      0.04