[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RonPlusSign/llms4subjects/blob/main/embedding_similarity_tagging.ipynb)

# Embedding Similarity Tagging

The goal of this notebook is to run the `embedding_similarity_tagging.py` script with different parameters (e.g. different embedding models).

The script uses a SentenceTransformer model to encode document texts and tag embeddings,
and then computes the similarity between them to tag the documents with the most similar GND tags.

The quality of the tagging results is evaluated using the `shared-task-eval-script/llms4subjects-evaluation.py` script.

In [None]:
# If you run this notebook in Google Colab, run this

# Clone repository and move its content in the current directory
!git clone https://github.com/RonPlusSign/llms4subjects.git
!mv llms4subjects/* .
!rm -r llms4subjects

# Install required packages
!pip install -r requirements.txt

#### Tagging with Different Embedding Models

In [12]:
models = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "distiluse-base-multilingual-cased-v1",
    "T-Systems-onsite/cross-en-de-roberta-sentence-transformer", # this gives warning "No sentence-transformers model found with name ...", but it's ok
    "intfloat/multilingual-e5-large",
]

In [13]:
for model_name in models:

    model_name_folder = model_name.split("/")[-1]
    tag_embeddings_file = f"results/{model_name_folder}/tag_embeddings.json" # Where to save the tag embeddings
    results_dir = f"results/{model_name_folder}" # Where to save the tagging results
    docs_path = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev" # Documents to tag
    tag_file = "shared-task-datasets/GND/dataset/GND-Subjects-tib-core.json" # Tag list definition

    print(f"\n------Running tagging with model: {model_name} ------")
    %run embedding_similarity_tagging.py \
            --model_name {model_name} \
            --tags_file {tag_file} \
            --tag_embeddings_file {tag_embeddings_file} \
            --results_dir {results_dir} \
            --docs_path {docs_path}


------Running tagging with model: sentence-transformers/all-MiniLM-L6-v2 ------


  from .autonotebook import tqdm as notebook_tqdm


Loading model...
Loading GND tags...
Encoding tag descriptions...


Batches: 100%|██████████| 2483/2483 [01:17<00:00, 32.07it/s]


Processing test documents and computing similarities...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Tagging documents: 100%|██████████| 6980/6980 [02:31<00:00, 46.21it/s]


Tagging complete. Individual results saved in corresponding files.

------Running tagging with model: distiluse-base-multilingual-cased-v1 ------
Loading model...
Loading GND tags...
Encoding tag descriptions...


Batches: 100%|██████████| 2483/2483 [04:22<00:00,  9.46it/s]


Processing test documents and computing similarities...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Tagging documents: 100%|██████████| 6980/6980 [02:41<00:00, 43.30it/s]


Tagging complete. Individual results saved in corresponding files.

------Running tagging with model: T-Systems-onsite/cross-en-de-roberta-sentence-transformer ------
Loading model...


No sentence-transformers model found with name T-Systems-onsite/cross-en-de-roberta-sentence-transformer. Creating a new one with mean pooling.


Loading GND tags...
Encoding tag descriptions...


Batches: 100%|██████████| 2483/2483 [09:01<00:00,  4.59it/s]


Processing test documents and computing similarities...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Tagging documents: 100%|██████████| 6980/6980 [07:12<00:00, 16.14it/s]


Tagging complete. Individual results saved in corresponding files.

------Running tagging with model: intfloat/multilingual-e5-large ------
Loading model...
Loading GND tags...
Encoding tag descriptions...


Batches: 100%|██████████| 2483/2483 [25:40<00:00,  1.61it/s]


Processing test documents and computing similarities...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Tagging documents: 100%|██████████| 6980/6980 [20:48<00:00,  5.59it/s]


Tagging complete. Individual results saved in corresponding files.


#### Evaluation

In [14]:
# Evaluate the tagging results using the evaluation script.
for model_name in models:
    print(f"\n------Evaluating tagging results for model: {model_name} ------")

    model_name_folder = model_name.split("/")[-1]
    true_labels_dir = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev"
    pred_labels_dir = f"results/{model_name_folder}"
    results_dir = f"results/{model_name_folder}"

    %run "shared-task-eval-script/llms4subjects-evaluation.py" \
            --team_name {model_name_folder} \
            --true_labels_dir {true_labels_dir} \
            --pred_labels_dir {pred_labels_dir} \
            --results_dir {results_dir}


------Evaluating tagging results for model: sentence-transformers/all-MiniLM-L6-v2 ------

LLMs4Subjects Shared Task -- Evaluations

Reading the True GND labels...
Reading the Predicted GND labels...

Evaluating the directory structure of the predicted folder...

Evaluating the predicted GND labels...

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 5
Evaluating GND Subject Codes -- Granularity Level: Record Type level and k: 5
Evaluating GND Subject Codes -- Granularity Level: Language level and k: 5

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 10
Evaluating GND Subject Codes -- Granularity Level: Record Type level and k: 10
Evaluating GND Subject Codes -- Granularity Level: Language level and k: 10

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 15
Evaluating GND Subject Codes -- Granularity Level: Record Type level and k: 15
Evaluating GND S

## SentenceTransformer fine-tuning

The `finetune_sentence_transformer.py` script fine-tunes a SentenceTransformer model on training data for subject tagging.

In [None]:
# List of models to fine-tune
models = [
    "sentence-transformers/all-MiniLM-L6-v2",
    # "distiluse-base-multilingual-cased-v1",
    # "T-Systems-onsite/cross-en-de-roberta-sentence-transformer", # this gives warning "No sentence-transformers model found with name ...", but it's ok
    # "intfloat/multilingual-e5-large",
]

In [2]:
# Finetune all SentenceTransformer models on the training data
for model_name in models:
    print(f"\n------Fine-tuning model: {model_name} ------")

    model_name_clean = model_name.split("/")[-1]
    training_data_dir = "shared-task-datasets/TIBKAT/tib-core-subjects/data/train"
    eval_data_dir = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev"
    gnd_tags_file = "shared-task-datasets/GND/dataset/GND-Subjects-tib-core.json"
    output_model_path = f"models/finetuned/{model_name_clean}"

    %run finetune_sentence_transformer.py \
            --training_path {training_data_dir} \
            --eval_path {eval_data_dir} \
            --gnd_tags_file {gnd_tags_file} \
            --model_name {model_name} \
            --output_model_path {output_model_path} \
            --batch_size 16 \
            --num_epochs 1


------Fine-tuning model: sentence-transformers/all-MiniLM-L6-v2 ------


  from .autonotebook import tqdm as notebook_tqdm


Loading model...
Loading GND tags and building mapping...
Loaded 79427 GND tags.
Building training examples...
Found 41902 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/train.


Building examples: 100%|██████████| 41902/41902 [19:35<00:00, 35.64it/s]


Created 87896 training examples.
Building evaluation examples...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Building examples: 100%|██████████| 6980/6980 [03:22<00:00, 34.46it/s]


Created 14711 evaluation examples.


You are adding a <class 'transformers.integrations.integration_utils.WandbCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
WandbCallback


Starting fine-tuning...


wandb: Currently logged in as: andrea-delli (andrea-delli-politecnico-di-torino) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss,Validation Loss
100,4.748,4.558064
200,4.4213,4.305213
300,4.2537,4.226685
400,4.1853,4.184414
500,4.1446,4.028315
600,4.0034,3.880127
700,3.953,3.848956
800,3.8884,3.851114
900,3.9726,4.06273
1000,3.8973,3.918621


Fine-tuning complete. Model saved to models/finetuned/all-MiniLM-L6-v2.


#### Tag using the fine-tuned models

In [6]:
model_names = [model_name.split("/")[-1] for model_name in models]
finetuned_models_path = [f"models/finetuned/{model_name}" for model_name in model_names]

for model_name in finetuned_models_path:

    model_name_folder = model_name.split("/")[-1]
    tag_embeddings_file = f"results/finetuned_{model_name_folder}/tag_embeddings.json" # Where to save the tag embeddings
    results_dir = f"results/finetuned_{model_name_folder}" # Where to save the tagging results
    docs_path = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev" # Documents to tag
    tag_file = "shared-task-datasets/GND/dataset/GND-Subjects-tib-core.json" # Tag list definition

    print(f"\n------Running tagging with model: {model_name} ------")
    %run embedding_similarity_tagging.py \
            --model_name {model_name} \
            --tags_file {tag_file} \
            --tag_embeddings_file {tag_embeddings_file} \
            --results_dir {results_dir} \
            --docs_path {docs_path}


------Running tagging with model: models/finetuned/all-MiniLM-L6-v2 ------
Loading model...
Loading GND tags...
Encoding tag descriptions...


Batches: 100%|██████████| 2483/2483 [01:19<00:00, 31.35it/s]


Processing test documents and computing similarities...
Found 6980 documents in shared-task-datasets/TIBKAT/tib-core-subjects/data/dev.


Tagging documents: 100%|██████████| 6980/6980 [02:24<00:00, 48.31it/s]


Tagging complete. Individual results saved in corresponding files.


#### Evaluate the fine-tuned models

In [11]:
# Evaluate the fine-tuned models using the evaluation script.
for model_name in finetuned_models_path:
    print(f"\n------Evaluating fine-tuned model: {model_name} ------")

    model_name_clean = model_name.split("/")[-1]
    true_labels_dir = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev"
    pred_labels_dir = f"results/finetuned_{model_name_clean}"
    results_dir = f"results/finetuned_{model_name_clean}"
    result_name = f"finetuned_{model_name_clean}"

    %run "shared-task-eval-script/llms4subjects-evaluation.py" \
            --team_name {result_name} \
            --true_labels_dir {true_labels_dir} \
            --pred_labels_dir {pred_labels_dir} \
            --results_dir {results_dir}


------Evaluating fine-tuned model: models/finetuned/all-MiniLM-L6-v2 ------

LLMs4Subjects Shared Task -- Evaluations

Reading the True GND labels...
Reading the Predicted GND labels...
Exception Occured: 'utf-8' codec can't decode byte 0x8c in position 11: invalid start byte

Evaluating the directory structure of the predicted folder...

Evaluating the predicted GND labels...

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 5
Evaluating GND Subject Codes -- Granularity Level: Record Type level and k: 5
Evaluating GND Subject Codes -- Granularity Level: Language level and k: 5

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 10
Evaluating GND Subject Codes -- Granularity Level: Record Type level and k: 10
Evaluating GND Subject Codes -- Granularity Level: Language level and k: 10

Evaluating GND Subject Codes -- Granularity Level: Combined Language and Record-levels and k: 15
Evaluating GND Sub