[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RonPlusSign/llms4subjects/blob/main/embedding_similarity_tagging.ipynb)

# Embedding Similarity Tagging

The goal of this notebook is to run the `embedding_similarity_tagging.py` script with different parameters (e.g. different embedding models).

The script uses a SentenceTransformer model to encode document texts and tag embeddings,
and then computes the similarity between them to tag the documents with the most similar GND tags.

The quality of the tagging results is evaluated using the `shared-task-eval-script/llms4subjects-evaluation.py` script.

In [None]:
# If you run this notebook in Google Colab, run this

# Clone repository and move its content in the current directory
!git clone https://github.com/RonPlusSign/llms4subjects.git
!mv llms4subjects/* .
!rm -r llms4subjects

# Install required packages
!pip install -r requirements.txt

#### Tagging with Different Embedding Models

In [None]:
models = [
    "distiluse-base-multilingual-cased-v1",
    "sentence-transformers/all-MiniLM-L6-v2",
    "T-Systems-onsite/cross-en-de-roberta-sentence-transformer", # this gives warning "No sentence-transformers model found with name ...", but it's ok
    "intfloat/multilingual-e5-large"
]

for model_name in models:

    model_name_folder = model_name.split("/")[-1]
    tag_embeddings_file = f"results/{model_name_folder}/tag_embeddings.json" # Where to save the tag embeddings
    results_dir = f"results/{model_name_folder}/dev" # Where to save the tagging results
    docs_path = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev" # Documents to tag
    tag_file = "shared-task-datasets/GND/dataset/GND-Subjects-tib-core.json" # Tag list definition

    print(f"\n------Running tagging with model: {model_name} ------")
    !python embedding_similarity_tagging.py \
            --model_name {model_name} \
            --tags_file {tag_file} \
            --tag_embeddings_file {tag_embeddings_file} \
            --results_dir {results_dir} \
            --docs_path {docs_path}

#### Evaluation

In [None]:
# Evaluate the tagging results using the evaluation script.
for model_name in models:
    model_name_folder = model_name.split("/")[-1]
    true_labels_dir = "shared-task-datasets/TIBKAT/tib-core-subjects/data/dev"
    pred_labels_dir = f"results/{model_name_folder}/dev"
    results_dir = f"results/{model_name_folder}"

    !python "shared-task-eval-script/llms4subjects-evaluation.py" \
            --team_name=PoliTo \
            --true_labels_dir {true_labels_dir} \
            --pred_labels_dir {pred_labels_dir} \
            --results_dir {results_dir}