Annif is an opensource tool for automated subject indexing and classification developed at the National Library of Finland. Annif uses a user given controlled vocabulary (e.g. thesaurus) and pre-labeled data to train models that can than be used to assign subjects to a new input text. This repository contains files created while researching the use of Annif with ebook data from the KB using the Brinkman thesaurus as controlled vocabulary.
-
annif_uitkomsten.xlsx
contains all Annif evaluation outcomes of experiments using different backends and settings. Each tab contains outcomes using a different subset of the original dataset (datasets not on github):- subset 1 ggc1: subset with summaries from all ebooks in original dataset.
- subset 2 ggc2: subset with summaries, titles and subtitles from all ebooks in original dataset.
- subset 3 ggc_zaaktrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to content.
- subset 4 ggc_vormtrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to form.
-
Annif aantekeningen folder contains documentation as tex/pdf.
generate_dataset_annif.ipynb
is a Jupyter Notebook file to generate a document corpus usable by Annif from raw GGC data.
Initial_analysis_ggc_dataset.ipynb
is a Jupyter Notebook file to get a feel for the Brinkman subjects available in the source data.
Files located in thema_thes folder.
Generate_thema_tsv.py
is a Python file to convert Thema thesaurus XML file into a TSV file to be used as Annif vocab.compare_Brinkman_and_Thema.ipynb
contains some exploratory investigation on similarities between Brinkman and Thema thesaurus.brinkman_thema_overlap.tsv
contains Brinkman subjects which are also found in Thema.