Experiments using Annif to suggest suitable Brinkman subjects

Introduction

Annif is an opensource tool for automated subject indexing and classification developed at the National Library of Finland. Annif uses a user given controlled vocabulary (e.g. thesaurus) and pre-labeled data to train models that can than be used to assign subjects to a new input text. This repository contains files created while researching the use of Annif with ebook data from the KB using the Brinkman thesaurus as controlled vocabulary.

Results

annif_uitkomsten.xlsx contains all Annif evaluation outcomes of experiments using different backends and settings. Each tab contains outcomes using a different subset of the original dataset (datasets not on github):
- subset 1 ggc1: subset with summaries from all ebooks in original dataset.
- subset 2 ggc2: subset with summaries, titles and subtitles from all ebooks in original dataset.
- subset 3 ggc_zaaktrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to content.
- subset 4 ggc_vormtrefwoorden: subset with summaries, titles and subtitles from ebooks which got assigned one or more Brinkman subjects refering to form.
Annif aantekeningen folder contains documentation as tex/pdf.

Generate document corpus for use in Annif

generate_dataset_annif.ipynb is a Jupyter Notebook file to generate a document corpus usable by Annif from raw GGC data.

Initial analysis of Brinkman subjects in GGC-dataset

Initial_analysis_ggc_dataset.ipynb is a Jupyter Notebook file to get a feel for the Brinkman subjects available in the source data.

Preliminary investigation on similarities between Thema and Brinkman thesaurus

Files located in thema_thes folder.

Generate_thema_tsv.py is a Python file to convert Thema thesaurus XML file into a TSV file to be used as Annif vocab.
compare_Brinkman_and_Thema.ipynb contains some exploratory investigation on similarities between Brinkman and Thema thesaurus.
brinkman_thema_overlap.tsv contains Brinkman subjects which are also found in Thema.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
Annif aantekeningen		Annif aantekeningen
thema_thes		thema_thes
Initial_analysis_ggc_dataset.ipynb		Initial_analysis_ggc_dataset.ipynb
README.md		README.md
annif_uitkomsten.xlsx		annif_uitkomsten.xlsx
generate_dataset_annif.ipynb		generate_dataset_annif.ipynb
kinderboeken_dataset.ipynb		kinderboeken_dataset.ipynb
projects.cfg		projects.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiments using Annif to suggest suitable Brinkman subjects

Introduction

Results

Generate document corpus for use in Annif

Initial analysis of Brinkman subjects in GGC-dataset

Preliminary investigation on similarities between Thema and Brinkman thesaurus

About

Releases

Packages

Languages

KBNLresearch/Annif_data_exp

Folders and files

Latest commit

History

Repository files navigation

Experiments using Annif to suggest suitable Brinkman subjects

Introduction

Results

Generate document corpus for use in Annif

Initial analysis of Brinkman subjects in GGC-dataset

Preliminary investigation on similarities between Thema and Brinkman thesaurus

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages