# AAI Project Bibliotheek
## Annif Models
### Information

This notebook contains the training of multiple models within Annif with our LCSH datasets. For a detailed explanation for the creation of our dataset, please refer to [data_processing.ipynb](data_processing.py) Addtionally this notebook will also contain the list of actions needed to make a project in Annif, such as loading a vocabulary. The steps of making a project in Annif are also detailed by the exercises in the [Annif-tutorial repository](https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/02_tfidf_project.md). A lot of steps which are performed in this notebook are also detailed in the following [paper about implementing Annif with LCSH](https://publications.drdo.gov.in/ojs/index.php/djlit/article/view/18619/7915)

### Implementation

#### Install tools
First I will install the following tools: Annif & Skosify. Annif is for subject indexing and Skosify is used for cleaning the LCSH vocabulary file. You could also make a development install of Annif by cloning the repository and opening it in a Docker container. That way custom backends can be used in Annif.

In [None]:
%pip install annif
%pip install skosify

#### Add subject vocabulary to project
I added the Library of Congress Subject Headings as my vocabulary, which can be found at the following [link](https://id.loc.gov/authorities/subjects.html). Here I downloaded the SKOS/RDF .ttl under bulk downloads
Because this vocabulary can contain a lot of redundant information, duplicates and other inconsistencies, the tool [Skosify](https://github.com/NatLibFi/Skosify) will be used to clean this file. This takes a while.<br>

In [None]:
# Reads subjects.skosrdf.ttl, writes SKOS file lc_subjects.ttl and sets the concept scheme to "Subjects"
!skosify subjects.skosrdf.ttl -o lc_subjects.ttl --label "Subjects"


Load the new file in the Annif project as a vocabulary. This only has to be done once. And can be reused for the following projects.

In [14]:
!annif load-vocab lcsh lc_subjects.ttl

Now the LCSH vocabulary is added to Annif.

#### Make __projects.cfg__

The next step is to make a project in Annif. To initialise a project, a __projects.cfg__ has to be made.
The parameters can be different for each available backend / model, so do make sure to check te Annif-tutorial repository for the initialization of certain backends.

Three backend will be tested and evaluated in this notebook: 

- [__TF-IDF__](https://monkeylearn.com/blog/what-is-tf-idf/): Short for _Term Frequency-Inverse Document Frequency_, it calculates two scores for each word: _term frequency_ (The frequency of a word in a document, indicating how often it occurs) and _Inverse document frequency_(Indicates how much a word occurs across all documents).
- [__Omikuji-Parabel__](https://github.com/tomtung/omikuji): An implementation of a Partitioned Label Trees (Tree based machine learning algorithm) for extreme multi-label classification, such as the large amount of subjects in our assigment.
- [__XTransformer__](https://github.com/amzn/pecos/blob/mainline/pecos/xmc/xtransformer/README.md): A module from the [PECOS](https://github.com/amzn/pecos/tree/mainline) framework using transformer models for extreme multi-label classification.

`[lcsh-tfidf-en]`<br>
`name=LCSH TFIDF Titles project` <br>
`language=en`<br>
`backend=tfidf`<br>
`vocab=lcsh`<br>
`analyzer=snowball(english)`

`[lcsh-tfidf2-en]`<br>
`name=LCSH TFIDF Summaries project` <br>
`language=en`<br>
`backend=tfidf`<br>
`vocab=lcsh`<br>
`analyzer=snowball(english)`

`[lcsh-omikuji-parabel-en]`<br>
`name=LCSH Omikuji Parabel project` <br>
`language=en`<br>
`backend=omikuji`<br>
`vocab=lcsh`<br>
`analyzer=snowball(english)`

`[lcsh-xtransformer-distilbert-en]`<br>
`name=LCSH XTransformer Distilbert project` <br>
`language=en`<br>
`backend=xtransformer`<br>
`vocab=lcsh`<br>
`analyzer=snowball(english)`

Execute the following command to view the new projects

In [15]:
# Command for listing projects
!annif list-projects 

Project ID                       Project Name                          Vocabulary ID  Language  Trained  Modification time
--------------------------------------------------------------------------------------------------------------------------
lcsh-tfidf-en                    LCSH TFIDF Titles project             lcsh           en        False    -                
lcsh-tfidf2-en                   LCSH TFIDF Summaries project          lcsh           en        False    -                
lcsh-omikuji-parabel-en          LCSH Omikuji Parabel project          lcsh           en        False    -                
lcsh-xtransformer-distilbert-en  LCSH XTransformer Distilbert project  lcsh           en        False    -                


#### Train & Evaluate Models
Training will be done seperately for each model, Each model will also not be trained on the exact same size, since the full size of some datasets has proven te be too computationally intensive. The use of a specific datasets may also vary, since one contains only titles and other summaries as well.
To evaluate the models, I have made a relatively small test dataset from the original dataset

In [16]:
from sklearn.model_selection import train_test_split
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

As can be viewed in the code below, the test dataset is only 2% of the total dataset. My reasoning behind this is because I want as much data to train the models, to cover as much subjects as possible. The 2% of data still contains a lot of records, since the datasets are quite big.

In [17]:
#Reads original .tsv files
summaries = pd.read_csv("datasets/summaries_uri.tsv", delimiter='\t', header=None)
titles = pd.read_csv("datasets/titles_uri.tsv", delimiter='\t', header=None)

# Splitting summaries into train set and test set.
train_summaries, test_summaries = train_test_split(summaries, test_size=0.02)
train_titles, test_titles = train_test_split(titles, test_size=0.02)

# Exports new train & test datasets
train_summaries.to_csv("datasets/train_summaries_uri.tsv", sep='\t', index=False, header=None)
test_summaries.to_csv("datasets/test_summaries_uri.tsv", sep='\t', index=False, header=None)
train_titles.to_csv("datasets/train_titles_uri.tsv", sep='\t', index=False, header=None)
test_titles.to_csv("datasets/test_titles_uri.tsv", sep='\t', index=False, header=None)

During evaluation a plethora of metrics will be produced, including [precision, recall](https://en.wikipedia.org/wiki/Precision_and_recall) and [F1-scores](https://en.wikipedia.org/wiki/F-score). 

One particular metric which will also be analysed is [NDCG](https://towardsdatascience.com/demystifying-ndcg-bee3be58cfe0), which stands for _Normalized Discounted Cumulative Gain_. This metric is used to measure ranking quality (which are like the subjects that are produced), and is therefore often used to evaluate the performance of search engines or recommendation systems. It works by assigning a score to the relevance of each subject towards the query, these scores are then discounted based on the position in the search result. Because this is a good measurement for our task, we will be primarily looking at this metric.

##### TF-IDF
The TF-IDF will be trained twice seperately, once on the dataset with titles only, and another time with the summaries.

First with the titles dataset:

In [18]:
!annif train lcsh-tfidf-en datasets/train_titles_uri.tsv

Backend tfidf: transforming subject corpus
Backend tfidf: creating vectorizer
Backend tfidf: creating similarity index


In [19]:
!annif eval lcsh-tfidf-en datasets/test_titles_uri.tsv

Precision (doc avg):          	0.0419
Recall (doc avg):             	0.3037
F1 score (doc avg):           	0.0722
Precision (subj avg):         	0.0011
Recall (subj avg):            	0.0020
F1 score (subj avg):          	0.0012
Precision (weighted subj avg):	0.2308
Recall (weighted subj avg):   	0.2883
F1 score (weighted subj avg): 	0.2177
Precision (microavg):         	0.0416
Recall (microavg):            	0.2883
F1 score (microavg):          	0.0728
F1@5:                         	0.1005
NDCG:                         	0.2148
NDCG@5:                       	0.1930
NDCG@10:                      	0.2148
Precision@1:                  	0.1430
Precision@3:                  	0.0878
Precision@5:                  	0.0652
True positives:               	2692
False positives:              	61946
False negatives:              	6647
Documents evaluated:          	6560


As seen in the metrics, the TF-IDF model trained purely on title data is not performing that great. It has a NDCG score of 0.21 which is quite low. There is room for improvement.

Next a separate TF-IDF model will be trained on the summaries dataset.

In [20]:
!annif train lcsh-tfidf2-en datasets/train_summaries_uri.tsv

Backend tfidf: transforming subject corpus
Backend tfidf: creating vectorizer
Backend tfidf: creating similarity index


In [21]:
!annif eval lcsh-tfidf2-en datasets/test_summaries_uri.tsv

Precision (doc avg):          	0.1442
Recall (doc avg):             	0.4487
F1 score (doc avg):           	0.2072
Precision (subj avg):         	0.0006
Recall (subj avg):            	0.0014
F1 score (subj avg):          	0.0007
Precision (weighted subj avg):	0.2759
Recall (weighted subj avg):   	0.4199
F1 score (weighted subj avg): 	0.3006
Precision (microavg):         	0.1442
Recall (microavg):            	0.4199
F1 score (microavg):          	0.2147
F1@5:                         	0.2272
NDCG:                         	0.3574
NDCG@5:                       	0.3030
NDCG@10:                      	0.3574
Precision@1:                  	0.2994
Precision@3:                  	0.2361
Precision@5:                  	0.1971
True positives:               	6328
False positives:              	37562
False negatives:              	8741
Documents evaluated:          	4389


This model already produces a better NDCG score of 0.35. Although this is better, there is still room for improvement.

##### Omikuji Parabel
The next model is Omikuji Parabel, as described before this is a tree-based model which is well suited for extreme multi-label classification.

In [23]:
!annif train lcsh-omikuji-parabel-en datasets/train_summaries_uri.tsv

Backend omikuji: creating vectorizer
Backend omikuji: creating train file
2024-01-14T12:01:58.955Z [36mINFO [0m [omikuji::data] Loading data from data/projects/lcsh-omikuji-parabel-en/omikuji-train.txt
2024-01-14T12:01:59.911Z [36mINFO [0m [omikuji::data] Parsing data
2024-01-14T12:02:01.076Z [36mINFO [0m [omikuji::data] Loaded 215048 examples; it took 2.12s
2024-01-14T12:02:01.202Z [36mINFO [0m [omikuji::model::train] Training model with hyper-parameters HyperParam { n_trees: 3, min_branch_size: 100, max_depth: 20, centroid_threshold: 0.0, collapse_every_n_layers: 0, linear: HyperParam { loss_type: Hinge, eps: 0.1, c: 1.0, weight_threshold: 0.1, max_iter: 20 }, cluster: HyperParam { k: 2, balanced: true, eps: 0.0001, min_size: 2 }, tree_structure_only: false, train_trees_1_by_1: false }
2024-01-14T12:02:01.202Z [36mINFO [0m [omikuji::model::train] Initializing tree trainer
2024-01-14T12:02:01.216Z [36mINFO [0m [omikuji::model::train] Computing label centroids
2024-01-14T12

In [24]:
!annif eval lcsh-omikuji-parabel-en datasets/test_summaries_uri.tsv

2024-01-14T12:04:54.105Z [36mINFO [0m [omikuji::model] Loading model from data/projects/lcsh-omikuji-parabel-en/omikuji-model...
2024-01-14T12:04:54.105Z [36mINFO [0m [omikuji::model] Loading model settings from data/projects/lcsh-omikuji-parabel-en/omikuji-model/settings.json...
2024-01-14T12:04:54.111Z [36mINFO [0m [omikuji::model] Loaded model settings Settings { n_features: 336785, classifier_loss_type: Hinge }...
2024-01-14T12:04:54.114Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree0.cbor...
2024-01-14T12:04:57.135Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree1.cbor...
2024-01-14T12:05:00.300Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree2.cbor...
2024-01-14T12:05:03.305Z [36mINFO [0m [omikuji::model] Loaded model with 3 trees; it took 9.20s
Precision (doc avg):          	0.2186
Recall (doc a

This model has a greater performance than the TF-IDF models as expected since it is well suited for extreme multi-label classification. It has a NDCG score of 0.5867, which is a big improvement.

##### XTransformer
Another model which can be used is XTransformer. This is a custom backend mentioned in this [pull request](https://github.com/NatLibFi/Annif/pull/540). It adds the ability to use a transformer based model from huggingface and use it for extreme multi-label classification using the [PECOS](https://github.com/amzn/pecos/blob/mainline/pecos/xmc/xtransformer/README.md) framework. This backend has been modified by us so that it can be used in the current Annif version.

In [25]:
# Reads the training summaries data.
data = pd.read_csv("datasets/train_summaries_uri.tsv", delimiter='\t', header=None)
# Takes the first 50000 entries of the data.
data_subset = data.head(50000)

# Exports the dataframe to a .tsv file.
data_subset.to_csv("datasets/xtf_train_summaries_uri.tsv", sep='\t', index=False, header=None)

This training might take longer than two hours...

In [26]:
!annif train lcsh-xtransformer-distilbert-en datasets/xtf_train_summaries_uri.tsv

Backend xtransformer: creating vectorizer
Backend xtransformer: creating training file
Backend xtransformer: Start training
Downloaded distilbert-base-multilingual-cased model from s3.
INFO:pecos.xmc.xtransformer.matcher:Downloaded distilbert-base-multilingual-cased model from s3.
***** Encoding data len=50000 truncation=128*****
INFO:pecos.xmc.xtransformer.matcher:***** Encoding data len=50000 truncation=128*****
***** Finished with time cost=7.673071384429932 *****
INFO:pecos.xmc.xtransformer.matcher:***** Finished with time cost=7.673071384429932 *****
trn tensors saved to /tmp/tmpj13lyfc5/X_trn.pt
INFO:pecos.xmc.xtransformer.matcher:trn tensors saved to /tmp/tmpj13lyfc5/X_trn.pt
Start fine-tuning transformer matcher...
INFO:pecos.xmc.xtransformer.matcher:Start fine-tuning transformer matcher...
***** Running training *****
INFO:pecos.xmc.xtransformer.matcher:***** Running training *****
  Num examples = 50000
INFO:pecos.xmc.xtransformer.matcher:  Num examples = 50000
  Num labels =

After training the model for 3 epochs, it can be evaluated using the regular test_summaries dataset. `2>/dev/null` is added to suppress CUDA user warnings, since it will produce one warning for every record in the test set, this would cause 1000+ warning messages.

In [29]:
!annif eval lcsh-xtransformer-distilbert-en datasets/test_summaries_uri.tsv 2>/dev/null

Precision (doc avg):          	0.1707
Recall (doc avg):             	0.5205
F1 score (doc avg):           	0.2441
Precision (subj avg):         	0.0005
Recall (subj avg):            	0.0009
F1 score (subj avg):          	0.0005
Precision (weighted subj avg):	0.2746
Recall (weighted subj avg):   	0.4970
F1 score (weighted subj avg): 	0.3330
Precision (microavg):         	0.1707
Recall (microavg):            	0.4970
F1 score (microavg):          	0.2541
F1@5:                         	0.3251
NDCG:                         	0.4870
NDCG@5:                       	0.4618
NDCG@10:                      	0.4870
Precision@1:                  	0.5386
Precision@3:                  	0.3740
Precision@5:                  	0.2848
True positives:               	7490
False positives:              	36400
False negatives:              	7579
Documents evaluated:          	4389


This transformer model which makes use of distilbert gets a NDCG score of 0.4870. This is worse then Omikuji Parabel. This could be due to the fact that is is trained on a lot less data. 

##### Example test
Below I am testing all models on a simple query, to give an example of the usage.

###### TF-IDF (Titles)
TF-IDF which is trained on titles gives promising subjects, but as seen it gives some weird subjects on this query, such as _Dreams and the Arts_. A explanation for this might be because the first part of the query: "What can you tell" is associated with that subject. And since it is trained on small texts (titles) it associates those otherwise neutral words with that subject

In [34]:
!echo "What can you tell me about Machine learning algorithms, which use mathematical models" | annif suggest lcsh-tfidf-en

<http://id.loc.gov/authorities/subjects/sh2003001414>	Dreams and the arts	0.3677
<http://id.loc.gov/authorities/subjects/sh85079324>	Machine learning	0.3652
<http://id.loc.gov/authorities/subjects/sh2002007921>	Mathematical models	0.2907
<http://id.loc.gov/authorities/subjects/sh85082123>	Mathematical literature	0.2900
<http://id.loc.gov/authorities/subjects/sh91000149>	Computer algorithms	0.2834
<http://id.loc.gov/authorities/subjects/sh90004643>	Hispanic Americans in motion pictures	0.2772
<http://id.loc.gov/authorities/subjects/sh85075520>	Learning	0.2758
<http://id.loc.gov/authorities/subjects/sh2006000311>	African American fraternal organizations	0.2738
<http://id.loc.gov/authorities/subjects/sh85082139>	Mathematics	0.2699
<http://id.loc.gov/authorities/subjects/sh85014235>	Biomathematics	0.2634


###### TF-IDF (Summaries)
This TF-IDF model is trained on the summaries. And as seen it produces different results. Here there are also some weird subjects, but all in all it performs better.

In [35]:
!echo "What can you tell me about Machine learning algorithms, which use mathematical models" | annif suggest lcsh-tfidf2-en

<http://id.loc.gov/authorities/subjects/sh85093206>	Number concept	0.2988
<http://id.loc.gov/authorities/subjects/sh85077449>	LISP (Computer program language)	0.2898
<http://id.loc.gov/authorities/subjects/sh96008834>	Python (Computer program language)	0.2865
<http://id.loc.gov/authorities/subjects/sh92006189>	Women periodical editors	0.2853
<http://id.loc.gov/authorities/subjects/sh98004629>	Libraries and distance education	0.2628
<http://id.loc.gov/authorities/subjects/sh99003437>	Open source software	0.2609
<http://id.loc.gov/authorities/subjects/sh87007505>	C++ (Computer program language)	0.2384
<http://id.loc.gov/authorities/subjects/sh2014000535>	Makerspaces	0.2343
<http://id.loc.gov/authorities/subjects/sh85042371>	Electronic spreadsheets	0.2252
<http://id.loc.gov/authorities/subjects/sh91000149>	Computer algorithms	0.2240


###### Omikuji Parabel
The next model is the Omikuji Parabel model. This uses a whole different technique as mentioned before. And as can be seen from the subjects, they are all relevant for the search query. 

In [36]:
!echo "What can you tell me about Machine learning algorithms, which use mathematical models" | annif suggest lcsh-omikuji-parabel-en

2024-01-14T15:25:01.588Z [36mINFO [0m [omikuji::model] Loading model from data/projects/lcsh-omikuji-parabel-en/omikuji-model...
2024-01-14T15:25:01.588Z [36mINFO [0m [omikuji::model] Loading model settings from data/projects/lcsh-omikuji-parabel-en/omikuji-model/settings.json...
2024-01-14T15:25:01.594Z [36mINFO [0m [omikuji::model] Loaded model settings Settings { n_features: 336785, classifier_loss_type: Hinge }...
2024-01-14T15:25:01.598Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree0.cbor...
2024-01-14T15:25:04.023Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree1.cbor...
2024-01-14T15:25:06.424Z [36mINFO [0m [omikuji::model] Loading tree from data/projects/lcsh-omikuji-parabel-en/omikuji-model/tree2.cbor...
2024-01-14T15:25:08.898Z [36mINFO [0m [omikuji::model] Loaded model with 3 trees; it took 7.31s
<http://id.loc.gov/authorities/subjects/sh85008180>

###### XTransformer
Lastly we have the XTransformer, this model has a similar performance to the Omikuji model. It is evident that these models have a better understanding of the query as opposed to simpler models like TF-IDF.

In [37]:
!echo "What can you tell me about Machine learning algorithms, which use mathematical models" | annif suggest lcsh-xtransformer-distilbert-en

<http://id.loc.gov/authorities/subjects/sh85079324>	Machine learning	0.1592
<http://id.loc.gov/authorities/subjects/sh89003285>	Computer science	0.1273
<http://id.loc.gov/authorities/subjects/sh85008180>	Artificial intelligence	0.1177
<http://id.loc.gov/authorities/subjects/sh85079341>	Machine theory	0.1171
<http://id.loc.gov/authorities/subjects/sh85010089>	Automatic control	0.1136
<http://id.loc.gov/authorities/subjects/sh85114628>	Robotics	0.1086
<http://id.loc.gov/authorities/subjects/sh87007398>	Software engineering	0.1072
<http://id.loc.gov/authorities/subjects/sh85043176>	Engineering	0.1012
<http://id.loc.gov/authorities/subjects/sh85107313>	Programming languages (Electronic computers)	0.1006
<http://id.loc.gov/authorities/subjects/sh00005852>	Quality control	0.0962
