# AAI Project Team Bibliotheek 
## Data processing

### Information
Our [Annif](https://github.com/NatLibFi/Annif) implementation uses the [Library of Congress Subject Headings](https://id.loc.gov/authorities/subjects.html), for an explanation how to create a LCSH Annif project, please refer to [make_project.ipynb](../make_project.ipynb). One step of that notebook is training the project with data. Therefore this notebook will explain the steps for transforming the [MARC XML data from the University Library of Amsterdam](https://uba.uva.nl/ondersteuning/open-data/datasets-en-publicatiekanalen/datasets-en-publicatiekanalen.html#Boeken) and [MARC XML data from Springer Nature](https://metadata.springernature.com/metadata/books) into trainingdata which can be used by Annif.

In [2]:
# Import standard libraries
from make_dataset import *
import pandas as pd
import glob
import os

import warnings
warnings.filterwarnings("ignore")

#### Splitting MARC XML file
This is a large file, so it first split into multiple smaller XML files, so that they can be processed. This is done using `split_data.py`. This produces 233 smaller XML files. These can be used for the next step.

#### Obtaining relevant data from files
The next step involves processing the data and collecting the relevant data for our project. Since we use Annif, we need training data which is compatible for this tool, which should also be in English.
This should be a `.tsv` file in the following format:

`summary/title   <subjectURI>`

So we need to identify which tags contain the language, summaries, titles and the LCSH subjects. This can be found in the [MARC 21 Format for Bibliographic Data](https://www.loc.gov/marc/bibliographic/). Here we can find the following tags that are relevant for this implementation.

| __Tag__       | 245                          | 520                |650       |
|---------------|--------------------|----------|-|
|__Description__| [Title](https://www.loc.gov/marc/bibliographic/bd245.html) | [Summary](https://www.loc.gov/marc/bibliographic/bd520.html) | [Subjects](https://www.loc.gov/marc/bibliographic/bd650.html) |

For each tag we want subfield `a`. For the subject we also only want the ones with `ind2=0`, since this second indicator indicates that it is a LCSH subject. Although the language content tag is available, I will not use it in selecting records, since the summaries or titles might still be in another language apart from English.

Getting this information into a file is performed with functions defined in [make_dataset.py](./make_dataset.py).
In the example below, two datasets will be made, one for a dataset with titles & subjects, and another for summaries & subjects.

##### Data from Unversity Library Amsterdam

First the data from the University Library of Amsterdam will be processed. 

In [2]:
# Path to the folder containing XML files
ub_xml_folder = "datasets/xml_ub"
# Use glob to get a list of all XML files in the folder
ub_xml = glob.glob(os.path.join(ub_xml_folder, "*.xml"))

ub_titles_raw = "datasets/ub_titles_raw.csv"
ub_summaries_raw = "datasets/ub_summaries_raw.csv"

In [3]:
# Call the functions with the list of XML files and output CSV file
ubxml_to_titles(ub_xml, ub_titles_raw)
ubxml_to_summaries(ub_xml, ub_summaries_raw)

Processing XML files:   0%|          | 0/233 [00:00<?, ?file/s]

Processing XML files: 100%|██████████| 233/233 [12:05<00:00,  3.11s/file]
Processing XML files: 100%|██████████| 233/233 [11:03<00:00,  2.85s/file]


##### Data from Springer Nature

Addtional data from Springer Nature will be used to cover more subjects

In [4]:
sn_xml = "datasets/xml_sn/SpringerNature_Books_MARC21_20231120_021727.xml"

sn_titles_raw = "datasets/sn_titles_raw.csv"
sn_summaries_raw = "datasets/sn_summaries_raw.csv"

In [5]:
# Call the functions with the list of XML files and output CSV file
snxml_to_titles(sn_xml, sn_titles_raw)
snxml_to_summaries(sn_xml, sn_summaries_raw)

##### Combining data

Here the individual title and summary datasets from each source will be added together.

In [6]:
ub_titles_df = pd.read_csv(ub_titles_raw)
sn_titles_df = pd.read_csv(sn_titles_raw)

print("UB Titles shape: ", ub_titles_df.shape)
print("SN Titles shape: ", sn_titles_df.shape)

combined_titles_df = pd.concat([ub_titles_df,sn_titles_df])
print("Combined Titles shape: ", combined_titles_df.shape)
combined_titles_df.to_csv("datasets/combined_titles_raw.csv", index=False)

UB Titles shape:  (510625, 14)
SN Titles shape:  (203461, 14)
Combined Titles shape:  (714086, 14)


In [7]:
ub_summaries_df = pd.read_csv(ub_summaries_raw)
sn_summaries_df = pd.read_csv(sn_summaries_raw)

print("UB Summaries shape: ", ub_summaries_df.shape)
print("SN Summaries shape: ", sn_summaries_df.shape)

combined_summaries_df = pd.concat([ub_summaries_df,sn_summaries_df])
print(combined_summaries_df.shape)
combined_summaries_df.to_csv("datasets/combined_summaries_raw.csv", index=False)

UB Summaries shape:  (55743, 14)
SN Summaries shape:  (191056, 14)
(246799, 14)


Now we have parsed the MARC XML files into two datasets, one with summaries and the other with titles. And most importantly, they have Library of Congress Subject Headings.

#### Data cleaning
Now that we have the datasets, these have to be checked for inconsistencies, such as duplicate values, weird symbols and inconsistent languages. 
This data cleaning process will be done separately for each dataset. 

For the cleaning a function has been made called: _clean_dataset()_, this can be found in the file _make_dataset.py_ This function performs all the aforementioned data cleaning and gives us some examples from the data.


##### Titles dataset
First we will start with the titles dataset


In [8]:
# Define raw (original) and cleaned dataset paths.
titles_raw = "datasets/combined_titles_raw.csv"
titles_clean = "datasets/combined_titles_clean.csv"

In [9]:
# Perform dataset cleaning
original_titles, cleaner_titles, clean_titles = clean_dataset(titles_raw, titles_clean)

Original size:  (714086, 14)
Removing duplicates...
New size:  (661255, 14)
Removing unnecessary symbols...
Detecting English language... (may take a while)
Removing non-English text...
Final size:  (407171, 14)
Converting to csv...


First we can see that the dataset contains 714086 rows and 19 columns, so 714086 books and 13 columns for potential subjects.
The first five entries of this data can be viewed below:

In [16]:
original_titles

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
0,De Reformatie in Culemborg /,Reformation,,,,,,,,,,,,
1,The story of the Mennonites /,Mennonites,,,,,,,,,,,,
2,Ein Leib in Christo werden :,Anabaptists.,Marriage,Sex,,,,,,,,,,
3,L'image du mal en Egypte :,Gnosticism.,Demonology.,Good and evil.,"Cosmogony, Ancient.",,,,,,,,,
4,Sur les traces de la bibliothèque médiévale...,"Manuscripts, Hebrew",Manuscript fragments,Jews,Jews,Genizah.,"Manuscripts, Aramaic","Manuscripts, Medieval","Paleography, Hebrew.",Manuscript fragments,Manuscript fragments,Manuscript fragments,,


The first noticable issue with the data are the `/` and `:` characters which are present behind each title, these can be replaced by a blank space.
The dataset also contains a number of duplicates, here we specify only the duplicates in the __Content__ column, since multiple titles can be assigned the same subject. This leaves us with a new size of 661255 records. The result can be viewed below.

In [14]:
cleaner_titles

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
0,De Reformatie in Culemborg,Reformation,,,,,,,,,,,,
1,The story of the Mennonites,Mennonites,,,,,,,,,,,,
2,Ein Leib in Christo werden,Anabaptists.,Marriage,Sex,,,,,,,,,,
3,L'image du mal en Egypte,Gnosticism.,Demonology.,Good and evil.,"Cosmogony, Ancient.",,,,,,,,,
4,Sur les traces de la bibliothèque médiévale...,"Manuscripts, Hebrew",Manuscript fragments,Jews,Jews,Genizah.,"Manuscripts, Aramaic","Manuscripts, Medieval","Paleography, Hebrew.",Manuscript fragments,Manuscript fragments,Manuscript fragments,,


In this dataset we can also see that a lot of titles are not in English. Since we would like to train our model on English data only, these have to be removed. Therefore the __langid__ library will be used to detect if the text is in english, and then only keep those records. This leaves us with the final amount of 407171 records. The cleaned data can be viewed below

In [17]:
clean_titles

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
1,The story of the Mennonites,Mennonites,,,,,,,,,,,,
7,The medieval cultures of the Irish Sea and the...,"Civilization, Medieval.",,,,,,,,,,,,
15,Gastronomic pleasures,Gastronomy.,Cooking in literature.,,,,,,,,,,,
17,The foods of love,Cooking.,Aphrodisiacs.,,,,,,,,,,,
18,A dictionary of aphrodisiacs,Aphrodisiacs,,,,,,,,,,,,


Now the cleaned dataset can be passed to the reconciliation service.

##### Summaries dataset

The same process I will repeat for the summaries. This datasets appends the summaries in addition to the title for the dataset. I will not be adding any comments on this explanation, since the steps are identical to the previous data cleaning with the titles.

In [10]:
summaries_raw = "datasets/combined_summaries_raw.csv"
summaries_clean = "datasets/combined_summaries_clean.csv"

In [11]:
original_summaries, cleaner_summaries, clean_summaries = clean_dataset(summaries_raw, summaries_clean)

Original size:  (246799, 14)
Removing duplicates...
New size:  (244444, 14)
Removing unnecessary symbols...
Detecting English language... (may take a while)
Removing non-English text...
Final size:  (223885, 14)
Converting to csv...


In [18]:
original_summaries

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
0,"Ein Leib in Christo werden : | ""Die vorliegend...",Anabaptists.,Marriage,Sex,,,,,,,,,,
1,Sur les traces de la bibliothèque médiévale...,"Manuscripts, Hebrew",Manuscript fragments,Jews,Jews,Genizah.,"Manuscripts, Aramaic","Manuscripts, Medieval","Paleography, Hebrew.",Manuscript fragments,Manuscript fragments,Manuscript fragments,,
2,The medieval cultures of the Irish Sea and the...,"Civilization, Medieval.",,,,,,,,,,,,
3,"Le monde en sphères : | ""Des consoles de bure...",Globes,Celestial globes,Cartographic materials,,,,,,,,,,
4,Eating my words : | A restaurant critic discus...,Women food writers,Food writing.,,,,,,,,,,,


In [19]:
cleaner_summaries

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
0,Ein Leib in Christo werden | Die vorliegende ...,Anabaptists.,Marriage,Sex,,,,,,,,,,
1,Sur les traces de la bibliothèque médiévale...,"Manuscripts, Hebrew",Manuscript fragments,Jews,Jews,Genizah.,"Manuscripts, Aramaic","Manuscripts, Medieval","Paleography, Hebrew.",Manuscript fragments,Manuscript fragments,Manuscript fragments,,
2,The medieval cultures of the Irish Sea and the...,"Civilization, Medieval.",,,,,,,,,,,,
3,Le monde en sphères | Des consoles de bureau...,Globes,Celestial globes,Cartographic materials,,,,,,,,,,
4,Eating my words | A restaurant critic discuss...,Women food writers,Food writing.,,,,,,,,,,,


In [20]:
clean_summaries

Unnamed: 0,Content,Subject1,Subject2,Subject3,Subject4,Subject5,Subject6,Subject7,Subject8,Subject9,Subject10,Subject11,Subject12,Subject13
2,The medieval cultures of the Irish Sea and the...,"Civilization, Medieval.",,,,,,,,,,,,
4,Eating my words | A restaurant critic discuss...,Women food writers,Food writing.,,,,,,,,,,,
5,From cooking vessels to cultural practices in ...,Bronze age,"Pottery, Ancient",Cookware,Material culture,Excavations (Archaeology),Social archaeology,,,,,,,
6,Heat | Writer Buford's memoir of his headlong...,"Cooking, Italian",Food,,,,,,,,,,,
7,The joy of eating | A rich and satisfying col...,Gastronomy,Food,Food habits,,,,,,,,,,


#### Reconcile dataset with LCSH

The datasets which are generated contain text subjects, although these may represent LCSH subjects, they also need to be matched with the external source. This is used to verify and clean the dataset against authorities, such as LCSH in our case. This process is called [reconciling](https://openrefine.org/docs/manual/reconciling).

To implement this reconciling, I used the tool [OpenRefine](https://openrefine.org/). I will highlight the important steps.

##### 1. Load dataset file, and create a project

![image-1.png](img/img1.PNG)

##### 2. Select a LCSH reconciliation service

There is currently no hosted version of a LCSH reconciliation service, so it has to be hosted locally. To do this clone this [repository](https://github.com/cmharlow/lc-reconcile) and follow the instructions under _Run Locally Instructions_. 


![image2.png](img/img2.PNG)

Select _Library of Congress Subject Headings_ when selecting which type to reconcile, then start reconciling. This process can take a long time...

![image3.png](img/img3.PNG)

##### 3. Select all matched records

When the reconciliation process is done, you will be left with three types of records: __matched__, __none__ and __uncertain__.

- __Matched__ records are the majority, these are the ones that are useful.
- __None__ records do not have a match and can be deleted
- __Uncertain__ records are contained within __None__. With these records the matches with a subject do not have a high score, but there are some with a lower score. Therefore the user has to choose select them. Because we want to have as many records as possible, we will match these subjects to the best candidate.

![image4.png](img/img4.PNG)

Next we will remove the unmatched subjects, to do this create a facet _by judgement_

![image5.png](img/img5.PNG)

Now include _none_ and _unreconciled_, this will give all the records with no match.

![image6.png](img/img6.PNG)

And remove all the matching rows.

![image7.png](img/img7.PNG)

##### 4. Add the subject URI's

For these instructions, please refer back to the LC reconcile service [repository](https://github.com/cmharlow/lc-reconcile)
A new column needs to be added based on the _Subjects_ column.

![image8.png](img/img8.PNG)

Input the following GREL expression: `cell.recon.match.id`. I named this new column _Subjects_URI_.

![image9.png](img/img9.PNG)

This adds a new column with the corresponding subject URI's, which will be used by Annif.

The _Subjects_ column can now be removed

![image10.png](img/img10.PNG)

This leaves us with a dataset containing the title or summary and the subject URI's.

##### 5. Export dataset

Now we are done with OpenRefine, so the dataset can be exported to a `.tsv` file.

![image11.png](img/img11.PNG)

The dataset is almost ready for use in Annif. The only thing left to do is remove the headers from the files. Also some subjects may not have been reconciled properly, therefore there are NaN values in the middle of the subjects. For a nice representation of the data, we would like all the subjects to be shifted to the left, this is done with the following code:

In [3]:
input_tsv = "datasets/combined_titles_clean_uri.tsv"
output_tsv = "datasets/titles_final.tsv"

# Shift the subjects to the left
nonshifted_data, shifted_data = shift_subjects(input_tsv, output_tsv)

Here you can see the difference before and after the shifting.

In [5]:
nonshifted_data

Unnamed: 0,Content,Subject1_URI,Subject2_URI,Subject3_URI,Subject4_URI,Subject5_URI,Subject6_URI,Subject7_URI,Subject8_URI,Subject9_URI,Subject10_URI,Subject11_URI,Subject12_URI,Subject13_URI
0,Indo-Caribbean Feminist Thought | Bringing tog...,,http://id.loc.gov/authorities/subjects/sh85077507,http://id.loc.gov/authorities/subjects/sh85120549,,,,,,,,,,
1,Multidisciplinary Approaches to Allergies | Al...,http://id.loc.gov/authorities/subjects/sh85076841,http://id.loc.gov/authorities/subjects/sh85003662,http://id.loc.gov/authorities/subjects/sh85014263,http://id.loc.gov/authorities/subjects/sh20180...,http://id.loc.gov/authorities/subjects/sh85044173,http://id.loc.gov/authorities/subjects/sh85100603,,,,,,,
2,Synchronizing E-Security | Synchronizing E-Sec...,http://id.loc.gov/authorities/subjects/sh85034453,http://id.loc.gov/authorities/subjects/sh94001524,http://id.loc.gov/authorities/subjects/sh89003285,http://id.loc.gov/authorities/subjects/sh85029552,http://id.loc.gov/authorities/subjects/sh85107267,http://id.loc.gov/authorities/subjects/sh96008434,,,,,,,
3,The History of Physics in Cuba | This book bri...,,,http://id.loc.gov/authorities/subjects/sh85112362,http://id.loc.gov/authorities/subjects/sh85125938,http://id.loc.gov/authorities/subjects/sh85045198,http://id.loc.gov/authorities/subjects/sh85034755,,,,,,,
4,Stochastic Processes | This book is the result...,http://id.loc.gov/authorities/subjects/sh85107090,http://id.loc.gov/authorities/subjects/sh20020...,,,,,,,,,,,


In [7]:
shifted_data

Unnamed: 0,Content,Subject1_URI,Subject2_URI,Subject3_URI,Subject4_URI,Subject5_URI,Subject6_URI,Subject7_URI,Subject8_URI,Subject9_URI,Subject10_URI,Subject11_URI,Subject12_URI,Subject13_URI
0,Indo-Caribbean Feminist Thought | Bringing tog...,http://id.loc.gov/authorities/subjects/sh85077507,http://id.loc.gov/authorities/subjects/sh85120549,,,,,,,,,,,
1,Multidisciplinary Approaches to Allergies | Al...,http://id.loc.gov/authorities/subjects/sh85076841,http://id.loc.gov/authorities/subjects/sh85003662,http://id.loc.gov/authorities/subjects/sh85014263,http://id.loc.gov/authorities/subjects/sh20180...,http://id.loc.gov/authorities/subjects/sh85044173,http://id.loc.gov/authorities/subjects/sh85100603,,,,,,,
2,Synchronizing E-Security | Synchronizing E-Sec...,http://id.loc.gov/authorities/subjects/sh85034453,http://id.loc.gov/authorities/subjects/sh94001524,http://id.loc.gov/authorities/subjects/sh89003285,http://id.loc.gov/authorities/subjects/sh85029552,http://id.loc.gov/authorities/subjects/sh85107267,http://id.loc.gov/authorities/subjects/sh96008434,,,,,,,
3,The History of Physics in Cuba | This book bri...,http://id.loc.gov/authorities/subjects/sh85112362,http://id.loc.gov/authorities/subjects/sh85125938,http://id.loc.gov/authorities/subjects/sh85045198,http://id.loc.gov/authorities/subjects/sh85034755,,,,,,,,,
4,Stochastic Processes | This book is the result...,http://id.loc.gov/authorities/subjects/sh85107090,http://id.loc.gov/authorities/subjects/sh20020...,,,,,,,,,,,


Now the data is ready for use.