# Prediction Pipelines with `article_relevance`

The Article relevance tool was designed for use with the [Neotoma Paleoecology Database](https://www.neotomadb.org) to produce a workflow that allows users to supply an updatable list of DOIs from publications. This list of DOIs is then used to extract metadata from [CrossRef](https://crossref.org) that can be constructed into a list of text embeddings from which we can develop predictive models.  The tooling allows us to generate multiple predictive models, along with the ability to perform a grid-search for optimal hyper-parameter tuning for all models.

The workflow provides predictive outputs as to whether an article might be suited for inclusion into a research database, a probability estimate for the prediction, as well as time-stamped predictions and the ability of a user to override the prediction. In this way we can test model evolution, and provide the opportunity for curated stewardship of model predictions.

![./assets/overview_image.svg](./assets/overview_image.svg)

Our goal is to provide a fully developed research infrastructure that connects labelled publication data from a particular research database, to machine learning models trained with classification models to predict article relevance, the suitability of an "unknown" journal article for inclusion within the database.

## Using the NeotomaART (Article Relevance Tool) Docker Container

The [Docker Container](https://github.com/NeotomaDB/article_project) connects three separate elements:

* A [Postgres database](https://github.com/NeotomaDB/article_project/tree/main/article_database) with a [pre-defined data schema](https://github.com/NeotomaDB/article_project/blob/main/article_database/create_database.sql).
* A [node/Express API](https://github.com/NeotomaDB/article_project/tree/main/article_api) to interface with the database.
* A Python package/framework to interact with the API.

This structure allows a research team to set up their own local or cloud-based instance of the ART to support data discovery and ingest for particular research groups. The goal of this project is to allow research teams to submit publications relevant to their research project, along with other publications, all identified with DOIs. Models are then built using a range of parameters, saved, and can then be used to predict whether or not "unseen" articles are then relevant for the research group.

## Running the script

Assuming that the Docker container is running, following the instructions within the [`README`](https://github.com/NeotomaDB/article_project/blob/main/README.md), we can load in the packages and begin to run the code. Note that a `.env` file is used here to help manage any environment variables we might need. At present there is only a single variable in `.env`: `API_HOME="localhost:8080"`. This value is pulled from the [`docker-compose` file in `article_project`](https://github.com/NeotomaDB/article_project/blob/main/docker-compose.yml#L20).

In [2]:
import os
from dotenv import load_dotenv
import article_relevance as ar
import csv
import re
import pandas as pd
import json
from collections import Counter

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

The `article_relevance` package includes multiple requirements for proper use. All requirements are placed in an `requirements.txt` file.

In addition, at minimum, we require a file with properly labelled DOI data. This data should contain the following columns:

* doi: A properly formatted DOI, with only the shoulder and endpoint. e.g., `10.5467/22343.whatever`
* label: A categorical label that can be used to identify whether or not the article is suitable for inclusion into the database.

It is also possible to load in unlabelled data as a list of DOI values. Throughout we use DOIs and ORCIDs as unique identifiers to link labels, articles, and predictions.

This project contains two data resources in the `data` folder:

* `raw/neotoma_crossref.csv`: data directly exported from Neotoma (publication is already included in Neotoma).
* `raw/labelled_data.csv`: manually labelled data for model training.
* `raw/project_2_labelled_data.csv`: data labelled using the [`SMART` application](https://github.com/RTIInternational/SMART).

These objects represent the file elements we'll be working with for the predictive models.

## Loading Raw Data

For our purposes we want to identify both the data source and the labelling. For any object we want to know the data source and certainty with which it was labelled. Data from our source should be most trustworthy, labelled data should have some indication of the labeller, unlabelled data should reflect that fact.

The first set of data we load is raw data from the database itself (`db_data`), and a set of labelled data (`labelled_data.csv`). The labelled data was all labelled by the same person in this case ("Simon Goring"), so when we insert the data to the `LABELLING_STORE` document we'll make sure to add that.

DOIs get entered in many ways by users. Incorrect data entry may result in DOIs that do not properly resolve, or are simply incorrect. Additionaly data entry errors may result in leading or trailing whitespace. To help us ensure that data is entered correctly we will use the function `ar.clean_dois()`. This function returns a `dict` with the `clean` and `removed` DOIs submitted by the user. This helps support data cleaning down the road.

In [3]:
with open('data/raw/neotoma_dois.csv') as file:
    db_data = list(csv.DictReader(file))

with open('data/raw/labelled_data.csv') as file:
    label_data = list(csv.DictReader(file))

all_doi = set([i.get('doi') for i in db_data] + [i.get('doi') for i in label_data])
doi_set = ar.clean_dois(all_doi)
print(f"There are {len(doi_set.get('clean'))} good DOIs and {len(doi_set.get('removed'))} removed DOIs in the full set of DOIs submitted.")

There are 2652 good DOIs and 48 removed DOIs in the full set of DOIs submitted.


`doi_set` is a `dict` object. We can write out and review the DOIs that failed the cleaning to see why they failed, or we can continue with our analysis. Here we will simply continue, using the clean results.

## Registering Articles

The function `ar.register_dois()` accepts a list of DOI values and sbmits them into the database. At the same time, the API itself queries CrossRef to pull in additional metadata. This metadata includes the article title and abstract (when provided by the publisher), along with additional information that may be of use in classification. There is printed output for this function, but here we set `verbose = False` so that we don't generate a massive output here. Depending on the number of DOIs inserted or submitted this process may take some time, in part because it reaches out to the CrossRef API.

In [4]:
registered = ar.register_dois(doi_set.get('clean'), verbose = False)

Connection failed for DOI {'doi': '10.3760/cma.j.cn112150-20220508-00458'}:
HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)
Connection failed for DOI {'doi': '10.3760/cma.j.cn112150-20220428-00425'}:
HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)
Connection failed for DOI {'doi': '10.3760/cma.j.cn115330-20200929-00778'}:
HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)
Connection failed for DOI {'doi': '10.3760/cma.j.cn115330-20210701-00416'}:
HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)


You will find that this step can be very time consuming since we are pulling in external data that will be used to build the embeddings and model inputs. As with `ar.clean_dois()`, this function returns a `dict` object with the elements `submitted`, `rejected`, `inserted` and `present`. This allows us to see rejections that result from secondary issues (invalid CrossRef endpoints for example) beyond valid DOI paths. We can also import DOIs from multiple sources, without overwriting data that already exists within the database.

In [5]:
print(f"Submitted: {registered.get('submitted')[0] or ''}\nRejected: {registered.get('rejected')[0]}")

Submitted: {'doi': '10.1007/s00334-011-0339-6'}
Rejected: {'doi': '10.3760/cma.j.cn112150-20220508-00458'}


From here we can check the `rejected` articles to see if we can understand why they may have been rejected. The CrossRef API allows us to see metadata about a particular article using the DOI, for example, we can query our rejected DOI: [https://api.crossref.org/works/10.3760%2Fcma.j.cn112150-20220508-00458](https://api.crossref.org/works/10.3760%2Fcma.j.cn112150-20220508-00458).

With this particular article we see that the DOI simply does not resolve. We can use this to clean our input data if we would like to. We can use the `ar.register_dois()` function to add new or corrected DOIs if we choose to update our original input files, or we can load in new external files and update the database.

In all cases, the data is added directly to the database within our Docker container, meaning that, as long as we retain the Docker image, the data persists. Assuming we are prepared to add some new DOIs to the database, we can simply call the following (this time using `verbose = True`, the default):

In [6]:
new_dois = ['10.1590/s0102-69922012000200010', '10.1090/S0002-9939-2012-11404-2', '10.1063/1.4742131', '10.1007/s13355-012-0130-x']

check = ar.register_dois(new_dois)

4 unique DOIs submitted.
4 DOIs valid.
doi was present: 10.1007/s13355-012-0130-x
doi was present: 10.1063/1.4742131
doi was present: 10.1590/s0102-69922012000200010
doi was present: 10.1090/S0002-9939-2012-11404-2


## Pre-Processing the Data

The data pipeline goes from registering the DOIs, to developing model-specific embeddings. Because of the way data is stored within the database, we can pre-process and develop embeddings for articles using multiple embedding models. In each case we need to define the form of the text string to be transformed. For this we use the `ar.data_preprocessing()` function. It calls to the database for all article metadata that has yet to be embedded with a partcular model. The `ar.data_preprocessing()` function takes the argument `model_name`, which can be any valid model shared on [HuggingFace](https://huggingface.co/models?pipeline_tag=feature-extraction). We chose the `allenai/specter2` model as the default since it was explicitly trained on a large scientific journal Title-Abstract dataset.

Within the database, we store the metadata both as a `jsonb` column (`crossrefmeta`) containing the full set of CrossRef metadata, and also in the columns `doi`, `title`, `subtitle`, `author`, `subject`, `abstract`, `containertitle`, `language` `published` and `publisher`. These fields are all drawn from CrossRef metadata and are not consistently filled. Currently the NeotomaART API returns structured JSON data for each article. It is possible to use other fields by [modifying the API code](https://github.com/NeotomaDB/article_project/blob/main/article_api/v0.1/helpers/dois/dois.js#L198), but we pre-defined these fields and return them as a list of JSON objects:

```json
{
    "doi":"10.1126/sciadv.aav3809",
    "title":"Central Europe temperature constrained by speleothem fluid inclusion water isotopes over the past 14,000 years",
    "subtitle":null,
    "abstract":"<jats:p>Past precipitation water sealed in stalagmites from Switzerland gives insight into temperature changes for the past 14,000 years.</jats:p>",
    "language":"en",
    "containertitle":"Science Advances"
}
```

It is important to note that data from CrossRef is inconsistent. Of the approximately 2,500 articles originally registered, only ~60% of articles had full abstracts, and approximately 10% of records were missing information about language of origin.

The data pre-processing step combines title and abstract into a single string element and imputes language (if missing).

In [7]:
processed_data = ar.data_preprocessing(model_name = 'allenai/specter2_base')

Running the command returns a list of `dict` objects, each with the keys `doi`, `text` and `language`. This list is then passed to the embeddings step:

In [8]:
embeddings = ar.add_embeddings(processed_data, text_col = 'text', model_name = 'allenai/specter2_base')

Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 53601.33it/s]

Building embeddings for 0 objects.



  state_dict = torch.load(weights_file, map_location="cpu")


## Registering a Project

We can load papers, and build embeddings without defining a particular project we're working on. The purpose of this project structure is to allow multiple projects to share the same embedded data and sets of papers. This simply reduces overhead for everyone using the system.

To register a project we only have to define a project name and a simple description. We also first check to see if it exists:

In [18]:

project_exists = ar.project_exists('Neotoma Relevance')
ar.register_project('Neotoma Relevance', 'A project to manage models for assessing publication relevance for Neotoma.')

Now the data is stored in the database, and we can begin to associate labels with the project, allowing us to link "classification" lables to the project and to the papers, providing the base data for our models.

## Adding Labels

We've already seen the file `data/raw/neotoma_dois.csv`. It is a file of DOIs for publications that are a part of Neotoma. They are canonically articles that are of interest to the database (since we've entered them already).  To add labels we are going to take this list of DOIs and assign labels to them (I'm going to say that I was the assigner):

In [20]:
with open('data/raw/neotoma_dois.csv') as file:
    db_data = list(csv.DictReader(file))
first_labels = ar.add_paper_labels(label_data, project = 'Neotoma Relevance', create = True)

neotoma_labels = [{'doi': i.get('doi'), 'label': 'In Neotoma', 'person': '0000-0002-2700-4605'} for i in db_data]
all_labels = ar.add_paper_labels(neotoma_labels, project = 'Neotoma Relevance', create = True)

## Building the Models

With labels added for the project we can now begin to take the data and build models. Using the `get_model_data()` function we call for all project data associated with a particular project and embedding model. This function returns a dict with the DOI, the embedding vector and the label. The `data_model` element is a `dict`, returned from the API for any one DOI:

```json
{"doi": "10.1126/sciadv.aav3809", "embeddings": [-0.5892478, -1.4083375, ..., -0.8913986, 0.13861942, -0.07215089, -0.120212786, -0.5112147, -1.6048986, -0.18262713, -0.95949847, 0.07596018, 0.03217636, -0.81287014, -0.5136357], "label": "Neotoma"}
```

The script below takes all papers that have embedding models for the model `allenai/specter2_base`, both labelled and unlabelled. It removes any unlabelled records, so we can properly build the model (since these are of "unknown" suitability). For the classifier we want only two classes, 1 and 0. We had four labelled classes in the dataset: `['Neotoma', 'Not Neotoma', 'In Neotoma' and 'Maybe Neotoma']`. For this classification scheme we will assign `0` to the `Not Neotoma` and assume all the other classes are of interest to Neotoma. We use the `re.search()` function to look for the word `Not` as a way to make this assignment.

In [22]:
# Now need to load in the labelled data and do the train/test split
data_model = ar.get_model_data(project = "Neotoma Relevance", model = "allenai/specter2_base")

# Remove unlabelled data (not suitable for building the model)
data_model = [i for i in data_model if i.get('label') is not None]

# Refine the labelling to two classes, and map the classes to integer values (1 and 0).
data_model = [dict(item, **{'target': int(bool(re.search(pattern='Not', string=item['label'])))}) for item in data_model]
# Convert the embeddings to a list, and then a data frame with columns named `embedding_xxx` where xxx is the embedding dimension.
data_embedding = [i['embeddings'] for i in data_model]
data_input = pd.DataFrame(data_embedding, columns = [f'embedding_{str(i)}' for i in range(len(data_model[0]['embeddings']))])
data_input = data_input.assign(doi = [i['doi'] for i in data_model])
data_input = data_input.assign(target = [i['target'] for i in data_model])


Once this code block has executed, we have a Pandas DataFrame with a column for the DOI, a column for the `target` class (0 or 1) and then the columns for the embeddings that we will use for classification.  Ultimately, any model supported by `scikitlearn` can be applied to the text embeddings, and the model can be stored within the database as a `joblib` file (note that `joblib` files have particular security vulnerabilities, so avoid using or opening `joblib` files from outside trusted environments).

Here we use the `sklearn` package to produce our models, and begin by splitting the data into training and testing data objects, using the `target` column as our target, and stratifying the sampling, since our data is highly unbalanced. We have set a specific `random_state` here for reproducibility purposes.

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(data_input.copy(),
                                                    data_input['target'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=data_input['target'])


## Classifiers

As mentioned, it is possible to use multiple classifiers, defined through imports from `sklearn`. The `ar.relevancePredictTrain()` function uses a randomized search across parameters by default. This means that we can define parameter ranges for any of the classifier parameters (as defined in their relevant help documentation). We use these models as examples to showcase support for multi-model approaches. It is possible to further tune models, providing "fixed" model parameterization.

In [25]:

classifiers = [
    (LogisticRegression(max_iter=1000), {
        'C': [0.001, 0.01, 0.1, 1, 10],
        'max_iter': [100, 1000, 10000],
        'penalty': ['l2'],
        'solver': ['liblinear', 'lbfgs']
    }),
    (DecisionTreeClassifier(class_weight="balanced"), {
        'max_depth': range(10, 100, 10)
    }),
    (KNeighborsClassifier(weights='uniform', algorithm='auto'), {
        'n_neighbors': range(5, 100, 10)
    }),
    (BernoulliNB(binarize=0.0), {
        'alpha': [0.001, 0.01, 0.1, 1.0]
    }),
    (RandomForestClassifier(), {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30]
    })
]


We define our classifiers using lists or range functions, passed in for each parameter of interest. Once the classifiers have been defined we can then run the model using `ar.relevancePredictTrain()`. This function directly outputs a dictionary of model results (accuracy, recall, precision and the F1 statistic for both test and training sets). In addition to these model assessment values, the function also outputs the optimized model for each of the classification types as a `joblib` file in the `data/models` directory. The files are named using the model name and a timestamp string.

In [26]:
resultsDict = ar.relevancePredictTrain(x_train = X_train, y_train = y_train, classifiers = classifiers)
with open('results.json', 'w', encoding='UTF-8') as f:
    json.dump(resultsDict['report'], f, indent=4, sort_keys=True, default=str)

Setting up features
Beginning training
Training logisticregression.
Starting fit at 2024-10-15_10-58-34
Training decisiontreeclassifier.
Starting fit at 2024-10-15_10-58-39




Training kneighborsclassifier.
Starting fit at 2024-10-15_10-58-44
Training bernoullinb.
Starting fit at 2024-10-15_10-58-45




Training randomforestclassifier.
Starting fit at 2024-10-15_10-58-46
finished process; returning results


In [32]:
resultsDict

{'model_name': ['LogisticRegression',
  'DecisionTreeClassifier',
  'KNeighborsClassifier',
  'BernoulliNB',
  'RandomForestClassifier'],
 'model': [Pipeline(steps=[('columntransformer',
                   ColumnTransformer(remainder='passthrough',
                                     transformers=[('doi', 'drop', ['doi'])])),
                  ('simpleimputer',
                   SimpleImputer(fill_value=0, strategy='constant')),
                  ('logisticregression',
                   LogisticRegression(C=0.01, max_iter=1000,
                                      solver='liblinear'))]),
  Pipeline(steps=[('columntransformer',
                   ColumnTransformer(remainder='passthrough',
                                     transformers=[('doi', 'drop', ['doi'])])),
                  ('simpleimputer',
                   SimpleImputer(fill_value=0, strategy='constant')),
                  ('decisiontreeclassifier',
                   DecisionTreeClassifier(class_weight='balanced',
 

We can see that the `resultsDict` object also includes information about the "best" fit model.

## Prediction

Given any of the models of interest, we can then call the `ar.relevancePredict()` function, which will load the `joblib` file and use the model to predict values based on the embeddings provided.

In [30]:
results = ar.relevancePredict(data_input, model = 'decisiontreeclassifier_2024-09-22_22-30-35.joblib')

## A Full Run-Through:

Here we see a complete run through using a pre-existing model and "new" DOIs that have not been examined before. The output is a set of publication metadata that is relevant to the Database, based on our prior labelled data:

In [31]:
with open('./data/raw/newdois.csv', 'r') as file:
    new_dois = file.read().splitlines()

# Clean and register DOIs
clean = ar.clean_dois(new_dois)
check = ar.register_dois(clean['clean'], verbose = False)

# Get text objects for papers that haven't been processed by the named model:
processed_data = ar.data_preprocessing(model_name = 'allenai/specter2_base')

# Build embeddings locally.
embeddings = ar.add_embeddings(processed_data, text_col = 'text', model_name = 'allenai/specter2_base')

new_data_model = ar.get_model_data(project = None, model = "allenai/specter2_base")

data_embedding = [i['embeddings'] for i in new_data_model]
data_input = pd.DataFrame(data_embedding, columns = [f'embedding_{str(i)}' for i in range(len(new_data_model[0]['embeddings']))])
data_input = data_input.assign(doi = [i['doi'] for i in data_model])

results = ar.relevancePredict(data_input, model = 'bernoullinb_2024-10-15_10-58-45.joblib')
results.to_csv('/tmp/output.csv')

goodpapers = results.loc[results['prediction'] == 1]['doi'].tolist()
pubs = [ar.get_publication_metadata(i) for i in goodpapers]

KeyboardInterrupt: 