# Prediction Pipelines with `article_relevance`

The Article relevance tool was designed for use with the Neotoma Paleoecology Database to produce a workflow that allows users to supply an updatable list of DOIs from publications. This list of DOIs is then used to extract metadata from [CrossRef](https://crossref.org) that can be constructed into a list of text embeddings from which we can develop predictive models.  The tooling allows us to generate multiple predictive models, along with the ability to perform a grid-search for optimal hyper-parameter tuning for all models.

The workflow provides predictive outputs as to whether an article might be suited for inclusion into a research database, a probability estimate for the prediction, as well as time-stamped predictions and the ability of a user to override the prediction. In this way we can test model evolution, and provide the opportunity for curated stewardship of model predictions.

In [None]:
import os
from dotenv import load_dotenv
import src.article_relevance as ar

load_dotenv()

This package includes multiple requirements for proper use. All requirements are placed in an `requirements.txt` file.

In addition, at minimum, we require a file with properly labelled DOI data. This data should contain the following columns:

* doi: A properly formatted DOI, with only the shoulder and endpoint. e.g., `10.5467/22343.whatever`
* label: A categorical label that can be used to identify whether or not the article is suitable for inclusion into the database.

This project contains two data resources in the `data` folder:

* `raw/neotoma_crossref.csv`: data directly exported from Neotoma (publication is already included in Neotoma)
* `raw/labelled_data.csv`: manually labelled data for model training

Our first goal is to import the data and add the required CrossRef metadata. To do that we build one of several data objects. We are storing data in memory, but using an AWS S3 bucket to maintain data consistency.

In [None]:
DOI_STORE = {'Bucket':os.environ['S3_BUCKET'],'Key':'doi_store.parquet'}
METADATA_STORE = {'Bucket':os.environ['S3_BUCKET'],'Key':'metadata_store.parquet'}
EMBEDDING_STORE = {'Bucket':os.environ['S3_BUCKET'],'Key':'embedding_store.parquet'}
PREDICTION_STORE = {'Bucket':os.environ['S3_BUCKET'],'Key':'prediction_store.parquet'}
LABELLING_STORE =  {'Bucket':os.environ['S3_BUCKET'],'Key':'labelling_store.parquet'}

These objects represent the file elements we'll be working with for the predictive models.

## Loading Raw Data

For our purposes we want to identify both the data source and the labelling. For any object we want to know the data source and certainty with which it was labelled. Data from our source should be most trustworthy, labelled data should have some indication of the labeller, unlabelled data should reflect that fact.

The first set of data we load is raw data from the database itself (`db_data`), and a set of labelled data (`labelled_data.csv`). The labelled data was all labelled by the same person in this case ("Simon Goring"), so when we insert the data to the `LABELLING_STORE` document we'll make sure to add that.

In [30]:
import pandas as pd
from datetime import datetime

db_data = pd.read_csv('data/raw/neotoma_dois.csv')
label_data = pd.read_csv('data/raw/labelled_data.csv')

unique_doi = set(db_data['doi'].tolist() + label_data['doi'].tolist())
doi_store2 = pd.DataFrame({'doi':list(unique_doi), 'date': datetime.now()})

The `doi_store` now has all of our DOIs, labelled (and, ultimately, otherwise) with a timestamp for the date/time they were added. This is a seperate document from the full metadata markup of these nts, and we'll further separate this all from the embeddings., and we'll further separate this all from the embeddings.nts, and we'll further separate this all from the embeddings.

We do this in part to reduce our file input/output overhead. The workflow assumes that all DOIs in this set of records stem from this core `doi_store` record. We keep a `datetime` tag so that we can assess updates and model outcomes against the available data at the time of model building or prediction.

The `push_s3` method gives us several options, including `check` and `create`. This allows us to create new objects in the S3 bucket if the Bucket/Key combination in the `s3_object` parameter does not yet exist, and the `check` flag allows us to validate our objects before we push them up to the cloud.

### First Run

#### Upload the raw DOI data

The first time we run this workflow we can assume that the data does not exist in the cloud, so we'll use `push_s3()`. This will push our full set of DOIs into the DOI_STORE bucket.

In [None]:
push_s3(s3_object = DOI_STORE, pa_object = doi_store2, check = False, create = True)

In the future, if we are adding new records to the complete set of DOIs we can use the function `update_dois()`. This will accept a vector of DOI strings, validate the DOIs, concatenate them with the list in `DOI_STORE` and update `DOI_STORE` with the new DOIs and their timestamps:

In [None]:
new_dois = ['10.1590/s0102-69922012000200010', '10.1090/S0002-9939-2012-11404-2', '10.1063/1.4742131', '10.1007/s13355-012-0130-x']

update_dois(s3_object = DOI_STORE, dois = new_dois)

#### Get Article Metadata

Article metadata will be stored in the object `METADATA_STORE`. It will likely be a large object, and so we want to reduce the amount of time we pass data back and forth, and we want to make sure it is synced with the object stored in `DOI_STORE` so that we know what articles we've accessed/processed and so that we have a guide for future labelling.

To get article metadata we want to use the `crossref_query()` function. We have to pass in both the `DOI_STORE` and the `METADATA_STORE`. If there are records that are in the `DOI_STORE` but not in the `METADATA_STORE` then we will poll CrossRef for the metadata. We can also check to see if there are any records in `METADATA_STORE` that aren't in `DOI_STORE`. This should not happen, but we can always check to be sure.

In [None]:
metadata = crossref_query(DOI_STORE, METADATA_STORE, create = True)

From here we have an object in the cloud with DOIs and object metadata where it is available. The output `DataFrame` contains several key metadata fields from CrossRef, including `title` and `subject`, that we will use later in analysis. We also append a field `data`, to indicate the date on which a record was accessed, and `valid`, as to whether or not a record was available from CrossRef. We can then pass the `metadata` `DataFrame` into an embedding model to create a numeric matrix that can be used for predictive modeling.

## Building Embeddings

Before we build embeddings we need to define which fields we are to use to generate the embeddings. We choose to use the article title and abstract (when available), all subject labels and the journal title. We treat the journal title and subjects with one-hot-encoding. This may have limitations with "unseen" journals, however we can perform *post hoc* tests to examine the weight that these encodings have with regards to predictive probabilities.