# Running Training and Prediction Pipeline
---
This notebook provides all the commands to reproduce the results of training the models, and prediction on the full corpus.

This process does not have to be done to update the inventory, but simply to reproduce the reported results, (this is the process used to produce them in the first place).

This pipeline has the following steps:

*   Split the manually curated datasets
*   Train all models on the classificaiton and NER tasks
*   Select the best model for each task
*   Evaluate all models for each task on their test sets
*   Perform classification of full corpus
*   Run NER model on predicted biodata resource papers
*   Extract URLs from predicted positives
*   Process the predicted names
*   Perform automated initial deduplication
*   Flag the inventory for selective manual review

### ***Warning***:

Running the full pipeline trains many models, and their "checkpoint" files are quite large (~0.5GB per model, ~15GB in total). Simply running prediction requires much less resources, including storage space.

### Other use-cases

If you want to compare a new model to the previously compared models, you can add another row to `config/models_info.tsv`. This pipeline will train this model and compare it to the others. If the other trained model checkpoint files are still present from a previous run, they will not be re-trained during the process.

# Setup
---
### Mount Drive

First, mount Google Drive to have access to files necessary for the run:


In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/GitHub/inventory_2022/

Run the make target to install Python dependencies, and download the full corpus that was used during training and evaluation.

In [None]:
! make setup

# Running the pipeline
---
Now, we are ready to run the pipeline

## Previewing what has to be done.

The following can be run to get a preview of what has to be done.

In [None]:
! make dryrun_reproduction

## Run it

The following cell will run the entire pipeline described above. It takes a while, even with GPU acceleration. Without GPU it will take a very long time, if it is able to finish at all.

In [None]:
! make train_and_predict

# Selective Manual Review

After running the initial pipeline, the inventory has been flagged for selective manual review.

The file to be reviewed is located at:

`out/original_query/for_manual_review/predictions.csv`

Review the flagged columns according to the instruction sheet, then place the manually reviewed file in the following folder:

`out/original_query/manually_reviewed/`

The file must still be named `predictions.csv`


# Processing Manual Review

Next, further processing is performed on the manually reviewed inventory.

In [None]:
! make process_manually_reviewed

# Results
---
Once the pipeline everythuing is complete, there are a few important output files


## Final inventory

The final inventory, including names, URLS, and metadata is found in the file:
*    `out/original_query/processed_countries/predictions.csv`

## Model training stats

The per-epoch training statistics for all models are in the files:

*    `out/classif_train_out/combined_train_stats/combined_stats.csv`
*    `out/ner_train_out/combined_train_stats/combined_stats.csv`

## Test set evaluation

Performance measures of the trained models on the test set are located in the files:

*    `out/classif_train_out/combined_test_stats/combined_stats.csv`
*    `out/ner_train_out/combined_test_stats/combined_stats.csv`

## Selected models
The name of the best models are in the files:

*    `out/classif_train_out/best/best_checkpt.txt`
*    `out/ner_train_out/best/best_checkpt.txt`
