# Live Demo

This notebook contains the code which makes the predictions for the given papers. The papers can be uploaded through the interface on the notebook by selecting either a pdf file or a parsed json file representing the paper. 

Note that the pdf file can only be used with the Parser server turned on (more  details below) and locally. Therefore, if it is run on Colab, it will fail. However, Colab supports classifying already parsed .json files from the dataset, but `{PAPER}_content.json` files must be used.

The selected best-performing model makes an inference. Therefore the parameters should be selected with which the model achieved that performance. 

The details of the implementation can be found in corresponding scripts as the notebook contains high-level code. 

The code was developed using the **Kaggle** platform, but it is adapted to be run on the **Google Colab** platform.

Essential things to have to run this notebook: 
1. Set the **BASE_PATH** which is the project directory to access the dataset and the code.
2. Make sure you have the data downloaded.
3. Select a model from another folder which makes inference. 
4. Choose the optimal parameters for the stored model. 
5. Have a file of a suitable format to make a prediction on.       


# Connect to the Google Drive

Firstly, connect to the Google Drive to be able to access files from there to read and store papers.

If other platform is used to run the notebook code, then comment this out. 

In [1]:
from google.colab import drive, files

colab_path = '/content/drive'
drive.mount(colab_path)

Mounted at /content/drive


# Install and Import Required Libraries for Tweets Topic Modelling

In [None]:
#@title Install Libraries
!pip install transformers
!pip install gradio
!pip install science_parse_api

## Import Essential Libraries

In [3]:
import os
import sys
import re
import time
import datetime
import json
import string
import requests
import random
from collections import Counter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset, Sampler, RandomSampler, SequentialSampler, random_split, Dataset

from tqdm.auto import tqdm
import gradio as gr

# Define Paths and Import Local Scripts

Set the absolute paths for **BASE_PATH** and **DATA_PATH**.

In [4]:
BASE_PATH = "/content/drive/MyDrive/Reviewer2"

DATA_PATH = f"{BASE_PATH}/ICLR_Dataset"

module_path = os.path.abspath(BASE_PATH)
if module_path not in sys.path:
    sys.path.append(module_path)


from scripts.utils import get_device
from scripts.inference import inference_for_single_document

# Define Constants, Hyperparameters and Configurations

Choose the best-performing model from these folders which are named:
1. BERT - **{BASE_PATH}/Best_Models/BERT_Model**
2. RoBERTa - **{BASE_PATH}/Best_Models/RoBERTa_Model**
3. Longformer - **{BASE_PATH}/Best_Models/Longformer_Model**

In [5]:
# Selected Parameters
batch_size = 16
add_paper_metadata = True

# Choose from these models: "bert-base-cased", "roberta-base", "allenai/longformer-base-4096"
model_name = "roberta-base"

selected_model_path = f"{BASE_PATH}/Best_Models/RoBERTa_Model"

# For Longformer, this was used: MAX_TOKENS_NUMBER = 2048
# For BERT and RoBERTa, this was used: MAX_TOKENS_NUMBER = 512
MAX_TOKENS_NUMBER = 512

TEXT_SELECTION_MODES = {"INTRODUCTION_WITH_ABSTRACT": 0, 
                        "INTRODUCTION_WITHOUT_ABSTRACT": 1, 
                        "MIDDLE": 2, 
                        "TAIL": 3,
                        "ABSTRACT_WITH_TAIL": 4}

# Selected mode of splitting text 
mode = TEXT_SELECTION_MODES["INTRODUCTION_WITHOUT_ABSTRACT"]

plt.rcParams["figure.figsize"] = (12, 6)

# Obtain the Device
Check if there is GPU Available preferably, as it would take a plethora of time to train on a CPU. 

In [6]:
device = get_device()

No GPU available, using the CPU instead.


# Load the Model

In [7]:
# Read a model from local storage which is used to make predictions
model = AutoModelForSequenceClassification.from_pretrained(selected_model_path)
tokenizer = AutoTokenizer.from_pretrained(model_name)

_ = model.to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Inference on the Selected Paper


## Inference on a Parsed JSON File

To do this, a parsed paper file should be provided with the same format as in the used dataset. 

In [None]:
def ui_json_inference(json_file):
    with open(json_file.name) as f:
        json_as_dict = json.load(f)
        probability_labels = inference_for_single_document(json_as_dict, model, tokenizer, MAX_TOKENS_NUMBER, device, batch_size)
        return probability_labels

gr.Interface(fn=ui_json_inference, 
             inputs=gr.File(), 
             outputs=gr.Label(num_top_classes=1),
             allow_flagging='never',
             title='ICLR Acceptance Predictor',
           ).launch()


## Inference on a PDF File
REQUIRES A PARSING SERVER TO BE RUNNING! But this only works at the moment on Docker which must be run locally. So to be able to input a paper in pdf, the notebok should be configured locally, otherwise it will fail if it is run on Colab.

Use this command to start the server for the parser on Linux/WSL:

`sudo docker run -p 127.0.0.1:8080:8080 --rm --init ucrel/ucrel-science-parse:3.0.1`

In [9]:
from science_parse_api.api import parse_pdf
from pathlib import Path


host = 'http://127.0.0.1'
port = '8080'


def pdf_inference(file):
    json_as_dict = parse_pdf(host, Path(file.name).resolve(), port=port)
    probability_labels = inference_for_single_document(json_as_dict, model, tokenizer, MAX_TOKENS_NUMBER, device, batch_size)
    return probability_labels

In [None]:
gr.Interface(fn=pdf_inference, 
             inputs=gr.File(), 
             outputs=gr.Label(num_top_classes=1),
             allow_flagging='never',
             title='ICLR Acceptance Predictor',
             ).launch()