**This notebook was created as part of the AI NER module of the new version of ANDDigest system.**

The module is aimed for the prediction of the correspondence of the names of biological entities to their types, based on the context in which they are mentioned, using the following fine-tuned models: 

```
Timofey/PubMedBERT_Cell_Components_Context_Classifier
Timofey/PubMedBERT_Genes_Proteins_Context_Classifier
Timofey/PubMedBERT_Drugs_Metabolites_Context_Classifier
Timofey/PubMedBERT_Diseases_Side_Effects_Context_Classifier
Timofey/PubMedBERT_Pathways_Context_Classifier
```
In order to gain access to the fine-tuned models you need to perform the next steps:
1. Create a free [huggingface](https://huggingface.co) account
2. Generate your User Access Token, which can be done by following this [guide](https://huggingface.co/docs/hub/security-tokens).
3. Agree to share your contact information (username and email, used for registration in step 1), with the developers of these models.
4. Paste your generated token into the corresponding field when the 5th cell of this notebook is executed.

**Input/Output Formats description:**

> **Input data:**<br>
The input data has a format similar to the datasets, used in the fine-tuning of these models. It is a file with a list of texts of pubmed abstracts, in which each line consists of two tab-separated parts:<br>
<br>
1. Abstract identification number (PMID)<br>
_Please note that the regular expression used by the program is designed for numeric pubmed identifiers, which consist only of numbers. In case of a different format of identification numbers, it is necessary to make adjustments to the regular expression template located in the eighth cell of the notebook, in the regexp variable_
2. Abstract text, where the analyzed name is replaced by the **\<andsystem\-candidate\>** tag
**Example:**
```
30342689   Early detection of Parkinson's disease through patient questionnaire and predictive modelling. Early detection of Parkinson's disease (PD) is important which can enable early initiation of therapeutic interventions and management strategies. However, methods for early detection still remain an unmet clinical need in <andsystem-candidate>. In this study, we use the Patient Questionnaire (PQ) portion from the widely used Movement Disorder Society-Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to develop prediction models that can classify early PD from healthy normal using machine learning techniques that are becoming popular in biomedicine: logistic regression, random forests, boosted trees and support vector machine (SVM). We carried out both subject-wise and record-wise validation for evaluating the machine learning techniques.
...
```

> **Output data:**<br>
The output is a list of numeric values, where each line corresponds to a same line number of the input file. By default, all results are saved into the file, located in the output directory. The output file is presented by a list, containing the three columns separated by several space characters and has the following format:<br>
1. Abstract identification number, corresponding to the input data
2. Probability value that the name of the analyzed object doesn\'t match the context typical for the object type that the used model is configured for
3. Probability value that the name of the object matches the object type being checked by the model
**Example:**
```
30342689   2.9851287308702013e-06   0.9999970197677612
```

**Data preprocessing, performed by the program:**

It should be noted that the maximum length of input data for the BERT models is limited to 512 tokens (pseudo-words, presented in the transformer's dictionary). Therefore, the notebook performs check of the tokenized data, supplied to the input of the model, for its compliance with the maximum length. If the value is exceeded, the program splits an abstract into the separated sentences using the **_sent_tokenize_** function from the [Natural Language Toolkit Package](https://www.nltk.org), and shortens it by cutting off one sentence from the end of the abstract, after which it joins the sentences into an abstract again and repeats the tokenization process with the check for exceeding the maximum value by the tokens number, until the length becomes <= 512. In case when the tag of the checked object (\<andsystem\-candidate\>) appear in the last sentence, the notebook cuts off a one sentence from the beginning instead.

In [None]:
!nvidia-smi

In [2]:
# Create folder where prediction results will be stored
!mkdir '/content/output/'

In [None]:
!pip install transformers
!pip install nltk

In [None]:
import nltk
nltk.download( 'punkt' )

In [None]:
# Please Note, you need to enter your user token here, before the running of next cells.
# An alternative way is to provide your token as use_auth_token parameter in the from_pretrained class.
# For more information see https://huggingface.co/docs/transformers/main_classes/model

from huggingface_hub import notebook_login

notebook_login()

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification

import torch
import re
import sys

from ipywidgets import IntProgress
from IPython.display import display
import time

# Specify the used model here.
model_dir = 'Timofey/PubMedBERT_Cell_Components_Context_Classifier'

# Path to the input file:
input_dir = '/content/pubmed_corpus.BERT_input.components.short.txt'

# Path to the output file:
output_file = open( '/content/output/pubmed_corpus.BERT_output.components.short.txt', 'w' )

# The number of examples (input file lines) can be specified here. This value is optional and used only for the visualization of the progress bar:
max_count = 146526

model = BertForSequenceClassification.from_pretrained( model_dir ).to( 'cuda' )
tokenizer = BertTokenizer.from_pretrained( model_dir )

In [None]:
f = IntProgress( min = 0, max = max_count ) # instantiate the bar
display(f) # display the bar

regexp = r'^\d+[\t|\s]+'
regexp_tag = r'\<andsystem\-candidate\>'

counter = 0
abstract = ''
pmid = ''

# output_file.write( "PMID FALSE TRUE\n" )
input_file = open( input_dir, 'rb' )
for line in input_file:
    line = line.decode( errors = 'replace' )
    lp = re.findall( regexp, line )[ 0 ]
    pmid = lp.rstrip()

    line = line.replace( lp, '' )
    abstract = line.rstrip()

    input_seq = tokenizer.encode( abstract, add_special_tokens = True )

    if ( len( input_seq ) < 512 ):
        inputs = tokenizer( abstract, return_tensors = "pt" ).to( 'cuda' )
        labels = torch.tensor( [ 1 ] ).unsqueeze( 0 ).cuda()
        outputs = model( **inputs, labels = labels )

        predicition = outputs.logits.softmax( dim = -1 ).tolist()
    else:
        while len( input_seq ) >= 512:
            abstract_split = []
            abstract_split = nltk.sent_tokenize( abstract )
            lp2 = len( re.findall( regexp_tag, abstract_split[ -1 ] ) )
            
            if ( lp2 < 1 ):
                abstract = abstract.replace( abstract_split[ -1 ], '' )
            else:
                abstract = abstract.replace( abstract_split[ 0 ], '' )
            
            input_seq = tokenizer.encode( abstract, add_special_tokens = True )
                    
        inputs = tokenizer( abstract, return_tensors = "pt" ).to( 'cuda' )
        labels = torch.tensor( [ 1 ] ).unsqueeze( 0 ).cuda()  # Batch size 1
        outputs = model( **inputs, labels = labels )

        predicition = outputs.logits.softmax( dim = -1 ).tolist()

    print( str( pmid ), " ", predicition[ 0 ][ 0 ], " ", predicition[ 0 ][ 1 ], file = output_file )

    counter += 1
    f.value += 1 # signal to increment the progress bar

counter = 0
abstract = ''
pmid = ''

input_file.close()
output_file.close()

In [None]:
print( 'Done' )