This is the code to be run in Google Colab

When a researcher submits his/her grant application, he also informs to grant agency his/her choice of discipline. This choice is not always accurate and could play a decisive role in receiving of the grant. The automatic classification of applications to suitable disciplines is possible by creating a classification model based on summaries of the applications, as shown below.

Reference I used for my case: BERT NLP Tutorial 2 - IMDB Movies Sentiment Analysis using BERT & TensorFlow 2 | NLP BERT Tutorial https://www.youtube.com/watch?v=sZdIybqppqQ

In [None]:
#ktrain is a lightweight wrapper for the deep learning library TensorFlow
!pip install ktrain

In [None]:
#the deep learning library TensorFlow
!pip install tensorflow

In [None]:
#importing required modules
import numpy as np
import pandas as pd
import tensorflow as tf
import ktrain
from ktrain import text

The development of classification model needs data to train on and data to test. The following steps are taken to develop a model by using training data (summary and discipline per application) of a given grant round and then applying the model on the test data (summary and discipline per application) of the same grant round. I usually do half (26000) of the grant applications for training and other half(26000) for the testing, but it depends also on the nummber of applications. 

In [None]:
#reading the application excel as a TRAINING data where there are following columns: application number, summary, disciplines (selected by applicant)
#The excel needs tobe in your google drive in the contents folder
df_train=pd.read_excel('/content/sample_data/train_data.xlsx')

In [None]:
#Check
df_train.head()

In [None]:
#dropping the nan values 
df_train = df_train[df_train['sumary'].notna()]

In [None]:
#reading the application excel as a TESTING data where there are following columns: application number, summary, disciplines (selected by applicant)
#The excel needs tobe in your google drive in the contents folder
df_test=pd.read_excel('/content/sample_data/test_data.xlsx')

In [None]:
#test data to validate
df_test.head()

In [None]:
#dropping the nan values 
df_test = df_test[df_test['summary'].notna()]

In [None]:
#using bert NLP model, considering 400 columns,
(X_train, y_train), (X_test, y_test), preprocess=text.texts_from_df(train_df=df_train,text_column='summary', label_columns='discipline', val_df=df_test, maxlen=400, preprocess_mode='bert')

In [None]:
#Checking the shape of X_train
X_train[0].shape

In [None]:
#building a model, preprocess mode with bert NLP model
model= text.text_classifier(name='bert', train_data=(X_train, y_train), preproc=preprocess)

In [None]:
#Get Learning Rate, considering 6 summaries at a time
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),val_data=(X_test, y_test), batch_size=6)

In [None]:
##this might take days to run
#learner.lr_find()
#learner.lr_plot()

#Optimal learning rate for this model is 2e-5

In [None]:
#assigning learning rate
learner.fit_onecycle(lr=2e-5, epochs=1)

In [None]:
#building a predictor to predict 
predictor=ktrain.get_predictor(learner.model, preprocess)

After the development of predictor, it can be used to predict discipline for given application by feeding in the summary of the application, as follows:

In [None]:
#Here goes the summary of the grant application that you want to automatically classify the discipline
#Converting vector values into string values
data = ['Foam generated with a surfactant solution and nitrogen is used for oil recovery, acid diversion and aquifer remediation. In laboratory experiments, the foam mobility is expressed in terms of the pressure drop across the porous medium and is related to many physical processes. There is lack of data that relate the pressure drop to a combination of three or more variables simultaneously. This paper investigates the steady state pressure drop for a combination of six variables, viz., permeability, surfactant concentration,pH, salinity, surfactant solution velocity and gas velocity.']

In [None]:
#prediction
predictor.predict(data)

In [None]:
#saving the predictor to the folder
predictor.save('/content/bert')