This is the code to run in Google Colab

When a researcher submits his/her grant application, he also informs to grant agency his/her choice of discipline. This choice is not always accurate and could play a decisive role in receiving of the grant. The automatic classification of applications to suitable disciplines is possible by creating a classification model based on summaries of the applications, as shown below.

Reference I used for Tutorial: Keith Galli - Natural Language Processing (NLP) in Python - From Zero to Hero
https://www.youtube.com/watch?v=vyOgWhwUmec

In [None]:
#using spaCy pipelines for pretrained BERT
!pip install spacy-transformers


In [None]:
#Using spacy NLP model
!python -m spacy download en_trf_bertbaseuncased_lg

In [None]:
#importing useful modules
import spacy
import torch #for deep learning
import pandas as pd

The development of classification model needs data to train on and data to test. The following steps are taken to develop a model by using training data (summary and discipline per application) of a given grant round and then applying the model on the test data (summary and discipline per application) of the same grant round. I usually do half (26000) of the grant applications for training and other half(26000) for the testing, but it depends also on the nummber of applications. 

In [None]:
#reading the application excel as a TRAINING data where there are following columns: application number, summary, disciplines (selected by applicant)
#The excel needs tobe in your google drive in the contents folder
df=pd.read_excel('/content/drive/My Drive/train_data.xlsx')

In [None]:
#dropping the nan values 
df = df[df['summary'].notna()]
#conversion of columns to lists
train_x=df['summary'].to_list()
train_y=df['discipline'].to_list()
#Converting vector values into string values
train_x = [str (item) for item in train_x]
train_y = [str (item) for item in train_y]

In [None]:
#loading the spacy model (Check https://spacy.io/models)
nlp = spacy.load('en_trf_bertbaseuncased_lg')

In [None]:
#using NLP engine on the summary of the grant applications
docs=[nlp(text) for text in train_x]

In [None]:
#developing word vectors
train_x_word_vectors=[x.vector for x in docs]

In [None]:
#uses support vector classifier for fit
from sklearn import svm
#, datasets
#C = 1.0  # SVM regularization parameter
#clf_svm_wv=svm.SVC(kernel='rbf', gamma=0.7, C=C) #you can play here little bit to get better results
clf_svm_wv=svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_word_vectors, train_y)

In [None]:
df_test=pd.read_excel('/content/drive/My Drive/test_data.xlsx')
#dropping the nan values from summary 
df_test = df_test[df_test['summary'].notna()]
#conversion of columns to lists
test_x=df_test['summary'].to_list()
#Converting vector values into string values
test_x = [str (item) for item in test_x]

In [None]:
#applying steps used before for test data hereunder
test_docs=[nlp(text) for text in test_x]
test_x_word_vectors=[x.vector for x in test_docs]
#prediction by the word vector model for the summaries in the test data
output_wordvector=clf_svm_wv.predict(test_x_word_vectors)

In [None]:
#creation of a summary dataframe to check whether the wordvector model can predict the disciplines of the grant applications
#A dataframe is created here to compare the prediction by the word vector model and actual discipline given by the applicant
df_summary = pd.DataFrame(columns = ['Application number', 'given_disci', 'wordvector_disci'])
df_summary['Application number']=df_test['Application number']
df_summary['given_disci']=df_test['discipline']
df_summary['wordvector_disci']=output_wordvector

In [None]:
#Quick check if how many good results were there
total=0
correct=0
wrong=0
i=0
while i < len(df_summary):
    if (df_summary.iloc[i,1]==df_summary.iloc[i,2]):
        correct=correct+1
    else:
        wrong=wrong+1
    i=i+1
print("total " + str(total))
print("correct "+ str(correct))
print("wrong "+ str(wrong))

In [None]:
#saving the dataframe as an excel file
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import Workbook
from openpyxl.styles import Color, PatternFill, Font, Border
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import ColorScaleRule, CellIsRule, FormulaRule

wb = Workbook()
ws = wb.active

ws.title='Discipline classification'

for r in dataframe_to_rows(df_summary, index=False, header=True):
    ws.append(r)

red_text = Font(color="9C0006")
redFill = PatternFill(bgColor="FFC7CE")
ws.conditional_formatting.add('B2:C297', FormulaRule(formula=['$B2=$C2'], stopIfTrue=False, fill=redFill))
    #saving the excel workbook
Exportfile_name=input('Give export file name ',)# write the name of the file with extension .xlsx
wb.save(Exportfile_name)