# Research Problem Extraction System

The basic idea is to solve the problem as a Named Entity Recognition (NER) task, since the research problems to be extracted are components of the source texts. <br>
For this purpose, a NER model will be trained first, for which the research questions from the training data will receive the entity 'RESEARCH_PROBLEM'. The sections of the test data for which this entity is recognized in the second step are predicted as research problems for the current paper.

In [1]:
import os
import re
import random
import pypickle as pypickle
import json
import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans


LABEL = 'RESEARCH_PROBLEM'

## 1. Training of the Model

First, the system iterats through the directories over all papers and finally the extracted data are made available in a data structure. This consists of an array in which each paper has a dictionary, which contains the path to the folder, the title, the abstract and the labeled research questions for the paper.

In [3]:
#Path to the training data
training_data_path = "training-data-master"

#Path to the test data
test_data_path = "test-data-master"

#Subpath to the json file containing the research questions
research_questions_file_subpath = "/info-units/research-problem.json"

In [4]:
"""
List all directories in a given directory
"""
def get_list_of_directories_in_directory(path:str):
    return [ directory for directory in os.listdir(path) if os.path.isdir(os.path.join(path, directory))]

In [5]:
"""
General data preprocessing
"""
def data_preprocessing(text:str):
    #lower all texts
    text = text.lower()
    
    return text

In [6]:
"""
Load the research-problem.json file for the current paper and extract the research problems
"""
def load_preprocess_problem(path:str):
    with open(path + research_questions_file_subpath, 'r') as research_problem_file:
        research_problem_file_data = json.load(research_problem_file)

    if isinstance(research_problem_file_data["has research problem"][0], list):
        #more then one research problems
        output_problems = [data_preprocessing(problem_array[0]) for problem_array in research_problem_file_data["has research problem"]]
    else:
        #only one research problem
        output_problems = research_problem_file_data["has research problem"][0]
        
    return output_problems

Observation about the structure of the paper of the provided output of the plaintext preprocessed by Stanza:<br>
- First line:  "titel"<br>
- Second line: [content of the title]<br>
- Third line:  "abstract"<br>
- Next lines:  [content of the title]<br>
- Then:        "introduction"

In [7]:
"""
Read the content of one paper and extrakt title and abstract
"""
def read_paper(path:str):
    #read the content of the paper by reading the preprocessed file
    content = data_preprocessing(open(path + "/" + [file for file in os.listdir(path) 
                                                         if re.match(".*Stanza-out.txt", file)][0], "r")
                                                         .read()).split("\n")
    #extract the title
    title = content[1]
    
    #extract the abstract
    abstract = ""
    current_line=3
    #the abstracts ends if the next line contains the word introduction
    while "introduction" not in content[current_line] and current_line < 10:
        abstract = abstract + "\n" + content[current_line]
        current_line += 1
        
    return title, abstract

In [8]:
"""
Find all paper in a sub-sub-directory of the given path, 
read the content and pass it into a dictionary-structure.
"""
def read_data(path:str):
    paper_data = []
    #iterate over all task folders 
    for task in get_list_of_directories_in_directory(training_data_path):
        task_path = training_data_path + "/" + task
        #iterate over all paper in the current task folder
        for paper in get_list_of_directories_in_directory(training_data_path + "/" + task):
            data = {}
            #save the path of the current paper folder
            data["path"] = task_path + "/" + paper
            #extract and save the title and the abstract of the current paper
            data["title"], data["abstract"] = read_paper(data["path"])
            #extract and save the research problems of the current paper
            data["research-problem"]=load_preprocess_problem(data["path"])
            #add the current paper dictionary to an array containg all paper
            paper_data.append(data)
            
    return paper_data

Load training data.

In [9]:
train_paper_data = read_data(training_data_path)

Load the test data set and split it in validation and test data.

In [10]:
validation_test_paper_data = read_data(training_data_path)
#split validation and test data in equal size
validation_size = int(len(validation_test_paper_data) * 0.5)
#randomly distribute test and validation data to prevent a bias with respect to the topics of the paper
#using a seed to create reproducible results
random.Random(0).shuffle(validation_test_paper_data)
#build validation data
validation_paper_data = validation_test_paper_data[validation_size:]
#build test data
test_paper_data = validation_test_paper_data[:validation_size]

Load the language model.

In [11]:
#load spacy transformer model
nlp = spacy.load("en_core_web_trf")
#disable unused pipeline elements
nlp.disable_pipes('ner', 'tagger', 'parser')

['ner', 'tagger', 'parser']

In [12]:
"""
Build a dictionary used for training the NER model, containing spans with entities taged as "RESEARCH_PROBLEM"
"""
def create_ner_dictionary(paper_data:dict):
    doc_bin = DocBin()
    ner = {'classes' : ['RESEARCH_PROBLEM'], 'annotations': []}
    for data in paper_data:
        data_dic = {}
        data_dic["text"] = data["title"]
        #apply the language model
        nlp_doc = nlp.make_doc(data_dic["text"] )
        data_dic['entities'] = []
        ner_entities = []
        #find spans of the research problems in the passed text
        for problem in data["research-problem"]:
            if problem in data["title"]:
                #find start position
                start_position = data["title"].index(problem)
                #find end position
                end_position = start_position + len(problem)
                #build span
                data_dic['entities'].append((start_position, end_position, LABEL))
                ner_span = nlp_doc.char_span(start_position, end_position, label=LABEL, alignment_mode="contract")
                if ner_span is not None:
                    ner_entities.append(ner_span)
            #remove duplicates or overlaps
            nlp_doc.ents = filter_spans(ner_entities)
            #add data to docBin
            doc_bin.add(nlp_doc)
        ner['annotations'].append(data_dic)
        ner['doc_bin'] = doc_bin
    return ner

Build NER dictionaries for training and validation data.

In [13]:
train_ner = create_ner_dictionary(train_paper_data)
validation_ner = create_ner_dictionary(validation_paper_data)

Save NER dictionaries

In [14]:
train_ner['doc_bin'].to_disk("training_data.spacy")
validation_ner['doc_bin'].to_disk("validation_data.spacy")

Initilaiz the basic configuration for training the model

In [15]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Train the model

In [16]:
!python -m spacy train config.cfg --output ./output --paths.train ./training_data.spacy --paths.dev ./validation_data.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-07-06 15:47:19,212] [INFO] Set up nlp object from config
[2022-07-06 15:47:19,220] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-07-06 15:47:19,223] [INFO] Created vocabulary
[2022-07-06 15:47:19,224] [INFO] Finished initializing nlp object
[2022-07-06 15:47:19,717] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     41.83    0.00    0.00    0.00    0.00
  2     200         69.50   1986.98   95.03   93.18   96.96    0.95
  6     400         93.40    262.92   97.59   99.30   95.95    0.98
 10     600        139.88    183.77   96.86   94.82   98.99    0.97
 15     800         84.67    146.23   97.18   95.44   98.9

Save the extracted information. (Only necessary for tests and reusage)

In [17]:
pypickle.save("train_paper_data", train_paper_data)
pypickle.save("validation_paper_data", validation_paper_data)
pypickle.save("test_paper_data", test_paper_data)

[pypickle] Pickle file saved: [train_paper_data]
[pypickle] Pickle file saved: [validation_paper_data]
[pypickle] Pickle file saved: [test_paper_data]


True

## Testing the Model

Loading the model

In [18]:
nlp_ner = spacy.load("output/model-best")

Loading the testdata. (Only necessary for tests and reusage)

In [19]:
test_paper_data = pypickle.load("test_paper_data")

[pypickle] Pickle file loaded: [test_paper_data]


Metric for testing the model:<br>
- detected research-problem is part of training data: true positive<br>
- detected research-problem is not part of training data: false positiv<br>
- training data contains a research-problem that is not detected: false negativ<br>

In [20]:
#count true positive
tp = 0
#count false positive
fp = 0
#count false negativ
fn = 0

for paper in test_paper_data:
    #apply the learned model on the current testdata
    doc = nlp_ner(paper["title"])
    detected_research_problems = [entity.text for entity in doc.ents]

    for detected_research_problem in detected_research_problems:
        if detected_research_problem in paper["research-problem"]:
            #correct predicted research problem
            tp += 1
        else:
            #wrong predicted research problem
            fp += 1
    for labeled_research_problem in paper["research-problem"]:
        if labeled_research_problem not in detected_research_problems:
            #labeled research problem was not predicted
            fn += 1

#calculate precision, recall and f1-score
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * precision * recall / (precision + recall)

print("precision: %f; recall: %f; f1-score %f" % (precision, recall, f1))

precision: 1.000000; recall: 0.331269; f1-score 0.497674
