# PART I (Data creation)


# 1. Introduction

This notebook contains the data creation for the Kaggle challenge "Coleridge Initative: Show US the Data" (https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/). This challenge is about recognizing public datasets used in scientific papers. In particular, we want to extract the datasets for scientific paper, with several NLP approaches. In this notebook, we test both BERT and SciBERT. The first model is introduced by Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., in 2018 [1]. Source code of BERT can be fuond [here](https://github.com/google-research/bert). The second model is  introduced by Beltagy, I., Lo, K., and Cohan, A. in 2019 [2]. Source code of SciBERT can be found [here](https://github.com/allenai/scibert).

Furthermore, we append the existing data with a specialized Corpus for dataset tagging. TDMSci is a Corpus existing of annotated data for tasks, metrices and datasets. Here, B-DATASET and I-DATASET are the NER-labels indicating a word is (part of) a dataset [3]. Source code (and annotated data) of TDMSci can be found [here](https://github.com/IBM/science-result-extractor).


We have created three notebooks, one for **dataset creation** (Part I), one for **training** ([Part IIa]() and [Part IIb]()) and one for **testing** ([Part III]()). This notebook creates the data for our used method.


[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[2] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.  
[3] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.



# 2. Preparing Notebook

In [1]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets 
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl 
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl 
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-nshufflene-any.whl 
!pip install datasets 
!pip install --upgrade fsspec
!pip install flair

Looking in links: file:///kaggle/input/coleridge-packages/packages/datasets
Processing /kaggle/input/coleridge-packages/packages/datasets/datasets-1.5.0-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/huggingface_hub-0.0.7-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/tqdm-4.49.0-py2.py3-none-any.whl
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.59.0
    Uninstalling tqdm-4.59.0:
      Successfully uninstalled tqdm-4.59.0
Successfully installed datasets-1.5.0 huggingface-hub-0.0.7 tqdm-4.49.0 xxhash-2.0.0
Processing /kaggle/input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Processing /kaggle/input/coleridge-packages/tokenizers-

In [2]:
#Import necessary libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
import os
from os import listdir
from os.path import isfile, join
import re
import json
import time
import datetime
import random
import glob
import importlib
import allennlp
import numpy as np
import pandas as pd
from transformers import *
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from flair.datasets import ColumnCorpus
from sklearn.utils import shuffle


  '"sox" backend is being deprecated. '


In [3]:
#Initialize paths for data
path_abs = '/kaggle/input/coleridgeinitiative-show-us-the-data/'
path_train = os.path.join(path_abs,'train/')
path_train_metadata = os.path.join(path_abs, 'train.csv')
path_test = os.path.join(path_abs, 'test/')
path_sample_submission = os.path.join(path_abs, 'sample_submission.csv')

path_abs_tdmsci = '/kaggle/input/tdmsci/'
path_test_tdmsci = os.path.join(path_abs_tdmsci, 'test_500_v2.txt')
path_train_tdmsci = os.path.join(path_abs_tdmsci,'train_1500_v2.txt')
path_train_nerjson = '/kaggle/working/train_ner.json'


# 3. Get to know the data

Here, we load the provided train-, and test data and look into the provided labels.
In total, there are 14316 papers for training purposes nad 4 papers for testing purposes.

## 3.1. Load data

In [5]:
#Load metadata
label_df = pd.read_csv(path_train_metadata)
label_df = label_df.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

sample_submission = pd.read_csv(path_sample_submission)
#Inpsect data
label_df

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,0007f880-0a9b-492d-9a58-76eb0b0e0bd7,The Impact of ICT Training on Income Generatio...,Program for the International Assessment of Ad...,Program for the International Assessment of Ad...,program for the international assessment of ad...
1,0008656f-0ba2-4632-8602-3017b44c2e90,Finnish Ninth Graders’ Gender Appropriateness ...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...
2,000e04d6-d6ef-442f-b070-4309493221ba,Economic Research Service: Specialized Agency...,Agricultural Resource Management Survey,Agricultural Resources Management Survey,agricultural resources management survey
3,000efc17-13d8-433d-8f62-a3932fe4f3b8,Risk factors and global cognitive status relat...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI|Alzheimer's Disease Neuroimaging Initiati...,adni|alzheimer s disease neuroimaging initiati...
4,0010357a-6365-4e5f-b982-582e6d32c3ee,Timelines of COVID-19 Vaccines,SARS-CoV-2 genome sequence,genome sequence of COVID-19,genome sequence of covid 19
...,...,...,...,...,...
14311,ffd19b3c-f941-45e5-9382-934b5041ec96,Water quality of the Mississippian carbonate a...,Census of Agriculture,Census of Agriculture,census of agriculture
14312,ffd4d86a-0f26-44cc-baed-f0e209cc22af,A Spherical Brain Mapping of MR Images for the...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni
14313,ffe7f334-245a-4de7-b600-d7ff4e28bfca,COVID-19 and Possible Pharmacological Preventi...,SARS-CoV-2 genome sequence,genome sequences of SARS-CoV-2,genome sequences of sars cov 2
14314,ffeb3568-7aed-4dbe-b177-cbd7f46f34af,Abandoning mathematics. Reconstructing the pro...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...


In [6]:
#Create dataframe with labels
label_df = label_df[['Id','cleaned_label']]


In [7]:
filenames_train = [f for f in listdir(path_train) if isfile(join(path_train, f))]
sample_filename = filenames_train[0] 

with open(path_train + sample_filename) as f:
    sample_dict = json.load(f)
    
print(sample_dict[0].keys())
print(len(filenames_train))

dict_keys(['section_title', 'text'])
14316


# 4. Load train data (and apply basic NLP)

Here, we will load our provided train data and apply basic NLP methods on this. THe basic NLP methods include normalization of the sentences and shorten the sentenes. The normalization is done by removing basic punciation such like: +,/.\,- etc. The second adjustment is done. because BERT handles a maximum sequence length of 512 tokens as input. Hence, we need to shorten the sentences. This is done by splitting the sentences in words and cutting them off after 64 words, with an overlap of 20 words. 

Because we deal with quite a huge amount of training data (14316 papers, let alone sentences), we create the training data in batches of 5000 papers each. Because the test data is quite small, we decided to leave the first 100 papers of the training data for validation purposes. 

In [10]:
MAX_LENGTH = 64
OVERLAP = 20

def clean_paper_sentence(sentence):
    """
    Input: sentence (string), Returns: sentence (string)
    This function is essentially clean_text without lowercasing.
    """
    sentence = re.sub('[^A-Za-z0-9]+', ' ', str(sentence)).strip()
    sentence = re.sub(' +', ' ', sentence)
    return sentence

def shorten_sentences(sentences):
    """
    Input: sentences (list), Returns: short_sentences (list)
    
    Sentences that have more than MAX_LENGTH words will be split
    into multiple sentences with overlappings.
    """
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences


def concatenate_text(json_dict):
    '''
    Input: json_dict (dictionary), Returns: sentences (list)
    
    Concatenate text and split sentences, as the BERT (and SciBERT) model
    has the contraint of maximum sequence length of 512.
    '''
    total_text = ""
        
    for section_dict in json_dict:
        total_text += section_dict['text']+ '\n'
    #sentences = nltk.tokenize.sent_tokenize(total_text) # This seems to take a lot of time?
    sentences = re.split('\. ', total_text)
    sentences = [clean_paper_sentence(s) for s in sentences]
   
    sentences = shorten_sentences(sentences)
    sentences = [sentence for sentence in sentences if len(sentence) > 10]
    

    return sentences

def create_train_df(path, start_batch, end_batch):
    '''
    Arguments: path (string), start_batch (int), end_batch (int)
    '''
    #Initialize dictionary with right keys
    final_dict = dict()
    final_dict["Id"] = []
    final_dict["Sentences"] = []
    
    
    counter = 0
    for root, _, files in os.walk(path):
        files.sort()
        files = files[start_batch:end_batch]
        
        for filename in range(0,len(files)):
            id = files[filename][:-5] #Remove .json from filename to retrieve id
            with open(path_train + files[filename]) as f:
                json_dict = json.load(f)
                sentences = concatenate_text(json_dict)
                final_dict["Id"].append(id)
                final_dict["Sentences"].append(sentences)
                
            counter += 1
    df = pd.DataFrame.from_dict(final_dict)

    return df




# 5. Create dataset

We see that the training data for now exists of an Id (of a paper) with a list of corresponding sentences. Now we need to preprocess this, in order to apply NER. As we are only interest in datasets, we annotated the data just as is done by the authors introducing TDMSci [1]. We make use of the same annotation format as is used in the [CoNLL-2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/).  

This means we can have two types of NER-tags: _B-types_ and _I-types_. The first type indicates the beginning of a named entity (or beginning of a phrase). The second type indicates other words in a named entity (or a phrase). When there is no entity (or word that is not part of a phrase), this word is labelled with 'O'. For this task, we use three different NER-task: **B-DATASET**, **I-DATASET** and **O**.  

This boils down to labelling every word not being (part of) a dataset as **O**, whilst datasets are being tagged with either **B-DATASET**, or **I-DATASET**.

[1] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.


We first apply NER in combination with SciBERT. For this purpose, we need the pretrained [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification) from SciBERT, which we already loaded at the beginning of this notebook.

## 5.1. Preprocess data

In [13]:

def find_dataset_indices(tokens, label_tokens):
    '''
    Arugments: tokens (list), label_tokens (list)
    Returns: dataset_indices(list)
    
    Gets indices of mentioned dataset in a list of words.
    '''
    dataset_indices = []
    for i in range(0, len(tokens)):
        if(tokens[i] == label_tokens[0] and tokens[i:i+len(label_tokens)]== label_tokens):
            dataset_indices = [*range(i, i+len(label_tokens))]
    return dataset_indices

def tag_sentence(sentence, tokenizer, datasets, words_with_label, counter_sent):
    '''
    Arguments: sentence (string), tokenizer (boolean), datasets (list), words_with_label (list), counter_sent (int)
    Returns: tokens (list), NER_sent (list), found_dataset (boolean), words_with_label (list)
    
    Senteces are tagged with either 'B-DATASET', 'I-DATAST' or 'O', based on whether a datset in [datasets] is mentioned
    in [sentence].
    
    NB: Tokenizer and words_with_label for testing purposes included. Can be ignored.
    '''
    NER_sent = []
    
    tokens = sentence.split() #Tokenize sentence
    assert(len(tokens) < 512)
    found_dataset = False

    if(any(re.findall(dataset,sentence.lower()) for dataset in datasets)):#If a dataset is found in a sentence , add right NER-labels to tokens
        found_dataset = True
        NER_sent = ['O']* len(tokens)   
    
        for dataset in datasets:              
            tokens_dataset = dataset.split()  #Tokenize dataset labels
      
            dataset_indices = find_dataset_indices([token.lower() for token in tokens], tokens_dataset)
            if(dataset_indices != []):
                for token in range(0, len(tokens)): #If current token is first word of dataset, add B-DATASET
                    if(token == dataset_indices[0]):
                        NER_sent[token] = 'B-DATASET'
                        words_with_label.append((tokens[token], 'B-DATASET', counter_sent)) 
                        first_found = True
                    elif(token in dataset_indices): #If current token is not the first word, but in tokenized dataset, add I-DATASET
                        NER_sent[token]= 'I-DATASET'
                        words_with_label.append((tokens[token], 'I-DATASET', counter_sent))
                    else: #If current token is not part of a dataset add O
                        NER_sent[token] = ('O')
                        first_found = False

    else: #No dataset found in a sentence
        NER_sent = ['O']* len(tokens)
     
    return list(zip(tokens, NER_sent)), found_dataset, words_with_label

def preprocess_data_ner(data_df, label_df, tokenizer):
    '''
    Arguments: data_df (dataframe), label_df (dataframe), tokenizer (SciBERT tokenizer)
    Returns: dict_final (dictionary), words_with_label (list), count_postives (int), count_negatives (int), counter_sent (int)
    
    Adds NER-labels to each token of data_df based on the following rules:
       - IF token is start of dataset THEN tag =  'B-DATASET'
       - IF token is not start but part of dataset THEN tag =  'I-DATASET'
       - IF token is not part of dataset THEN tag = 'O'
    
    Only positive sentences and 50% of hard negatives are included in final dictionary [dict_final].
    Hard negatives are sentences with the word 'data' or 'study' in it, but does not actaully contain the mention of a dataset.
    
    NB: words_with_label for testing purposes included. Can be ignored.
    '''
     
    #Initalize and prepare dictionary
    dict_final = dict()
    dict_final['Id'] = []
    dict_final['Sentences'] = []
    dict_final['NER-labels'] = []
    
    words_with_label = [] # For checking purposess
    count_postives = 0
    count_negatives = 0
    counter_sent = 0

    for id, row in data_df.iterrows():
            key_id = row[0]
            dict_final['Id'].append(key_id)
            sentences = row[1]
            dict_final['Sentences'].append(sentences)
            datasets = re.split('\|', list(label_df.loc[label_df['Id'] == key_id]["cleaned_label"])[0]) #Look up labels based on id

            NER_labels = []
            NER_labels_NEG = []
            
#             random_sentence_id = random.randrange(0, len(sentences))
            #Add NER-labels to each token of each sentence
            for sent_id, sentence in enumerate(sentences):
                
                
                NER_sent, dataset_found, words_with_label = tag_sentence(sentence, tokenizer, datasets, words_with_label, counter_sent)
                if dataset_found:
                                  
                    count_postives += 1
                    NER_labels.append(NER_sent)
                elif ( any(word in sentence.lower() for word in ['data', 'study'])): 
                   #Randomly remove half of the hard negatives

                    if random.choice([False, True]):
                        NER_labels_NEG.append(NER_sent)
                        count_negatives+= 1

                
                counter_sent += 1
            
            random.shuffle(NER_labels_NEG)
            NER_labels+= NER_labels_NEG[0:len(NER_labels)]
            dict_final['NER-labels'].append(NER_labels)
                        
            
    return dict_final, words_with_label, count_postives, count_negatives, counter_sent



# 5.2. Save preprocessed data

Now we save the preprocessed training data. We can load this data into our project to use it in the future.

In [16]:
def save_train_ner(filename, column, ids):
    ''' 
    Arguments: fileneme (string), column (string), ids (list)
    Returns: Nothing
    
    Writes the (batch of the) training data to a file under the name of [filename], in a fomrat usable for BERT training and testing.
    '''
    
    with open(filename, 'a+') as f:
        for row_i in range(len(column)):
            if(column[row_i] != []):
                for sentence in column[row_i]:      
                    if(sentence != []):
                        
                        words, nes = list(zip(*sentence))
                        assert(len(words) == len(nes))
                
                        if(len(words) > 512):
                            print("uhhoh words")
                            print(ids[ row_i])
                        row_json = {'tokens' : words, 'tags' : nes, 'id': ids[row_i]}
                       
                        json.dump(row_json, f)
                        f.write('\n')


# 6. Create the training data in batches

Here, the actual training datasets are created in batches of 5000 articles to prevent OOM errors.

In [17]:
# Load TDMSci data

from flair.data import Corpus
from flair.datasets import ColumnCorpus

# Make flair column corpus with TDMSci data
def read_tdmsci(filename):
    '''
    Arguments: filename (string)
    Retunrs: corpus1 (<ColumnCorpus>)
    
    Impports the TDMSCI dataset in the ColumnCorpus format.
    '''
    # define columns
    columns = {0: 'text', 1: 'pos', 2: 'ner'} 

    # init a corpus using column format, data folder and the names of the train, dev and test files
    corpus1: Corpus = ColumnCorpus(path_abs_tdmsci, columns,
                                   train_file=filename,
                                   test_file=filename,
                                   dev_file=filename)
    return corpus1

# Get tuple of text and NER-tag for TDMSci input token
def get_tdmsci_tuple(token):
    return (token.text, token.annotation_layers['ner'][0].value)

# Process TDMSci data
def tdmsci_to_df(filename, tdmsci_id):
    '''
    Arguments: filename (string), tdmsci_id (string)
    Retunrs: df_final (dictionary)
    
    Creates training data from the TDMSCI dataset, in a fomrat usable for BERT training and testing.
    '''
    tdmsci_corpus = read_tdmsci(filename)
    dict_final = dict()
    dict_final['Id'] = []
    dict_final['Sentences'] = []
    dict_final['NER-labels'] = []
    for sent in tdmsci_corpus.train:
        intermediate_NER_labels = []
        for tok in sent:
            (token, tag) = get_tdmsci_tuple(tok)
            intermediate_NER_labels.append((token, tag))
        sentence = sent.to_plain_string()
        sentence = clean_paper_sentence(sentence)
        sentences = shorten_sentences([sentence])
        allTags = [x[1] for x in intermediate_NER_labels]
        if(len(sentence) > 10):
            if(np.all([x not in allTags for x in ['B-METRIC', 'B-TASK', 'I-METRIC', 'I-TASK']])):
                dict_final['Id'].append(tdmsci_id)
                dict_final['Sentences'].append(sentences)
                dict_final['NER-labels'].append([intermediate_NER_labels])
    df_final = pd.DataFrame.from_dict(dict_final)
    return df_final







In [20]:
def make_training_data(batch_size):
    '''
    Arguments: batch_size(int)
    Returns: Nothing
    
    Creates training data suitable for executing NER with the BERT model, with addition of TDMSCI data.
    Based on batch size, from the 100th training paper, the final training dataset is created.
    '''
    N_train = len(filenames_train)
    start_idx = 100
    
    #Empty file
    with open('train_ner.json', 'w') as f:
              f.write('')
    
    
    while((start_idx + batch_size) < (N_train + batch_size)):
        print(f'Batch {start_idx} through {start_idx + batch_size} of {N_train}')
        #Create train data
        train_df_ner = create_train_df(path_train, start_idx, (start_idx + batch_size))
        print(f'Batch size: {len(train_df_ner)}')

        #Preprocess train data for NER classification
        train_preprocessed, words_with_label, count_postives, count_negatives, counter_sent = preprocess_data_ner(train_df_ner, label_df, True)
        print(f'There are {count_postives} positive sentences and {count_negatives} negative sentences out of {counter_sent} sentences')
        train_preprocessed_df = pd.DataFrame.from_dict(train_preprocessed)
        
        #Append TDMSci NER to Coleridge dataset
        if((start_idx + batch_size) > N_train):
            
            train_tdmsci_df = tdmsci_to_df(path_train_tdmsci, 'TDMSCI_train')
            test_tdmsci_df = tdmsci_to_df(path_test_tdmsci, 'TMDSCI_test')
            
            #Append to Coleridge data
            print(f'Adding TDMSci data (size: {(len(train_tdmsci_df["Sentences"])+len(test_tdmsci_df["Sentences"]))})')
            train_preprocessed_df = train_preprocessed_df.append(train_tdmsci_df, ignore_index=True)
            train_preprocessed_df = train_preprocessed_df.append(test_tdmsci_df, ignore_index=True)
        
        train_preprocessed_df = shuffle(train_preprocessed_df)
        #Append batch to final json file
        save_train_ner('train_ner.json', train_preprocessed_df["NER-labels"], train_preprocessed_df["Id"])
        start_idx = start_idx + batch_size
        
        
       

In [22]:
make_training_data(5000)

Batch 100 through 5100 of 14316
Batch size: 5000
There are 19252 positive sentences and 88353 negative sentences out of 1378229 sentences
Batch 5100 through 10100 of 14316
Batch size: 5000
There are 19346 positive sentences and 87899 negative sentences out of 1335900 sentences
Batch 10100 through 15100 of 14316
Batch size: 4216
There are 15163 positive sentences and 79173 negative sentences out of 1156448 sentences
2021-06-06 17:47:40,619 Reading data from /kaggle/input/tdmsci
2021-06-06 17:47:40,621 Train: /kaggle/input/tdmsci/train_1500_v2.txt
2021-06-06 17:47:40,621 Dev: /kaggle/input/tdmsci/train_1500_v2.txt
2021-06-06 17:47:40,622 Test: /kaggle/input/tdmsci/train_1500_v2.txt
2021-06-06 17:47:48,289 Reading data from /kaggle/input/tdmsci
2021-06-06 17:47:48,290 Train: /kaggle/input/tdmsci/test_500_v2.txt
2021-06-06 17:47:48,291 Dev: /kaggle/input/tdmsci/test_500_v2.txt
2021-06-06 17:47:48,292 Test: /kaggle/input/tdmsci/test_500_v2.txt
Adding TDMSci data (size: 551)


In [23]:
with open(path_train_nerjson) as f:
    acc = 0
    labels = []
    for row in f:
        rowjson = json.loads(row)
        acc +=1
        labels+= rowjson["tags"]
    
print("There are {} rows in train_ner.json".format(acc))
print("The unique labels are:{}".format(np.unique(labels)))
        

There are 95394 rows in train_ner.json
The unique labels are:['B-DATASET' 'I-DATASET' 'O']
