# Part III (Testing)

# 1. Introduction

This notebook contains the prediction phase for the Kaggle challenge "Coleridge Initative: Show US the Data" (https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/). To recap, this challenge is about datasets used in scientific papers. In particular, we want to extract the datasets for scientific paper, with several NLP approaches. In this notebook, we test both BERT and SciBERT. The first model is introduced by Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., in 2018 [1]. Source code of BERT can be fuond [here](https://github.com/google-research/bert). The second model is  introduced by Beltagy, I., Lo, K., and Cohan, A. in 2019 [2]. Source code of SciBERT can be found [here](https://github.com/allenai/scibert).


Furthermore, we append the existing data with a specialized Corpus for dataset tagging. TDMSci is a Corpus existing of annotated data for tasks, metrices and datasets. Here, B-DATASET and I-DATASET are the NER-labels indicating a word is (part of) a dataset [3]. Source code (and annotated data) of TDMSci can be found [here](https://github.com/IBM/science-result-extractor).

This boils down to exectuing Named Entity Recognition (NER), in particular token classfication.

We have created three notebooks, one for **dataset creation** ([Part I](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part%20I_Creating_Dataset.ipynb)), one for **training** ([Part IIa](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part_IIa_BERT_Training.ipynb) and [Part IIb](https://github.com/Josien94/MLiP/blob/main/Challenge2%20-%20Coleridge%20Initiative%20-%20Show%20US%20the%20Data/Part_IIb_SciBERT_Training.ipynb)) and this one for **testing** (Part III). This part exists of the testing phase of the model. We loaded the pre-trained model from either _part IIa_ or _part IIb_, and executed predictions with the pre-trained model on our validation set.


[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.  
[2] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.  
[3] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.


# 2. Preparing Notebook

In [1]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets 
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl 
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl 
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl 
!pip install datasets 

Looking in links: file:///kaggle/input/coleridge-packages/packages/datasets
Processing /kaggle/input/coleridge-packages/packages/datasets/datasets-1.5.0-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/huggingface_hub-0.0.7-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/tqdm-4.49.0-py2.py3-none-any.whl
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.56.2
    Uninstalling tqdm-4.56.2:
      Successfully uninstalled tqdm-4.56.2
Successfully installed datasets-1.5.0 huggingface-hub-0.0.7 tqdm-4.49.0 xxhash-2.0.0
Processing /kaggle/input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Processing /kaggle/input/coleridge-packages/tokenizers-

In [2]:
#Import necessary libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
import os
from os import listdir
from os.path import isfile, join
import re
import json
import time
import datetime
import random
import glob
import importlib
import allennlp
import numpy as np
import pandas as pd
from transformers import *
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import torch
from typing import List



  '"sox" backend is being deprecated. '


In [3]:
acc = 0
with open('/kaggle/input/tdmsci/train_ner.json') as f:
    for row in f:
        acc += 1

print("There are {} rows in the training set!".format(acc))

There are 95394 rows in the training set!


In [4]:
#Initialize paths for data
path_abs = '/kaggle/input/coleridgeinitiative-show-us-the-data/'
path_test = os.path.join(path_abs, 'test/')
path_train = os.path.join(path_abs, 'train/')
path_metadata = os.path.join(path_abs, 'train.csv')
path_ner_json = '/kaggle/input/tdmsci/train_ner.json'
sample_submission_path = '../input/coleridgeinitiative-show-us-the-data/sample_submission.csv'
sample_submission = pd.read_csv(sample_submission_path)

adnl_govt_labels_path = '../input/coleridge-additional-gov-datasets-22000popular/data_set_800_with10000popular.csv'

onlyBert = True #CHANGE THIS TO FALSE FOR FINAL SUBMSSION 

# 3. Get to know the data

## 3.1. Load data

In [5]:
#Load metadata
label_df = pd.read_csv(path_metadata)
label_df = label_df.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

# sample_submission = pd.read_csv(path_sample_submission)
# #Inpsect data

label_df.head()

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,0007f880-0a9b-492d-9a58-76eb0b0e0bd7,The Impact of ICT Training on Income Generatio...,Program for the International Assessment of Ad...,Program for the International Assessment of Ad...,program for the international assessment of ad...
1,0008656f-0ba2-4632-8602-3017b44c2e90,Finnish Ninth Graders’ Gender Appropriateness ...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...
2,000e04d6-d6ef-442f-b070-4309493221ba,Economic Research Service: Specialized Agency...,Agricultural Resource Management Survey,Agricultural Resources Management Survey,agricultural resources management survey
3,000efc17-13d8-433d-8f62-a3932fe4f3b8,Risk factors and global cognitive status relat...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI|Alzheimer's Disease Neuroimaging Initiati...,adni|alzheimer s disease neuroimaging initiati...
4,0010357a-6365-4e5f-b982-582e6d32c3ee,Timelines of COVID-19 Vaccines,SARS-CoV-2 genome sequence,genome sequence of COVID-19,genome sequence of covid 19


In [6]:
computeFBeta = True
if(len(sample_submission) > 4):
    computeFBeta = False

# 4. Create testdata (and basic NLP)

Here, we will load our provided test data and apply basic NLP methods on this. 

In [7]:
MAX_LENGTH = 64
OVERLAP = 20

def clean_paper_sentence(s):
    """
    This function is essentially clean_text without lowercasing.
    """
    s = re.sub('[^A-Za-z0-9]+', ' ', str(s)).strip()
    s = re.sub(' +', ' ', s)
    return s

def shorten_sentences(sentences):
    """
    Sentences that have more than MAX_LENGTH words will be split
    into multiple sentences with overlappings.
    """
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

#Concatenate text and split sentences, as the BERT (and SciBERT) model 
#has the contraint of maximum sequence length of 512.
def concatenate_text(json_dict):
    total_text = ""
        
    for section_dict in json_dict:
        total_text += section_dict['text']+ '\n'
    #sentences = nltk.tokenize.sent_tokenize(total_text) # This seems to take a lot of time?
    sentences = re.split('\. ', total_text)
    sentences = [clean_paper_sentence(s) for s in sentences]
   
    sentences = shorten_sentences(sentences)
    return sentences

def create_test_df(path, dataset, N_test):
    '''
    
    '''
    #Initialize dictionary with right keys
    final_dict = dict()
    final_dict["Id"] = []
    final_dict["Sentences"] = []
    
    max_length = N_test
    
    counter = 0
    for root, _, files in os.walk(path):
        for filename in range(0,max_length):#files:
            id = files[filename][:-5] #Remove .json from filename to retrieve id
            with open(path + files[filename]) as f:
                json_dict = json.load(f)
                sentences = concatenate_text(json_dict)
                final_dict["Id"].append(id)
                final_dict["Sentences"].append(sentences)
                
            counter += 1
    df = pd.DataFrame.from_dict(final_dict)

    return df




In [8]:
#Create dataframe with labels
#label_df = label_df[['Id','cleaned_label']]

/#Create train test data for NER classification
if(computeFBeta):
    N_test = len([name for name in os.listdir(path_test)])
    print(N_test)
    test_df_ner = create_test_df(path_test, "test", N_test)

#create test data for MLM classification
#test_df_mlm = create_train_test_df(path_test, "test",0, N_test)

4


# 5. Literal matching

To improve performance, we introduce literal matching, where sentences from the test data are matched against a created knowledge bank. Here, literal matches between datasets in the knowledge bank and sentences in the test data are found. 

Furthermore, we add the top [10,000 datasets](https://www.kaggle.com/chienhsianghung/coleridge-additional-gov-datasets-22000popular) from the [U.S. Government's open data](https://www.data.gov/). To exclude datasets exiting of one word - mostly a common used word like "earth" - we restricted the inclusion of a dataset with a wordcount larger than 3.

In [10]:
#Literal matching (create knowledge bank)
train = pd.read_csv(path_metadata)
papers = {}

for paper_id in train['Id'].unique():
    with open(f'{path_train}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

print(len(list(papers.keys())))


14316


In [11]:
sample_submission_path = '../input/coleridgeinitiative-show-us-the-data/sample_submission.csv'
sample_submission = pd.read_csv(sample_submission_path)
paper_test_folder = '../input/coleridgeinitiative-show-us-the-data/test'
for paper_id in sample_submission['Id']:
    with open(f'{paper_test_folder}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

In [12]:
all_labels = set()

for label_1, label_2, label_3 in train[['dataset_title', 'dataset_label', 'cleaned_label']].itertuples(index=False):
    all_labels.add(str(label_1).lower())
    all_labels.add(str(label_2).lower())
    all_labels.add(str(label_3).lower())

adnl_govt_labels = pd.read_csv(adnl_govt_labels_path)

for l in adnl_govt_labels.title:
    
    if (len(l.split()) > 3):
        all_labels.add(l.lower())
        all_labels.add(l)
        all_labels.add(clean_paper_sentence(l))
    
    
all_labels = set(all_labels)
print(f'No. different labels: {len(all_labels)}')


No. different labels: 12230


# 6. NER with BERT

Here we apply NER on the pretrained model.

In [13]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

def totally_clean_text(txt):
    txt = clean_text(txt)
    txt = re.sub(' +', ' ', txt)
    return txt

## 6.1. Apply  Literal matching

After we have created the knowledgebank, we can now apply our literal matches by looping over the papers and see wheter the text contains a label out of the knowledgebank.

In [14]:
literal_preds = []

if(not computeFBeta):
    for paper_id in sample_submission['Id']:
        paper = papers[paper_id]
        text_1 = '. '.join(section['text'] for section in paper).lower()
        text_2 = totally_clean_text(text_1)

        labels = set()
        for label in all_labels:
            if label in text_1 or label in text_2:
                labels.add(clean_text(label))

        literal_preds.append('|'.join(labels))
else:
      first_100_papers = sorted(os.listdir(path_train))[0:100]
  
      for papername in first_100_papers:
        with open(f'{path_train}{papername}', 'r') as f:
            paper = json.load(f)
       
        text_1 = '. '.join(section['text'] for section in paper).lower()
        text_2 = totally_clean_text(text_1)

        labels = set()
        for label in all_labels:
            if label in text_1 or label in text_2:
                labels.add(clean_text(label))

        literal_preds.append('|'.join(labels))

In [15]:
literal_preds[:5]

['program for the international assessment of adult competencies',
 'trends in international mathematics and science study',
 'agricultural resources management survey',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'genome sequence of covid 19']

## 6.2. Preprocess test data

After applying literal matching, we preprocess our test data by applying the same basic NLP techniques to our test data. Furthermore, to make the input suitable for BERT, dummy tags are added.

In [16]:
train = train.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

print(f'No. grouped training rows: {len(train)}')

No. grouped training rows: 14316


In [17]:
def clean_training_text(txt):
    """
    similar to the default clean_text function but without lowercasing.
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt)).strip()

def shorten_sentences(sentences):
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

In [18]:
def preprocess_test_ner():
    test_rows = [] # test data in NER format
    paper_lengths = []

    for paper_id in sample_submission['Id']:
       
        paper = papers[paper_id]
        
        sentences = [clean_training_text(sentence) for section in paper 
                 for sentence in section['text'].split('.')
                ]
        #The author of this code does this, but I am not sure if this is necessary
        sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
        sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]
        
        #Add NER-labels to each token of each sentence
        for sentence in sentences:
            
            tokens = sentence.split()
            dummy_tags = ['O'] * len(tokens)
            
            test_rows.append({'tokens' : tokens, 'tags' : dummy_tags, 'id': paper_id})
        paper_lengths.append(len(sentences))
    


    return test_rows, paper_lengths

def preprocess_test_validation():
    #For testing purposes
    test_rows = [] # test data in NER format
    paper_lengths = []
    ids = []

    first_100_papers = sorted(os.listdir(path_train))[0:100]
  
    for papername in first_100_papers:
        ids.append(papername[:-5])
        with open(f'{path_train}{papername}', 'r') as f:
            paper = json.load(f)
          
            #print(paper["id"])
            sentences = [clean_training_text(sentence) for section in paper 
                     for sentence in section['text'].split('.')
                    ]
            #The author of this code does this, but I am not sure if this is necessary
            sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
            sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]

            #Add NER-labels to each token of each sentence
            for sentence in sentences:

                tokens = sentence.split()
                dummy_tags = ['O'] * len(tokens)

                test_rows.append({'tokens' : tokens, 'tags' : dummy_tags, 'id': paper_id})
            paper_lengths.append(len(sentences))
    return test_rows, paper_lengths, ids
  
        

In [19]:

if(not computeFBeta):
    test_rows, paper_lengths = preprocess_test_ner()
else:
    test_rows, paper_lengths, validationIds = preprocess_test_validation()
    
print(len(test_rows))
print(test_rows[0:5])

3079
[{'tokens': ['The', 'aim', 'of', 'this', 'study', 'was', 'to', 'identify', 'if', 'acquiring', 'ICT', 'skills', 'through', 'DOT', 'Lebanon', 's', 'ICT', 'training', 'program', 'a', 'local', 'NGO', 'improved', 'income', 'generation', 'opportunities', 'after', '3', 'months', 'of', 'completing', 'the', 'training'], 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'id': '8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60'}, {'tokens': ['This', 'study', 'was', 'completed', 'in', 'an', 'effort', 'to', 'find', 'creative', 'and', 'digital', 'solutions', 'to', 'the', 'high', 'rate', 'of', 'youth', 'unemployment', 'in', 'Lebanon', '37', 'one', 'of', 'the', 'highest', 'rates', 'in', 'the', 'world'], 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'id': '8e6996b4-ca08-

## 6.3. Predict with BERT

Now we run the BERT model with the preloaded trained model on the test set and save the results in *bert\_outputs*.

In [20]:
max_length = 64 # max no. words for each sentence.
overlap = 20 # if a sentence exceeds MAX_LENGTH, we split it to multiple sentences with overlapping
path_train_nerjson = '/kaggle/input/tdmsci/train_ner.json'
pred_save_path = './pred'
prediction_file = 'test_predictions.txt'
test_input_save_path = './input_data'
path_pretrained_scibert = '/kaggle/input/tdmsci/scibert/output' 
path_pretrained_bert =  '/kaggle/input/tdmsci/results/output' 
train_file = path_train_nerjson
filename_test = 'test_ner_input.json'

os.environ["MODEL_PATH"] = f"{path_pretrained_bert}"
os.environ["TRAIN_FILE"] = f"{train_file}"
os.environ["VALIDATION_FILE"] = f"{train_file}"
os.environ["TEST_FILE"] = f"{test_input_save_path}/{filename_test}"
os.environ["OUTPUT_DIR"] = f"{pred_save_path}"


In [21]:
# copy my_seqeval.py to the working directory because the input directory is non-writable
!cp /kaggle/input/coleridge-packages/my_seqeval.py ./
os.makedirs(test_input_save_path, exist_ok=True)

In [22]:
def predict_scibert_ner():
    !python ../input/kaggle-ner-utils/kaggle_run_ner.py \
    --model_name_or_path "$MODEL_PATH" \
    --train_file "$TRAIN_FILE" \
    --validation_file "$VALIDATION_FILE" \
    --test_file "$TEST_FILE" \
    --output_dir "$OUTPUT_DIR" \
    --report_to 'none' \
    --seed 123 \
    --do_predict

In [24]:
bert_outputs = []
batch_size = 64000

for batch_begin in range(0, len(test_rows), batch_size):#len(test_rows), batch_size):
    print(len(test_rows))
    # write data rows to input file
    with open(f"{test_input_save_path}/{filename_test}", 'w') as f:
        for row in test_rows[batch_begin:batch_begin+batch_size]:
            json.dump(row, f)
            f.write('\n')
            
    with open(f"{test_input_save_path}/{filename_test}", 'r') as f:
        content = f.read()
        
    # remove output dir
    !rm -r "$OUTPUT_DIR"
    
    # do predict
    predict_scibert_ner()
    
    # read predictions
    with open(f'{pred_save_path}/{prediction_file}') as f:
        this_preds = f.read().split('\n')[:-1]
        bert_outputs += [pred.split() for pred in this_preds]

3079
rm: cannot remove './pred': No such file or directory
2021-06-13 09:45:57.945036: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-53e6f0991a8d2d90/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-53e6f0991a8d2d90/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.
[INFO|configuration_utils.py:470] 2021-06-13 09:46:26,058 >> loading configuration file /kaggle/input/tdmsci/results/output/config.json
[INFO|configuration_utils.py:508] 2021-06-13 09:46:26,059 >> Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_cased",
  "archi

## 6.4. Restore labels
After prediction, we need to restore the labels to actual words, with the use of the earleir defined paper lengths

In [26]:

test_sentences = [row['tokens'] for row in test_rows]
#del test_rows

bert_dataset_labels = [] # store all dataset labels for each publication

for length in paper_lengths:
    print(length)
    labels = set()
    for sentence, pred in zip(test_sentences[:length], bert_outputs[:length]):
        curr_phrase = ''
        for word, tag in zip(sentence, pred):
            if tag == 'B-DATASET': # start a new phrase
                if curr_phrase:
                    labels.add(curr_phrase)
                    curr_phrase = ''
                curr_phrase = word
            elif tag == 'I-DATASET' and curr_phrase: # continue the phrase
                curr_phrase += ' ' + word
            else: # end last phrase (if any)
                if curr_phrase:
                    labels.add(curr_phrase)
                    curr_phrase = ''
        # check if the label is the suffix of the sentence
        if curr_phrase:
            labels.add(curr_phrase)
            curr_phrase = ''
    
    # record dataset labels for this publication
    bert_dataset_labels.append(labels)
    
    del test_sentences[:length], bert_outputs[:length]


      



14
38
46
61
2
26
41
29
43
14
38
56
40
25
6
19
43
38
20
15
25
41
6
24
22
16
19
16
62
31
28
18
27
5
28
26
36
24
14
23
14
42
24
21
89
36
65
30
21
31
96
21
42
2
1
47
15
47
36
13
106
11
10
35
26
53
23
28
26
41
45
28
8
11
15
26
33
33
31
21
22
13
10
7
23
1
7
48
28
21
10
45
38
71
41
23
12
200
27
24


In [27]:
bert_dataset_labels[:5]


[set(),
 {'Trends in International Mathematics and Science Study'},
 set(),
 {'ADNI', 'Alzheimer s Disease Neuroimaging Initiative ADNI'},
 set()]

## 6.5. Filter results
We filter the reults by first looking at the jaccard simalirty between the predicted labels, to not get nearly-duplicates.
After this, we match our results to the literal matching list defined earlier in this notebook. In case a literal match is present for a predicted label, the literal match will be taken as final output.

In [28]:
def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

filtered_bert_labels = []

for labels in bert_dataset_labels:
    filtered = []
    
    for label in sorted(labels, key=len):
        label = clean_text(label)
        if len(filtered) == 0 or all(jaccard_similarity(label, got_label) < 0.75 for got_label in filtered):
            filtered.append(label)
    
    filtered_bert_labels.append('|'.join(filtered))

In [29]:
filtered_bert_labels[:100]


['',
 'trends in international mathematics and science study',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'early childhood longitudinal study',
 'baltimore longitudinal study of aging',
 '',
 'baltimore longitudinal study of aging',
 'baltimore longitudinal study of aging',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni',
 'baltimore longitudinal study of aging',
 'adni',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni',
 'beginning postsecondary students',
 '',
 'covid 19 death data|covid 19 mortality data',
 'early childhood longitudinal study',
 '',
 'early childhood longitudinal study|national education longitudinal study',
 '',
 'survey of earned doctorates',
 'education longitudinal study',
 '',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni|alzheimer s disease neuroimaging initiative adni',
 '',
 'agricultural resource management surv

In [30]:

if(not onlyBert):
    print("Adding literal matching!")
    final_predictions = []
    for literal_match, bert_pred in zip(literal_preds, filtered_bert_labels):
        if literal_match:
            final_predictions.append(literal_match)
        else:
            final_predictions.append(bert_pred)
else:
    print("No literal matching!")
    final_predictions = filtered_bert_labels

No literal matching!


In [31]:
final_predictions[0:100]


['',
 'trends in international mathematics and science study',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'early childhood longitudinal study',
 'baltimore longitudinal study of aging',
 '',
 'baltimore longitudinal study of aging',
 'baltimore longitudinal study of aging',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni',
 'baltimore longitudinal study of aging',
 'adni',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni',
 'beginning postsecondary students',
 '',
 'covid 19 death data|covid 19 mortality data',
 'early childhood longitudinal study',
 '',
 'early childhood longitudinal study|national education longitudinal study',
 '',
 'survey of earned doctorates',
 'education longitudinal study',
 '',
 '',
 'adni|alzheimer s disease neuroimaging initiative adni',
 'adni|alzheimer s disease neuroimaging initiative adni',
 '',
 'agricultural resource management surv

## 6.6. Generate submission file

In [32]:
#In case of submission
if(not computeFBeta):
    sample_submission['PredictionString'] = final_predictions
    sample_submission.head()
    

In [33]:
# In case of submission
if(not computeFBeta):
    sample_submission.to_csv(f'submission.csv', index=False)

## 6.7. Perform validation
In Part I of this set of notebooks, we left the first 100 papers out for validation. Here,  we perfrom evaluation on the first 100 papers by computing the Jaccard-based FBeta score.

In [34]:
def compute_fbeta(y_true: List[List[str]],
                  y_pred: List[List[str]],
                  beta: float = 0.5) -> float:
    """Compute the Jaccard-based micro FBeta score.

    References
    ----------
    - https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/overview/evaluation
    """

    def _jaccard_similarity(str1: str, str2: str) -> float:
        a = set(str1.split()) 
        b = set(str2.split())
        c = a.intersection(b)
        return float(len(c)) / (len(a) + len(b) - len(c))

    tp = 0  # true positive
    fp = 0  # false positive
    fn = 0  # false negative
    for ground_truth_list, predicted_string_list in zip(y_true, y_pred):
        

        predicted_string_list_sorted = sorted(predicted_string_list)
        for ground_truth in sorted(ground_truth_list):         

            if len(predicted_string_list_sorted) == 0:
                fn += 1
            else:
                similarity_scores = [
                    _jaccard_similarity(ground_truth, predicted_string)
                    for predicted_string in predicted_string_list_sorted
                ]
                matched_idx = np.argmax(similarity_scores)
                if similarity_scores[matched_idx] >= 0.5:
                    predicted_string_list_sorted.pop(matched_idx)
                    tp += 1
                else:
                    fn += 1
        fp += len(predicted_string_list_sorted)

    tp *= (1 + beta ** 2)
    fn *= beta ** 2
    fbeta_score = tp / (tp+ fp + fn)
    return fbeta_score

In [35]:
def getTrainLabels(ids, final_predictions, label_df):
    labels = []
    for paper_id in ids:
        
        
        
        value = label_df.loc[label_df['Id'] == paper_id]
        cleanedLabel = value["cleaned_label"].values
        labels.append(cleanedLabel)
    return labels
      

In [36]:
#In case of development (using the validation data)
if(computeFBeta):
    trainLabels = getTrainLabels(validationIds, final_predictions,  label_df)
    print(trainLabels)
    print(final_predictions)
    fbeta = compute_fbeta(trainLabels, [x.split('|')for x in final_predictions])
    print(fbeta)


[array(['program for the international assessment of adult competencies'],
      dtype=object), array(['trends in international mathematics and science study'],
      dtype=object), array(['agricultural resources management survey'], dtype=object), array(['adni|alzheimer s disease neuroimaging initiative adni '],
      dtype=object), array(['genome sequence of covid 19'], dtype=object), array(['adni|alzheimer s disease neuroimaging initiative adni '],
      dtype=object), array(['early childhood longitudinal study'], dtype=object), array(['baltimore longitudinal study of aging blsa |baltimore longitudinal study of aging'],
      dtype=object), array(['noaa tide gauge'], dtype=object), array(['baltimore longitudinal study of aging'], dtype=object), array(['baltimore longitudinal study of aging blsa |baltimore longitudinal study of aging'],
      dtype=object), array(['adni|alzheimer s disease neuroimaging initiative adni '],
      dtype=object), array(['adni'], dtype=object), array(['ba