# Part III (Testing)

# 1. Introduction

This notebook contains the prediction phase for the Kaggle challenge "Coleridge Initative: Show US the Data" (https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/)

To recap, this challenge is about datasets used in scientific papers. In particular, we want to extract the datasets for scientific paper, with several NLP approaches. In this notebook, we test both BERT and SciBERT. The first model is introduced by Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., in 2018 [1]. Source code of BERT can be fuond [here](https://github.com/google-research/bert). The second model is  introduced by Beltagy, I., Lo, K., and Cohan, A. in 2019 [2]. Source code of SciBERT can be found [here](https://github.com/allenai/scibert).


Furthermore, we append the existing data with a specialized Corpus for dataset tagging. TDMSci is a Corpus existing of annotated data for tasks, metrices and datasets. Here, B-DATASET and I-DATASET are the NER-labels indicating a word is (part of) a dataset [3]. Source code (and annotated data) of TDMSci can be found [here](https://github.com/IBM/science-result-extractor).

This boils down to exectuing Named Entity Recognition (NER), in particular token classfication.

We have created three notebooks, one for **dataset creation** ([Part I]()), one for **training** ([Part IIa]() and [Part IIb]()) and this one for **testing** (Part III). This part exists of the testing phase of the model. We loaded the pre-trained model from either [part IIa()] or [part IIb()]  and executed predictions with the pre-trained model on our validation set.

For part I (creating dataset), we refer to [this notebook](https://www.kaggle.com/lunaelise/fork-of-mlip-group25-scibert-dataset)
For part II (training), we refer to [this notebook](https://www.kaggle.com/lunaelise/fork-of-mlip-group25-scibert-training).






[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[2] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.  
[3] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.


# 2. Preparing Notebook

In [1]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets 
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl 
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl 
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl 
!pip install datasets 

Looking in links: file:///kaggle/input/coleridge-packages/packages/datasets
Processing /kaggle/input/coleridge-packages/packages/datasets/datasets-1.5.0-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/huggingface_hub-0.0.7-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/tqdm-4.49.0-py2.py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.56.2
    Uninstalling tqdm-4.56.2:
      Successfully uninstalled tqdm-4.56.2
Successfully installed datasets-1.5.0 huggingface-hub-0.0.7 tqdm-4.49.0 xxhash-2.0.0
Processing /kaggle/input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Processing /kaggle/input/coleridge-packages/tokenizers-

In [2]:
#Import necessary libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import re
import os
from os import listdir
from os.path import isfile, join
import re
import json
import time
import datetime
import random
import glob
import importlib
import allennlp
import numpy as np
import pandas as pd
from transformers import *
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import torch


  '"sox" backend is being deprecated. '


In [3]:
acc = 0
with open('/kaggle/input/tdmsci/train_ner.json') as f:
    for row in f:
        acc += 1

print("There are {} rows in the training set!".format(acc))

There are 309999 rows in the training set!


In [4]:
#!cp /kaggle/input/coleridge-packages/my_seqeval.py ./
import os

#Make directory in kaggle output of the model
# path_pretrained_SciBERT = '/kaggle/input/tdmsci/output/output' #'/kaggle/working/output'
# os.makedirs(path_pretrained_SciBERT, exist_ok=True)

# os.environ["ModelOutputPath"] = f"{path_pretrained_SciBERT}"
# #Refer to pretrained SciBERT model
# #path_SciBERT_model = '/kaggle/input/tdmsci/output/output' 
# for root, dirs, files in os.walk("/kaggle/input/mlip-group25-scibert-training", topdown=False):
#     for filename in files:
#         print(os.path.join(root, filename))
#         name = os.path.join(root, filename)
#         os.environ["filename"] = f"{name}"
#         !cp "$filename" "$ModelOutputPath"
    

In [5]:
model_name = 'bert-base-cased' #'allenai/scibert_scivocab_cased'
#model_name =  'bert-base-cased'
#Initialize paths for data
path_abs = '/kaggle/input/coleridgeinitiative-show-us-the-data/'
path_test = os.path.join(path_abs, 'test/')
path_train = os.path.join(path_abs, 'train/')
path_metadata = os.path.join(path_abs, 'train.csv')
path_ner_json = '/kaggle/input/tdmsci/train_ner.json'
sample_submission_path = '../input/coleridgeinitiative-show-us-the-data/sample_submission.csv'
sample_submission = pd.read_csv(sample_submission_path)
# path_sample_submission = os.path.join(path_abs, 'sample_submission.csv')
# path_abs_tdmsci = '/kaggle/input/tdmsci/'

adnl_govt_labels_path = '../input/coleridge-additional-gov-datasets-22000popular/data_set_800_with10000popular.csv'

# 3. Get to know the data

## 3.1. Load data

In [6]:
# #Load metadata
# label_df = pd.read_csv(path_train_metadata)
# label_df = label_df.groupby('Id').agg({
#     'pub_title': 'first',
#     'dataset_title': '|'.join,
#     'dataset_label': '|'.join,
#     'cleaned_label': '|'.join
# }).reset_index()

# sample_submission = pd.read_csv(path_sample_submission)
# #Inpsect data
# label_df.head()

# 4. Create testdata (and basic NLP)

Here, we will load our provided test data and apply basic NLP methods on this. 

In [7]:
MAX_LENGTH = 64
OVERLAP = 20

def clean_paper_sentence(s):
    """
    This function is essentially clean_text without lowercasing.
    """
    s = re.sub('[^A-Za-z0-9]+', ' ', str(s)).strip()
    s = re.sub(' +', ' ', s)
    return s

def shorten_sentences(sentences):
    """
    Sentences that have more than MAX_LENGTH words will be split
    into multiple sentences with overlappings.
    """
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

#Concatenate text and split sentences, as the BERT (and SciBERT) model 
#has the contraint of maximum sequence length of 512.
# def concatenate_text(json_dict):
#     total_text = ""
        
#     for section_dict in json_dict:
#         total_text += section_dict['text']+ '\n'
#     #sentences = nltk.tokenize.sent_tokenize(total_text) # This seems to take a lot of time?
#     sentences = re.split('\. ', total_text)
#     sentences = [clean_paper_sentence(s) for s in sentences]
   
#     sentences = shorten_sentences(sentences)
#     return sentences

# def create_test_df(path, dataset, N_test):
#     '''
    
#     '''
#     #Initialize dictionary with right keys
#     final_dict = dict()
#     final_dict["Id"] = []
#     final_dict["Sentences"] = []
    
#     max_length = N_test
    
#     counter = 0
#     for root, _, files in os.walk(path):
#         for filename in range(0,max_length):#files:
#             id = files[filename][:-5] #Remove .json from filename to retrieve id
#             with open(path + files[filename]) as f:
#                 json_dict = json.load(f)
#                 sentences = concatenate_text(json_dict)
#                 final_dict["Id"].append(id)
#                 final_dict["Sentences"].append(sentences)
                
#             counter += 1
#     df = pd.DataFrame.from_dict(final_dict)

#     return df




In [8]:
# N_test = len([name for name in os.listdir(path_test)])
# print(N_test)

In [9]:
#Create dataframe with labels
#label_df = label_df[['Id','cleaned_label']]

#Create train test data for NER classification
#test_df_ner = create_test_df(path_test, "test", N_test)

#create test data for MLM classification
#test_df_mlm = create_train_test_df(path_test, "test",0, N_test)

In [10]:
#test_df_ner.head()

#2. Load data

In [11]:
#Literal matching (create knowledge bank)
train = pd.read_csv(path_metadata)
papers = {}

for paper_id in train['Id'].unique():
    with open(f'{path_train}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

print(len(list(papers.keys())))


14316


In [12]:
sample_submission_path = '../input/coleridgeinitiative-show-us-the-data/sample_submission.csv'
sample_submission = pd.read_csv(sample_submission_path)
paper_test_folder = '../input/coleridgeinitiative-show-us-the-data/test'
for paper_id in sample_submission['Id']:
    with open(f'{paper_test_folder}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

In [13]:
all_labels = set()

for label_1, label_2, label_3 in train[['dataset_title', 'dataset_label', 'cleaned_label']].itertuples(index=False):
    all_labels.add(str(label_1).lower())
    all_labels.add(str(label_2).lower())
    all_labels.add(str(label_3).lower())

adnl_govt_labels = pd.read_csv(adnl_govt_labels_path)

for l in adnl_govt_labels.title:
    
    if (len(l.split()) > 3):
        all_labels.add(l.lower())
        all_labels.add(l)
        all_labels.add(clean_paper_sentence(l))
    
    
all_labels = set(all_labels)
print(f'No. different labels: {len(all_labels)}')


No. different labels: 12230


# 5. NER with SciBERT

We first apply NER in combination with SciBERT. For this purpose, we need the pretrained [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification) from SciBERT, which we already loaded at the beginning of this notebook.

We first take a look at our training data.

We see that the training data for now exists of an Id (of a paper) with a list of corresponding sentences. Now we need to preprocess this, in order to apply NER. As we are only interest in datasets, we annotated the data just as is done by the authors introducing TDMSci [1]. We make use of the same annotation format as is used in the [CoNLL-2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/).  

This means we can have two types of NER-tags: _B-types_ and _I-types_. The first type indicates the beginning of a named entity (or beginning of a phrase). The second type indicates other words in a named entity (or a phrase). When there is no entity (or word that is not part of a phrase), this word is labelled with 'O'. For this task, we use three different NER-task: **B-DATASET**, **I-DATASET** and **O**.  

This boils down to labelling every word not being (part of) a dataset as **O**, whilst datasets are being tagged with either **B-DATASET**, or **I-DATASET**.

[1] Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. arXiv preprint arXiv:2101.10273.

In [14]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

def totally_clean_text(txt):
    txt = clean_text(txt)
    txt = re.sub(' +', ' ', txt)
    return txt

## 5.1. Test SciBERT model 

In [15]:
literal_preds = []

for paper_id in sample_submission['Id']:
    paper = papers[paper_id]
    text_1 = '. '.join(section['text'] for section in paper).lower()
    text_2 = totally_clean_text(text_1)
    
    labels = set()
    for label in all_labels:
        if label in text_1 or label in text_2:
            labels.add(clean_text(label))
    
    literal_preds.append('|'.join(labels))

In [16]:
literal_preds[:5]

['alzheimer s disease neuroimaging initiative adni|adni',
 'integrated postsecondary education data system|common core of data|schools and staffing survey|nces common core of data|trends in international mathematics and science study|progress in international reading literacy study',
 'sea lake and overland surges from hurricanes|noaa storm surge inundation|slosh model',
 'rural urban continuum codes']

In [17]:
train = train.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()

print(f'No. grouped training rows: {len(train)}')

No. grouped training rows: 14316


In [18]:
def clean_training_text(txt):
    """
    similar to the default clean_text function but without lowercasing.
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt)).strip()

def shorten_sentences(sentences):
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

In [19]:
def preprocess_test_ner():
    test_rows = [] # test data in NER format
    paper_lengths = []

    for paper_id in sample_submission['Id']:
       
        paper = papers[paper_id]
        
        sentences = [clean_training_text(sentence) for section in paper 
                 for sentence in section['text'].split('.')
                ]
        #The author of this code does this, but I am not sure if this is necessary
        sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
        sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]
        
        #Add NER-labels to each token of each sentence
        for sentence in sentences:
            
            tokens = sentence.split()
            dummy_tags = ['O'] * len(tokens)
            
            test_rows.append({'tokens' : tokens, 'tags' : dummy_tags, 'id': paper_id})
        paper_lengths.append(len(sentences))
    
    #For testing purposes
#     first_100_papers = sorted(os.listdir(path_train))[0:100]
#     for papername in first_100_papers:
    
#         with open(f'{path_train}{papername}', 'r') as f:
#             paper = json.load(f)
#             sentences = [clean_training_text(sentence) for section in paper 
#                      for sentence in section['text'].split('.')
#                     ]
#             #The author of this code does this, but I am not sure if this is necessary
#             sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
#             sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]

#             #Add NER-labels to each token of each sentence
#             for sentence in sentences:

#                 tokens = sentence.split()
#                 dummy_tags = ['O'] * len(tokens)

#                 test_rows.append({'tokens' : tokens, 'tags' : dummy_tags, 'id': paper_id})
#             paper_lengths.append(len(sentences))
  

    return test_rows, paper_lengths
        

In [20]:
test_rows, paper_lengths = preprocess_test_ner()
print(len(test_rows))
print(test_rows[0:5])

365
[{'tokens': ['A', 'recent', 'large', 'genomewide', 'association', 'study', 'GWAS', 'reported', 'a', 'genome', 'wide', 'significant', 'locus', 'for', 'years', 'of', 'education', 'which', 'subsequently', 'demonstrated', 'association', 'to', 'general', 'cognitive', 'ability', 'g', 'in', 'overlapping', 'cohorts'], 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'id': '2100032a-7c33-4bff-97ef-690822c43466'}, {'tokens': ['The', 'current', 'study', 'was', 'designed', 'to', 'test', 'whether', 'GWAS', 'hits', 'for', 'educational', 'attainment', 'are', 'involved', 'in', 'general', 'cognitive', 'ability', 'in', 'an', 'independent', 'large', 'scale', 'collection', 'of', 'cohorts'], 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], 'id': '2100032a-7c33-4bff-97ef-690822c43466'}, {'tokens': ['We', 

In [21]:
max_length = 64 # max no. words for each sentence.
overlap = 20 # if a sentence exceeds MAX_LENGTH, we split it to multiple sentences with overlapping
path_train_nerjson = '/kaggle/input/tdmsci/train_ner.json'
pred_save_path = './pred'
prediction_file = 'test_predictions.txt'
test_input_save_path = './input_data'
path_pretrained_scibert = '/kaggle/input/tdmsci/results/output' #'/kaggle/working/output'
train_file = path_train_nerjson
filename_test = 'test_ner_input.json'

os.environ["MODEL_PATH"] = f"{path_pretrained_scibert}"
os.environ["TRAIN_FILE"] = f"{train_file}"
os.environ["VALIDATION_FILE"] = f"{train_file}"
os.environ["TEST_FILE"] = f"{test_input_save_path}/{filename_test}"
os.environ["OUTPUT_DIR"] = f"{pred_save_path}"


In [22]:
# copy my_seqeval.py to the working directory because the input directory is non-writable
!cp /kaggle/input/coleridge-packages/my_seqeval.py ./
os.makedirs(test_input_save_path, exist_ok=True)

In [23]:
def predict_scibert_ner():
    !python ../input/kaggle-ner-utils/kaggle_run_ner.py \
    --model_name_or_path "$MODEL_PATH" \
    --train_file "$TRAIN_FILE" \
    --validation_file "$VALIDATION_FILE" \
    --test_file "$TEST_FILE" \
    --output_dir "$OUTPUT_DIR" \
    --report_to 'none' \
    --seed 123 \
    --do_predict

In [24]:
# import requests

# paper_length = [] # store the number of sentences each paper has
# def prepare_testdata(filename):
#     test_rows = []
#     with open(filename) as f:
#         for row in f:
#             json_row = json.loads(row)
#             test_rows.append(json_row)
#     return test_rows
    
# test_rows = prepare_testdata(train_file) #test data in NER format


In [25]:
bert_outputs = []
batch_size = 64000

for batch_begin in range(0, len(test_rows), batch_size):#len(test_rows), batch_size):
    # write data rows to input file
    with open(f"{test_input_save_path}/{filename_test}", 'w') as f:
        for row in test_rows[batch_begin:batch_begin+batch_size]:
            json.dump(row, f)
            f.write('\n')
            
    with open(f"{test_input_save_path}/{filename_test}", 'r') as f:
        content = f.read()
        
    # remove output dir
    !rm -r "$OUTPUT_DIR"
    
    # do predict
    predict_scibert_ner()
    
    # read predictions
    with open(f'{pred_save_path}/{prediction_file}') as f:
        this_preds = f.read().split('\n')[:-1]
        bert_outputs += [pred.split() for pred in this_preds]

rm: cannot remove './pred': No such file or directory
2021-06-09 09:41:36.646120: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-3e668983be5bb6cc/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-3e668983be5bb6cc/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.
[INFO|configuration_utils.py:470] 2021-06-09 09:42:17,670 >> loading configuration file /kaggle/input/tdmsci/results/output/config.json
[INFO|configuration_utils.py:508] 2021-06-09 09:42:17,671 >> Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_cased",
  "architectu

In [26]:
#print(bert_outputs)

# labelArrays = []
# for output in bert_outputs:
#     if('B-DATASET' in output or 'I-DATASET' in output):
#         labelArrays.append(output)
# print(labelArrays)

## 5.5. Restore labels

In [27]:

test_sentences = [row['tokens'] for row in test_rows]
#del test_rows

bert_dataset_labels = [] # store all dataset labels for each publication

for length in paper_lengths:
    print(length)
    labels = set()
    for sentence, pred in zip(test_sentences[:length], bert_outputs[:length]):
        curr_phrase = ''
        for word, tag in zip(sentence, pred):
            if tag == 'B-DATASET': # start a new phrase
                if curr_phrase:
                    labels.add(curr_phrase)
                    curr_phrase = ''
                curr_phrase = word
            elif tag == 'I-DATASET' and curr_phrase: # continue the phrase
                curr_phrase += ' ' + word
            else: # end last phrase (if any)
                if curr_phrase:
                    labels.add(curr_phrase)
                    curr_phrase = ''
        # check if the label is the suffix of the sentence
        if curr_phrase:
            labels.add(curr_phrase)
            curr_phrase = ''
    
    # record dataset labels for this publication
    bert_dataset_labels.append(labels)
    
    del test_sentences[:length], bert_outputs[:length]
print(len(bert_outputs))

# def getTags(sentences, preds):
#     labels = []
#     dataset = ""

#     for sentence, preds in zip(sentences, preds):
#         dataset = ""

#         for word, tag in zip(sentence.split(), preds):
#             #print("Word:{}, tag:{}".format(word,tag) )
#             if(tag == "B-DATASET"):
#                 dataset += tag + ' '
#             elif (tag == "I-DATASET" and dataset != ""):
#                 print("dfsdfdsf")
#                 dataset += tag + ' '
#             elif(tag != "B-DATSET" and tag != "I-DATASET" and dataset != ""):
#                 labels.append(dataset)
#                 dataset= ""
    
#     if(dataset ==""):
#         labels.append("")
#     return ("|".join(labels))
            
    
# def restoreLabels(predictions, test_df):
#     test_sentences_papers = test_df["Sentences"]
#     ids = test_df["Id"]
#     startIndex = 0
    
#     acc = 0
#     predictionStrings = []
#     for paper_i in range(len(test_sentences_papers)):
#         length = len(test_sentences_papers[paper_i])
#         paper_id = ids[paper_i]
#         predictions_paper = predictions[startIndex:startIndex+length]
#         labels = getTags(test_sentences_papers[paper_i], predictions_paper)
#         predictionStrings.append(labels)
#         startIndex += length
#         acc += 1
#     return predictionStrings
      



34
150
97
84
0


In [28]:
bert_dataset_labels[:5]


[{'Alzheimer s Disease Neuroimaging Initiative ADNI'},
 {'Common Core of Data',
  'Trends in International Mathematics and Science Study',
  'trends in International Mathematics and Science Study'},
 set(),
 set()]

In [29]:
def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

filtered_bert_labels = []

for labels in bert_dataset_labels:
    filtered = []
    
    for label in sorted(labels, key=len):
        label = clean_text(label)
        if len(filtered) == 0 or all(jaccard_similarity(label, got_label) < 0.75 for got_label in filtered):
            filtered.append(label)
    
    filtered_bert_labels.append('|'.join(filtered))

In [30]:
filtered_bert_labels[:100]


['alzheimer s disease neuroimaging initiative adni',
 'common core of data|trends in international mathematics and science study',
 '',
 '']

In [31]:
final_predictions = []
for literal_match, bert_pred in zip(literal_preds, filtered_bert_labels):
    print("literal_match:{} --- bert_pred:{} \n--------------------------------------".format(literal_match, bert_pred))
    if literal_match:
        final_predictions.append(literal_match)
    else:
        final_predictions.append(bert_pred)

literal_match:alzheimer s disease neuroimaging initiative adni|adni --- bert_pred:alzheimer s disease neuroimaging initiative adni 
--------------------------------------
literal_match:integrated postsecondary education data system|common core of data|schools and staffing survey|nces common core of data|trends in international mathematics and science study|progress in international reading literacy study --- bert_pred:common core of data|trends in international mathematics and science study 
--------------------------------------
literal_match:sea lake and overland surges from hurricanes|noaa storm surge inundation|slosh model --- bert_pred: 
--------------------------------------
literal_match:rural urban continuum codes --- bert_pred: 
--------------------------------------


In [32]:
final_predictions[0:100]

['alzheimer s disease neuroimaging initiative adni|adni',
 'integrated postsecondary education data system|common core of data|schools and staffing survey|nces common core of data|trends in international mathematics and science study|progress in international reading literacy study',
 'sea lake and overland surges from hurricanes|noaa storm surge inundation|slosh model',
 'rural urban continuum codes']

In [33]:
sample_submission['PredictionString'] = final_predictions
sample_submission.head()

Unnamed: 0,Id,PredictionString
0,2100032a-7c33-4bff-97ef-690822c43466,alzheimer s disease neuroimaging initiative ad...
1,2f392438-e215-4169-bebf-21ac4ff253e1,integrated postsecondary education data system...
2,3f316b38-1a24-45a9-8d8c-4e05a42257c6,sea lake and overland surges from hurricanes|n...
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,rural urban continuum codes


In [34]:
sample_submission.to_csv(f'submission.csv', index=False)

## Generate submission file

In [35]:
# def generate_submission_file(test_ids, test_predictions):
#     submission_dict = {"Id": test_ids, "PredictionString": test_predictions}
#     submission_df = pd.DataFrame.from_dict(submission_dict)
#     submission_df.to_csv(f'submission.csv', index=False)
    
    

# generate_submission_file(test_df_ner["Id"], final_predictions)