<a href="https://colab.research.google.com/github/AbderrahimAl/Show-US-the-Data_Coleridge-Initiative/blob/main/bert_masked_language_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Coleridge Initiative - Show US the Data<br>Discover how data is used for the public good

> 📑 Context : This competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data is critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.

> In this competition, you'll use natural language processing (NLP) to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas gathered from CHORUS publisher members and other sources, you'll identify data sets that the publications' authors used in their work.

> 📌 Goal : The objective of the competition is to identify the mention of datasets within scientific publications. 

# <font color='#FF4500'> Table of Content
* [1. Importing necessary packages and libraries📚](#1)
* [2. Loading the data ⌛](#2)
* [3. Data Pre-Processing🔧](#4)
* [4. Matching 📑](#4)
* [5. Masked Language Modling  🤗](#5)


# <font color='#FF4500'>Importing necessary packages and libraries📚</font>

## Install packages :

In [None]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl

from IPython.display import clear_output
clear_output()

## Importing Libraries:

In [None]:
import numpy as np
import pandas as pd 
import json
import os 
import re
import string
import random

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

from wordcloud import WordCloud

#Text Color
from termcolor import colored

#NLP
import spacy

from tqdm.auto import tqdm

import pathlib
import glob


from datasets import load_dataset
import torch
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, \
AutoModelForMaskedLM, Trainer, TrainingArguments, pipeline

from typing import List
import string
from functools import partial



import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

plt.rcParams['figure.figsize']=(8,6)

In [None]:
def SeedEverything(seed: int):

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    print(f'Setted Pipeline SEED = {SEED}')

SEED=2021
SeedEverything(SEED)

Setted Pipeline SEED = 2021


# <font color='#FF4500'> Loading the data ⌛

* `train.csv` : Labels and metadata for the training set from scientific publications in the train folder ;
* `train` - the full text of the training set's publications in JSON format, broken into sections with section titles
* `test` - the full text of the test set's publications in JSON format, broken into sections with section titles
* The `sample_subimission.csv` : a sample submission file in the correct format.

### 1 CSV files :

In [None]:
DATA_PATH = pathlib.Path('../input/coleridgeinitiative-show-us-the-data')

**Columns Description :**

* `id `- publication id - note that there are multiple rows for some training documents, indicating multiple mentioned datasets
* `pub_title` - title of the publication (a small number of publications have the same title)
* `dataset_title` - the title of the dataset that is mentioned within the publication
* `dataset_label` - a portion of the text that indicates the dataset
* `cleaned_label` - the dataset_label, as passed through the clean_text function from the Evaluation page
* `PredictionString`- To be filled with equivalent of cleaned_label of train data (just in sample submission).

In [None]:
#reading train.csv
train=pd.read_csv(DATA_PATH /'train.csv')
train.sample(5)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
7763,bcd1b4eb-d6a3-4669-84ca-5fb5bf5e7193,7. AN ASSESSMENT OF POLICIES TO IMPROVE TEACHE...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...
11297,a7b26e45-1939-4950-9707-3a2b06e021d0,Modeling high-impact weather and climate: less...,International Best Track Archive for Climate S...,IBTrACS,ibtracs
11750,051f87f6-0b60-42fe-8d22-84614b95f859,Observations and a Model of the Mean Circulati...,World Ocean Database,World Ocean Database,world ocean database
17503,2558d9bc-89f4-4310-8f02-12243ed7930e,Sensitivity of Ground-Based Remote Sensing Est...,Census of Agriculture,Census of Agriculture,census of agriculture
5189,84387aee-43ad-48aa-bc4f-748bd8bcdc22,"High body mass index, brain metabolism and con...",Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni


* Let's take a look to the `Sample_submission.csv `:

In [None]:
sampleSubmission=pd.read_csv(DATA_PATH /'sample_submission.csv')
sampleSubmission.head()

Unnamed: 0,Id,PredictionString
0,2100032a-7c33-4bff-97ef-690822c43466,
1,2f392438-e215-4169-bebf-21ac4ff253e1,
2,3f316b38-1a24-45a9-8d8c-4e05a42257c6,
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,


### 2 Basic Analysis :

* Training set shape :

In [None]:
print('Dimension of the training Dataset : {}'.format(colored(train.shape,'blue')))

Dimension of the training Dataset : [34m(19661, 5)[0m


* Data Description :

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19661 entries, 0 to 19660
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Id             19661 non-null  object
 1   pub_title      19661 non-null  object
 2   dataset_title  19661 non-null  object
 3   dataset_label  19661 non-null  object
 4   cleaned_label  19661 non-null  object
dtypes: object(5)
memory usage: 768.1+ KB


In [None]:
train.isnull().sum().to_frame('NaN Values')

Unnamed: 0,NaN Values
Id,0
pub_title,0
dataset_title,0
dataset_label,0
cleaned_label,0


We don't have any duplicated value on the train set

In [None]:
print('All rows :',colored(train.shape[0],'red'))
for column in train.columns:
    print("{} : {}".format(column,colored(len(train[column].unique()),'blue')))

All rows : [31m19661[0m
Id : [34m14316[0m
pub_title : [34m14271[0m
dataset_title : [34m45[0m
dataset_label : [34m130[0m
cleaned_label : [34m130[0m


* we have 14316 unique `Publication Id` that's mean there's some publication mentioning more than one data set.
* for the `publication title` there's less unique values than `Publication Id` So there're diffirent publications with same title.
* 45 `Dataset title` and 130 `Dataset Label` that's mean there're some datasets with multiple labels

### 3 Reading JSON format :

The publications that we will use in train and test are provided  in JSON format, broken up into sections with section titles.

In [None]:
trainFilesPath =DATA_PATH /'train'
testFilesPath = DATA_PATH /'test'

In [None]:
def ReadJsonFiles(fileName, InputPath):
    """
    This Function get the Publication text from Json file without Section titles
    """
    
    JsonPATH = os.path.join(InputPath, (fileName+'.json'))
    
    publicationSections = []
    
    with open(JsonPATH, 'r') as file:
        json_decode = json.load(file)
        for data in json_decode:
            publicationSections.append(data.get('text'))
    
    publicationText = ' '.join(publicationSections)
    
    return publicationText

Let's apply `ReadJsonFile` to train and Submission Set (Kaggle Test Set):

In [None]:
#Extract text from json file and plus its column to train and sampleSubmission csv file:
tqdm.pandas()
train['publicationText']=train['Id'].progress_apply(lambda x:ReadJsonFiles(x,trainFilesPath))
sampleSubmission['publicationText']=sampleSubmission['Id'].progress_apply(lambda x:ReadJsonFiles(x,testFilesPath))

HBox(children=(FloatProgress(value=0.0, max=19661.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [None]:
train.sample(5)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label,publicationText
15141,6db465a5-51ed-4bb8-8211-2be2fa1a298a,The U.S. Scientific and Technical Workforce I...,Survey of Graduate Students and Postdoctorates...,Survey of Graduate Students and Postdoctorates...,survey of graduate students and postdoctorates...,The Office of Science and Technology Policy (O...
13086,49dc08db-15c0-4949-aa20-7ce1fe1a9e0b,Choosing a Postsecondary Institution. Statisti...,Beginning Postsecondary Student,Beginning Postsecondary Student,beginning postsecondary student,"In general, beginning postsecondary students w..."
5684,f80261df-de7a-4635-a174-62e746ded42d,Early Cerebral Small Vessel Disease and Brain ...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni,Objective: Decline in cognitive function begin...
9813,4d05fc57-5d03-4937-9674-29357955ca80,Extension and refinement of the predictive val...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni,Background: This study examined the predictive...
16772,d1562dc2-f009-49eb-acd7-349ee33828fd,Trajectories of physiological dysregulation pr...,Baltimore Longitudinal Study of Aging (BLSA),Baltimore Longitudinal Study of Aging,baltimore longitudinal study of aging,Scientists studying aging do so along two fron...


In [None]:
sampleSubmission.head()

Unnamed: 0,Id,PredictionString,publicationText
0,2100032a-7c33-4bff-97ef-690822c43466,,Cognitive deficits and reduced educational ach...
1,2f392438-e215-4169-bebf-21ac4ff253e1,,This report describes how the education system...
2,3f316b38-1a24-45a9-8d8c-4e05a42257c6,,"Cape Hatteras National Seashore (CAHA), locate..."
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,,A significant body of research has been conduc...


# <font color='#FF4500'>Data Pre-Processing 🔧

Let's do some data cleaning

`TextCleaning` function will help us to convert all text to lower case, remove special charecters, emojis and multiple spaces

In [None]:
def TextCleaning(text):
    
   
    text = ''.join([k for k in text if k not in string.punctuation])#Delete punctuation
    text = re.sub('[^A-Za-z0-9]+', ' ', str(text).lower()).strip()  #Convert all text to lower case
    text = re.sub(' +', ' ', text)

    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

In [None]:
#Example :
TextCleaning('Hello World 😀')

'hello world'

In [None]:
tqdm.pandas()
train['publicationText']=train['publicationText'].progress_apply(lambda x:TextCleaning(x))
sampleSubmission['publicationText']=sampleSubmission['publicationText'].progress_apply(lambda x: TextCleaning(x))

HBox(children=(FloatProgress(value=0.0, max=19661.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [None]:
sampleSubmission

Unnamed: 0,Id,PredictionString,publicationText
0,2100032a-7c33-4bff-97ef-690822c43466,,cognitive deficits and reduced educational ach...
1,2f392438-e215-4169-bebf-21ac4ff253e1,,this report describes how the education system...
2,3f316b38-1a24-45a9-8d8c-4e05a42257c6,,cape hatteras national seashore caha located a...
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,,a significant body of research has been conduc...


In [None]:
papers = {}
for paper_id in sampleSubmission['Id']:
    with open(f'{testFilesPath}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

# <font color='#FF4500'>Matching 📑

In [None]:
adnl_govt_labels = pd.read_csv('../input/bigger-govt-dataset-list/data_set_800.csv')

literal_preds = []
to_append = []
for index, row in tqdm(sampleSubmission.iterrows()):
    to_append = [row['Id'],'']
    
    clean_string = row['publicationText']
    for index, row2 in adnl_govt_labels.iterrows():
        query_string = str(row2['title'])
        if query_string in clean_string:
            if to_append[1] != '' and query_string not in to_append[1]:
                to_append[1] = to_append[1] + '|' + query_string
            if to_append[1] == '':
                to_append[1] = query_string
    literal_preds.append(*to_append[1:])

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




In [None]:
literal_preds

['adni|alzheimers disease neuroimaging initiative|pubmed',
 'common core of data|nces common core of data|trends in international mathematics and science study|schools and staffing survey|integrated postsecondary education data system|ipeds|progress in international reading literacy study',
 'slosh model|noaa storm surge inundation|sea lake and overland surges from hurricanes',
 '']

# <font color='#FF4500'>Masked Modling Language 🤗

## Load model and tokenizer

In [None]:
PRETRAINED_PATH = '../input/coleridge-mlm-model/output-mlm/checkpoint-48000'
TOKENIZER_PATH = '../input/coleridge-mlm-model/model_tokenizer'

MAX_LENGTH = 64
OVERLAP = 20

PREDICT_BATCH = 32

DATASET_SYMBOL = '$' # this symbol represents a dataset name
NONDATA_SYMBOL = '#' # this symbol represents a non-dataset name

In [None]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(PRETRAINED_PATH)

mlm = pipeline(
    'fill-mask', 
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

def clean_paper_sentence(s):
    """
    This function is essentially clean_text without lowercasing.
    """
    s = re.sub('[^A-Za-z0-9]+', ' ', str(s)).strip()
    s = re.sub(' +', ' ', s)
    return s

def shorten_sentences(sentences):
    """
    Sentences that have more than MAX_LENGTH words will be split
    into multiple sentences with overlappings.
    """
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

connection_tokens = {'s', 'of', 'and', 'in', 'on', 'for', 'data', 'dataset'}
def find_mask_candidates(sentence):
    """
    Extract masking candidates for Masked Dataset Modeling from a given $sentence.
    A candidate should be a continuous sequence of at least 2 words, 
    each of these words either has the first letter in uppercase or is one of
    the connection words ($connection_tokens). Furthermore, the connection 
    tokens are not allowed to appear at the beginning and the end of the
    sequence.
    """
    def candidate_qualified(words):
        while len(words) and words[0].lower() in connection_tokens:
            words = words[1:]
        while len(words) and words[-1].lower() in connection_tokens:
            words = words[:-1]
        
        return len(words) >= 2
    
    candidates = []
    
    phrase_start, phrase_end = -1, -1
    for id in range(1, len(sentence)):
        word = sentence[id]
        if word[0].isupper() or word in connection_tokens:
            if phrase_start == -1:
                phrase_start = phrase_end = id
            else:
                phrase_end = id
        else:
            if phrase_start != -1:
                if candidate_qualified(sentence[phrase_start:phrase_end+1]):
                    candidates.append((phrase_start, phrase_end))
                phrase_start = phrase_end = -1
    
    if phrase_start != -1:
        if candidate_qualified(sentence[phrase_start:phrase_end+1]):
            candidates.append((phrase_start, phrase_end))
    
    return candidates

## Transform :

In [None]:
mask = mlm.tokenizer.mask_token

In [None]:
all_test_data = []


for paper_id in sampleSubmission['Id']:
    # load paper
    paper = papers[paper_id]
    
    # extract sentences
    sentences = set([clean_paper_sentence(sentence) for section in paper 
                     for sentence in section['text'].split('.')
                    ])
    sentences = shorten_sentences(sentences) # make sentences short
    sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
    sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]
    sentences = [sentence.split() for sentence in sentences] # sentence = list of words
    
    # mask
    test_data = []
    for sentence in sentences:
        for phrase_start, phrase_end in find_mask_candidates(sentence):
            dt_point = sentence[:phrase_start] + [mask] + sentence[phrase_end+1:]
            test_data.append((' '.join(dt_point), ' '.join(sentence[phrase_start:phrase_end+1]))) # (masked text, phrase)
    
    all_test_data.append(test_data)
    

## Predict :

In [None]:
pred_labels = []

pbar = tqdm(total = len(all_test_data))
for test_data in all_test_data:
    pred_bag = set()
    
    if len(test_data):
        texts, phrases = list(zip(*test_data))
        mlm_pred = []
        for p_id in range(0, len(texts), PREDICT_BATCH):
            batch_texts = texts[p_id:p_id+PREDICT_BATCH]
            batch_pred = mlm(list(batch_texts), targets=[f' {DATASET_SYMBOL}', f' {NONDATA_SYMBOL}'])
            
            if len(batch_texts) == 1:
                batch_pred = [batch_pred]
            
            mlm_pred.extend(batch_pred)
        
        for (result1, result2), phrase in zip(mlm_pred, phrases):
            if (result1['score'] > result2['score']*1.5 and result1['token_str'] == DATASET_SYMBOL) or\
               (result2['score'] > result1['score']*1.5 and result2['token_str'] == NONDATA_SYMBOL):
                pred_bag.add(clean_text(phrase))
    
    # filter labels by jaccard score 
    filtered_labels = []
    
    for label in sorted(pred_bag, key=len, reverse=True):
        if len(filtered_labels) == 0 or all(jaccard_similarity(label, got_label) < 0.75 for got_label in filtered_labels):
            filtered_labels.append(label)
            
    pred_labels.append('|'.join(filtered_labels))
    pbar.update(1)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

In [None]:
pred_labels

['lothian birth cohort study lbc1936',
 'trends in international mathematics and science study timss|progress in international|pirls pisa',
 'nc sea level rise risk management study slrrms|dataset data management in arcgis',
 'iri cnp data']

In [None]:
final_predictions=[]
for literal_match, mlm_pred in zip(literal_preds, pred_labels):
        if literal_match!='' and mlm_pred not in literal_match:
            final_predictions.append(literal_match +'|'+mlm_pred)
        else:
            if literal_match:
                final_predictions.append(literal_match)
            else :
                final_predictions.append(mlm_pred)
            

            
final_predictions
            
sampleSubmission['PredictionString'] = final_predictions
sample_submission=sampleSubmission[['Id','PredictionString']]
sample_submission.to_csv('submission.csv', index=False)

In [None]:
sample_submission['PredictionString'].head().to_list()

['adni|alzheimers disease neuroimaging initiative|pubmed|lothian birth cohort study lbc1936',
 'common core of data|nces common core of data|trends in international mathematics and science study|schools and staffing survey|integrated postsecondary education data system|ipeds|progress in international reading literacy study|trends in international mathematics and science study timss|progress in international|pirls pisa',
 'slosh model|noaa storm surge inundation|sea lake and overland surges from hurricanes|nc sea level rise risk management study slrrms|dataset data management in arcgis',
 'iri cnp data']