# <p style="font-family:newtimeroman; text-align:center; fontsize:150%">Coleridge Initiative - Show US the Data<br>Discover how data is used for the public good</p>

> 📑 Context : This competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data is critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.

> In this competition, you'll use natural language processing (NLP) to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas gathered from CHORUS publisher members and other sources, you'll identify data sets that the publications' authors used in their work.

> 📌 Goal : The objective of the competition is to identify the mention of datasets within scientific publications. 

<a id='0'></a>
## <p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Table of Content</p>
* [1. Importing necessary packages and libraries📚](#1)
* [2. Loading the data ⌛](#2)
* [3. Data Pre-Processing🔧](#4)
* [4. Matching 📑](#4)
* [5. Masked Language Modling  🤗](#5)


<a id='1'></a>
## <p style="background-color:skyblue ; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Importing necessary packages and libraries📚</p>

## Install packages :

In [None]:
!pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets
!pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
!pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl
!pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl


from IPython.display import clear_output
clear_output()

## Importing Libraries:

In [None]:
import numpy as np
import pandas as pd 
import json
import os 
import re
import string
import random
import statistics
import Levenshtein
import math


# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

from wordcloud import WordCloud
from collections import Counter

#Text Color
from termcolor import colored

#NLP
import spacy

from tqdm.auto import tqdm

import pathlib
import glob


from datasets import load_dataset
import torch
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, \
AutoModelForMaskedLM, Trainer, TrainingArguments, pipeline

from typing import List
import string
from functools import partial
#from statistics import *
from statistics import mean, median, mode, stdev
from sklearn.metrics import accuracy_score, classification_report




import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

plt.rcParams['figure.figsize']=(8,6)

In [None]:
def SeedEverything(seed: int):

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    print(f'Setted Pipeline SEED = {SEED}')

SEED=2021
SeedEverything(SEED)

<a id='2'></a>
## <p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Loading the data ⌛</p>

* `train.csv` : Labels and metadata for the training set from scientific publications in the train folder ;
* `train` - the full text of the training set's publications in JSON format, broken into sections with section titles
* `test` - the full text of the test set's publications in JSON format, broken into sections with section titles
* The `sample_subimission.csv` : a sample submission file in the correct format.

### 1 CSV files :

In [None]:
DATA_PATH = pathlib.Path('../input/coleridgeinitiative-show-us-the-data')

**Columns Description :**

* `id `- publication id - note that there are multiple rows for some training documents, indicating multiple mentioned datasets
* `pub_title` - title of the publication (a small number of publications have the same title)
* `dataset_title` - the title of the dataset that is mentioned within the publication
* `dataset_label` - a portion of the text that indicates the dataset
* `cleaned_label` - the dataset_label, as passed through the clean_text function from the Evaluation page
* `PredictionString`- To be filled with equivalent of cleaned_label of train data (just in sample submission).

In [None]:
MAX_SAMPLE = None

train_path = '../input/coleridgeinitiative-show-us-the-data/train.csv'
paper_train_folder = '../input/coleridgeinitiative-show-us-the-data/train/'

train = pd.read_csv(train_path)
print (train.shape)
train = train[:MAX_SAMPLE]
#print (train.head())
print ("=====================")
# Group by publication, training labels should have the same form as expected output.
df = train.groupby('Id').agg({
    'pub_title': 'first',
    'dataset_title': '|'.join,
    'dataset_label': '|'.join,
    'cleaned_label': '|'.join
}).reset_index()    
#print (train.query("dataset_title.str.contains('|')"))
# print(train['dataset_title'][138])
# print(train['dataset_label'][138])
# print(train['cleaned_label'][138])
print ("==================================")
print('train size: ', len(df))
print ("==================================")
print (df.head())

In [None]:
#reading train.csv
#train=pd.read_csv(DATA_PATH /'train.csv')
print (df.shape)
df.sample(5)


* Let's take a look to the `Sample_submission.csv `:

In [None]:
output= df.tail(3000)
#output=pd.read_csv(DATA_PATH /'sample_submission.csv') # output file
output.head()

### 2 Basic Analysis :

* Training set shape :

In [None]:
print('Dimension of the training Dataset : {}'.format(colored(train.shape,'blue')))

* Data Description :

In [None]:
output.info()

In [None]:
output.isnull().sum().to_frame('NaN Values')

We don't have any duplicated value on the train set

In [None]:
print('All rows :',colored(train.shape[0],'red'))
for column in train.columns:
    print("{} : {}".format(column,colored(len(train[column].unique()),'blue')))

* we have 14316 unique `Publication Id` that's mean there's some publication mentioning more than one data set.
* for the `publication title` there's less unique values than `Publication Id` So there's diffirent publication with same title.
* 45 `Dataset title` and 130 `Dataset Label` that's mean there some dataset with multiple labels

### 3 Reading JSON format :

The publications that we will use in train and test are provided  in JSON format, broken up into sections with section titles.

In [None]:
#trainFilesPath =DATA_PATH /'train' #we will use this only, last 3000 files
#testFilesPath = DATA_PATH /'test'
testFilesPath = DATA_PATH /'train'

In [None]:
def ReadJsonFiles(fileName, InputPath):
    """
    This Function get the Publication text from Json file without Section titles
    """
    
    JsonPATH = os.path.join(InputPath, (fileName+'.json'))
    
    publicationSections = []
    
    with open(JsonPATH, 'r') as file:
        json_decode = json.load(file)
        for data in json_decode:
            publicationSections.append(data.get('text'))
    
    publicationText = ' '.join(publicationSections)
    
    return publicationText

Let's apply `ReadJsonFile` to train and Submission Set (Kaggle Test Set):

In [None]:
#Extract text from json file and plus its column to train and output csv file:
tqdm.pandas()
#train['publicationText']=train['Id'].progress_apply(lambda x:ReadJsonFiles(x,trainFilesPath))
#output['publicationText']=output['Id'].progress_apply(lambda x:ReadJsonFiles(x,testFilesPath))
output['publicationText']=output['Id'].progress_apply(lambda x:ReadJsonFiles(x,testFilesPath))

In [None]:
#train.sample(5)

In [None]:
output.info

<a id='3'></a>
## <p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Data Pre-Processing 🔧</p>

Let's do some data cleaning

`TextCleaning` function will help us to convert all text to lower case, remove special charecters, emojis and multiple spaces

In [None]:
def TextCleaning(text):
    
   
    text = ''.join([k for k in text if k not in string.punctuation])#Delete punctuation
    text = re.sub('[^A-Za-z0-9]+', ' ', str(text).lower()).strip()  #Convert all text to lower case
    text = re.sub(' +', ' ', text)

    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

In [None]:
#Example :
TextCleaning('Hello World 😀')

In [None]:
tqdm.pandas()
#train['publicationText']=train['publicationText'].progress_apply(lambda x:TextCleaning(x))
output['publicationText']=output['publicationText'].progress_apply(lambda x: TextCleaning(x))

In [None]:
output.head()

In [None]:
papers = {}
for paper_id in output['Id']:
    with open(f'{testFilesPath}/{paper_id}.json', 'r') as f:
        paper = json.load(f)
        papers[paper_id] = paper

In [None]:
len(papers)

In [None]:
#papers

<a id='4'></a>
## <p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Matching 📑</p>

<a id='5'></a>
## <p style="background-color:skyblue; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Masked Modling Language 🤗</p>

## Load model and tokenizer

In [None]:
#PRETRAINED_PATH = '../input/coleridge-mlm-model/output-mlm/checkpoint-48000'
#TOKENIZER_PATH = '../input/coleridge-mlm-model/model_tokenizer'
PRETRAINED_PATH = '../input/dataset1/thesis-model/checkpoint-72000'
TOKENIZER_PATH = '../input/dataset1/abdd-model_tokenizer'


MAX_LENGTH = 64
OVERLAP = 20

PREDICT_BATCH = 32

DATASET_SYMBOL = '$' # this symbol represents a dataset name
NONDATA_SYMBOL = '#' # this symbol represents a non-dataset name

In [None]:
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, use_fast=True)
model = AutoModelForMaskedLM.from_pretrained(PRETRAINED_PATH)

mlm = pipeline(
    'fill-mask', 
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

def clean_paper_sentence(s):
    """
    This function is essentially clean_text without lowercasing.
    """
    s = re.sub('[^A-Za-z0-9]+', ' ', str(s)).strip()
    s = re.sub(' +', ' ', s)
    return s

def shorten_sentences(sentences):
    """
    Sentences that have more than MAX_LENGTH words will be split
    into multiple sentences with overlappings.
    """
    short_sentences = []
    for sentence in sentences:
        words = sentence.split()
        if len(words) > MAX_LENGTH:
            for p in range(0, len(words), MAX_LENGTH - OVERLAP):
                short_sentences.append(' '.join(words[p:p+MAX_LENGTH]))
        else:
            short_sentences.append(sentence)
    return short_sentences

connection_tokens = {'s', 'of', 'and', 'in', 'on', 'for', 'data', 'dataset'}
def find_mask_candidates(sentence):
    """
    Extract masking candidates for Masked Dataset Modeling from a given $sentence.
    A candidate should be a continuous sequence of at least 2 words, 
    each of these words either has the first letter in uppercase or is one of
    the connection words ($connection_tokens). Furthermore, the connection 
    tokens are not allowed to appear at the beginning and the end of the
    sequence.
    """
    def candidate_qualified(words):
        while len(words) and words[0].lower() in connection_tokens:
            words = words[1:]
        while len(words) and words[-1].lower() in connection_tokens:
            words = words[:-1]
        
        return len(words) >= 2
    
    candidates = []
    
    phrase_start, phrase_end = -1, -1
    for id in range(1, len(sentence)):
        word = sentence[id]
        if word[0].isupper() or word in connection_tokens:
            if phrase_start == -1:
                phrase_start = phrase_end = id
            else:
                phrase_end = id
        else:
            if phrase_start != -1:
                if candidate_qualified(sentence[phrase_start:phrase_end+1]):
                    candidates.append((phrase_start, phrase_end))
                phrase_start = phrase_end = -1
    
    if phrase_start != -1:
        if candidate_qualified(sentence[phrase_start:phrase_end+1]):
            candidates.append((phrase_start, phrase_end))
    
    return candidates

## Transform :

In [None]:
mask = mlm.tokenizer.mask_token

In [None]:
output

In [None]:
all_test_data = []


for paper_id in output['Id']:
    #print (paper_id)
    #load paper
    paper = papers[paper_id]
    
    # extract sentences
    sentences = set([clean_paper_sentence(sentence) for section in paper 
                     for sentence in section['text'].split('.')
                    ])
    sentences = shorten_sentences(sentences) # make sentences short
    sentences = [sentence for sentence in sentences if len(sentence) > 10] # only accept sentences with length > 10 chars
    sentences = [sentence for sentence in sentences if any(word in sentence.lower() for word in ['data', 'study'])]
    sentences = [sentence.split() for sentence in sentences] # sentence = list of words
    
    
    # mask
    test_data = []
    for sentence in sentences:
        for phrase_start, phrase_end in find_mask_candidates(sentence):
            dt_point = sentence[:phrase_start] + [mask] + sentence[phrase_end+1:]
            test_data.append((' '.join(dt_point), ' '.join(sentence[phrase_start:phrase_end+1]))) # (masked text, phrase)
    
    all_test_data.append(test_data)
    
#print (len(all_test_data))
#print(all_test_data[1])
    

## Predict :

In [None]:
pred_labels = []

pbar = tqdm(total = len(all_test_data))
for test_data in all_test_data:
    
    pred_bag = set()
    
    if len(test_data):
        texts, phrases = list(zip(*test_data))
        #print (texts, "+++++++++++++++++++++++++++++++++++++++++++++++",phrases)
        #print (phrases, "Phrases")
        mlm_pred = []
        
        for p_id in range(0, len(texts), PREDICT_BATCH):
            print (p_id)
            batch_texts = texts[p_id:p_id+PREDICT_BATCH]
            
            batch_pred = mlm(list(batch_texts), targets=[f' {DATASET_SYMBOL}', f' {NONDATA_SYMBOL}'])
            #print (batch_pred) # important
            if len(batch_texts) == 1:
                batch_pred = [batch_pred]
            
            mlm_pred.extend(batch_pred)
        
        for (result1, result2), phrase in zip(mlm_pred, phrases):
            print ("*************************************************")
            print (result1['score'], result2['score'])
            print (result1['token_str'], result2['token_str'])
            print (phrase, "Phrase")
            print ("*************************************************")
            if (result1['score'] > result2['score']*1.5 and result1['token_str'] == DATASET_SYMBOL) or\
               (result2['score'] > result1['score']*1.5 and result2['token_str'] == NONDATA_SYMBOL):
                pred_bag.add(clean_text(phrase))
    
    # filter labels by jaccard score 
    filtered_labels = []
    #print (pred_bag)
    
    for label in sorted(pred_bag, key=len, reverse=True):
        if len(filtered_labels) == 0 or all(jaccard_similarity(label, got_label) < 0.75 for got_label in filtered_labels):
            filtered_labels.append(label)
            
    pred_labels.append('|'.join(filtered_labels))
    pbar.update(1)
    #print (pred_labels)
    #print ("========================================================")

In [None]:
pred_labels

In [None]:
final_predictions=[]
#for literal_match, mlm_pred in zip(literal_preds, pred_labels):
        #if literal_match!='' and mlm_pred not in literal_match:
         #   final_predictions.append(literal_match +'|'+mlm_pred)
        #else:
         #   if literal_match:
          #      final_predictions.append(literal_match)
           # else :
            #    final_predictions.append(mlm_pred)
            

            
#final_predictions
            
#output['PredictionString'] = final_predictions
output['PredictionString'] = pred_labels
output_file=output[['Id','cleaned_label','PredictionString']]
#output_file.to_csv('OutputFile.csv', index=False)
#print (output_file)

In [None]:
#print (output.cleaned_label)
print(output)

In [None]:
## EVALUATION METRICES

#1. Jaccard score
def jaccard(str1, str2): 
    a = set(str1.lower()) 
    b = set(str2.lower())
    c = a.intersection(b)
    return round(float(len(c)) / (len(a) + len(b) - len(c)), 2) 



#2. Levenshtein Distance
def levenshtein_distance(string1, string2):

    # the Levenshtein distance between string1 and string2
    l_dist = Levenshtein.distance(string1, string2)
    
    return l_dist


#3. Cosine Similarity

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


def cos_sim_driver_function(string1, string2):

    vector1 = text_to_vector(string1)
    vector2 = text_to_vector(string2)

    cosine = get_cosine(vector1, vector2)

    return round(cosine, 2)

In [None]:
output_file['jaccard_score'] = output_file.apply(lambda x: jaccard(x['cleaned_label'], x['PredictionString']),axis=1)
output_file['levenshtein_distance'] = output_file.apply(lambda x: levenshtein_distance(x['cleaned_label'], x['PredictionString']),axis=1)
output_file['cosine_similarity'] = output_file.apply(lambda x: cos_sim_driver_function(x['cleaned_label'], x['PredictionString']),axis=1)
    
    

In [None]:
#output_file[['Id', 'pub_title', 'dataset_title', 'dataset_label', 'cleaned_label',
 #      'PredictionString', 'jaccard_score',
  #     'hamming_distance', 'levenshtein_distance', 'cosine_similarity']]

In [None]:
print("Mean jaccard_score:",statistics.mean(output_file['jaccard_score']))
print("Mean levenshtein_distance:", statistics.mean(output_file['levenshtein_distance']))
print("Mean cosine_similarity:", statistics.mean(output_file['cosine_similarity']))


In [None]:

print(output_file[10:15])

In [None]:
#sample_submission["PredictionString"][1]

#'common core of data|nces common core of data|trends in international mathematics and science study|schools and staffing survey|integrated postsecondary education data system|ipeds|progress in international reading literacy study'

In [None]:
#y_eval= output_file['cleaned_label'].values
#predict_y= output_file['PredictionString'].values
#print(classification_report(y_eval, predict_y))