<h1 style='color:white; background:blue; border:0'><center>Show US the Data: Start Our Study</center></h1>

![](https://oerc.osu.edu/sites/oerc/themes/oerc/images/projects/coleridge.png)

This competition challenges data scientists to show how publicly funded data and evidence are used to serve science and society. Data, evidence, and science are critical if government is address the many threats facing society: pandemics, climate change and coastal inundation, Alzheimer’s disease, child hunger, and support science and innovation, increase food production, maintain biodiversity, and address many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.

Can natural language processing find the hidden-in-plain-sight data citations? Can machine learning find the link between the words used in research articles and the data referenced in the article?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

import tqdm
from tqdm.auto import tqdm as tqdmp
tqdmp.pandas()

# NLP
import unicodedata, string, re, nltk, json
from nltk.corpus import stopwords
from nltk.util import ngrams
from wordcloud import WordCloud, STOPWORDS
from collections import Counter, defaultdict
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.dates as mdates

# ignoring warnings
import warnings
warnings.simplefilter("ignore")

# Some settings for visualizations
sns.set(rc={'axes.facecolor':'black', 'figure.facecolor':'black', 
            'xtick.color': 'white', 'ytick.color': 'white', 
            'grid.color': 'white', 'axes.labelcolor': 'white',
            'figure.dpi': 150, 'grid.linestyle': ':', 'grid.alpha': .6,
            'font.family': 'fantasy'})

In [None]:
train = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
ss = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv')
train.head()

In [None]:
ss.head()

# EDA

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
# Note that ALL ground truth texts have been cleaned for matching purposes using the following code:
def text_cleaner(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

def show_wordcloud(data, stop, mask = None, title = None, color = 'black'):
    """
    Function for creating wordclouds (with or without mask)
    """
    from wordcloud import WordCloud, ImageColorGenerator
    wordcloud = WordCloud(background_color = color,
                         stopwords = stop,
                         mask = mask,
                         max_words = 100,
                         scale = 3,
                         width = 4000, 
                         height = 2000,
                         collocations = False,
                         colormap = 'rainbow',
                         random_state = 1)
    
    wordcloud = wordcloud.generate(data)
    
    plt.figure(1, figsize = (16, 8), dpi = 300)
    plt.title(title, size = 15)
    plt.axis('off')
    if mask is None:
        plt.imshow(wordcloud, interpolation = "bilinear")
        plt.show()
    else:
        image_colors = ImageColorGenerator(mask)
        plt.imshow(wordcloud.recolor(color_func = image_colors), 
                   interpolation = "bilinear")
        plt.show()

Let's at first analyze the train pub_title data.

In [None]:
title_cleaned = train['pub_title'].apply(lambda x: text_cleaner(x))

In [None]:
title_length = title_cleaned.str.len()

plt.title('Title length', size = 15, color = 'white')
sns.distplot(title_length, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Length of title (symbols)')
plt.show()

In [None]:
title_words = title_cleaned.str.split().map(lambda x: len(x))

plt.title('Title words', size = 15, color = 'white')
sns.distplot(title_words, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Length of title (words)')
plt.show()

In [None]:
title_word_len = title_cleaned.str.split().apply(lambda x: [len(i) for i in x]).map(lambda x: np.mean(x))

plt.title('Title words length', size = 15, color = 'white')
sns.distplot(title_word_len, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Mean word length in title (symbols)')
plt.show()

In [None]:
words = title_cleaned.str.split().values.tolist()
title_corpus = [word for i in words for word in i]

title_counter = Counter(title_corpus)
title_most = title_counter.most_common()

stop = set(stopwords.words('english'))

title_top_words, title_top_words_count = [], []
for word, count in title_most[:100]:
    if word not in stop:
        title_top_words.append(word)
        title_top_words_count.append(count)

In [None]:
plt.title('TOP-10 title words', color = 'white', size = 15)
sns.barplot(y = title_top_words[:10], x = title_top_words_count[:10], 
            edgecolor = 'black', color = 'blue')
plt.show()

In [None]:
title_word_string = ' '.join(title_corpus)
show_wordcloud(title_word_string, stop)

## Extract all publication texts

Publications are provided in JSON format, broken up into sections with section titles. Let's extract all the texts with a little function.

In [None]:
def text_extractor(url_id):
    url = '../input/coleridgeinitiative-show-us-the-data/train/{}.json'.format(url_id)
    return ' '.join(pd.read_json(url).text)

Note that there are multiple rows for some training documents, indicating multiple mentioned datasets

In [None]:
train_texts = train.drop_duplicates(subset = ['Id'])['Id'].progress_apply(lambda x: text_extractor(x))

In [None]:
train_texts

In [None]:
text_cleaned = train_texts.progress_apply(lambda x: text_cleaner(x))

In [None]:
text_cleaned

In [None]:
text_length = text_cleaned.str.len()

plt.title('Pub text length', size = 15, color = 'white')
sns.distplot(text_length, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Length of pub text (symbols)')
plt.show()

In [None]:
text_words = text_cleaned.str.split().map(lambda x: len(x))

plt.title('Pub text words', size = 15, color = 'white')
sns.distplot(text_words, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Length of pub text (words)')
plt.show()

In [None]:
text_word_len = text_cleaned.str.split().progress_apply(lambda x: [len(i) for i in x]).map(lambda x: np.mean(x))

plt.title('Pub text words length', size = 15, color = 'white')
sns.distplot(text_word_len, kde = False, color = 'blue', 
             hist_kws = dict(alpha = 1))
plt.xlabel('Mean word length in pub text (symbols)')
plt.show()

In [None]:
words = text_cleaned.str.split().values.tolist()
text_corpus = [word for i in words for word in i]

text_counter = Counter(title_corpus)
text_most = title_counter.most_common()

stop = set(stopwords.words('english'))

text_top_words, text_top_words_count = [], []
for word, count in text_most[:100]:
    if word not in stop:
        text_top_words.append(word)
        text_top_words_count.append(count)

In [None]:
plt.title('TOP-20 text words', color = 'white', size = 15)
sns.barplot(y = text_top_words[:20], x = text_top_words_count[:20], 
            edgecolor = 'black', color = 'blue')
plt.show()

The goal in this competition is not just to match known dataset strings but to generalize to datasets that have never been seen before using NLP and statistical techniques. Not all datasets have been identified in train, but you have been provided enough information to generalize.

The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset.

Submissions are evaluated on a Jaccard-based FBeta score between predicted texts and ground truth texts, with Beta = 0 (an F0 or precision score). Multiple predictions are delineated with a pipe (|) character in the submission file.

The following is Python reference code for the Jaccard score:

In [None]:
def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

<h1 style='color:white; background:blue; border:0'><center>WORK IN PROGRESS...</center></h1>