## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">1. Understanding the Comptetion</p>

![ci](https://oerc.osu.edu/sites/oerc/themes/oerc/images/projects/coleridge.png)

**Background** - The Coleridge Initiative is a not-for-profit organization, originally established at New York University, that is working with governments to ensure that data are more effectively used for public decision-making.

### Competition Objective

One liner - We are required to build an algorithm that can find our what are the datasets that a publications uses.

Description - In this competition, we need to develop an algorithm to automate the discovery of how scientific data are referenced in publications. We have with us the full text of scientific publications from numerous research areas, we'll identify data sets that the publications' authors used in their work.

We have a labelled dataset (train set) that we'll use to develop our algorithm. The unlabelled dataset (test set) will be used for evaluation of the algorithm. Let's look into the data to understand the data and the comptetion better.

This type of automation will be very useful in showing what datasets are used in a particular type of publications or the reverse, what are the potential usages of a datset.

> It's also important to understand the evaluation process of this competition because it is little different. Look into this [Evaluation Process📢(Jaccard,FBeta)](https://www.kaggle.com/pashupatigupta/ci-how-score-is-calculated-jaccard-fbeta) notebook for a detailed explanation of evaluation process.

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">2. Understanding the Data</p>


In [None]:
import os
import re
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style="whitegrid")
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objects as go

import warnings
warnings.filterwarnings('ignore')

List of data file provided as input

In [None]:
os.listdir('../input/coleridgeinitiative-show-us-the-data')

We have a train.csv file. We also have train and test folders. Let's look into the folders first then we'll look into files.

In [None]:
print("Files in train directory : \n")
print(os.listdir('../input/coleridgeinitiative-show-us-the-data/train')[:5])
print("\nFiles in test directory : \n")
print(os.listdir('../input/coleridgeinitiative-show-us-the-data/test')[:5])

We have some json files in both the directories. Let's look into the files to find out what are they -

In [None]:
with open('../input/coleridgeinitiative-show-us-the-data/train/f8b03c87-9d1a-4f20-b76b-cb6c69d447b2.json') as f:
    sample = json.load(f)

In [None]:
sample[:2]

Well, these json files are full text version of publication. Let's what are the sections in a paper -

In [None]:
for s in sample:
    print(s['section_title'])

Woah! So we have each and every detail of a publication available in a json format. We can use these details to generate some features for model building. This is about the json files.

Now let's look at train.csv file that I believe has the labels information.

In [None]:
train = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
train.sample(10)

In [None]:
train.columns

#### Columns

- id (publication id) - note that there are multiple rows for some training documents, indicating multiple mentioned datasets
- pub_title - title of the publication (a small number of publications have the same title)
- dataset_title - the title of the dataset that is mentioned within the publication
- dataset_label - a portion of the text that indicates the dataset
- cleaned_label - the dataset_label, as passed through the clean_text function from the Evaluation page

So, we have 'id', 'publication_title' and 'cleaned_label' columns. The id column is same as the json filenames. So using this id column we can have any information about the publication.

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">3. EDA and Data Prepataion</p>

### 3.1 EDA

In [None]:
def basic_eda(df, row_limit=5, list_elements_limit=10):
    ### rows and columns
    print('Info : There are {} columns in the dataset'.format(df.shape[1]))
    print('Info : There are {} rows in the dataset'.format(df.shape[0]))
    
    print("==================================================")
    
    ## data types
    print("\nData type information of different columns")
    dtypes_df = pd.DataFrame(df.dtypes).reset_index().rename(columns={0:'dtype', 'index':'column_name'})
    cat_df = dtypes_df[dtypes_df['dtype']=='object']
    num_df = dtypes_df[dtypes_df['dtype']!='object']
    print('Info : There are {} categorical columns'.format(len(cat_df)))
    print('Info : There are {} numerical columns'.format(len(dtypes_df)-len(cat_df)))
    
    if list_elements_limit >= len(cat_df):
        print("Categorical columns : ", list(cat_df['column_name']))
    else:
        print("Categorical columns : ", list(cat_df['column_name'])[:list_elements_limit])
        
    if list_elements_limit >= len(num_df):
        print("Numerical columns : ", list(num_df['column_name']))
    else:
        print("Numerical columns : ", list(num_df['column_name'])[:list_elements_limit])
    
    #dtypes_df['dtype'].value_counts().plot.bar()
    display(dtypes_df.head(row_limit))
    
#     print("==================================================")
#     print("\nDescription of numerical variables")
    
#     #### Describibg numerical columns
#     desc_df_num = df[list(num_df['column_name'])].describe().T.reset_index().rename(columns={'index':'column_name'})
#     display(desc_df_num.head(row_limit))
    
    print("==================================================")
    print("\nDescription of categorical variables")
    
    desc_df_cat = df[list(cat_df['column_name'])].describe().T.reset_index().rename(columns={'index':'column_name'})
    display(desc_df_cat.head(row_limit))
    
    return

In [None]:
basic_eda(train)

### Observations

- 1) There are duplicate id's meaning that there are some pulications that are using mutiple datasets. That's why that id is repeating.
- 2) Same is the case with pub_title. A single publication is using mutiple datasets.
- 3) There is NO one to one mapping of id and pub_title. Meaning that there are cases when two different publications (from two different authors) have same title. Well, interesting!!!
- 4) There 45 dataset titles but 130 dataet labels. Meaning that there are some datasets that has multiple labels. We'll look into how these two are related.

### 3.1.1. Duplicate Id's and dataset labels

In [None]:
id_df = train[train['Id'] == '170113f9-399c-489e-ab53-2faf5c64c5bc'].drop_duplicates('dataset_title')
id_df[['Id', 'dataset_title']]

#### Note -: As we can see this "170113f9-399c-489e-ab53-2faf5c64c5bc" Id is mentioning multiple datasets. So, for each id in test we'll need to predict all posible datasets used.

### 3.1.2. Duplicate pub_title and dataset label

In [None]:
pub_df = train[train['pub_title'] == 'Science and Engineering Indicators 2008'].drop_duplicates('dataset_title')
pub_df[['pub_title', 'dataset_title']]

#### Note - As we observed in the above artifact there are publication titles using multiple datasets.

### 3.1.3. Multiple publications having same title

There is NO one to one mapping of id and pub_title. Meaning that there are cases when two different publications (from two different authors) have same title.

In [None]:
print("Five such example (pub_title) where case 3.1.3 happens...\n")
i=0
for pt in train['pub_title'].unique():
    pub_df = train[train['pub_title'] == pt].drop_duplicates('Id')
    if pub_df.shape[0] > 1:
        print(pt)
        i = i+1
    if i==5:
        break

In [None]:
pub_df = train[train['pub_title'] == 'Characteristics and Production Costs of U.S. Hog Farms, 2004'].drop_duplicates('Id')
pub_df

### 3.1.4. Dataset titles and labels

A single dataset can have multiple labels.

In [None]:
unique_titles = train['dataset_title'].unique()
dup_title = []
count = []
dup_list = []
for ut in unique_titles:
    title_df = train[train['dataset_title'] == ut]
    tdf = title_df[['Id', 'dataset_title', 'dataset_label']].drop_duplicates('dataset_label')
    if tdf.shape[0] > 1:
        #print(ut)
        dup_title.append(ut)
        count.append(tdf.shape[0])
        dup_list.append(list(tdf['dataset_label']))
        
dup_df = pd.DataFrame({'dataset_title':dup_title, 'label_count':count, 'label_list':dup_list})

In [None]:
dup_df.set_index('dataset_title')['label_count'].sort_values(ascending=False).plot.barh(figsize=(12,18))
plt.title("No of labels that a datset have")
plt.xlabel('labels_count')
plt.show()

### 3.2 Data Preparation

We'll read the text of a publication from the json file and put it in the train dataframe

In [None]:
def get_text(filename, test=False):
    if test:
        df = pd.read_json('../input/coleridgeinitiative-show-us-the-data/test/{}.json'.format(filename))
    else:
        df = pd.read_json('../input/coleridgeinitiative-show-us-the-data/train/{}.json'.format(filename))
    text = " ".join(list(df['text']))
    return text

In [None]:
train['text'] = train['Id'].apply(get_text)
train.sample(5)

Now that we have the text content of each publication let's do some wordcloud analysis.

### 3.2.1 WordCloud of publication titles

In [None]:
words_in_titles = list(train.pub_title.str.split(expand=True).stack())

wordcloud = WordCloud(stopwords = STOPWORDS,
                      background_color = "white",
                      width = 3000,
                      height = 2000
                     ).generate(' '.join(words_in_titles))
plt.figure(1, figsize = (18, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

### 3.2.2 WordCloud of most frequent words in the texts

In [None]:
text = ' '.join(train['text'].sample(frac=0.3))
wordcloud = WordCloud(background_color='white', stopwords=STOPWORDS, width=2560, height=1440).generate(text)

barplot_dim = (15, 15)
ax = plt.subplots(figsize=barplot_dim, facecolor='w')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

In [None]:
# A text cleaning function
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

## <p style="background-color:skyblue; font-family:newtimeroman; font-size:180%; text-align:center">4. Baseline Model</p>

### Hypothesis building 

Instead of directly jumping into models like BERT, XLNet, GPT-3, let's think simple here. In any publication the authors mentions the names of the datasets that are used in their work. So by simple string matching we can find out whether a dataset is mentioned in a publication or not.

So, instead of inferering from the publication (which datasets are used) we'll be finding out if a particular dataset is used in publication or not. For this we need a list of possible datasets and we can get it from the training set. BUT this isn't what the competition demands. This is just a baseline hypothesis.

### 4.1 Preparing test set

In [None]:
test_files = os.listdir('../input/coleridgeinitiative-show-us-the-data/test')
test_files

In [None]:
test = pd.DataFrame({'Id':test_files})
test['Id'] = test['Id'].apply(lambda x : x.split('.')[0])
test['text'] = test['Id'].apply(get_text, test=True)
test

### 4.2 Let's check this hypothesis on training data (To check if it's even worth to use this)

In [None]:
is_present = []
for exp in train.iterrows():
    if exp[1]['cleaned_label'] in clean_text(exp[1]['text']):
        is_present.append(1)
    else:
        is_present.append(0)

In [None]:
train['present'] = is_present
train.head()

Let's check the accuracy

In [None]:
acc = (train['present'].sum() / len(train))*100
print("Accuracy on Traininig set : {}%".format(acc))

Superb! This hypothesis gives 100% accuracy on training set.

### 4.3 Making submission file

In [None]:
submission_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/sample_submission.csv')
ids = submission_df['Id']

In [None]:
datasets_titles = [x.lower() for x in set(train['dataset_title'].unique()).union(set(train['dataset_label'].unique()))]

labels = []
for index in submission_df['Id']:
    publication_text = test[test['Id'] == index].text.str.cat(sep='\n').lower()
    #print(publication_text)
    label = []
    for dataset_title in datasets_titles:
        if dataset_title in publication_text:
            label.append(clean_text(dataset_title))
    labels.append('|'.join(label))

submission_df['PredictionString'] = labels

In [None]:
submission = pd.DataFrame()
submission['Id'] = ids
submission['PredictionString'] = labels

In [None]:
submission

In [None]:
submission.to_csv('submission.csv', index=False)

#### We can see that string matching gives 100% accuracy on train set. On submission as well it will probably give a good score. This model can definetely serve as a baseline.

#### Note - Accuracy isn't the actual evaluation metric. The actual evaluation metric is Jaccard similatity base FBeta(0.5) score. I have prepared this [Notebook](https://www.kaggle.com/pashupatigupta/ci-how-score-is-calculated-jaccard-fbeta) that implements the evaluation metric and it also evaluates the baseline on actual metric.

#### If you found it useful please consider appreciating it by an UPVOTE. Thanks!