![](https://bigdata.umd.edu/sites/bigdata.umd.edu/files/styles/500w/public/Coleridge%20Initiative.png?itok=jKqCpybk)

<h1> <center> 📜 Coleridge Initiative </center> </h1>
<h2> <center> 🔍 Complete EDA </center> </h2>


* [1. Introduction](#section-one)
* [2. Data Understanding](#section-two)
* [3. EDA for text publications](#section-three)
* [4. EDA for dataset titles](#section-four)
* [5. EDA for publication titles](#section-five)


<h2> <center> <a href="section-one"> 1. Introduction </a> </center> </h2>


> 📑 Context : This competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data is critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.

> In this competition, you'll use natural language processing (NLP) to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas gathered from CHORUS publisher members and other sources, you'll identify data sets that the publications' authors used in their work.

> 📌 Goal : The objective of the competition is to identify the mention of datasets within scientific publications. 

> Challenges : 
* It is an unsupervised task, the test set can have other datasets than those who are present in the train folder.
* Some dataset labels are in the same in the ground truth.

#### Librairies 📚

In [None]:
!pip install textstat

#Basics
import numpy as np
import pandas as pd
import glob
import seaborn as sn
import seaborn as sn
import matplotlib.pyplot as plt
from collections import defaultdict
import warnings
import gc

#NLP librairies
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import CountVectorizer
from textstat import flesch_reading_ease

<h2> <center> <a href="section-two"> 2. Data understanding </a> </center> </h2>

* `train.csv` : Labels and metadata for the training set from scientific publications in the train folder ;
* `train` - the full text of the training set's publications in JSON format, broken into sections with section titles
* `test` - the full text of the test set's publications in JSON format, broken into sections with section titles
* The `sample_subimission.csv` : a sample submission file in the correct format.

In [None]:
DIR_TRAIN = "../input/coleridgeinitiative-show-us-the-data/train/"
DIR_TEST = "../input/coleridgeinitiative-show-us-the-data/test/"

DIR_TRAIN_CSV = "../input/coleridgeinitiative-show-us-the-data/train.csv"
train_csv = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")
warnings.filterwarnings("ignore")

In [None]:
train_csv.head(5)

<h3>  2.1 Data description </h3>

The train_csv file contains five columns : 

`id` -  note that there are multiple rows for some training documents, indicating multiple mentioned datasets ;

`pub_title` - title of the publication (a small number of publications have the same title) ;

`dataset_title` - the title of the dataset that is mentioned within the publication ;

`dataset_label` - a portion of the text that indicates the dataset ;

`cleaned_label` - the dataset_label, as passed through the clean_text function from the Evaluation page

I'm adding a `text` column for each row corresponding to the full text : 

In [None]:
train_csv['text'] = train_csv.apply(lambda x : pd.read_json(DIR_TRAIN + x['Id'] + ".json")['text'].str.cat(sep=' '), axis = 1)

In [None]:
train_csv.describe()

<h3> 2.2 Id publications with multiple dataset titles  </h3>

> **Information** : There are Id publications in which there are multiple mention of dataset titles. How much there are ?


In [None]:
group_pub_dataset_title = train_csv.groupby('Id').count()[['dataset_title']].sort_values(by = "dataset_title", ascending = False)
id_multiple_dataset = group_pub_dataset_title[group_pub_dataset_title['dataset_title'] >1][['dataset_title']].reset_index()

In [None]:
plt.figure(figsize=(16, 6))
sn.barplot(x = id_multiple_dataset['dataset_title'].iloc[:20],
          y  = id_multiple_dataset['Id'].iloc[:20])

plt.title("How much dataset titles by Id publications", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("")
plt.xlabel("Count", fontsize=14)

For example, the publication Id ""84ed3c4c-f57b-440c-8062-b8dff66a8421" is duplicated two times in the train_csv with different dataset titles : 

In [None]:
train_csv[train_csv.duplicated(subset=['Id'])]
train_csv[train_csv['Id'] == "84ed3c4c-f57b-440c-8062-b8dff66a8421"]

<h3> 2.3 Publication titles with multiple dataset titles  </h3>

> **Information** : There are publication titles in which there are multiple mention of dataset title. How many are there ?


In [None]:
group_pub_dataset_title = train_csv.groupby('pub_title').count()[['dataset_title']].sort_values(by = "dataset_title", ascending = False)
pub_title_multiple_dataset = group_pub_dataset_title[group_pub_dataset_title['dataset_title'] >1][['dataset_title']].reset_index()

In [None]:
plt.figure(figsize=(16, 6))
sn.barplot(x = pub_title_multiple_dataset['dataset_title'].iloc[:20],
          y  = pub_title_multiple_dataset['pub_title'].iloc[:20])

plt.title("How much dataset titles by publication titles", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("")
plt.xlabel("Count", fontsize=14)

<h3> 2.3 Publication title with different Id publications </h3>

> **Information** : Each publication title with different Id publications : it means the same publication title for two differents publications ! 

Here are the five first publication title which have two different Id publications. There are in total 45.

In [None]:
group_pub_title = train_csv.drop_duplicates("Id").groupby('pub_title').count()
group_pub_title[group_pub_title['Id'] >1][['Id']].head(5)

For example, for the publication title "A quantitative examination of lightning as a predictor of peak winds in tropical cyclones" : 

In [None]:
train_csv[train_csv['pub_title'] == "A quantitative examination of lightning as a predictor of peak winds in tropical cyclones"]

<h3> 2.4 Dataset titles with different dataset labels  </h3>

> **Information** : How many dataset labels are there by dataset title ?

In [None]:
dataset_title_multiple_label = train_csv.drop_duplicates('dataset_label').groupby('dataset_title').count()[['dataset_label']].sort_values(by = 'dataset_label', ascending = False).reset_index()

In [None]:
plt.figure(figsize=(16, 6))
sn.barplot(y = dataset_title_multiple_label['dataset_title'].iloc[:20],
          x  = dataset_title_multiple_label['dataset_label'].iloc[:20])

plt.title("How much dataset labels by dataset title there are ?", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("")
plt.xlabel("Count", fontsize=14)

<h2> <center> <a href="section-three"> 3. EDA for text publications </a> </center> </h2>
<h3> 3.1 Number of words in text publications  </h3>

> **Information** : How many words are there in texts ?

I took a sample of 1000 texts representative of the distribution of number of words by text because there are texts with more than 80,000 words so we don't see in the graphic the distribution around the mean. 

In [None]:
#train_csv['text_splitted'] = train_csv['text'].str.split()
#train_csv['nb_words'] = train_csv['text_splitted'].apply(len)

In [None]:
plt.figure(figsize=(16, 6))
sn.distplot(pd.Series(train_csv['text'].unique()).apply(len), kde=True)

plt.title("Sample of the distribution of number of words by text", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Number of words", fontsize=14)


In mean, each text has 5000 words. The distribution is skewed on the right : there are also many texts between 5000 and 8000 words.

<h3> 3.2 The mean of word length in text publications</h3>

> **Information** : What is the mean of word length in text ?

In [None]:
#train_csv['avg_length_word'] = train_csv['text_splitted'].apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x))

In [None]:
plt.figure(figsize=(16, 6))

sn.distplot(pd.Series(train_csv['text'].unique()).apply(lambda x : x.split()).apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x)), kde=True)

plt.title("Average word length in text", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Average word length", fontsize=14)

Is there really words with length around 49 characters ?
What are these words ?

In [None]:
long_word_length = pd.Series(train_csv['text'].unique()).apply(lambda x : x.split())
def get_long_length(row):
    for x in row:
        if len(x)>40:
            return x
long_word_length.apply(get_long_length).unique()

There are just web adresses and one with many stars. 

<h3> 3.3 Mostwords in text publications </h3>

> **Information** : What are the mostwords in text publications ?

In [None]:
stopwords = stopwords.words('english')

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords]
    return " ".join(filtered_words)


In [None]:
train_csv['clean_text'] = train_csv['text'].map(lambda s:preprocess(s))
mostwords_in_text=defaultdict(int)
def get_mostwords_in_text(row):
    for word in row.split():
        mostwords_in_text[word] += 1
pd.Series(train_csv['clean_text'].unique()).apply(get_mostwords_in_text)
mostwords_in_text = dict(sorted(mostwords_in_text.items(), key=lambda x: x[1], reverse = True))
mostwords_in_text = pd.DataFrame.from_dict(mostwords_in_text, orient = 'index').reset_index()
mostwords_in_text.columns = ['mostword', 'count']

In [None]:
plt.figure(figsize=(16, 6))
sn.barplot(x = mostwords_in_text['count'].iloc[:20], 
           y = mostwords_in_text['mostword'].iloc[:20])

plt.title("Mostwords in text", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Words")
plt.xlabel("Count", fontsize=14)

<h3> 3.4 Ngrams (bigram and trigram) for text publications </h3>

> **Information** : Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc.If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on. Looking at most frequent n-grams can give us a better understanding of the context in which the word was used.

In [None]:
def get_top_ngram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) 
                  for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]

In [None]:
bigram_text = get_top_ngram(train_csv['clean_text'].unique(), 2)
bigram_in_text = pd.DataFrame.from_dict(dict(bigram_text), orient = 'index').reset_index()
bigram_in_text.columns = ['bigram', 'count']

plt.figure(figsize=(16, 6))
sn.barplot(x = bigram_in_text['count'].iloc[:20], 
           y = bigram_in_text['bigram'].iloc[:20])

plt.title("Bigrams in text", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Bigrams")
plt.xlabel("Count", fontsize=14)

In [None]:
"""
trigram_text = get_top_ngram(train_csv['clean_text'].unique(), 3)
trigram_in_text = pd.DataFrame.from_dict(dict(trigram_text), orient = 'index').reset_index()
trigram_in_text.columns = ['trigram', 'count']

plt.figure(figsize=(16, 6))
sn.barplot(x = trigram_in_text['count'].iloc[:20], 
           y = trigram_in_text['trigram'].iloc[:20])

plt.title("Trigrams in text", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Trigrams")
plt.xlabel("Count", fontsize=14)

"""

<h3> 3.5 Wordclouds for text publications </h3>

> **Information** : Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance.

In [None]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color='black',
                      stopwords=stopwords,
                      max_words=100,
                      max_font_size=30,
                      scale=3,
                      random_state=1)
   
wordcloud=wordcloud.generate(str(train_csv['text'].unique()))

In [None]:
fig = plt.figure(1, figsize=(12, 12))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()

<h3> 3.6 Text complexity in text publications </h3>

> **Information** : How readable (difficult to read) the text is and what type of reader can fully understand it ? Do we need a college degree to understand the message or a first-grader can clearly see what the point is ?

In [None]:
train_csv['text_readable'] = train_csv['text'].apply(lambda x : flesch_reading_ease(x))


In [None]:
plt.figure(figsize=(16, 6))
sn.distplot(train_csv['text_readable'].iloc[5000:10000], kde=True)

plt.title("How readable are text publications ? (based on 5000 text samples)", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Text readability score", fontsize=14)

The mean of readability score is around 30. It means that people from college to high school can read the scientific publications !

<h2> <center> <a href="section-four"> 4. EDA for dataset titles </a> </center> </h2>
<h3> 4.1 Number of words in dataset titles  </h3>

> **Information** : How many words are there in dataset titles ?


In [None]:
plt.figure(figsize=(16, 6))
sn.distplot(pd.Series(train_csv['dataset_title'].unique()).apply(len), kde=True)

plt.title("Distribution of number of words by dataset titles", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Number of words", fontsize=14)


There are in mean 5 words in dataset titles. 

<h3> 4.2 The mean of word length in dataset titles</h3>

> **Information** : What is the mean of word length in dataset titles ?

In [None]:
#train_csv['avg_length_word_title'] = train_csv['title_splitted'].apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x))
plt.figure(figsize=(16, 6))
sn.distplot(pd.Series(train_csv['dataset_title'].unique()).apply(lambda x : x.split()).apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x)), kde=True)

plt.title("Average word length in dataset titles", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Average dataset title word length", fontsize=14)

<h3> 4.3 Mostwords in dataset titles </h3>

> **Information** : What are the mostwords in dataset titles ?

In [None]:
train_csv['clean_dataset_title'] = train_csv['dataset_title'].map(lambda s:preprocess(s))
mostwords_in_dataset_title=defaultdict(int)
def get_mostwords_in_text(row):
    for word in row.split():
        mostwords_in_dataset_title[word] += 1
pd.Series(train_csv['clean_dataset_title'].unique()).apply(get_mostwords_in_text)
mostwords_in_dataset_title = dict(sorted(mostwords_in_dataset_title.items(), key=lambda x: x[1], reverse = True))
mostwords_in_dataset_title = pd.DataFrame.from_dict(mostwords_in_dataset_title, orient = 'index').reset_index()
mostwords_in_dataset_title.columns = ['mostword', 'count']

In [None]:
plt.figure(figsize=(16, 6))
sn.barplot(x = mostwords_in_dataset_title['count'].iloc[:20], 
           y = mostwords_in_dataset_title['mostword'].iloc[:20])

plt.title("Mostwords in dataset titles", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Words")
plt.xlabel("Count", fontsize=14)

<h3> 4.4 Ngrams (bigram and trigram) for dataset titles </h3>

> **Information** : Ngrams are simply contiguous sequences of n words. For example “riverbank”,” The three musketeers” etc.If the number of words is two, it is called bigram. For 3 words it is called a trigram and so on. Looking at most frequent n-grams can give us a better understanding of the context in which the word was used.

In [None]:
bigram_dataset_title = get_top_ngram(train_csv['dataset_title'].unique(), 2)
bigram_dataset_title = pd.DataFrame.from_dict(dict(bigram_dataset_title), orient = 'index').reset_index()
bigram_dataset_title.columns = ['bigram', 'count']

plt.figure(figsize=(16, 6))
sn.barplot(x = bigram_dataset_title['count'].iloc[:20], 
           y = bigram_dataset_title['bigram'].iloc[:20])

plt.title("Bigrams in dataset titles", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Bigrams")
plt.xlabel("Count", fontsize=14)

In [None]:
trigram_dataset_title = get_top_ngram(train_csv['dataset_title'].unique(), 3)
trigram_dataset_title  = pd.DataFrame.from_dict(dict(trigram_dataset_title), orient = 'index').reset_index()
trigram_dataset_title.columns = ['trigram', 'count']

plt.figure(figsize=(16, 6))
sn.barplot(x = trigram_dataset_title['count'].iloc[:20], 
           y = trigram_dataset_title['trigram'].iloc[:20])

plt.title("Trigrams in dataset_title", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Trigrams")
plt.xlabel("Count", fontsize=14)

<h3> 4.5 Wordclouds for dataset titles </h3>

> **Information** : Wordcloud is a great way to represent text data. The size and color of each word that appears in the wordcloud indicate it’s frequency or importance.

In [None]:
stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color='black',
                      stopwords=stopwords,
                      max_words=100,
                      max_font_size=30,
                      scale=3,
                      random_state=1)
   
wordcloud=wordcloud.generate(str(train_csv['dataset_title'].unique()))

In [None]:
fig = plt.figure(1, figsize=(12, 12))
plt.axis('off')
plt.imshow(wordcloud)
plt.show()

<h3> 4.6 Text complexity in dataset titles </h3>

> **Information** : How readable (difficult to read) the dataset title is and what type of reader can fully understand it ? Do we need a college degree to understand the message or a first-grader can clearly see what the point is ?

In [None]:
train_csv['dataset_title_readable'] = pd.Series(train_csv['dataset_title'].unique()).apply(lambda x : flesch_reading_ease(x))
plt.figure(figsize=(16, 6))
sn.distplot(train_csv['dataset_title_readable'], kde=True)

plt.title("How readable are dataset titles ? ", fontsize=16)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Density")
plt.xlabel("Dataset title readability score", fontsize=14)

<h3> Thank you for reading my notebook. I hope you enjoyed it. </h3>


TO BE CONTINUED...

**Credits :** 

* https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools (very good tutorial on NLP data analysis)