<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Imports-and-Defaults" data-toc-modified-id="Imports-and-Defaults-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Imports and Defaults</a></span></li></ul></li><li><span><a href="#Identification" data-toc-modified-id="Identification-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Identification</a></span><ul class="toc-item"><li><span><a href="#Importing-search-results" data-toc-modified-id="Importing-search-results-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importing search results</a></span></li><li><span><a href="#Dropping-Records-without-a-DOI" data-toc-modified-id="Dropping-Records-without-a-DOI-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Dropping Records without a DOI</a></span></li><li><span><a href="#Dropping-Duplicates" data-toc-modified-id="Dropping-Duplicates-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Dropping Duplicates</a></span></li></ul></li><li><span><a href="#Screening" data-toc-modified-id="Screening-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Screening</a></span><ul class="toc-item"><li><span><a href="#Restricting-to-journal-articles" data-toc-modified-id="Restricting-to-journal-articles-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Restricting to journal articles</a></span></li><li><span><a href="#Removing-records-with-missing-Journal-or-Title" data-toc-modified-id="Removing-records-with-missing-Journal-or-Title-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Removing records with missing Journal or Title</a></span></li><li><span><a href="#Removing-incomplete-Google-Scholar-Journal-Names-that-can't-be-matched-and-completed" data-toc-modified-id="Removing-incomplete-Google-Scholar-Journal-Names-that-can't-be-matched-and-completed-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Removing incomplete Google Scholar Journal Names that can't be matched and completed</a></span></li><li><span><a href="#Preparation-for-Merging-Remaining-Records" data-toc-modified-id="Preparation-for-Merging-Remaining-Records-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Preparation for Merging Remaining Records</a></span><ul class="toc-item"><li><span><a href="#Resetting-very-short-abstracts" data-toc-modified-id="Resetting-very-short-abstracts-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Resetting very short abstracts</a></span></li><li><span><a href="#Dummy-for-Source" data-toc-modified-id="Dummy-for-Source-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>Dummy for Source</a></span></li></ul></li><li><span><a href="#Building-Dataframes-of-Unique-Records" data-toc-modified-id="Building-Dataframes-of-Unique-Records-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Building Dataframes of Unique Records</a></span><ul class="toc-item"><li><span><a href="#Records-that-have-DOI-and-abstract" data-toc-modified-id="Records-that-have-DOI-and-abstract-2.5.1"><span class="toc-item-num">2.5.1&nbsp;&nbsp;</span>Records that have DOI and abstract</a></span></li><li><span><a href="#Records-that-have-DOI-but-no-abstract" data-toc-modified-id="Records-that-have-DOI-but-no-abstract-2.5.2"><span class="toc-item-num">2.5.2&nbsp;&nbsp;</span>Records that have DOI but no abstract</a></span></li></ul></li><li><span><a href="#Saving-New-Dataframes" data-toc-modified-id="Saving-New-Dataframes-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Saving New Dataframes</a></span></li></ul></li><li><span><a href="#Scraping" data-toc-modified-id="Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scraping</a></span><ul class="toc-item"><li><span><a href="#Formatting-Vagaries" data-toc-modified-id="Formatting-Vagaries-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Formatting Vagaries</a></span></li><li><span><a href="#Scraping-Function" data-toc-modified-id="Scraping-Function-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Scraping Function</a></span></li></ul></li></ul></div>

### Imports and Defaults

In [1]:
from IPython.core.display import display, HTML, Markdown as md
display(HTML("""<style>.container { width:75% !important; } p, ul {max-width: 40em;} .rendered_html table { margin-left: 0; } .output_subarea.output_png { display: flex; justify-content: center;}</style>"""))

In [2]:
# Basics
import numpy as np 
import pandas as pd 

#String cleaning and processing
import re
import string

import os
pd.options.mode.chained_assignment = None

import json

In [3]:
# Visualisation
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 15, 10
rcParams['axes.titlesize'] = 20
rcParams['axes.labelsize'] = 'large'
rcParams['xtick.labelsize'] = 10
rcParams['ytick.labelsize'] = 10
rcParams['lines.linewidth'] = 2
rcParams['font.size'] = 18

In [4]:
from langdetect import detect #language detection

## Identification

Publish or Perish software (either via desktop GUI or command line prompts) was used to search through four research databases (Google Scholar, Scopus, PubMed and CrossRef). The searches were done year by year to maximise the number of records returned and then joined together into four separate datasets. 

### Importing search results

In [5]:
def loadJSONfiles(path, source, startYear, endYear):
    JSONobject = []
    for year in range(startYear, endYear+1):
        filename= path + source+'-'+str(year)+'.json'
        if os.path.exists(filename):
            with open(filename, encoding='utf-8-sig') as f:
                newfile = json.load(f)
            JSONobject = JSONobject+ newfile
        else: print('Cant find', filename)
    return JSONobject

In [6]:
def get_df(jsonfile, source):
    df_dict = {'publication':[], 'title':[], 'authors':[],'doi':[], 
               'year':[], 'cites':[],'type':[], 'abstract':[], 'article_url':[], 'fulltext_url':[]}
    for record in jsonfile:
        df_dict['publication'].append(record.get('source'))
        df_dict['year'].append(record.get('year'))
        df_dict['doi'].append(record.get('doi'))
        df_dict['title'].append(record.get('title'))
        df_dict['abstract'].append(record.get('abstract'))
        df_dict['authors'].append(record.get('authors'))
        df_dict['cites'].append(record.get('cites'))
        df_dict['type'].append(record.get('type'))
        df_dict['article_url'].append(record.get('article_url'))
        df_dict['fulltext_url'].append(record.get('fulltext_url'))
    
    df = pd.DataFrame.from_dict(df_dict)
    df.fillna(np.nan, inplace=True)
    df['abstractLength'] = [len(x) if type(x)==str else np.nan for x in df['abstract'] ]
    df[source]=np.int8(1)
    df.reset_index(drop=True, inplace=True)
    return df

In [7]:
def jsonFilesToDF(path, source, startYear, endYear):
    jsonFile = loadJSONfiles(path,source, startYear, endYear)
    df = get_df(jsonFile, source)
    return df

In [8]:
crossref = jsonFilesToDF('POPresults/','crossref', 2010, 2021)
gscholar = jsonFilesToDF('POPresults/','gscholar', 2010, 2021)
pubmed = jsonFilesToDF('POPresults/','pubmed', 2010, 2021)
scopus = jsonFilesToDF('POPresults/','scopus', 2010, 2021)

In [9]:
crossref.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     3909 non-null   object 
 1   title           4000 non-null   object 
 2   authors         3776 non-null   object 
 3   doi             4000 non-null   object 
 4   year            4000 non-null   int64  
 5   cites           4000 non-null   int64  
 6   type            4000 non-null   object 
 7   abstract        689 non-null    object 
 8   article_url     4000 non-null   object 
 9   fulltext_url    3280 non-null   object 
 10  abstractLength  689 non-null    float64
 11  crossref        4000 non-null   int8   
dtypes: float64(1), int64(2), int8(1), object(8)
memory usage: 347.8+ KB


In [10]:
gscholar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11968 entries, 0 to 11967
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     9968 non-null   object 
 1   title           11968 non-null  object 
 2   authors         11968 non-null  object 
 3   doi             2347 non-null   object 
 4   year            11155 non-null  float64
 5   cites           11968 non-null  int64  
 6   type            5280 non-null   object 
 7   abstract        10640 non-null  object 
 8   article_url     10760 non-null  object 
 9   fulltext_url    6669 non-null   object 
 10  abstractLength  10640 non-null  float64
 11  gscholar        11968 non-null  int8   
dtypes: float64(2), int64(1), int8(1), object(8)
memory usage: 1.0+ MB


In [11]:
pubmed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4497 entries, 0 to 4496
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     4497 non-null   object 
 1   title           4494 non-null   object 
 2   authors         4463 non-null   object 
 3   doi             4124 non-null   object 
 4   year            4494 non-null   float64
 5   cites           4497 non-null   int64  
 6   type            4497 non-null   object 
 7   abstract        3866 non-null   object 
 8   article_url     0 non-null      float64
 9   fulltext_url    0 non-null      float64
 10  abstractLength  3866 non-null   float64
 11  pubmed          4497 non-null   int8   
dtypes: float64(4), int64(1), int8(1), object(6)
memory usage: 391.0+ KB


In [12]:
scopus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2400 entries, 0 to 2399
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     2400 non-null   object 
 1   title           2400 non-null   object 
 2   authors         2398 non-null   object 
 3   doi             2358 non-null   object 
 4   year            2400 non-null   int64  
 5   cites           2400 non-null   int64  
 6   type            2399 non-null   object 
 7   abstract        0 non-null      float64
 8   article_url     549 non-null    object 
 9   fulltext_url    0 non-null      float64
 10  abstractLength  0 non-null      float64
 11  scopus          2400 non-null   int8   
dtypes: float64(3), int64(2), int8(1), object(6)
memory usage: 208.7+ KB


In [13]:
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records indentified from databases:', total)

Records indentified from databases: 22865


### Dropping Records without a DOI

In [14]:
crossref = crossref[crossref.doi.notna()]
gscholar = gscholar[gscholar.doi.notna()]
pubmed = pubmed[pubmed.doi.notna()]
scopus = scopus[scopus.doi.notna()]

In [15]:
oldtotal = total
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records left after dropping records without DOI:', total)
print('Records removed:', oldtotal-total)

Records left after dropping records without DOI: 12829
Records removed: 10036


In [16]:
crossref['doi'] = crossref.doi.str.lower() #standardise case of doi entries
gscholar['doi'] = gscholar.doi.str.lower() #standardise case of doi entries
pubmed['doi'] = pubmed.doi.str.lower() #standardise case of doi entries
scopus['doi'] = scopus.doi.str.lower() #standardise case of doi entries

crossref['publication'] = crossref.publication.str.lower() #standardise publication names
gscholar['publication'] = gscholar.publication.str.lower() #standardise publication names
pubmed['publication'] = pubmed.publication.str.lower() #standardise publication names
scopus['publication'] = scopus.publication.str.lower() #standardise publication names

### Dropping Duplicates

In [17]:
crossref.drop_duplicates(subset=['doi'], inplace=True)
gscholar.drop_duplicates(subset=['doi'], inplace=True)
pubmed.drop_duplicates(subset=['doi'], inplace=True)
scopus.drop_duplicates(subset=['doi'], inplace=True)

In [18]:
oldtotal = total
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records left after droping duplicates:', total)
print('Records removed:', oldtotal-total)

Records left after droping duplicates: 12412
Records removed: 417


In [19]:
crossref.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3993 entries, 0 to 3999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     3902 non-null   object 
 1   title           3993 non-null   object 
 2   authors         3769 non-null   object 
 3   doi             3993 non-null   object 
 4   year            3993 non-null   int64  
 5   cites           3993 non-null   int64  
 6   type            3993 non-null   object 
 7   abstract        687 non-null    object 
 8   article_url     3993 non-null   object 
 9   fulltext_url    3274 non-null   object 
 10  abstractLength  687 non-null    float64
 11  crossref        3993 non-null   int8   
dtypes: float64(1), int64(2), int8(1), object(8)
memory usage: 378.2+ KB


In [20]:
gscholar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2317 entries, 0 to 11942
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     2256 non-null   object 
 1   title           2317 non-null   object 
 2   authors         2317 non-null   object 
 3   doi             2317 non-null   object 
 4   year            2308 non-null   float64
 5   cites           2317 non-null   int64  
 6   type            468 non-null    object 
 7   abstract        2303 non-null   object 
 8   article_url     2317 non-null   object 
 9   fulltext_url    1439 non-null   object 
 10  abstractLength  2303 non-null   float64
 11  gscholar        2317 non-null   int8   
dtypes: float64(2), int64(1), int8(1), object(8)
memory usage: 219.5+ KB


In [21]:
pubmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3744 entries, 0 to 4496
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     3744 non-null   object 
 1   title           3743 non-null   object 
 2   authors         3727 non-null   object 
 3   doi             3744 non-null   object 
 4   year            3744 non-null   float64
 5   cites           3744 non-null   int64  
 6   type            3744 non-null   object 
 7   abstract        3254 non-null   object 
 8   article_url     0 non-null      float64
 9   fulltext_url    0 non-null      float64
 10  abstractLength  3254 non-null   float64
 11  pubmed          3744 non-null   int8   
dtypes: float64(4), int64(1), int8(1), object(6)
memory usage: 354.7+ KB


In [22]:
scopus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2358 entries, 0 to 2399
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     2358 non-null   object 
 1   title           2358 non-null   object 
 2   authors         2356 non-null   object 
 3   doi             2358 non-null   object 
 4   year            2358 non-null   int64  
 5   cites           2358 non-null   int64  
 6   type            2357 non-null   object 
 7   abstract        0 non-null      float64
 8   article_url     545 non-null    object 
 9   fulltext_url    0 non-null      float64
 10  abstractLength  0 non-null      float64
 11  scopus          2358 non-null   int8   
dtypes: float64(3), int64(2), int8(1), object(6)
memory usage: 223.4+ KB


## Screening

### Restricting to journal articles

In [23]:
crossref.type.value_counts()

journal-article        3415
book-chapter            395
posted-content           53
dataset                  35
proceedings-article      32
book                     18
reference-entry          13
peer-review              13
component                 8
report                    4
other                     3
monograph                 3
reference-book            1
Name: type, dtype: int64

In [24]:
gscholar.type.value_counts()

HTML        401
CITATION     45
PDF          15
BOOK          7
Name: type, dtype: int64

In [25]:
pubmed.type.value_counts()

Journal Article                      2850
Case Reports                          458
Letter                                156
Comparative Study                      82
Editorial                              50
Evaluation Study                       36
English Abstract                       29
News                                   20
Comment                                14
Historical Article                     13
Clinical Trial                          7
Biography                               6
Published Erratum                       4
Review                                  4
Introductory Journal Article            3
Clinical Study                          3
Clinical Trial, Phase I                 1
Corrected and Republished Article       1
Clinical Trial, Phase III               1
Clinical Trial, Phase II                1
Controlled Clinical Trial               1
Congress                                1
Interview                               1
Clinical Trial Protocol           

In [26]:
scopus.type.value_counts()

Article             1792
Review               460
Letter                22
Note                  19
Short Survey          19
Conference Paper      17
Book Chapter          14
Editorial             10
Erratum                2
Book                   1
Data Paper             1
Name: type, dtype: int64

In [27]:
crossref = crossref[crossref.type =='journal-article']
gscholar = gscholar[gscholar.type.isin(['PDF', 'HTML'])]
pubmed = pubmed[pubmed.type =='Journal Article']
scopus = scopus[scopus.type =='Article']

In [28]:
oldtotal = total
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records left after dropping non journal articles:', total)
print('Records removed:', oldtotal-total)

Records left after dropping non journal articles: 8473
Records removed: 3939


### Removing records with missing Journal or Title

In [29]:
crossref = crossref[crossref.publication.notna() & crossref.title.notna()]
gscholar = gscholar[gscholar.publication.notna() & gscholar.title.notna()]
pubmed = pubmed[pubmed.publication.notna() & pubmed.title.notna()]
scopus = scopus[scopus.publication.notna() & scopus.title.notna()]

# Also drop any titles comprosed of just numbers
crossref = crossref[crossref.title.str.lower().str.islower()] 
gscholar = gscholar[gscholar.title.str.lower().str.islower()] 
pubmed = pubmed[pubmed.title.str.lower().str.islower()] 
scopus = scopus[scopus.title.str.lower().str.islower()] 

In [30]:
oldtotal = total
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records left after dropping records with missing journal or article names:', total)
print('Records removed:', oldtotal-total)

Records left after dropping records with missing journal or article names: 8457
Records removed: 16


In [31]:
crossref.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3414 entries, 4 to 3999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     3414 non-null   object 
 1   title           3414 non-null   object 
 2   authors         3305 non-null   object 
 3   doi             3414 non-null   object 
 4   year            3414 non-null   int64  
 5   cites           3414 non-null   int64  
 6   type            3414 non-null   object 
 7   abstract        589 non-null    object 
 8   article_url     3414 non-null   object 
 9   fulltext_url    2961 non-null   object 
 10  abstractLength  589 non-null    float64
 11  crossref        3414 non-null   int8   
dtypes: float64(1), int64(2), int8(1), object(8)
memory usage: 323.4+ KB


In [32]:
gscholar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402 entries, 28 to 11928
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     402 non-null    object 
 1   title           402 non-null    object 
 2   authors         402 non-null    object 
 3   doi             402 non-null    object 
 4   year            402 non-null    float64
 5   cites           402 non-null    int64  
 6   type            402 non-null    object 
 7   abstract        402 non-null    object 
 8   article_url     402 non-null    object 
 9   fulltext_url    402 non-null    object 
 10  abstractLength  402 non-null    float64
 11  gscholar        402 non-null    int8   
dtypes: float64(2), int64(1), int8(1), object(8)
memory usage: 38.1+ KB


In [33]:
pubmed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2849 entries, 0 to 4496
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     2849 non-null   object 
 1   title           2849 non-null   object 
 2   authors         2837 non-null   object 
 3   doi             2849 non-null   object 
 4   year            2849 non-null   float64
 5   cites           2849 non-null   int64  
 6   type            2849 non-null   object 
 7   abstract        2720 non-null   object 
 8   article_url     0 non-null      float64
 9   fulltext_url    0 non-null      float64
 10  abstractLength  2720 non-null   float64
 11  pubmed          2849 non-null   int8   
dtypes: float64(4), int64(1), int8(1), object(6)
memory usage: 269.9+ KB


In [34]:
scopus.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1792 entries, 4 to 2399
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     1792 non-null   object 
 1   title           1792 non-null   object 
 2   authors         1791 non-null   object 
 3   doi             1792 non-null   object 
 4   year            1792 non-null   int64  
 5   cites           1792 non-null   int64  
 6   type            1792 non-null   object 
 7   abstract        0 non-null      float64
 8   article_url     355 non-null    object 
 9   fulltext_url    0 non-null      float64
 10  abstractLength  0 non-null      float64
 11  scopus          1792 non-null   int8   
dtypes: float64(3), int64(2), int8(1), object(6)
memory usage: 169.8+ KB


### Removing incomplete Google Scholar Journal Names that can't be matched and completed

In [35]:
gscholar['doiJournal'] = gscholar.doi.apply (lambda x: re.split('/', x)[0])

In [36]:
incompleteJournalNames = [name for name in gscholar.publication.unique() if '…' in name]
completeJournalNames = [name for name in gscholar.publication.unique() if '…' not in name]

In [37]:
incompleteJournalNames

['bmc research …',
 'journal of …',
 'bmc infectious …',
 'jrsm short …',
 'otolaryngology–head and …',
 'arthritis research & …',
 'experimental and applied …',
 'international …',
 'molecular …',
 '… journal of rare …',
 'acta veterinaria …',
 'environmental health and preventive …',
 'parasites & …',
 'archivum immunologiae et …',
 '…',
 'european journal of …',
 'world journal of …',
 'bmc veterinary …',
 'journal of ophthalmic inflammation and …',
 'european journal of clinical microbiology & infectious …',
 'planta …',
 'climate change adaptation in developed …',
 '… england journal of …',
 'circulation: heart …',
 'evolution: education and …',
 'bmc …',
 'new frontiers of molecular …',
 'arthritis …',
 'bmc public …',
 'new england journal …',
 'journal of osteopathic …',
 'bmc medical …',
 'bmc family …',
 'theoretical biology and medical …',
 'indian journal of psychological …',
 'annals of the american …',
 'archives of …',
 'global pediatric …',
 'netherlands heart …',
 'jou

In [38]:
journalNamesDict = {}

In [39]:
for name in completeJournalNames:
    journalNamesDict[name] = gscholar[gscholar.publication ==name].doiJournal.unique()[0]

In [40]:
len(gscholar[gscholar.publication.str.contains('…')])

248

In [41]:
for index, row in gscholar.iterrows():
    if ('…') in row['publication']:
        name = row['publication']
        options=[]
        if name.startswith('…') & name.endswith('…'):
            newname=name[1:-1]
            if len(newname)>1:
                for key in journalNamesDict:
                    if (newname in key) & (journalNamesDict[key]==row['doiJournal']):
                        options.append(key)
                if len(options)==1:
                    gscholar.loc[index,'publication']= options[0]
        elif name.startswith('…'):
            newname=name[1:]
            for key in journalNamesDict:
                if (key.endswith(newname)) & (journalNamesDict[key]==row['doiJournal']):
                    options.append(key)
            if len(options)==1:
                gscholar.loc[index,'publication']= options[0]
        elif name.endswith('…'):
            newname=name[:-1]
            for key in journalNamesDict:
                if (key.startswith(newname)) & (journalNamesDict[key]==row['doiJournal']):
                    options.append(key)
            if len(options)==1:
                gscholar.loc[index,'publication']= options[0]

In [42]:
### Number of records to be removed
len(gscholar[gscholar.publication.str.contains('…')])

118

In [43]:
gscholar=gscholar[~gscholar.publication.str.contains('…')]

In [44]:
gscholar.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 284 entries, 52 to 11928
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   publication     284 non-null    object 
 1   title           284 non-null    object 
 2   authors         284 non-null    object 
 3   doi             284 non-null    object 
 4   year            284 non-null    float64
 5   cites           284 non-null    int64  
 6   type            284 non-null    object 
 7   abstract        284 non-null    object 
 8   article_url     284 non-null    object 
 9   fulltext_url    284 non-null    object 
 10  abstractLength  284 non-null    float64
 11  gscholar        284 non-null    int8   
 12  doiJournal      284 non-null    object 
dtypes: float64(2), int64(1), int8(1), object(9)
memory usage: 29.1+ KB


In [45]:
gscholar.drop(columns='doiJournal', inplace=True)

In [46]:
oldtotal = total
total = len(crossref)+len(scopus)+len(pubmed)+len(gscholar)
print('Records left after dropping truncated Google Scholar journal names:', total)
print('Records removed:', oldtotal-total)

Records left after dropping truncated Google Scholar journal names: 8339
Records removed: 118


### Preparation for Merging Remaining Records

#### Resetting very short abstracts

With the exception of one very short pubmed abstract, there are some abstracts that appear truncated. An arbitrary threshold of 300 characters is set with scopus and crossref abstracts short than this reset to null. Hopefully these will be properly populated by the scraping process.

In [47]:
for idx in crossref.index:
    if crossref.loc[idx].abstractLength<300:
        crossref.at[idx,'abstract'] = np.NaN
        crossref.at[idx,'abstractLength'] = np.NaN

In [48]:
for idx in gscholar.index:
    if gscholar.loc[idx].abstractLength<300:
        gscholar.at[idx,'abstract'] = np.NaN
        gscholar.at[idx,'abstractLength'] = np.NaN

In [49]:
for idx in scopus.index:
    if scopus.loc[idx].abstractLength<300:
        scopus.at[idx,'abstract'] = np.NaN
        scopus.at[idx,'abstractLength'] = np.NaN

#### Dummy for Source

In [50]:
# Datasets already contain a dummy variable column flagging where they were obtained from. E.g pubmed['pubmed']=1
# This aids matching columns in preparation for concatenation
pubmed[['gscholar','crossref','scopus']]=0
gscholar[['pubmed','crossref','scopus']]=0
crossref[['pubmed','gscholar','scopus']]=0
scopus[['pubmed','gscholar','crossref']]=0

### Building Dataframes of Unique Records

#### Records that have DOI and abstract

In [51]:
len(pubmed[pubmed.doi.notna() & pubmed.abstract.notna()])

2720

Using Pubmed results as the starting point due to their relative completeness of data, we take the 2720 results with both a DOI and abstract and add any abstracts from the Crossref, Google Scholar or Scopus records that also have a unique DOI. Using the DOI as a unique identifier prevents adding duplicates to the new collated dataset as we build it. 

In [52]:
## Splitting Datasets by abstract and doi availability
crossrefToAdd = crossref[crossref.doi.notna() & crossref.abstract.notna()]
crossrefRemaining = crossref[~(crossref.doi.notna() & crossref.abstract.notna())]
gscholarToAdd = gscholar[gscholar.doi.notna() & gscholar.abstract.notna()]
gscholarRemaining = gscholar[~(gscholar.doi.notna() & gscholar.abstract.notna())]
scopusToAdd = scopus[scopus.doi.notna() & scopus.abstract.notna()]
scopusRemaining = scopus[~(scopus.doi.notna() & scopus.abstract.notna())]

In [53]:
#Concating unique DOI records that contain abstracts
collated = pubmed[pubmed.doi.notna() & pubmed.abstract.notna()]
collated = pd.concat([collated, crossrefToAdd[~crossrefToAdd['doi'].isin(pubmed.doi)]])
collated = pd.concat([collated, scopusToAdd[~scopusToAdd['doi'].isin(collated.doi)]])
collated = pd.concat([collated, gscholarToAdd[~gscholarToAdd['doi'].isin(collated.doi)]])

In [54]:
#add language of abstracts, titles and journal names
collated['publicationLanguage'] = collated.publication.apply(detect)
collated['titleLanguage'] = collated.title.apply(detect)
collated['abstractLanguage'] = collated.abstract.apply(detect)

#reset index
collated.reset_index(drop = True, inplace=True)

In [55]:
collated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2962 entries, 0 to 2961
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   publication          2962 non-null   object 
 1   title                2962 non-null   object 
 2   authors              2958 non-null   object 
 3   doi                  2962 non-null   object 
 4   year                 2962 non-null   float64
 5   cites                2962 non-null   int64  
 6   type                 2962 non-null   object 
 7   abstract             2962 non-null   object 
 8   article_url          242 non-null    object 
 9   fulltext_url         210 non-null    object 
 10  abstractLength       2962 non-null   float64
 11  pubmed               2962 non-null   int64  
 12  gscholar             2962 non-null   int64  
 13  crossref             2962 non-null   int64  
 14  scopus               2962 non-null   int64  
 15  publicationLanguage  2962 non-null   o

#### Records that have DOI but no abstract

Next we build a dataframe of records that have unique DOIs but no abstract. This will be the list of abstracts we will attempt to scrape and add to the dataframe of existing abstracts.

In [56]:
scraped = pubmed[pubmed.doi.notna() & pubmed.abstract.isna()]
crossrefToAdd = crossrefRemaining[~(crossrefRemaining['doi'].isin(collated.doi) | crossrefRemaining['doi'].isin(scraped.doi))]
scraped = pd.concat([scraped, crossrefToAdd])
scopusToAdd = scopusRemaining[~(scopusRemaining['doi'].isin(collated.doi) | scopusRemaining['doi'].isin(scraped.doi))]
scraped = pd.concat([scraped, scopusToAdd])
gscholarToAdd = gscholarRemaining[~(gscholarRemaining['doi'].isin(collated.doi) | gscholarRemaining['doi'].isin(scraped.doi))]
scraped = pd.concat([scraped, gscholarToAdd])

In [57]:
#add language of abstracts, titles and journal names
scraped['publicationLanguage'] = scraped.publication.apply(detect)
scraped['titleLanguage'] = scraped.title.apply(detect)
scraped['abstractLanguage'] = np.NaN

#Sort by number of times journal appears in dataset
scraped = scraped.iloc[scraped.groupby('publication').publication.transform('size').mul(-1).argsort(kind='mergesort')]

# reset index
scraped.reset_index(drop=True, inplace=True)

In [58]:
scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601 entries, 0 to 2600
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   publication          2601 non-null   object 
 1   title                2601 non-null   object 
 2   authors              2494 non-null   object 
 3   doi                  2601 non-null   object 
 4   year                 2601 non-null   float64
 5   cites                2601 non-null   int64  
 6   type                 2601 non-null   object 
 7   abstract             0 non-null      object 
 8   article_url          2106 non-null   object 
 9   fulltext_url         1703 non-null   object 
 10  abstractLength       0 non-null      float64
 11  pubmed               2601 non-null   int64  
 12  gscholar             2601 non-null   int64  
 13  crossref             2601 non-null   int64  
 14  scopus               2601 non-null   int64  
 15  publicationLanguage  2601 non-null   o

In [59]:
oldtotal = total
total = len(collated) + len(scraped)
print('Records left after merging records:', total)
print("Records with abstracts:", len(collated))
print("Abstracts to seek to retrieve via scraping:", len(scraped))
print('Records removed:', oldtotal-total)

Records left after merging records: 5563
Records with abstracts: 2962
Abstracts to seek to retrieve via scraping: 2601
Records removed: 2776


In [60]:
scraped.publication.value_counts()

ticks and tick-borne diseases        58
clinical infectious diseases         52
parasites & vectors                  47
plos one                             45
médecine et maladies infectieuses    38
                                     ..
child & family social work            1
anatolian journal of psychiatry       1
arthritis care and research           1
eye (basingstoke)                     1
journal of rheumatology               1
Name: publication, Length: 1056, dtype: int64

### Saving New Dataframes

In [61]:
collated.to_csv('collated.csv')
scraped.to_csv('scraped.csv')

## Scraping

In [62]:
import requests                 # How Python gets the webpages
from bs4 import BeautifulSoup   # Creates structured, searchable object
import urllib                   # useful for cleaning/processing URLs

### Formatting Vagaries

In each of the main sources of articles, the parent webpage (e.g. Science Direct, plos.org) tags Abstract text in different ways. 

For example:
* Science Direct (Ticks and Tick-borne Diseases): ``<h2 class="section-title u-h3 u-margin-l-top u-margin-xs-bottom">Abstract</h2><div id="abst0005">``
* PLoS ONE: ``<h2>Abstract\</h2>\<div class="abstract-content">``

* Vector-Borne and Zoonotic Diseases: ``<h2>Abstract</h2></div><div class="abstractSection abstractInFull">``

* Parasites & Vectors: ``<h2 class="c-article-section__title js-section-title js-c-reading-companion-sections-item" id="Abs1">Abstract</h2><div class="c-article-section__content" id="Abs1-content">``

In addition, some sites, notably Science Direct/Elsevier employ a redirect not picked up by the *Requests* package in Python because they use JavaScript to redirect from the page that the get request lands on. Additional handling is required in these cases to extract the actual URL, visit it and extract the Abstract text. 

### Scraping Function

In [73]:
def scrapeAbstracts(doi):
    headers = {'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (K HTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
              }
    url = 'http://dx.doi.org/' + doi
#     print(url)
    try:
        response = requests.get(url) 
    except requests.exceptions.RequestException as e:
                return 'Error'
        
    if(response.status_code!=200):
        return str(response.status_code)+ ' response'
    
    page = response.content
    captalised = False
    abstractHeadingStart = page.find(b'>Abstract<')
    if abstractHeadingStart==-1: 
        abstractHeadingStart = page.find(b'>ABSTRACT<')
        if abstractHeadingStart==-1:
            redirect = page.find(b'redirect')
            if redirect ==-1:
                return 'No abstract found'
            else:
                URLstart = page.find(b'http', redirect)
                URLend = page.find(b'"', URLstart)
                URLencoded = page[URLstart:URLend]
                URLdecoded = URLencoded.decode('UTF-8')
                redirectURL = urllib.parse.unquote(URLdecoded)
                redirectURLshort = redirectURL[:redirectURL.find('?')]
#                 print(redirectURLshort)
                try:
                    response = requests.get(redirectURL, headers=headers)
                except requests.exceptions.RequestException as e:
                    return 'Error after redirect'
                if(response.status_code!=200):
                    return 'No response'
                page = response.content
                abstractHeadingStart = page.find(b'>Abstract<')
                if abstractHeadingStart==-1: 
                    return 'No abstract found'
        else:
            captalised=True
#             print(captalised)
    if (captalised):
        lastMentionofAbstract = page.rfind(b'>ABSTRACT<')
        if abstractHeadingStart != lastMentionofAbstract:
    #         print('test')
            if page.find(b'>ABSTRACT</h')> abstractHeadingStart:
                abstractHeadingStart = page.find(b'>ABSTRACT</h')
        divTagStart = page.find(b'<div', abstractHeadingStart+len('>ABSTRACT<'))
        
    else:
        lastMentionofAbstract = page.rfind(b'>Abstract<')
        if abstractHeadingStart != lastMentionofAbstract:
    #         print('test')
            if page.find(b'>Abstract</h')> abstractHeadingStart:
                abstractHeadingStart = page.find(b'>Abstract</h')
        divTagStart = page.find(b'<div', abstractHeadingStart+len('>Abstract<'))
        if divTagStart - abstractHeadingStart > 200:
            divTagStart = page.rfind(b'<div', 0, abstractHeadingStart)
    #         print(page[abstractHeadingStart:abstractHeadingend])
#     print(page[abstractHeadingStart:])


    try:        
        
        divTagEnd = page.find(b'>', divTagStart)
        divTag = page[divTagStart:divTagEnd]
        divTagType = divTag[divTag.find(b' ')+1:divTag.find(b'=')].decode()
        divTagAtrr = divTag[divTag.find(b'"')+1:]
        divTagAtrr = divTagAtrr[:divTagAtrr.find(b'"')]
        
#         print(divTag)
#         print(divTagType)
#         print(divTagAtrr)
        scraping = BeautifulSoup(page, "html") 
        text = scraping.find("div", attrs={divTagType: divTagAtrr})
        for subTag in text.contents[:-1]:
            if subTag.name is not None and subTag.name.startswith("h"):
                subTag.string = subTag.string + '.'

    except AttributeError as error:
        headingTag = page[page.rfind(b'<',0, abstractHeadingStart):abstractHeadingStart]
        headingTag = headingTag[:3]

        abstractStart = page.find(b'>', abstractHeadingStart+len('>Abstract<'))+1
        abstractEnd = page.find(headingtag,abstractStart)

        scraping = BeautifulSoup(page[abstractStart:abstractEnd], "html") 
        text = scraping.get_text(strip=True)
        return text
    except TypeError as error:
        return 'TypeError'
#         print(divTag)
#         print(divTagType)
#         print(divTagAtrr)
#     print('returns')
    return text.get_text(separator = ' ', strip=True)

In [72]:
print('There are', scraped.publication.nunique(), 'different publications in the scraping dataset.')
print(sum(scraped.publication.value_counts()==1), "of them appear only once.")
print("Another", sum(scraped.publication.value_counts()==2), "appear just twice.")

There are 1056 different publications in the scraping dataset.
661 of them appear only once.
Another 169 appear just twice.


In [67]:
scraped.publication.value_counts()[:10]

ticks and tick-borne diseases         58
clinical infectious diseases          52
parasites & vectors                   47
plos one                              45
médecine et maladies infectieuses     38
parasites and vectors                 32
the american journal of medicine      29
emerging infectious diseases          28
revue francophone des laboratoires    27
vector-borne and zoonotic diseases    21
Name: publication, dtype: int64