In [46]:
import pickle
from crossref.restful import Works, Etiquette
import pandas as pd
from scihub_upgraded import SciHub

## Part 1: Creating the new meta_df from CrossRef

In [2]:
with open("doilist", "rb") as fp:
    doi_list = pickle.load(fp)

In [3]:
cet_list = doi_list[:20000]
sum_list = doi_list[20000:40000]
ogi_list = doi_list[40000:]

In [5]:
my_etiquette = Etiquette('Analysing Publishing Delay in Academic Journals', 'v2.0', 'https://github.com/Spidey0023/THEsis-Codes', 'oguzkokes@gmail.com')

In [9]:
ogi_works = Works(etiquette=my_etiquette)

In [10]:
meta_dict = []
for doi in ogi_list:
        meta_dict.append(ogi_works.doi(doi))

In [11]:
len(meta_dict)

24119

In [12]:
with open("ogi_dict","wb") as fp:
    pickle.dump(meta_dict, fp)

In [13]:
meta_df = pd.DataFrame(meta_dict)

In [17]:
%store meta_df

Stored 'meta_df' (DataFrame)


In [14]:
meta_df.to_csv("ogi_meta.csv")

In [31]:
meta_df.columns

Index(['indexed', 'reference-count', 'publisher', 'issue', 'content-domain',
       'short-container-title', 'published-print', 'DOI', 'type', 'created',
       'page', 'source', 'is-referenced-by-count', 'title', 'prefix', 'volume',
       'author', 'member', 'container-title', 'original-title', 'language',
       'deposited', 'score', 'subtitle', 'short-title', 'issued',
       'references-count', 'journal-issue', 'URL', 'relation', 'ISSN',
       'issn-type', 'subject', 'published', 'license', 'abstract',
       'update-policy', 'published-online', 'reference', 'link',
       'alternative-id', 'update-to', 'funder', 'published-other', 'archive',
       'assertion', 'article-number', 'editor', 'translator', 'accepted',
       'clinical-trial-number'],
      dtype='object')

Thanks to Ceto & Summan, CrossRef meta_df extraction was completed in just 3 hours. 

The dataset can now be merged (pd.concat) and then analysed. The final aim is to match ISSN & DOI and identiy the "OK date"d journals.

Then the plan is to: 

1) Second chance to all CR_failed and date_failed journals.

2) Finalize the journal pool for Q1 journals

3) Calculate the sample count for OK journals.

4) Create a new pipeline for CR meta and SciHub date extraction

5) Create the 100K Q1 article dataset

6) Analyse the dateset,

7) Repeat the data extraction processes for Q2-4 journals

8) Create a "custom" dataset for all Q1-4 journals

9) Repeat the analysis for the all Q's dataset

10) Write two articles for one dataset each

11) PROFIT!

## Part 2: Analysing meta_df

In [3]:
with open("ceto_dict", "rb") as fp:
    ceto_meta = pickle.load(fp)

In [4]:
with open("sum_dict", "rb") as fp:
    sum_meta = pickle.load(fp)

In [7]:
with open("ogi_dict", "rb") as fp:
    ogi_meta = pickle.load(fp)

In [8]:
ceto_df = pd.DataFrame(ceto_meta)
sum_df = pd.DataFrame(sum_meta)
ogi_df = pd.DataFrame(ogi_meta)

In [9]:
ceto_df.shape

(20000, 51)

In [10]:
sum_df.shape

(20000, 50)

In [11]:
ogi_df.shape

(24119, 51)

In [13]:
meta_df = pd.concat([ceto_df,sum_df, ogi_df], ignore_index=True)

In [14]:
meta_df.shape

(64119, 51)

In [73]:
meta_df.loc[0:10,'published']

0      {'date-parts': [[2010, 9, 17]]}
1       {'date-parts': [[2011, 6, 1]]}
2      {'date-parts': [[2012, 4, 23]]}
3       {'date-parts': [[2013, 3, 6]]}
4      {'date-parts': [[2014, 9, 16]]}
5      {'date-parts': [[2015, 5, 20]]}
6       {'date-parts': [[2016, 6, 7]]}
7     {'date-parts': [[2017, 11, 20]]}
8      {'date-parts': [[2018, 9, 21]]}
9      {'date-parts': [[2019, 10, 8]]}
10      {'date-parts': [[2020, 3, 5]]}
Name: published, dtype: object

In [74]:
meta_df['published'].map(len).value_counts()

1    64119
Name: published, dtype: int64

In [69]:
meta_df['relation'][meta_df['relation'].map(len)==1]

85       {'has-review': [{'id-type': 'doi', 'id': '10.3...
86       {'has-review': [{'id-type': 'doi', 'id': '10.3...
90       {'has-preprint': [{'id-type': 'doi', 'id': '10...
136      {'has-preprint': [{'id-type': 'doi', 'id': '10...
249      {'is-translation-of': [{'id-type': 'doi', 'id'...
                               ...                        
63876    {'has-preprint': [{'id-type': 'doi', 'id': '10...
63948    {'has-review': [{'id-type': 'doi', 'id': '10.2...
63949    {'has-review': [{'id-type': 'doi', 'id': '10.2...
63963    {'has-preprint': [{'id-type': 'doi', 'id': '10...
64031    {'has-review': [{'id-type': 'doi', 'id': '10.3...
Name: relation, Length: 376, dtype: object

In [56]:
meta_df.loc[10,'deposited']

{'date-parts': [[2021, 6, 30]],
 'date-time': '2021-06-30T03:23:21Z',
 'timestamp': 1625023401000}

In [58]:
meta_df['score'].value_counts()

1    64119
Name: score, dtype: int64

In [21]:
meta_df.columns

Index(['indexed', 'reference-count', 'publisher', 'issue', 'license',
       'content-domain', 'short-container-title', 'published-print', 'DOI',
       'type', 'created', 'page', 'source', 'is-referenced-by-count', 'title',
       'prefix', 'volume', 'author', 'member', 'published-online', 'reference',
       'container-title', 'original-title', 'language', 'link', 'deposited',
       'score', 'subtitle', 'short-title', 'issued', 'references-count',
       'journal-issue', 'URL', 'relation', 'ISSN', 'issn-type', 'subject',
       'published', 'funder', 'archive', 'alternative-id', 'update-policy',
       'assertion', 'abstract', 'update-to', 'article-number',
       'published-other', 'clinical-trial-number', 'editor', 'accepted',
       'translator'],
      dtype='object')

In [75]:
meta_df_col_keep = ['reference-count', 'publisher','published-print','DOI','is-referenced-by-count', 'title','author','published-online', 'reference',
'container-title', 'language','issued', 'references-count','ISSN', 'subject', 'published']

meta_df_col_drop = ['indexed',  'issue', 'license', 'content-domain', 'short-container-title',  
       'type', 'created', 'page', 'source', 'prefix', 'volume', 'member', 
       'original-title', 'link', 'deposited', 'score', 'subtitle', 'short-title',
       'journal-issue', 'URL', 'relation', 'issn-type', 
       'funder', 'archive', 'alternative-id', 'update-policy',
       'assertion', 'abstract', 'update-to', 'article-number',
       'published-other', 'clinical-trial-number', 'editor', 'accepted',
       'translator']

In [132]:
meta_df["references-count"].equals(meta_df["reference-count"])

meta_df.drop("references-count",axis=1, inplace=True)

In [76]:
meta_df_full = meta_df.copy()

%store meta_df_full

Stored 'meta_df_full' (DataFrame)


In [133]:
#meta_df.drop(meta_df_col_drop, axis=1, inplace=True)

%store meta_df

meta_df.to_csv("meta_df.csv")

Stored 'meta_df' (DataFrame)


In [None]:
%store -r meta_df

%store -r retr_complete_one

%store -r sh_comp_one_df
%store -r unpy_comp_one_df

In [137]:
retr_complete_one.loc[retr_complete_one.Issn.map(type)==str,"Issn"] = retr_complete_one.loc[retr_complete_one.Issn.map(type)==str,"Issn"].map(lambda x: [x])
retr_complete_one["Year"] =  retr_complete_one.Year.astype(int)

meta_df["ISSN"] = meta_df.ISSN.map(lambda x: [iss.replace("-","") for iss in x])
meta_df["Year"] = meta_df.published.map(lambda x: x["date-parts"][0][0])

In [150]:
meta_df.ISSN.map(len).value_counts()

2    42356
1    21698
3       65
Name: ISSN, dtype: int64

In [151]:
retr_complete_one.Issn.map(len).value_counts()

1    57454
2    16917
3       50
Name: Issn, dtype: int64

In [167]:
def doi_retr(retrrow, artcldf):
    issn = retrrow.Issn
    year = retrrow.Year
    doi = artcldf[(artcldf.ISSN.map(lambda x: any(iss in x for iss in issn))) & (artcldf.Year == year)]["DOI"].tolist()
    return doi


In [168]:
doi_retr_trial2 = retr_complete_one.apply(lambda x: doi_retr(x,meta_df), axis=1, result_type="reduce")

In [171]:
doi_retr_trial2.map(len).value_counts()

1    64076
0    10329
2       16
dtype: int64

In [177]:
retr_complete_one["DOI"] = doi_retr_trial2

In [179]:
%store retr_complete_one

Stored 'retr_complete_one' (DataFrame)


In [180]:
retr_complete_one.to_csv("retr_complete_one.csv")

In [178]:
retr_complete_one

Unnamed: 0,Issn,Year,Total_Docs,Sample_Count,CrossRef_retr,DOI
0,[00011541],2010,294,1,True,[10.1002/aic.12400]
1,[00011541],2011,315,1,True,[10.1002/aic.12671]
2,[00011541],2012,347,1,True,[10.1002/aic.13810]
3,[00011541],2013,422,1,True,[10.1002/aic.14056]
4,[00011541],2014,359,1,True,[10.1002/aic.14601]
...,...,...,...,...,...,...
74416,"[8756758X, 14602695]",2017,170,1,True,[10.1111/ffe.12617]
74417,"[8756758X, 14602695]",2018,199,1,True,[10.1111/ffe.12803]
74418,"[8756758X, 14602695]",2019,210,1,True,[10.1111/ffe.13083]
74419,"[8756758X, 14602695]",2020,221,1,True,[10.1111/ffe.13260]


In [172]:
meta_df.shape

(64119, 16)

## Part 3: Small Adjustments on meta_df

As can be seen from below, the new meta_df performed well above expectations, however there are still a small number of manual DOI matching necessary, as well as some "double" articles per journal.

In this step meta_df will be finalized so that the one last run for all fails can be made. These fails include:

1- No article DOI info from CrossRef
2- Failed metadata from Unpywall API (now replaced with CR Rest API)
3- "cant_read_pdf" error (PDF Miner)
4- cant find dates(a final chance for each ISSN-year pairing)

To do this:

    1- meta_df needs to have exactly 64,119 DOIs - DONE!

    2- Pull new articles + DOIs for each failed (10k new articles) -> CrossRef run

    3- Find & scrape PDFs for each -> scihub_upgraded

    4- Compare & merge results -> retr_complete_one



In [3]:
%store -r meta_df
%store -r retr_complete_one

In [20]:
meta_dois = meta_df.DOI.tolist()

retr_OK_dois = [val for sublist in retr_complete_one.DOI.tolist() for val in sublist]

unmacthed_dois = set(meta_dois).difference(retr_OK_dois)


In [44]:
# This is the final number of DOIs we want to reach in retr_comp_one["doi"] column:
len(meta_dois)

64119

In [25]:
#The current situation:
retr_complete_one.DOI.map(len).value_counts()

1    64076
0    10329
2       16
Name: DOI, dtype: int64

In [24]:
print(f"Total unmatched: {len(unmacthed_dois)}")

Total unmatched: 23


So, we need to deal with:

* Duplicate DOIs in retr_ (16) +4   FIXED!
* Unmatched DOIs (23) FIXED!
* Jrnls with CR_retr == True but doi == 0 (76?) FIXED!
* Unpy missing & failed articles (49) FIXED!


In [55]:
duplicate_doi_retr = retr_complete_one.copy()

In [130]:
#1 Duplicate DOIs:
duplicate_doi_retr[duplicate_doi_retr.DOI.map(len)==2]

Unnamed: 0,Issn,Year,Total_Docs,Sample_Count,CrossRef_retr,DOI


In [112]:
#duplicate_doi_retr[duplicate_doi_retr.Issn.map(lambda x: "3502" in x)]

duplicate_doi_retr.loc[3502,"DOI"]
duplicate_doi_retr.loc[10173,"DOI"]
duplicate_doi_retr.loc[5363,"DOI"] = [duplicate_doi_retr.loc[5364,"DOI"][0]]
duplicate_doi_retr.loc[5364,"DOI"] = [duplicate_doi_retr.loc[5364,"DOI"][1]]


['10.1080/00210862.2014.881083', '10.1080/00210862.2014.1000629']

In [61]:
[duplicate_doi_retr.loc[43482,"DOI"][0]]

['10.1093/neuros/nyx039']

In [208]:
# Convert back to List:
duplicate_doi_retr.loc[duplicate_doi_retr.DOI.map(type)==str,"DOI"] = duplicate_doi_retr.loc[duplicate_doi_retr.DOI.map(type)==str,"DOI"].map(lambda x: [x])

In [212]:
duplicate_doi_retr[duplicate_doi_retr.Issn.map(lambda x: "15251403" in x)]

Unnamed: 0,Issn,Year,Total_Docs,Sample_Count,CrossRef_retr,DOI
43576,[15251403],2011,208,1,False,[]
43577,[15251403],2014,294,1,False,[]
43578,[15251403],2015,275,1,True,[10.1111/ner.12267]
43579,[15251403],2016,170,1,True,[10.1111/ner.12397]
43580,[15251403],2017,111,1,True,[10.1111/ner.12716]
43581,[15251403],2018,115,1,True,[10.1111/ner.12890]
43582,[15251403],2019,152,1,True,[10.1111/ner.12939]
43583,[15251403],2020,253,1,True,[10.1111/ner.13100]


In [210]:
meta_df[meta_df.DOI == '10.1111/ner.12939']

Unnamed: 0,reference-count,publisher,published-print,DOI,is-referenced-by-count,title,author,published-online,reference,container-title,language,issued,ISSN,subject,published
39116,51,Elsevier BV,"{'date-parts': [[2019, 4]]}",10.1111/ner.12939,7,[Spinal Cord Stimulation Infection Rate and In...,"[{'given': 'David A.', 'family': 'Provenzano',...",,"[{'key': '10.1111/ner.12939_bib1', 'doi-assert...",[Neuromodulation: Technology at the Neural Int...,en,"{'date-parts': [[2019, 4]]}",[1094-7159],"[Anesthesiology and Pain Medicine, Neurology (...","{'date-parts': [[2019, 4]]}"


In [207]:
duplicate_doi_retr.loc[32,"DOI"] = "10.5465/amj.2019.0156"

In [133]:
#Unmatched DOIs:
unmacthed_dois

{'10.1002/acr.24479',
 '10.1002/stem.283',
 '10.1016/j.prosdent.2020.09.020',
 '10.1016/j.psychsport.2020.101780',
 '10.1037/hea0001031',
 '10.1070/sm9367',
 '10.1080/09739572.2020.1689694',
 '10.1086/675642',
 '10.1086/682227',
 '10.1086/685489',
 '10.1088/1367-2630/abd50e',
 '10.1109/tmm.2020.3044452',
 '10.1111/ner.12267',
 '10.1111/ner.12397',
 '10.1111/ner.12716',
 '10.1111/ner.12890',
 '10.1111/ner.12939',
 '10.1111/ner.13100',
 '10.1145/3379463',
 '10.25222/larr.258',
 '10.25222/larr.377',
 '10.3917/afco.239.0035',
 '10.5465/amj.2019.0156'}

In [215]:
duplicate_doi_retr[(duplicate_doi_retr.CrossRef_retr == False) & (duplicate_doi_retr.DOI.map(len)!=0)]

Unnamed: 0,Issn,Year,Total_Docs,Sample_Count,CrossRef_retr,DOI
43576,[15251403],2011,208,1,False,[]
43577,[15251403],2014,294,1,False,[]


In [41]:
len(retr_complete_one[retr_complete_one.CrossRef_retr==True])

64168

In [229]:
duplicate_doi_retr.loc[43577,"DOI"]

[]

In [51]:
43+64076

64119

In [230]:
duplicate_doi_retr.DOI.map(len).value_counts()

1    64119
0    10302
Name: DOI, dtype: int64

## Part 4: Merging meta_df & sh_comp_one_df & unpy_comp_one_df ->  retr_comp_one -> FIN

In [47]:
%store -r duplicate_doi_retr
%store -r meta_df
%store -r sh_comp_one_df
%store -r unpy_comp_one_df

In [4]:
duplicate_doi_retr["DOI"] = duplicate_doi_retr.DOI.map(lambda x: x[0] if len(x)>0 else "")

In [48]:
# Combine sh + unpy with meta_df to create -> date_df

# 1- Scihub merge:
date_df = pd.merge(meta_df,sh_comp_one_df[["doi","Scihub_results"]], how="left", left_on="DOI", right_on="doi")
date_df.rename(columns={"Scihub_results":"Results"}, inplace=True)
date_df.drop("doi", axis=1, inplace=True)

# 2- Unpy merge
date_df.set_index("DOI", inplace=True)
unpy_comp_one_df.set_index("doi", inplace=True)
date_df["Results"] = date_df["Results"].fillna(unpy_comp_one_df["Unpy_results"])

# 3 - Fixing Unpy_df.drop(52626, inplace=True) from part3d_Unpy_one_get_dates:
sh = SciHub()
tek_date = sh.get_dates("10.1186/1755-8794-4-73")
date_df.at["10.1186/1755-8794-4-73","Results"] = tek_date




In [48]:
# Combine duplicate_doi_retr with date_df -> back to retr_complete_one & is ready for second run
retr_complete_one = duplicate_doi_retr.merge(date_df["Results"], how="left", left_on="DOI", right_index=True)

In [58]:
# Create mask for second run:
# Conditions:
# 1- CrossRef == False
cr_mask = retr_complete_one.CrossRef_retr == False

# 2- Results.map(type) == str
#retr_complete_one[retr_complete_one.Results.map(type)==str].Results.value_counts()
str_mask = retr_complete_one.Results.map(type)==str

# 3- Results -> no_date_found
def no_date_finder(res):
    if type(res)== list:
        if len(res)==2:
            return True
        else:
            return False
    else:
        return False 

no_date_mask = retr_complete_one.Results.map(no_date_finder)

In [60]:
second_run_df = retr_complete_one[(cr_mask) | (str_mask) |(no_date_mask)]

READY FOR ROUND 2!

But before round 2, will simplify the folders & files & variables to avoid confusion later on.


FINAL NOTE:

As it stands, there should be 2 variables necessary in the next notebook

These are:

* retr_complete_one -> Scimago jrnl data combined with CR_retr + DOI + Results column
* date_df -> combination of meta_df + sh_comp_one_df + unpy_comp_one_df -> has metadata + Date column

These 2 dataframes will also be saved as csv & pickle for backup!

And in case of a problem we have several backup variables:

For retr_complete_one:
* duplicate_doi_retr -> df with minor issues of earlier retr_comp_one fixed
* retr_issnupdated -> an earlier & unfinished version 

For date_df:
* meta_df -> CR metadata column (thanks to Ceto & Summan)
* meta_df_full -> meta_df will all 51 columns     

    Partial backups:
* sh_comp_one_df -> artcl Unpy metadata + dates from Scihub 
* unpy_comp_one_df -> artcl Unpy metadata + dates from direct url + dates from Scihub for direct url fails

    Depreciated backups:
* artcl_df -> metadata df retrieved from Unpy (PUBLISHED COLUMN IS WRONG!)
* doi_complete_one -> list that only contains DOIs for complete_one_run





In [84]:
%store retr_complete_one
%store date_df
%store second_run_df

Stored 'retr_complete_one' (DataFrame)


In [55]:
with open("date_df","wb") as fp:
    pickle.dump(date_df, fp)

In [56]:
with open("retr_complete_one","wb") as fp:
    pickle.dump(retr_complete_one, fp)

In [62]:
with open("second_run_df","wb") as fp:
    pickle.dump(second_run_df, fp)