
## This is the process for sorting references and finding Digital Object Identifiers (DOIs) for the publications reported as part of the second grant supporting the Oxford Biomedical Research Center (OxBRC2)

###  The process makes use of resources kindly made available by:

 Crossref (https://www.crossref.org/),   
 and the python library 'habanero' (v0.6, https://habanero.readthedocs.io/en/latest/index.html)
        
<br>
<p><font size=4 color=red>&#9888; Please do not run to notebook repeatedly unless needed, as it will put unneeded strain on API servers  &#9888;</font></p> 
<p> It will also take a long time to run </p>
<p> You could run this with a subset of the data, as descriibed below.

<br>


### Set up your work-space with importing the packages you need



In [1]:
# Handles our data 
import pandas as pd

# Helpful for managing calls to the CrossRef API
from habanero import Crossref, exceptions

# We need to limit our usgae of the API to be polite
import time

#'xlrd'  is also likely a dependancy.  This can be installed using magic..'%conda install xlrd' in a new code cell if needed

### First bring in the existing list of publictaions (using the columns needed)
Import of an Excel file can be done with the Pandas python library, and we will select only the columns we need

In [2]:
#%conda install xlrd

In [3]:
df1 = pd.read_excel('./Source_files/Source_OxBRC2_publications.xlsx',
                    sheet_name='current_list', usecols=[0,1,2,3])
df1.shape

(2378, 4)

In [4]:
#Now we can look at the top couple of lines of the file we have just imported
df1.head(2)

Unnamed: 0,ID,DOI,complete,csv_post_title
0,1125,10.1186/s12881-014-0095-4,"&amp; , Fenwick AL, Goos JAC, Rankin J, Lord ...",Apparently synonymous substitutions in FGFR2 a...
1,1996,10.1183/13993003.00321-2016,", Pattinson KT, Turner MR. A wider pathologic...",A wider pathological network underlying breath...


In [5]:
#we could also make dataframe from only those missing DOIs

df1.loc[df1.DOI.notna()].shape


(1442, 4)

### Therefore identifying that 1442 of the 2378 publications have DOIs (936 do not)
### _however, we don't know any of them are correct_ so it will best to acquire all top matches from CrossRef


In [6]:
# Set up a 
cr = Crossref()

In [7]:
# set a mailto address, adding your e-mail will allow better reporting of issues
Crossref(mailto = "your e-mail here")

< Crossref 
URL: https://api.crossref.org
KEY: None
MAILTO: <your e-mail here
ADDITIONAL UA STRING: None
>

In [28]:
cr.works(query_title=df1.csv_post_title[12]
             ,select=['title','DOI'],
                      filter={'from_pub_date':'2011-01-01','until_pub_date':'2018-12-31','type':'journal-article'},
                      limit=1)

{'status': 'ok',
 'message-type': 'work-list',
 'message-version': '1.0.0',
 'message': {'facets': {},
  'total-results': 1772384,
  'items': [{'title': ['HIV-1 DNA predicts disease progression and post-treatment virological control'],
    'DOI': '10.7554/elife.03821'}],
  'items-per-page': 1,
  'query': {'start-index': 0, 'search-terms': None}}}

In [11]:
def CrossRefDoiTop1(df):
    try:
        search = cr.works(query_title=df.csv_post_title
             ,select=['title','DOI'],
                      filter={'from_pub_date':'2011-01-01','until_pub_date':'2018-12-31','type':'journal-article'},
                      limit=1)
        time.sleep(1)
        return search['message']['items'][0]
    except:
        return ('None')
    finally:
        time.sleep(1)
        pass

### Testing the API request can be done for a small subset of the titles in the list

#### Click on the trianlge to unfold the code


<details><summary>Steps to run through with a subset of the dataframe</summary><br>

```python
# We can also take a short section of the dataframe, to check things are working
df1mini = df1.iloc[10:15]

# Then we can assign the API response to each row as a new column 
df1check= df1mini.assign(crossref_API_out =df1mini.apply(CrossRefDoiTop1, axis=1))

# and then convert this column of data from json to a new dataframe (that can be joined or merged with existing data)
newdata = pd.DataFrame(pd.io.json.json_normalize(df1check.crossref_API_out)) 

# have a look
newdata
```

</details>

###  For the main dataframe, we are going to split the data into a couple of sections.  This is in part to ensure that if the long process of collecting data is interrupted, will not need to rerun everything.

In [18]:
#cut main dataframe (df1) into 3 sections
df1x1= df1[0:801]
df1x2= df1[801:1601]
df1x3= df1[1601:2379]#.reset_index()
df1x3.head(2)

Unnamed: 0,ID,DOI,complete,csv_post_title
1601,839,10.4137/BMI.S16553,"Patel S, Murphy D, Haralambieva E, Abdulla ZA,...",Increased expression of phosphorylated FADD in...
1602,2411,10.1016/j.omtn.2016.12.006,"Patricio MI, Barnard AR, Orlans HO, McClements...",Inclusion of the Woodchuck Hepatitis Virus Pos...


In [19]:
# get information on the top hit in CrossRef for each title in the section of the dataframe (and save this as a csv file)

df1x1out = df1x1.assign(crossref_API_out =df1x1.apply(CrossRefDoiTop1, axis=1))
df1x1out.to_csv('bodDois1x1out.csv')

In [20]:
df1x1out.head()

Unnamed: 0,ID,DOI,complete,csv_post_title,crossref_API_out
0,1125,10.1186/s12881-014-0095-4,"&amp; , Fenwick AL, Goos JAC, Rankin J, Lord ...",Apparently synonymous substitutions in FGFR2 a...,{'title': ['Apparently synonymous substitution...
1,1996,10.1183/13993003.00321-2016,", Pattinson KT, Turner MR. A wider pathologic...",A wider pathological network underlying breath...,{'title': ['A wider pathological network under...
2,506,,"Adib-Samii P, Rost N, Traylor M, Devan W, ...",17q25 Locus is associated with white matter hy...,{'title': ['17q25 Locus Is Associated With Whi...
3,1430,10.1093/annonc/mdu449,"and I. Tomlinson*, Findlay JM, Middleton MR, ...",A systematic review and meta-analysis of somat...,{'title': ['A systematic review and meta-analy...
4,848,,"Dichgans M, Malik R, KÃ•_nig IR, Rosand J,...",Shared genetic susceptibility to ischemic stro...,{'title': ['Shared Genetic Susceptibility to I...


In [21]:
df1x2out = df1x2.assign(crossref_API_out =df1x2.apply(CrossRefDoiTop1, axis=1))
df1x2out.to_csv('bodDois1x2out.csv')

In [22]:
df1x2out.head()

Unnamed: 0,ID,DOI,complete,csv_post_title,crossref_API_out
801,969,10.2337/dc13-2539,"Guest JF, Panca M, Sladkevicius E, Taheri S, S...",Clinical outcomes and cost-effectiveness of co...,{'title': ['Obstructive sleep apnea in childre...
802,1767,10.1182/blood-2015-05-647578,"GuiÃ¨ze R, Robbe P, Clifford R, de Guibert S, ...",Presence of multiple recurrent mutations confe...,{'title': ['Presence of multiple recurrent mut...
803,1935,10.1038/nature14347,"Gundem G, Van Loo P, Kremeyer B, Alexandrov LB...",The evolutionary history of lethal metastatic ...,{'title': ['Treatment of metastatic prostate c...
804,2060,10.1093/jac/dkw177,"Guo Q, Tomich AD, McElheny CL, Cooper VS, Stoe...",Glutathione-S-transferase FosA6 of Klebsiella ...,{'title': ['Glutathione-S-transferase FosA6 of...
805,1690,10.1158/1055-9965.EPI-15-0363,"Guo X, Long J, Zeng C, Michailidou K, Ghoussai...",Fine-Scale Mapping of the 4q24 Locus Identifie...,{'title': ['Fine-Scale Mapping of the 4q24 Loc...


In [23]:
df1x3out = df1x3.assign(crossref_API_out =df1x3.apply(CrossRefDoiTop1, axis=1))
df1x3out.to_csv('bodDois1x3out.csv')

In [24]:
# Bring these smaller dataframes together 

df1x_out = pd.concat([df1x1out, df1x2out, df1x3out])

In [27]:
#Save the work so we don't need to get the data again

df1x_out.to_csv('./A1out_CrossRef_BRC_Bodlist_BODandCrossRefDois_Oct19.csv')
df1x_out.to_json('./A1out_CrossRef_BRC_Bodlist_BODandCrossRefDois_Oct19.json', orient='records', lines=True)
df1x_out.crossref_API_out.to_json('./A1out_CrossRef_API_data_alone_Oct19.json', orient='index')
df1x_out.head()


Unnamed: 0,ID,DOI,complete,csv_post_title,crossref_API_out
0,1125,10.1186/s12881-014-0095-4,"&amp; , Fenwick AL, Goos JAC, Rankin J, Lord ...",Apparently synonymous substitutions in FGFR2 a...,{'title': ['Apparently synonymous substitution...
1,1996,10.1183/13993003.00321-2016,", Pattinson KT, Turner MR. A wider pathologic...",A wider pathological network underlying breath...,{'title': ['A wider pathological network under...
2,506,,"Adib-Samii P, Rost N, Traylor M, Devan W, ...",17q25 Locus is associated with white matter hy...,{'title': ['17q25 Locus Is Associated With Whi...
3,1430,10.1093/annonc/mdu449,"and I. Tomlinson*, Findlay JM, Middleton MR, ...",A systematic review and meta-analysis of somat...,{'title': ['A systematic review and meta-analy...
4,848,,"Dichgans M, Malik R, KÃ•_nig IR, Rosand J,...",Shared genetic susceptibility to ischemic stro...,{'title': ['Shared Genetic Susceptibility to I...


In [26]:
#and check we don't have any gaps in our data from CrossRef
df1x_out[df1x_out.crossref_API_out.isna()]

Unnamed: 0,ID,DOI,complete,csv_post_title,crossref_API_out
