# Extra analysis:  not needed for the current paper

### D8 - diving up the list of Digital Object Identifiers (DOIs) into groups for use as a starting point for the GUI program VOSViewer.  This is another method for producing author-association networks.



Networks were also explored in the program VOSviewer (version 1.6.11, https://www.vosviewer.com/) for comparison. For VOSviewer, each list of DOIs was imported via the Crossref DOI resource. Then networks were created with fractional counting of co-authorship, with no exclusion of papers with large numbers of authors. Additionally, a thesaurus file was constructed to aid with aggregation of records where authors have multiple names or initials that are recorded inconsistently. No restriction was made on the minimum number of publications for inclusion. 


## dividing DOIs for each publication by:

    # research group 
    # type of group (established Themes and new working groups)
    # stage of the OxBRC2 project

In [3]:
# Python library for data handling and import/export
import pandas as pd


In [4]:
df_in = pd.read_csv('./C2out_for_app.csv', index_col=['publication_date'], parse_dates=True)
df_in.sort_index(inplace=True)
df_in.head()

Unnamed: 0_level_0,Unnamed: 0,ID,DOI,times_cited_CrossRef,times_cited_Dimensions,relative_citation_ratio,field_citation_ratio,number_of_authors,research_group,group_type
publication_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2011-04-12 10:57:25+00:00,862,741,10.1111/j.1460-9592.2011.03591.x,14.0,15.0,0.83,3.47,1.0,cardiovascular,Theme
2011-07-25 13:24:12+00:00,446,31,10.1016/j.neurobiolaging.2011.05.018,105.0,144.0,5.34,28.61,6.0,dementia and cerebrovascular disease,Theme
2011-08-22 17:08:59+00:00,1300,101,10.1037/a0024992,49.0,64.0,2.09,11.4,5.0,cognitive health,Working Group
2011-09-17 03:56:19+00:00,719,184,10.1016/j.neuroimage.2011.09.010,48.0,49.0,1.97,,1.0,cardiovascular,Theme
2011-10-05 06:03:22+00:00,1449,238,10.1002/hbm.21402,86.0,90.0,3.65,21.32,6.0,functional neurosciences and imaging,Theme


In [5]:
df_in.DOI.value_counts()

10.1038/leu.2015.129                    3
10.1016/j.neurobiolaging.2012.07.011    3
10.1016/j.cortex.2012.04.011            3
10.1523/jneurosci.4437-12.2013          3
10.1038/ng.3304                         3
                                       ..
10.3233/jpd-140523                      1
10.1016/j.ijcard.2014.09.025            1
10.2196/mhealth.3568                    1
10.1167/iovs.12-10037                   1
10.2217/fon.14.222                      1
Name: DOI, Length: 2364, dtype: int64

In [6]:
group_outlist =pd.Series(df_in.research_group.values.tolist())
group_outlist.unique()

array(['cardiovascular', 'dementia and cerebrovascular disease',
       'cognitive health', 'functional neurosciences and imaging',
       'infection', 'genomic medicine', 'cancer',
       'translational physiology', 'immunity and inflammation',
       'diabetes', 'vaccines', 'blood',
       'biomedical informatics and technology',
       'pathology and bioresources', 'surgical innovation and evaluation',
       'other brc funded work', 'ethics',
       'patient and public involvement', 'molecular diagnostics',
       'prevention and population care',
       'research education and training', 'transplantation',
       'health economics'], dtype=object)

In [7]:
group_type_list =pd.Series(df_in.group_type.values.tolist())
group_type_list.unique()

array(['Theme', 'Working Group', 'Other'], dtype=object)

In [8]:
for item in group_outlist.unique():
    selected = df_in[df_in.research_group==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_all_DOIs.csv', index=False, header=False)

In [9]:
for item in group_type_list.unique():
    df_timed = df_in.dropna()
    selected = df_timed[df_timed.group_type==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_all_DOIs.csv', index=False, header=False)

##  We are first going to divide our lists into each third of the OxBRC2 funding period.

This sorting and division is never going to be completely accurate, due to variations in data for publishing dates vs DOI creation dates (and some omissions and variation between publishers for this data).

Occasionally we identifiy gaps in author data for certain DOIs, and we have clearly identified uncertainty in group memberships

In [10]:
df_start = df_in[df_in.index.notna()].sort_index().truncate(after='2013-11-22 23:59:59+00:00')
df_mid =  df_in[df_in.index.notna()].sort_index().truncate(before='2013-11-22 23:59:59+00:00', after='2015-07-09 23:59:59+00:00')
df_end = df_in[df_in.index.notna()].sort_index().truncate(before='2015-07-15 23:59:59+00:00')

In [11]:
df_start.count()

Unnamed: 0                 832
ID                         832
DOI                        832
times_cited_CrossRef       832
times_cited_Dimensions     832
relative_citation_ratio    811
field_citation_ratio       794
number_of_authors          830
research_group             832
group_type                 832
dtype: int64

In [12]:
df_mid.count()

Unnamed: 0                 829
ID                         829
DOI                        829
times_cited_CrossRef       829
times_cited_Dimensions     829
relative_citation_ratio    809
field_citation_ratio       797
number_of_authors          829
research_group             829
group_type                 829
dtype: int64

In [13]:
df_end.count()

Unnamed: 0                 829
ID                         829
DOI                        829
times_cited_CrossRef       829
times_cited_Dimensions     829
relative_citation_ratio    808
field_citation_ratio       791
number_of_authors          829
research_group             829
group_type                 829
dtype: int64

In [14]:
df_in.groupby('research_group')['number_of_authors'].describe(percentiles=[0.1,0.5,0.9]).round(1)

Unnamed: 0_level_0,count,mean,std,min,10%,50%,90%,max
research_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
biomedical informatics and technology,62.0,18.0,90.8,2.0,4.0,6.0,10.0,721.0
blood,181.0,13.8,11.4,1.0,3.0,10.0,25.0,74.0
cancer,136.0,11.9,11.6,1.0,3.0,10.0,20.5,110.0
cardiovascular,270.0,17.1,40.8,1.0,3.0,8.0,19.1,445.0
cognitive health,132.0,6.2,3.7,2.0,3.0,5.0,10.9,31.0
dementia and cerebrovascular disease,150.0,16.0,19.9,1.0,4.0,8.5,47.0,127.0
diabetes,131.0,42.0,80.8,1.0,3.0,12.0,89.0,485.0
ethics,10.0,7.7,10.0,1.0,1.0,2.0,22.6,28.0
functional neurosciences and imaging,180.0,7.4,4.0,1.0,3.0,7.0,13.0,29.0
genomic medicine,241.0,41.9,74.7,1.0,5.0,16.0,108.0,496.0


In [15]:
Author_stats_table = pd.DataFrame([df_start.number_of_authors.describe(percentiles=[0.1,0.5,0.9]),
                                   df_mid.number_of_authors.describe(percentiles=[0.1,0.5,0.9]),
                                   df_end.number_of_authors.describe(percentiles=[0.1,0.5,0.9]),
                                   df_in.number_of_authors.describe(percentiles=[0.1,0.5,0.9])],
                                  index=['Start', 'Mid', 'End','All']).round(2)

In [16]:
Author_stats_table

Unnamed: 0,count,mean,std,min,10%,50%,90%,max
Start,830.0,14.26,26.55,1.0,3.0,8.0,23.0,322.0
Mid,829.0,18.93,49.27,1.0,3.0,9.0,29.2,679.0
End,829.0,22.82,100.88,1.0,3.0,9.0,28.0,2467.0
All,2497.0,18.63,66.55,1.0,3.0,9.0,27.0,2467.0


---
## We can divided up the dataframes for the 'start' , 'mid' and 'end' of BRC2 to give each theme or research group type.

--- 

In [17]:
group_outlist

0                             cardiovascular
1       dementia and cerebrovascular disease
2                           cognitive health
3                             cardiovascular
4       functional neurosciences and imaging
                        ...                 
2498    functional neurosciences and imaging
2499                          cardiovascular
2500               immunity and inflammation
2501               immunity and inflammation
2502          prevention and population care
Length: 2503, dtype: object

In [18]:
#start of BRC2

for item in group_outlist.unique():
    selected = df_start[df_start.research_group==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_start_DOIs.csv', index=False, header=False)

In [19]:
# middle of BRC2

for item in group_outlist.unique():
    selected = df_mid[df_mid.research_group==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_mid_DOIs.csv', index=False, header=False)

In [20]:
# end of BRC2

for item in group_outlist.unique():
    selected = df_end[df_end.research_group==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_end_DOIs.csv', index=False, header=False)

## and again we can divide the DOIs into those from Themes, Working Groups and Others, for the start, middle and end of the Grant

In [21]:
group_type_list.unique()

array(['Theme', 'Working Group', 'Other'], dtype=object)

In [22]:
#start of BRC2

for item in group_type_list.unique():
    selected = df_start[df_start.group_type==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_start_DOIs.csv', index=False, header=False)

# middle of BRC2

for item in group_type_list.unique():
    selected = df_mid[df_mid.group_type==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_mid_DOIs.csv', index=False, header=False)

# end of BRC2

for item in group_type_list.unique():
    selected = df_end[df_end.group_type==item]['DOI']
    selected.to_csv('./author_networks/'+str(item)+'_end_DOIs.csv', index=False, header=False)

##  and finally, ensure we have a list of all of the DOIs for VOSviewer network of entire OxBRC2


In [23]:
all_DOIs = df_in.drop_duplicates(subset=['DOI'],keep='first')['DOI']
all_DOIs.to_csv('./author_networks/all_DOIs.csv', index=False, header=False)

In [24]:
len(all_DOIs)

2365