
 # A Consortia of Confusion?


>Many papers these days, especially in fields of medical research and clinical trials, are truely massive collaborative efforts.
This can often mean a list of people who should be acknowledged, although the levels of contribution can vary, as can the ability of publishers to deal with this collaboration.
This is understandable, but leaves another issue that need acknowledging when making associaition networks.

>The involvement of research groupings in authorship lists creates a few issues with accuracy:
    - In many cases the membership of the group or consortium is unpublished
    - It may also be that the level of contrubution of groups directly to publications varies greatly
    - Some members of the research group may also be listed separately as an author
    
However, author lists also have issues with deduplication and incorrect merging of authors that may cause more error than groups.

### <div class="alert-success"> It may be interesting to explore the formation of such groups as indicator of maturity in a research area and how this might modify the reproducibliy of finding or effects on treatment/policy in such areas? </div>

## A look at the list of individual author nodes to check terminology for groups 

In [1]:
import pandas as pd

# to aquire and display formatted references from Crossref
from habanero import cn
from IPython.display import display, Markdown

### Now we bring in our additional information about particular authors (Nodes)

In [2]:
author_nodes = pd.read_csv('./D3out_Author_nodes_processed.csv',index_col=[0])
author_nodes.shape #.columns

(20225, 10)

In [3]:
author_nodes.head()

Unnamed: 0,author,DOI,DOI_count,CR_citations,primary_affiliation,research_group,group_type,primary_group,primary_type,Ox_author
0,A. Abdel-Gadir,['10.1016/j.jcmg.2015.11.008'],1,80,{nan},['cardiovascular'],['Theme'],cardiovascular,Theme,0
1,A. Abe,['10.1080/15548627.2015.1100356'],1,3007,{nan},['immunity and inflammation'],['Theme'],immunity and inflammation,Theme,0
2,A. Abizaid,['10.1016/j.ijcard.2013.03.064'],1,1,{nan},['cardiovascular'],['Theme'],cardiovascular,Theme,0
3,A. Abubakar,['10.1371/journal.pone.0113360'],1,34,{nan},['translational physiology'],['Theme'],translational physiology,Theme,0
4,A. Abulí,"['10.1136/gutjnl-2011-300537', '10.1186/1471-2...",2,106,{nan},"['genomic medicine', 'genomic medicine']","['Theme', 'Theme']",genomic medicine,Theme,0


In [4]:
# Number of authors with the term 'Comittee'

author_nodes[author_nodes.author.str.contains('Committee', regex=True, case=False)]['author'].count()

6

In [5]:
# Number of authors with the term 'Consortium'

author_nodes[author_nodes.author.str.contains('Consortium', regex=True, case=False)]['author'].count()

120

In [6]:
author_nodes[author_nodes.author.str.contains('Group', regex=True, case=False)]['author'].count()

37

In [7]:
author_nodes[author_nodes.author.str.contains('Team', regex=True, case=False)]['author'].count()

2

In [8]:
# A look at some of the Consortia

author_nodes[author_nodes.author.str.contains('Consortium', regex=True, case=False)]

Unnamed: 0,author,DOI,DOI_count,CR_citations,primary_affiliation,research_group,group_type,primary_group,primary_type,Ox_author
2008,Asian Genetic Epidemiology Network Type 2 Diab...,['10.1038/ng.2897'],1,669,{nan},['diabetes'],['Theme'],diabetes,Theme,0
2009,Australian Asthma Genetics Consortium (AAGC),['10.1038/ng.2694'],1,155,{nan},['immunity and inflammation'],['Theme'],immunity and inflammation,Theme,0
3767,C4D Consortium,['10.1182/blood-2012-06-436188'],1,54,{nan},['cardiovascular'],['Theme'],cardiovascular,Theme,0
3768,CARDIOGENICS Consortium,"['10.1038/ng.2480', '10.1182/blood-2012-06-436...",2,1084,{nan},"['diabetes', 'cardiovascular']","['Theme', 'Theme']",diabetes,Theme,0
3769,CARDIoGRAM Consortium,"['10.1182/blood-2012-06-436188', '10.1038/ng.2...",2,466,{nan},"['cardiovascular', 'cardiovascular']","['Theme', 'Theme']",cardiovascular,Theme,0
...,...,...,...,...,...,...,...,...,...,...
20200,the ReproGen Consortium,['10.1093/hmg/ddu150'],1,57,{nan},['genomic medicine'],['Theme'],genomic medicine,Theme,0
20201,the SLI Consortium,['10.1038/ejhg.2014.296'],1,31,{nan},['immunity and inflammation'],['Theme'],immunity and inflammation,Theme,0
20204,the UK Brain Expression consortium,['10.1186/gb-2013-14-7-r75'],1,157,{nan},['diabetes'],['Theme'],diabetes,Theme,0
20205,the Wellcome Trust Case Control Consortium 2,['10.1161/strokeaha.115.009387'],1,10,{nan},['dementia and cerebrovascular disease'],['Theme'],dementia and cerebrovascular disease,Theme,0


---
### Following the process of counting authors in each of the CrossRef entries for OxBRC2 we can take a look at those DOIs that have no associated authors...returning to our eariler list of DOIs, as these won't feature in our author list.
---

In [9]:
df= pd.read_csv('./C1in.csv',index_col=[0])
df.shape

(2365, 17)

In [10]:
df.head(2)

Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI,api_add,doi,Dim_times_cited,recent_citations,relative_citation_ratio,field_citation_ratio,license,pub_date_CR_API,CR_times_cited,authors_CR,year,month,auth_number
0,10.1186/s12881-014-0095-4,1125,"&amp; , fenwick al, goos jac, rankin j, lord h...",10.1186/s12881-014-0095-4,"{'doi': '10.1186/s12881-014-0095-4', 'times_ci...",10.1186/s12881-014-0095-4,7.0,4.0,0.24,0.78,This data has been sourced via the Dimensions ...,2014-08-30 14:03:56+00:00,5.0,"[{'given': 'Aimee L', 'family': 'Fenwick', 'se...",2014.0,8.0,10.0
1,10.1183/13993003.00321-2016,1996,", pattinson kt, turner mr. a wider pathologica...",10.1183/13993003.00321-2016,"{'doi': '10.1183/13993003.00321-2016', 'times_...",10.1183/13993003.00321-2016,4.0,3.0,0.57,0.99,This data has been sourced via the Dimensions ...,2016-06-01 01:53:39+00:00,4.0,"[{'given': 'Kyle T.S.', 'family': 'Pattinson',...",2016.0,6.0,2.0


In [11]:
df2 = df[df.auth_number.isna()]
df2.shape

(6, 17)

In [12]:
ids=pd.Series(df2.FinalDOI.values)
ids

0    10.3978/j.issn.2225-319X.2014.05.14
1          10.1016/s0140-6736(12)60768-5
2    10.3978/j.issn.2305-5839.2015.09.12
3                    10.1056/nejmx120009
4                   10.1002/cyto.b.21165
5              10.5083/ejcm.20424884.147
dtype: object

In [13]:

display(Markdown('## Publications without authors? Empty author fields '))
display(Markdown('---'))

for i in range(0,len(ids)):
    try:
        print (cn.content_negotiation(ids = ids[i], format = "text", style='apa'))
        display(Markdown('---'))
    except:
        print ('No data returned')
        display(Markdown('---'))


## Publications without authors? Empty author fields 

---

No data returned


---

The benefits and harms of intravenous thrombolysis with recombinant tissue plasminogen activator within 6 h of acute ischaemic stroke (the third international stroke trial [IST-3]): a randomised controlled trial. (2012). The Lancet, 379(9834), 2352–2363. doi:10.1016/s0140-6736(12)60768-5



---

No data returned


---

The Perpetual Challenge of Infectious Diseases. (2012). New England Journal of Medicine, 366(9), 868–868. doi:10.1056/nejmx120009



---

No data returned


---

No data returned


---

Further examination of this small number of publications showed that author lists were simply missing, either just this field 

The numbers of authors can be updated below if desired.  The 1237 authors in one publication is a rare case where the full membership of a group/consortium was listed in the supplementary data.

In [14]:
df3=df

df3.loc[1826,'auth_number']=8
df3.loc[936,'auth_number']=4
df3.loc[976,'auth_number']=1237
df3.loc[1828,'auth_number']=6
df3.loc[1580,'auth_number']=3
df3.loc[1289,'auth_number']=2


df3[df3.auth_number.isna()]

Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI,api_add,doi,Dim_times_cited,recent_citations,relative_citation_ratio,field_citation_ratio,license,pub_date_CR_API,CR_times_cited,authors_CR,year,month,auth_number


In [15]:
df.loc[976]

finaldoi_lower                                 10.1016/s0140-6736(12)60768-5
ID                                                                        53
complete                   ist collaborative group, sandercock p, wardlaw...
FinalDOI                                       10.1016/s0140-6736(12)60768-5
api_add                    {'doi': '10.1016/s0140-6736(12)60768-5', 'time...
doi                                            10.1016/s0140-6736(12)60768-5
Dim_times_cited                                                          813
recent_citations                                                         163
relative_citation_ratio                                                32.35
field_citation_ratio                                                   134.1
license                    This data has been sourced via the Dimensions ...
pub_date_CR_API                                    2012-05-30 15:47:05+00:00
CR_times_cited                                                           688