## Plotting data to look at the impact of the BRC Oxford 2 (Apr2012 to Apr 2017)- continued

### In this case we are interested in the possible differences in output between those research areas that have continued funding in Oxford, following on from BRC1 (known as research 'Themes'), when compared to research areas and groups with funding for the first time as part of BRC2 (Working groups, WGs).

### The Notebook contains a couple of sections:
>   1. Having a look at distribution of publications reported by one or more research groups
>   2. Data processing to expand the list so each publication has a row for each research group involved (**introducing some duplicates**)
>   3. Analysis and plotting by research group and group type (Themes/Working Groups/Others)
>   4. Further plots, not used in paper.
---
At the end of the notebook are some addtional plots that give other options for exploring the data ( not used in the paper).



In [1]:
# Set up the tools needed first

# dataframes and calculations
import pandas as pd

#  hvplot, built on bokeh and holoviews, great for simple quick plotting from pandas
import hvplot.pandas

# our plotting options
import holoviews as hv

# and we also want a library for controlling our colour pallettes 
import colorcet as cc


# for geometric means (avoid excessive influence of outliers on FCR mean)
import scipy.stats as st

hv.extension('bokeh')

---
## Section 1 - bring in details of research groups and the types of research group and combined these details with the metrics we have for each publication/DOI:
    - Themes, 
    - Working Groups, and 
    - Other BRC-supported work

In [2]:
df_in  = pd.read_csv('C1in.csv', index_col=['pub_date_CR_API'], parse_dates=True)
df_in.head(2)


Unnamed: 0_level_0,Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI,api_add,doi,Dim_times_cited,recent_citations,relative_citation_ratio,field_citation_ratio,license,CR_times_cited,authors_CR,year,month,auth_number
pub_date_CR_API,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2014-08-30 14:03:56+00:00,0,10.1186/s12881-014-0095-4,1125,"&amp; , fenwick al, goos jac, rankin j, lord h...",10.1186/s12881-014-0095-4,"{'doi': '10.1186/s12881-014-0095-4', 'times_ci...",10.1186/s12881-014-0095-4,7.0,4.0,0.24,0.78,This data has been sourced via the Dimensions ...,5.0,"[{'given': 'Aimee L', 'family': 'Fenwick', 'se...",2014.0,8.0,10.0
2016-06-01 01:53:39+00:00,1,10.1183/13993003.00321-2016,1996,", pattinson kt, turner mr. a wider pathologica...",10.1183/13993003.00321-2016,"{'doi': '10.1183/13993003.00321-2016', 'times_...",10.1183/13993003.00321-2016,4.0,3.0,0.57,0.99,This data has been sourced via the Dimensions ...,4.0,"[{'given': 'Kyle T.S.', 'family': 'Pattinson',...",2016.0,6.0,2.0


In [3]:
#  Checking our index column is in a datetime format
df_in.index.dtype


datetime64[ns, UTC]

In [4]:
#  Checking the shape of our dataframe

df_in.shape

(2365, 17)

###  Now we will bring in a list of DOIs claimed by each research group (Theme / Working Group / Other)
we can clean the incoming DOIs of trailing white space and move to lower case as a new column to improve matching.


In [5]:
df_theme = pd.read_csv('./Source_files/C2in_theme_match_2365_BRC2_DOIs.csv', index_col=0)
df_theme['finaldoi_lower'] =df_theme.FinalDOI.str.lower().str.strip(' .')
df_theme.head()                


Unnamed: 0_level_0,FinalDOI,Themes,finaldoi_lower
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,10.1002/mrm.24260,Cardiovascular,10.1002/mrm.24260
2,10.1073/pnas.1117412109,Cardiovascular,10.1073/pnas.1117412109
3,10.1016/j.jcmg.2011.10.007,Cardiovascular,10.1016/j.jcmg.2011.10.007
4,10.1007/s00702-011-0732-4,Dementia and Cerebrovascular disease,10.1007/s00702-011-0732-4
5,10.1080/13803395.2012.672966,Dementia and Cerebrovascular disease,10.1080/13803395.2012.672966


### This data can then be merged with the existing dataframe of metrics, remembering that more than one research group might claim a single DOI (this will give a list).

In [6]:
df= df_in.reset_index().merge(df_theme,how='left',on='finaldoi_lower')
df.set_index(['pub_date_CR_API'], inplace=True)

In [7]:
df_theme.Themes.count()

2365

In [8]:
# split multiple entries in Themes string into items in a list

df2 = df.assign(Themes_split = df.Themes.str.lower().str.split(',').to_list())

df2.head(2)

Unnamed: 0_level_0,Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI_x,api_add,doi,Dim_times_cited,recent_citations,relative_citation_ratio,field_citation_ratio,license,CR_times_cited,authors_CR,year,month,auth_number,FinalDOI_y,Themes,Themes_split
pub_date_CR_API,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2014-08-30 14:03:56+00:00,0,10.1186/s12881-014-0095-4,1125,"&amp; , fenwick al, goos jac, rankin j, lord h...",10.1186/s12881-014-0095-4,"{'doi': '10.1186/s12881-014-0095-4', 'times_ci...",10.1186/s12881-014-0095-4,7.0,4.0,0.24,0.78,This data has been sourced via the Dimensions ...,5.0,"[{'given': 'Aimee L', 'family': 'Fenwick', 'se...",2014.0,8.0,10.0,10.1186/s12881-014-0095-4,Genomic Medicine,[genomic medicine]
2016-06-01 01:53:39+00:00,1,10.1183/13993003.00321-2016,1996,", pattinson kt, turner mr. a wider pathologica...",10.1183/13993003.00321-2016,"{'doi': '10.1183/13993003.00321-2016', 'times_...",10.1183/13993003.00321-2016,4.0,3.0,0.57,0.99,This data has been sourced via the Dimensions ...,4.0,"[{'given': 'Kyle T.S.', 'family': 'Pattinson',...",2016.0,6.0,2.0,10.1183/13993003.00321-2016,Translational Physiology,[translational physiology]


In [9]:
# now we can check the number of items in each list

def theme_count(df):
    try:
        theme_count = len(df.Themes_split)
    except:
        theme_count = None
    return theme_count

In [10]:
df2['theme_count']= df2.apply(theme_count, axis=1)

# percentage of publications claimed by 1, 2 or 3 research groups
(df2.theme_count.value_counts()/df2.theme_count.value_counts().sum() )*100

1    94.672304
2     4.820296
3     0.507400
Name: theme_count, dtype: float64


### How many publications registered with more than 1 research group over the years? 

### -> around 5% of all publications were the work of more than 1 research group


In [11]:
themes_per_DOI = pd.DataFrame(df2.groupby('year').theme_count.value_counts())

themes_per_DOI.stack().reset_index().hvplot.bar(x='year', y='0',
                                               by='theme_count',
                                                stacked=True, cmap='cubehelix_r',
                                               legend='top_right').opts(title='Research Groups reporting each publication',
                                                                                   ylabel='Number of Publications')

## Section 2 - expand the list to give a DOI/publication entry for each research group reporting it
<div class="alert alert-warning">
&#9888; This will give repeat data for some DOIs, due to the small number of publications shared by more than one research group.
</div>

In [12]:
# Expand (explode) dataframe so that each each research group claiming a DOI had its own row entry for this DOI 
df_long =df2.reset_index().explode('Themes_split')

In [13]:
# lowercase coversion and remove any white space, to allow correct matching to incoming group type data (theme_key)

df_long['Theme'] = df_long['Themes_split'].str.strip(' ').str.lower()
df_long.shape

(2503, 23)

In [14]:
df_long.dropna().head(2)

Unnamed: 0.1,pub_date_CR_API,Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI_x,api_add,doi,Dim_times_cited,recent_citations,...,CR_times_cited,authors_CR,year,month,auth_number,FinalDOI_y,Themes,Themes_split,theme_count,Theme
0,2014-08-30 14:03:56+00:00,0,10.1186/s12881-014-0095-4,1125,"&amp; , fenwick al, goos jac, rankin j, lord h...",10.1186/s12881-014-0095-4,"{'doi': '10.1186/s12881-014-0095-4', 'times_ci...",10.1186/s12881-014-0095-4,7.0,4.0,...,5.0,"[{'given': 'Aimee L', 'family': 'Fenwick', 'se...",2014.0,8.0,10.0,10.1186/s12881-014-0095-4,Genomic Medicine,genomic medicine,1,genomic medicine
1,2016-06-01 01:53:39+00:00,1,10.1183/13993003.00321-2016,1996,", pattinson kt, turner mr. a wider pathologica...",10.1183/13993003.00321-2016,"{'doi': '10.1183/13993003.00321-2016', 'times_...",10.1183/13993003.00321-2016,4.0,3.0,...,4.0,"[{'given': 'Kyle T.S.', 'family': 'Pattinson',...",2016.0,6.0,2.0,10.1183/13993003.00321-2016,Translational Physiology,translational physiology,1,translational physiology


In [15]:
# we can now bring in a list of research groups and if they are Themes, Working Groups (WG), or other publications.

df_theme_key = pd.read_csv('./Source_files/C2in_theme_WG_key.csv',index_col=0)
df_theme_key

Unnamed: 0_level_0,Theme_or_WG
Theme,Unnamed: 1_level_1
immunity and inflammation,Theme
cardiovascular,Theme
genomic medicine,Theme
vaccines,Theme
blood,Theme
functional neurosciences and imaging,Theme
dementia and cerebrovascular disease,Theme
cancer,Theme
diabetes,Theme
cognitive health,Working Group


In [16]:
# we can then merge these two dataframes, so that we have a single row for each DOI claimed by any research group
# (with duplicate rows for multiple research groups involved)
# The duplications need careful treatment as grouped statistics will be affected by these additional rows.

df_long2 =df_long.merge(df_theme_key, on='Theme', how='outer')#.drop_duplicates()
df_long2[df_long2.duplicated(subset=['finaldoi_lower','Themes_split' ])]

Unnamed: 0.1,pub_date_CR_API,Unnamed: 0,finaldoi_lower,ID,complete,FinalDOI_x,api_add,doi,Dim_times_cited,recent_citations,...,authors_CR,year,month,auth_number,FinalDOI_y,Themes,Themes_split,theme_count,Theme,Theme_or_WG


In [17]:
#which columns are available to us
df_long2.columns


Index(['pub_date_CR_API', 'Unnamed: 0', 'finaldoi_lower', 'ID', 'complete',
       'FinalDOI_x', 'api_add', 'doi', 'Dim_times_cited', 'recent_citations',
       'relative_citation_ratio', 'field_citation_ratio', 'license',
       'CR_times_cited', 'authors_CR', 'year', 'month', 'auth_number',
       'FinalDOI_y', 'Themes', 'Themes_split', 'theme_count', 'Theme',
       'Theme_or_WG'],
      dtype='object')

In [18]:
# just take the columns we will need for further analysis
df_long2_out = df_long2.reindex(columns=['pub_date_CR_API','ID', 'doi', 'CR_times_cited',
       'Dim_times_cited','relative_citation_ratio', 'field_citation_ratio', 
       'auth_number', 'Theme', 'Theme_or_WG'])

# and rename these columns
df_long2_out.columns=[ 'publication_date','ID', 'DOI', 'times_cited_CrossRef', 'times_cited_Dimensions',
                      'relative_citation_ratio', 'field_citation_ratio','number_of_authors',
                      'research_group', 'group_type']

In [19]:
df_long2_out.head()


Unnamed: 0,publication_date,ID,DOI,times_cited_CrossRef,times_cited_Dimensions,relative_citation_ratio,field_citation_ratio,number_of_authors,research_group,group_type
0,2014-08-30 14:03:56+00:00,1125,10.1186/s12881-014-0095-4,5.0,7.0,0.24,0.78,10.0,genomic medicine,Theme
1,2012-12-09 20:07:21+00:00,266,10.1038/ng.2492,166.0,178.0,5.58,46.47,16.0,genomic medicine,Theme
2,2012-08-10 18:49:10+00:00,168,10.1002/ajmg.a.35558,24.0,25.0,0.65,2.15,13.0,genomic medicine,Theme
3,2015-01-24 04:46:41+00:00,1920,10.1016/j.fertnstert.2014.12.123,22.0,27.0,1.72,12.25,8.0,genomic medicine,Theme
4,2012-10-25 17:38:36+00:00,175,10.1371/journal.pgen.1003025,42.0,53.0,1.69,4.55,4.0,genomic medicine,Theme


In [20]:
df_long2_out.to_csv('./C2out_for_app.csv')
df_long2_out.shape


(2503, 10)

In [21]:
df_long2_out.DOI.value_counts()

10.1016/j.jacc.2015.07.059              3
10.1523/jneurosci.4437-12.2013          3
10.1016/j.biopsych.2013.04.015          3
10.1016/j.cortex.2012.04.011            3
10.1016/j.neurobiolaging.2012.07.011    3
                                       ..
10.1212/wnl.0000000000000942            1
10.1097/01.qai.0000429258.85522.91      1
10.1016/s1470-2045(15)00091-1           1
10.1136/annrheumdis-2015-207544         1
10.1080/13825585.2014.894958            1
Name: DOI, Length: 2364, dtype: int64

---

## Section 3 - Analyse and plot the metrics we have for each publication/DOI by research groups and the type of research group:
    - Themes, 
    - Working Groups, and 
    - Other BRC-supported work

---

In [22]:
# reimport data exported above (this isn't necessary for this notebook, but makes other work (including app) easier))

df_with_group = pd.read_csv('./C2out_for_app.csv',
                            index_col=['publication_date'],
                            parse_dates=True)
df_with_group.dropna().head(2)

Unnamed: 0_level_0,Unnamed: 0,ID,DOI,times_cited_CrossRef,times_cited_Dimensions,relative_citation_ratio,field_citation_ratio,number_of_authors,research_group,group_type
publication_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2014-08-30 14:03:56+00:00,0,1125,10.1186/s12881-014-0095-4,5.0,7.0,0.24,0.78,10.0,genomic medicine,Theme
2012-12-09 20:07:21+00:00,1,266,10.1038/ng.2492,166.0,178.0,5.58,46.47,16.0,genomic medicine,Theme


In [23]:
df_with_group.dtypes

Unnamed: 0                   int64
ID                           int64
DOI                         object
times_cited_CrossRef       float64
times_cited_Dimensions     float64
relative_citation_ratio    float64
field_citation_ratio       float64
number_of_authors          float64
research_group              object
group_type                  object
dtype: object

## Overall values for Field Citation Ratio (FCR) and Relative Citation Ration (RCR) for OxBRC2

### Already calculated in C1 - and &#9888; need to use data without duplicated rows

In [24]:
BRC2_mean_FCR = df['field_citation_ratio'].mean()
BRC2_gmean_FCR = st.gmean(df['field_citation_ratio'].dropna().values+1)-1
BRC2_median_FCR = df['field_citation_ratio'].median()
BRC2_count_FCR = df['field_citation_ratio'].count() 

print ('An overall field citation ratio (FCR) of '), 
print (BRC2_mean_FCR.round(2),' ,(mean)',
       BRC2_gmean_FCR.round(2),' ,(geometric mean)',
       BRC2_median_FCR.round(2),'(median) from data for',BRC2_count_FCR,' publications')

text1 = 'geometric mean FCR: ' + str(BRC2_gmean_FCR.round(2)) +', (n= ' + str(BRC2_count_FCR) + ')'
text2 = 'median FCR: ' + str(BRC2_median_FCR.round(2)) +', (n= ' + str(BRC2_count_FCR) + ')'

An overall field citation ratio (FCR) of 
15.71  ,(mean) 7.12  ,(geometric mean) 6.75 (median) from data for 2259  publications


## Calculations of median and geomean for citation ratios (FCR and RCR), grouped by type of research group (Themes = established, Working Group= new, Other= one-off project or core)

In [25]:
Theme_mean_FCR = df_long2[df_long2['Theme_or_WG']=='Theme']['field_citation_ratio'].mean()
Theme_gmean_FCR = st.gmean(df_long2[df_long2['Theme_or_WG']=='Theme']['field_citation_ratio'].dropna().values+1)-1
Theme_median_FCR = df_long2[df_long2['Theme_or_WG']=='Theme']['field_citation_ratio'].median()
Theme_count_FCR = df_long2[df_long2['Theme_or_WG']=='Theme']['field_citation_ratio'].count() 

print ('An overall field citation ratio (FCR) of '), 
print (Theme_mean_FCR.round(2),' ,(mean)',
       Theme_gmean_FCR.round(2),' ,(geometric mean)',
       Theme_median_FCR.round(2),'(median) from data for',Theme_count_FCR,' publications for research Themes')

text1 = 'geometric mean FCR: ' + str(Theme_gmean_FCR.round(2)) +', (n= ' + str(Theme_count_FCR) + ')'
text2 = 'median FCR: ' + str(Theme_median_FCR.round(2)) +', (n= ' + str(Theme_count_FCR) + ')'

An overall field citation ratio (FCR) of 
16.59  ,(mean) 7.31  ,(geometric mean) 6.99 (median) from data for 2140  publications for research Themes


In [26]:
WG_mean_FCR = df_long2[df_long2['Theme_or_WG']=='Working Group']['field_citation_ratio'].mean()
WG_gmean_FCR = st.gmean(df_long2[df_long2['Theme_or_WG']=='Working Group']['field_citation_ratio'].dropna().values+1)-1
WG_median_FCR = df_long2[df_long2['Theme_or_WG']=='Working Group']['field_citation_ratio'].median()
WG_count_FCR = df_long2[df_long2['Theme_or_WG']=='Working Group']['field_citation_ratio'].count() 

print ('An overall field citation ratio (FCR) of '), 
print (WG_mean_FCR.round(2),' ,(mean)',
       WG_gmean_FCR.round(2),' ,(geometric mean)',
       WG_median_FCR.round(2),'(median) from data for',WG_count_FCR,' publications for Working Groups')

text1 = 'geometric mean FCR: ' + str(WG_gmean_FCR.round(2)) +', (n= ' + str(WG_count_FCR) + ')'
text2 = 'median FCR: ' + str(WG_median_FCR.round(2)) +', (n= ' + str(WG_count_FCR) + ')'

An overall field citation ratio (FCR) of 
11.59  ,(mean) 6.27  ,(geometric mean) 5.66 (median) from data for 221  publications for Working Groups


In [27]:
Other_mean_FCR = df_long2[df_long2['Theme_or_WG']=='Other']['field_citation_ratio'].mean()
Other_gmean_FCR = st.gmean(df_long2[df_long2['Theme_or_WG']=='Other']['field_citation_ratio'].dropna().values+1)-1
Other_median_FCR = df_long2[df_long2['Theme_or_WG']=='Other']['field_citation_ratio'].median()
Other_count_FCR = df_long2[df_long2['Theme_or_WG']=='Other']['field_citation_ratio'].count() 

print ('An overall field citation ratio (FCR) of '), 
print (Other_mean_FCR.round(2),' ,(mean)',
       Other_gmean_FCR.round(2),' ,(geometric mean)',
       Other_median_FCR.round(2),'(median) from data for',Other_count_FCR,' publications for Other research areas')

text1 = 'geometric mean FCR: ' + str(Other_gmean_FCR.round(2)) +', (n= ' + str(Other_count_FCR) + ')'
text2 = 'median FCR: ' + str(Other_median_FCR.round(2)) +', (n= ' + str(Other_count_FCR) + ')'

An overall field citation ratio (FCR) of 
14.39  ,(mean) 7.52  ,(geometric mean) 7.34 (median) from data for 31  publications for Other research areas


In [28]:
def geomean_from_zero(df=df, column='field_citation_ratio', round=2):
    plus_ones = df[column] + 1
    gmean_out = (st.gmean(plus_ones.dropna())-1).round(round)
    return gmean_out
 

In [29]:
grouped = df_with_group.groupby('research_group')

grouped.field_citation_ratio.agg(['median', 'mean', 'count'], axis=1).sort_values('median', ascending=False).round(2).tail(5)

Unnamed: 0_level_0,median,mean,count
research_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
other brc funded work,5.13,15.04,15
molecular diagnostics,5.03,25.37,30
research education and training,4.19,4.19,2
patient and public involvement,4.11,5.22,28
vaccines,4.01,7.77,213


In [30]:
gdata={'FCR_geomean':grouped.apply(geomean_from_zero),
       'RCR_geomean':grouped.apply(geomean_from_zero,column='relative_citation_ratio')}

grouped_geomeans = pd.DataFrame(data=gdata)
grouped_geomeans.sort_values('FCR_geomean', ascending=False)

Unnamed: 0_level_0,FCR_geomean,RCR_geomean
research_group,Unnamed: 1_level_1,Unnamed: 2_level_1
health economics,14.2,2.87
diabetes,11.91,2.58
infection,9.19,2.8
prevention and population care,9.16,2.55
biomedical informatics and technology,9.15,1.82
genomic medicine,8.73,2.15
dementia and cerebrovascular disease,8.26,1.94
functional neurosciences and imaging,8.01,2.05
molecular diagnostics,7.8,1.74
cardiovascular,7.7,1.89


## We can look at citations or FCR by research group (bottom 5 by median FCR shown here)

- indicating that all 23 research groupings have FCR values above 3 

# Scatter plots to show as much information as possible about OxBRC2

## Starting with a scatter plot of all publications over the period of BRC2, coloured by Research Group,

###  with size of the points representing the number of authors

In [31]:
def monthlyGeomean(df, freq = 'M'):
    df['plus_ones'] = df.field_citation_ratio +1
    monthly_gmeans = df.dropna().groupby(pd.Grouper(freq= freq))['plus_ones'].apply(st.gmean)-1
    return monthly_gmeans

In [32]:
month_gmean_FCR = monthlyGeomean(df_with_group, 'M')
month_gmean_FCR.head()

  return np.exp(log_a.mean(axis=axis))
  ret = ret.dtype.type(ret / rcount)


publication_date
2011-04-30 00:00:00+00:00     3.47
2011-05-31 00:00:00+00:00      NaN
2011-06-30 00:00:00+00:00      NaN
2011-07-31 00:00:00+00:00    28.61
2011-08-31 00:00:00+00:00    11.40
Freq: M, Name: plus_ones, dtype: float64

In [33]:
all_scatter = df_with_group.dropna().hvplot.scatter(y='field_citation_ratio',
                                                    size ='number_of_authors',
                                                    c='research_group',
                                                    xlabel='Publication date (DOI created)',
                                                    ylabel='Field Citation Ratio (FCR)',
                                                    hover_cols=['DOI'],
                                                    alpha=0.5)



final_scatter =(all_scatter\
                *hv.Text(x= pd.Timestamp('2017-09-01T00'), y=12, text='Geomean FCR',
                     halign='center', fontsize=10).opts(color='darkblue')\
                *hv.HLine(y=BRC2_gmean_FCR.round(2) , label='gmean FCR').opts(color='darkblue', 
                                                                       line_width=3, line_dash='dashed')\
                *hv.Text(x=pd.Timestamp('2017-9'),y=1.5,text='FCR=1',fontsize=10)\
                *hv.HLine(y=1).opts(line_dash='dotted', color='black')\
                *month_gmean_FCR.truncate(after='2017-04').hvplot.line(color='darkblue', line_width=3))

final_scatter.opts(width=900,height=400,show_legend=False,
                   yticks=([(1, '1'),(10, '10'),(100, '100'),(1000,'1000')]),
                   ylim=(0.5,1200),logy=True,xlim=(pd.Timestamp('2012'),pd.Timestamp('2018')), xrotation=25,
                   toolbar='above')

In [34]:
hv.save(final_scatter, './Figures/FCR_scatter_color.html')
hv.save(final_scatter.opts(toolbar=None ), './Figures/FCR_scatter_color.png', dpi=1200)

## A second scatter plot with coloured rows for research groups.

#### each point is a publication with x-axis position based on the date the associated DOI was first created, and the size of the point (bubble) relating to the current Field Citation Ration (FCR) of the publication.

In [35]:


RG_bubble =(df_with_group.dropna().hvplot.scatter(y='research_group',c='research_group',
                                     size ='field_citation_ratio',scale=1.2, cmap='category20',
                                     xlabel='Publication date (DOI created)',ylabel='Research Group',
                                     hover_cols=['DOI','ID'],legend=False)).sort('research_group', reverse=True)

RG_bubble.opts(width=900, height=400,  xlim=(pd.Timestamp('2011-10'),pd.Timestamp('2017-10')),
                  xrotation=25, xticks=None, toolbar='above',tools=['hover','tap'])

In [36]:
hv.save(RG_bubble.opts(toolbar=None), './Figures/research_group_bubble.png', dpi=1200)
hv.save(RG_bubble, './Figures/research_group_bubble.html')

##  Authors per publication - by research group

In [37]:
author_box= (df_with_group.hvplot.box(by=['group_type','research_group'],color='group_type',
                                                   cmap=['#1f77b4','#ff7f0e', '#2ca02c'],
                                                   y='number_of_authors').opts(ylim=(0,100), height=450, width=600,
                                                        xrotation=85, show_frame=False,
                                                        xlabel= '',yticks=(0,9,25,50,75,100), ylabel='Number of Authors',
                                                        show_legend=False, padding=0.8))

             
author_box2= (df_with_group.hvplot.box(by=['group_type'],color='group_type',
                                                   cmap=['#1f77b4','#ff7f0e', '#2ca02c'],
                                                   y='number_of_authors').opts(ylim=(0,100), height=450, width=200,
                                                        xrotation=85, show_frame=False,
                                                        xlabel= '', ylabel='',yaxis=None,
                                                        show_legend=False, padding=0.8))

#hvplot.save(author_box, 'Author_boxplot.html')
#hvplot.save(author_box, 'Author_boxplot.png')

hvplot.save(author_box +author_box2, 'Author_boxplots.svg')
hvplot.save((author_box +author_box2).opts(toolbar=None), 'Author_boxplots.png')

(author_box +author_box2).opts(width=900)

In [38]:
df_with_group.groupby('group_type')['number_of_authors'].median().round(2)

group_type
Other            7.0
Theme            9.0
Working Group    5.0
Name: number_of_authors, dtype: float64

## Stacked graph of cumilative publications over OxBRC2, by type of research group

In [39]:
group_type_counts = df_with_group.groupby(['group_type']).resample('M').count()['DOI']

group_type_counts.head()

#We then need to unpack this so the divisions are separate columns, so we can then do a cumilative count
group_stack =group_type_counts.unstack('group_type', fill_value=0)

group_stack_plot =group_stack.cumsum().hvplot.area(stacked=True,cmap=['#2ca02c','#1f77b4','#ff7f0e'], alpha=0.8)


group_stack_plot.opts(xlim=(pd.Timestamp('2011-04-01T00:00'),pd.Timestamp('2017-03-31T00:00')),
                      title = 'Number of publications (cumilative), by Research Group Type',
                      toolbar=None, legend_position='top_left', height=250,width=900,
                      yticks=(0,1000,2000),
                      xlabel='Date', ylabel='Cumulative publications')


In [40]:
hv.save(group_stack_plot.opts(toolbar=None),'./Figures/grouped_pub_plot.png', dpi=1200)
hv.save(group_stack_plot,'./Figures/grouped_pub_plot.html')

---
---
---


## 4.  Other Plots, not used in the paper 

---

The figures below might be of interest

## Stacked graph of monthly publications by research group

In [41]:
# monthly output from each research group
RG_counts = df_with_group.groupby(['research_group']).resample('M').count()['DOI']
RG_counts.head()


#We then need to unpack this so the divisions are separate columns, so we can then do a cumilative count
count_stack = RG_counts.unstack('research_group', fill_value=0)
pub_timeline = count_stack.hvplot.area(by='research_group', stacked=True, height=600, width=1000, legend='left',
                        xlim=(pd.Timestamp('2011-10'),pd.Timestamp('2017-10')))

pub_timeline

In [42]:
# Uncomment to save 

# hvplot.save(pub_timeline, './Figures/publications_over_time.svg')

## Stacked graph of cumilative publications over OxBRC2, by research group

In [43]:
count_stack_plot =count_stack.cumsum().hvplot.area(stacked=True,cmap='category20')

count_stack_plot.opts(xlim=(pd.Timestamp('2012-01-01T00:00'),pd.Timestamp('2017-03-31T00:00')),
                      title = 'Number of publications, by Research Group',
                      xlabel='Date', ylabel='Cumulative publications',
                      toolbar=None, legend_position='left', height=400, width=900)

In [44]:
# Uncomment to save 

#hvplot.save(count_stack_plot.opts(toolbar=None),'./Figures/Theme_pub_plot.png')
#hvplot.save(count_stack_plot,'./Figures/Theme_pub_plot.svg')


## A violin plot for Field Citation Ratio (FCR) by research group (top5 FCR shown here)

In [45]:

violin1 =(df_with_group[df_with_group.group_type =='Theme'].dropna().sort_values(by='research_group').hvplot.violin(y='field_citation_ratio',by='research_group',
                       ylim=(0,100), color='#1f77b4').opts(height=450, width=500,ylabel= 'Field Citation Ratio (FCR)',
                                                          xlabel='',title='Field Citation Ratio: Themes',xrotation=80)
          *hv.HLine(y=1))

violin2 =(df_with_group[df_with_group.group_type =='Working Group'].dropna().sort_values(by='research_group').hvplot.violin(y='field_citation_ratio',by='research_group',
                       ylim=(0,100)
                    ,color='#ff7f0e').opts(height=450, width=300, title=' Working Groups',xrotation=80,xlabel='', yaxis=None)
          *hv.HLine(y=1))

violin3 =(df_with_group[df_with_group.group_type =='Other'].dropna().sort_values(by='research_group').hvplot.violin(y='field_citation_ratio',by='research_group',
                       ylim=(0,100)
                    ,color='#2ca02c').opts(height=450, width=150, title='Other Work',xrotation=80,xlabel='', yaxis=None)
          *hv.HLine(y=1))

In [46]:
final_violins =((violin1+ violin2+ violin3)).opts(toolbar=None)
final_violins

In [47]:
# Uncomment to save 

#hv.save(final_violins.opts(toolbar=None), './Figures/FCR_violins.png')
#hv.save(final_violins, './Figures/FCR_violins.html')

## The relationships between citations, citation ratios, and numbers of authors per publication can be examined briefly 

In [48]:
# plot CrossRef citations against FCR
bubble1 = df_with_group.hvplot.scatter(x='times_cited_CrossRef',y='field_citation_ratio',aggregator='mean',
                                   by='research_group', size ='number_of_authors',cmap='category20',legend=False,
                                   logx=False, logy=False, height=400, width=400, xlabel = 'Number of Citations',
                                   ylabel = 'Field Citation Ratio (FCR)', title='Citations vs FCR')

# or plot CrossRef citations against number of authors for each research group
bubble2 = df_with_group.hvplot.scatter(x='times_cited_CrossRef',y='number_of_authors',aggregator='mean', by='research_group',
                                   size ='field_citation_ratio', hover_cols=['number_of_authors'],legend=False,
                                   logx=False, logy=False, height=400, width=400,
                                   xlabel = 'Number of Citations', ylabel = 'Number of Authors',
                                   title='Citations vs Number of Authors')
bubbles = (bubble1 + bubble2)
# bubbles.opts(toolbar=None)

In [49]:
x= df_with_group.dropna().times_cited_CrossRef
y= df_with_group.dropna().field_citation_ratio

slope, intercept, r_value, p_value, std_err = st.linregress(x=x,y=y)
reg_curve1 = slope*x+intercept

bubble1_labels = bubble1 *hv.Curve((x,reg_curve1))*hv.Text(3000,100,
                                                           'r = '+str(r_value.round(4)),
                                                           fontsize=18)*hv.Text(3000,200, 'slope= '+str(slope.round(4)),
                                                                                fontsize=18)
#bubble1_labels

In [50]:
x2= df_with_group.dropna().times_cited_CrossRef
y2= df_with_group.dropna().number_of_authors

slope2, intercept2, r_value2, p_value2, std_err2 = st.linregress(x=x2,y= y2)
reg_curve2 = slope2*x2+intercept2

bubble2_labels =bubble2 *hv.Curve((x2,reg_curve2))*hv.Text(1500,1500, 'r = '+str(r_value2.round(4)),
                                                           fontsize=18)*hv.Text(1500,1700, 'slope= '+str(slope2.round(4)), fontsize=18)

#bubble2_labels

In [51]:
bubbles_labels = bubble1_labels + bubble2_labels
bubbles_labels

In [52]:
# uncomment to save
#hvplot.save(bubbles_labels.opts(toolbar=None), './Figures/Theme_scatters.png')
#hvplot.save(bubbles_labels, './Figures/Theme_scatters.svg')

In [53]:
df_helpers =pd.DataFrame(data=[df_with_group.groupby('research_group')['ID'].count(),
                                      df_with_group.groupby('research_group')['number_of_authors'].mean().round(2)]).transpose()
df_helpers.columns=('total_publications', 'average_authorship')
df_helpers

Unnamed: 0_level_0,total_publications,average_authorship
research_group,Unnamed: 1_level_1,Unnamed: 2_level_1
biomedical informatics and technology,62.0,17.95
blood,181.0,13.75
cancer,136.0,11.91
cardiovascular,272.0,17.05
cognitive health,132.0,6.2
dementia and cerebrovascular disease,150.0,16.04
diabetes,131.0,42.0
ethics,10.0,7.7
functional neurosciences and imaging,180.0,7.43
genomic medicine,241.0,41.88


In [54]:
bubble3 = df_helpers.hvplot.scatter(x='total_publications', y='average_authorship',
                                           by='research_group', cmap='category20',legend=False)
#bubble3

In [55]:
x3= df_helpers.total_publications
y3= df_helpers.average_authorship

slope, intercept, r_value, p_value, std_err = st.linregress(x=x3,y=y3)
reg_curve3 = slope*x+intercept

bubble3_labels = bubble3 *hv.Curve((x,reg_curve3)).opts(line_alpha=0.1)*hv.Text(250,50,
                                                           'r = '+str(r_value.round(4)),
                                                           fontsize=14)*hv.Text(250,40, 'slope= '+str(slope.round(4)),
                                                                                fontsize=14)

In [56]:
bubble3_labels.opts(xlim=(0,400),ylim=(0,60))

## It doesn't seem that there is a clear link between the number of authors in a research group and the volume of publications during OxBRC2

---

---

---

# Additional plots for the paper ...

### These are minor changes, focussing on the main 2 types of research group, Themes and Working groups


----
First we can focus on the research outputs from research Themes and Working Groups (select all except the 'Other' group types)

In [57]:
# Select dataframe, using  inverse with '!='
df_with_group_focus = df_with_group[df_with_group.group_type!='Other'].dropna()

In [58]:
group_type_counts_focus = df_with_group[df_with_group.group_type!='Other'].groupby(['group_type']).resample('M').count()['DOI']



#We then need to unpack this so the divisions are separate columns, so we can then do a cumilative count
group_stack_focus =group_type_counts_focus.unstack('group_type', fill_value=0)

group_stack_plot_focus =group_stack_focus.cumsum().hvplot.area(stacked=True,cmap=['#1f77b4','#ff7f0e'], alpha=0.8)


group_stack_plot_focus.opts(xlim=(pd.Timestamp('2011-04-01T00:00'),pd.Timestamp('2017-03-31T00:00')),
                      title = 'Number of publications (cumilative), by Research Group Type',
                      toolbar=None, legend_position='top_left', height=250,width=900,
                      yticks=(0,1000,2000),
                      xlabel='Date', ylabel='Cumulative publications')


In [59]:
hv.save(group_stack_plot_focus.opts(toolbar=None),'./Figures/grouped_pub_plot_focus.png', dpi=1200)
hv.save(group_stack_plot_focus,'./Figures/grouped_pub_plot_focus.html')

##  Authors per publication - by research group - Focus on Themes and Working Groups

In [101]:
df_with_group.research_group = df_with_group.research_group.str.capitalize()

author_box_focus= (df_with_group[df_with_group.group_type!='Other'].hvplot.box(by=['group_type','research_group'],color='group_type',
                                                   cmap=['#1f77b4','#ff7f0e', '#2ca02c'],
                                                   y='number_of_authors').opts(ylim=(0,100), height=450, width=600,
                                                        xrotation=85, show_frame=False,
                                                        xlabel= '',yticks=(0,9,25,50,75,100), ylabel='Number of Authors',
                                                        show_legend=False, padding=0.8))

             
author_box2_focus= (df_with_group[df_with_group.group_type!='Other'].hvplot.box(by=['group_type'],color='group_type',
                                                   cmap=['#1f77b4','#ff7f0e', '#2ca02c'],
                                                   y='number_of_authors').opts(ylim=(0,100), height=450, width=200,
                                                        xrotation=85, show_frame=False,
                                                        xlabel= '', ylabel='',yaxis=None,
                                                        show_legend=False, padding=0.8))

#hvplot.save(author_box, 'Author_boxplot.html')
#hvplot.save(author_box, 'Author_boxplot.png')

hvplot.save(author_box_focus +author_box2_focus, './Figures/Author_boxplots_focus.svg')
hvplot.save((author_box_focus +author_box2_focus).opts(toolbar=None), './Figures/Figure2.png')
hvplot.save((author_box_focus +author_box2_focus), './Figures/Author_boxplots_focus.png')

(author_box_focus +author_box2_focus).opts(width=900)


## Starting with a scatter plot of all publications over the period of BRC2, coloured by Research Group,

###  with size of the points representing the number of authors

In [61]:
month_agg_FCR_focus = df_with_group_focus.groupby(pd.Grouper(freq='M')).field_citation_ratio.median()
month_agg_FCR_Theme = df_with_group_focus[df_with_group_focus.group_type=='Theme'].groupby(pd.Grouper(freq='M')).field_citation_ratio.median()
month_agg_FCR_WG = df_with_group_focus[df_with_group_focus.group_type=='Working Group'].groupby(pd.Grouper(freq='M')).field_citation_ratio.median()
month_agg_FCR_WG.head()

publication_date
2011-08-31 00:00:00+00:00    11.4
2011-09-30 00:00:00+00:00     NaN
2011-10-31 00:00:00+00:00     NaN
2011-11-30 00:00:00+00:00     NaN
2011-12-31 00:00:00+00:00     NaN
Freq: M, Name: field_citation_ratio, dtype: float64

In [62]:
month_gmean_FCR_focus = monthlyGeomean(df_with_group_focus, 'M')
month_gmean_FCR_focus.head()

  return np.exp(log_a.mean(axis=axis))
  ret = ret.dtype.type(ret / rcount)


publication_date
2011-04-30 00:00:00+00:00     3.47
2011-05-31 00:00:00+00:00      NaN
2011-06-30 00:00:00+00:00      NaN
2011-07-31 00:00:00+00:00    28.61
2011-08-31 00:00:00+00:00    11.40
Freq: M, Name: plus_ones, dtype: float64

In [63]:
month_agg_FCR_Theme.dropna().hvplot.line(color='darkblue', line_width=5)\
*month_agg_FCR_WG.dropna().hvplot.line(color='Orange', line_width=5)\

In [81]:
all_scatter_focus = df_with_group_focus.dropna().hvplot.scatter(y='field_citation_ratio',
                                                                size ='number_of_authors',
                                                                c='group_type',
                                                                scale=2,
                                                                xlabel='Publication date (DOI created)',
                                                                ylabel='Field Citation Ratio (FCR)',
                                                                hover_cols=['DOI'],
                                                                alpha=0.6)



final_scatter_focus =(all_scatter_focus\
                *hv.Text(x= pd.Timestamp('2017-09-01T00'),
                         y=16,
                         text='Geomean FCR',
                         halign='center',
                         fontsize=20).opts(color='darkblue')\
                *hv.HLine(y=BRC2_gmean_FCR.round(2) ,
                          label='gmean FCR').opts(color='darkblue',
                                                  line_width=3,
                                                  line_dash='dashed')\
                *hv.Text(x=pd.Timestamp('2017-9'),
                         y=1.5,
                         text='FCR=1',
                         fontsize=20)
                *hv.HLine(y=1).opts(line_dash='dotted',
                                    color='black')\
                *month_gmean_FCR_focus.truncate(after='2017-04').hvplot.line(color='darkblue', 
                                                                             line_width=3))

final_scatter_focus.opts(width=1800,
                         height=800,
                         show_legend=False,
                         yticks=([(1, '1'),(10, '10'),(100, '100'),(1000,'1000')]),
                         ylim=(0.5,1200),
                         logy=True,xlim=(pd.Timestamp('2012'),pd.Timestamp('2018')),
                         xrotation=25,
                         fontsize= {
                         'labels': '200%', 
                         'ticks': '180%'},
                         toolbar='above')

In [83]:
hv.save(final_scatter_focus.opts(toolbar=None), './Figures/final_scatter_focus.png', dpi=1200)
hv.save(final_scatter_focus.opts(toolbar=None), './Figures/Figure5.png', dpi=2400)
hv.save(final_scatter_focus, './Figures/final_scatter_focus.html')

In [66]:
df_with_group_focus

Unnamed: 0_level_0,Unnamed: 0,ID,DOI,times_cited_CrossRef,times_cited_Dimensions,relative_citation_ratio,field_citation_ratio,number_of_authors,research_group,group_type,plus_ones
publication_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-08-30 14:03:56+00:00,0,1125,10.1186/s12881-014-0095-4,5.0,7.0,0.24,0.78,10.0,genomic medicine,Theme,1.78
2012-12-09 20:07:21+00:00,1,266,10.1038/ng.2492,166.0,178.0,5.58,46.47,16.0,genomic medicine,Theme,47.47
2012-08-10 18:49:10+00:00,2,168,10.1002/ajmg.a.35558,24.0,25.0,0.65,2.15,13.0,genomic medicine,Theme,3.15
2015-01-24 04:46:41+00:00,3,1920,10.1016/j.fertnstert.2014.12.123,22.0,27.0,1.72,12.25,8.0,genomic medicine,Theme,13.25
2012-10-25 17:38:36+00:00,4,175,10.1371/journal.pgen.1003025,42.0,53.0,1.69,4.55,4.0,genomic medicine,Theme,5.55
...,...,...,...,...,...,...,...,...,...,...,...
2014-07-10 15:39:54+00:00,2497,1068,10.1097/mej.0000000000000166,5.0,5.0,0.34,1.53,6.0,biomedical informatics and technology,Theme,2.53
2015-08-12 14:41:09+00:00,2498,1567,10.1186/s12911-015-0186-y,20.0,25.0,2.26,7.84,6.0,biomedical informatics and technology,Theme,8.84
2013-12-02 09:45:56+00:00,2499,1271,10.1007/s40279-013-0128-8,99.0,118.0,5.10,24.12,5.0,biomedical informatics and technology,Theme,25.12
2013-08-29 03:42:52+00:00,2501,619,10.1128/cvi.00427-13,52.0,55.0,2.31,7.48,16.0,research education and training,Working Group,8.48


In [90]:
df_with_group_focus.research_group = df_with_group_focus.research_group.str.capitalize()

RG_bubble_focus = df_with_group_focus.hvplot.scatter(y = 'research_group',
                                                     c='group_type',
                                                     size ='field_citation_ratio',
                                                     cmap=['#1f77b4','#ff7f0e'],
                                                     scale = 2.4,
                                                     xlabel='Publication date (DOI created)',
                                                     ylabel='Research Themes and Working Groups',
                                                     hover_cols=['DOI','ID'],
                                                     legend=False).sort('group_type').sort('research_group', reverse=True)

RG_bubble_focus.opts(width=2400,
                     height=800,
                     xlim=(pd.Timestamp('2011-10'),pd.Timestamp('2017-10')),
                     xrotation=25,
                     fontsize= {
                         'labels': '200%', 
                         'ticks': '180%', 
                     },
                     xticks=None,
                     toolbar='above',
                     tools=['hover','tap'])

In [91]:
hv.save(RG_bubble_focus.opts(toolbar=None), './Figures/research_group_bubble_focus.png', dpi=1200)
hv.save(RG_bubble_focus, './Figures/research_group_bubble_focus.html')
hv.save(RG_bubble_focus.opts(toolbar=None), './Figures/Figure1.png', dpi=2400)

In [92]:

RG_bubble_focus2 = df_with_group_focus.hvplot.scatter(y = 'research_group',
                                                      c='group_type',
                                                      size ='field_citation_ratio',
                                                      scale=1.2,
                                                      cmap=['#ff7f0e','#1f77b4'],
                                                      xlabel='Publication date (DOI created)',
                                                      ylabel='Research Group',
                                                      hover_cols=['DOI','ID'],
                                                      legend=False).sort(['group_type','research_group'], reverse=True)

RG_bubble_focus2.opts(width=900,
               height=400,
               xlim=(pd.Timestamp('2011-10'),pd.Timestamp('2017-10')),
               xrotation=25,
               xticks=None,
               toolbar='above',
               tools=['hover','tap'])

In [93]:
hv.save(RG_bubble_focus2.opts(toolbar=None), './Figures/research_group_bubble_focus2.png', dpi=1200)
hv.save(RG_bubble_focus2, './Figures/research_group_bubble_focus2.html')