In [None]:
# Title: Evaluating institutional open access performance: Methodology, challenges and assessment

**Abstract**

The role of open access has rapidly become central to the global research community (and beyond). This is increasingly so with organisations and governments mandating serious considerations for publishing in open access forms (e.g., Plan S). However, tracking the progress of open access at the institutional level remains a difficult problem, with most bibliographic databases lacking exhaustive open access and affiliation labelling, using non-standardised metadata formats, and having potentially significant differences in coverage. This necessitates methodologies integrating diverse data sources to provide more complete evidence for institutional open access performance. In this study, we build the first comprehensive and reproducible data workflow capable of capturing instituional open access data. The workflow combines digital object identifiers (DOIs) affiliated to individual institutions from Microsoft Academic, Web of Science, and Scopus. Subsequently, we normalise publication dates using Crossref metadata and query each DOI’s open access status through Unpaywall. We use this methodology to produce various open access scores for the Top 1000 universities in the Times Higher Education World University Rankings 2019, and supplement this list with additional universities from countries with a low presence in our data set. Analyses of the resulting data highlights the existence of different open access paths that universities take, as well as regional differences across the globe, arising from varying policies and infrastructure. We also present the top 100 performing universities in each of the categories of total open access, open access (gold) publishing and repository-mediated (green) open access percentages. The presence of African, Asian and Latin American universities in the top 100 shows encouraging progress in those regions.

<b>Keywords:</b> Open Access; Evaluation Framework; Unpaywall; Microsoft Academic; Web of Science; Scopus.

# Introduction

Open access (OA) is a policy aspiration for research funders, organisations, and communities globally. While there is substantial disagreement on the best route to achieve open acess, the idea that wider availability of research outputs should be a goal is broadly shared. Over the past decade, there is a massive increase in the volume of publications available in open access, mediated through publishers and through a wide range of respositories. Piwowar et al. (2018), a large-scale peer reviewed study of global open access trends, showed that the percentage of open access articles per year was about 45% in 2015, compared to around 5% before 1990 and a more recent projection suggests that 44% of all outputs ever published will be freely accessible in 2025 (Piwowar et al., 2019).

This massive increase has been driven in large part by policy initiatives of both funders and research organisations. Medical research funders such as the Wellcome Trust and Medical Research Council in the UK and the National Institutes of Health in the US led a wide range of funder policy interventions. Universities such as Harvard, Liege, Southampton and others developed local polices and infrastructures that became more widely adopted. Plan S, led by a coalition of funders[^1], annaounced in 2019 has as its goal the complete conversion of scholarly publishing to immediate open access. This is the most ambitious, and therefore the most controversial, policy initiative to date with questions raised about the approach (Rabesandratana, 2019; Haug, 2019; Barbour and Nicholls, 2019), implementation details (McNutt, 2019; Gómez-Fernández,2019; Brainard, 2019; Agustini and Berk, 2019), and unintended side effects for existing programs outside North America and Northwestern Europe (Debat and Babini, 2019; Aguado-López and Becerril-García, 2019).

[^1]: See https://www.coalition-s.org/

Despite the scale of these interventions and the apparent success in driving change, at least in some areas, there is limited comparative and quantitative research about which policy interventions have been the most successful. The landscape also lacks a framework or theory of change through which to analyse the observed effects of interventions. In part this is due to a historical lack of high-quality data on open access, the heterogeneous nature of the global scholarly publishing endeavour, and the consequent lack of any baseline against which to make comparisons.

Early examples of such critical work on evaluating open access and the effectiveness of policy showed a correlation between mandate strength and rate of deposit in repositories (Gargouri et al., 2012). An important recent example is reported by Larivière and Sugimoto (2018). They show a link between the monitoring of policy and its effectiveness, describing strong performance by articles funded by the Wellcome Trust, UK Medical Research Councils and National Institutes of Health, alongside the UK Research Councils and Gates Foundations. These are all funders that have implemented monitoring, and in some cases sanctions for non-compliance. By comparison open access for works funded by Canadian funders, which do not monitor compliance, were shown to lag substantially even when disciplinary effects were taken into account.

There is a need for critical, inclusive and diverse evaluation of open access performance that can address regional and political differences. For example, the SciELO project (originally from Brazil, but now covering 14 countries, mostly Latin American) have successfully implemented an electronic open access publishing model for journals, which has resulted in a surge in journal-mediated (gold) open access (Packer, 2009; Wang et al., 2018). In contrast, the open access policies adopted by UK Research and Innovation[^2] require repository deposit of research articles for elibility in the next national research evaluation exercise. Given the lower cost of repository-mediated (green) open access, and the high correlation between mandate strength and rate of deposit into institutional repositories, this is likely to encourage more repository than journal-mediated open access.

[^2]: See https://www.ukri.org/funding/information-for-award-holders/open-access/

Recent work by Iyandemye and Thomas (2019) showed that, for biomedical research, there was a greater level of open access for articles published from countries with a lower GDP. This effect persisted when articles with only authors from a specific region were considered. Levels of open access were particularly high for sub-Saharan Africa. They report no clear relationship between the number of policies and open access to biomedical literature, although they do not examine the details of those policies. This provides evidence of national or regional effects on publication cultures that lead to open access.

Meanwhile, Siler et al. (2018) demonstrated an institutional effect on choices around open access. They showed, for the field of Global Health and a set of institutions identified through Web of Science (WoS), lower-ranked institutions are more likely to publish in closed outlets. They suggest that this is due to the cost of article processing charges (APCs). This shows the importance of considering institutional context when examining open access performance, and potentially when considering implementation pathways.

## Change at the institutional level

We have argued (Montgomery et al., 2018) that the key to understanding and guiding the cultural changes that underpin a transition to openness, including open access to scholarly outputs, is analysis at the level of research institutions. While funders, national governments, and research communities create the environments in which researchers operate, it is within their professional spaces that choices around communication, and their links to career progression and job security are strongest. Therefore, analysis of how external policy leads to change at the level of universities is critical. However, providing accurate and reliable data on open access at the university level is a challenge.

The most comprehensive work on open access at the university level currently available is that included in the Leiden Ranking provided by Centre for Science and Technology Studies (CWTS), accompanied by the preprint Robinson-Garsia et al. (2019). This utilises an internal Web of Science database and data from Unpaywall[^3] to provide an estimate of open access over a range of timeframes. These data have highlighted the broad effects of funder policies (notably the performance of UK universities in response to Research Council and Funding Council policies) while also providing standout examples from regions that are less expected (for instance Bilkent University in Turkey).

[^3]: See https://unpaywall.org/

Given the policy drivers towards open access it is perhaps inevitable that rankings will start to incorporate information on open access and universities may be judged on their performance. One concern is the existing disciplinary bias in large bibliographic sources. For example, the coverages of Web of Science (WoS) and Scopus were shown to be biased toward the sciences and the English language (Mongeon and Paul-Hus, 2016). We have already described how conventional approaches to defining evaluation frameworks based on single sources of output data can provide misleading results (Huang et al., 2019). In a companion white paper to this article we provide more details of these issues with a sensitivity analysis of the data presented here (Huang et al., 2020a). If we are to make valid comparisons of universities across countries, regions and funders to examine the effectiveness of open access policy implementation there is a critical need for evaluation frameworks that provide fairer, more diverse, and more meaningful measurement of open access performance.

## Challenges in evaluating institutions

Building a robust open access evaluation framework at the institutional level comes with a number of challenges. As mentioned earlier, there is an issue of coverage of research outputs by different bibiographic sources. Each data source comes with its own limitations and biases. In addition to the issue of coverage, there are a number of fundamental challenges in defining such a framework. These include:

1.	What is a university?
2.	What is the set of objects we should look at to determine open access performance?
3.	Do we care about absolute numbers or proportions?
4.	What are the issues surrounding data completeness and quality for both sets of objects we examine and the data on open access performance?

Our pragmatic assessment is that any evaluation framework should be tied to explicit policy goals and be shaped to deliver that. Following from our work on open knowledge institutions (Montgomery et al., 2018) our goals in conducting an evaluation exercise and developing the framework are as follows:

1.	Maximising the amount of research content that is accessible to the widest range of users, in the first instance focusing on existing formal research content for which metadata quality is sufficiently high to enable analysis 
2.	Developing an evaluation framework that drives an elevation of open access and open science issues to a strategic issue for all research-intensive universities
3.	Developing a framework that is sensitive to and can support universities taking a diversity of approaches and routes towards delivering on those goals

In terms of a pragmatic approach to delivering on these we therefore intend to:

1.	Focus on research intensive institutions, using existing rankings as a sample set
2.	Seek to maximise the set of objects which we can collect and track while connecting them to institutions (i.e., favour recall over precision)
3.	Focus on proportions of open access as a performance indicator rather than absolute numbers
4.	Publicly report on the details of performance for high performing institutions (and provide strategic data on request to others)
5.	Report on the diversity of paths being taken to deliver overall access by a diverse group of universities
6.	Develop methodology that is capable of identifying which policy interventions have made a difference to outcome measures and any ‘signature’ of those effects

# Results

Based on a large scale sensitivity analysis of our full data set we have developed a workflow and analysis procedure for quantifying and comparing open access performance at the university level. We first present a summary of this workflow, followed by the main findings resulting from our analysis. We report a top 100 of global universities based on percentage of overall open access, open access publishing (‘gold open access’) and repository mediated open access (‘green open access’). We then show global trends on open access, the effect of selected policy interventions, and how different institutional choices can be visualised and analysed.

## A reproducible workflow to evaluate open access performance at the institutional level

We developed a reproducible workflow capable of quantifying a wide range of open access characteristics at the institutional level. The overall workflow is shown diagrammatically in Figure 1. This includes a mapping of open access definitions and the Unpaywall information we used to construct them. Briefly, we gather output metadata from searches in Microsoft Academic[^4] (Sinha et al., 2015; Wang et al., 2019), Web of Science and Scopus, for each university. From this full set we gather the corresponding DOIs from the metadata of each output. These are filtered down to Crossref DOIs by matching against a Crossref snapshot (see Supplemtary Methodology for snapshot identity). These are then matched against an Unpaywall snapshot (see Supplemtary Methodology for snapshot identity) for their correspinding open access status. Detailed discussions of various open access definitions, data sources, and techincal details of the data infrastructure can be found in the Supplementary Methodology.

[^4]: See https://aka.ms/msracad


#### Figure 1: Workflow of data collection and mapping of open access definitions to Unpaywall metadata.

$$ $$

![test](images/oa_article_figure_1.bmp)

As we have noted previously (Huang et al., 2019) there are significant levels of sensitivity associated to the choices in bibliographic data sources when they are used to create a ranking. For this analysis we therefore chose to combine all three datasets (i.e., Microsoft Academic, Web of Science and Scopus). At the same time, we also use Crossref's ‘issued date’ field as the standardised approach for determining publication year for each output. In the companion white paper (Huang et al., 2020a) we provide a comprehensive sensitivity analysis for these choices. There are also changes based on which specific Unpaywall snapshot is used. This is partly due to real changes (e.g. release of works from repositories after embargo) and due to changes within the Unpaywall data system (examples include changes in upstream data sources such as journal inclusion or exclusion in the Directory of Open Access Journals (DOAJ), and internal changes such as improved repository calling or wider journal coverage). As this is a product of gradually improving systems underpinning Unpaywall we use the most recent available snapshot to provide the most up to date data in a reproducible and identifiable form.

Briefly, it is our view that to provide a robust assessment of open access performance the following criteria must be met:

1.	The set of outputs included in each category (here institutions) and a complete traceable description of how they were collected must be transparently described. Provided here by a description of the data sources and the procedures used to collect DOIs for each institution (see Supplementary Methodology).
2.	A clearly defined, open and auditable data source on open access status. Provided here by a defined and identified Unpaywall snapshot (see Supplementary Methodology).
3.	A clearly defined and implementable description of how open access status data is interpreted. Provided here in Figure 1 and in Supplementary Methodology in the form of the SQL query used to establish open access status categories for each DOI.
4.	Provision of derived data and analysis in auditable form. Provided here as the derived data as open data (Huang et al., 2020b), code for the analysis of derived data as Jupyter notebooks (Huang et al., 2020c), and upstream data analysis in the form of SQL queries used (Huang et al., 2020b).

We have limited our data sharing in two ways. Firstly, we do not provide the full list of DOIs obtained from each source, due to Terms of Service restrictions. Secondly, we have not identified institutions individually except for those that fall within the top 100 globally for total open access, journal-mediated, or repository-mediated open access. The full dataset containing derived data for all institutions is available in anonymised form (Huang et al., 2020b). 

In [None]:
import pandas as pd
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")

from analysis import charts

plt.style.use('seaborn-white')
sns.set_context('paper')

In [None]:
full = pd.read_csv('https://zenodo.org/record/3693222/files/institutional_oa_evaluation_2020_full_paper_dataset_2020_02_12.csv?download=1')

In [None]:
named = pd.read_csv('https://zenodo.org/record/3693222/files/institutional_oa_evaluation_2020_named_unis_dataset_2020_02_12.csv?download=1')

In [None]:
## Helper functions ##

# Data cleanup required, mainly on country names #
def clean_geo_names(df):
    country_clean = { "country" : {
        "United Kingdom of Great Britain and Northern Ireland" : "United Kingdom",
        "Iran (Islamic Republic of)" : "Iran",
        "Korea, Republic of" : "South Korea",
        "Taiwan, Province of China" : "Taiwan"
                              }
                    }
    df.replace(to_replace = country_clean, inplace=True)

    df.loc[df.country.isin(['Canada', 'United States of America']), 'region'] = 'North America'
    df.replace('Americas', 'Latin America', inplace=True)
    return df

# Creating nice column names for graphing
def nice_column_names(df):
    cols = [
        ('Open Access (%)', 'percent_oa'),
        ('Total Green OA (%)', 'percent_green'),
        ('Total Gold OA (%)', 'percent_gold'),
        ('Gold DOAJ (%)', 'percent_gold_just_doaj'),
        ('Green in Institutional Repository (%)', 'percent_in_home_repo'),
        ('Hybrid OA (%)', 'percent_hybrid'),
        ('Total Publications', 'total'),
        ('Change in Open Access (%)', 'total_oa_pc_change'),
        ('Change in Green OA (%)', 'green_pc_change'),
        ('Change in Gold OA (%)', 'gold_pc_change'),
        ('Change in Total Publications (%)', 'total_pc_change'),        
        ('Year of Publication', 'published_year'),
        ('University Name', 'name'),
        ('Region', 'region'),
        ('Country', 'country'),
            ]
    for col in cols:
        if col[1] in df.columns.values:
            df[col[0]] = df[col[1]]

    return df

# Function for creating percent_changes year on year
def calculate_pc_change(df, columns, 
              id_column='grid_id', 
              year_column='published_year',
              column_name_add='_pc_change'):
    df = df.sort_values(year_column, ascending=True)
    for column in columns:
        new_column_name = column + column_name_add
        df[new_column_name] = list(df.groupby(id_column)[column].pct_change()*100)   
    return df

# Function for calculating confidence intervals
def calculate_confidence_interval(df, columns,
                                  total_column='total',
                                  column_name_add='_err'):
    for column in columns:
        new_column_name = column + column_name_add
        df[new_column_name] = 100*3.43*(
                                            df[column] / 100 *
                                                   (
                                                    1 - df[column] / 100
                                                   ) /
                                            df[total_column]
                                                )**(.5)
    return df

In [None]:
# Do the data cleanup and a few calculations for graphing
clean_geo_names(full)
clean_geo_names(named)
full = calculate_confidence_interval(full,
                                 ['percent_gold', 
                                  'percent_green', 
                                  'percent_oa'])
named = calculate_pc_change(named, 
              ['gold', 
               'green', 
               'total_oa', 
               'total'])
named = calculate_confidence_interval(named,
                                 ['percent_gold', 
                                  'percent_green', 
                                  'percent_oa'])
full = nice_column_names(full)
named = nice_column_names(named)

## Top 100 global universities in terms of total open access, gold open access and green open access

In Figure 2, we present the top 100 universities in each of the categories of total open access, gold open access and green open access for publications assigned to the year 2017 (see Section 3.1 for a discussion on conditions for inclusion; and see Supplementary Figures 1 and 2 for equivalent plots for 2016 and 2018). This is, to our knowledge, the first set of university rankings that provides a confidence interval on the quantitative variable being ranked and compensates for the multiple comparisons effect. Across this top 100 the statistical difference between universities at the 95% confidence shows that a simple numerical ranking cannot be justified. The high performance of a number of Latin American and African universities, together with a number of Indonesian universities, particularly with respect to open access publishing (i.e., gold open access), is also striking. For Latin America this is sensitive to our use of Microsoft Academic as a data source (Huang et al., 2020a) showing the importance of an inclusive approach. The outcomes for Indonesian universities are also consistent with the lastest report on country-level analysis (Van Noorden, 2019). These suggest that the narrative of Europe and the USA driving a publishing-dominated approach to open access misses a substantial part of the full global picture.

The highest performers in terms of repository-mediated open access (i.e., green open access) are dominated by UK universities. This is not surprising given the power of the open access mandate associated with the Research Excellence Framework to drive university behaviour. It is perhaps interesting that few US universities appear in this group (with CalTech and MIT the exceptions). This suggests that while the National Institutes of Health mandate has been very effective at driving open access to the biomedical literature limited inroads have been made into other disciplines in the US context, despite the White House memorandum. As was seen in the Leiden Ranking, Bilkent University from Turkey also emerges as a stand-out performer in repository-mediated open access. 

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

#### Figure 2: Top 100 universities in terms of performance in proportions of total open access, open access publishing (gold OA) and repository-mediated open access (green OA) for 2017.

In [None]:
#from matplotlib.patches import Patch
from matplotlib.lines import Line2D
legend_elements = [Line2D([0], [0], color='orange', lw=8, label='Asia'),
                   Line2D([0], [0], color='limegreen', lw=8, label='Europe'),
                   Line2D([0], [0], color='dodgerblue', lw=8, label='North America'),
                   Line2D([0], [0], color='brown', lw=8, label='Latin America'),
                   Line2D([0], [0], color='magenta', lw=8, label='Africa'),
                   Line2D([0], [0], color='red', lw=8, label='Oceania')]
# Create the figure
fig, ax = plt.subplots(figsize=(16,0.7))
ax.legend(handles=legend_elements, loc='lower center', frameon=True, ncol=6)
plt.axis('off')
plt.show()
params = [
            {
            'chart_class': charts.ConfidenceIntervalRank,
            'rankcol': 'Open Access (%)',
            'errorcol': 'percent_oa_err',
            'filter_name': 'published_year',
            'filter_value': 2017
            },
            {
            'chart_class': charts.ConfidenceIntervalRank,
            'rankcol': 'Total Gold OA (%)',
            'errorcol': 'percent_gold_err',
            'filter_name': 'published_year',
            'filter_value': 2017
            },
            {
            'chart_class': charts.ConfidenceIntervalRank,
            'rankcol': 'Total Green OA (%)',
            'errorcol': 'percent_green_err',
            'filter_name': 'published_year',
            'filter_value': 2017
            }
]
figdata = named[(named.percent_green_err<17)&
                              (named.total*named.percent_green/100>5)&
                              ((named.total*(1-named.percent_green/100)>5))&
                              (named.percent_gold_err<17)&
                              (named.total*named.percent_gold/100>5)&
                              ((named.total*(1-named.percent_gold/100)>5))&
                              (named.percent_oa_err<17)&
                              (named.total*named.percent_oa/100>5)&
                              ((named.total*(1-named.percent_oa/100)>5))]
figure2 = charts.Layout(figdata, params)
figure2.process_data()
figure2.plot(wspace=1.36);



##  The global picture and its evolution
To examine the global picture for the full set of 1,207 universities and to interrogate different paths to open access we plot the overall level of repository mediated (green) and publisher mediated (gold) open access for each university over time. 

The levels of total open access, gold open access and green open access for 1,207 universities for publications in 2017, grouped by country, can be found in Supplementary Figure 3 (with comparable figures for other years give in Supplementary Figures 4 and 5). The countries are ordered by the median total open access percentage for each country. Amongst countries with a large number of universities in the dataset the UK is a clear leader with Indonesia, Brazil, Columbia, the Netherlands, and Switzerland showing a strong performance. Corresponding results grouped by regions of these universities are also provided in Supplementary Figures 6 to 8.

Consistent with previously reported results there are high performing universities in Latin America (i.e., Peru, Costa Rica, Columbia, Chile, Brazil) and Uganda as well as a range of European countries. Latin American countries owe their performance in large part to open access journals (i.e. gold open access) whereas European countries see a more significant contribution from repository based open access (i.e, green open access). Many countries have universities that are high peformers in terms of the proportion of open access, while overall country performance can be linked to policy mandates and infrastructure provision.

An alternate view is shown in Figure 3 plotting total levels of open access publishing (gold) versus total levels of repository open access (green) for each of the 1,207 universities as a scatter plot. This plot is helpful to understand regional variation in the paths taken towards open access. Points are coloured by region as per Figure 2. Figure 3 presents the results for 2017 (with changes over time shown in the animated version), with stationary time-stamped versions given in Supplementary Figure 9. 

Overall universities in Oceania (Australia and New Zealand) and North America (Canada and the US) lag behind comparators in Europe (on repository-mediated open access) and Latin America (on gold open access). Asian universities are highly diverse. As seen in Figure 2 there are some high performers in the top 100s, particularly for open access publishing, but many also lag. Africa is also highly diverse but with a skew towards high performance, with an emphasis on open access publishing.

Latin American institutions show high levels of open access publishing throughout the period illustrated. This is due to substantial infrastructure investments in systems like Redalyc, SciELO and others in Latin America starting in the 1990s. Another visible effect of policy intervention may be visible in the time series as North American universities shift to the right, showing an increase in repository-mediated open access from 2007-2010, possibly in response to the National Institutes of Health public access policy. This is overtaken by a substantial shift right by European, mostly UK, universities in 2015 following UK funding council policies requiring repository-mediated open access for inclusion of research outputs in the Research Excellence Framework. 

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

#### Figure 3: Open access publishing (gold OA) vs repository-mediated open access (green OA) by institution for 2017 (and 2007-2018 for animated version). Each point plotted is a university, with size indicating the number of outputs analysed and colour showing the region. Articles can be open access through both publishing and repository routes so x and y values do not sum to give total open access.

In [None]:
sns.set_context("paper",rc={"legend.fontsize":16,"axes.labelsize":16},font_scale=2)

In [None]:
figdata = full[(full.percent_green_err<17)&
                            (full.total*full.percent_green/100>5)&
                              ((full.total*(1-full.percent_green/100)>5))&
                              (full.percent_gold_err<17)&
                              (full.total*full.percent_gold/100>5)&
                              ((full.total*(1-full.percent_gold/100)>5))&
                              (full.percent_oa_err<17)&
                              (full.total*full.percent_oa/100>5)&
                              ((full.total*(1-full.percent_oa/100)>5))]
figure3 = charts.ScatterPlot(figdata, 
                                   'Total Green OA (%)', 
                                   'Total Gold OA (%)', 
                                   'Year of Publication', 2017,
                                   hue_column='Region', 
                                   size_column='Total Publications')
figure3.process_data()
figure3.plot(xlim=(0,119), ylim=(0,100), figsize=(10,8));

In [None]:
#an animated version of green OA versus gold OA for all years from 2005 to 2018
figure3ani = charts.ScatterPlot(figdata, 
                                'Total Green OA (%)', 
                                'Total Gold OA (%)', 
                                'Year of Publication', (2006,2019),
                                hue_column='Region',
                                size_column='Total Publications')
figure3ani.process_data()
figure3ani.animate(xlim=(0,119), ylim=(0,100), figsize=(10,8))

## The effects of policy interventions

The previous figures give some indication of signals of policy interventions. If our goal is to provide data on the effectiveness of policy then our analysis should be capable of identifying the effects of policy change. We tested our ability to detect the effects of four aspects of policy implementation. In 2012 the UK Research Councils, following the Finch Report, provided additional funding to individual universities to support open access publishing. The amount of additional funding relating to existing research council funding, not to the number of outputs of that university.

In Figure 4a we show the annual change in open access publishing (‘pure’ open access plus hybrid) for three UK universities amongst those with the largest additional funding and three amongst those with significantly less additional funding (Lawson, 2018). In either 2012 or 2013 we see a jump in open access publishing across all the universities. When we separate out publishing in journals listed in the Directory of Open Access Journals no clear trend is depicted by the size of the jumps. In contrast, there is a more clear link between the scale of funding and the changes in hybrid open access publishing (see Supplementary Figures 10a and 10b). This is showing an effect of the funding and policy. However, as the additional funding tails off in 2015 the scale of growth falls back.

Figure 4b shows the growth of content in UK university repositories from 2000-2017, in contrast to two universities from other regions. In 2015 deposit of a research output in a repository became a requirement for eligibility for including in the UK Research Excellence Framework. This policy shift was profound because it relates to an assessment exercise and funding which covers all disciplinary areas and all universities. It is unique globally in terms of both its reach and its effectiveness. The dominance of the top 100 for both overall open access and repository-mediated open access by UK universities as well as the approach to 100% coverage being made by such a large number of universities is driven in large part by that policy intervention.

Figure 4c focuses on the takeup of hybrid open access publishing options in the Netherlands following deals with Springer in 2014, and Wiley in 2016. The consistent dip in hybrid adoption in the Netherlands to 2014 does not have an obvious explanation except perhaps that researchers were waiting to see the result of negotiations. Across the Netherlands levels of publishing in hybrid open access journals show a sharp turn of increase from 2014 onwards with a less pronounced effects (more smooth increases) for publishing in pure open access (see Supplementary Figures 10c and 10d).

Finally, in Figure 4d we show the effect of subtle differences in policy relating to acceptable embargo periods. UK Research and Funding Council polices have been aggressive in reducing embargo lengths mandating six months for STEM subjects and twelve months for HSS subjects. The effect of embargoes can be seen in data for repository mediated open access as a dip in the most recent years of publication. Using Unpaywall data from late 2019 we see a dip in repository-mediated open access performance for UK universities in 2018 but a limited effect on 2017. By comparison with three of the highest performing US universities we see an extended dip in performance, indicative of an acceptance of longer embargoes.

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

#### Figure 4: Monitoring the effect of policy interventions for selected groups of universities. Subfigure A represents the annual change in percentage (rolling current year percentage minus the previous year percentage) of gold OA by six UK universities. The first three universities are those with larger additional funding in contrast to the last three universities who received less additional funding. Subfigure B represents the annual percentage of green OA through the institutional repositories for four UK universities in contrast to two universities from elsewhere. Subfigure C records the annual percentages of hyrbid OA at six universities in the Netherlands. Subfigure D represents three pairs of UK versus US universities, grouped roughly by percentages of total OA. The annaul percentages of total green OA are depicted for each university.

In [None]:
named = named.sort_values(['grid_id', 'published_year'])
named['Change in % Gold'] = named.percent_gold.diff()
named['Change in % Gold DOAJ'] = named.percent_gold_just_doaj.diff()
named['Change in % Hybrid'] = named.percent_hybrid.diff()

In [None]:
sns.set_context("paper",rc={"legend.fontsize":10,"axes.labelsize":10},font_scale=1)

In [None]:
plots = [
            {
            'year_range': (2008,2018),
            'unis': [
                'grid.83440.3b', # University College London 
                'grid.5335.0', # University of Cambridge 
                'grid.8756.c', # University of Glasgow
                'grid.6571.5', # Loughborough University
                'grid.11914.3c', # University of St Andrews
                'grid.11201.33', #Plymouth University
                    ],
            'y_column': 'Change in % Gold',
            'markerline' : 2012
            },
            {
            'year_range': (2008,2018),
            'unis': [
                'grid.6571.5', # Loughborough University
                'grid.5337.2', # University of Bristol
                'grid.7445.2', # Imperial College
                'grid.83440.3b', # University College London
                'grid.5170.3', # TU Denmark
                'grid.20861.3d', # CalTech
                    ],
            'y_column': 'Green in Institutional Repository (%)',
            'markerline' : 2015
            },
            {
            'year_range': (2008,2018),
            'unis': [    
                'grid.5132.5', # Leiden University
                'grid.4830.f', # University of Groningen
                'grid.5477.1', # Utrecht University
                'grid.5590.9', # Radboud University Nijmegen
                'grid.12380.38', # VU University Amsterdam
                #'grid.6852.9', # Eindhoven University of Technology
                    ],
            'y_column' : 'Hybrid OA (%)',
            'markerline' : 2014
            },
            {
            'year_range': (2012,2019),
            'unis': [
                'grid.7445.2', # Imperial College
                'grid.20861.3d', # CalTech
                'grid.5335.0', # University of Cambridge 
                'grid.21107.35', # Johns Hopkins
                'grid.9759.2', # University of Kent
                'grid.205975.c' # UC, Santa Cruz
                    ],
            'y_column': 'Total Green OA (%)',
            'markerline' : 2017,
            'ylim' : (50,70)
            }
]
figure4 = charts.TimePlotLayout(named, plots)
figure4.process_data()
fig = figure4.plot(figsize=(15,10), 
             wspace=0.3, 
             ylabel_adjustment=0.025, 
             panel_labels=True, 
             panellable_adjustment=0.02)

axes = fig.axes
for ax in axes[0:6]:
    ax.set_ylim(-1,7)
for ax in axes[6:12]:
    ax.set_ylim(0,70)
for ax in axes[12:17]:
    ax.set_ylim(0,19)
for ax in axes[17:19]:
    ax.set_ylim(43,72)
for ax in axes[19:21]:
    ax.set_ylim(40,65)
for ax in axes[21:23]:
    ax.set_ylim(37,66)

## Different institutional paths towards open access

In both Figure 3 and Figure 4 we see evidence of different paths towards open access, emphasising publishing or repository mediated routes, depending on the context and resources. The idea of mapping these paths is shown explicitly for a subset of universities in Figure 5, where the levels of open access for each university is plotted over time.

Figure 5 shows the paths taken by two sets of UK universities. For universities that received substantial funding from the UK research councils for open access publishing three examples are shown. An alternate route, emphasising repository-mediated, open access is seen for three universities that received less funding. While UK universities have a strong position on open access across the board this plot shows that they have applied differing strategies to achieve that.

In contrast the Latin American institutions already have high levels of open access publishing at our earliest time point. This is due to substantial infrastructure investments in systems like Redalyc, SciELO and others in Latin America starting in the 1990s. Our data suggests a fall in overall open access amongst Latin American universities from 2012 onwards which we ascribe to an increased pressure to publish in "international" journals which are often subscription based, and for which Latin American scholars are reluctant or unable to pay hybrid Author Processing Charges (APCs).

<div style="page-break-after: always; visibility: hidden"> 
\pagebreak 
</div>

#### Figure 5: Comparing different paths to open access (gold OA versus green OA ) for a selected set of universities.

In [None]:
sns.set_context("paper",rc={"legend.fontsize":16,"axes.labelsize":16},font_scale=1)
comparison = ['grid.83440.3b', # University College London 
              'grid.5335.0', # University of Cambridge 
              'grid.8756.c', # University of Glasgow
              'grid.6571.5', # Loughborough University
              'grid.11914.3c', # University of St Andrews
              'grid.11201.33', #Plymouth University
              'grid.11899.38', # University of Sao Paolo
              'grid.410543.7', # Sao Paulo State University
              #'grid.9486.3', # National Autonomous University of Mexico
              'grid.411221.5' # Universidade Federal de Pelotas           
             ]

colorpalette_sel_uni=[
                'green',
                'red',
                'maroon',
                'royalblue',
                'darkviolet',
                'darkorange',
                'grey',
                'blue',
                #'pink',
                'lightcoral'
]

figure5 = charts.TimePath(named, (2007,2018), 
                   comparison, 
                   'Total Green OA (%)', 'Total Gold OA (%)',
                   hue_column='University Name')

figure5.process_data()
figure5.plot(xlim=(0,140), ylim=(0,100), figsize=(10,8), colorpalette=colorpalette_sel_uni)

In [None]:
#animated version of comparison of selected universities over time.
figure5.animate(xlim=(0,140), ylim=(0,100), figsize=(10,8), colorpalette=colorpalette_sel_uni)

# Discussion

## Implications for evaluating open access and limitations

Previous work has been mostly limited to one off evaluations and provided a limited basis for longitudinal analysis. Our analysis process includes automated approaches for collecting the outputs related to specific universities, and the analysis of those outputs. Currently the addition of new universities, and the updating of large data sources is partly manual but we also expect to automate this in the near future. Along with the nature of this article this will provide an updatable report and longitudinal dataset that can provide a consistent and growing evidence source for open access policy and implementation analysis.

While it is clear (Huang et al., 2020a) that our analysis has limitations in its capacity to provide a consistent estimate of open access status across all universities, our approach does provide a reproducible and transparent view of overall global performance. There are challenges to be addressed with respect to small universities and research organisations and we have taken a necessarily subjective view of which institutions to include. We use the Šidák correction to control for the familywise error rate in multiple comparisons (this essentially results in individual confidence intervals and margins of error being evaluated at 99.95%). Institutions with margin of error greater than 17, for any of total open access, gold open access or green open access, were also removed from data used to generate various figures. This is in addition to the conventional conditions for normal approximation (see Huang et al., 2020a). Our approach systematically leaves out most universities with very small number of outputs (i.e., less than 100 outputs), and universities with very extreme open access proportions and relatively small number of outputs. This is essentially leaving out universities for which we have very little confidence in the corresponding data, or for which the outputs size is too small to make any comparable judgements. This is also roughly in-line with our intended focus on research-intensive universities, but with a largely inclusive approach. While these small institutions will be interesting to look at, they require a different approach to analysis in our view. The use of alternative methodologies for multiple comparisons and quantifying the degree of dissimilarities between universities' open access scores also require further study. The filter described above was applied to Figures 2 and 3 in the main text and their corresponding figures in the Supplementary Figures. However, data for the full set of universities are presented in Supplementary Figures 3 to 8.

We have used multiple sources of bibiliographic information with the goal of gaining a more inclusive view of research outputs. Despite this there are still limitations in the coverage of these data sources, and a likely bias towards STEM disciplines. In addition the focus of Unpaywall on analysis of outputs with Crossref DOIs means that we are missing outputs for disciplines and output types where the use of DOIs is limited (such as the humanities and for books). In addition, due to the nature of this work and to limitations on the use of Web of Science and Scopus APIs, we have collected data from these two sources over an extended period of time. Changes in these datasets will not be transparently reflected in our final analysis, although we expect such changes to be small. For other data sources we are able to precisely define the data dump used for our analysis, supporting reproducibility as well as modified analyses.

## Requirements and approaches for improving open access evaluation

There have been many differing assessments of open access performance over the past 10-15 years. Many of the differences between these have been driven by details in the approach. This combined with limited attention to reproducibility has lead to confusion and a lack of clarity on the rate and degree of progress to open access (Green, 2019). As noted above we believe that a minimum standard should be set in providing assessments of open access to support evidence based policy making and implemenation.

1.	The set of outputs included in each category (here institutions) and a clear description of how they were collected must be provided. 
2.	A clearly defined, open and auditable data source on open access status.
3.	A clearly defined and implementable description of how open access status data is interpreted.
4.	Provision of derived data and analysis in auditable form.

With such a minimum standard in hand we can clearly identify approaches to improve the quality, transparency and reproducibility of open access performance assessments. There is significant opportunity for improving the data sources on sets of outputs and how they can be grouped (e.g. by people, groups, discipline, funder, organisation, country etc). Improvements to institutional identifier systems such as the Research Organisation Registry, increased completeness of metadata records, particularly that provided by publishers via Crossref on affiliation, ORCIDs and funders, and enhancing the coverage of open access status data (for instance by incorporating data from CORE and BASE), will all enhance coverage. There are also opportunities to expand the coverage both of output types and geographies by incorporating a wider range of bibliographic data sources as inputs.

Broadly speaking we would advocate for the evidence base for open access policy and implementation to be built on open and transparent data. While the majority of sources we have used in this report are open (Microsoft Academic, Crossref, Unpaywall) we have elected to supplement this with data from two proprietary sources, Web of Science and Scopus. Our argument for taking this approach has been laid out separately (Huang et al., 2019; 2020a). Here we note that our sensitivity analysis (Huang et al., 2019; 2020a) shows that where the goal is to describe sector-wide trends and movements, that the difference between using the open Microsoft Academic data alone and incorporating proprietary data is modest.

## Implications for policy intervention and implementation

Our results have significant implications for the details of policy interventions. Firstly, we have demonstrated the ability to detect signals of policy interventions in the behaviour of intitutions. We see clear effects and results arising from the efforts of national funders and policy makers, particularly in the United Kingdom. The combined policy change and funding provided by the UK Research Councils in 2012 is associated with a increase in the level of open access publishing (gold open access), and the level of increase appears to be associated with the level of funding provided. Similarly the requirement for outputs to be deposited in a repository for eligibility for the 2021 Research Excellence Framework is associated with substantial increase in repository-mediated open access from around 2015.

Our results also may have implications for deciding on the effectiveness of directly funding open access publishing. It is perhaps surprising to some readers that the overall levels of open access publishing (i.e, gold open access) in the UK are not higher. Specific funders, most notably the Wellcome Trust, have achieved very high levels of open access for articles from research they support through the provision of funding for open access publication. In addition the UK Research Councils invested significant funds, as well as political capital in supporting gold open access. However, these have not translated to expected levels of open access publishing across the output of UK institutions. Indeed, the majority gains over the past five years would appear to have come from repository-mediated open access. 

Alongside this the continued leadership of Latin American institutions on open access publishing levels is the continuation of a trend set more than a decade ago through the provision of publishing infrastructures. Taken alongside the clear response in the Netherlands in hybrid open access in response to publish and read agreements this suggests that increasing levels of open access publishing through article processing charges is potentially expensive compared to the costs of providing infrastructure. In that sense the increased levels of access delivered through repositories, particularly in northwestern Europe are substantially cheaper. 

Another interesting natural experiment is how the strength of funder actions is associated with overall change in levels of open access. In the Netherlands and the UK in particular, but also in the US, where funder policies have moved from encouragement, to mandates, to monitoring with sanctions for non-compliance there are substantial shifts in overall levels of open access. By contrast, in countries where policy remains effectively at the recommendation level, such as Australia, levels of open access lag significantly.

Perhaps most interestingly in light of the debates surrounding Plan S is the evidence that it is Latin America and Africa where levels of open access publishing are at their highest. As noted above, in Latin America this is a strong signal of the effectiveness of infrastructures such as SciELO in supporting uptake of open access practices. In the case of Africa there may be effects of funder requirements (with funders such as the Gates Foundation and Wellcome Trust that have strong open access requirements playing a significant role) as well as disciplinary spread. In both cases we are likely to have a limited view of the full diversity of research outputs due to their poor capture in information systems from the North Atlantic.


## Implications for reaching "100% open access"

What can our results tell us about the feasibility of reaching 100% open access? One signal we see across our dataset is that of saturation. In the animated version of Figure 3 (or the version over several years in Supplementary Figures) there is a clear signal of saturation with respect to open access publishing (gold open access) for European and North American universities. With few exceptions, institutions do not achieve levels of gold open access greater than 40% and this level is stable from 2014-2018. Similarly in Figure 4 we see evidence of shifts in response to stimuli (funding and policy interventions) which then stabilise. Even those UK universities with very high levels of repository open access (green) see a slowing down of the rise in levels a few years after the Research Excellence Framework policy intervention.

These signals suggest that the last few percents may be very difficult, and possibly expensive to achieve. There will always be areas and cases where open access is challenging. Achieving '100%' may require a tighter definition of what should be in scope. For those areas where we see signals of saturation much lower than 100% these are likely signals of the complexity of the system, and of large categories of outputs where open access is harder to achieve, or the motivation of institutions (including authors, libraries, and other support starff) to achieve it is lower. These are most likely to be disciplinary differences. 

This suggests that the challenges in making the next step change in open access practice requires a different approach. For those disciplines where practice is not changing, we need a deeper understanding of the barriers, and this will require further research and analysis. It will also require improved data sources, as those areas with the lowest open access, humanities and social sciences, have outputs that are the least well represented in data. 

Levels of open access by university, even for the best in the world, appear to fall short of those achieved by specific funders with strong policy and implementation programs. This likely reflects the greater complexity and diversity in an institutional research portfolio. However, the clear success of specific universities in efficiently achieving high levels of open access also illustrates that it is possible to make a significant difference through thoughtful support for action and culture change.

# Conclusion

The evidence-base for policy development and implementation for open access has been hampered by a lack of consistency in analysis results and clarity on how those results were obtained. In particular it has been challenging to provide longitudinal and transparent results to monitor the effects of policy and support interventions. While not all readers will agree with the choices we have made in implementing an analysis process we aimed to provide sufficient transparency and reproducibility to allow for both replication, critique and alternative approaches to this analysis. This can underpin a higher quality of debate and policy development globally, and aid in learning from successes in other regions.

Our analysis of open access performance by research intensive universities highlights the importance of robust policy and support in driving change. Geographies where there is a long history of infrastructure provision, such as Latin America, show very high levels of open access publishing. The United Kingdom is a particular case where there is consistently very high levels of open access, particularly that provided by repositories, in response to a strong and well supported policy environment. We also see that different institutions may choose to take different paths to delivering open access depending on resources, culture and systems in place.

The value of analysis at the level of universities is that we gain a picture of open access performance across a diverse research ecosystem. We see differences across countries and regions, and differences between universities within countries. Overall we see that there are multiple different paths towards improving access, and that different paths may be more or less appropriate in different contexts. Most importantly, while further research is needed to unpick the details of the differences in open access provision, we hope we provided a framework for enabling that longitudinal analysis to be taken forward and used wherever it is needed.

# Acknowledgements 

This work was funded by the Research Office of Curtin University through a strategic grant, the Curtin University Faculty of Humanities, and the School of Media, Creative Arts and Social Enquiry.

# References

1. Aguado-López, E., & Becerril-García, A. (2019, November 6). Latin America’s longstanding open access ecosystem could be undermined by proposals from the Global North [LSE Latin America and Caribbean]. https://blogs.lse.ac.uk/latamcaribbean/2019/11/06/latin-americas-longstanding-open-access-ecosystem-could-be-undermined-by-proposals-from-the-global-north/

1. Agustini, B., & Berk, M. (2019). The open access mandate: Be careful what you wish for. Australian & New Zealand Journal of Psychiatry, 53(11), 1044–1046. https://doi.org/10.1177/0004867419864436

1. Barbour, G., & Nicholls, S. (2019). Open Access: Should one model ever fit all? Australian Quarterly, 90(3), 3–9. https://www.jstor.org/stable/26687171

1. Brainard, J. (2019). Scientific societies worry about threat from Plan S. Science, 363(6425), 332–333. https://doi.org/10.1126/science.363.6425.332

1. Debat, H., & Babini, D. (2019). Plan S in Latin America: A precautionary note. PeerJ Preprints, 7, e27834v2. https://doi.org/10.7287/peerj.preprints.27834v2

1. Gargouri, Y., Larivière, V., Gingras, Y., Brody, T., Carr, L., & Harnad, S. (2012). Testing the Finch Hypothesis on Green OA Mandate Ineffectiveness. ArXiv, 1210.8174. https://arxiv.org/abs/1210.8174

1. Gómez-Fernández, J. C. (2019). Plan S for publishing science in an open access way: Not everyone is likely to be happy. Biophysical Reviews, 11, 841–842. https://doi.org/10.1007/s12551-019-00604-4

1. Green, T. (2019). Is open access affordable? Why current models do not work and why we need internet‐era transformation of scholarly communications. Learned Publishing, 32(1), 13–25. https://doi.org/10.1002/leap.1219

1. Haug, C. J. (2019). No Free Lunch—What Price Plan S for Scientific Publishing? The New England Journal of Medicine, 380, 1181–1185. https://doi.org/10.1056/NEJMms1900864

1. Huang, C.-K., Neylon, C., Brookes-Kenworthy, C., Hosking, R., Montgomery, L., Wilson, K., & Ozaygen, A. (2019). Comparison of bibliographic data sources: Implications for the robustness of university rankings. BioRxiv, 750075. https://doi.org/10.1101/750075

1. Huang, C.-K., Neylon, C., Hosking, R., Montgomery, L., Wilson, K., Ozaygen, A., & Brookes-Kenworthy, C. (2020a). Evaluating institutional open access performance: Sensitivity analysis. https://doi.org/10.5281/zenodo.3696857

1. Huang, C.-K., Neylon, C., Hosking, R., Montgomery, L., Wilson, K., Ozaygen, A., & Brookes-Kenworthy, C. (2020b). Data and Intermediate Queries for: Evaluating institutional open access performance: Methodology, challenges and assessment. https://doi.org/10.5281/zenodo.3693221

1. Huang, C.-K., Neylon, C., Hosking, R., Brookes-Kenworthy, C., Montgomery, L., Wilson, K., & Ozaygen, A. (2020c). Jupyter Notebooks for the article: Evaluating institutional open access performance: Methodology, challenges and assessment. #location forthcoming#.

1. Iyandemye, J., & Thomas, M. P. (2019). Low income countries have the highest percentages of open access publication: A systematic computational analysis of the biomedical literature. PLOS One, 14(7), e0220229. https://doi.org/10.1371/journal.pone.0220229

1. Larivière, V., & Sugimoto, C. R. (2018). Do authors comply when funders enforce open access to research? Nature, 562, 483–486. https://doi.org/10.1038/d41586-018-07101-w

1. Lawson, S. (2018). RCUK open access block grant allocation 2013-18. Figshare. https://figshare.com/articles/RCUK_open_access_block_grant_allocation_2013-17/4047315

1. McNutt, M. (2019). Opinion: “Plan S” falls short for society publishers—And for the researchers they serve. Proceedings of the National Academy of Sciences, 116(7), 2400–2403. https://doi.org/10.1073/pnas.1900359116

1. Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: A comparative analysis. Scientometrics, 106(1), 213–228. https://doi.org/10.1007/s11192-015-1765-5

1. Montgomery, L., Hartley, J., Neylon, C., Gillies, M., Gray, E., Herrmann-Pillath, C., Huang, C.-K. (Karl), Leach, J., Potts, J., Ren, X., Skinner, K., Sugimoto, C. R., & Wilson, K. (2018). Open Knowledge Institutions: Reinventing Universities. MIT Press Work in Progress. https://wip.mitpress.mit.edu/oki

1. Packer, A. L. (2009). The SciELO Open Access: A Gold Way from the South. Canadian Journal of Higher Education, 39(3), 111–126. http://journals.sfu.ca/cjhe/index.php/cjhe/article/view/479

1. Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., Farley, A., West, J., & Haustein, S. (2018). The state of OA: a large-scale analysis of the prevalence and impact of Open Access articles. PeerJ, e4375. https://doi.org/10.7717/peerj.4375

1. Piwowar, H., Priem, J., & Orr, R. (2019). The Future of OA: A large-scale analysis projecting Open Access publication and readership. BioRxiv. https://doi.org/10.1101/795310

1. Rabesandratana, T. (2019). The world debates open-access mandates. Science, 363(6422), 11–12. https://doi.org/10.1126/science.363.6422.11

1. Robinson-Garcia, N., Costas, R., & van Leeuwen, T. N. (2019). Indicators of Open Access for universities. ArXiv, 1906.03840. https://arxiv.org/abs/1906.03840

1. Siler, K., Haustein, S., Smith, E., Larivière, V., & Alperin, J. P. (2018). Authorial and institutional stratification in open access publishing: The case of global health research. PeerJ, 6, e4269. https://doi.org/10.7717/peerj.4269

1. Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.-J., & Wang, K. (2015). An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. http://dx.doi.org/10.1145/2740908.2742839

1. Van Noorden, R. (2019, May 15). Indonesia tops open-access publishing charts. Nature News. http://doi.org/10.1038/d41586-019-01536-5

1. Wang, X., Cui, Y., Xu, S., & Hu, Z. (2018). The state and evolution of Gold open access: A country and discipline level analysis. Aslib Journal of Information Management, 70(5), 573–584. https://doi.org/10.1108/AJIM-02-2018-0023

1. Wang, K., Shen, Z., Huang, C., Wu, C.-H., Eide, D., Dong, Y., Qian, J., Kanakia, A., Chen, A., & Rogahn, R. (2019). A Review of Microsoft Academic Services for Science of Science Studies. Frontiers in Big Data. https://doi.org/10.3389/fdata.2019.00045


In [None]:
# code for converting to PDF in command prompt
# jupyter nbconvert --to pdf --TemplateExporter.exclude_input=True ##notebook_file_name##.ipynb