 ![FREYA Logo](https://github.com/datacite/pidgraph-notebooks-python/blob/master/images/freya_200x121.png?raw=true) | [FREYA](https://www.project-freya.eu/en) WP2 [User Story 4](https://github.com/datacite/pidgraph-notebooks-python/issues/8) | As a funder I want to see how many of the research outputs funded by me have an open license enabling reuse, so that I am sure I properly support Open Science. 
 :------------- | :------------- | :-------------

Funders that support open research are interested in monitoring the extent of open access given to the outputs of grants they award - while the grant is active as well as retrospectively. <p />
This notebook uses the [DataCite GraphQL API](https://api.datacite.org/graphql) to retrieve and report license types of outputs of the following funders to date:
 - [DFG (Deutsche Forschungsgemeinschaft, Germany)](https://doi.org/10.13039/501100001659)
 - [ANR (Agence Nationale de la Recherche, France)](https://doi.org/10.13039/501100001665)
 - [SNF (Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung, Switzerland)](https://doi.org/10.13039/501100001711)

**Goal**: By the end of this notebook you should be able to:
- Retrieve all outputs and associated licenses for three different funders; 
- Plot an interactive bar plot showing for each funder the proportion of outputs issued under a given license type;
- Plot interactive bar plots showing for each funder the proportion of outputs issued under a given license type, faceted by output type and year of its publication respectively.

## Install libraries and prepare GraphQL client

In [None]:
%%capture
# Install required Python packages
!pip install gql requests numpy plotnine

In [None]:
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport

_transport = RequestsHTTPTransport(
    url='https://api.datacite.org/graphql',
    use_json=True,
)

client = Client(
    transport=_transport,
    fetch_schema_from_transport=True,
)

## Define and run GraphQL query
Define the GraphQL query to find all outputs  and associated licenses for three different funders: [DFG (Deutsche Forschungsgemeinschaft, Germany)](https://doi.org/10.13039/501100001659), [ANR (Agence Nationale de la Recherche, France)](https://doi.org/10.13039/501100001665) and  [SNF (Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung, Switzerland)](https://doi.org/10.13039/501100001711)), up to 300 works per funder.

In [None]:
# Generate the GraphQL query: find all outputs and their associated licenses (where available) 
# for three different funders, identified by query1, query2 and query3.
query_params = {
    "query1" : "https://doi.org/10.13039/501100001659",
    "query2" : "https://doi.org/10.13039/501100001665",
    "query3" : "https://doi.org/10.13039/501100001711",
    "maxWorks" : 200
}

funderId2Acronym = {
    "https://doi.org/10.13039/501100001659" : "DFG",
    "https://doi.org/10.13039/501100001665" : "ANR",
    "https://doi.org/10.13039/501100001711" : "SNF"
}

query = gql("""query getGrantOutputsForFundersById(
    $query1: ID!,
    $query2: ID!,
    $query3: ID!,
    $maxWorks: Int!
    )
{
query1: funder(id: $query1) {
  name
  id
  works(first: $maxWorks) {
      totalCount
      licenses {
        id
        title
        count
      }        
      nodes {
        id

        titles {
          title
        }      
        types {
          resourceType
        }
        dates {
          date
          dateType
        }
        versionOfCount
        rights {
          rights
          rightsIdentifier
          rightsUri
        }
        fundingReferences {
          funderIdentifier
          funderName
          awardNumber
          awardTitle
        }
      }
    }
  },
query2: funder(id: $query2) {
  name
  id
  works(first: $maxWorks) {
      totalCount
      licenses {
        id
        title
        count
      }       
      nodes {
        id

        titles {
          title
        }      
        types {
          resourceType
        }
        dates {
          date
          dateType
        }
        versionOfCount
        rights {
          rights
          rightsIdentifier
          rightsUri
        }
        fundingReferences {
          funderIdentifier
          funderName
          awardNumber
          awardTitle
        }
      }
    }
  },  
query3: funder(id: $query3) {
  name
  id
  works(first: $maxWorks) {
      totalCount
      licenses {
        id
        title
        count
      }       
      nodes {
        id
        titles {
          title
        }      
        types {
          resourceType
        }
        dates {
          date
          dateType
        }
        versionOfCount
        rights {
          rights
          rightsIdentifier
          rightsUri
        }
        fundingReferences {
          funderIdentifier
          funderName
          awardNumber
          awardTitle
        }
      }
    }
  }  
}
""")

Run the above query via the GraphQL client

In [None]:
import json
data = client.execute(query, variable_values=json.dumps(query_params))

## Display bar plot of number of outputs per license type and funder.
Plot an interactive bar plot showing the proportion of outputs issued under a given license type, for each funder.

In [None]:
import plotly.io as pio
import plotly.express as px
from IPython.display import IFrame
import pandas as pd
from operator import itemgetter
import re

xstr1 = lambda s: 'Not available' if s is None else str(s)

# Adapted from: https://stackoverflow.com/questions/58766305/is-there-any-way-to-implement-stacked-or-grouped-bar-charts-in-plotly-express
def px_stacked_bar(df, color_name='License Type', y_name='Metrics', **pxargs):
    idx_col = df.index.name
    m = pd.melt(df.reset_index(), id_vars=idx_col, var_name=color_name, value_name=y_name)
    # For Plotly colour sequences see: https://plotly.com/python/discrete-color/     
    return px.bar(m, x=idx_col, y=y_name, color=color_name, **pxargs, 
                  color_discrete_sequence=px.colors.qualitative.Pastel1)
 

# Map each license type to a dict that in turn maps the position of the output's bar in plot 
# to the count of outputs corresponding to that license type.
licenseType2Pos2Count = {}
    
# Populate license type counts per funder
queries = ['query1', 'query2', 'query3']
# labels contains funder labels in bar plot - each bar corresponds to a single funder
labels = {}
pos = 0
for query in queries:
    funder = data[query]
    labels[pos] = funderId2Acronym[funder['id']]
    for license in funder['works']['licenses']:
        licenseId = xstr1(license['id'])
        outputCount = license['count']
        if re.search('cc-by-\d+', licenseId) is None:
             licenseId = "other"
        if licenseId not in licenseType2Pos2Count:
            licenseType2Pos2Count[licenseId] = {}
            for pos1 in range(0, len(queries)):
                # Initialise license's counts for each funder
                licenseType2Pos2Count[licenseId][pos1] = 0
        licenseType2Pos2Count[licenseId][pos] = outputCount
    pos += 1
        
# Create stacked bar plot
x_name = "Funders"
dfDict = {x_name: labels}

for license in licenseType2Pos2Count:
    dfDict[license] = licenseType2Pos2Count[license]

df = pd.DataFrame(dfDict)
fig = px_stacked_bar(df.set_index(x_name), y_name = "Output Counts")

# Set plot background to transparent
fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)'
})

# Write interactive plot out to html file
pio.write_html(fig, file='out.html')

# Display plot from the saved html file
display(Markdown("<br />License types of all funder's outputs to date, shown as a stacked bar plot - one bar per funder:"))
IFrame(src="./out.html", width=500, height=500)

## Plot output counts per license type, funder and year
Plot an interactive bar plot showing for each funder the proportion of outputs published in a given year under a given license type.

In [None]:
import plotly.express as px
import re

xstr = lambda s: 'General' if s is None else str(s)

# Populate license type counts per funder
queries = ['query1', 'query2', 'query3']
funder2outputType2year2licenceType2outputCount = {}
# funderAcronym2Name is needed for the plot legend - as funder names are too long to be shown in the plot itself
funderAcronym2Name = {}

# Collect license type counts data into funder2outputType2year2licenceType2outputCount
for query in queries:
    funder = data[query]  
    funderAcronym = funderId2Acronym[funder['id']]
    funderAcronym2Name[funderAcronym] = funder['name']
    if funderAcronym not in funder2outputType2year2licenceType2outputCount:
        funder2outputType2year2licenceType2outputCount[funderAcronym] = {}       
    for node in funder['works']['nodes']:
        if node['versionOfCount'] > 0:
            # If the current output is a version of another one, exclude it
            continue      
            
        # Retrieve output type         
        resource_type = xstr(node['types']['resourceType'])
        if resource_type not in funder2outputType2year2licenceType2outputCount[funderAcronym]:
            funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type] = {}    
            
        # Retrieve output year         
        year = None
        for date_dict in node['dates']:
            y = date_dict['date'].split('-')[0]
            if year is None:
                if date_dict['dateType'] in ['Issued', 'Created']:
                    year = y
            else:
                if date_dict['dateType'] in ['Issued']:
                    year = y
        if year not in funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type]:
            funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type][year] = {}                     
             
        # Retrieve license types         
        if len(node['rights']) == 0:
            node['rights'].append({'rightsIdentifier': 'Not available'})
        for rights in node['rights']:
            rightsIdentifier = rights['rightsIdentifier']
            if rightsIdentifier is None:
                continue
            # Group low frequency rightsIdentifiers into "other"
            if re.search('cc-by-\d+|Not available', rightsIdentifier) is None:
                rightsIdentifier = "other"
            if rightsIdentifier not in funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type][year]:
                funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type][year][rightsIdentifier] = 0;
            funder2outputType2year2licenceType2outputCount[funderAcronym][resource_type][year][rightsIdentifier] += 1

# Populate data structures for faceted stacked bar plot
funders, outputTypes, outputYears, licenseTypes, outputCounts  = ({}, {}, {}, {}, {})
pos = 0
for funder in funder2outputType2year2licenceType2outputCount:
    for outputType in funder2outputType2year2licenceType2outputCount[funder]:     
        for outputYear in funder2outputType2year2licenceType2outputCount[funder][outputType]:       
            for rightsIdentifier in funder2outputType2year2licenceType2outputCount[funder][outputType][outputYear]:
                funders[pos] = funder
                outputTypes[pos] = outputType                   
                licenseTypes[pos] = rightsIdentifier
                outputYears[pos] = outputYear                 
                outputCounts[pos] = funder2outputType2year2licenceType2outputCount[funder][outputType][outputYear][rightsIdentifier]
                pos += 1
dfDict = {"Funder": funders, "Output Type": outputTypes, "Year": outputYears, "License": licenseTypes, "Output Count": outputCounts}
df1 = pd.DataFrame(dfDict)

# Create funders legend
tableBody=""
for funderAcronym in funderAcronym2Name:
    tableBody += "%s | %s\n" % (funderAcronym, funderAcronym2Name[funderAcronym])

fig1 = px.bar(df1, x="Year", y="Output Count", color="License", barmode="stack", facet_row="Funder", text="Output Type")

fig1.update_traces(texttemplate='%{text}', textposition='inside')
fig1.update_layout(uniformtext_minsize=6, uniformtext_mode='hide')

# Write interactive plot out to html file
pio.write_html(fig1, file='out1.html')

# Display plot from the saved html file
markDownContent="<br />The plot below shows counts per year of each funder's outputs to date corresponding to a given licence type (maximum %d outputs per funder)." + \
"<br />**Note**: each bar's section corresponds to a different output type (where possible, output types are shown within bar plots)." + \
"<br />Full information is shown when you mouse-over a section of a bar." + \
"<br />"
display(Markdown(markDownContent % query_params['maxWorks']))
display(Markdown("| Acronym | Funder Name|\n|---|---|\n%s" % tableBody))

IFrame(src="./out1.html", width=1000, height=1000)

## Plot output counts per license type, funder and output type.
Plot interactive bar plots showing for each funder the proportion of outputs of a given type published under a given license type.

In [None]:
fig2 = px.bar(df1, x="Output Type", y="Output Count", color="License", barmode="stack",
             facet_row="Funder", text="Year"
#            facet_col=""
            )
fig2.update_traces(texttemplate='%{text:}', textposition='inside')
fig2.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

# Write interactive plot out to html file
pio.write_html(fig2, file='out2.html')

# Display plot from the saved html file
# Display plot from the saved html file
markDownContent="<br />The plot below shows counts (per output type) of each funder's outputs to date corresponding to a given licence type (maximum %d outputs per funder)." + \
"<br />**Note**: each bar section corresponds to a different publication year of an output (where possible, publication years are shown within bar plots)." + \
"<br />Full information is shown when you mouse-over a section of a bar." + \
"<br />"
display(Markdown(markDownContent % query_params['maxWorks']))
display(Markdown("| Acronym | Funder Name|\n|---|---|\n%s" % tableBody))

IFrame(src="./out2.html", width=1000, height=1000)