# Metrics in Context | A Proof of Concept

### Describing Citation Data with [Frictionless](https://frictionlessdata.io/)

*Asura Enkhbayar, 04.02.2021*

---

This notebook presents a proof of concept for a standardized approach to providing provenance information for citation data and, more broadly, scholarly metrics. To do so, we will take a bibliometric dataset consisting of four spreadsheets and build a Frictionless [Data Package](https://specs.frictionlessdata.io/data-package/) while describing the internal structure of the spreadsheets as well as their provenance. To do so, I will step-by-step introduce a new conceptual framework for scholarly metrics and gradually incorporate the dataset into the final data package.

*Note: For a broader overview of the project please refer to the [README](https://github.com/Bubblbu/metrics-in-context) in the project repository.*

<img style="float: center;" width=25%, src="../../materials/assets/fosdem2021.jpg">

Let's quickly import a few libraries and functions

In [123]:
from pprint import pprint
from pathlib import Path
import yaml

import pandas as pd
from frictionless import describe_resource, Schema, Package

# directories
input_files = Path("./input_files")
schemas = Path("./schemas")

## The Data

For the purpose of this particular notebook, I will not go into the nitty gritty details of this particular dataset. Of course, doing exactly that is one of the goals of creating a the Frictionless Data Package for scholarly metrics. For now, I will simply provide some high-level explanations for the dataset and a cursory description for each spreadsheet.

The dataset we are working with consists is provided by [Scite.ai](https://scite.ai/). The particular sample was created by querying Pubmed for articles on Amyotrophic lateral sclerosis (ALS). Scite is one of the newer data sources that not only provide the classic citation link between two documents ("A cites B"), but attempt to extract what has been established as citation contexts ("A mentions B 3 times in the introduction"). Using the data they provide, I have derived four CSV files:

- `article_metadata.csv`: Article metadata for each DOI like title, authors, journal, etc.
- `mentions.csv`: The individual traced citations between documents and context information such as a text snippet or the mentioning section.
- `citation_metrics.csv`: This spreadsheet contains metrics for every individual citation link derived from its source article such as the total number of outgoing references.
- `article_metrics.csv`: A table with article-level metrics derived from the aforementioned traces.

In bibliometrics and research assessment we often focus on the last file, the citation counts and related metrics for the individual articles. In the rest of this document, we will work our way up from the actual citation in a document to the final citation count and attempt to model and capture each step on our way. 

## Assembling A Metrics Data Package

The conceptual framework introduced throughout the next sections is part of a broader doctoral research project (mine *cough*) and is built around the shift from representation towards performance. This shift, also conceivable as the move from object-centrism to process-centrism, also impinges on scholarly metrics in the form of attention towards processes and practices rather than outputs and metrics. But how do these processes look like in detail and how can we start to systematically describe them?

The usual story goes something like this: 

> A *citation* links two scholarly texts expressed through an in-text mention and a bibliographic reference.
> This state is then captured by a *citation link* in their databases.
> This data then enables the creation of *citation metrics*.

I would like to retell this story while introducing a new framework and vocabulary for metrics provenance.  

### 1. The Citational Event

> A *citation* links two *scholarly texts* expressed through an in-text mention and a bibliographic reference.

Let's go to the very beginning of any citation count, h-index, or JIF. The citation itself is typically understood as a directional binary relation between two peer-reviewed research articles. However, not only are the kinds of documents and outputs changing (datasets, software, social media) but indexing technology is also improving and providing more than a simple binary state as the dataset we are using shows. To accomodate for these changes, I suggest to think of *citational events* and their *contexts* as our basic analytic unit. One important difference that this model introduces to the traditional conceptualization of the citation is that we never observe a direct relation between documents, dataset, or any output. These links are always mediated by citational events.

![citational_event](../../materials/assets/citational_event.png)

To avoid falling into the trap of never-ending re-definitions of citable items (e.g. articles, preprints, datasets, software, ...) I propose to re-conceptualize the citation as any form of written or spoken statement that references another one. This occurance of a discursive act linking to another one can then be understood as an interdiscursive event (Nakassis, 2013). The material manifestation of those acts (written text, published datasets, software) are the contexts of those particular events. I've talked enough about things that bald French men usually talk about, thus, without further ado, let's dive into implementing said citational event:

```yaml
type: citational
description: A textual reference from one scholarly document to another scholarly document.
source_event:
  activity: scholarly writing
target_event:
  activity: scholarly writing
```

This YAML object is a model of the event that Scite captures. Moving forward, we will construct multiple of these provenance schemas to describe the processes of scholarly metrics. Here, `type` denotes that Scite captures *citational* events (other events which are not citational are views, downloads, or bookmarks). Both `source_event` and and `target_event` are specified by the processes that lead to their creations. In our case, both the citing as well as the cited events are caused by scholarly writing. This specification does *not* say anything about the kinds of outputs that are considered. For these, we can now move on to...

*Note: This might seem like a very lean definition of the fundamental entity that is being captured. This is on purpose as we will continue to explore the complexity through the additional processes of becoming the final metric.*

*Another note: The event provenance schema models the actual citational event, i.e., the citations happening in scholarly articles. Thus, there is no data representation (a CSV file) that captures these *act of citing*.

#### Contexts

Events do not occur in a vacuum. Citational events happen in their contexts which we might be more familiar with as research articles, preprints, tweets, news articles, datasets, and many more things. These contexts majorly impact how the final scholarly metrics look: Web of Science indexes a particular subset of the scientific literature, Google Scholar indexes all the pages that they are aware of, and Scite.ai can work the documents that they access via publisher agreements or because of open licenses. In the following provenance schema, we will attempt capture these fundamnetal differences:

*Note: This is a quite rudimentary example for the dataset at hand. It is mostly speculative (input highly welcome, Scite.ai folks!)*

```yaml
contexts:
    - type: peer-reviewed articles
      coverage:
        - publishers sharing fulltexts with Scite
        - Pubmed Open Access Subset
        - OA articles accessed through Unpaywall
      identified_by:
        - DOI
    - type: preprints
      coverage:
        - bioRxiv
        - medRxiv
      identified_by:
        - DOI
```

As we can see, a data source can contain multiple types of contexts. E.g., preprints and peer-reviewed articles which might be similar documents but run on very different social and technical infrastructures. More importantly, as we can see in this example, it is encouraged to be as precise and extensive as possible in terms of the concrete coverage of context types. By doing so, it also becomes possible to imagine the value of these provenance schemas as the comparison of context schemas from different data sources could be easily achieved with computational, visual, as well as purely textual approaches. 

**Adding our first Frictionless Data Resource**

Now that we have defined the events and contexts in Scite.ai, we can start loading the first CSV: `input_files/article_metadata.csv` which is pretty straightforward article metadata. Let's see how we can combine real data and our provenance schemas using Frictionless. Frictionless provides the high-level function `describe_resource` which sets up a `Resource` object in addition to doing some initial processing.

In [124]:
scite_contexts = describe_resource("input_files/article_metadata.csv")

In the next steps, we will set some basic properties to describe the context resource.

In [125]:
scite_contexts.name = "scite-contexts"
scite_contexts.profile = "mic-contexts" # this profile does not exist yet
scite_contexts.description = "Context metadata from Scite.ai | articles identified by DOI"

As already mentioned, Frictionless uses [Table Schemas](https://specs.frictionlessdata.io/table-schema/) to describe the structure of tabular data. In this case, an initial schema has already been extracted and setup by `describe_resource` which we now can manually expand. For instance, the spreadsheet contains one article per row identified by the `doi` column and the `title`, `authors`, `journal`, `type`, and `year`. Accordingly, we should set "doi" as the primary key.

In [126]:
scite_contexts.schema.primary_key = "doi"

Next, we will add a new property `prov` which we will populate with the previously created schema for Scite contexts. To do so, we will use the `Schema` class and directly load the YAML file from disk.

In [127]:
scite_contexts['prov'] = Schema("schemas/contexts.yaml")

Finally, our new data resource containing contexts looks like this:

In [128]:
pprint(scite_contexts)

{'compression': 'no',
 'compressionPath': '',
 'control': {'newline': ''},
 'description': 'Context metadata from Scite.ai | articles identified by DOI',
 'dialect': {'quoteChar': '"'},
 'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'scite-contexts',
 'path': 'input_files/article_metadata.csv',
 'profile': 'mic-contexts',
 'prov': {'contexts': [{'coverage': ['publishers sharing fulltexts with Scite',
                                     'Pubmed Open Access Subset',
                                     'OA articles accessed through Unpaywall'],
                        'identified_by': ['DOI'],
                        'type': 'peer-reviewed articles'},
                       {'coverage': ['bioRxiv', 'medRxiv'],
                        'identified_by': ['DOI'],
                        'type': 'preprints'}]},
 'query': {},
 'schema': {'fields': [{'name': 'doi', 'type': 'string'},
                       {'name': 'title', 'type': 'string'},
                       {'name'

It might not look like much so far, but we have successfully loaded a spreadsheet containing article metadata into a data structure which provides mechanisms to simultaneously account for:

1. The internal structure and logics of the raw data using `schema`
2. The kind of contexts that are captured using the new `prov` property, i.e., **we are doing provenance!** 

### 2. Tracing Events and Contexts

> This state is then captured by a *citation link* in their databases.

The processes involved in the capturing of citations might be the most overlooked and undertheorized aspect of modern citations. Citation theory, citation-based indicators and metrics, and the responsible use of research metrics are all areas that get their FAIR share of attention. But what about the tracing of citational events and their way into our databases of knowledge?

To answer this question, I want to introduce *processes of tracing* which create an imprint or *trace* of a citational event in a database. Citation indexes are the obvious example for institutions that trace citational events by systematically registering citing and cited documents. In the recent years, however, with the rapid growth of citation data providers the landscape of citation tracing methods is also growing and becoming increasingly complex.

One of the reasons why processes of tracing haven't been scrutitinized in detail yet might be that most of them are still black boxes. However, we can still attempt to identify these black-boxes and label them as such. This exercises' value becomes especially clear once we start to apply the same vocabulary of tracing processes to newer open data sources such as [OpenCitations](http://opencitations.net/index/coci) which in turn should be transparent. For the current purposes, I will again provide a speculative and quite rudimentary example of a tracing pipeline:

```yaml
event: schemas/event.yaml
contexts: schemas/contexts.yaml
tracing:
  pipeline:
    - citation_extraction: text-processing/ML
    - citation_reference_matching: text-processing/ML
    - reference_document_matching: ID matching
```

This basic example of the provenance schemas for Scite traces references the two previously developed models for *events* and *contexts*. This is a mandatory condition for any data resource which contains traces as the kind of events and the concrete manifestation of their contexts are crucial information about the origin of this dataset. Furthermore, this schema should also provide insights about the actual tracing process which will typically be a pipeline of individual steps. In this case, I have only chosen the extraction of citation statements, the matching of in-text mentions with the entries in the bibliography, and the matching of references with other documents in that database.

**Our second Frictionless Data Resource**

Once again, we can now use `describe_resource` to load the CSV with the actual in-text mention data from Scite: `input_files/mentions.csv` which contains an individual in-text mention per line identified by an `id`. They also specify the `source` and `target` DOI as well as the name of the `section` the target was mentioned in.

In [129]:
scite_traces = describe_resource("input_files/mentions.csv")

We will once again set some basic properties like a name, descriptor, and a profile, and a primary key on `id`. However, in addition we can designate both `source` and `target` as foreign keys as the each DOI is also in the contexts resource which contains all metadata.

In [130]:
# Some general properties
scite_traces.name = "scite-traces"
scite_traces.profile = "mic-traces"
scite_traces.description = "Traced citing-cited article pairs for each mention"

# Extend the table schema
scite_traces.schema.primary_key = "id"
scite_traces.schema.foreign_keys.append(
    {"fields": ["source"], "reference": {"resource": "scite-contexts", "fields": ["doi"]}}
)
scite_traces.schema.foreign_keys.append(
    {"fields": ["target"], "reference": {"resource": "scite-contexts", "fields": ["doi"]}}
)

The only thing missing is the provenance information. Just as we did last time, we are going to use the `prov` property to add our experimental trace schema.

In [131]:
scite_traces['prov'] = Schema("schemas/traces.yaml")

The final data resource containing Scite traces looks like this:

In [132]:
pprint(scite_traces)

{'compression': 'no',
 'compressionPath': '',
 'control': {'newline': ''},
 'description': 'Traced citing-cited article pairs for each mention',
 'dialect': {},
 'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'scite-traces',
 'path': 'input_files/mentions.csv',
 'profile': 'mic-traces',
 'prov': {'contexts': 'schemas/contexts.yaml',
          'event': 'schemas/event.yaml',
          'tracing': {'pipeline': [{'citation_extraction': 'text-processing/ML'},
                                   {'citation_reference_matching': 'text-processing/ML'}]}},
 'query': {},
 'schema': {'fields': [{'name': 'field1', 'type': 'integer'},
                       {'name': 'id', 'type': 'integer'},
                       {'name': 'source', 'type': 'string'},
                       {'name': 'target', 'type': 'string'},
                       {'name': 'section', 'type': 'string'}],
            'foreignKeys': [{'fields': ['source'],
                             'reference': {'fields': ['doi'

This time our data resource already contains considerably more complexity:
    
1. The internal structure of the CSV and its columns is no longer limited to this one resource. We have added foreignKeys which reference rows in the `scite_contexts` resource.
2. The provenance information is also relational. Not only have we defined that the kind of captured events (`schemas/events.yaml` which we wrote earlier) but also the contexts in which we are actually looking for these events (`schemas/contexts.yaml`). Finally, we have started to describe parts of the tracing pipeline in very rudimentary terms.

**Look mum, more provenance!**

### 3. Patterns and Metrics I

> This data then enables the creation of *citation metrics*.

So far, we have modelled the underlying event in Scite, described the kinds of contexts that host these events, and defined how the these events and contexts are transformed into data traces. The last missing step is now to close the gap between these traces and the metrics that we are so familiar with. In the framework that I am presenting, *processes of patterning* take traces (i.e., "data imprints" of events) and rearrange them in interesting pattern that are meaningful to us. This means that all the metrics we are familiar with are patterns derived from traces.

Dataset containing scholarly metrics can be organized in various ways. Individual tables could be representative for different data sources or different context types. However, commonly scholarly metrics are presented as tidy data, i.e., rows represent individual articles by DOI and columns are the measurements or metrics. Thus, we will now introduce a new option in the provenance schema to be able to describe individual fields in the dataset:

```yaml
fields:
  - mentions:
      name: mentions
      description: the number of times that the target was mentioned by the source
      resources:
        - scite_traces
  - norm_mentions:
      name: norm_mentions
      description: number of mentions normalized by total outgoing mentions of the source
      resources:
        - scite_traces
```

The CSV we are loading now is `input_files/citation_metrics.csv` which provides metrics for each pair of citing and cited articles. However, in contrast to the traditional citation count which would assign a value of 1 to each unique pair, in this case we are actually counting the number of mentions. In order to achieve that, we obviously require the `scite_traces` resource that we created earlier (which in turn contains the full provenance information up to this point).

**The third Frictionless Data Resource**

Let's load the respective file for this provenance schema and repeat the previous exercises:

In [133]:
citation_patterns = describe_resource("input_files/citation_metrics.csv")

# Some general properties
citation_patterns.name = "citation-patterns"
citation_patterns.profile = "mic-patterns"
citation_patterns.description = "Patterns for unique source and target pairs"

# Extend the table schema
citation_patterns.schema.primary_key = "id"
citation_patterns.schema.foreign_keys.append(
    {"fields": ["source"], "reference": {"resource": "scite-contexts", "fields": ["doi"]}}
)
citation_patterns.schema.foreign_keys.append(
    {"fields": ["target"], "reference": {"resource": "scite-contexts", "fields": ["doi"]}}
)

# Adding a constraint to the normalized mention which can't be smaller than 0
citation_patterns.schema.get_field("norm_mentions").constraints = {"minimum": 0}

# Add provenance schema
citation_patterns['prov'] = Schema("schemas/citation_patterns.yaml")

Our first pattern resource then looks as follows:

In [134]:
citation_patterns

{'name': 'citation-patterns',
 'profile': 'mic-patterns',
 'path': 'input_files/citation_metrics.csv',
 'scheme': 'file',
 'format': 'csv',
 'hashing': 'md5',
 'encoding': 'utf-8',
 'compression': 'no',
 'compressionPath': '',
 'control': {'newline': ''},
 'dialect': {},
 'query': {},
 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
   {'name': 'source', 'type': 'string'},
   {'name': 'target', 'type': 'string'},
   {'name': 'mentions', 'type': 'integer'},
   {'name': 'norm_mentions', 'type': 'number', 'constraints': {'minimum': 0}}],
  'primaryKey': 'id',
  'foreignKeys': [{'fields': ['source'],
    'reference': {'resource': 'scite-contexts', 'fields': ['doi']}},
   {'fields': ['target'],
    'reference': {'resource': 'scite-contexts', 'fields': ['doi']}}]},
 'stats': {'hash': 'c55ab9fb2f445aa8893787b03bcb829f',
  'bytes': 2330,
  'fields': 5,
  'rows': 31},
 'description': 'Patterns for unique source and target pairs',
 'prov': {'fields': [{'mentions': {'name': 'mentions',
 

We have constructed our first data resource that does not directly derive from the captured events as we used the Scite traces as input variables. Once again, we achieve both goals of describing the internal structure of the table at hand while providing its provenence information.

### 4. Patterns and Metrics II

Finally, we are now ready to assemble the final data resource that everyone has been waiting for... Article-level metrics! Everyone wants to know how often their most recent paper has been cited or in this case we could also look at the number of in-text mentions. So, without further ado, I present the final provenance schema for article-level patterns:

```yaml
fields:
  - mentions_agg:
      - name: mentions_agg
      - description: the total number of mentions aggregated by target articles
      - resource:
        - citation_patterns
  - refs_agg:
      - name: refs_agg
      - description: the classic citation count. the number of unique citing articles for each cited article
      - resource:
        - citation_patterns
```

We are now introducing our final two patterns: `mentions_agg` which is the total number of in-text mentions for each article. In contrast, `refs_agg` is the traditional citation count which is the plain number of articles that mentioned the target one. I want to simply point that the input resource for both these fields are not of the type `mic-traces` as our Scite trace data resource. As mentioned earlier, patterns can take either traces or other patterns as input data in order to produce new patterns. The only important thing for us is to maintain the chain of provenance information.

**The fourth and final Frictionless Data Resource**

This final resource is pretty much the same as the previous one. One difference being that the primary key of the table is also a foreign key at the same time.

In [135]:
article_patterns = describe_resource("input_files/article_metrics.csv")

# Some general properties
article_patterns.name = "article-patterns"
article_patterns.profile = "mic-patterns"
article_patterns.description = "Patterns for individual articles"

# Extend the table schema
article_patterns.schema.primary_key = "doi"
# article_patterns.schema.foreign_keys.append(
#     {"fields": ["doi"], "reference": {"resource": "scite-contexts", "fields": ["doi"]}}
# )

# Adding a constraint to the normalized mention which can't be smaller than 0
article_patterns.schema.get_field("mentions_agg").constraints = {"minimum": 0}
article_patterns.schema.get_field("refs_agg").constraints = {"minimum": 0}

# Add provenance schema
article_patterns['prov'] = Schema("schemas/article_patterns.yaml")

pprint(article_patterns)

{'compression': 'no',
 'compressionPath': '',
 'control': {'newline': ''},
 'description': 'Patterns for individual articles',
 'dialect': {},
 'encoding': 'utf-8',
 'format': 'csv',
 'hashing': 'md5',
 'name': 'article-patterns',
 'path': 'input_files/article_metrics.csv',
 'profile': 'mic-patterns',
 'prov': {'fields': [{'mentions_agg': [{'name': 'mentions_agg'},
                                       {'description': 'the total number of '
                                                       'mentions aggregated by '
                                                       'target articles'},
                                       {'resource': ['citation_patterns']}]},
                     {'refs_agg': [{'name': 'refs_agg'},
                                   {'description': 'the classic citation '
                                                   'count. the number of '
                                                   'unique citing articles for '
                                

We have done it! We are now proud owners of article-level metrics as a Frictionless data resource **with provenance information attached!** 

## Assembling the Final Package

The last step is to use the four resources that we created so far and combine them into a *Frictionless Data Packate*. The function call is pretty straightforward:

In [136]:
scite = Package(resources=[scite_contexts, scite_traces, citation_patterns, article_patterns])

We are now both able to look at the data tables in each resource as follows:

In [137]:
scite.get_resource("article-patterns").to_pandas().dataframe

Unnamed: 0_level_0,mentions_agg,refs_agg
doi,Unnamed: 1_level_1,Unnamed: 2_level_1
10.1186/s13023-016-0444-9,14,8
10.1212/wnl.0000000000004179,15,6
10.3892/etm.2018.6726,2,1
10.3390/md15040089,13,7
10.1007/s12035-016-0271-y,15,9


While also being able to very quickly produce an overview of the provenance information contained in the data package:

In [138]:
for r in  scite["resources"]:
    print(f"=== Data Resource: {r['name']}")
    pprint(r["prov"])
    print("")

=== Data Resource: scite-contexts
{'contexts': [{'coverage': ['publishers sharing fulltexts with Scite',
                            'Pubmed Open Access Subset',
                            'OA articles accessed through Unpaywall'],
               'identified_by': ['DOI'],
               'type': 'peer-reviewed articles'},
              {'coverage': ['bioRxiv', 'medRxiv'],
               'identified_by': ['DOI'],
               'type': 'preprints'}]}

=== Data Resource: scite-traces
{'contexts': 'schemas/contexts.yaml',
 'event': 'schemas/event.yaml',
 'tracing': {'pipeline': [{'citation_extraction': 'text-processing/ML'},
                          {'citation_reference_matching': 'text-processing/ML'}]}}

=== Data Resource: citation-patterns
{'fields': [{'mentions': {'description': 'the number of times that the target '
                                         'was mentioned by the source',
                          'name': 'mentions',
                          'resources': ['scite_trac

## Conclusions

This notebook presented a proof of concept for the systematic and programmatic description of provenance information for metrics using Frictionless. Especially using the features built into the Frictionless toolkit with data packages, data resources, and table schemas we were able to provide information about the internal logics of the complete dataset. In addition, I introduced the concepts of citational events, contexts, traces, and patterns in order to model the underlying provenance of the same dataset. By extending the individual data resources with a `prov` property, I have attempted to extend the Frictionless toolkit to accomodate for provenance.

Some practical benefits of this approach for the research commnunity:

- The splitting of provenance information and data tables allows us to model the former (or even require) even if the enduser does not have access to the raw data.
- The introduction of contexts, traces, and patterns could contribute to the development of a best-practice for bibliometric research projects and scholarly metrics in general.

Benefits for practicioners and society in general:

- By creating provenance schemas for individual data sources (and publishing them openly) we create a publicly available, standardardized resource which provides metadata on data sources. Furthermore, we are able to account for black-boxes as we can model the individual processes of metrics independently.

Finally, a few potential extensions for a Frictionless Metrics Data Package that might be fun to explore:

- Automatically emphasize black-boxes in the data processing pipelines or missing provenance information for patterns/metrics
- Compare metrics in a dataset and automatically point out incommensurable fields
- Create visualizations of these provenance pipelines
- Automatically retrieve metadata and citations from open services (Crossref/COCI)

## References

Nakassis, C. V. (2013). Citation and Citationality. Signs and Society, 1(1), 51–77. https://doi.org/10.1086/670165