# Metrics in Context | A Proof of Concept

### Describing Citation Data with [Frictionless](https://frictionlessdata.io/)

*Asura Enkhbayar, 04.02.2021*

---

This notebook presents a proof of concept for a standardized approach to citation data and, more broadly, scholarly metrics. To do so, we will take a bibliometric dataset consisting of four spreadsheets and build a Frictionless [Data Package](https://specs.frictionlessdata.io/data-package/) using a systematic approach to describe each tables provenance. I will first introduce the dataset at hand and provide an quick overview and situate it in the context of traditional bibliometric work. Then, I will step-by-step introduce a new conceptual framework for scholarly metrics and gradually incorporate the dataset into the final data package. Finally, a few concluding remarks and a brief discussion of the potential and limitations of this approach.

*Note: For a broader overview of the project please refer to the [README](https://github.com/Bubblbu/metrics-in-context) in the project repository.*

<img style="float: center;" width=25%, src="../../materials/assets/fosdem2021.jpg">

## The Data

Let's quickly import a few libraries and functions and setup our folder structure.

In [38]:
from pprint import pprint
from pathlib import Path
import yaml

import pandas as pd
from frictionless import describe_schema, Resource, Package

# Directories
input_files = Path("./input_files")
schemas = Path("./schemas")

For the purpose of this particular notebook, I will not go into the nitty gritty details of this dataset. Of course, providing that kind of information would be part of the aims of developing the Frictionless Data Package for scholarly metrics as these details make up provenance information. For now, I will simply describe the spreadsheets at hand and provide some  cursory descriptions of its origins.

The dataset we are working with consists of four spreadsheets that were created based on data provided by [Scite.ai](https://scite.ai/). The particular sample was created by querying for recent articles in Pubmed on Amyotrophic lateral sclerosis (ALS). Scite is one of the newer data sources that not only provide the classic citation link between two documents, but attempt to extract what has been established as citation contexts. While the former can be understood as the datafication of bibliographies and reference lists, context-aware citations trace each individual in-text mention of articles.

The four CSVs are:

- `contexts.csv`: Article metadata for each DOI.
- `traces.csv`: The individual traced citations between documents and context information such as a text snippet or the mentioning section.
- `citations.csv`: This spreadsheet contains metrics for every individual citation link derived from its source article such as the total number of outgoing references.
- `articles.csv`: A table with article-level metrics derived from the aforementioned traces.

Very often, bibliometric researchers and practictioners will only get to see the final article-level metrics, i.e., the citation counts for DOIs. In the next sections, we will now gradually go through these spreadsheets and describe how citation counts become what they are.

## A Conceptual Framework for Scholarly Metrics

Here, I want to attempt to briefly introduce the conceptual framework used to systematize scholarly metrics. This framework is part of a broader doctoral research project and is built around the philosophical shift from object-centrism to process-centrism. This shift from representation towards performance can also be expressed for scholarly metrics by emphasising the processes and practices that lead to outputs and citation counts. But how do these processes look like and how can we start to systematically describe them?

The usual story goes something like this: A *citation* links two scholarly texts expressed through an in-text mention and a bibliographic reference. This state is then captured by a *citation link* in their databases. This data then enables the creation of *citation metrics*. I would like to retell this story equipped with a new vocabulary for provenance and while paying attention to each step in the individuation of the citation count.  

### The Citational Event

> A *citation* links two *scholarly texts* expressed through an in-text mention and a bibliographic reference.

Firstly, I propose to question the idea of the citation as one consistent concept itself. It is typically understood as a directional binary property between two peer-reviewed research articles. However, more and more citation databases are using text-processing methods to extract the individual in-text mentions with their contexts and furthermore the types of documents hosting citations is changing (e.g., datasets, software, mentions in social media). To accomodate for these changes, I suggest to think in terms of *citational events* and their *contexts*.

The typical citation is a statement in a peer-reviewed scholarly article referencing another piece of scholarly writing. Some of the new social (and technical) challenges around citations stem from changing citational practices such as the citing of preprints or datasets. So far, these changes have been pragmatically addressed by introduction of norms and dedicated infrastructure (e.g., the push for DOIs for datasets and iniatives that track their usage) but we are lacking the conceptual language to capture all these forms of citations.

To avoid falling into the trap of never-ending definitions, I propose to re-conceptualize the citation as an interdiscursive event (Nakassis, 2013) which links two discursive acts. For the present purpose it will suffice to understand *discursive acts* as some form of written or spoken statement. Modality, syntax, and even semantics become part of the wider context of that particular citational event. I've talked enough about things that bald French men usually talk about, thus, without further ado, let's dive into implementing said citational event:

```yaml
type: citational
description: A textual reference from one scholarly document to another scholarly document.
source_event:
  activity: scholarly writing
target_event:
  activity: scholarly writing
```

Scite captures *citational* events. Other services also capture events which are not citational such as views, downloads, or bookmarks. Throughout this document, technical objects will always be accompanied with free-text descriptions. Finally, we also attempt to describe the source and target events individually. In our case, both the citing as well as the cited events are caused by scholarly writing. Altmetrics would consider source events caused by social media activities referencing scholarly texts.

*Note: This might seem like a very lean definition of the fundamental entity that is being captured. This is on purpose as we will continue to explore the complexity through the additional processes of becoming the final metric.*

**Contexts**

These citational events do not occur in a vacuum. As already indicated earlier, they typically happen in contexts which are those things that we usually pay attention to most. The research articles, conference papers, literature reviews, white papers, preprints, and other formats that nowadays host citations of all kinds are the typical unit of analysis when it comes to scholarly metrics. However, with this small committment to the event as the core of this framework, I am hoping to shift attention to the processes and practices leading to citations rather than their manifested forms. Still, we obviously still need to talk and specify these concrete outputs that we love and hate so much. The following yaml excerpt is an example for a list of possible contexts for Scite:

```yaml
- type: peer-reviewed articles
  coverage:
    - publishers sharing fulltexts with Scite
    - OA articles accessed through Unpaywall
    - Pubmed Open Access Subset
  identified_by:
    - DOI
- type: preprints
  coverage:
    - bioRxiv
    - medRxiv
  identified_by:
    - DOI
```

### The Tracing of Events and Contexts

> This state is then captured by a *citation link* in their databases.

How are events and their contexts turned into traces?

**Traces**

```yaml
event: schemas/event.yaml
tracing:
  source:

  target:
    contexts:
      - type: peer-reviewed articles
        coverage:
          - publishers sharing fulltexts with Scite
          - OA articles accessed through Unpaywall
          - Pubmed Open Access Subset
        identified_by:
          - DOI
      - type: preprints
        coverage:
          - bioRxiv
          - medRxiv
        identified_by:
          - DOI
```

### Patterns and Metrics

> This data then enables the creation of *citation metrics*.



**Patterns**

All metrics are patterns, but not all patterns are metrics.

## Implementation

### Events

### Contexts

### Traces

### Patterns

#### Citation level patterns

This table contains citation patterns aggregated of individual citations.

Each table will thus contain:

- A unique ID
- A source DOI for the citing article
- A target DOI for the cited article
- One or more patterns
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [31]:
f = data_dir / "citation_patterns.csv"
cit_patterns = pd.read_csv(f)
cp_schema = describe_schema(f)
cp_schema.fields

[{'name': 'source', 'type': 'string'},
 {'name': 'target', 'type': 'string'},
 {'name': 'total_source_mentions', 'type': 'integer'},
 {'name': 'total_source_refs', 'type': 'integer'},
 {'name': 'mentions', 'type': 'integer'},
 {'name': 'norm_refs', 'type': 'number'},
 {'name': 'norm_mentions', 'type': 'number'},
 {'name': 'mentions_per_ref', 'type': 'number'},
 {'name': 'wf1', 'type': 'number'},
 {'name': 'wf2', 'type': 'number'},
 {'name': 'wf3', 'type': 'number'}]

In [29]:
cp_schema.name = "citation_patterns"
cp_schema.description = 

ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

{'fields': [{'name': 'source', 'type': 'string'},
  {'name': 'target', 'type': 'string'},
  {'name': 'total_source_mentions', 'type': 'integer'},
  {'name': 'total_source_refs', 'type': 'integer'},
  {'name': 'mentions', 'type': 'integer'},
  {'name': 'norm_refs', 'type': 'number'},
  {'name': 'norm_mentions', 'type': 'number'},
  {'name': 'mentions_per_ref', 'type': 'number'},
  {'name': 'wf1', 'type': 'number'},
  {'name': 'wf2', 'type': 'number'},
  {'name': 'wf3', 'type': 'number'}]}

### Article level patterns

This table contains citation patterns aggregated on the article level.

Each table will thus contain:

- A unique DOI
- One or more patterns for each DOI
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [6]:
f = "article_patterns.csv"
article_patterns = pd.read_csv(f)
ap_schema = describe_schema(f)

In [7]:
# Set primary key
ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

# First pattern: mentions_agg
mentions_agg = ap_schema.get_field("mentions_agg")
mentions_agg.title = "Aggregated mentions"
mentions_agg.description = "Sum of all incoming mentions for this DOI"
mentions_agg.type = "integer"
mentions_agg.missing_values = ["", "n/a", "NaN"]
mentions_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "description": "Aggregation of all citations by DOI"
            "type": "aggregation",
            "by": "doi",
            "resource": ""
        }
    }
}

# Second pattern: refs_agg
refs_agg = ap_schema.get_field("refs_agg")
refs_agg.title = "Aggregated references"
refs_agg.description = "Sum of all incoming references for this DOI"
refs_agg.type = "integer"
refs_agg.missing_values = ["", "n/a", "NaN"]
refs_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "type": "aggregation",
            "by": "doi",
            "input": ""
        }
    }
}

# Add third pattern which is an average

SyntaxError: invalid syntax (<ipython-input-7-41b5306200e4>, line 16)

In [None]:
# Resource specs
resource_desc = {
    "profile": "pattern",
    "name": "Article-level metrics",
    "data": data_dir / "article_patterns.csv",
    "schema": ap_schema
}

## Discussion

Benefits:

- Also provides a logical structure for bibliometrics datasets
- By splitting data resources and schemas it is possible to discuss provenance information without having actual data of any prior stages and processes
- Frictionless enables the addition of logic and function to the data packages which opens door for extensions:
    - Automatically retrieve metadata and citations from open services (Crossref/COCI) including their provenance schemas of course
    - Compare metrics in a dataset and automatically point out incommensurable fields
    - Automatically emphasize black-boxes in the data processing pipelines
    - Create visualizations of these provenance pipelines

Drawbacks:

- 

## References

Nakassis, C. V. (2013). Citation and Citationality. Signs and Society, 1(1), 51–77. https://doi.org/10.1086/670165