# Metrics in Context | A Proof of Concept

### Describing Citation Data with [Frictionless](https://frictionlessdata.io/)

*Asura Enkhbayar, 04.02.2021*

---

This notebook presents a proof of concept for a standardized approach to citation data and, more broadly, scholarly metrics. To do so, we will take a bibliometric dataset consisting of four spreadsheets and build a Frictionless [Data Package](https://specs.frictionlessdata.io/data-package/) using a systematic approach to describe each tables provenance. I will first introduce the dataset at hand and provide an quick overview and situate it in the context of traditional bibliometric work. Then, I will step-by-step introduce a new conceptual framework for scholarly metrics and gradually incorporate the dataset into the final data package. Finally, a few concluding remarks and a brief discussion of the potential and limitations of this approach.

*Note: For a broader overview of the project please refer to the [README](https://github.com/Bubblbu/metrics-in-context) in the project repository.*

<img style="float: center;" width=25%, src="../../materials/assets/fosdem2021.jpg">

## Bibliometric Data

The dataset used in this particular notebook is slightly different from more traditional citation datasets. 

The available dataset consists of four spreadsheets:

1. `contexts.csv`: The contexts of the events captured which typically include article metadata. 
2. `traces.csv`: The individual traced citational links between documents.
3. `citations.csv`: This table contains patterns derived from the traces on the level of each individual citation pair.
4. `articles.csv`: A table with article-level metrics.

## A Conceptual Framework for Scholarly Metrics

Here, I want to attempt to briefly introduce the conceptual framework used to systematize scholarly metrics. This framework is part of a broader doctoral research project and is built around the philosophical shift from object-centrism to process-centrism. This shift from representation towards performance can also be expressed for scholarly metrics by emphasising the processes and practices that lead to outputs and citation counts. But how do these processes look like and how can we start to systematically describe them?

The usual story goes something like this: A *citation* links two scholarly texts expressed through an in-text mention and a bibliographic reference. This state is then captured by a *citation link* in their databases. This data then enables the creation of *citation metrics*. I would like to retell this story with some new vocabulary and some closer attention paid to the individual steps:

### Citational Events and Their Contexts

> A *citation* links two *scholarly texts* expressed through an in-text mention and a bibliographic reference.

Firstly, I propose to question the idea of the citation as one consistent concept itself. It is typically understood as a directional binary property between two peer-reviewed research articles. However, more and more citation databases are using text-processing methods to extract the individual in-text mentions with their contexts and furthermore the types of documents hosting citations is changing (e.g., datasets, software, mentions in social media). To accomodate for these changes, I suggest to think in terms of *citational events* and their *contexts*.

**Citational Events**

The typical citation is a statement in a peer-reviewed scholarly article referencing another piece of scholarly writing. Some of the new social (and technical) challenges around citations stem from changing citational practices such as the citing of preprints or datasets. So far, these changes have been pragmatically addressed by introduction of norms and dedicated infrastructure (e.g., the push for DOIs for datasets and iniatives that track their usage) but we are lacking the conceptual language to capture all these forms of citations.

Thus, I propose the understanding of the citation as an interdiscursive event (Nakassis, 2013).  

**Contexts**



### The Tracing of Events

> This state is then captured by a *citation link* in their databases.

How are events and their contexts turned into traces?

**Traces**



### Patterns and Metrics

> This data then enables the creation of *citation metrics*.



**Patterns**

All metrics are patterns, but not all patterns are metrics.

## Implementation

In [1]:
from pprint import pprint

import pandas as pd
from frictionless import describe_schema

### Events

### Contexts

### Traces

### Patterns

#### Citation level patterns

This table contains citation patterns aggregated of individual citations.

Each table will thus contain:

- A unique ID
- A source DOI for the citing article
- A target DOI for the cited article
- One or more patterns
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [31]:
f = data_dir / "citation_patterns.csv"
cit_patterns = pd.read_csv(f)
cp_schema = describe_schema(f)
cp_schema.fields

[{'name': 'source', 'type': 'string'},
 {'name': 'target', 'type': 'string'},
 {'name': 'total_source_mentions', 'type': 'integer'},
 {'name': 'total_source_refs', 'type': 'integer'},
 {'name': 'mentions', 'type': 'integer'},
 {'name': 'norm_refs', 'type': 'number'},
 {'name': 'norm_mentions', 'type': 'number'},
 {'name': 'mentions_per_ref', 'type': 'number'},
 {'name': 'wf1', 'type': 'number'},
 {'name': 'wf2', 'type': 'number'},
 {'name': 'wf3', 'type': 'number'}]

In [29]:
cp_schema.name = "citation_patterns"
cp_schema.description = 

ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

{'fields': [{'name': 'source', 'type': 'string'},
  {'name': 'target', 'type': 'string'},
  {'name': 'total_source_mentions', 'type': 'integer'},
  {'name': 'total_source_refs', 'type': 'integer'},
  {'name': 'mentions', 'type': 'integer'},
  {'name': 'norm_refs', 'type': 'number'},
  {'name': 'norm_mentions', 'type': 'number'},
  {'name': 'mentions_per_ref', 'type': 'number'},
  {'name': 'wf1', 'type': 'number'},
  {'name': 'wf2', 'type': 'number'},
  {'name': 'wf3', 'type': 'number'}]}

### Article level patterns

This table contains citation patterns aggregated on the article level.

Each table will thus contain:

- A unique DOI
- One or more patterns for each DOI
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [6]:
f = "article_patterns.csv"
article_patterns = pd.read_csv(f)
ap_schema = describe_schema(f)

In [7]:
# Set primary key
ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

# First pattern: mentions_agg
mentions_agg = ap_schema.get_field("mentions_agg")
mentions_agg.title = "Aggregated mentions"
mentions_agg.description = "Sum of all incoming mentions for this DOI"
mentions_agg.type = "integer"
mentions_agg.missing_values = ["", "n/a", "NaN"]
mentions_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "description": "Aggregation of all citations by DOI"
            "type": "aggregation",
            "by": "doi",
            "resource": ""
        }
    }
}

# Second pattern: refs_agg
refs_agg = ap_schema.get_field("refs_agg")
refs_agg.title = "Aggregated references"
refs_agg.description = "Sum of all incoming references for this DOI"
refs_agg.type = "integer"
refs_agg.missing_values = ["", "n/a", "NaN"]
refs_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "type": "aggregation",
            "by": "doi",
            "input": ""
        }
    }
}

# Add third pattern which is an average

SyntaxError: invalid syntax (<ipython-input-7-41b5306200e4>, line 16)

In [None]:
# Resource specs
resource_desc = {
    "profile": "pattern",
    "name": "Article-level metrics",
    "data": data_dir / "article_patterns.csv",
    "schema": ap_schema
}

## References

Nakassis, C. V. (2013). Citation and Citationality. Signs and Society, 1(1), 51â€“77. https://doi.org/10.1086/670165