# Metrics in Context | Describing citation data from Scite.ai with Frictionless

> This notebook presents a proof of concept for a standardized approach to citation data and, more broadly, scholarly metrics.

**Table of Contents**

- [Introduction](#Introduction)
- [A Conceptual Framework for Scholarly Metrics](#A-Conceptual-Framework-for-Scholarly-Metrics)
- [Data](#Data)
- [Implementation](#Implementation)
    - [Events and Contexts](#Events-and-Contexts)
    - [Traces](#Traces)
    - [Patterns](#Patterns)

## Introduction

## A Conceptual Framework for Scholarly Metrics

## Data

The available dataset consists of four spreadsheets:

1. `contexts.csv`: The contexts of the events captured which typically include article metadata. 
2. `traces.csv`: The individual traced citational links between documents.
3. `citations.csv`: This table contains patterns derived from the traces on the level of each individual citation pair.
4. `articles.csv`: A table with article-level metrics.

## Implementation

In [1]:
from pprint import pprint

import pandas as pd
from frictionless import describe_schema

### Events and Contexts

### Traces

### Patterns

#### Citation level patterns

This table contains citation patterns aggregated of individual citations.

Each table will thus contain:

- A unique ID
- A source DOI for the citing article
- A target DOI for the cited article
- One or more patterns
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [31]:
f = data_dir / "citation_patterns.csv"
cit_patterns = pd.read_csv(f)
cp_schema = describe_schema(f)
cp_schema.fields

[{'name': 'source', 'type': 'string'},
 {'name': 'target', 'type': 'string'},
 {'name': 'total_source_mentions', 'type': 'integer'},
 {'name': 'total_source_refs', 'type': 'integer'},
 {'name': 'mentions', 'type': 'integer'},
 {'name': 'norm_refs', 'type': 'number'},
 {'name': 'norm_mentions', 'type': 'number'},
 {'name': 'mentions_per_ref', 'type': 'number'},
 {'name': 'wf1', 'type': 'number'},
 {'name': 'wf2', 'type': 'number'},
 {'name': 'wf3', 'type': 'number'}]

In [29]:
cp_schema.name = "citation_patterns"
cp_schema.description = 

ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

{'fields': [{'name': 'source', 'type': 'string'},
  {'name': 'target', 'type': 'string'},
  {'name': 'total_source_mentions', 'type': 'integer'},
  {'name': 'total_source_refs', 'type': 'integer'},
  {'name': 'mentions', 'type': 'integer'},
  {'name': 'norm_refs', 'type': 'number'},
  {'name': 'norm_mentions', 'type': 'number'},
  {'name': 'mentions_per_ref', 'type': 'number'},
  {'name': 'wf1', 'type': 'number'},
  {'name': 'wf2', 'type': 'number'},
  {'name': 'wf3', 'type': 'number'}]}

### Article level patterns

This table contains citation patterns aggregated on the article level.

Each table will thus contain:

- A unique DOI
- One or more patterns for each DOI
    - Each pattern also contain the form of aggregation and input traces/patterns
        - Optional for each input trace/pattern: a data resource

In [6]:
f = "article_patterns.csv"
article_patterns = pd.read_csv(f)
ap_schema = describe_schema(f)

In [7]:
# Set primary key
ap_schema.primary_key = "doi"
ap_schema.get_field("doi").title = "DOI of citing article"

# First pattern: mentions_agg
mentions_agg = ap_schema.get_field("mentions_agg")
mentions_agg.title = "Aggregated mentions"
mentions_agg.description = "Sum of all incoming mentions for this DOI"
mentions_agg.type = "integer"
mentions_agg.missing_values = ["", "n/a", "NaN"]
mentions_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "description": "Aggregation of all citations by DOI"
            "type": "aggregation",
            "by": "doi",
            "resource": ""
        }
    }
}

# Second pattern: refs_agg
refs_agg = ap_schema.get_field("refs_agg")
refs_agg.title = "Aggregated references"
refs_agg.description = "Sum of all incoming references for this DOI"
refs_agg.type = "integer"
refs_agg.missing_values = ["", "n/a", "NaN"]
refs_agg.mic = {
    "type": "pattern",
    "prov": {
        "operation": {
            "type": "aggregation",
            "by": "doi",
            "input": ""
        }
    }
}

# Add third pattern which is an average

SyntaxError: invalid syntax (<ipython-input-7-41b5306200e4>, line 16)

In [None]:
# Resource specs
resource_desc = {
    "profile": "pattern",
    "name": "Article-level metrics",
    "data": data_dir / "article_patterns.csv",
    "schema": ap_schema
}

In [14]:
metadata = pd.read_csv(data_dir / "event_contexts.csv")
traces = pd.read_csv(data_dir / "citation_traces.csv")


In [32]:
f = data_dir / "event_contexts.csv"
event_contexts = describe(f)["schema"]

In [33]:
event_contexts

{'fields': [{'name': 'doi', 'type': 'string'},
  {'name': 'slug', 'type': 'string'},
  {'name': 'type', 'type': 'string'},
  {'name': 'title', 'type': 'string'},
  {'name': 'abstract', 'type': 'string'},
  {'name': 'authors', 'type': 'string'},
  {'name': 'keywords', 'type': 'string'},
  {'name': 'year', 'type': 'number'},
  {'name': 'shortJournal', 'type': 'string'},
  {'name': 'publisher', 'type': 'string'},
  {'name': 'issue', 'type': 'string'},
  {'name': 'volume', 'type': 'integer'},
  {'name': 'page', 'type': 'string'},
  {'name': 'retracted', 'type': 'boolean'},
  {'name': 'memberId', 'type': 'integer'},
  {'name': 'issns', 'type': 'string'},
  {'name': 'editorialNotices', 'type': 'string'},
  {'name': 'journalSlug', 'type': 'string'},
  {'name': 'journal', 'type': 'string'},
  {'name': 'rwStatus', 'type': 'any'}]}

In [20]:
desc = describe(data_dir / "citation_traces.csv")
trace_schema = desc["schema"]

{'fields': [{'name': 'id', 'type': 'integer'},
  {'name': 'source', 'type': 'string'},
  {'name': 'target', 'type': 'string'}]}

In [28]:
trace_schema.get_field("id").title = "Unique ID for a source-target pair"
trace_schema.get_field("source").title = "DOI of citing article"
trace_schema.get_field("source").type = "string"
trace_schema.get_field("target").title = "DOI of cited article"
trace_schema.get_field("target").type = "string"

In [29]:
trace_schema

{'fields': [{'name': 'id',
   'type': 'integer',
   'title': 'Unique ID for a source-target pair'},
  {'name': 'source', 'type': 'string', 'title': 'DOI of citing article'},
  {'name': 'target', 'type': 'string', 'title': 'DOI of cited article'}]}

In [30]:
desc = describe(data_dir / "citation_patterns.csv")
cit_patterns = desc["schema"]

In [31]:
cit_patterns

{'fields': [{'name': 'source', 'type': 'string'},
  {'name': 'target', 'type': 'string'},
  {'name': 'total_source_mentions', 'type': 'integer'},
  {'name': 'total_source_refs', 'type': 'integer'},
  {'name': 'mentions', 'type': 'integer'},
  {'name': 'norm_refs', 'type': 'number'},
  {'name': 'norm_mentions', 'type': 'number'},
  {'name': 'mentions_per_ref', 'type': 'number'},
  {'name': 'wf1', 'type': 'number'},
  {'name': 'wf2', 'type': 'number'},
  {'name': 'wf3', 'type': 'number'}]}

In [43]:
citations = {
    "name": "scholarly citation",
    "local": "scholarly article",
    "reference": "mentions",
    "remote": "scholarly article"
}

In [45]:
citation_links = {
    "name": "citation link",
    "description": "All references deposited to COCI through the I4OC including a long list of publishers (notably, Elsevier is not contributing its reference data)" \
                    "These references are then matched and aggregated for articles.",
    "tracked_event": citations,
    "coverage": "COCI publishers",
    
}

In [46]:
pattern = {
    "name": "aggregation",
    "description": "Local "
    "trace": citation_links
}

In [52]:
pprint(schema.get_field("citation_count")["pattern"])

{'name': 'aggregation',
 'trace': {'description': 'All references deposited to COCI through the I4OC '
                          'including a long list of publishers (notably, '
                          'Elsevier is not contributing its reference '
                          'data)These references are then matched and '
                          'aggregated for articles.',
           'name': 'citation link',
           'tracked_event': {'local': 'scholarly article',
                             'name': 'scholarly citation',
                             'reference': 'mentions',
                             'remote': 'scholarly article'}}}


#### 1.2 Multiple Citation Counts

Describe a dataset with several citation counts provided from Crossref, Web of Science, and Google Scholar.

### 2. Altmetrics

#### 2.1 Facebook shares from Altmetric.com

Describe a dataset of aggregated Facebook share counts provided by Altmetric

#### 2.2 Public and Private Facebook Shares

Describe a dataset with Altmetric Facebook counts and private Facebook share numbers aggregated by our method.

### 3. References

#### 3.1 Basic reference data

Describe a dataset of in-text references provided by Scite.ai.

#### 3.2 References and Citations

Describe a dataset consisting of citation counts derived from a reference data.