# Which citation styles do we have in Crossref data?

Dominika Tkaczyk

5.09.2019

In this notebook I use the style classifier to find out which styles are present in the Crossref collection.

In [1]:
import warnings
warnings.simplefilter('ignore')

import json
import pandas as pd

from styleclass.train import get_default_model
from styleclass.classify import classify

First I need to read the default model:

In [2]:
model = get_default_model()

*model* contains the complete trained style classifier. It can be used to infer the citation style of a new reference string.

Let's read a sample of reference strings drawn from Crossref API:

In [3]:
with open('data/sample.txt', 'r') as file:
    data = file.readlines()
strings = [d.strip() for d in data]

Finally, I will classify each reference string into the citation style:

In [4]:
styles = classify(strings, *model)
existing_styles = pd.DataFrame({'string': strings, 'style': styles})

Let's look at the fraction of each style in our dataset:

In [5]:
styles_distr = existing_styles.groupby(['style']).size().reset_index(name='counts') 
styles_distr['fraction'] = styles_distr['counts'] / len(strings)
styles_distr = styles_distr.sort_values(by='counts', ascending=False).reset_index(drop=True)
styles_distr

Unnamed: 0,style,counts,fraction
0,springer-basic-author-date,6460,0.294011
1,apa,3078,0.140087
2,unknown,2363,0.107546
3,springer-lecture-notes-in-computer-science,1952,0.08884
4,vancouver,1656,0.075369
5,american-institute-of-physics,1143,0.052021
6,bmc-bioinformatics,1051,0.047834
7,harvard3,784,0.035682
8,acm-sig-proceedings,760,0.034589
9,ieee,567,0.025806


The most popular style is *springer-basic-author-date* (29%), followed by *apa* (14%) and *springer-lecture-notes-in-computer-science* (9%). We also have 11% of the strings classified as "unknown". Let's see a sample of those strings:

In [6]:
for i, s in enumerate(existing_styles.loc[existing_styles['style'] == 'unknown']
                      .sample(10, random_state=10)['string']):
    print('['+str(i)+']', s)

[0] Giddens, Anthony (1988), Die Konstitution der Gesellschaft. Grundzüge einer Theorie der Strukturierung, Frankfurt/M./New York: Campus.
[1] C-H-Bestimmung: cand. chem. K. Hess.
[2] Muckerheide J. The health effects of low-level radiation: Science, data and corrective action 1995 26 34
[3] World Bank 2010 How firms in eastern and central Europe are performing in the post-financial crisis world
[4] Conover, B. (1943): The races of the Knot(C. canutus). — Condor 45, p. 226–228.
[5] ASEAN Secretariat (2009a) ASEAN Political…
[6] Austin, A. L. sr. andAustin, A. L. jr. (1931): Food poisoning in shore Birds. — Auk 48, p. 195–197.
[7] MOBIL Chemical. June 22 1970. June 22, Technical Bulletin Method R‐36
[8] H. A.Sauerand S. S.Flaschen: Proc. 7th Electronic Components Symposium, Washington, 1955, (Engineering Publ . New York , 1956) p. 41 .
[9] 1998.Belfast Telegraph, 23 May
