# Which citation styles do we have in Crossref data?

Dominika Tkaczyk

2.11.2018

In this notebook I use the style classifier to find out which styles are present Crossref data and what is the style distribution.

In [1]:
import sys
sys.path.append('..')

%matplotlib inline

import warnings
warnings.simplefilter('ignore')

import json
import pandas as pd
import re

from data_utils import add_noise, clean_data, read_ref_strings_data, remove_technical_parts, rearrange_tokens
from features import get_features, select_features_chi2
from sklearn.linear_model import LogisticRegression

  from numpy.core.umath_tests import inner1d


## Training

First, I will train the classifier. To do that, I have to read the training data first:

In [2]:
dataset = read_ref_strings_data('../data/ref_strings/')
print('Dataset size: {}'.format(dataset.shape[0]))
dataset.head()

Dataset size: 85000


Unnamed: 0,doi,string,style
0,10.1016/s0002-9394(14)70125-4,"[1]LEE, S.-H. and TSENG, S.C.G. 1997. Amniotic...",acm-sig-proceedings
1,10.1016/0920-9964(95)95073-i,"[1]Scheffer, R. et al. 1995. History of premor...",acm-sig-proceedings
2,10.1075/cilt.97.22vek,"[1]Vekerdi, J. 1993. 4. Word formation in Gips...",acm-sig-proceedings
3,10.1080/19761597.2013.810947,"[1]Kang, J. et al. 2013. Determinants of succe...",acm-sig-proceedings
4,10.1016/0378-1119(79)90090-8,"[1]Wickens, M.P. et al. 1979. Restriction map ...",acm-sig-proceedings


Cleaning and preprocessing the data (more about this procedure can be found [here]()):

In [3]:
dataset = clean_data(dataset)
dataset['string'] = dataset['string'].apply(remove_technical_parts)
dataset['string'] = dataset['string'].apply(add_noise)
dataset_unknown = dataset.sample(5000)
dataset_unknown['string'] = dataset_unknown['string'].apply(rearrange_tokens)
dataset_unknown['style'] = 'unknown'
dataset_unknown
dataset = pd.concat([dataset, dataset_unknown])
print('Dataset size: {}'.format(dataset.shape[0]))

Dataset size: 87834


Training the classification model (more about the parameters and the training can be found [here]()):

In [4]:
count_vectorizer, tfidf_transformer, features = get_features(dataset['string'], nfeatures=5000,
                                                             feature_selector=select_features_chi2, ngrams=(2, 4))
model = LogisticRegression(random_state=0).fit(features, dataset['style'])

*model* contains the complete trained style classifier. It can be used to infer the citation style of a new reference string.

## Classifying real reference strings

Let's read a sample of metadata records drawn from Crossref API:

In [5]:
with open('../data/samples/sample-10000.json', 'r') as file:
    data = json.loads(file.read())['sample']

Next, I iterate over all unstructured reference strings found in the records and assign a citation style (or "unknown") to each of them:

In [6]:
strings = []
styles = []
probabilities = []
for d in data:
    for r in d.get('reference', []):
        if 'unstructured' in r:
            r['unstructured'] = re.sub('http[^ ]+', '', r['unstructured']).strip()
            r['unstructured'] = re.sub('(?<!\d)10\.\d{4,9}/[-\._;\(\)/:a-zA-Z0-9]+', '', r['unstructured'])
            r['unstructured'] = re.sub('doi:?', '', r['unstructured']).strip()
            if len(r['unstructured']) < 11:
                continue
            _, _, test_features = get_features([r['unstructured']], count_vectorizer=count_vectorizer,
                                               tfidf_transformer=tfidf_transformer)
            prediction = model.predict(test_features)
            probabilities.append(max(model.predict_proba(test_features)[0]))
            strings.append(r['unstructured'])
            styles.append(prediction[0])
existing_styles = pd.DataFrame({'string': strings, 'style': styles})

## The distribution of the styles

Let's look at the fraction of each style in our dataset:

In [7]:
styles_distr = existing_styles.groupby(['style']).size().reset_index(name='counts') 
styles_distr['fraction'] = styles_distr['counts'] / len(strings)
styles_distr = styles_distr.sort_values(by='counts', ascending=False).reset_index(drop=True)
styles_distr

Unnamed: 0,style,counts,fraction
0,springer-basic-author-date,6303,0.289886
1,apa,3081,0.141701
2,unknown,2753,0.126615
3,springer-lecture-notes-in-computer-science,1997,0.091846
4,vancouver,1911,0.08789
5,american-institute-of-physics,947,0.043554
6,bmc-bioinformatics,873,0.040151
7,acm-sig-proceedings,808,0.037161
8,harvard3,752,0.034586
9,elsevier-with-titles,589,0.027089


The most popular style is *springer-basic-author-date* (29%), followed by *apa* (14%) and *springer-lecture-notes-in-computer-science* (9%). We also have 13% of the strings classified as "unknown". Let's see a sample of those strings:

In [8]:
for i, s in enumerate(existing_styles.loc[existing_styles['style'] == 'unknown'].sample(10, random_state=10)['string']):
    print('['+str(i)+']', s)

[0] Blair G.P., “The Basic Design of Two-Stroke Engines”, SAE, in Chapter 2 ‘The Theory of Unsteady Gas Flow Through Engines’, 2nd Edition, in 1995, pp. 141.
[1] F. Tang andJ. O. Henningsen:IEEE J. Quantum Electron.,22, 2084 (1986).
[2] ASEAN Secretariat (2013a) Opening Remarks by H.E. Le Luong Minh Secretary General of ASEAN at The ASEAN-UN Workshop Lessons Learned and Best Practices in Conflict Prevention and Preventive Diplomacy, 5 April 2013,  /news/item/opening-remarks-by-he-le-luong-minh-secretary-gen-eral-of-asean-at-the-asean-un-workshop-lessons-learned-and-best-practices-in-conflict-prevention-and-preventive-diplomacy date accessed 7 August 2014.
[3] Wassom J. S. Mutagenesis and teratogenesis. Symposium on the Handling of Toxicological Information May 27-28 1976 Natl. Tech. Information Services PB-283-164.
[4] Myers , J. S. Albany, Western Australia, West. Aust. Geol. Surv., 1∶1 000 000 Geol. Ser., Explanatory Notes 9 1995
[5] Littwin , W. 1935 Forsch. Staat. Obs., Danzig
[6] 