# Which citation styles do we have in Crossref data?

Dominika Tkaczyk

16.11.2018

In this notebook I use the style classifier to find out which styles are present in the Crossref collection.

In [1]:
import sys
sys.path.append('..')

%matplotlib inline

import warnings
warnings.simplefilter('ignore')

import json
import pandas as pd
import re

from config import *
from dataset import read_ref_strings_data, generate_unknown
from features import get_features, select_features_chi2
from sklearn.linear_model import LogisticRegression
from train import clean, train

  from numpy.core.umath_tests import inner1d


## Training

First, I will train the classifier. To do that, I have to read the training data first:

In [2]:
dataset = read_ref_strings_data('../data/dataset/')
print('Dataset size: {}'.format(dataset.shape[0]))
dataset.head()

Dataset size: 85000


Unnamed: 0,doi,style,string
0,10.1016/s0002-9394(14)70125-4,acm-sig-proceedings,"[1]LEE, S.-H. and TSENG, S.C.G. 1997. Amniotic..."
1,10.1016/0920-9964(95)95073-i,acm-sig-proceedings,"[1]Scheffer, R. et al. 1995. History of premor..."
2,10.1075/cilt.97.22vek,acm-sig-proceedings,"[1]Vekerdi, J. 1993. 4. Word formation in Gips..."
3,10.1080/19761597.2013.810947,acm-sig-proceedings,"[1]Kang, J. et al. 2013. Determinants of succe..."
4,10.1016/0378-1119(79)90090-8,acm-sig-proceedings,"[1]Wickens, M.P. et al. 1979. Restriction map ..."


Cleaning and preprocessing the data (more about this procedure can be found [here](https://github.com/CrossRef/citation-style-classifier/blob/master/analyses/citation_style_classification.ipynb)):

In [3]:
dataset = clean(dataset, random_state=0)
dataset_unknown = dataset_unknown = generate_unknown(dataset, 5000, random_state=0)
dataset = pd.concat([dataset, dataset_unknown])
print('Dataset size: {}'.format(dataset.shape[0]))

Dataset size: 87834


Training the classification model (more about the parameters and the training can be found [here](https://github.com/CrossRef/citation-style-classifier/blob/master/analyses/citation_style_classification.ipynb)):

In [4]:
count_vectorizer, tfidf_transformer, model = train(dataset, random_state=0)

*model* contains the complete trained style classifier. It can be used to infer the citation style of a new reference string.

## Classifying real reference strings

Let's read a sample of metadata records drawn from Crossref API:

In [5]:
with open('../data/samples/sample-10K.json', 'r') as file:
    data = json.loads(file.read())['sample']

Next, I iterate over all unstructured reference strings found in the records and assign a citation style (or "unknown") to each of them:

In [6]:
strings = []
styles = []
probabilities = []
for d in data:
    for r in d.get('reference', []):
        if 'unstructured' in r:
            r['unstructured'] = re.sub('http[^ ]+', '', r['unstructured']).strip()
            r['unstructured'] = re.sub('(?<!\d)10\.\d{4,9}/[-\._;\(\)/:a-zA-Z0-9]+', '', r['unstructured'])
            r['unstructured'] = re.sub('doi:?', '', r['unstructured']).strip()
            if len(r['unstructured']) < 11:
                continue
            _, _, test_features = get_features([r['unstructured']], count_vectorizer=count_vectorizer,
                                               tfidf_transformer=tfidf_transformer)
            prediction = model.predict(test_features)
            probabilities.append(max(model.predict_proba(test_features)[0]))
            strings.append(r['unstructured'])
            styles.append(prediction[0])
existing_styles = pd.DataFrame({'string': strings, 'style': styles})

## The distribution of the styles

Let's look at the fraction of each style in our dataset:

In [7]:
styles_distr = existing_styles.groupby(['style']).size().reset_index(name='counts') 
styles_distr['fraction'] = styles_distr['counts'] / len(strings)
styles_distr = styles_distr.sort_values(by='counts', ascending=False).reset_index(drop=True)
styles_distr

Unnamed: 0,style,counts,fraction
0,springer-basic-author-date,6276,0.288645
1,apa,3086,0.141931
2,unknown,2873,0.132134
3,springer-lecture-notes-in-computer-science,1949,0.089638
4,vancouver,1943,0.089362
5,american-institute-of-physics,957,0.044014
6,bmc-bioinformatics,902,0.041485
7,acm-sig-proceedings,787,0.036196
8,harvard3,746,0.03431
9,ieee,570,0.026215


The most popular style is *springer-basic-author-date* (29%), followed by *apa* (14%) and *springer-lecture-notes-in-computer-science* (9%). We also have 13% of the strings classified as "unknown". Let's see a sample of those strings:

In [8]:
for i, s in enumerate(existing_styles.loc[existing_styles['style'] == 'unknown'].sample(10, random_state=10)['string']):
    print('['+str(i)+']', s)

[0] High-Density 25/100 Gigabit Ethernet StrataXGS Tomahawk Ethernet Switch Series.
[1] Englund, E. J., and Sparks, A. R., 1988, Geo-EAS (Geostatistical Environmental Assessment Software) User's Guide, EPA/600/4-88/033: U.S. EPA, Las Vegas, 174 p.
[2] Catel, Dtsch. med. Wschr.1933 1689;1935, 985
[3] Alternatives to Traditional Transportation Fuels.” (1994) Energy Information Administration (EIA) Report. Washington DC: U.S. Department of Energy, June.
[4] Electric Properties of Polymers [in Russian], 2nd ed., Leningrad (1977).
[5] ABAQUS/Standard User Manual, Version 5.8. Hibbit, Carlsson and Sorensen Inc., Pawtucket RI, USA, 1998.
[6] Dee, D. P., and Coauthors, 2011, “The ERA-Interim reanalysis: Configuration and performance of the data assimilation system”, Quart. J. Roy. Meteor. Soc., 137, 553–597,   .
[7] <a target="_blank" href='
[8] 1(h). Carl Casper , "Method of Manufacturing Melting Crucibles," Ger. Pat. 210,085, 1907 .
[9] Brown, D. H. (1947): Food-washing Habits of Waders. — B