Input data `signatures` and `records`.

In [12]:
import pandas as pd
import re

from json import load
from unidecode import unidecode

In [19]:
signatures = load(open("/data01/beard/clustering/signatures.json"))
signatures = {s["signature_id"]: s for s in signatures}

In [20]:
records = load(open("/data01/beard/clustering/records.json"))
records = {r["publication_id"]: r for r in records}

In [9]:
data = [len(signatures), len(records)]
index = ['Signatures',  # Total number of signatures
         'Records',  # Total number of records
        ]
d = {'Count': data}
df = pd.DataFrame(data=d, index=index)
df

Unnamed: 0,Count
Signatures,8958311
Records,1137394


Example of a (random) signature `s` and record element `r`, where `s` is a contributor of `r`. Represented as a JSON array:

In [10]:
records['1262925']

{u'authors': [u'Ellis, John', u'Mustafayev, Azar', u'Olive, Keith A'],
 u'publication_id': u'1262925',
 u'title': u'Resurrecting No-Scale Supergravity Phenomenology',
 u'year': u'2010'}

In [11]:
signatures['1262925_Ellis, John_4155994']

{u'author_affiliation': u'CERN',
 u'author_inspire_id': u'00146525',
 u'author_name': u'Ellis, John',
 u'publication_id': u'1262925',
 u'signature_id': u'1262925_Ellis, John_4155994'}

Brief statistics for `records` and `signatures` (see `predicted-clusters_analyzer` for more signature statistics):

In [18]:
r_title = len([r for r in records.itervalues() if r["title"]])
r_year = len([r for r in records.itervalues() if r["year"]])
s_affiliation = len([s for s in signatures.itervalues() if s["author_affiliation"]])

data = [r_title, r_year, s_affiliation]
index = ['Records having title',
         'Records having year',
         'Signatures having affiliation'
        ]
d = {'Count': data}
df = pd.DataFrame(data=d, index=index)
df

Unnamed: 0,Count
Records having title,1118457
Records having year,1134291
Signatures having affiliation,3605596


While running the disambiguation on the signature set, we have found `invalid` author full names, stored in the MARC 21 field `100__a`. However, the applied regular expression may not cover all invalid cases. The `invalid` author names are represented as pairs, where the first element contains the `record id` the author contributed to, the second element contains the name itself, such as `(recid, full_name)`.

In [15]:
p = re.compile(r"^([a-zA-Z]+[-',.~\s]*[a-zA-Z]*){3,}", flags=re.UNICODE)

# http://fabzter.com/blog/remove-nonspacing-characters-text-python
print(unidecode(u"áéíóú äëïöü ø ñÑ û"))  # Test removing accents and stuff

# Match invalid author names in signatures
invalid_pairs = []  # pair representing (record_id, author_name)
for signature in signatures.itervalues():
    author_name = signature["author_name"]

    if not p.match(unidecode(author_name)):
        invalid_pairs.append( (signature["publication_id"], author_name) )

print len(invalid_pairs)

aeiou aeiou o nN u
5015
