<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Identifier-lookalikes" data-toc-modified-id="Identifier-lookalikes-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Identifier lookalikes</a></span><ul class="toc-item"><li><span><a href="#Config" data-toc-modified-id="Config-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Config</a></span></li></ul></li><li><span><a href="#Analysing-the-values" data-toc-modified-id="Analysing-the-values-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analysing the values</a></span></li><li><span><a href="#Frequencies-of-identifier-values" data-toc-modified-id="Frequencies-of-identifier-values-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Frequencies of identifier values</a></span></li></ul></div>

## Identifier lookalikes

We extract identifier like values from NARCIS documents.

**N.B.:** 
See [helloNarcis](helloNarcis.ipynb) for the provenance of the data that we work with,
and how to set up your system to run these notebooks locally.

An identifier lookalike is a value that satisfies at least one of the following constraints:

* **num** it consists completely of the numerals `0..9`,
* **protocol** it starts with a sequence of ascii letters, then comes a `:`, 
  and then an ascii letter or `/`
* **web** it contains a `.`, immediately surrounded by ascii lowercase letters
  
Before inspecting values, we split it on whitespace.

In [1]:
import collections
import os
import re

from utils import readData, writeData, getTopKeys

### Config

Specify the location of the database dump here; also the local location of this notebook.

In [2]:
REPO_DIR = os.path.expanduser('~/github/Dans-labs/narcis-explore')

TEMP_NAME = '_temp'
TEMP_DIR = f'{REPO_DIR}/{TEMP_NAME}'

RESULT_NAME = 'results'
RESULT_DIR = f'{REPO_DIR}/{RESULT_NAME}'

IDENT_NAME = 'identifiers.tsv'
IDENT_FILE = f'{TEMP_DIR}/{IDENT_NAME}'

FIELDS_NAME = 'fields.tfx'
FIELDS_FILE = f'{TEMP_DIR}/{FIELDS_NAME}'

DOC_NAME = 'docId.tfx'
DOC_FILE = f'{TEMP_DIR}/{DOC_NAME}'
KEY_NAME = 'keys.tfx'
KEY_FILE = f'{TEMP_DIR}/{KEY_NAME}'
VAL_NAME = 'values.tfx'
VAL_FILE = f'{TEMP_DIR}/{VAL_NAME}'

for outDir in (TEMP_DIR, RESULT_DIR):
    os.makedirs(outDir, exist_ok=True)

The patterns to detect identifiers with.

In [3]:
protocolPat = re.compile('^[a-zA-Z]+:[a-zA-Z/]')
webPat = re.compile('[a-z]\.[a-z]')

## Analysing the values

We walk through all distinct values and collect the ones that look like an identifier or the
numerical part of it.

In [7]:
values = readData(VAL_FILE)

In [10]:
identifiers = set()

for (i, val) in enumerate(values):
    if (
        type(val) is int or
        type(val) is str and (
            val.isnumeric() or
            protocolPat.match(val) or
            webPat.match(val)
        )
    ):
        identifiers.add(i)
print(f'{len(identifiers)} identifier candidates out of {len(values)} values')

6801 identifier candidates out of 12600 values


## Frequencies of identifier values

Some links point to generic things, such as schemas, vocabularies. 
Those we see over and over again.
Let's harvest them.

First we load all the document data.

In [11]:
fields = readData(FIELDS_FILE)

Now we pick the identifiers and store them in a counter.

In [12]:
identFreq = collections.Counter()
for pairs in fields:
    for (key, value) in pairs.items():
        if value in identifiers:
            identFreq[value] += 1
print(f'{len(identFreq)} identifiers counted')

6801 identifiers counted


Show the distribution.

In [13]:
identDist = collections.Counter()
for (ident, freq) in identFreq.items():
    identDist[freq] += 1

for (freq, n) in sorted(identDist.items(), key=lambda x: (-x[1], x[0])):
    print(f'{n:>7} identifiers with frequency {freq:>7}')

   5650 identifiers with frequency       1
   1073 identifiers with frequency       2
     13 identifiers with frequency       4
      8 identifiers with frequency       8
      7 identifiers with frequency       6
      4 identifiers with frequency      12
      4 identifiers with frequency     998
      3 identifiers with frequency      14
      3 identifiers with frequency      16
      3 identifiers with frequency     100
      2 identifiers with frequency      10
      2 identifiers with frequency      22
      2 identifiers with frequency      80
      2 identifiers with frequency      90
      2 identifiers with frequency      91
      2 identifiers with frequency    1000
      1 identifiers with frequency       3
      1 identifiers with frequency       5
      1 identifiers with frequency      13
      1 identifiers with frequency      17
      1 identifiers with frequency      34
      1 identifiers with frequency      45
      1 identifiers with frequency      52
      1 ide

Now we select the link values that occur more than a certain number of times.

In [17]:
freqThreshold = 100

frequentValues = collections.defaultdict(collections.Counter)

for (val, freq) in identFreq.items():
    if freq >= freqThreshold:
        frequentValues[val] = freq

In [22]:
len(frequentValues)

504

In [19]:
for (val, freq) in sorted(frequentValues.items(), key=lambda x: (-x[1], x[0])):
    print(f'{freq:>9} x {values[val]}')

     1000 x info:eu-repo/semantics/descriptiveMetadata
     1000 x info:eu-repo/semantics/humanStartPage
      998 x 1
      998 x http://dare.uva.nl/cgi/arno/oai/aup
      998 x 36
      998 x http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-6.xsd
      985 x info:eu-repo/semantics/book
      687 x info:eu-repo/semantics/objectFile
      686 x http://purl.org/eprint/accessRights/OpenAccess
      162 x 2006
      138 x 2005
      105 x 2007
      100 x 4775
      100 x 4781
      100 x 4782
