<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Hello-NARCIS" data-toc-modified-id="Hello-NARCIS-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Hello NARCIS</a></span></li><li><span><a href="#Preparation" data-toc-modified-id="Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preparation</a></span><ul class="toc-item"><li><span><a href="#Config" data-toc-modified-id="Config-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Config</a></span></li><li><span><a href="#Start-Docker" data-toc-modified-id="Start-Docker-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Start Docker</a></span></li><li><span><a href="#Connect-to-MongoDB" data-toc-modified-id="Connect-to-MongoDB-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Connect to MongoDB</a></span></li></ul></li><li><span><a href="#Key-value-generation" data-toc-modified-id="Key-value-generation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Key value generation</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Load data</a></span><ul class="toc-item"><li><span><a href="#Top-level-fields" data-toc-modified-id="Top-level-fields-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Top level fields</a></span></li></ul></li><li><span><a href="#Inventory-of-keys" data-toc-modified-id="Inventory-of-keys-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Inventory of keys</a></span><ul class="toc-item"><li><span><a href="#Nested-keys" data-toc-modified-id="Nested-keys-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Nested keys</a></span></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Results</a></span></li><li><span><a href="#Appendix" data-toc-modified-id="Appendix-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Appendix</a></span><ul class="toc-item"><li><span><a href="#Use-variety" data-toc-modified-id="Use-variety-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Use <em>variety</em></a></span></li><li><span><a href="#Preparation" data-toc-modified-id="Preparation-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Preparation</a></span></li><li><span><a href="#Run-variety" data-toc-modified-id="Run-variety-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Run <em>variety</em></a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Results</a></span></li></ul></li></ul></div>

## Hello NARCIS

We explore a dump of the [NARCIS](https://www.narcis.nl) data.
The dump has been taken bij 
[Emil Bode](https://dans.knaw.nl/nl/over/organisatie-beleid/medewerkers/bode)
in December 2017.
He has prepared it as MongoDB database contents in a Docker container.
This container is obtainable from DataVerse:
[NARCIS metadata in nldidlnorm format](http://hdl.handle.net/10411/FTZVH4).
Next to the dump is a *readme* that tells you exactly how to query the data locally.

The purpose of this notebook is an initial scan of the full database contents.
The database is called `NARCIS`, and in it is just one collection, `Dec2017`, with over 1,5 million
documents.

A document is a dictionary of keys and values, where values may be numbers, strings, dates, but also lists
of values and also dictionaries of keys and values.
The nesting of lists and dictionaries can be arbitrarily deep.

Documents do not have to conform to any schema. Most likely, there are a few dominant schemata,
but also likely: there might be outliers.

We assume no previous knowledge of NARCIS, so we want to explore the data from scratch.

This notebook performs a first step:
* we convert the documents into sets of key values, and write them economically to disk.

## Preparation

We assume that you have a directory `local` under your home directory, with a directory `narcis` in it.
We assume that you have downloaded the dump from DataVerse and placed it under this `narcis` directory.

If you have it in a different location on your system, you can adapt the notebook to your own situation
by editing some values in the config section below.

You need to have [Docker](https://www.docker.com/get-docker) installed on your system.

In [1]:
import collections
import os
from pymongo import MongoClient
from bson.objectid import ObjectId

from utils import readData, writeData, getTopKeys

### Config

Specify the location of the database dump here; also the local location of this notebook.

In [2]:
DUMP_DIR = os.path.expanduser('~/local/narcis/MongoDB-NARCIS-Dec17')

REPO_DIR = os.path.expanduser('~/github/Dans-labs/narcis-explore')

TEMP_NAME = '_temp'
TEMP_DIR = f'{REPO_DIR}/{TEMP_NAME}'

RESULT_NAME = 'results'
RESULT_DIR = f'{REPO_DIR}/{RESULT_NAME}'

FIELDS_VAR_NAME = 'fields-variety.txt'
FIELDS_VAR_FILE = f'{RESULT_DIR}/{FIELDS_VAR_NAME}'

FIELDS_NAME = 'fields.tfx'
FIELDS_FILE = f'{TEMP_DIR}/{FIELDS_NAME}'
DOC_NAME = 'docId.tfx'
DOC_FILE = f'{TEMP_DIR}/{DOC_NAME}'
KEY_NAME = 'keys.tfx'
KEY_FILE = f'{TEMP_DIR}/{KEY_NAME}'
VAL_NAME = 'values.tfx'
VAL_FILE = f'{TEMP_DIR}/{VAL_NAME}'

FIELD_FREQ_NAME = 'fieldFreqs.tsv'
FIELD_FREQ_FILE = f'{TEMP_DIR}/{FIELD_FREQ_NAME}'

for outDir in (TEMP_DIR, RESULT_DIR):
    os.makedirs(outDir, exist_ok=True)

If we generate large files, we do so in a temporary directory `{{TEMP_NAME}}`, in the repository itself.
The repository lists `{{TEMP_NAME}}` in its `.gitignore` file, so output will not be sent to GitHub.

Result files will be generated in `{{RESULT_NAME}}`, also inside the repo. These files will be sent to GitHub.

### Start Docker

Below are the magic commands to start and stop the relevant Docker container.
They are shell commands, not Python commands.

In case you want to stop Docker

In [3]:
!docker stop NARCIS
!docker rm NARCIS

NARCIS
NARCIS


In the next cell a Docker container is started.

In [4]:
!docker run --name NARCIS -v {DUMP_DIR}:/data/db -p 27019:27017 -d mongo --logpath /data/db/log.log

e6307c30296a19f7905dccb0ae0425d48588ac4ce08376f5516b9d491bac21e8


### Connect to MongoDB
Now there is a MongoDb behind port 27019 that we can connect to.

We make the connection and get some very basic statistics about the contents of the database.

In [5]:
client = MongoClient('mongodb://localhost:27019/')
client.database_names()

['NARCIS', 'admin', 'config', 'local']

Navigate to the Dec2017 collection

In [6]:
DBN = client.NARCIS
print(DBN.collection_names())
DBND = DBN.Dec2017
DBND.count()

['Dec2017']


1622397

## Key value generation

We want to treat a document as a generator of key-value pairs, 
where keys are hierarchical keys and values are scalar values.
We convert scalar values to strings, and escape newlines.

We then write the key value pairs to file.
If we just write them naively as tuples `(_id, hkey, value)` where
`hkey` is the hierarchical key within the document, 
we end up with a file of 5GB.

In order to save space we leave out the `_id`, but instead we insert a blank line between docs.

We make a mapping between the doc-ids and positive numbers.

Furthermore, instead of specifying keys, we map them to positive numbers.

Same with keys.

We save the mappings from numbers to document ids, keys and values as single column files.
The value on line *n* is the value to which number *n* is mapped to.

In [7]:
def kvPairs(key, val):
    if type(val) is list:
        for (i, v) in enumerate(val):
            for y in kvPairs(key + (str(i+1),), v):
                yield y
    elif type(val) is dict:
        for (k, v) in val.items():
            for y in kvPairs(key + (k,), v):
                yield y
    else:
        yield (key, str(val).replace('\n', '\\n'))

We exclude some information from the data: some top-level keys and their values can be ignored.

In [8]:
excludeKeys = set('''
    setSpec
'''.strip().split())

Here we apply the filtering.
We also treat the `_id` field in a special way.

In [9]:
def docPairs(pairs):
    docId = None
    filteredPairs = []
    for (key, value) in pairs:
        if key[0] in excludeKeys:
            pass
        elif key == ('_id',):
            docId = value
        else:
            hKey = '.'.join(key)
            filteredPairs.append((hKey, value))
    return (docId, filteredPairs)

This is the big loop where all documents are converted into key value pairs which get written to file.

It takes half an hour.

In [10]:
i = 0
j = 0
chunk = 100000
limit = -1
nkv = 0

keys = []
keysIndex = {}
values = []
valuesIndex = {}
docIds = []
fields = []

for (docNumber, doc) in enumerate(DBND.find()):
    j += 1
    i += 1
    (docId, pairs) = docPairs(kvPairs((), doc))
    docIds.append(docId)
    theFields = {}
    for (key, value) in pairs:
        keyNumber = keysIndex.get(key, None)
        if keyNumber is None:
            keyNumber = len(keys)
            keysIndex[key] = keyNumber
            keys.append(key)
        valNumber = valuesIndex.get(value, None)
        if valNumber is None:
            valNumber = len(values)
            valuesIndex[value] = valNumber
            values.append(value)
        theFields[keyNumber] = valNumber
        nkv += 1
    fields.append(theFields)
    if j == chunk:
        j = 0
        print(f'\t{i:>7} docs => {nkv:>9} pairs')
    if limit > 0 and i >= limit:
        break
print(f'\t{i:>7} docs => {nkv:9} pairs')

	 100000 docs =>   9366214 pairs
	 200000 docs =>  18518217 pairs
	 300000 docs =>  27784517 pairs
	 400000 docs =>  39483221 pairs
	 500000 docs =>  49967908 pairs
	 600000 docs =>  61319571 pairs
	 700000 docs =>  71707654 pairs
	 800000 docs =>  83571814 pairs
	 900000 docs =>  95045898 pairs
	1000000 docs => 109394801 pairs
	1100000 docs => 120065283 pairs
	1200000 docs => 129697097 pairs
	1300000 docs => 141362774 pairs
	1400000 docs => 152096820 pairs
	1500000 docs => 164118222 pairs
	1600000 docs => 177606033 pairs
	1622397 docs => 180762634 pairs


Write data to disk in a way that facilitates quick reuse: pickled and gzipped.

In [11]:
writeData(KEY_FILE, keys)

In [12]:
writeData(VAL_FILE, values)

In [13]:
writeData(DOC_FILE, docIds)

In [14]:
writeData(FIELDS_FILE, fields)

## Load data

Further processing can now start by reading the files just generated.

### Top level fields
We count how often each top-level field occurs.
We make separate counts for empty and non-empty values.

In [15]:
keys = readData(KEY_FILE)

In [16]:
values = readData(VAL_FILE)

In [17]:
docIds = readData(DOC_FILE)

In [18]:
fields = readData(FIELDS_FILE)

In [19]:
topKeys = getTopKeys(keys)
sorted(set(topKeys.values()))

['DAI',
 'GlobalIDs',
 'ID',
 'Journal',
 'Keywords',
 'NumberofIDs',
 'access',
 'date_harv',
 'date_header',
 'date_orig',
 'filenr',
 'nldidlnorm',
 'originURL']

## Inventory of keys

The next step is to have a closer look at the top-level keys of all the documents.

In [20]:
allTopKeys = collections.Counter()

for pairs in fields:
    myTopKeys = set(topKeys[key] for key in pairs)
    for mtk in myTopKeys: allTopKeys[mtk] += 1
print(f'There are {len(allTopKeys)} top keys')

There are 13 top keys


Here are the keys and how often they occur:

In [21]:
for (key, amount) in sorted(allTopKeys.items(), key=lambda x: (-x[1], x[0])):
    print(f'{key:<20} {amount:>7}x')

GlobalIDs            1622397x
ID                   1622397x
NumberofIDs          1622397x
access               1622397x
date_harv            1622397x
date_header          1622397x
date_orig            1622397x
filenr               1622397x
nldidlnorm           1622397x
originURL            1622397x
Journal              1124764x
DAI                  1003441x
Keywords              746235x


Every document in NARCIS has mostly the same set of top level keys.

### Nested keys

We want to explore a distribution of all keys, also the ones that occur
(deeply) nested in documents.

In [22]:
nestedKeys = collections.Counter()

for pairs in fields:
    for key in pairs:
        nestedKeys[key] += 1
print(f'{len(nestedKeys)} distinct keys encountered')

1569 distinct keys encountered


We list the keys in order of frequency, most frequent ones first.
The result is writen to `{{FIELD_FREQ_NAME}}`.

In [23]:
with open(FIELD_FREQ_FILE, 'w') as fh:
    for (k, n) in sorted(nestedKeys.items(), key=lambda x: (-x[1], x[0])):
        fh.write(f'{n:>7}\t{keys[k]}\n')

See them on GitHub: [nested keys overview](https://github.com/Dans-labs/narcis-explore/blob/master/results/fieldFreqs.tsv)

In [24]:
!head -20 {FIELD_FREQ_FILE}

1622397	ID.1
1622397	filenr.1
1622397	date_header.1
1622397	date_harv.1
1622397	date_orig.1
1622397	GlobalIDs.1.1
1622397	NumberofIDs.1
1622397	originURL.1
1622397	access.1
1622397	nldidlnorm.1.Descriptor.Statement.Identifier.1
1622397	nldidlnorm.1.Descriptor.Statement._attrs.1
1622397	nldidlnorm.1.Descriptor1.Statement.modified.1
1622397	nldidlnorm.1.Descriptor1.Statement._attrs.1
1622397	nldidlnorm.1.Component.Resource.1
1622397	nldidlnorm.1.Component.Resource.2
1622397	nldidlnorm.1.Item.Descriptor.Statement.type.1
1622397	nldidlnorm.1.Item.Descriptor.Statement._attrs.1
1622397	nldidlnorm.1.Item.Component.Resource.mods.name.role.roleTerm.text.1
1622397	nldidlnorm.1.Item.Component.Resource.mods.name.role.roleTerm._attrs.1
1622397	nldidlnorm.1.Item.Component.Resource.mods.titleInfo.title.1


In [25]:
!tail -20 {FIELD_FREQ_FILE}

      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name7.namePart.text.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name7.namePart._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name7.namePart1.text.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name7.namePart1._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name7._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name8.namePart.text.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name8.namePart._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name8.namePart1.text.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name8.namePart1._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.relatedItem.name8._attrs.1
      1	nldidlnorm.1.Item.Component.Resource.mods.name6.affiliation9.1
      1	nldidlnorm.1.Item.Component.Resource.mods.name6.affiliation0.1
      1	

## Results
The product of this notebook are the files

* `{{FIELDS_NAME}}` all key value pairs of all documents,
  where keys and values are replaced by their numerical index
* `{{DOC_NAME}}` single column file containing all document ids in order of their index number
* `{{KEY_NAME}}` single column file containing all keys in order of their index number
* `{{VAL_NAME}}` single column file containing all values in order of their index number

In other notebooks we will use this file, and conduct further explorations.

## Appendix

The rest of this notebook is an alternative approach that we will not build on.

### Use *variety*

Normally, you can speed up data processing by having the database use its query engine to its full potential.
Using `mapReduce` inside MongoDb as indicated on
[stack overflow](https://stackoverflow.com/questions/2298870/mongodb-get-names-of-all-keys-in-collection)
could do the trick. But the accepted answer there does not do  nested keys.
So we have to write a MongoDb function for mapReduce, which means writing something in Javascript.

It turns out that somebody has already done this, and made a nice library out of it:
[variety](https://github.com/variety/variety). 
This is a tool to explore the dominant scheme of a MongoDB and its outlier documents.

However, the performance is much worse than the previous method:
going this way takes you a whopping 80 minutes.
We only show how to do it, generate the results, but we do not recommend it.

If you want to try it yourself, there is extra preparation to do.

### Preparation

You must have *MongoDb* and *Node* installed.

Install *variety*:

```
npm install variety-cli -g
```

### Run *variety*
This is a javascript program; we need Node to run it.
The next cell takes well over an hour!

In [None]:
!variety --port=27019 NARCIS/Dec2017

### Results

We have copied the output into the file `{{FIELDS_VAR_FILE_NAME}}`, so it does not get inadvertently lost.
After saving the output, we discarded it from the cell.

The results are pleasing to the eye (if your screen is wide enough) but
they lend themselves less well for further processing.

In [64]:
!head -20 {FIELDS_VAR_FILE}

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key                                                                                       | types                           | occurrences | percents                 |
| ----------------------------------------------------------------------------------------- | ------------------------------- | ----------- | ------------------------ |
| DAI                                                                                       | Array                           |     1622397 | 100.00000000000000000000 |
| GlobalIDs                                                                                 | Array                           |     1622397 | 100.00000000000000000000 |
| ID                                                                                        | Array                           |     1622397 | 100.0000

In [65]:
!tail -40 {FIELDS_VAR_FILE}

| nldidlnorm.XX.Item.Component.Resource.mods.relatedItem4.titleInfo.title                   | Array                           |           1 |   0.00006163719484195299 |
| nldidlnorm.XX.Item.Component.Resource.mods.relatedItem5                                   | Object                          |           1 |   0.00006163719484195299 |
| nldidlnorm.XX.Item.Component.Resource.mods.relatedItem5._attrs                            | Array                           |           1 |   0.00006163719484195299 |
| nldidlnorm.XX.Item.Component.Resource.mods.relatedItem5.titleInfo                         | Object                          |           1 |   0.00006163719484195299 |
| nldidlnorm.XX.Item.Component.Resource.mods.relatedItem5.titleInfo.title                   | Array                           |           1 |   0.00006163719484195299 |
| nldidlnorm.XX.Item.Component.Resource.mods.subject0.topic0                                | Array                           |           1 |   0.0000