# DBLP dataset
---

*Source: https://www.aminer.org/citation*

We use the `V13` dataset. 
> The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources.

We build the graph of citations, where each node represents a paper, a directed link between nodes represents a citation between papers. Moreover, we use words in the abstract of each paper as a vector of binary attributes for the corresponding nodes.

**Libraries**

In [1]:
from IPython.display import SVG

from collections import defaultdict, Counter
import ijson
import json
import numpy as np
import pickle
from scipy import sparse
from tqdm import tqdm

import nltk

from sknetwork.data import from_edge_list, Bunch, save
from sknetwork.topology import get_connected_components
from sknetwork.visualization import svg_graph

## 1. Load data

* `dblpv13.json` needs treatment before getting parsed. We use `sed s/NumberInt([0-9]*)/"ok"/g dblpv13.json >> dblpv14.json` in order to replace badly parsed Integers.

In [2]:
path = '/Users/simondelarue/Downloads/dblpv14.json'

In [5]:
%%time

categories = ['computer science']#, 'pattern recognition']
articles = {}
cpt = 0

with open(path, 'rb') as f:
    for record in tqdm(ijson.items(f, 'item')):
        # Add articles only if they have abstract + title + belong to defined category
        if 'fos' in record.keys() and 'abstract' in record.keys() and 'title' in record.keys() and 'references' in record.keys() and len(record.get('title')) > 0 and len(record.get('abstract')) > 0:
            for cat in record['fos']:
                if cat.lower() in categories:
                    articles[record['_id']] = {'name': record['title'],
                                      'abstract': record['abstract'],
                                      'categories': record['fos'],
                                      'references': record['references']}
                    cpt += 1
                    break

5354309it [20:39, 4318.03it/s] 

CPU times: user 20min 11s, sys: 25 s, total: 20min 36s
Wall time: 20min 40s





In [6]:
%%time
with open('data/dblpv13_raw.json', 'w') as f:
    json.dump(articles, f)

CPU times: user 26.8 s, sys: 4.28 s, total: 31.1 s
Wall time: 32.2 s


In [7]:
print(f'Number of raw articles: {len(articles)}')

Number of raw articles: 2121958


For each article, we go through its references and verify the validity of each of them, i.e if they belongs to the registered articles. If some references are not valid (meaning they do not belong to 'Computer Science' field, or their title and abstract are empty), we delete them from the list of references. We end with a dictionary of articles with:
* belong to `Computer Science` field
* non empty title
* non empty abstract
* references respecting the three previous criteria

In [67]:
%%time

# Remove non-valid articles from references
for article, values in tqdm(articles.items()):
    new_references = values.get('references').copy()
    for ref in values.get('references'):
        if ref not in articles:
            new_references.remove(ref)
    articles[article]['references'] = new_references

100%|███████████████████████████████████████████████████████| 2121958/2121958 [00:13<00:00, 152128.46it/s]

CPU times: user 10.9 s, sys: 2.39 s, total: 13.3 s
Wall time: 14 s





In [73]:
%%time
with open('data/dblpv13.json', 'w') as f:
    json.dump(articles, f)

CPU times: user 24.4 s, sys: 4.11 s, total: 28.5 s
Wall time: 30 s


In [2]:
%%time
with open('data/dblpv13.json', 'r') as f:
    articles = json.load(f)

CPU times: user 13.2 s, sys: 23.4 s, total: 36.6 s
Wall time: 51.2 s


## Build graph dataset  

### Adjacency matrix 

Build edge list and use it as input to `from_edge_list` method in Scikit-network.

In [88]:
%%time
edges = []
for article, values in articles.items():
    for ref in values.get('references'):
        edges.append((article, ref))

CPU times: user 3.06 s, sys: 2.91 s, total: 5.97 s
Wall time: 9.52 s


In [89]:
%%time
adjacency = from_edge_list(edges)

CPU times: user 32.4 s, sys: 28.1 s, total: 1min
Wall time: 1min 16s


In [90]:
adjacency

{'names': array(['53e99784b7602d9701f3e15d', '53e99784b7602d9701f3f411',
        '53e99784b7602d9701f3f5fe', ..., '60754485e4510cd7c836e05d',
        '60754495e4510cd7c83703fa', '607544aae4510cd7c8371f24'],
       dtype='<U24'),
 'adjacency': <2058901x2058901 sparse matrix of type '<class 'numpy.int64'>'
 	with 39034212 stored elements in Compressed Sparse Row format>}

In [91]:
# Save adjacency matrix
dataset = Bunch()
dataset.adjacency = adjacency.adjacency
dataset.names = adjacency.names

meta = Bunch()
meta.name = "DBLP articles V13"
meta.description = 'Links connecting DBLP articles in `Computer Science` if one article cites the other (directed)'
meta.source = 'https://www.aminer.org/citation'
meta.date = 'November 2022'
dataset.meta = meta

In [92]:
with open('data/DBLP', 'bw') as f:
    pickle.dump(dataset, f)

In [3]:
%%time
# Load dataset
with open('data/DBLP', 'br') as f:
    dataset = pickle.load(f)

CPU times: user 637 µs, sys: 126 ms, total: 127 ms
Wall time: 129 ms


In [4]:
print(f'Number of nodes: {dataset.adjacency.shape[0]}')
print(f'Number of edges: {dataset.adjacency.nnz}')

Number of nodes: 2058901
Number of edges: 39034212


In [5]:
dataset.adjacency.shape

(2058901, 2058901)

In [6]:
len(dataset.names)

2058901

### Biadjacency matrix

In [7]:
import re

In [8]:
articles.get('53e99cf5b7602d97025ace63').get('abstract')

'We describe a new approach to the visual recognition of cursive handwriting. An effort is made to attain human- like performance by using a method based on pictorial alignment and on a model of the process of handwriting. The alignment approach permits recognition of character instances that appear embedded in connected strings. A system embodying this approach has been implemented and tested on five different word sets. The performance was stable both across words and across writers. The system exhibited a substantial ability to interpret cursive connected strings without recourse to lexical knowledge. The interpretation of cursive connected handwriting is considerably more difficult than the reading of printed text. This difficulty may be the reason for the relative lack of attention to the problem of reading cursive script within the field of computational vision. The present article describes progress made toward understanding and solving this problem. We identify and discuss two 

In [9]:
def tokenize_text(text):
    """ Merge words that have been cut + remove punctuation + tokenize """
    res = re.sub(r'- ', '', text)
    res_without_punc = re.sub(r'[^\w\s]', '', res)
    tokens = nltk.word_tokenize(res_without_punc)
    return tokens

In [24]:
%%time

# Create indexed list of words from abstracts
# Result is a list of list containing word indexes for each abstract
words2idx = {}
idx = 0
abstracts_tokens = []
n = dataset.adjacency.shape[0]

for i in range(n):
    article_id = dataset.names[i]
    abstract = articles.get(article_id).get('abstract')
    tokens = tokenize_text(abstract)
    abstract_tokens = []
    for tk in tokens:
        if tk not in words2idx:
            words2idx[tk] = idx
            idx += 1
        abstract_tokens.append(words2idx.get(tk))
    abstracts_tokens.append(abstract_tokens)

CPU times: user 10min 24s, sys: 15.7 s, total: 10min 40s
Wall time: 10min 49s


In [25]:
print(len(words2idx))
print(len(abstracts_tokens))
print(dataset.adjacency.shape)

1821694
2058901
(2058901, 2058901)


In [26]:
%%time
idx2words = {v: k for k, v in words2idx.items()}

CPU times: user 89.3 ms, sys: 151 ms, total: 241 ms
Wall time: 313 ms


In [39]:
%%time

# initialize biadajcency matrix
x, y = dataset.adjacency.shape
biadjacency = sparse.lil_matrix((x, len(words2idx)), dtype=np.int16)
biadjacency.shape

CPU times: user 7.35 s, sys: 29.6 s, total: 37 s
Wall time: 1min 1s


(2058901, 1821694)

In [29]:
%%time
biadjacency_cp = biadjacency.copy()

CPU times: user 4.5 s, sys: 18.8 s, total: 23.3 s
Wall time: 37.7 s


**Fill biadjacency matrix**

In [40]:
%%time

for r in tqdm(range(dataset.adjacency.shape[0])):
    
    # Extract word counts for each abstract
    abst = abstracts_tokens[r]
    if len(abst) > 0:
        cnt_abstract_tokens = Counter(abstracts_tokens[r])
        dst, cnt = zip(*cnt_abstract_tokens.items())
        l_dst, l_cnt = list(dst), list(cnt)

        # Fill node features with word counts (BoW)
        biadjacency[r, l_dst] = l_cnt

100%|████████████████████████████████████████████████████████| 2058901/2058901 [01:42<00:00, 20171.46it/s]

CPU times: user 1min 36s, sys: 2.69 s, total: 1min 39s
Wall time: 1min 42s





In [41]:
biadjacency.shape

(2058901, 1821694)

In [27]:
biadjacency3 nodes without abstract ...
res = 0
for val in abstracts_tokens:
    if len(val) == 0:
        res += 1
res

113

In [32]:
biadjacency = biadjacency_cp.tocsr()

In [42]:
biadjacency[0, :]

<1x1821694 sparse matrix of type '<class 'numpy.int16'>'
	with 96 stored elements in List of Lists format>

In [44]:
biadjacency_cp.nnz

190753627

In [37]:
len(Counter(abstracts_tokens[0]))

96

In [40]:
for k, v in idx2words.items():
    print(k, v)
    break

0 As


In [49]:
len(dblp_cs.get('names_col'))

1821694

In [46]:
dblp_cs['biadjacency'] = biadjacency

In [47]:
dblp_cs

{'adjacency': <2058901x2058901 sparse matrix of type '<class 'numpy.int64'>'
 	with 39034212 stored elements in Compressed Sparse Row format>,
 'names': array(['53e99784b7602d9701f3e15d', '53e99784b7602d9701f3f411',
        '53e99784b7602d9701f3f5fe', ..., '60754485e4510cd7c836e05d',
        '60754495e4510cd7c83703fa', '607544aae4510cd7c8371f24'],
       dtype='<U24'),
 'meta': {'name': 'DBLP articles V13',
  'description': 'Links connecting DBLP articles in `Computer Science` if one article cites the other (directed)',
  'source': 'https://www.aminer.org/citation',
  'date': 'November 2022'},
 'biadjacency': <2058901x1821694 sparse matrix of type '<class 'numpy.int16'>'
 	with 190753627 stored elements in List of Lists format>,
 'names_col': array(['As', 'process', 'variations', ..., 'preconditionerrnfor',
        'partitioningrnand', 'SIMPLEtype'], dtype='<U506')}

In [31]:
dataset.names_col = words

In [32]:
dataset

{'adjacency': <2058901x2058901 sparse matrix of type '<class 'numpy.int64'>'
 	with 39034212 stored elements in Compressed Sparse Row format>,
 'names': array(['53e99784b7602d9701f3e15d', '53e99784b7602d9701f3f411',
        '53e99784b7602d9701f3f5fe', ..., '60754485e4510cd7c836e05d',
        '60754495e4510cd7c83703fa', '607544aae4510cd7c8371f24'],
       dtype='<U24'),
 'meta': {'name': 'DBLP articles V13',
  'description': 'Links connecting DBLP articles in `Computer Science` if one article cites the other (directed)',
  'source': 'https://www.aminer.org/citation',
  'date': 'November 2022'},
 'names_col': array(['As', 'process', 'variations', ..., 'preconditionerrnfor',
        'partitioningrnand', 'SIMPLEtype'], dtype='<U506')}

In [29]:
%%time
# Words
words = np.array(list(idx2words.values()))

CPU times: user 392 ms, sys: 326 ms, total: 717 ms
Wall time: 797 ms


In [23]:
words[dblp.biadjacency[0, :].indices]

NameError: name 'words' is not defined

In [45]:
dataset.biadjacency = biadjacency

In [46]:
dataset.keys()

dict_keys(['adjacency', 'names', 'meta', 'biadjacency'])

In [15]:
%%time
# Load dataset
with open('data/DBLP_biadj', 'br') as f:
    dblp = pickle.load(f)

CPU times: user 1.87 ms, sys: 425 ms, total: 427 ms
Wall time: 614 ms


In [33]:
dblp.keys()

dict_keys(['adjacency', 'names', 'meta', 'biadjacency'])

In [34]:
dblp_cs = dblp.copy()

In [36]:
dblp_cs.names_col = words

AttributeError: 'dict' object has no attribute 'names_col'

In [37]:
dblp_cs['names_col'] = words

In [38]:
dblp_cs

{'adjacency': <2058901x2058901 sparse matrix of type '<class 'numpy.int64'>'
 	with 39034212 stored elements in Compressed Sparse Row format>,
 'names': array(['53e99784b7602d9701f3e15d', '53e99784b7602d9701f3f411',
        '53e99784b7602d9701f3f5fe', ..., '60754485e4510cd7c836e05d',
        '60754495e4510cd7c83703fa', '607544aae4510cd7c8371f24'],
       dtype='<U24'),
 'meta': {'name': 'DBLP articles V13',
  'description': 'Links connecting DBLP articles in `Computer Science` if one article cites the other (directed)',
  'source': 'https://www.aminer.org/citation',
  'date': 'November 2022'},
 'biadjacency': <2058901x2058901 sparse matrix of type '<class 'numpy.int16'>'
 	with 190753627 stored elements in Compressed Sparse Row format>,
 'names_col': array(['As', 'process', 'variations', ..., 'preconditionerrnfor',
        'partitioningrnand', 'SIMPLEtype'], dtype='<U506')}

## Save data

In [None]:
dataset = Bunch()
dataset.adjacency = adjacency
dataset.biadjacency = biadjacency
dataset.names = names
dataset.names_col = names_col

meta = Bunch()
meta.name = "DBLP_cs articles V13"
meta.description = 'Links connecting DBLP articles in `Computer Science` if one article cites the other (directed)'
meta.source = 'https://www.aminer.org/citation'
meta.date = 'November 2022'
dataset.meta = meta

In [47]:
%%time
with open('data/DBLP_cs', 'bw') as f:
    pickle.dump(dataset, f)

CPU times: user 98.7 ms, sys: 1.49 s, total: 1.59 s
Wall time: 3.89 s


In [50]:
%%time
with open('data/DBLP_cs', 'bw') as f:
    pickle.dump(dblp_cs, f)

CPU times: user 4.19 s, sys: 5.37 s, total: 9.56 s
Wall time: 13.8 s
