# OEIS Analysis: Graph of Authors from Comments
## Author: Paula Mihalcea
#### Università degli Studi di Firenze

## Contents

1. [Introduction](#introduction)
2. [Requirements](#requirements)
3. [Dataset](#dataset)
4. [Author parsing](#author-parsing)
5. [Graph building](#graph-building)
6. [Maximal cliques](#maximal-cliques)
    - [Definitions](#definitions)
    - [The Bron-Kerbosch algorithm](#the-bron-kerbosch-algorithm)
	    - [Bron-Kerbosch with Tomita pivoting](#bron-kerbosch-with-tomita-pivoting)
	    - [Complexity](#complexity)
- [Testing](#testing)
- [License](#license)

## Introduction

**[OEIS](https://oeis.org/)** is the online encyclopaedia of **integer sequences**. It lists *over 340.000* number sequences in lexicographic order, such as the [prime numbers](http://oeis.org/A000040) or the [Fibonacci sequence](http://oeis.org/A000045), easing the work of countless researchers since 1964, its foundation year.

The OEIS is made of a series of **JSON files**, one for each integer sequence. Given their regular, human-readable format, these files can be easily manipulated in order to have many of their aspects further analyzed. Indeed, each page of the OEIS not only lists the integers of corresponding sequence, but also a series of information such as formulas, references, links and comments.

This work aims to create, step-by-step, a **[Python 3](https://www.python.org/)** script capable of loading these files and parsing their content in order to build a **graph** where:
- **nodes** represent all unique **authors** that can be found in each comment of every sequence;
- **edges** link two authors who have **commented the same sequence**.

Three algorithms are then implemented in order to find:
- a maximal clique;
- a list of all maximal cliques;
- the maximum clique.

The library of choice for creating the graph is **[NetworkX](https://networkx.org/)**, a fast Python module for the creation, manipulation, and study of the structure of complex networks. TODO Other data science packages such as [NumPy](https://numpy.org/) and [Matplotlib](https://matplotlib.org/) are also used for efficiency purposes, as they provide highly optimized functions specifically created for large datasets such as the OEIS encyclopaedia.
TODO Other Python libraries such as [itertools](https://docs.python.org/3/library/itertools.html), [os](https://docs.python.org/3/library/os.html), [random](https://docs.python.org/3/library/random.html), and [sys](https://docs.python.org/3/library/sys.html) are also used for efficiency purposes, as they provide highly optimized functions specifically created for large datasets such as the OEIS encyclopaedia.

## Requirements

Before starting, a series of packages must be installed in order for the subsequent code to be executable. The simplest way is to use [`pip`](https://pypi.org/project/pip/), a package manager for Python callable from the system terminal.

The commands needed for this operation are listed in the following cell; the Jupyter magic function [`%%cmd`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-bash) (`%%bash` for Unix users) at the beginning allows to use it as a terminal. Make sure to follow the recommended install order, as it helps avoiding errors which can sometimes be generated by the different versions of the packages.

In [None]:
%%cmd

pip install numpy
pip install networkx
pip install matplotlib
pip install tqdm
# TODO

The freshly installed modules can be now used by simply importing them:

In [None]:
import itertools as its
import json
import networkx as nx
import numpy as np
import os
import re
import sys
import timeit
import tqdm
# TODO add other imports

## Dataset

Having installed the required packages, we can now proceed with analyzing the dataset.

The raw OEIS sequence files can be found in [`data/sequences`](./data/sequences/). We can start by writing a function capable of opening one of them using the [JSON package](https://docs.python.org/3/library/json.html) available in Python, and use it to load a file's content as a Python [dict](https://docs.python.org/3/library/stdtypes.html#mapping-types-dict), then print it:

In [None]:
def load_json(file_path, print_content=False):
    try:
        file = open(file_path, 'r')
    except OSError:
        print('Could not open file:' + file + ', exiting program.')
        sys.exit()
    with file:
        raw_data = json.load(file)
        if print_content:
            print('File ' + file_path.split('/')[-1] + ' contents:')
            print()
            print(json.dumps(raw_data, indent=True))
            print()
            print('The \'json\' Python module returns a dictionary, which can be confirmed by invoking the \'type\' function on the loaded data: ' + str(type(raw_data)) + '.')
            print('This dictionary\'s keys are: ' + str(raw_data.keys()).replace('dict_keys([', '').replace('])', '') + '.')
        return raw_data

Note that this function correctly **handles input/output errors**, and can be used to **return a file's content** as a Python **dictionary** even without printing it, by either omitting the `print_content` argument or setting it to `False` - a feature which will soon come in handy.

We can thus view the first JSON file and its keys:

In [None]:
file = load_json('data/sequences/A000001.json', True)

As mentioned before, each sequence file contains additional information, specifically:
- a simple `greeting`;
- a `query`, containing the sequence's ID;
- `count`;
- `start`;
- `results`, which contains a list with another dictionary as its first element.

It can be seen from this file's content that the most relevant information is actually found in the **`results` sub-dictionary**, which can be easily accessed with:

In [None]:
results = file.get('results')

if results:
    print(json.dumps(results[0], indent=True))

Again, there are many different keys, among which we can find the one which is relevant to this project: the `comment` key containing a list of **comments** with their **authors**:

In [None]:
comment_list = results[0].get('comment')

if comment_list:
    print(json.dumps(comment_list, indent=True))
else:
    print('No "comments" subsection found.' + '\n')

## Author parsing

Now that we know where to find the authors' names, we can proceed with building a function to parse all of them from a given file.

### Regular expressions
The most efficient way of doing this is to use a **regular expression** (also known as *regex*), a sequence of characters that specifies a *search pattern*\[[1](https://en.wikipedia.org/wiki/Regular_expression)\].

We must first identify the ways in which the names have been written; by analyzing some comments, **six main patterns** have been identified, along with the **four regular expressions** needed to match them:
1. *"\_Name Surname\_"* `(?<=_)[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=_)`
2. *"\[Name Surname\]"* and *"\[Surnamea, Surnameb\]"* `(?<=\[)[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=\])`
3. *"- Name, Surname ( "* and *"\- Name Surname, "* `(?<=- )[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?= \(|, )`
4. *"(Name Surname,"* `(?<=\()[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=,)`

In spite of their apparent complexity, the meaning of these patterns is quite simple and be easily debugged with tools like [Regex101](https://regex101.com/). Each of them matches only strings that:
- begin with certain characters `_`, `[`, `- `, `(`,
    - followed by a capital letter `[A-Z]`,
        - not followed by another capital letter `(?!=[A-Z])`,
    - followed by at least any two characters `{2,}?`
        - at the condition that none of them belong to a list of forbidden symbols `[^0-9+\(\)\[\]\{\}\\\/_:;""]` (where `^` is as a negation operator),
- end with certain characters `_`, `]`, `(` or `, `, `,`.

`(?>=)` and `(?=)` indicate that the matched strings should be preceded or followed (respectively) by the character(s) to the right of the `=` symbol.

Escaping certain characters distinguishes them from a regex special symbol (e.g. `\(\)` matches the string *()*, while `()` is an empty regex group); whitespaces are simply represented by... a whitespace (` `).

By combining these four expressions with the OR character (`|`) we can create the following regular expression to match all five patterns at once in Python:

`(?<=_)[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=_)|(?<=\[)[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=\])|(?<=- )[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?= \(|, )|(?<=\()[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?(?=,)`

#### About this method's completeness
It should be noted that these expressions do not find all the authors present in the comments because they are not written consistently across all sequences. One might argue that it would be sufficient finding all patterns used in order to get all the names; while this would be a good, if not really feasible solution (we do not know how many they are), the problem remains because certain patterns also match formulas and other unrelated data, making them unusable for retrieving only names.

The definitive solution would be to either manually get the names, or to allow the matching of extraneous data in order to remove it later from the list of names; this way would take too long, though, and goes beyond the purpose of this project.

### The parsing function
The parsing function gets the **raw data** read by the JSON library in input and returns a **set of all author names** present in the comments of the loaded file (or `None` if there are none).

Basically, after preparing the regex pattern (`re.compile()`), for each `comment` in the non-empty `comment_list` the function gets a list of the authors' names using Python's [`re`](https://docs.python.org/3/library/re.html) package for regular expressions, and uses it to update the set of unique authors called `authors` (which contains all names found in the file). The list comprehension in the `update` method is needed to flatten the many lists of lists returned by `re.findall()`.

In [None]:
def parse_authors_from_comments(raw_data):
    # Regex pattern
    common_pattern = r'[A-Z](?!=[A-Z])[^0-9+\(\)\[\]\{\}\\\/_:;""]{2,}?'
    pattern_list = [('(?<=_)', '(?=_)'), ('(?<=\[)', '(?=\])'), ('(?<=- )', '(?= \(|, )'), ('(?<=\()', '(?=,)')]
    pattern = re.compile('|'.join([start + common_pattern + end for start, end in pattern_list]))

    # Comment parsing
    comment_list = raw_data.get('results')[0].get('comment')
    if comment_list:
        authors = set()
        for comment in comment_list:
            authors.update([n for names in re.findall(pattern, comment) for n in names.split(', ')])
        return authors
    return

Some observations:
- the regex pattern is initially split into its **subpatterns** for better readability and to avoid repetitions;
- this pattern has been accurately written so as to **not return empty matches**, normally generated by *capturing groups* (groups of characters between round parentheses) and for which additional `if`s would have been needed, resulting in a more complicated list comprehension;
- **some sequences do not contain comments**, hence the check on `comment_list`;
- a **set** has been chosen for the `authors_set` variable in order to **exclude duplicate names**, since the data needed for the project only concerns the presence or absence of a given author in the comments of a sequence, not all his/her instances. Python's [`set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) data structure allows to store items in a hash table, without duplicating them.

## Graph building

We can now proceed by parsing the authors from all OEIS sequences in the `data/sequences` directory and build their graph using the NetworkX library, eventually saving it to disk to avoid having to load every time all the JSON files.

Considering that each **node** of the graph should contain the **name of a single author** (without duplicates), we only need to:
1. Add each author of each sequence as a node;
2. Add edges between all pairs of authors which have commented the same sequence.

By repeating this procedure for every file in the `data/sequences` directory we get a graph of all authors, where people who have commented the same sequence are connected by an edge.

The creation of such a graph is quite simple with the NetworkX library, since we only need to:
 - parse each sequence file;
 - extract its authors;
 - add them as nodes;
 - create a list of all possible pairs of authors in each sequence;
 - add an edge for each pair.

Since the first two operations have been already implemented in the previous steps (see the `parse_authors_from_comments()` function), the other two are as simple as two lines of code, knowing that **NetworkX does not complain when adding existing nodes or edges**: we do not need to check every time if a given author has already been inserted or if a certain edge already exists, because the library will *not* duplicate them\[[2](https://networkx.org/documentation/stable/reference/classes/graph.html)\]. In fact, we could skip the `add_nodes_from()` function, since NetworkX automatically inserts non-existing nodes when adding edges connecting them.

The best way to compute all author pairs for each sequence is given by the [itertools](https://docs.python.org/3/library/itertools.html) library, which implements efficient looping.

Some notes about this function:
- all it needs as input arguments is the **path** of the directory containing the JSON files and a **boolean flag** to specify if the resulting dataframe should also be saved to disk (instead of simply returned) - along with a name for the newly created JSON graph file, eventually (otherwise `comments_authors_graph.json` applies by default);
- it begins with checking the correctness of the JSON files path and creating the necessary variables, among which:
    - a list of all files in the given directory (using [`os.listdir()`](https://docs.python.org/3/library/os.html#os.listdir));
    - an empty NetwrokX graph `G`;
    - a [tqdm](https://tqdm.github.io/) progress bar, only needed to visualize the overall progress of the parsing process.

In [None]:
def build_graph_from_directory(dir_path, save=False, filename='comments_authors_graph'):
    if dir_path[-1] != '/':
        dir_path += '/'
    file_list = [json_file for json_file in os.listdir(dir_path) if json_file.endswith('.json')]

    # Prepare variables
    g = nx.Graph()
    progress_bar = tqdm.tqdm(total=len(file_list))

    # Parse all JSON files
    for f in file_list:
        progress_bar.set_description('Parsing file {}'.format(f))
        file_path = dir_path + f
        raw_data = load_json(file_path)

        authors = parse_authors_from_comments(raw_data)
        if authors:
            # g.add_nodes_from(authors)
            g.add_edges_from(list(its.combinations(authors, 2)))
        progress_bar.update(1)

    # Save graph
    if save:
        with open(dir_path.split('/')[0] + '/' + filename + '.json', 'w') as out_file:
            json.dump(nx.readwrite.json_graph.node_link_data(g), out_file)

    return g

The graph can thus be created by running:

In [None]:
g = build_graph_from_directory('data/sequences', save=True)

and later retrieved by simply loading it with the function:

In [None]:
def load_json_graph(file_path):
    with open(file_path) as file:
        return nx.readwrite.json_graph.node_link_graph(json.load(file))

In [None]:
g = load_json_graph('data/comments_authors_graph.json')

All variables names are lowercase with words separated by undescores in order to be compliant with the Python Enhancement Proposals 8 (PEP 8) style guide\[[3](https://www.python.org/dev/peps/pep-0008/#function-and-variable-names)\].

## Maximal cliques

### Definitions
Let $G = (\mathcal{V}, \mathcal{E})$ be an undirected graph where $\mathcal{V}$ is the set of all nodes and $\mathcal{E}$ the set of all edges.

A **clique** of $G$ is a **complete subgraph**\[[4](https://mathworld.wolfram.com/Clique.html)\], or a simple undirected graph in which every pair of distinct vertices is connected by a unique edge\[[5](https://en.wikipedia.org/wiki/Complete_graph)\].

A **maximal clique** is a clique that cannot be extended by including one more adjacent vertex, meaning it is not a subset of a larger clique. The **maximum clique** in a graph (i.e. the clique of largest size) is always maximal, while the converse does not hold\[[6](https://mathworld.wolfram.com/MaximalClique.html)\].

### The Bron-Kerbosch algorithm
In order to find one or all maximal cliques in our graph, as well as the maximum clique, we can proceed by implementing the **Bron-Kerbosch algorithm**\[[7](https://dl.acm.org/doi/10.1145/362342.362367)\], designed by its Dutch namesakes in 1973 and widely used in its variants for finding many types of communities (subsets of nodes more densely connected than the rest of the network) in graphs.

This algorithm solves the following **core problem**\[[8](https://e-l.unifi.it/course/view.php?id=20118)\]:

> Given three sets $R$, $P$ and $X$, find all maximal cliques that include:
> - all of the vertices in $R$;
> - some of the vertices in $P$;
> - none of the vertices in $X$.
>
> Assuming that $P \cap X = \emptyset$, and for each $v \in P \cup X$, $R \cup \{v\}$ is a clique, i.e. $v \in \mathcal{N}(R)$.

To find all maximal cliques of graph $G$, we only need to initially set $R = \emptyset$, $P = \mathcal{V}$ and $X = \emptyset$. After running the algorithm described in the pseudocode below, whenever $P = \emptyset$ and $X = \emptyset$ then there are no further elements that can be added to $R$, so $R$ is a maximal clique and the algorithm outputs it.

```
Bron-Kerbosch(R, P, X):
    if P and X are both empty:
        report R as a maximal clique
    for each vertex v in P:
        Bron-Kerbosch(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
        P = P \ {v}
        X = X ⋃ {v}
```

The set $X$ of vertices avoids listing cliques that have already been found, meaning that Bron-Kerbosch **lists each solution exactly once**.

It should be noted, however, that often we lose time going in dead-ends, since every time $P = \emptyset$ and $X = \emptyset$ we back-track without producing anything.

The following image illustrates how this algorithm works on a toy example.

![](/bk_example.png)

#### Bron-Kerbosch with Tomita pivoting
The number of bad cases (where $P = \emptyset$ and $X = \emptyset$) can be reduced by choosing a pivot vertex $u$ from $P \cup X$: any maximal clique must then include either $u$ or one of its non-neighbors, or otherwise the clique could be augmented by adding $u$ to it. Hence, only $u$ and its non-neighbors need to be tested as the choices for the vertex $v$ that is added to $R$ in each recursive call to the algorithm.

A simple, effective way to choose the pivot is called the **Tomita pivoting**\[[9](https://www.sciencedirect.com/science/article/pii/S0304397506003586)\]:

> The pivot $u \in P \cup X$ is the node having more neighbors in $P$

```
Bron-Kerbosch-Tomita-Pivoting(R, P, X):
    if P and X are both empty:
        report R as a maximal clique
    choose pivot vertex u in P ⋃ X such that it is the node having more neighbors in P
    for each vertex v in P \ {N(u)}:
        Bron-Kerbosch-Tomita-Pivoting(R ⋃ {v}, P ⋂ N(v), X ⋂ N(v))
        P = P \ {v}
        X = X ⋃ {v}
```

#### Complexity
The worst-case analysis for the Bron-Kerbosch algorithm with a pivot strategy matches the bound in \[[10](https://link.springer.com/article/10.1007/BF02760024)\], which proves that any $n$-vertex graph has at most $3^{\frac{n}{3}}$ maximal cliques.

Although other algorithms for solving the maximal clique problem yield better results on certain types of input, it has been frequently reported that in practice the Bron-Kerbosch algorithm, with its $O(3^{\frac{n}{3}})$ **worst-case time complexity**, is more **efficient** than its alternatives.

## Python maximal cliques algorithms
As stated in the introduction, the aim of this project is to build **three algorithms** to find:
1. a maximal clique;
2. a list of all maximal cliques;
3. the maximum clique.

### Algorithm 1: finding a maximal clique
The easiest way to perform this operation is to implement the efficient Bron-Kerbosch algorithm in such a way that it stops after having found a single maximal clique, which can be done with the following code:

In [None]:
def find_a_maximal_clique(g, tomita=True, print_result=True):
    if not isinstance(g, 'networkx.classes.graph.Graph'):
        print('The provided graph is not a valid NetworkX undirected graph.')
        return

    def bron_kerbosch(r, p, x):
        if not p and not x:
            if len(r) > 2:
                return r
        else:
            for v in {*p}:
                return bron_kerbosch(r | {v}, p & {*g.neighbors(v)}, x & {*g.neighbors(v)})

    def bron_kerbosch_tomita_pivoting(r, p, x):
        if not p and not x:
            if len(r) > 2:
                return r
        else:
            u = max({(v, len({n for n in g.neighbors(v) if n in p})) for v in p | x}, key=lambda v: v[1])[0]
            for v in p - {*g.neighbors(u)}:
                return bron_kerbosch_tomita_pivoting(r | {v}, p & {*g.neighbors(v)}, x & {*g.neighbors(v)})

    if g.nodes:
        # Initialization
        r = {*()}
        p = {*g.nodes}
        x = {*()}

        # Bron-Kerbosch algorithm
        if tomita:
            clique = bron_kerbosch_tomita_pivoting(r, p, x)
        else:
            clique = bron_kerbosch(r, p, x)

        # Printing
        if print_result:
            print(clique)

        return clique
    else:
        print('The provided graph is empty.')
        return

The function `find_a_maximal_clique()` takes in input a NetworkX graph `g` and, optionally, two boolean flags for using the Tomita pivoting (`tomita`, `True` by default) and printing the clique found (`print_result`, also `True` by default).

It checks whether the provided graph is a NetworkX undirected graph, and then proceeds with the definition of the two versions of the Bron-Kerbosch algorithm. It can be seen that they are a quite literal implementation of the pseudocode listed above, except that only non-trivial maximal cliques are returned, i.e. only those with more than 2 nodes are considered. The $R$, $P$ and $X$ sets are implemented using Python's efficient [`set`](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset), and in particular empty sets are created using set literals `{*()}`\[[11](https://www.python.org/dev/peps/pep-0448/)\], which apparently are faster than the equivalent `set()` constructor, as effectively demonstrated by this code snippet:

In [8]:
number = 1000
print('Empty literal execution time: {} s.'.format((timeit.timeit('{*()}', number=number)) / number))
print('Set constructor execution time: {} s.'.format((timeit.timeit('set()', number=number)) / number))

Empty literal execution time: 7.419999997182458e-08 s.
Set constructor execution time: 1.5360000003283857e-07 s.


## Testing
This project has been created and succesfully tested on the following platform:

- **Motherboard:** MSI Nightblade X2
- **CPU:** Intel Core i7-6700K @ 4.01 GHz, 8 core
- **GPU:** AMD Radeon RX VEGA64 8GB
- **RAM:** 16 GB DDR4 @ 2133 MHz
- **SSD:** Samsung SSD 850 EVO 500 GB (540/520 MB/s r/w)
- **HDD:** WD Blue 3 TB (7200 rpm, 180/220 MB/s r/w)
- **OS:** Windows 10 Pro x64 1909
- **IDE:** PyCharm Professional 2021.1
- **Python:** 3.8

## References
\[1\] Wikipedia, **Regular expression**, https://en.wikipedia.org/wiki/Regular_expression

\[2\] NetworkX, **Graph - Undirected graphs with self loops**, https://networkx.org/documentation/stable/reference/classes/graph.html

\[3\] Guido van Rossum, Barry Warsaw, Nick Coghlan, **PEP 8 -- Style Guide for Python Code**, https://www.python.org/dev/peps/pep-0008/#function-and-variable-names

\[4\] WolframMathWorld, **Clique**, https://mathworld.wolfram.com/Clique.html

\[5\] Wikipedia, **Complete graph**, https://en.wikipedia.org/wiki/Complete_graph

\[6\] WolframMathWorld, **Maximal Clique**, https://mathworld.wolfram.com/MaximalClique.html

\[7\] Coen Bron, Joep Kerbosch, **Algorithm 457: finding all cliques of an undirected graph**, Communications of the ACM, vol. 16, issue 9 (Sept. 1973), pp 575–577, https://dl.acm.org/doi/10.1145/362342.362367

\[8\] Andrea Marino, **Finding Graph Patterns**, from the "Advanced Algorithms and Graph Mining" course at the Università degli Studi di Firenze, 2021, https://e-l.unifi.it/course/view.php?id=20118

\[9\] Etsuji Tomita, Akira Tanaka, Haruhisa Takahashi, **The worst-case time complexity for generating all maximal cliques and computational experiments**, Theoretical Computer Science, 363 (1): 28–42, 2006, https://www.sciencedirect.com/science/article/pii/S0304397506003586

\[10\] J. W. Moon, L. Moser, **On cliques in graphs**, Israel Journal of Mathematics, 3: 23–28, 1965, https://link.springer.com/article/10.1007/BF02760024

\[11\] Jashua Landau, **PEP 448 -- Additional Unpacking Generalizations**, https://www.python.org/dev/peps/pep-0448/

\[12\]

\[13\]

\[14\]

## License
This work is licensed under a [Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International”](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) license. More details are available in the [LICENSE.md](./LICENSE.md) file.