# Erdos-computation with plain Python

In [None]:
from erdos1_plainpython import load_ndjson, attach_authors, build_coauthorship_matrix, erdos


## Data loading

Loading datasets from NDJSON files and inspect the first examples. The dataset consists of two entities: Scientific publications (aka **Articles**) and their **Authors**. The dataset is build using all publications from MDPI's Plant journal and their authors. The many-to-many relationship is between Authors and Articles is modelled with a product entities: **Authorship**.

![Data model](images/data_model.drawio.png)

Articles have the following attributes:
  * **title:** The title of the publication.
  * **publication_date:** The date when the open-access publication was first published on mdpi.com.
  * **doi:** The Digital Object Identifier of the publication. The DOI uniquely identifies every scientific publication. 

Authors have the following attributes:
    * **id:** A unique and sequential identifier used as row and column indices to build a 2d co-authorship matrix. The matrix is needed as input to Dijkstra's shortest path algorithm.
    * **lastname:** The lastname of the author.
    * **given_names:** All given names of the author, separated by spaces.
    * **orcid:** A unique identifier, intended to uniquely identify scholars and their publication list.


In [None]:
articles = load_ndjson("esp2025_articles.ndjson")

# Display the first two articles. Each article has he following properties.
articles[:2]

In [None]:
authors = load_ndjson("esp2025_authors.ndjson")

# Display the first two authors. Each author has the following properties.
authors[:2]

In [None]:
authorships = load_ndjson("esp2025_authorships.ndjson")

# Display the some authorship relationships. Author 0000-0001-8056-7215 appear as author of two distinct articles.
authorships[3:5]

## Resolving many-to-many relationship

For our purpose, it's more convenient to attach all authors of a publication to the publication itself. We can achieve this in two steps.
* First, we build a mapping from article DOI to all authorship relations for the article.
* Second, we build a mapping from author ORCID to author objects.
* Finally, we iterate over all articles, fetch all authorship relations and attach the authors to a new property called `authors`.

In [None]:
fat_articles = attach_authors(articles, authors, authorships)

In [None]:
a = fat_articles

In [None]:
# Let's looks at an example. Please not that the order of authors may differ from the paper.

## Build co-authorship matrix


In [None]:
coauthorship = build_coauthorship_matrix(fat_articles)

In [None]:
assert coauthorship[1, 2] == 0  # Author 1 and 2 are not coauthors

In [None]:
assert coauthorship[1, 4998] == 1 # Author 1 and 4998 wrote a publication together
assert coauthorship[4998, 1] == 1 # Author 1 and 4998 wrote a publication together

## Find shortest path

In [None]:
authors[4], authors[990]

In [None]:
erdos(coauthorship, 4, 990)