Skip to content

Commit

Permalink
Merge pull request #787 from aaxelb/eng-2100--docs
Browse files Browse the repository at this point in the history
[ENG-2100] docs: delete old, add new
  • Loading branch information
aaxelb committed Jul 27, 2021
2 parents 20b6e20 + 6ff012f commit 019dbe3
Show file tree
Hide file tree
Showing 37 changed files with 313 additions and 2,456 deletions.
106 changes: 106 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Architecture of SHARE/Trove

This document is a starting point and reference to familiarize yourself with this codebase.

## Bird's eye view
In short, SHARE/Trove takes metadata records (in any supported input format),
ingests them, and makes them available in any supported output format.
```
┌───────────────────────────────────────────┐
│ Ingest │
│ ┌──────┐ │
│ ┌─────────────────────────┐ ┌──►Format├─┼────┐
│ │ Normalize │ │ └──────┘ │ │
│ │ │ │ │ ▼
┌───────┐ │ │ ┌─────────┐ ┌────────┐ │ │ ┌──────┐ │ save as
│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord
└───────┘ │ │ │ └─────────┘ └────────┘ │ │ │ └──────┘ │ │
│ │ │ │ │ . │ │ ┌───────┐
│ │ └─────────────────────────┘ │ . │ └──►Indexer│
│ │ │ . │ └───────┘
│ └─────────────────────────────┼─────────────┘ some formats also
│ │ indexed separately
▼ ▼
save as save as
RawDatum NormalizedData
```

## Code map

A brief look at important areas of code as they happen to exist now.

### Static configuration

`share/schema/` describes the "normalized" metadata schema/format that all
metadata records are converted into when ingested.

`share/sources/` describes a starting set of metadata sources that the system
could harvest metadata from -- these will be put in the database and can be
updated or added to over time.

`project/settings.py` describes system-level settings which can be set by
environment variables (and their default values), as well as settings
which cannot.

`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM.

`share/subjects.yaml` describes the "central taxonomy" of subjects allowed
in `Subject.name` fields of `NormalizedData`.

### Harvest and ingest

`share/harvest/` and `share/harvesters/` describe how metadata records
are pulled from other metadata repositories.

`share/transform/` and `share/transformers/` describe how raw data (possibly
in any format) are transformed to the "normalized" schema.

`share/regulate/` describes rules which are applied to every normalized datum,
regardless where or what format it originally come from.

`share/metadata_formats/` describes how a normalized datum can be formatted
into any supported output format.

`share/tasks/` runs the harvest/ingest pipeline and stores each task's status
(including debugging info, if errored) as a `HarvestJob` or `IngestJob`.

### Outward-facing views

`share/search/` describes how the search indexes are structured, managed, and
updated when new metadata records are introduced -- this provides a view for
discovering items based on whatever search criteria.

`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
view for harvesting metadata from SHARE/Trove in bulk.

`api/` describes a mostly REST-ful API that's useful for inspecting records for
a specific item of interest.

### Internals

`share/admin/` is a Django-app for administrative access to the SHARE database
and pipeline logs

`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF

### Testing

`tests/` are tests.

## Cross-cutting concerns

### Immutable metadata

Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`,
`FormattedMetadataRecord`) should be considered immutable -- any updates
result in a new record being created, not an old record being altered.

Multiple records which describe the same item/object are grouped by a
"source-unique identifier" or "suid" -- essentially a two-tuple
`(source, identifier)` that uniquely and persistently identifies an item in
the source repository. In most outward-facing views, default to showing only
the most recent record for each suid.

## Why this?
inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html)
and [this example architecture document](https://github.com/rust-analyzer/rust-analyzer/blob/d7c99931d05e3723d878bea5dc26766791fa4e69/docs/dev/architecture.md)
72 changes: 4 additions & 68 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,7 @@
# CONTRIBUTING

## Style Guide

In the following templates, `TYPE` may be any of `Task`, `Bug`, `Feature`, `Improvement`, or `Quick`.

### Commit Messages

Commit messages should be formatted as:

```
[SHARE-###][TYPE] Brief description
* More details about the code changes.
* Formatted as a bulleted list
* If you have a really long line, wrap it
at 80 characters and line up with the first
letter, not the bullet point.
```

Here are some excellent commit messages, for reference.
* https://github.com/CenterForOpenScience/SHARE/commit/0fe503f0dc5f90da366246086ae76ee5281843cf
* https://github.com/CenterForOpenScience/SHARE/commit/226bac6a9010cde6aed7ac037c9186ac889b5132
* https://github.com/CenterForOpenScience/SHARE/commit/0e02dbb9d06920623e0dfb6a32fd1b38771de74b

### Pull Requests

Titles should be formatted as `[SHARE-###][TYPE] Brief description`

Here are some excellent pull requests, for reference.
* https://github.com/CenterForOpenScience/SHARE/pull/658
* https://github.com/CenterForOpenScience/SHARE/pull/642

### Code

#### Docstrings

Python docstrings should follow the [Google docstring style guide](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).

To easily distinguish them, docstrings should use triple double-quotes, `"""`, and large strings should use triple single-quotes, `'''`

## Reporting Issues

If you find a bug in osf.io or would like to propose a new feature, please file an issue report in CenterForOpenScience/osf.io. Below we have some information on how to best report the issue, but if you’re short on time or new to this, don’t worry! We really want to know about the problem, so go ahead and report it. If you do this a lot, or you just want to know how to make it easier for us to find and fix the problem, keep reading.

If you would like to report a security issue, please email contact@cos.io for instructions on how to report the security issue. Do not include details of the issue in that email.

### Quick link
[Submit an issue](https://github.com/CenterForOpenScience/SHARE/issues/new?body=Steps%0A-------%0A1.%20%0A%0AExpected%0A------------%0A%0AActual%0A--------%0A)
using that link and you will have a handy template to save you a little time in your issue reporting.

### How to make the best issue
--------------------------

First, please make sure that the issue has not already been reported by searching through the issue archives.

When submitting an issue, be as descriptive as possible:
* What you did (step by step)
* Where does this happen on SHARE?
* What you expected
* What actually happened
* Check the JavaScript console in the browser (e.g. In Chrome go to View → Developer → JavaScript console) and report errors
* If it's an issue with staging, report whether or not it also occurs on production
* If an error was generated, report what time it occurred, and the specific URL.
* Potential causes
* Suggest a solution
* What will it look like when this issue is resolved?

Include pictures (e.g., in OSX press Cmd+Shift+4 to draw a box to screenshot)

TODO: how do we want to guide community contributors?

For now, if you're interested in contributing to SHARE/Trove, feel free to
[open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues)
and start a conversation.
87 changes: 10 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,26 @@
# SHARE v2
# SHARE/Trove

SHARE is creating a free, open dataset of research (meta)data.

> **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.
[![Coverage Status](https://coveralls.io/repos/github/CenterForOpenScience/SHARE/badge.svg?branch=develop)](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop)
[![Gitter](https://badges.gitter.im/CenterForOpenScience/SHARE.svg)](https://gitter.im/CenterForOpenScience/SHARE)

## Technical Documentation
## Documentation

http://share-research.readthedocs.io/en/latest/index.html
### What is this?
see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md)

### How can I use it?
see [how-to/use-the-api.md](./how-to/use-the-api.md)

## On the OSF
### How do I navigate this codebase?
see [ARCHITECTURE.md](./ARCHITECTURE.md)

https://osf.io/sdxvj/
### How do I run a copy locally?
see [how-to/run-locally.md](./how-to/run-locally.md)


## Get involved

We'll be expanding this section in the near future, but, beyond using our API for your own purposes, harvesters are a great way to get started. You can find a few that we have in our list [here](https://github.com/CenterForOpenScience/SHARE/issues/510).

## Setup for testing
It is useful to set up a [virtual environment](http://virtualenvwrapper.readthedocs.io/en/latest/install.html) to ensure [python3](https://www.python.org/downloads/) is your designated version of python and make the python requirements specific to this project.

mkvirtualenv share -p `which python3.6`
workon share

Once in the `share` virtual environment, install the necessary requirements, then setup SHARE.

pip install -Ur requirements.txt
python setup.py develop
pyenv rehash # Only necessary when using pyenv to manage virtual environments

`docker-compose` assumes [Docker](https://www.docker.com/) is installed and running. Running `./bootstrap.sh` will create and provision the database. If there are any SHARE containers running, make sure to stop them before bootstrapping using `docker-compose stop`.

docker-compose build web
docker-compose run --rm web ./bootstrap.sh

## Run
Run the API server

# In docker
docker-compose up -d web

# Locally
sharectl server

Setup Elasticsearch

sharectl search setup

Run Celery

# In docker
docker-compose up -d worker

# Locally
sharectl worker -B

## Populate with data
This is particularly applicable to running [ember-share](https://github.com/CenterForOpenScience/ember-share), an interface for SHARE.

Harvest data from providers, for example

sharectl harvest com.nature
sharectl harvest com.peerj.preprints

# Harvests may be scheduled to run asynchronously using the schedule command
sharectl schedule org.biorxiv.html

# Some sources provide thousands of records per day
# --limit can be used to set a maximum number of records to gather
sharectl harvest org.crossref --limit 250

If the Celery worker is running, new data will automatically be indexed every couple minutes.

Alternatively, data may be explicitly indexed using `sharectl`

sharectl search
# Forcefully re-index all data
sharectl search --all

## Building docs

cd docs/
pip install -r requirements.txt
make watch

## Running Tests

### Unit test suite
Expand Down
42 changes: 42 additions & 0 deletions WHAT-IS-THIS-EVEN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# "What is this, even?"

Imagine a vast, public library full of the outputs and results of some scientific
research -- shelves full of articles, preprints, datasets, data analysis plans,
and so on.

You can think of SHARE/Trove as that library's card catalog.

## "...What is a card catalog?"

A [card catalog](https://en.wikipedia.org/wiki/Card_catalog) is that weird, cool cabinet you might see at the front of a
library with a bunch of tiny drawers full of index cards -- each index card
contains information about some item on the library shelves.

The card catalog is where you go when you want to:
- locate a specific item in the library
- discover items related to a specific topic, author, or other keywords
- make a new item easily discoverable by others

## "OK but what 'library' is this?"
As of July 2021, SHARE/Trove contains metadata on over 4.5 million items originating from:
- [OSF](https://osf.io) (including OSF-hosted Registries and Preprint Providers)
- [REPEC](http://repec.org)
- [arXiv](https://arxiv.org)
- [ClinicalTrials.gov](https://clinicaltrials.gov)
- ...and more!

Updates from OSF are reflected within seconds, while updates from third-party sources are
harvested once daily.

## "How can I use it?"

You can search the full SHARE/Trove catalog at
[share.osf.io/discover](https://share.osf.io/discover).

Other search pages can also be built on SHARE/Trove, showing only a specific
collection of items. For example, [OSF Preprints](https://osf.io/preprints/discover)
and [OSF Registries](https://osf.io/registries/discover) show only registrations
and preprints, respectively, which are hosted on OSF infrastructure.

To learn about using the API (instead of a user interface), see
[USING-THE-API.md](./USING-THE-API.md)
1 change: 0 additions & 1 deletion bootstrap.sh

This file was deleted.

0 comments on commit 019dbe3

Please sign in to comment.