Merge pull request #787 from aaxelb/eng-2100--docs

[ENG-2100] docs: delete old, add new
CenterForOpenScience · Jul 27, 2021 · 019dbe3 · 019dbe3
2 parents 20b6e20 + 6ff012f
commit 019dbe3
Show file tree

Hide file tree

Showing 37 changed files with 313 additions and 2,456 deletions.
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -0,0 +1,106 @@
+# Architecture of SHARE/Trove
+
+This document is a starting point and reference to familiarize yourself with this codebase.
+
+## Bird's eye view
+In short, SHARE/Trove takes metadata records (in any supported input format),
+ingests them, and makes them available in any supported output format.
+```
+            ┌───────────────────────────────────────────┐
+            │                  Ingest                   │
+            │                                  ┌──────┐ │
+            │ ┌─────────────────────────┐   ┌──►Format├─┼────┐
+            │ │        Normalize        │   │  └──────┘ │    │
+            │ │                         │   │           │    ▼
+┌───────┐   │ │ ┌─────────┐  ┌────────┐ │   │  ┌──────┐ │    save as
+│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord
+└───────┘ │ │ │ └─────────┘  └────────┘ │ │ │  └──────┘ │ │
+          │ │ │                         │ │ .           │ │  ┌───────┐
+          │ │ └─────────────────────────┘ │ .           │ └──►Indexer│
+          │ │                             │ .           │    └───────┘
+          │ └─────────────────────────────┼─────────────┘  some formats also
+          │                               │                indexed separately
+          ▼                               ▼
+        save as                         save as
+        RawDatum                        NormalizedData
+```
+
+## Code map
+
+A brief look at important areas of code as they happen to exist now.
+
+### Static configuration
+
+`share/schema/` describes the "normalized" metadata schema/format that all
+metadata records are converted into when ingested.
+
+`share/sources/` describes a starting set of metadata sources that the system
+could harvest metadata from -- these will be put in the database and can be
+updated or added to over time.
+
+`project/settings.py` describes system-level settings which can be set by
+environment variables (and their default values), as well as settings
+which cannot.
+
+`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM.
+
+`share/subjects.yaml` describes the "central taxonomy" of subjects allowed
+in `Subject.name` fields of `NormalizedData`.
+
+### Harvest and ingest
+
+`share/harvest/` and `share/harvesters/` describe how metadata records
+are pulled from other metadata repositories.
+
+`share/transform/` and `share/transformers/` describe how raw data (possibly
+in any format) are transformed to the "normalized" schema.
+
+`share/regulate/` describes rules which are applied to every normalized datum,
+regardless where or what format it originally come from.
+
+`share/metadata_formats/` describes how a normalized datum can be formatted
+into any supported output format.
+
+`share/tasks/` runs the harvest/ingest pipeline and stores each task's status
+(including debugging info, if errored) as a `HarvestJob` or `IngestJob`.
+
+### Outward-facing views
+
+`share/search/` describes how the search indexes are structured, managed, and
+updated when new metadata records are introduced -- this provides a view for
+discovering items based on whatever search criteria.
+
+`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html)
+view for harvesting metadata from SHARE/Trove in bulk.
+
+`api/` describes a mostly REST-ful API that's useful for inspecting records for
+a specific item of interest.
+
+### Internals
+
+`share/admin/` is a Django-app for administrative access to the SHARE database
+and pipeline logs
+
+`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF
+
+### Testing
+
+`tests/` are tests.
+
+## Cross-cutting concerns
+
+### Immutable metadata
+
+Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`,
+`FormattedMetadataRecord`) should be considered immutable -- any updates 
+result in a new record being created, not an old record being altered.
+
+Multiple records which describe the same item/object are grouped by a
+"source-unique identifier" or "suid" -- essentially a two-tuple
+`(source, identifier)` that uniquely and persistently identifies an item in
+the source repository. In most outward-facing views, default to showing only
+the most recent record for each suid.
+
+## Why this?
+inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html)
+and [this example architecture document](https://github.com/rust-analyzer/rust-analyzer/blob/d7c99931d05e3723d878bea5dc26766791fa4e69/docs/dev/architecture.md)
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,71 +1,7 @@
 # CONTRIBUTING
 
-## Style Guide
-
-In the following templates, `TYPE` may be any of `Task`, `Bug`, `Feature`, `Improvement`, or `Quick`.
-
-### Commit Messages
-
-Commit messages should be formatted as:
-
-```
-[SHARE-###][TYPE] Brief description
-
-  * More details about the code changes.
-  * Formatted as a bulleted list
-  * If you have a really long line, wrap it
-    at 80 characters and line up with the first
-    letter, not the bullet point.
-```
-
-Here are some excellent commit messages, for reference.
-* https://github.com/CenterForOpenScience/SHARE/commit/0fe503f0dc5f90da366246086ae76ee5281843cf
-* https://github.com/CenterForOpenScience/SHARE/commit/226bac6a9010cde6aed7ac037c9186ac889b5132
-* https://github.com/CenterForOpenScience/SHARE/commit/0e02dbb9d06920623e0dfb6a32fd1b38771de74b
-
-### Pull Requests
-
-Titles should be formatted as `[SHARE-###][TYPE] Brief description`
-
-Here are some excellent pull requests, for reference.
-* https://github.com/CenterForOpenScience/SHARE/pull/658
-* https://github.com/CenterForOpenScience/SHARE/pull/642
-
-### Code
-
-#### Docstrings
-
-Python docstrings should follow the [Google docstring style guide](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html).
-
-To easily distinguish them, docstrings should use triple double-quotes, `"""`, and large strings should use triple single-quotes, `'''`
-
-## Reporting Issues
-
-If you find a bug in osf.io or would like to propose a new feature, please file an issue report in CenterForOpenScience/osf.io. Below we have some information on how to best report the issue, but if you’re short on time or new to this, don’t worry! We really want to know about the problem, so go ahead and report it. If you do this a lot, or you just want to know how to make it easier for us to find and fix the problem, keep reading.
-
-If you would like to report a security issue, please email contact@cos.io for instructions on how to report the security issue. Do not include details of the issue in that email.
-
-### Quick link
-[Submit an issue](https://github.com/CenterForOpenScience/SHARE/issues/new?body=Steps%0A-------%0A1.%20%0A%0AExpected%0A------------%0A%0AActual%0A--------%0A)
-using that link and you will have a handy template to save you a little time in your issue reporting.
-
-### How to make the best issue
---------------------------
-
-First, please make sure that the issue has not already been reported by searching through the issue archives.
-
-When submitting an issue, be as descriptive as possible:
-* What you did (step by step)
-    * Where does this happen on SHARE?
-* What you expected
-* What actually happened
-    * Check the JavaScript console in the browser (e.g. In Chrome go to View → Developer → JavaScript console) and report errors
-    * If it's an issue with staging, report whether or not it also occurs on production
-    * If an error was generated, report what time it occurred, and the specific URL.
-* Potential causes
-* Suggest a solution
-    * What will it look like when this issue is resolved?
-
-Include pictures (e.g., in OSX press Cmd+Shift+4 to draw a box to screenshot)
-
+TODO: how do we want to guide community contributors?
 
+For now, if you're interested in contributing to SHARE/Trove, feel free to
+[open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues)
+and start a conversation.
diff --git a/README.md b/README.md
@@ -1,93 +1,26 @@
-# SHARE v2
+# SHARE/Trove
 
 SHARE is creating a free, open dataset of research (meta)data.
 
 > **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews.
 
 [![Coverage Status](https://coveralls.io/repos/github/CenterForOpenScience/SHARE/badge.svg?branch=develop)](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop)
-[![Gitter](https://badges.gitter.im/CenterForOpenScience/SHARE.svg)](https://gitter.im/CenterForOpenScience/SHARE)
 
-## Technical Documentation
+## Documentation
 
-http://share-research.readthedocs.io/en/latest/index.html
+### What is this?
+see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md)
 
+### How can I use it?
+see [how-to/use-the-api.md](./how-to/use-the-api.md)
 
-## On the OSF
+### How do I navigate this codebase?
+see [ARCHITECTURE.md](./ARCHITECTURE.md)
 
-https://osf.io/sdxvj/
+### How do I run a copy locally?
+see [how-to/run-locally.md](./how-to/run-locally.md)
 
 
-## Get involved
-
-We'll be expanding this section in the near future, but, beyond using our API for your own purposes, harvesters are a great way to get started. You can find a few that we have in our list [here](https://github.com/CenterForOpenScience/SHARE/issues/510).
-
-## Setup for testing
-It is useful to set up a [virtual environment](http://virtualenvwrapper.readthedocs.io/en/latest/install.html) to ensure [python3](https://www.python.org/downloads/) is your designated version of python and make the python requirements specific to this project.
-
-    mkvirtualenv share -p `which python3.6`
-    workon share
-
-Once in the `share` virtual environment, install the necessary requirements, then setup SHARE.
-
-    pip install -Ur requirements.txt
-    python setup.py develop
-    pyenv rehash  # Only necessary when using pyenv to manage virtual environments
-
-`docker-compose` assumes [Docker](https://www.docker.com/) is installed and running. Running `./bootstrap.sh` will create and provision the database. If there are any SHARE containers running, make sure to stop them before bootstrapping using `docker-compose stop`.
-
-    docker-compose build web
-    docker-compose run --rm web ./bootstrap.sh
-
-## Run
-Run the API server
-
-    # In docker
-    docker-compose up -d web
-
-    # Locally
-    sharectl server
-
-Setup Elasticsearch
-
-    sharectl search setup
-
-Run Celery
-
-    # In docker
-    docker-compose up -d worker
-
-    # Locally
-    sharectl worker -B
-
-## Populate with data
-This is particularly applicable to running [ember-share](https://github.com/CenterForOpenScience/ember-share), an interface for SHARE.
-
-Harvest data from providers, for example
-
-    sharectl harvest com.nature
-    sharectl harvest com.peerj.preprints
-
-    # Harvests may be scheduled to run asynchronously using the schedule command
-    sharectl schedule org.biorxiv.html
-
-    # Some sources provide thousands of records per day
-    # --limit can be used to set a maximum number of records to gather
-    sharectl harvest org.crossref --limit 250
-
-If the Celery worker is running, new data will automatically be indexed every couple minutes.
-
-Alternatively, data may be explicitly indexed using `sharectl`
-
-    sharectl search
-    # Forcefully re-index all data
-    sharectl search --all
-
-## Building docs
-
-    cd docs/
-    pip install -r requirements.txt
-    make watch
-
 ## Running Tests
 
 ### Unit test suite

diff --git a/WHAT-IS-THIS-EVEN.md b/WHAT-IS-THIS-EVEN.md
@@ -0,0 +1,42 @@
+# "What is this, even?"
+
+Imagine a vast, public library full of the outputs and results of some scientific
+research -- shelves full of articles, preprints, datasets, data analysis plans,
+and so on.
+
+You can think of SHARE/Trove as that library's card catalog.
+
+## "...What is a card catalog?"
+
+A [card catalog](https://en.wikipedia.org/wiki/Card_catalog) is that weird, cool cabinet you might see at the front of a
+library with a bunch of tiny drawers full of index cards -- each index card
+contains information about some item on the library shelves.
+
+The card catalog is where you go when you want to:
+- locate a specific item in the library
+- discover items related to a specific topic, author, or other keywords
+- make a new item easily discoverable by others
+
+## "OK but what 'library' is this?"
+As of July 2021, SHARE/Trove contains metadata on over 4.5 million items originating from:
+- [OSF](https://osf.io) (including OSF-hosted Registries and Preprint Providers)
+- [REPEC](http://repec.org)
+- [arXiv](https://arxiv.org)
+- [ClinicalTrials.gov](https://clinicaltrials.gov)
+- ...and more!
+
+Updates from OSF are reflected within seconds, while updates from third-party sources are
+harvested once daily.
+
+## "How can I use it?"
+
+You can search the full SHARE/Trove catalog at
+[share.osf.io/discover](https://share.osf.io/discover).
+
+Other search pages can also be built on SHARE/Trove, showing only a specific
+collection of items. For example, [OSF Preprints](https://osf.io/preprints/discover)
+and [OSF Registries](https://osf.io/registries/discover) show only registrations
+and preprints, respectively, which are hosted on OSF infrastructure.
+
+To learn about using the API (instead of a user interface), see
+[USING-THE-API.md](./USING-THE-API.md)
diff --git a/bootstrap.sh b/bootstrap.sh