Skip to content

Commit

Permalink
Remove 'btrix' from harvest names and commands
Browse files Browse the repository at this point in the history
Why these changes are being introduced:
Using 'harvest' continues to reinforce that this application is more than just
a browsertrix web crawl, but also a metadata parsing process.  This is also more
inline with the OAI-PMH harvester in naming conventions.
  • Loading branch information
ghukill committed Sep 28, 2023
1 parent fe2102c commit aa0f7a2
Show file tree
Hide file tree
Showing 22 changed files with 107 additions and 107 deletions.
14 changes: 7 additions & 7 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ RUN pip3 install --upgrade pip \

# NOTE: /app is already used by browsertrix-crawler
# Setup python virtual environment
WORKDIR /btrixharvest
COPY Pipfile /btrixharvest/Pipfile
WORKDIR /browsertrix-harvester
COPY Pipfile /browsertrix-harvester/Pipfile
RUN pipenv install --python 3.11

# Copy full browstrix-harvester app
COPY pyproject.toml /btrixharvest/
COPY docker-entrypoint.sh /btrixharvest/
COPY browsertrix_harvester/ /btrixharvest/browsertrix_harvester/
COPY tests/ /btrixharvest/tests/
COPY pyproject.toml /browsertrix-harvester/
COPY docker-entrypoint.sh /browsertrix-harvester/
COPY harvester/ /browsertrix-harvester/harvester/
COPY tests/ /browsertrix-harvester/tests/

ENTRYPOINT ["/btrixharvest/docker-entrypoint.sh"]
ENTRYPOINT ["/browsertrix-harvester/docker-entrypoint.sh"]
8 changes: 4 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ update: install ## Update all Python dependencies

### Test commands ###
test: ## Run tests and print a coverage report
pipenv run coverage run --source=browsertrix_harvester -m pytest -vv
pipenv run coverage run --source=harvester -m pytest -vv
pipenv run coverage report -m

coveralls: test
Expand Down Expand Up @@ -50,7 +50,7 @@ ruff-apply:

# CLI commands
shell:
pipenv run btrixharvest-dockerized shell
pipenv run harvest-dockerized shell

# Docker commands
build-docker:
Expand All @@ -59,9 +59,9 @@ build-docker:
# Test crawl commands
# local docker container crawl
test-harvest-local:
pipenv run btrixharvest-dockerized --verbose harvest \
pipenv run harvest-dockerized --verbose harvest \
--crawl-name="homepage" \
--config-yaml-file="/btrixharvest/tests/fixtures/lib-website-homepage.yaml" \
--config-yaml-file="/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml" \
--metadata-output-file="/crawls/collections/homepage/homepage.xml" \
--num-workers 4 \
--btrix-args-json='{"--maxPageLimit":"15"}'
Expand Down
4 changes: 2 additions & 2 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,5 @@ pandas-stubs = "*"
python_version = "3.11"

[scripts]
btrixharvest = "python -c \"from browsertrix_harvester.cli import main; main()\""
btrixharvest-dockerized = "docker run -it -v $PWD/output/crawls:/crawls browsertrix-harvester-dev:latest"
harvest = "python -c \"from harvester.cli import main; main()\""
harvest-dockerized = "docker run -it -v $PWD/output/crawls:/crawls browsertrix-harvester-dev:latest"
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ See [architecture docs](docs/architecture.md).

## Development

**NOTE**: When performing web crawls, this application invokes browsertrix-crawler. While theoretically possible to install browsertrix-crawler on your local machine as a callable binary, this application is oriented around running only inside of a Docker container where it is already installed. For this reason, the pipenv convenience command `btrixharvest-dockerized` has been created (more on this below).
**NOTE**: When performing web crawls, this application invokes browsertrix-crawler. While theoretically possible to install browsertrix-crawler on your local machine as a callable binary, this application is oriented around running only inside of a Docker container where it is already installed. For this reason, the pipenv convenience command `harvest-dockerized` has been created (more on this below).

### Build Application

Expand All @@ -20,11 +20,11 @@ See [architecture docs](docs/architecture.md).
- Build docker image: `make build-docker`
- builds local image `browsertrix-harvester-dev:latest`
- To run the app:
- Non-Dockerized: `pipenv run btrixharvest --help`
- Non-Dockerized: `pipenv run harvest --help`
- works locally for many things, but will throw error for actions that perform crawls
- Dockerized: `pipenv run btrixharvest-dockerized --help`
- Dockerized: `pipenv run harvest-dockerized --help`
- provides full functionality by running as a docker container
- points back to the pipenv command `btrixharvest`
- points back to the pipenv command `harvest`

### Testing and Linting

Expand All @@ -44,7 +44,7 @@ SENTRY_DSN=None# If set to a valid Sentry DSN, enables Sentry exception monitori

### Main
```shell
pipenv run btrixharvest
pipenv run harvest
```

```text
Expand All @@ -60,7 +60,7 @@ Commands:

### Shell environment
```shell
pipenv run btrixharvest shell
pipenv run harvest shell
```
```text
Usage: -c shell [OPTIONS]
Expand All @@ -73,7 +73,7 @@ Options:

### Parse URL content from crawl
```shell
pipenv run btrixharvest parse-url-content
pipenv run harvest parse-url-content
```
```text
Usage: -c parse-url-content [OPTIONS]
Expand All @@ -96,7 +96,7 @@ This is the primary command for this application. This performs a web crawl, th
**NOTE:** if neither `--wacz-output-file` or `--metadata-output-file` is set, a crawl will be performed, but nothing will exist outside of the container after it completes.

```shell
pipenv run btrixharvest-dockerized harvest
pipenv run harvest-dockerized harvest
```
```text
Usage: -c harvest [OPTIONS]
Expand Down Expand Up @@ -130,11 +130,11 @@ Options:
#### Configuration YAML

There are a couple of options for providing a file for the required `--config-yaml-file` argument:
* 1- add to, or reuse files from, the local directory `browsertrix_harvester/crawl_configs`
* on image rebuild, this file will be available in the container at `/btrixharvest/browsertrix_harvester/crawl_configs`
* 1- add to, or reuse files from, the local directory `harvester/crawl_configs`
* on image rebuild, this file will be available in the container at `/browsertrix-harvest/harvester/crawl_configs`
* 2- provide an S3 file URI

At the time of harvest, for either local or remote files, the application copies the provided file to `/btrixharvest/crawl-config.yaml` inside the container.
At the time of harvest, for either local or remote files, the application copies the provided file to `/browsertrix-harvest/crawl-config.yaml` inside the container.

## Extracted Metadata

Expand Down Expand Up @@ -201,8 +201,8 @@ make build-docker
```shell
make test-harvest-local
```
* Performs a crawl using the config YAML `/btrixharvest/tests/fixtures/lib-website-homepage.yaml`
* Metadata is written to `/crawls/collections/homepage/homepage.xml` in the container, which is mounted and available in the local `output/` folder
* Performs a crawl using the container mounted config YAML `/browsertrix-harvest/tests/fixtures/lib-website-homepage.yaml`
* Metadata is written to container directory `/crawls/collections/homepage/homepage.xml`, which is mounted and available in the local `output/` folder

### Remote Test Crawl

Expand Down
2 changes: 1 addition & 1 deletion bin/test-harvest-ecs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ aws ecs run-task \
--launch-type="FARGATE" \
--region us-east-1 \
--network-configuration '{"awsvpcConfiguration": {"subnets": ["subnet-0488e4996ddc8365b","subnet-022e9ea19f5f93e65"], "securityGroups": ["sg-044033bf5f102c544"]}}' \
--overrides '{"containerOverrides": [ {"name":"broswertrix-harvester", "command": ["--verbose", "harvest", "--crawl-name", "'"$CRAWL_NAME"'", "--config-yaml-file", "/btrixharvest/tests/fixtures/lib-website-homepage.yaml", "--metadata-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.xml", "--wacz-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.wacz", "--num-workers", "2"]}]}'
--overrides '{"containerOverrides": [ {"name":"browsertrix-harvester", "command": ["--verbose", "harvest", "--crawl-name", "'"$CRAWL_NAME"'", "--config-yaml-file", "/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml", "--metadata-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.xml", "--wacz-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.wacz", "--num-workers", "2"]}]}'
1 change: 0 additions & 1 deletion browsertrix_harvester/__init__.py

This file was deleted.

51 changes: 0 additions & 51 deletions browsertrix_harvester/crawl_configs/lib-website-config.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ NOTE: this is different from other python CLI apps, which generally use `python:

## Web Crawls

Any actions that trigger a browsertrix web crawl will not work outside of a container context. A decorator `browsertrix_harvester.utils.require_container` has been created that can be used to decorate functions or methods that should not run outside of a container context. This decorator looks for EITHER of the following conditions to be true:
Any actions that trigger a browsertrix web crawl will not work outside of a container context. A decorator `harvester.utils.require_container` has been created that can be used to decorate functions or methods that should not run outside of a container context. This decorator looks for EITHER of the following conditions to be true:
* the file `/.dockerenv` exists; indicates locally running container
* the env var `AWS_EXECUTION_ENV` is set; indicates Fargate ECS task

Expand Down
1 change: 1 addition & 0 deletions harvester/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""harvester package."""
8 changes: 4 additions & 4 deletions browsertrix_harvester/cli.py → harvester/cli.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""browsertrix_harvester.cli"""
"""harvester.cli"""
# ruff: noqa: FBT001, ARG001

import logging
Expand All @@ -9,9 +9,9 @@
import click
import smart_open # type: ignore[import]

from browsertrix_harvester.config import configure_logger, configure_sentry
from browsertrix_harvester.crawl import Crawler
from browsertrix_harvester.parse import CrawlParser
from harvester.config import configure_logger, configure_sentry
from harvester.crawl import Crawler
from harvester.parse import CrawlParser

logger = logging.getLogger(__name__)

Expand Down
2 changes: 1 addition & 1 deletion browsertrix_harvester/config.py → harvester/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def configure_logger(logger: logging.Logger, verbose: bool) -> str:
)
logger.setLevel(logging.DEBUG)
for handler in logging.root.handlers:
handler.addFilter(logging.Filter("browsertrix_harvester"))
handler.addFilter(logging.Filter("harvester"))
else:
logging.basicConfig(
format="%(asctime)s %(levelname)s %(name)s.%(funcName)s(): %(message)s"
Expand Down
8 changes: 4 additions & 4 deletions browsertrix_harvester/crawl.py → harvester/crawl.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""browsertrix_harvester.crawl"""
"""harvester.crawl"""

import json
import logging
Expand All @@ -8,16 +8,16 @@

import smart_open # type: ignore[import]

from browsertrix_harvester.exceptions import ConfigYamlError
from browsertrix_harvester.utils import require_container
from harvester.exceptions import ConfigYamlError
from harvester.utils import require_container

logger = logging.getLogger(__name__)


class Crawler:
"""Class that manages browsertrix crawls."""

DOCKER_CONTAINER_CONFIG_YAML_FILEPATH = "/btrixharvest/crawl-config.yaml"
DOCKER_CONTAINER_CONFIG_YAML_FILEPATH = "/browsertrix-harvester/crawl-config.yaml"

# ruff: noqa: FBT001, FBT002
def __init__(
Expand Down
51 changes: 51 additions & 0 deletions harvester/crawl_configs/lib-website-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
generateCDX: true
generateWACZ: true
text: true
# prevent PAGES from getting crawled; scoping
exclude:
- ".*lib.mit.edu/search/.*"
- ".*mit.primo.exlibrisgroup.com/.*"
# prevent RESOURCES / ASSETS from getting retrieved; URL requests
blockRules:
- ".*googlevideo.com.*"
- ".*cdn.pw-60-mitlib-wp-network.pantheonsite.io/media/.*"
- "\\.(jpg|png)$"
depth: 1
scopeType: "domain"
seeds:
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/exhibits/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/exhibits/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/mithistory/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mithistory/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/news/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/news/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/scholarly/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/scholarly/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/150books/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/150books/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/giving/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/giving/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/music-oral-history/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/music-oral-history/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/docs/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/docs/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/data-management/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/data-management/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/akdc/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/akdc/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-reads/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-reads/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/pomeroy/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/pomeroy/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-and-slavery/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-and-slavery/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/creos/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/creos/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/distinctive-collections/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/distinctive-collections/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/about/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/about/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/opendata/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/opendata/sitemap.xml
File renamed without changes.
4 changes: 2 additions & 2 deletions browsertrix_harvester/parse.py → harvester/parse.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""browsertrix_harvester.parse"""
"""harvester.parse"""
# ruff: noqa: N813

import gzip
Expand All @@ -19,7 +19,7 @@
from warcio.recordloader import ArcWarcRecord # type: ignore[import]
from yake import KeywordExtractor # type: ignore[import]

from browsertrix_harvester.exceptions import WaczFileDoesNotExist
from harvester.exceptions import WaczFileDoesNotExist

logger = logging.getLogger(__name__)

Expand Down
4 changes: 2 additions & 2 deletions browsertrix_harvester/utils.py → harvester/utils.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
"""browsertrix_harvester.utils"""
"""harvester.utils"""
# ruff: noqa: ANN401

import os
from collections.abc import Callable
from typing import Any, TypeVar

from browsertrix_harvester.exceptions import RequiresContainerContextError
from harvester.exceptions import RequiresContainerContextError

ReturnType = TypeVar("ReturnType")

Expand Down
4 changes: 2 additions & 2 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
import pytest
from click.testing import CliRunner

from browsertrix_harvester.crawl import Crawler
from browsertrix_harvester.parse import CrawlParser
from harvester.crawl import Crawler
from harvester.parse import CrawlParser


@pytest.fixture(autouse=True)
Expand Down
6 changes: 3 additions & 3 deletions tests/fixtures/lib-website-homepage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ exclude:
# prevent RESOURCES / ASSETS from getting retrieved; URL requests
blockRules:
- ".*googlevideo.com.*"
- ".*cdn.libraries.mit.edu/media/.*"
- ".*cdn.pw-60-mitlib-wp-network.pantheonsite.io/media/.*"
- "\\.(jpg|png)$"
depth: 1
maxPageLimit: 20
timeout: 10
scopeType: "domain"
seeds:
- url: https://libraries.mit.edu/sitemap.xml
sitemap: https://libraries.mit.edu/sitemap.xml
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
Loading

0 comments on commit aa0f7a2

Please sign in to comment.