Remove 'btrix' from harvest names and commands

Why these changes are being introduced: Using 'harvest' continues to reinforce that this application is more than just a browsertrix web crawl, but also a metadata parsing process. This is also more inline with the OAI-PMH harvester in naming conventions.
MITLibraries · Sep 28, 2023 · aa0f7a2 · aa0f7a2
1 parent fe2102c
commit aa0f7a2
Show file tree

Hide file tree

Showing 22 changed files with 107 additions and 107 deletions.
diff --git a/Dockerfile b/Dockerfile
@@ -26,14 +26,14 @@ RUN pip3 install --upgrade pip \
 
 # NOTE: /app is already used by browsertrix-crawler
 # Setup python virtual environment
-WORKDIR /btrixharvest
-COPY Pipfile /btrixharvest/Pipfile
+WORKDIR /browsertrix-harvester
+COPY Pipfile /browsertrix-harvester/Pipfile
 RUN pipenv install --python 3.11
 
 # Copy full browstrix-harvester app
-COPY pyproject.toml /btrixharvest/
-COPY docker-entrypoint.sh /btrixharvest/
-COPY browsertrix_harvester/ /btrixharvest/browsertrix_harvester/
-COPY tests/ /btrixharvest/tests/
+COPY pyproject.toml /browsertrix-harvester/
+COPY docker-entrypoint.sh /browsertrix-harvester/
+COPY harvester/ /browsertrix-harvester/harvester/
+COPY tests/ /browsertrix-harvester/tests/
 
-ENTRYPOINT ["/btrixharvest/docker-entrypoint.sh"]
+ENTRYPOINT ["/browsertrix-harvester/docker-entrypoint.sh"]
diff --git a/Makefile b/Makefile
@@ -17,7 +17,7 @@ update: install ## Update all Python dependencies
 
 ### Test commands ###
 test: ## Run tests and print a coverage report
-	pipenv run coverage run --source=browsertrix_harvester -m pytest -vv
+	pipenv run coverage run --source=harvester -m pytest -vv
 	pipenv run coverage report -m
 
 coveralls: test
@@ -50,7 +50,7 @@ ruff-apply:
 
 # CLI commands
 shell:
-	pipenv run btrixharvest-dockerized shell
+	pipenv run harvest-dockerized shell
 
 # Docker commands
 build-docker:
@@ -59,9 +59,9 @@ build-docker:
 # Test crawl commands
 # local docker container crawl
 test-harvest-local:
-	pipenv run btrixharvest-dockerized --verbose harvest \
+	pipenv run harvest-dockerized --verbose harvest \
 	--crawl-name="homepage" \
-	--config-yaml-file="/btrixharvest/tests/fixtures/lib-website-homepage.yaml" \
+	--config-yaml-file="/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml" \
 	--metadata-output-file="/crawls/collections/homepage/homepage.xml" \
 	--num-workers 4 \
 	--btrix-args-json='{"--maxPageLimit":"15"}'

diff --git a/Pipfile b/Pipfile
@@ -30,5 +30,5 @@ pandas-stubs = "*"
 python_version = "3.11"
 
 [scripts]
-btrixharvest = "python -c \"from browsertrix_harvester.cli import main; main()\""
-btrixharvest-dockerized = "docker run -it -v $PWD/output/crawls:/crawls browsertrix-harvester-dev:latest"
+harvest = "python -c \"from harvester.cli import main; main()\""
+harvest-dockerized = "docker run -it -v $PWD/output/crawls:/crawls browsertrix-harvester-dev:latest"
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ See [architecture docs](docs/architecture.md).
 
 ## Development
 
-**NOTE**: When performing web crawls, this application invokes browsertrix-crawler.  While theoretically possible to install browsertrix-crawler on your local machine as a callable binary, this application is oriented around running only inside of a Docker container where it is already installed.  For this reason, the pipenv convenience command `btrixharvest-dockerized` has been created (more on this below).
+**NOTE**: When performing web crawls, this application invokes browsertrix-crawler.  While theoretically possible to install browsertrix-crawler on your local machine as a callable binary, this application is oriented around running only inside of a Docker container where it is already installed.  For this reason, the pipenv convenience command `harvest-dockerized` has been created (more on this below).
 
 ### Build Application
 
@@ -20,11 +20,11 @@ See [architecture docs](docs/architecture.md).
 - Build docker image: `make build-docker`
   - builds local image `browsertrix-harvester-dev:latest`
 - To run the app:
-  - Non-Dockerized: `pipenv run btrixharvest --help`
+  - Non-Dockerized: `pipenv run harvest --help`
     - works locally for many things, but will throw error for actions that perform crawls
-  - Dockerized: `pipenv run btrixharvest-dockerized --help`
+  - Dockerized: `pipenv run harvest-dockerized --help`
     - provides full functionality by running as a docker container
-    - points back to the pipenv command `btrixharvest` 
+    - points back to the pipenv command `harvest` 
 
 ### Testing and Linting
 
@@ -44,7 +44,7 @@ SENTRY_DSN=None# If set to a valid Sentry DSN, enables Sentry exception monitori
 
 ### Main
 ```shell
-pipenv run btrixharvest
+pipenv run harvest
 ```
 
 ```text
@@ -60,7 +60,7 @@ Commands:
 
 ### Shell environment
 ```shell
-pipenv run btrixharvest shell
+pipenv run harvest shell
 ```
 ```text
 Usage: -c shell [OPTIONS]
@@ -73,7 +73,7 @@ Options:
 
 ### Parse URL content from crawl
 ```shell
-pipenv run btrixharvest parse-url-content
+pipenv run harvest parse-url-content
 ```
 ```text
 Usage: -c parse-url-content [OPTIONS]
@@ -96,7 +96,7 @@ This is the primary command for this application.  This performs a web crawl, th
 **NOTE:** if neither `--wacz-output-file` or `--metadata-output-file` is set, a crawl will be performed, but nothing will exist outside of the container after it completes.
 
 ```shell
-pipenv run btrixharvest-dockerized harvest
+pipenv run harvest-dockerized harvest
 ```
 ```text
 Usage: -c harvest [OPTIONS]
@@ -130,11 +130,11 @@ Options:
 #### Configuration YAML
 
 There are a couple of options for providing a file for the required `--config-yaml-file` argument:
-  * 1- add to, or reuse files from, the local directory `browsertrix_harvester/crawl_configs`
-    * on image rebuild, this file will be available in the container at `/btrixharvest/browsertrix_harvester/crawl_configs`
+  * 1- add to, or reuse files from, the local directory `harvester/crawl_configs`
+    * on image rebuild, this file will be available in the container at `/browsertrix-harvest/harvester/crawl_configs`
   * 2- provide an S3 file URI
 
-At the time of harvest, for either local or remote files, the application copies the provided file to `/btrixharvest/crawl-config.yaml` inside the container.
+At the time of harvest, for either local or remote files, the application copies the provided file to `/browsertrix-harvest/crawl-config.yaml` inside the container.
 
 ## Extracted Metadata
 
@@ -201,8 +201,8 @@ make build-docker
 ```shell
 make test-harvest-local
 ```
-  * Performs a crawl using the config YAML `/btrixharvest/tests/fixtures/lib-website-homepage.yaml`
-  * Metadata is written to `/crawls/collections/homepage/homepage.xml` in the container, which is mounted and available in the local `output/` folder 
+  * Performs a crawl using the container mounted config YAML `/browsertrix-harvest/tests/fixtures/lib-website-homepage.yaml`
+  * Metadata is written to container directory `/crawls/collections/homepage/homepage.xml`, which is mounted and available in the local `output/` folder 
 
 ### Remote Test Crawl
 

diff --git a/bin/test-harvest-ecs.sh b/bin/test-harvest-ecs.sh
@@ -10,4 +10,4 @@ aws ecs run-task \
 --launch-type="FARGATE" \
 --region us-east-1 \
 --network-configuration '{"awsvpcConfiguration": {"subnets": ["subnet-0488e4996ddc8365b","subnet-022e9ea19f5f93e65"], "securityGroups": ["sg-044033bf5f102c544"]}}' \
---overrides '{"containerOverrides": [ {"name":"broswertrix-harvester", "command": ["--verbose", "harvest", "--crawl-name", "'"$CRAWL_NAME"'", "--config-yaml-file", "/btrixharvest/tests/fixtures/lib-website-homepage.yaml", "--metadata-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.xml", "--wacz-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.wacz", "--num-workers", "2"]}]}'
+--overrides '{"containerOverrides": [ {"name":"browsertrix-harvester", "command": ["--verbose", "harvest", "--crawl-name", "'"$CRAWL_NAME"'", "--config-yaml-file", "/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml", "--metadata-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.xml", "--wacz-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'"$CRAWL_NAME"'.wacz", "--num-workers", "2"]}]}'
diff --git a/browsertrix_harvester/__init__.py b/browsertrix_harvester/__init__.py
diff --git a/browsertrix_harvester/crawl_configs/lib-website-config.yaml b/browsertrix_harvester/crawl_configs/lib-website-config.yaml
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -13,7 +13,7 @@ NOTE: this is different from other python CLI apps, which generally use `python:
 
 ## Web Crawls
 
-Any actions that trigger a browsertrix web crawl will not work outside of a container context.  A decorator `browsertrix_harvester.utils.require_container` has been created that can be used to decorate functions or methods that should not run outside of a container context.  This decorator looks for EITHER of the following conditions to be true:
+Any actions that trigger a browsertrix web crawl will not work outside of a container context.  A decorator `harvester.utils.require_container` has been created that can be used to decorate functions or methods that should not run outside of a container context.  This decorator looks for EITHER of the following conditions to be true:
   * the file `/.dockerenv` exists; indicates locally running container
   * the env var `AWS_EXECUTION_ENV` is set; indicates Fargate ECS task
 

diff --git a/harvester/__init__.py b/harvester/__init__.py
@@ -0,0 +1 @@
+"""harvester package."""
diff --git a/browsertrix_harvester/cli.py → harvester/cli.py b/browsertrix_harvester/cli.py → harvester/cli.py
@@ -1,4 +1,4 @@
-"""browsertrix_harvester.cli"""
+"""harvester.cli"""
 # ruff: noqa: FBT001, ARG001
 
 import logging
@@ -9,9 +9,9 @@
 import click
 import smart_open  # type: ignore[import]
 
-from browsertrix_harvester.config import configure_logger, configure_sentry
-from browsertrix_harvester.crawl import Crawler
-from browsertrix_harvester.parse import CrawlParser
+from harvester.config import configure_logger, configure_sentry
+from harvester.crawl import Crawler
+from harvester.parse import CrawlParser
 
 logger = logging.getLogger(__name__)
 

diff --git a/browsertrix_harvester/config.py → harvester/config.py b/browsertrix_harvester/config.py → harvester/config.py
@@ -15,7 +15,7 @@ def configure_logger(logger: logging.Logger, verbose: bool) -> str:
         )
         logger.setLevel(logging.DEBUG)
         for handler in logging.root.handlers:
-            handler.addFilter(logging.Filter("browsertrix_harvester"))
+            handler.addFilter(logging.Filter("harvester"))
     else:
         logging.basicConfig(
             format="%(asctime)s %(levelname)s %(name)s.%(funcName)s(): %(message)s"

diff --git a/browsertrix_harvester/crawl.py → harvester/crawl.py b/browsertrix_harvester/crawl.py → harvester/crawl.py
@@ -1,4 +1,4 @@
-"""browsertrix_harvester.crawl"""
+"""harvester.crawl"""
 
 import json
 import logging
@@ -8,16 +8,16 @@
 
 import smart_open  # type: ignore[import]
 
-from browsertrix_harvester.exceptions import ConfigYamlError
-from browsertrix_harvester.utils import require_container
+from harvester.exceptions import ConfigYamlError
+from harvester.utils import require_container
 
 logger = logging.getLogger(__name__)
 
 
 class Crawler:
     """Class that manages browsertrix crawls."""
 
-    DOCKER_CONTAINER_CONFIG_YAML_FILEPATH = "/btrixharvest/crawl-config.yaml"
+    DOCKER_CONTAINER_CONFIG_YAML_FILEPATH = "/browsertrix-harvester/crawl-config.yaml"
 
     # ruff: noqa: FBT001, FBT002
     def __init__(

diff --git a/harvester/crawl_configs/lib-website-config.yaml b/harvester/crawl_configs/lib-website-config.yaml
@@ -0,0 +1,51 @@
+generateCDX: true
+generateWACZ: true
+text: true
+# prevent PAGES from getting crawled; scoping
+exclude:
+  - ".*lib.mit.edu/search/.*"
+  - ".*mit.primo.exlibrisgroup.com/.*"
+# prevent RESOURCES / ASSETS from getting retrieved; URL requests
+blockRules:
+  - ".*googlevideo.com.*"
+  - ".*cdn.pw-60-mitlib-wp-network.pantheonsite.io/media/.*"
+  - "\\.(jpg|png)$"
+depth: 1
+scopeType: "domain"
+seeds:
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/exhibits/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/exhibits/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/mithistory/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mithistory/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/news/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/news/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/scholarly/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/scholarly/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/150books/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/150books/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/giving/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/giving/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/music-oral-history/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/music-oral-history/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/docs/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/docs/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/data-management/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/data-management/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/akdc/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/akdc/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-reads/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-reads/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/pomeroy/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/pomeroy/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-and-slavery/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/mit-and-slavery/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/creos/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/creos/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/distinctive-collections/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/distinctive-collections/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/about/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/about/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/opendata/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/opendata/sitemap.xml
diff --git a/browsertrix_harvester/exceptions.py → harvester/exceptions.py b/browsertrix_harvester/exceptions.py → harvester/exceptions.py
diff --git a/browsertrix_harvester/parse.py → harvester/parse.py b/browsertrix_harvester/parse.py → harvester/parse.py
@@ -1,4 +1,4 @@
-"""browsertrix_harvester.parse"""
+"""harvester.parse"""
 # ruff: noqa: N813
 
 import gzip
@@ -19,7 +19,7 @@
 from warcio.recordloader import ArcWarcRecord  # type: ignore[import]
 from yake import KeywordExtractor  # type: ignore[import]
 
-from browsertrix_harvester.exceptions import WaczFileDoesNotExist
+from harvester.exceptions import WaczFileDoesNotExist
 
 logger = logging.getLogger(__name__)
 

diff --git a/browsertrix_harvester/utils.py → harvester/utils.py b/browsertrix_harvester/utils.py → harvester/utils.py
@@ -1,11 +1,11 @@
-"""browsertrix_harvester.utils"""
+"""harvester.utils"""
 # ruff: noqa: ANN401
 
 import os
 from collections.abc import Callable
 from typing import Any, TypeVar
 
-from browsertrix_harvester.exceptions import RequiresContainerContextError
+from harvester.exceptions import RequiresContainerContextError
 
 ReturnType = TypeVar("ReturnType")
 

diff --git a/tests/conftest.py b/tests/conftest.py
@@ -4,8 +4,8 @@
 import pytest
 from click.testing import CliRunner
 
-from browsertrix_harvester.crawl import Crawler
-from browsertrix_harvester.parse import CrawlParser
+from harvester.crawl import Crawler
+from harvester.parse import CrawlParser
 
 
 @pytest.fixture(autouse=True)

diff --git a/tests/fixtures/lib-website-homepage.yaml b/tests/fixtures/lib-website-homepage.yaml
@@ -8,12 +8,12 @@ exclude:
 # prevent RESOURCES / ASSETS from getting retrieved; URL requests
 blockRules:
   - ".*googlevideo.com.*"
-  - ".*cdn.libraries.mit.edu/media/.*"
+  - ".*cdn.pw-60-mitlib-wp-network.pantheonsite.io/media/.*"
   - "\\.(jpg|png)$"
 depth: 1
 maxPageLimit: 20
 timeout: 10
 scopeType: "domain"
 seeds:
-  - url: https://libraries.mit.edu/sitemap.xml
-    sitemap: https://libraries.mit.edu/sitemap.xml
+  - url: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
+    sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml