Water Use (#195)

* Update IngestAPI with changes to accommodate usage (#91) * Improve logging configuration * Update IngestAPI to accept and store allocation messages with usage * Message format tweaks * Make Metered Allocations nullable in API * Add sourceId to ingested water allocation * Update water allocation consumer (#92) * Improve logging configuration * Update IngestAPI to accept and store allocation messages with usage * Message format tweaks * Make Metered Allocations nullable in API * Add sourceId to ingested water allocation * Remove unused code * Updated WaterAllocationMessage and Consumer to handle allocations by consent (rather than area) * Fix typo * Ensure Cache manifest for new PlanLimit queries is periodically updated * Fix failing tests by truncating datetime before comparison to avoid precision differences. * Update GeoJsonControllerIntegrationTest to reflect new plan limits controller, and remove redundant test * Changes from PR review * Ensure only Rainfall measurements are used in the Rainfall queries (#96) * Manager - Kafka Consumer for Observations (#97) * Manager - Hilltop Crawler Changes * Shuffle files around * martin/aggregate water use (#95) * add water use daily aggregation sql view * optimise sql query performance * Adding Error handling to Observations Consumer --------- Co-authored-by: Martin Peak <martin.peak@gw.govt.nz> * Hilltop Crawler (#98) * Manager - Hilltop Crawler Changes * Shuffle files around * Crawler - Updates * Intergration testing * Improve DB naming * Big refactor of test setup * Correct time to wait between quiet polls * Add global error handling around task * Add check to make sure Virtual Measurement matches Measurement Name * Add a bunch to the readme * Add views to aggregate daily allocations and usage by area (#100) * Add views to aggregate consent allocations and usage by day * Start writing tests * Allow easier setup and assertion against test data, start adding more tests * Add tests for various allocation scenarios * Add failing test for aggregating data across different areas * Simplify views * Splitting out the calculation of daily useage by area * Add more assertions for aggregation of different areas * water_allocation.meter should correlate with observation_site.name * Format code * Feedback from PR review --------- Co-authored-by: Steve Mosley <steve@starter4ten.com> * Remove unnesscary `yield` * Adding index on next_fetch_at column * Add some more readme details about the task queue * Adding notes about partition sizes * Improve note about how schedulign tasks works --------- Co-authored-by: Vimal Jobanputra <vim@noodle.io> * Fixup the path being passed to archive build reports (#99) * Adding in sample data scripts and way to run from gradle (#101) * Adding in sample data scripts and way to run from gradle * Update so source id is unique per row * Update to insert observation data for 50 sites * Adding in a simple script to add time to the observed data * Tweak how loading is done to avoid the function * Update the refresh to be relative to current date * Adding in transactions for the scripts * Bump msw from 1.2.1 to 1.3.2 in /packages/PlanLimitsUI (#108) Bumps [msw](https://github.com/mswjs/msw) from 1.2.1 to 1.3.2. - [Release notes](https://github.com/mswjs/msw/releases) - [Changelog](https://github.com/mswjs/msw/blob/main/CHANGELOG.md) - [Commits](mswjs/msw@v1.2.1...v1.3.2) --- updated-dependencies: - dependency-name: msw dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump net.logstash.logback:logstash-logback-encoder (#107) Bumps [net.logstash.logback:logstash-logback-encoder](https://github.com/logfellow/logstash-logback-encoder) from 7.3 to 7.4. - [Release notes](https://github.com/logfellow/logstash-logback-encoder/releases) - [Commits](logfellow/logstash-logback-encoder@logstash-logback-encoder-7.3...logstash-logback-encoder-7.4) --- updated-dependencies: - dependency-name: net.logstash.logback:logstash-logback-encoder dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add query and controller endpoint for usage data (#102) * Weekly overview table (#124) * Minor code cleanups/comments * Move non-component files into lib, and move sidebar into a subdir * Remove redundant files * Extract Sidebar components * Wire up basic loading mechanism * Copy across button and indicator components from Interrain * Extract hook to load and transform data, start formatting usage table * Add GroundWater and populate tooltips with the right data * Take into account WaterTakeFilter when showing usage table * Clean up some naming * Ensure WaterAllocationMessage messages match IngestAPI types * Added fields to water usage API required by the front-end * Camelise API properties * Ensure usage dates are sorted * Better handling of errors loading usage data * Feedback from code review * Moving WaterAllocation View to a Materialised View (#125) * Moving WaterAllocation View to a Materialised View * Remove spots * Updates to allow water_allocation_and_usage_by_area to be refreshed * Fix up permission issue with materialized view * Turn effective_daily_consents back into a regular migration so we can ensure it is run before water_allocation_and_usage_by_area * Add materialized_views_role definition to LocalInfrastructure init script * Ensure materialized view is refreshed before running tests * Ensure materialized_views_role has the same permissions as eop_manager_app_user * Ensure view is materialized after generating test data * Remove old version of DB config - Has been superceded by LocalInfra folder * Limit materialized views role to Read only --------- Co-authored-by: Vim <vim@noodle.io> * Update plan limit libs (#127) * Update react-router-dom * Update Vite and associated packages * Update typescript and related packages, sort dev and main packages * Update linting and formatting libs And remove pretty-quick which is not compatible with prettier 3: prettier/pretty-quick#164 * Update testing libs * Update autoprefixer, remove unused web-vitals * Update tailwind, postcss and related libs * Update data/state libraries * npm audit fix * Update react-map-gl and maplibre-gl * Update MSW * Moving the materalized view permissions into main migrations (#129) * Move permissions required for materialized_views to migration * Add RESET ROLE to script - Otherwise the connection goes back to the pool with the wrong role, and causes pain for later threads. * Fix permissions so app can query usage view (#137) * Config ErrorHandlingDeserializer to DLQ unparsable messages (#134) * Re-apply changes from PR #130 (#138) * Changes for Water Use View (#139) * Changes for Water Use View - Move data to NZST timezone when aggregating daily - Fix issue with Water Meter Volume, previously we were treating it as a l/s value when it is actually a raw M^3 used since the last data point. Which makes the aggregation a simple sum * Update test with new data * Weekly usage graph (#148) * WIP: Detailed usage page with initial weekly usage * WIP: Group heatmap into regions and order * WIP: Tidy up presentation and data * Use full path in link to detailed usage page to fix AWS Amplify weirdness * Display tweaks * Add a link back to the limits viewer * Error handling and code cleanup * Fix bug formatting date in tooltip * Make UsageDisplayGroups part of council data * Crawler fix start of month (#149) * Tweak to limit amount of history kept. * Extend measurement list so it will be on month ahead if there are recent measurements * Add new fetch task that is constantly refreshing the latest data * Refactor to make comparsion more explicit * Update README for recent changes * Use correct allocation measure (#151) * Bump @typescript-eslint/parser in /packages/PlanLimitsUI (#140) Bumps [@typescript-eslint/parser](https://github.com/typescript-eslint/typescript-eslint/tree/HEAD/packages/parser) from 6.9.0 to 6.10.0. - [Release notes](https://github.com/typescript-eslint/typescript-eslint/releases) - [Changelog](https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/parser/CHANGELOG.md) - [Commits](https://github.com/typescript-eslint/typescript-eslint/commits/v6.10.0/packages/parser) --- updated-dependencies: - dependency-name: "@typescript-eslint/parser" dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump maplibre-gl from 3.5.1 to 3.5.2 in /packages/PlanLimitsUI (#132) Bumps [maplibre-gl](https://github.com/maplibre/maplibre-gl-js) from 3.5.1 to 3.5.2. - [Release notes](https://github.com/maplibre/maplibre-gl-js/releases) - [Changelog](https://github.com/maplibre/maplibre-gl-js/blob/main/CHANGELOG.md) - [Commits](maplibre/maplibre-gl-js@v3.5.1...v3.5.2) --- updated-dependencies: - dependency-name: maplibre-gl dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Daily Usage Graph (#158) * Extract results to separate components * Display daily usage * Use the same color scheme for daily usage we weekly usage * Format data in tooltip * Add a key to the daily usage chart * Add legend to weekly usage * Tweaks based on feedback * Fix rounding issue * Test alternate visualisation type * Add a key to the heatmapped daily data * Code tidy-ups * Remove TimeRange graph and tidy up code * Remove unused renderCell Override * Ensure top axis is not generated for weekly graphs other than first * Copy tweaks * Water meter reading view bugfix (#159) * fix for bug in daily usage calc * remove date filter. no longer needed as filtering occurs on a downstream view * Handle missing data in the UI (#165) * Handle missing data for Usage table * Handle missing data in Detailed Daily usage graph * Display tweaks * Handle missing and show more detailed info for weekly usage * Formatting tweaks * Only show extended tooltip data when the d key is held down * Remove console.log * Formatting tweaks * More explicit handling of 0 vs NULL values when displaying usage * Always show daily usage table in weekly tooltip * Show CMU if CMSU is not present * Handle missing data in view (#173) * Update field names to match updated view * Don't default to 0's where NULLs are present in Usage and Allocation data * Update view with showing allocation vs usage * Update view tests --------- Co-authored-by: Vim <vim@noodle.io> * Add overview docs (#174) * Add some docs * Docs tweaks * Remove log error message about rack assigner * Adding date check back in because of how it effect materlizing the view (#161) * Lower number of consumers for Kafka to hopefully stabalise things * Update Jackson max message size * Crawler handle large messages (#225) * Skip large messages from Hilltop - Some messages will cause issues later in processing because of their size * Fixup off by one day error in test * Update manager so observation updated_at is only changed if the value changes - This is better information, but also should imporve loading performance * TEAMU-876 - Show what percentage of allocation is measured (#232) * Update Ingest API for new message format * Update manager with new fields * Update Plan Limits with new data * Adding in Measured vs Total Allocation for usage * https://gwrc.atlassian.net/browse/TEAMU-929 : Daily Aggregated Observations Multiple values for same day (#235) * Update Ingest API for new message format * Update manager with new fields * Update Plan Limits with new data * Adding in Measured vs Total Allocation for usage * added the .sql file which removes the daily duplicates for each site * fix the where clause position * drop the view and create it as there are structural changes * drop the view and create it as there are structural changes * drop the cascaded view and recreate them * removed the drop command * removed the drop command * changes to ALTER view * changes to ALTER view * added drop and removed alter * added both view script in one migration file * added both view script in one migration file * added both view script in one migration file * added both view in the same migration file and corrected the sql query * keeping the naming conventions same * Adjusting roles to allow removeing the materalised view --------- Co-authored-by: Steve Mosley <steve@starter4ten.com> * TEAMU-1016-Fix-leap-year-issue (#243) * TEAMU-1016 Add FE and BE unit tests * TEAMU-1016 Add FE and BE unit tests * TEAMU-1016 Improve BE exception handling * TEAMU-1016 Teak start scripts * TEAMU-1016 Check error message in test * TEAMU-1016 Lint fixes * TEAMU-1016 Spelling error * TEAMU-1016 Linting and batect run * TEAMU-1016 Resolve merge conflict in readme --------- Co-authored-by: wtcarter-gw <wayne.carter@gw.govt.nz> * TEAMU-940 Add new migration 41 replacing MAX and MIN with last and first (#248) * TEAMU-940 Add new migration 41 * TEAMU-940 Correct migration 41 as per failing test * TEAMU-940 Apply review feedback and add some tests * TEAMU-940 Augment tests * TEAMU-940 Spotless linting --------- Co-authored-by: wtcarter-gw <wayne.carter@gw.govt.nz> * fix to exclude all combined_meters (#249) * fix to exclude all combined_meters * Remove the extra space --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Vimal Jobanputra <vim@noodle.io> Co-authored-by: Martin Peak <martin.peak@gw.govt.nz> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Swapna Josmi Sam <56238575+swapnasam@users.noreply.github.com> Co-authored-by: Wayne Carter <wayne@ednastreet.nz> Co-authored-by: wtcarter-gw <wayne.carter@gw.govt.nz>
Greater-Wellington-Regional-Council · Mar 25, 2024 · 9b8a459 · 9b8a459
1 parent 94a4f3b
commit 9b8a459
Show file tree

Hide file tree

Showing 159 changed files with 15,729 additions and 8,017 deletions.
diff --git a/.github/workflows/hilltop-crawler.yml b/.github/workflows/hilltop-crawler.yml
@@ -51,7 +51,7 @@ jobs:
         if: always()        
         with:
           name: reports
-          path: format('{0}/build/reports', env.WORKING_DIR)
+          path: ${{ env.WORKING_DIR }}/build/reports
 
       - name: Configure AWS credentials
         if: ${{ github.actor != 'dependabot[bot]' }}

diff --git a/.github/workflows/ingest-api.yml b/.github/workflows/ingest-api.yml
@@ -51,7 +51,7 @@ jobs:
         if: always()        
         with:
           name: reports
-          path: format('{0}/build/reports', env.WORKING_DIR)
+          path: ${{ env.WORKING_DIR }}/build/reports
 
       - name: Configure AWS credentials
         if: ${{ github.actor != 'dependabot[bot]' }}      

diff --git a/.github/workflows/manager.yml b/.github/workflows/manager.yml
@@ -51,7 +51,7 @@ jobs:
         if: always()
         with:
           name: reports
-          path: format('{0}/build/reports', env.WORKING_DIR)
+          path: ${{ env.WORKING_DIR }}/build/reports
 
       - name: Configure AWS credentials
         if: ${{ github.actor != 'dependabot[bot]' }}      

diff --git a/.run/Plan Limits UI.run.xml b/.run/Plan Limits UI.run.xml
@@ -0,0 +1,13 @@
+<component name="ProjectRunConfigurationManager">
+  <configuration default="false" name="Plan Limits UI" type="js.build_tools.npm">
+    <package-json value="$PROJECT_DIR$/packages/PlanLimitsUI/package.json" />
+    <command value="run" />
+    <scripts>
+      <script value="dev" />
+    </scripts>
+    <arguments value="-- --open" />
+    <node-interpreter value="project" />
+    <envs />
+    <method v="2" />
+  </configuration>
+</component>
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ Include diagrams in the site:
 
 ### How to Run the application locally
 
-To run Ha Kākano locally, you will need to start the shared infrastructure services and then start the individual applications. 
+To run He Kākano locally, you will need to start the shared infrastructure services and then start the individual applications. 
 
 #### Shared Infrastructure
 
@@ -59,4 +59,10 @@ _And from a new shell session in the same folder_
 
 #### Application Packages
 
-Having started the shared infrastructure, you will find specific run instructions for each application package in its own README.md file.  For example, [here are the run instructions for the Plan Limits UI](packages/PlanLimitsUI/README.md).
+Having started the shared infrastructure, you will find specific run instructions for each application package in its own README.md file.  For example, [here are the run instructions for the Plan Limits UI](./packages/PlanLimitsUI/README.md).
+
+## `start.sh` convenience script
+
+To simplify running EOP locally, a `start.sh` script has been provided in the root of the repository.  This script will start the shared infrastructure services and then another application component that you name as an argument.
+
+For example, to start the PlanUnitsUI application, you would run `./start.sh PlanUnitsUI`.  
diff --git a/packages/HilltopCrawler/README.md b/packages/HilltopCrawler/README.md
@@ -0,0 +1,108 @@
+# Hilltop Crawler
+
+## Description
+
+This app extracts observation data from Hilltop servers and publishes them as to a Kafka `observations` topic for other
+applications to consume. Which Hilltop servers and which types to extract are configured in a database table which can
+be added to while the system is running.
+
+The app intends to keep all versions of data pulled from the Hilltop. Allowing downstream systems to keep track of when
+data has been updated in Hilltop. Noting that because Hilltop does not expose when changes were made we can only capture
+when the crawler first saw data change. This is also intended to allow us to update some parts of the system without
+having to re-read all of the data from hilltop.
+
+## Process
+
+Overall, this is built around the Hilltop API which exposes three levels of GET API request
+
+* SITES_LIST — List of Sites, includes names and locations
+* MEASUREMENTS_LIST — Per site, list of measurements available for that site, including details about the first and last
+  observation date
+* MEASUREMENT_DATA / MEASUREMENT_DATA_LATEST — Per site, per measurement, the timeseries observed data either historical or most recent
+
+The app crawls these by reading each level to decide what to read from the next level. i.e., The SITES_LIST tells the
+app which sites call MEASUREMENTS_LIST which in turn tells the app which measurements to call MEASUREMENT_DATA for. Requests
+for MEASUREMENT_DATA are also split into monthly chunks to avoid issues with too much data being returned in one
+request. If a site appears to be reporting data frequently then a MEASUREMENT_DATA_LATEST will be queued up to get data for the most recent 35 days.
+
+The app keeps a list of API requests that it will keep up to date by calling that API on a schedule. This list is stored
+in the `hilltop_fetch_tasks` table and works like a task queue. Each time a request is made, the result is used to try
+and determine when next to schedule the task. An example is for MEASUREMENT_DATA_LATEST if the last observation was
+recent, then a refresh should be attempted soon, if it was a long way in the past, it should be refreshed less often.
+
+The next schedule time has been implemented as a random time in an interval, to provide some jitter when between task
+requeue times to hopefully spread them out, and the load on the servers we are calling.
+
+The task queue also keeps meta-data about the previous history of tasks that are not used by the app this is to allow
+engineers to monitor how the system is working.
+
+This process is built around three main components:
+
+* Every hour, monitor the configuration table
+    * Read the configuration table
+    * From the configuration, add any new tasks to the task queue
+
+* Continuously monitor the task queue
+    * Read the next task to work on
+    * Fetch data from hilltop for that task
+    * If a valid result which is not the same as the previous version
+        * Queue any new tasks from the result
+        * Send the result to Kafka `hilltop_raw` topic (note the Kafka topic does not distinguish between
+          MEASUREMENT_DATA / MEASUREMENT_DATA_LATEST it just calls them both MEASUREMENT_DATA)
+    * Requeue the task for sometime in the future, based on the type of request
+
+* Kafka streams component
+    * Monitor the stream
+    * For each message map it to either a `site_details` or and `observations` message
+
+Currently, the Manager component listens to the `observations` topic and stores the data from that in a DB table.
+
+## Task Queue
+
+The task queue is currently a custom implementation on top of a single table `hilltop_fetch_tasks`.
+
+This is a slightly specialized queue where
+
+* Each task has a scheduled run time backed by the  `next_fecth_at` column
+* Each time a task runs it will be re-queued for some time in the future
+* The same task can be added multiple times and rely on Postgres “ON CONFLICT DO NOTHING” to avoid the task being added
+  multiple times
+
+The implementation relies on the postgres `SKIP LOCKED` feature to allow multiple worker threads to pull from the queue
+at the same time without getting the same task.
+
+See this [reference](https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/) for
+discussion about the `SKIP LOCKED` query.
+
+The queue implementation is fairly simple for this specific use. If it becomes more of a generic work queue then a
+standard implementation such as Quartz might be worthwhile moving to.
+
+## Example Configuration
+
+These are a couple of insert statements that are not stored in migration scripts, so developer machines don't index
+them by default.
+
+GW Water Use
+
+```sql
+INSERT INTO hilltop_sources (council_id, hts_url, configuration)
+VALUES (9, 'https://hilltop.gw.govt.nz/WaterUse.hts',
+        '{ "measurementNames": ["Water Meter Volume", "Water Meter Reading"] }');
+```
+
+GW Rainfall
+
+```sql
+INSERT INTO hilltop_sources (council_id, hts_url, configuration)
+VALUES (9, 'https://hilltop.gw.govt.nz/merged.hts', '{ "measurementNames": ["Rainfall"] }');
+```
+
+### TODO / Improvement
+
+* The algorithm for determining the next time to schedule an API refresh could be improved, something could be built
+  from the previous history based on how often data is unchanged.
+* To avoid hammering Hilltop there is rate limiting using a "token bucket" library, currently this uses one bucket for
+  all requests. It could be split to use one bucket per server
+* Each time the latest measurement API is called we will receive data that has already been seen and processed previously and store it
+  again in the `hilltop_raw` topic. This seems wasteful and there are options for cleaning up when a record just
+  supersedes the previous. But cleaning up could mean losing some knowledge about when a observation was first seen.     
diff --git a/packages/HilltopCrawler/build.gradle.kts b/packages/HilltopCrawler/build.gradle.kts
@@ -41,10 +41,14 @@ dependencies {
   implementation("org.flywaydb:flyway-core:10.6.0")
   implementation("org.flywaydb:flyway-database-postgresql:10.6.0")
   implementation("io.github.microutils:kotlin-logging-jvm:3.0.5")
+  implementation("org.apache.kafka:kafka-streams")
+  implementation("com.bucket4j:bucket4j-core:8.3.0")
 
   testImplementation("org.springframework.boot:spring-boot-starter-test")
   testImplementation("org.springframework.kafka:spring-kafka-test")
   testImplementation("io.kotest:kotest-assertions-core:5.8.0")
+  testImplementation("io.kotest:kotest-assertions-json:5.8.0")
+  testImplementation("org.mockito.kotlin:mockito-kotlin:5.1.0")
 }
 
 // Don't repackage build in a "-plain" Jar
@@ -70,21 +74,6 @@ configure<com.diffplug.gradle.spotless.SpotlessExtension> {
   kotlinGradle { ktfmt() }
 }
 
-val dbConfig =
-    mapOf(
-        "url" to
-            "jdbc:postgresql://${System.getenv("CONFIG_DATABASE_HOST") ?: "localhost"}:5432/eop_test",
-        "user" to "postgres",
-        "password" to "password")
-
-flyway {
-  url = dbConfig["url"]
-  user = dbConfig["user"]
-  password = dbConfig["password"]
-  schemas = arrayOf("hilltop_crawler")
-  locations = arrayOf("filesystem:./src/main/resources/db/migration")
-}
-
 testlogger {
   showStandardStreams = true
   showPassedStandardStreams = false

diff --git a/packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Application.kt b/packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Application.kt
@@ -1,33 +1,21 @@
 package nz.govt.eop.hilltop_crawler
 
-import java.security.MessageDigest
+import com.fasterxml.jackson.core.StreamReadConstraints
 import org.springframework.boot.autoconfigure.SpringBootApplication
-import org.springframework.boot.autoconfigure.jackson.Jackson2ObjectMapperBuilderCustomizer
 import org.springframework.boot.context.properties.EnableConfigurationProperties
 import org.springframework.boot.runApplication
-import org.springframework.context.annotation.Bean
-import org.springframework.http.converter.json.Jackson2ObjectMapperBuilder
-import org.springframework.kafka.annotation.EnableKafka
 import org.springframework.scheduling.annotation.EnableScheduling
 
 @SpringBootApplication
 @EnableScheduling
-@EnableKafka
 @EnableConfigurationProperties(ApplicationConfiguration::class)
-class Application {
-
-  @Bean
-  fun jsonCustomizer(): Jackson2ObjectMapperBuilderCustomizer {
-    return Jackson2ObjectMapperBuilderCustomizer { _: Jackson2ObjectMapperBuilder -> }
-  }
-}
+class Application {}
 
 fun main(args: Array<String>) {
   System.setProperty("com.sun.security.enableAIAcaIssuers", "true")
+
+  StreamReadConstraints.overrideDefaultStreamReadConstraints(
+      StreamReadConstraints.builder().maxStringLength(50_000_000).build())
+
   runApplication<Application>(*args)
 }
-
-fun hashMessage(message: String) =
-    MessageDigest.getInstance("SHA-256").digest(message.toByteArray()).joinToString("") {
-      "%02x".format(it)
-    }
diff --git a/...ages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/CheckForNewSourcesTask.kt b/...ages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/CheckForNewSourcesTask.kt
@@ -0,0 +1,34 @@
+package nz.govt.eop.hilltop_crawler
+
+import java.util.concurrent.TimeUnit
+import nz.govt.eop.hilltop_crawler.api.requests.buildSiteListUrl
+import nz.govt.eop.hilltop_crawler.db.DB
+import nz.govt.eop.hilltop_crawler.db.HilltopFetchTaskType
+import org.springframework.context.annotation.Profile
+import org.springframework.scheduling.annotation.Scheduled
+import org.springframework.stereotype.Component
+
+/**
+ * This task is responsible for triggering the first fetch task for each source stored in the DB.
+ *
+ * It makes sure any new rows added to the DB will start to be pulled from within an hour.
+ *
+ * Each time it runs, it will create the initial fetch task for each source found in the DB. This
+ * relies on the task queue (via "ON CONFLICT DO NOTHING") making sure that duplicate tasks will not
+ * be created.
+ */
+@Profile("!test")
+@Component
+class CheckForNewSourcesTask(val db: DB) {
+
+  @Scheduled(fixedDelay = 1, timeUnit = TimeUnit.HOURS)
+  fun triggerSourcesTasks() {
+    val sources = db.listSources()
+
+    sources.forEach {
+      db.createFetchTask(
+          DB.HilltopFetchTaskCreate(
+              it.id, HilltopFetchTaskType.SITES_LIST, buildSiteListUrl(it.htsUrl)))
+    }
+  }
+}
diff --git a/packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Constants.kt b/packages/HilltopCrawler/src/main/kotlin/nz/govt/eop/hilltop_crawler/Constants.kt
@@ -1,3 +1,6 @@
 package nz.govt.eop.hilltop_crawler
 
 const val HILLTOP_RAW_DATA_TOPIC_NAME = "hilltop.raw"
+const val OUTPUT_DATA_TOPIC_NAME = "observations"
+
+const val MAX_RESPONSE_SIZE = 20_000_000