Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from mozilla:master #11

Open
wants to merge 79 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
5d9eadd
Bump jedis from 3.5.0 to 3.5.1 (#1546)
dependabot[bot] Jan 25, 2021
3360d4d
Scrube more unwanted apps
scholtzan Jan 25, 2021
75032e0
Bump pytest from 6.2.1 to 6.2.2 in /ingestion-edge (#1548)
dependabot[bot] Jan 26, 2021
bdd46b2
Bump jose4j from 0.7.4 to 0.7.5 (#1550)
dependabot[bot] Jan 27, 2021
5af390a
Drop glean-js-tmp in pipeline (#1552)
fbertsch Jan 29, 2021
2ea55dd
Bump jose4j from 0.7.5 to 0.7.6 (#1553)
dependabot[bot] Jan 29, 2021
2767c6c
Bump org.everit.json.schema from 1.12.1 to 1.12.2 (#1551)
dependabot[bot] Jan 29, 2021
054e80a
Bump maven-checkstyle-plugin from 3.1.1 to 3.1.2
dependabot[bot] Feb 1, 2021
d46ade9
Bump checkstyle from 8.39 to 8.40 (#1555)
dependabot[bot] Feb 1, 2021
4fb9201
Bump google-cloud-pubsublite from 0.8.0 to 0.9.0 (#1556)
dependabot[bot] Feb 2, 2021
6e3628a
Bump google-cloud-kms from 1.40.5 to 1.40.6 (#1559)
dependabot[bot] Feb 8, 2021
4401223
Bump google-cloud-pubsublite from 0.9.0 to 0.10.0 (#1558)
dependabot[bot] Feb 8, 2021
56bf105
Bug 1684980 - Ignore non-Glean user agents from firefox-desktop (#1557)
acmiyaguchi Feb 8, 2021
74a3db6
Fix #1560 - Handle null user agent strings while scrubbing (#1561)
acmiyaguchi Feb 9, 2021
b3140c5
Bump spotless-maven-plugin from 2.7.0 to 2.8.0 (#1564)
dependabot[bot] Feb 10, 2021
500cc2d
Bump libraries-bom from 16.3.0 to 16.4.0 (#1563)
dependabot[bot] Feb 10, 2021
5301912
Bug 1691807 - Allow decryption of objects in rally meta pings (#1562)
acmiyaguchi Feb 11, 2021
2dbd690
Bump junit from 4.13.1 to 4.13.2 (#1565)
dependabot[bot] Feb 16, 2021
74f38b4
Ignore org-privacywall-browser pings
fbertsch Feb 16, 2021
c4c713c
Bump sanic from 20.12.1 to 20.12.2 in /ingestion-edge (#1567)
dependabot[bot] Feb 17, 2021
c145de6
Bump spotless-maven-plugin from 2.8.0 to 2.8.1 (#1568)
dependabot[bot] Feb 17, 2021
5429bb2
Bump spring-core from 5.3.3 to 5.3.4 (#1569)
dependabot[bot] Feb 17, 2021
107d9ca
Add --pubsubIdAttribute beam option for dedup when reading pubsub (#1…
relud Feb 17, 2021
324ddef
Bump google-cloud-monitoring from 2.0.0 to 2.0.1 in /ingestion-edge (…
dependabot[bot] Feb 22, 2021
6a09bb9
Bump beam-sdks-java-bom from 2.27.0 to 2.28.0
dependabot[bot] Feb 23, 2021
c357cb0
Bump mockito-core from 3.7.7 to 3.8.0 (#1573)
dependabot[bot] Feb 23, 2021
0860f4b
Bump google-cloud-kms from 1.40.6 to 1.40.7 (#1574)
dependabot[bot] Feb 23, 2021
5b5ef9b
Bump libraries-bom from 16.4.0 to 18.0.0 (#1575)
dependabot[bot] Feb 25, 2021
2c494f4
Update docs about deduplication (#1577)
jklukas Feb 25, 2021
1d160f1
Remove decoder uri dedup option from top-level build script (#1576)
whd Feb 25, 2021
a5ce94d
Bump google-cloud-kms from 1.40.7 to 1.40.8 (#1581)
dependabot[bot] Feb 26, 2021
9828740
Bump google-cloud-pubsublite from 0.10.0 to 0.11.0 (#1582)
dependabot[bot] Feb 26, 2021
0cfad7e
Bump aiohttp from 3.7.3 to 3.7.4 in /ingestion-edge (#1580)
dependabot[bot] Mar 1, 2021
41445eb
Bump pytest-sanic from 1.6.2 to 1.7.0 in /ingestion-edge (#1584)
dependabot[bot] Mar 1, 2021
4d3713f
Bump checkstyle from 8.40 to 8.41 (#1585)
dependabot[bot] Mar 1, 2021
a14c81a
Update transitive dependencies to fix ingestion-edge-release (#1586)
relud Mar 1, 2021
e6475b1
Bump google-cloud-pubsublite from 0.11.0 to 0.11.1 (#1587)
dependabot[bot] Mar 2, 2021
97619c8
Bump libraries-bom from 18.0.0 to 18.1.0 (#1588)
dependabot[bot] Mar 4, 2021
e7ee21c
Bump libraries-bom from 18.1.0 to 19.0.0 (#1589)
dependabot[bot] Mar 8, 2021
4e6c1ee
[Bug 1695728] Mark `org-mozilla-ios-lockbox-credentialprovider` as Un…
scholtzan Mar 8, 2021
c3dbc61
Bump spotless-maven-plugin from 2.8.1 to 2.9.0 (#1591)
dependabot[bot] Mar 9, 2021
5dc4b43
Create contextual services pipeline (#1579)
BenWu Mar 9, 2021
5e98232
Bump aiohttp from 3.7.4 to 3.7.4.post0 in /ingestion-edge (#1590)
dependabot[bot] Mar 9, 2021
b59526e
Propagate contextual services opts (#1593)
whd Mar 10, 2021
1ea77ec
Update pain_points.md (#1594)
jklukas Mar 10, 2021
95096cc
Change allowedDocTypes to string value (#1595)
BenWu Mar 10, 2021
608089c
Change allowedDocType setter type (#1596)
BenWu Mar 10, 2021
c05fbab
Remove support for redis deduplication (#1583)
relud Mar 11, 2021
37a77e8
Bump google-cloud-kms from 1.40.8 to 1.41.0 (#1597)
dependabot[bot] Mar 12, 2021
82c6e04
Bump google-cloud-monitoring from 2.0.1 to 2.1.0 in /ingestion-edge (…
dependabot[bot] Mar 12, 2021
0088d65
Bug 1697602 Drop AET messages (#1600)
jklukas Mar 15, 2021
a5ea135
Bump log4j.version from 2.14.0 to 2.14.1 (#1601)
dependabot[bot] Mar 15, 2021
943b5de
Bump pip-tools from 5.5.0 to 6.0.0 in /ingestion-edge (#1602)
dependabot[bot] Mar 15, 2021
26b5376
Bump google-cloud-kms from 1.41.0 to 1.41.1 (#1603)
dependabot[bot] Mar 16, 2021
aa67b1b
Bump pip-tools from 6.0.0 to 6.0.1 in /ingestion-edge (#1604)
dependabot[bot] Mar 16, 2021
ee3784e
Disable okhttp redirects for ctxsvc (#1606)
BenWu Mar 17, 2021
4e0d68e
Bug 1698576 - Ignore com-zibb-browser
acmiyaguchi Mar 17, 2021
8760be3
Bug 1698579 - Ignore org-mozilla-firefox-betb
acmiyaguchi Mar 17, 2021
66fe360
Bug 1698574 - Ignore org-mozilla-allanchain-firefox
acmiyaguchi Mar 17, 2021
e1c97fd
Remove Memorystore and AET from architecture overview (#1610)
jklukas Mar 18, 2021
46ad670
Bump spring-core from 5.3.4 to 5.3.5 (#1605)
dependabot[bot] Mar 18, 2021
f96fb3b
Update rendered docs in line with bigquery-etl (#1611)
jklukas Mar 19, 2021
5f75f19
Bump guava from 30.1-jre to 30.1.1-jre (#1616)
dependabot[bot] Mar 23, 2021
0b2c534
Bump google-cloud-pubsublite from 0.11.1 to 0.12.0 (#1612)
dependabot[bot] Mar 23, 2021
8f505da
Bump libraries-bom from 19.0.0 to 19.2.1 (#1615)
dependabot[bot] Mar 23, 2021
f853869
Bump pytest-mypy from 0.8.0 to 0.8.1 in /ingestion-edge (#1614)
dependabot[bot] Mar 23, 2021
36c9477
Bug 1697602 Remove AET support (#1599)
jklukas Mar 23, 2021
ada88f7
Use maven.compiler.release to build for java 8 from jdk 9+ (#1619)
relud Mar 24, 2021
f42ec4b
Bug 1697342 - Support Glean.js pings in Rally decoder (#1617)
acmiyaguchi Mar 30, 2021
16a93ee
Bump google-cloud-pubsublite from 0.12.0 to 0.13.0 (#1628)
dependabot[bot] Mar 31, 2021
7d5fab2
Bump checkstyle from 8.41 to 8.41.1 (#1625)
dependabot[bot] Mar 31, 2021
288af7a
Bump google-cloud-monitoring from 2.1.0 to 2.2.1 in /ingestion-edge (…
dependabot[bot] Mar 31, 2021
35a19a2
Bump pyyaml from 5.3.1 to 5.4 in /ingestion-edge (#1622)
dependabot[bot] Mar 31, 2021
cebe636
Support workload identity in flush manager service (#1630)
whd Apr 2, 2021
313b169
Bump pytest from 6.2.2 to 6.2.3 in /ingestion-edge (#1632)
dependabot[bot] Apr 5, 2021
87c0eeb
Bump gunicorn from 20.0.4 to 20.1.0 in /ingestion-edge (#1626)
dependabot[bot] Apr 5, 2021
3b88239
Bump google-cloud-pubsublite from 0.13.0 to 0.13.1 (#1631)
dependabot[bot] Apr 5, 2021
70b8960
Bump flogger-system-backend from 0.5.1 to 0.6 (#1629)
dependabot[bot] Apr 5, 2021
89659fc
Work around bug that was breaking q.empty() (#1634)
relud Apr 5, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,12 @@ jobs:
- checkout
- run:
name: Install dependencies
command: sudo pip install mkdocs markdown-include
command: |
sudo pip install \
mkdocs \
mkdocs-material \
markdown-include \
mkdocs-awesome-pages-plugin
- add_ssh_keys:
fingerprints:
- "84:b0:66:dd:ec:68:b1:45:9d:5d:66:fd:4a:4f:1b:57"
Expand Down
6 changes: 6 additions & 0 deletions GRAVEYARD.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Link to code removed from this repository

- 2021-03-12 (available until [commit `c05fbab`](https://github.com/mozilla/gcp-ingestion/commit/c05fbabcb44c5a8a290be311a87951728cf587b6))
- Remove support for Account Ecosystem Telemetry (AET);
see [Bug 1697602](https://bugzilla.mozilla.org/show_bug.cgi?id=1697602)
- Also removes `ParseLogEntry` which transforms a Google Cloud Logging
(`Stackdriver`) `LogEntry` message into one compatible with structured
ingestion, which could be relevant for future use cases.
- 2020-06-01 (available until [commit `f1d6464`](https://github.com/mozilla/gcp-ingestion/commit/f1d646442b8c1fcd63202ebca91363979b5b2ae2))
- Remove `DeduplicateByDocumentId` transform, which was intended for use with
the backfill from `heka` data, but did not perform well and was never used
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ There are currently four components:
For more information, see [the documentation](https://mozilla.github.io/gcp-ingestion).

Java 11 support is a work in progress for the Beam Java SDK, so this project requires
Java 8 and will likely fail to compile using newer versions of the JDK.
Java 8. Maven has been configured to compile for Java 8 when using newer versions of the
JDK, but support is only guaranteed for JDK 8.
To manage multiple local JDKs, consider [jenv](https://www.jenv.be/) and the
`jenv enable-plugin maven` command.
17 changes: 7 additions & 10 deletions docs/architecture/decoder_service_specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ in the Structured Ingestion pipeline.
## Data Flow

1. Consume messages from Google Cloud PubSub raw topic
1. Deduplicate message by `uri` (which generally contains `docId`)
- Disabled for stub-installer, which does not include a UUID in the URI
1. Perform GeoIP lookup and drop `x_forwarded_for` and `remote_addr` and
optionally `geo_city` based on population
1. Parse the `uri` attribute to determine document type, etc.
Expand All @@ -16,16 +18,12 @@ in the Structured Ingestion pipeline.
1. Validate the schema of the body
1. Extract user agent information and drop `user_agent`
1. Add metadata fields to message
1. Deduplicate message by `docId`
- Generate `docId` for submission types that don't have one
1. Write message to PubSub decoded topic based on `namespace` and `docType`

### Implementation

The above steps will be executed as a single Apache Beam job that can accept
either a streaming input from PubSub or a batch input from Cloud Storage.
Message deduplication will be done by checking for the presence of ids as keys
in Cloud Memory Store (managed Redis).

### Decoding Errors

Expand Down Expand Up @@ -118,13 +116,12 @@ method, to ensure ack'd messages are fully delivered.

### Deduplication

Each `docId` will be allowed through "at least once", and only be
Each `uri` will be allowed through "at least once", and only be
rejected as a duplicate if we have completed delivery of a message with the
same `docId`. Duplicates will be considered errors and sent to the error topic.
same `uri`. We assume that each `uri` contains a UUID that uniquely identifies
the document.
"Exactly once" semantics can be applied to derived data sets using SQL in
BigQuery, and GroupByKey in Beam and Spark.

Note that deduplication is only provided with a "best effort" quality of service.
In the ideal case, we hold 24 hours of history for seen document IDs, but that
buffer is allowed to degrade to a shorter time window when the pipeline is under
high load.
Note that deduplication is only provided with a "best effort" quality of service
using a 10 minute window.
8 changes: 0 additions & 8 deletions docs/architecture/diagram.mmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,16 @@ graph TD

f1(Producers) --> k1(Ingestion Edge)
k1 --> p1(Raw Topic)
k1 --> p1aet(AET Raw Topic)
k1 --> p99(...)

subgraph Pipeline Family: Structured
p1 --> k2(Raw Sink)
k2 --> c1(BigQuery)
p1 --> d2(Decoder)
p1aet --> d2aet(AET Decoder)
d2aet -. Decrypt keys on startup .- m2(Cloud KMS)
d2 --> p2(Decoded Topic)
d2 -.-> p2err(Error Topic)
p2err --> d3err(Error Sink)
d3err --> berr(BigQuery)
d2aet --> p2
d2aet -.-> p2err
p2 --> d3(Live Sink)
d3 --> b1(BigQuery)
p2 --> k3(Decoded Sink)
Expand All @@ -37,9 +32,6 @@ subgraph .
p99 --> d99(...)
end

d5 -. Mark document as seen .-> m1
m1(Cloud Memorystore) -. Document already seen? .-> d2

subgraph Colors
d(Dataflow jobs are green)
k(Kubernetes services are magenta)
Expand Down
14 changes: 7 additions & 7 deletions docs/architecture/diagram.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 3 additions & 30 deletions docs/architecture/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,8 @@ This document specifies the architecture for GCP Ingestion as a whole.
`BigQuery`
- The Dataflow `Decoder` job decodes messages from the PubSub `Raw Topic` to
the PubSub `Decoded Topic`
- Also checks for existence of `document_id`s in
`Cloud Memorystore` in order to deduplicate messages
- The Dataflow `AET Decoder` job provides all the functionality of the `Decoder`
with additional decryption handling for Account Ecosystem Telemetry pings
- The Dataflow `Republisher` job reads messages from the PubSub `Decoded Topic`,
marks them as seen in `Cloud Memorystore` and republishes them to various
- The Dataflow `Republisher` job reads messages from the PubSub `Decoded Topic`
and republishes them to various
lower volume derived topics including `Monitoring Sample Topics` and
`Per DocType Topics`
- The Kubernetes `Decoded Sink` job copies messages from the PubSub `Decoded Topic`
Expand Down Expand Up @@ -70,37 +66,14 @@ This document specifies the architecture for GCP Ingestion as a whole.
1. Produce `normalized_` variants of select attributes
1. Inject `normalized_` attributes at the top level and other select
attributes into a nested `metadata` top level key in `payload`
- Should deduplicate messages based on the `document_id` attribute using
`Cloud MemoryStore`
- Should deduplicate messages based on the `uri` attribute
- Must ensure at least once delivery, so deduplication is only "best effort"
- Should delay deduplication to the latest possible stage of the pipeline
to minimize the time window between an ID being marked as seen in
`Republisher` and it being checked in `Decoder`
- Must send messages rejected by transforms to a configurable error destination
- Must allow error destination in BigQuery

### AET Decoder

The AET (Account Ecosystem Telemetry) Decoder is a modified version of the
Decoder with the following properties:

- The raw topic that feeds the AET Decoder must not be sent anywhere else;
the AET Decoder needs to either successfully decrypt or sanitize all AET
identifiers
- Must load private keys from an encrypted blob in GCS
- Must call Cloud KMS at startup to decrypt keys and store these only in memory
- Must remove or redact all AET `ecosystem_anon_id` values from the payload before
passing to any durable output, including errors
- Must have access restricted to a limited set of operators to avoid exposing private keys
- Encrypted fields must be JOSE JWE objects in Compact Serialization form

### Republisher

- Must copy messages from PubSub topics to PubSub topics
- Must attempt to publish the `document_id` of each consumed message to
`Cloud MemoryStore`
- ID publishing should be "best effort" but must not prevent the message
proceeding to further steps in case of errors reaching `Cloud MemoryStore`
- Must ack messages read from PubSub after they are delivered to all
matching destinations
- Must not ack messages read from PubSub before they are delivered to all
Expand Down
8 changes: 4 additions & 4 deletions docs/architecture/pain_points.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ is to specify mapping at template creation time.
Does not use standard client library.

Does not expose an output of delivered messages, which is needed for at least
once delivery with deduplication. Current workaround is to get delivered
messages from a subscription to the output PubSub topic.
once delivery with deduplication. Current workaround is to use the deduplication
available via `PubsubIO.read()`.

Uses HTTPS JSON API, which increases message payload size vs protobuf by 25%
for base64 encoding and causes some messages to exceed the 10MB request size
Expand All @@ -57,9 +57,9 @@ limit that otherwise would not.
## Templates

Does not support repeated parameters via `ValueProvider<List<...>>`, as
described in [Dataflow Java SDK #632]
described in [Dataflow Java SDK #632].

[googlecloudplatform/dataflowjavasdk#632]: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/632
[dataflow java sdk #632]: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/632

# PubSub

Expand Down
Loading