-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/opos #81
Feature/opos #81
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting read! I've made some comments meant at improving the document and clarify some points.
There should be full transparency over what is being scheduled to be run on | ||
orchestrated probes. That is to say that it should be possible for anybody to | ||
inspect the log of all the instructions and determine which ones have affected | ||
their probe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possibility for anybody to inspect, seems to contradict "either publicly or privately depending on what is safer for the user", isn't it?
Components: | ||
|
||
**OPOS Client**: This is generally an instance of ooniprobe (or | ||
ooniprobe-mobile) that is subscribing to the OPOC event feed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be "OPOS event feed" rather than "OPOC"?
on and their progress. | ||
|
||
**OPOS Event feed**: This is the backend component responsible for emitting | ||
events to clients and informing them of what they should be doing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get completely the difference between scheduler and feed, and specifically how they will be implemented in real: is the scheduler a daemon and the feed a database (or a queue)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the feed the output of the scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact the "scheduler" was added after the "event feed". Originally I had excluded the "scheduler" from the architecture as I was interested in documenting only what is exposed and the "scheduler" in a way is an architectural construct that is in a way transparent to the user.
However in specifying this I felt the need to have a this architectural construct and so I added it.
To exemplify this in a more concrete way, a possible implementation of this would be:
- OPOS Event feed: a websockets interface that clients subscribe to receive notices of what they should do
- OPOS Scheduler: the engine responsible for knowing which clients should be notified of events
|
||
When somebody with proper access permissions adds an event to via the **OPOS | ||
Management Interface** all the subscribed probes that are affected by that | ||
event will be notified and will begin performing the specified action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this description, it should be useful to clarify the role of the scheduler (when does it kick in?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note about this below.
it on github.com) to avoid it being blocked hence being resilient to censorship. | ||
|
||
* It SHOULD be possible to know how many *active* **OPOS Clients** there are at | ||
a given time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this requirement is simple to implement with a pub-sub scheme. I mean: when using WebSockets (or perhaps we should consider HTTP/2?), you can keep a persistent connection and thus you can enumerate your clients (even anonymously, if they pass through some collateral freedom thingy). I do not see easily how can you enumerate clients with pub-sub and low latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is why I mentioned this as a SHOULD and not as a MUST. It's something I would like to have the power to know, because it's very useful, but it's not a strict requirement.
I agree that depending on the specific implementation there will be different ways to do this that can be more or less accurate. Worst case scenario we can do something similar to how Tor counts users. We estimate the average number of requests each client will make per day and count the number of requests and divide by the number of average requests a client is meant to make.
* The start time of the job in ISO 8601 (YYYY-MM-DDThh:mm:ssTZ, where TZ is the timezone of `Z` for UTC). | ||
An empty start time means start immediately. | ||
|
||
* The run interval, defined following the ["Duration" component of the ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd in any case explain the semantic of this field without only redirecting to the standard.
The Management Interface will assign to a given job a unique id (hereafter | ||
referred to as `job_id`) that can be used to unschedule jobs. | ||
|
||
It should be possible for somebody interacting with the Management Interface to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'd say "an authorised operator"
### 4.1.1 Run test | ||
|
||
This action is used to instruct a probe to run a certain OONI test once. For | ||
sake of clarity we omit the common elements from the above format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: can we design a format by which we can omit common fields unless necessary to override some defaults, such that the JSON that will travel on the wire is actually the one in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated: It would help to clarify in some way from which two entities will this message travel.
RUNNING --> COMPLETED; | ||
COMPLETED --> [*]; | ||
@enduml | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a picture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, never mind: it is a picture (I just checked by viewing the document)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a UML chart.
|
||
# Links | ||
|
||
These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can omit this "these"
|
||
* The country of a probe | ||
|
||
* The Network of a probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bikeshedding: I don't think we should capitalize network here
Management Interface** all the subscribed probes that are affected by that | ||
event will be notified and will begin performing the specified action. | ||
|
||
The **OPOS Scheudler** is the scheduling engine responsible for figuring out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here
|
||
# 3.0 Transport protocol requirements | ||
|
||
We use the term "Transport" loosely here to mean the protocol that is being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Transport should be capitalised here (and let's paint the bike shed ooni-blue)
|
||
The base data format is the following: | ||
|
||
```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bikeshedding: if we declare this as json, the comment (which is not json) will result badly coloured. I think it's better either to have a valid json or to do not declare this as json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right, I guess I should declare it as hjson
(https://hjson.org/)
}, | ||
"probe_asn": "", | ||
"probe_id": "", | ||
"probe_family": "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this family?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a bit confusing to include something that is not yet implemented in ooniprobe here, but my reasoning behind it was that at some point we may want to group ooniprobes together by family, so that you can say "the ooniprobe run by ORGX" and they are 10 each with their own ID.
`filter`: | ||
The `where` clause in the filter definition is an implementation of the [loopback | ||
style filters](https://github.com/strongloop/loopback-filters), but with "in" | ||
in place of "inq" for clarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I originally advocated for in
rather than inq
. Now, I'm wondering whether keeping inq
will be better because we adhere to the standard and we have one less diff-y thing to remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was also my reasoning for using inq
in the beginning. I don't have a strong opinion for either (maybe we can say that they are both supported?)
* **COMPLETED**: when a job has finished executing it enters the COMPLETED state. | ||
|
||
|
||
# Links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd perhaps link inline rather than at the bottom, so it's more clear what refers to what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just going to drop the links that are not mapped to anything in particular. The reason why I put chronos there originally was for part of the inspiration of the scheduler format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to prevent us from moving forward with this specification more than necessary.
I feel like most of my comments can be addressed also at a later time, when we start making a prototype.
opos/Measurements-and-url-policy.md
Outdated
Measurements should be scheduled via OPOS and/or any other system run by OONI | ||
to gather network measurement data. | ||
|
||
1. To the extent that it is possible we will always do what is best for users of OONI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it should be mentioned to the policy that users can be opt-out by OPOS and will not be enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is a bit out of scope for the policy document as this defined the governance and rules for when OPOS is in place. Whether it is in place for all users, opt-in, disableable or whatnot is an implementation detail.
BTW I think we should have it enabled by default for all new users and prompt old users to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think we should have it enabled by default for all new users and prompt old users to enable it.
Yes, I agree with this
opos/Measurements-and-url-policy.md
Outdated
# Measurements and URL policy | ||
|
||
This document explains the policy we follow when considering what URLs and | ||
Measurements should be scheduled via OPOS and/or any other system run by OONI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OPOS initials could be expanded here.
measurement must be signed off by at least one other person. | ||
|
||
7. Anybody in the governance roster can nominate somebody else to join the | ||
roster. Their inclusion shall be discussed and once there is consensus they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the inclusion of people or entities must (and not shall) be discussed beforehand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think shall and must are synonyms in this context, though shall sounds to me as a bit softer.
opos/Measurements-and-url-policy.md
Outdated
|
||
a. Have a legittimate reason to do so (ex. you do research on censorship) | ||
|
||
b. Be a respected member of the OONI community |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be very careful with people using the OPOS not only for (obvious) reasons of malice but also for reason of experimentation failure: eg: researching on a new censorship analysis that could de-anonymize users.
I assume that the bigger this governance roster became the harder will be to control OPOS usage.
The target latency for the OPOS is of around 1 minute. This means that from the | ||
time a new instruction is inserted into the OPOS to the moment that the | ||
affected probes receive it 1 minute should pass. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the latency window quite short, is there any reasoning for this latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By short you mean it's too little? 1 minute of latency is actually fairly high, even with long polling you are going to achieve latencies that are less than 1 minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By short I mean too little, given the fact that there are a number of probes on very slow links getting a new instruction per minute can be too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the link is slow (in terms of throughput), they will most likely have a latency < 60 seconds.
If they do have a latency > 60 seconds TCP sockets will timeout before they are established and bumping the time up will not help much there.
In which circumstances do you imagine there to be a > 60 seconds latency with OPOS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that in most cases the probes will not have completed the scheduled job in less than 60 seconds. I can't think of many good reasons of having such a short (for ooni-probes) polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1 minute latency is from when they receive it, that doesn't mean it will take them less than 1 minutes to complete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways I don't want to fixate on this as in the end this is going to be an implementation detail. The purpose of specifying this is that the order of magnitude we are targeting is not seconds, but rather minutes and not hours.
Would it make you more comfortable if I were to say "minutes" instead of "1 minute"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, minutes rather than one minute is reasonable as well. In any case, the probe ability to refuse running a job is IMO our best line of defense against over scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that makes and it's definetely not a blocker for the OPOS spec.
Thanks for working on this @hellais .
* The country of a probe | ||
|
||
* The Network of a probe | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some useful granular items that could be added:
- Installed tests
List of installed tests per probe. - Allowed tests
List of permitted tests to be run per probe. - Type of collector (onion, https, cloudfront) used to submit reports
- Privacy options enabled per probe (includeip, includeasn, includecountry)
- Size of measurement quota (limit, free %)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these should go inside of the high-level design goals. The purpose of this section is to set the expectations as to what broad type of problem this system is trying to solve (schedule measurements based on network and location of a probe), the specific requirements (that are due to implementation details) that are needed in order to satisfy the job is too low level for at least this section at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are some important properties that need to be enumerated in advance otherwise we may "violate" some user's options.
Some possible scenarios:
- Schedule tests that are not installed by probes.
- Schedule tests that are not "allowed" to run ie: the http header field manipulation test.
- Schedule a test with a long list of experiments that require a decent amount of disk space to probes reaching their disk quota capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some possible scenarios:
Schedule tests that are not installed by probes.
Schedule tests that are not "allowed" to run ie: the http header field manipulation test.
Schedule a test with a long list of experiments that require a decent amount of disk space to probes reaching their disk quota capacity.
I think these are implementation details that don't necessarily have to live in this document, or at least not in this section. As an operator I don't care to specify these options and the orchestrator should be smart enough to figure out which of the available probes is capable of running my measurements.
I can add a note about the fact that certain attributes of the probe need to be sent to the OPOS in order for it to be able to resolve these constraints, in 4.0 Jobs
.
|
||
A user interested in seeing what has been scheduled in the past can visit the | ||
**OPOS Event log** and see a history of what has happenned. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note should be added with the default log retention date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no log retention date. We should store all logs forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of probes that cannot handle storing all logs forever.
Perhaps it make sense to add a default log retention plan rather than end up having probes running out of disk space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a number of probes that cannot handle storing all logs forever.
The logs are not stored by the probes, but by the probe orchestration system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For privacy reasons it would be better if these logs are stored only on the probe(s) that performed the scheduled tests and are not stored in a service online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For privacy reasons it would be better if these logs are stored only on the probe(s) that performed the scheduled tests and are not stored in a service online.
but as you pointed out before probes can have limited disk space available and they can't store them all.
Moreover there is nothing stopping somebody from subscribing to the event feed as all possible probe types and logging all the messages.
I think the assumption is that it's safe to log all the scheduled events as they are being broadcasted publicly.
What exactly is the privacy implication of making the scheduled events be known public?
example by specifying `"delay": 60` the action will be triggerred after a | ||
number of seconds going from `0-60`. | ||
|
||
`schedule`: The scheduling for the job, in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format. Consists of 3 parts separated by /: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra /
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /
character is the separator for the 3 parts. I should probably wrap it in ` though
"url": "https://torproject.org/" | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a job_id
needed to schedule a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as is mentioned below:
The Management Interface will assign to a given job a unique id (hereafter referred to as
job_id
) that can be used to unschedule jobs.
The OPOS Scheduler keeps track of the lifecycle of a job. With respect to an | ||
**OPOS Client** a job can be in one of the following states: | ||
|
||
* **READY**: this is the state in which a job is when it is first added via the OPOS Management Interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency this can be: this is the state a job is in when[...]
progress of any given task. | ||
|
||
* **COMPLETED**: when a job has finished executing it enters the COMPLETED state. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency the REJECTED RUNNING COMPETED states can start with: this is the state a job is in when[...]
opos/Measurements-and-url-policy.md
Outdated
roster. Their inclusion shall be discussed and once there is consensus they | ||
are added. To be included in the roster you need to: | ||
|
||
a. Have a legittimate reason to do so (ex. you do research on censorship) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/legittimate/legitimate
which clients to notify of jobs and how. | ||
|
||
A user interested in seeing what has been scheduled in the past can visit the | ||
**OPOS Event log** and see a history of what has happenned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/happenned/happened
by specifying the optional `delay` key. | ||
|
||
We use JSON serialization for representing the message structures inside of | ||
this document, however if due to other contraints JSON is not an adequate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/contraints/constraints
you can say: `"platform": "macos"` or `"platform": ["macos", "android"]`. | ||
|
||
`delay`: is an upper bound on the number of seconds to delay the action by. For | ||
example by specifying `"delay": 60` the action will be triggerred after a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/triggerred/triggered
* The run interval, defined following the ["Duration" component of the ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) standard. | ||
Durations are indicated as a string in the format | ||
`P[n]Y[n]M[n]DT[n]H[n]M[n]S`, where `[n]` is a number representing a value | ||
for the follwoing date or time element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/follwoing/following
orchestrated probes. That is to say that all measurements scheduled via OPOS | ||
must be recorded in an append only log so that it's possible, upon request, | ||
for a user to learn what jobs have affected their probe. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How and from which entity a probe-user can request backlog of the past scheduled tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OPOS Event log: Through this it is possible to inspect the history of
instrumented events.
based on feedback from @anadahz & @bassosimone
I think all feedback should have been addressed. Can I proceed with merging this? |
Hit it! |
This branch include work in progress on the probe orchestration system