Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/opos #81

Merged
merged 18 commits into from
Feb 7, 2017
Merged

Feature/opos #81

merged 18 commits into from
Feb 7, 2017

Conversation

hellais
Copy link
Member

@hellais hellais commented Jan 9, 2017

This branch include work in progress on the probe orchestration system

Copy link
Contributor

@bassosimone bassosimone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting read! I've made some comments meant at improving the document and clarify some points.

There should be full transparency over what is being scheduled to be run on
orchestrated probes. That is to say that it should be possible for anybody to
inspect the log of all the instructions and determine which ones have affected
their probe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possibility for anybody to inspect, seems to contradict "either publicly or privately depending on what is safer for the user", isn't it?

Components:

**OPOS Client**: This is generally an instance of ooniprobe (or
ooniprobe-mobile) that is subscribing to the OPOC event feed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "OPOS event feed" rather than "OPOC"?

on and their progress.

**OPOS Event feed**: This is the backend component responsible for emitting
events to clients and informing them of what they should be doing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get completely the difference between scheduler and feed, and specifically how they will be implemented in real: is the scheduler a daemon and the feed a database (or a queue)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the feed the output of the scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact the "scheduler" was added after the "event feed". Originally I had excluded the "scheduler" from the architecture as I was interested in documenting only what is exposed and the "scheduler" in a way is an architectural construct that is in a way transparent to the user.
However in specifying this I felt the need to have a this architectural construct and so I added it.

To exemplify this in a more concrete way, a possible implementation of this would be:

  • OPOS Event feed: a websockets interface that clients subscribe to receive notices of what they should do
  • OPOS Scheduler: the engine responsible for knowing which clients should be notified of events


When somebody with proper access permissions adds an event to via the **OPOS
Management Interface** all the subscribed probes that are affected by that
event will be notified and will begin performing the specified action.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this description, it should be useful to clarify the role of the scheduler (when does it kick in?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about this below.

it on github.com) to avoid it being blocked hence being resilient to censorship.

* It SHOULD be possible to know how many *active* **OPOS Clients** there are at
a given time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that this requirement is simple to implement with a pub-sub scheme. I mean: when using WebSockets (or perhaps we should consider HTTP/2?), you can keep a persistent connection and thus you can enumerate your clients (even anonymously, if they pass through some collateral freedom thingy). I do not see easily how can you enumerate clients with pub-sub and low latency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I mentioned this as a SHOULD and not as a MUST. It's something I would like to have the power to know, because it's very useful, but it's not a strict requirement.

I agree that depending on the specific implementation there will be different ways to do this that can be more or less accurate. Worst case scenario we can do something similar to how Tor counts users. We estimate the average number of requests each client will make per day and count the number of requests and divide by the number of average requests a client is meant to make.

* The start time of the job in ISO 8601 (YYYY-MM-DDThh:mm:ssTZ, where TZ is the timezone of `Z` for UTC).
An empty start time means start immediately.

* The run interval, defined following the ["Duration" component of the ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) standard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd in any case explain the semantic of this field without only redirecting to the standard.

The Management Interface will assign to a given job a unique id (hereafter
referred to as `job_id`) that can be used to unschedule jobs.

It should be possible for somebody interacting with the Management Interface to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'd say "an authorised operator"

### 4.1.1 Run test

This action is used to instruct a probe to run a certain OONI test once. For
sake of clarity we omit the common elements from the above format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can we design a format by which we can omit common fields unless necessary to override some defaults, such that the JSON that will travel on the wire is actually the one in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: It would help to clarify in some way from which two entities will this message travel.

RUNNING --> COMPLETED;
COMPLETED --> [*];
@enduml
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a picture?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, never mind: it is a picture (I just checked by viewing the document)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a UML chart.


# Links

These
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can omit this "these"


* The country of a probe

* The Network of a probe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshedding: I don't think we should capitalize network here

Management Interface** all the subscribed probes that are affected by that
event will be notified and will begin performing the specified action.

The **OPOS Scheudler** is the scheduling engine responsible for figuring out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here


# 3.0 Transport protocol requirements

We use the term "Transport" loosely here to mean the protocol that is being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Transport should be capitalised here (and let's paint the bike shed ooni-blue)


The base data format is the following:

```json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshedding: if we declare this as json, the comment (which is not json) will result badly coloured. I think it's better either to have a valid json or to do not declare this as json.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, I guess I should declare it as hjson (https://hjson.org/)

},
"probe_asn": "",
"probe_id": "",
"probe_family": "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this family?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a bit confusing to include something that is not yet implemented in ooniprobe here, but my reasoning behind it was that at some point we may want to group ooniprobes together by family, so that you can say "the ooniprobe run by ORGX" and they are 10 each with their own ID.

`filter`:
The `where` clause in the filter definition is an implementation of the [loopback
style filters](https://github.com/strongloop/loopback-filters), but with "in"
in place of "inq" for clarity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I originally advocated for in rather than inq. Now, I'm wondering whether keeping inq will be better because we adhere to the standard and we have one less diff-y thing to remember.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was also my reasoning for using inq in the beginning. I don't have a strong opinion for either (maybe we can say that they are both supported?)

* **COMPLETED**: when a job has finished executing it enters the COMPLETED state.


# Links
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd perhaps link inline rather than at the bottom, so it's more clear what refers to what

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just going to drop the links that are not mapped to anything in particular. The reason why I put chronos there originally was for part of the inspiration of the scheduler format.

Copy link
Contributor

@bassosimone bassosimone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to prevent us from moving forward with this specification more than necessary.

I feel like most of my comments can be addressed also at a later time, when we start making a prototype.

Measurements should be scheduled via OPOS and/or any other system run by OONI
to gather network measurement data.

1. To the extent that it is possible we will always do what is best for users of OONI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it should be mentioned to the policy that users can be opt-out by OPOS and will not be enabled by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is a bit out of scope for the policy document as this defined the governance and rules for when OPOS is in place. Whether it is in place for all users, opt-in, disableable or whatnot is an implementation detail.

BTW I think we should have it enabled by default for all new users and prompt old users to enable it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think we should have it enabled by default for all new users and prompt old users to enable it.

Yes, I agree with this

# Measurements and URL policy

This document explains the policy we follow when considering what URLs and
Measurements should be scheduled via OPOS and/or any other system run by OONI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OPOS initials could be expanded here.

measurement must be signed off by at least one other person.

7. Anybody in the governance roster can nominate somebody else to join the
roster. Their inclusion shall be discussed and once there is consensus they
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the inclusion of people or entities must (and not shall) be discussed beforehand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think shall and must are synonyms in this context, though shall sounds to me as a bit softer.


a. Have a legittimate reason to do so (ex. you do research on censorship)

b. Be a respected member of the OONI community
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be very careful with people using the OPOS not only for (obvious) reasons of malice but also for reason of experimentation failure: eg: researching on a new censorship analysis that could de-anonymize users.

I assume that the bigger this governance roster became the harder will be to control OPOS usage.

The target latency for the OPOS is of around 1 minute. This means that from the
time a new instruction is inserted into the OPOS to the moment that the
affected probes receive it 1 minute should pass.

Copy link
Contributor

@anadahz anadahz Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the latency window quite short, is there any reasoning for this latency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By short you mean it's too little? 1 minute of latency is actually fairly high, even with long polling you are going to achieve latencies that are less than 1 minute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By short I mean too little, given the fact that there are a number of probes on very slow links getting a new instruction per minute can be too much.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the link is slow (in terms of throughput), they will most likely have a latency < 60 seconds.
If they do have a latency > 60 seconds TCP sockets will timeout before they are established and bumping the time up will not help much there.
In which circumstances do you imagine there to be a > 60 seconds latency with OPOS?

Copy link
Contributor

@anadahz anadahz Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that in most cases the probes will not have completed the scheduled job in less than 60 seconds. I can't think of many good reasons of having such a short (for ooni-probes) polling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1 minute latency is from when they receive it, that doesn't mean it will take them less than 1 minutes to complete it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways I don't want to fixate on this as in the end this is going to be an implementation detail. The purpose of specifying this is that the order of magnitude we are targeting is not seconds, but rather minutes and not hours.

Would it make you more comfortable if I were to say "minutes" instead of "1 minute"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, minutes rather than one minute is reasonable as well. In any case, the probe ability to refuse running a job is IMO our best line of defense against over scheduling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes and it's definetely not a blocker for the OPOS spec.
Thanks for working on this @hellais .

* The country of a probe

* The Network of a probe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some useful granular items that could be added:

  • Installed tests
    List of installed tests per probe.
  • Allowed tests
    List of permitted tests to be run per probe.
  • Type of collector (onion, https, cloudfront) used to submit reports
  • Privacy options enabled per probe (includeip, includeasn, includecountry)
  • Size of measurement quota (limit, free %)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these should go inside of the high-level design goals. The purpose of this section is to set the expectations as to what broad type of problem this system is trying to solve (schedule measurements based on network and location of a probe), the specific requirements (that are due to implementation details) that are needed in order to satisfy the job is too low level for at least this section at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some important properties that need to be enumerated in advance otherwise we may "violate" some user's options.

Some possible scenarios:

  • Schedule tests that are not installed by probes.
  • Schedule tests that are not "allowed" to run ie: the http header field manipulation test.
  • Schedule a test with a long list of experiments that require a decent amount of disk space to probes reaching their disk quota capacity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some possible scenarios:
Schedule tests that are not installed by probes.
Schedule tests that are not "allowed" to run ie: the http header field manipulation test.
Schedule a test with a long list of experiments that require a decent amount of disk space to probes reaching their disk quota capacity.

I think these are implementation details that don't necessarily have to live in this document, or at least not in this section. As an operator I don't care to specify these options and the orchestrator should be smart enough to figure out which of the available probes is capable of running my measurements.

I can add a note about the fact that certain attributes of the probe need to be sent to the OPOS in order for it to be able to resolve these constraints, in 4.0 Jobs.


A user interested in seeing what has been scheduled in the past can visit the
**OPOS Event log** and see a history of what has happenned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note should be added with the default log retention date.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no log retention date. We should store all logs forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of probes that cannot handle storing all logs forever.
Perhaps it make sense to add a default log retention plan rather than end up having probes running out of disk space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of probes that cannot handle storing all logs forever.

The logs are not stored by the probes, but by the probe orchestration system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For privacy reasons it would be better if these logs are stored only on the probe(s) that performed the scheduled tests and are not stored in a service online.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For privacy reasons it would be better if these logs are stored only on the probe(s) that performed the scheduled tests and are not stored in a service online.

but as you pointed out before probes can have limited disk space available and they can't store them all.

Moreover there is nothing stopping somebody from subscribing to the event feed as all possible probe types and logging all the messages.
I think the assumption is that it's safe to log all the scheduled events as they are being broadcasted publicly.

What exactly is the privacy implication of making the scheduled events be known public?

example by specifying `"delay": 60` the action will be triggerred after a
number of seconds going from `0-60`.

`schedule`: The scheduling for the job, in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format. Consists of 3 parts separated by /:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra /

Copy link
Member Author

@hellais hellais Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The / character is the separator for the 3 parts. I should probably wrap it in ` though

"url": "https://torproject.org/"
}
}
}
Copy link
Contributor

@anadahz anadahz Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a job_id needed to schedule a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as is mentioned below:

The Management Interface will assign to a given job a unique id (hereafter referred to as job_id) that can be used to unschedule jobs.

The OPOS Scheduler keeps track of the lifecycle of a job. With respect to an
**OPOS Client** a job can be in one of the following states:

* **READY**: this is the state in which a job is when it is first added via the OPOS Management Interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency this can be: this is the state a job is in when[...]

progress of any given task.

* **COMPLETED**: when a job has finished executing it enters the COMPLETED state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency the REJECTED RUNNING COMPETED states can start with: this is the state a job is in when[...]

roster. Their inclusion shall be discussed and once there is consensus they
are added. To be included in the roster you need to:

a. Have a legittimate reason to do so (ex. you do research on censorship)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/legittimate/legitimate

which clients to notify of jobs and how.

A user interested in seeing what has been scheduled in the past can visit the
**OPOS Event log** and see a history of what has happenned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/happenned/happened

by specifying the optional `delay` key.

We use JSON serialization for representing the message structures inside of
this document, however if due to other contraints JSON is not an adequate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/contraints/constraints

you can say: `"platform": "macos"` or `"platform": ["macos", "android"]`.

`delay`: is an upper bound on the number of seconds to delay the action by. For
example by specifying `"delay": 60` the action will be triggerred after a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/triggerred/triggered

* The run interval, defined following the ["Duration" component of the ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) standard.
Durations are indicated as a string in the format
`P[n]Y[n]M[n]DT[n]H[n]M[n]S`, where `[n]` is a number representing a value
for the follwoing date or time element.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: s/follwoing/following

orchestrated probes. That is to say that all measurements scheduled via OPOS
must be recorded in an append only log so that it's possible, upon request,
for a user to learn what jobs have affected their probe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and from which entity a probe-user can request backlog of the past scheduled tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPOS Event log: Through this it is possible to inspect the history of
instrumented events.

@hellais
Copy link
Member Author

hellais commented Feb 7, 2017

I think all feedback should have been addressed. Can I proceed with merging this?

@bassosimone
Copy link
Contributor

Hit it!

@hellais hellais merged commit f449f26 into master Feb 7, 2017
@hellais hellais deleted the feature/opos branch February 7, 2017 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants