Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc issues, or maybe I'm dumb #248

Closed
dhalperi opened this issue Mar 13, 2019 · 10 comments
Closed

Doc issues, or maybe I'm dumb #248

dhalperi opened this issue Mar 13, 2019 · 10 comments

Comments

@dhalperi
Copy link

dhalperi commented Mar 13, 2019

I am finally picking up DD/TD because I want to play with Monoids. New to both Rust and DD.

Going through the mdbook, but at HEAD instead of version 0.7, and am able to explain many differences between my output and the docs.

Things I can't explain, so far:

  1. Step 1, Write a Program - we put the code in src/main.rs but the suggested command is cargo run --example hello. I get (at both HEAD and v0.7) that there is no such example. Am I missing something or is this just a bug?
    (The same issue happens for all the other issues in cells I've seen).

  2. Increase the Scale - Don't we need to change the code? The page seems to just have two run commands (that fail for me because of --example hello), with no code changes. I did understand that we probably add --release to get it to go faster.

    Is there some magic linkage to https://github.com/TimelyDataflow/timely-dataflow/blob/master/examples/hello.rs that I've messed up in my setup? How does it know which version of things to run?

I'm guessing these are just omissions/bugs, but I am open to being completely wrong given that I am new to a lot here.

@dhalperi
Copy link
Author

For posterity:

➜  lca git:(master) ✗ cat Cargo.toml 
[package]
name = "lca"
version = "0.1.0"
authors = ["Daniel Halperin <daniel@halper.in"]
edition = "2018"

[dependencies]
timely = { git = "https://github.com/TimelyDataflow/timely-dataflow" }
differential-dataflow = { git = "https://github.com/TimelyDataflow/differential-dataflow" }
#timely = "0.7"
#differential-dataflow = "0.7"

@dhalperi
Copy link
Author

➜  lca git:(master) ✗ ls *
Cargo.lock Cargo.toml

src:
main.rs

target:
debug   release

@dhalperi
Copy link
Author

dhalperi commented Mar 13, 2019

  1. Consolidate vs Distinct?

    The Concat docs mention that concatenating manages with itself reversed will have duplicate entries for (0, 0). This is, I think, one element with multiplicity 2.

    This exact same example is used on the Consolidate docs, which says "does nothing to the collection except ensure that each element occurs with only one count". Is this "one count" as in "we need to know what the multiplicity of the element is authoritatively" or as in "multiplicity [count] of one"? I initially assumed the latter, which is basically just Distinct in SQL terms at least.

    Then, this text confused me mildly because I thought Distinct would have had a simpler explanation: "We might see two copies of the same element, (0, 0)... This is because for reasons of efficiency, operators like map and concat do not work too hard to "consolidate" the changes that they produce. This is rarely a problem, but it can nonetheless be helpful to consolidate a collection before inspecting it, to ensure that you see the most concise version."

    Finally, there's a separate Distinct operator, which is explained in exactly the semantics I would have expected - set vs bag semantics.

    So how do Distinct and Consolidate differ?

    Maybe Consolidate is actually the former - something about tuples having a single count - but I can't figure out how to inspect the count. So how would it be reflected other than as the same tuple printed twice? [The count doesn't seem to be in the (data, time, diff) triple mentioned in the very first example]

@dhalperi
Copy link
Author

  1. In the first Arrangement Example, is query backwards?

    Specifically, if knows is a pair of the relation and query is (query_id, source) [here, I'm assuming source is in the relation and query_id is an int, nonce, etc., not the same type], IIUC we can't join query and knows without flipping query.

@frankmcsherry
Copy link
Member

These are super helpful. I'm on the road at the moment (4am locally) but I'll start working through these! Some are doc issues, some are "no, that should just work!".

@comnik
Copy link
Member

comnik commented Mar 13, 2019

Hi! Chiming in here, as Frank is supposed to be on a beach somewhere right now.

  1. This indeed looks like an error in the docs. It is probably easiest to add your own example to the examples/ folder in the Timely repo and run cargo run --example <your_example>. Cargo has first-level support for these examples. Finally, you can of course also create an examples/ folder in your own repository and run them the same way.

  2. This is also not on you, there is indeed no scale parameter in the current hello example. You could try the bfs example like so cargo run --example bfs 100 1000 where the first argument is the number of nodes, the second one the number of edges in a randomly generated graph.

In any case, --release does make a significant difference to both performance and, unfortunately, compile times.

To increase the "scale" as in "number of workers", no code change is necessary! This is controlled by the -w parameter seen in action in chapter 2.2.

@comnik
Copy link
Member

comnik commented Mar 13, 2019

  1. Consolidate vs Distinct – a fan favourite! First of all, the Distinct operator does what you would think. It ensures that there is at most one of every datapoint in the collection.

Consolidate is more interesting, and somewhat unique to Differential. Assume a collection of edges. At t_0, we introduce an edge (a, b) which is represented as ((a, b), t_0, +1). Then at t_1 we retract that edge again, ((a, b), t_1, -1). Now, logically the edges collection is empty, but physically, there are two tuples lying around. This can be problematic in iterative dataflows, but also leads to potentially confusing outputs for clients. It's like asking "Is b reachable from a?" and Differential responding "Yes... but no!". When you'd prefer a simple "no".

To a first approximation, Consolidate ensures, that the physical representation corresponds to the logical representation.

By a single count, we refer to the number of diffs we have to look at to get the full picture of the collection at a point in time, not the magnitude of the diffs themselves.

Another example would be adding the same tuple multiple times. This might lead to a physical representation of ((a, b), t_0, 1), ((a, b), t_0, 1) when you'd like to see a concise ((a, b), t_0, 2). Consolidate does this for you.

@frankmcsherry
Copy link
Member

I believe TimelyDataflow/differential-dataflow#158 should address these questions. Sorry for the confusion!

(NB: it is technically a different repo; which was the source of my initial confusion. you were right about all reported issues).

Also, thanks very much to @comnik for stepping in with explanations!

@frankmcsherry
Copy link
Member

Also, if it turns out I was overly optimistic and the fixes don't address core ambiguities please do feel free to say as much. :) While they might be clearer now, the goal for sure is to make them clearer for future folks as well!

@dhalperi
Copy link
Author

(NB: it is technically a different repo; which was the source of my initial confusion. you were right about all reported issues).

Sorry -- I had open 3 different mdbooks and 2 different GitHub repos and created the issue in the wrong one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants