Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ThroughputModifications #14

Merged
merged 17 commits into from
Aug 6, 2018
Merged

Refactor ThroughputModifications #14

merged 17 commits into from
Aug 6, 2018

Conversation

David-Durst
Copy link
Owner

This PR mainly reworks ThroughputModifications.hs so that others can understand it. @kayvonf, this is the PR you are looking for.

This PR also reorders the arguments to ReduceOp and renames them, so ReduceOp 4 1 now means 4 tokens are reduced together (called numTokens) with 1 token received per clock (called parallelism or par).

@akeley98 - I had to reorder the args for reduce in simulator. Please confirm I did it the correct way.

@akeley98
Copy link
Collaborator

The simulator's ReduceOp still seems okay. All tests still pass.

Copy link
Collaborator

@akeley98 akeley98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simulator ReduceOp still works correctly.

1. Leaf, non-modifiable rate - these ops are sped up and slowed down by wrapping
them in a parent, modifiable rate op such as map and underutil.

2. Leaf, modifiable rate - these ops are sped up and slowed down by trying to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"trying to adjust their rate" --> trying to adjust their rate via changes to rate parameters. Since not all ops in this category permit all possible rates, it may not be possible to speed up/slow down these ops in a cases.

<Meta: I'm thinking we could be a little more precise about use of the term "adjust their rate" in these comments, since we also talk about changing the rate of a non-modifiable rate op. I get what we mean, but we might confuse someone if the context is less clear.>

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that rate is not the best term. I've switched to the term "throughput parameter". I introduce the term https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R7 and https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R33. Thoughts? I believe this clears up the confusion.

I've also decreased the confusion by renaming the categories indirectly and directly scalable. See https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R7 for their definitions.

-- ASSUMPTION: the user has specified a type for these ops and passes will not
-- change the type because that would change the semantics of the
-- program. For example, one could speed up an Add T_Int my making it a
-- Add $ T_Array 2 T_Int, but that would change the program's meaning.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what you mean here by "change the program's meaning", since by some definition parallelizing the program also changes it's meaning in that the input and output types of the program are now different. Similarly, wrapping an ADD in a map, vs. defining add polymorphically to work on T_Array 2 T_Int conceptually produces the same set of singleton values.

Specifically, I think what this comment is trying to say is that we're making the assumption that basic combinational ops are defined on a few base types, and that more complex circuits are created from compositions of these ops. If we want an adder to operate on an array of size N, well, that requires a program that is map N (add).

I think this is a good assumption for the following reason: While it makes sense what the meaning of invoking ADD on array data types, it makes less sense what the meaning of invoking ADD on tuples. (Do we field-wise add the tuple's fields?) As such, I think it's intuitive to say that ADD is only defined on simple base types (it's not polymorphic on arrays), and that adding more complex data types is expressed in language via maps (to add arrays) and for adding tuples, creating a circuit with the appropriate tuple extractions and creation)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we mean the same thing, but have different definitions of base types. I’m including homogeneous arrays in my list of base types. I’m doing this as there are certain applications where nested arrays of bits are the basic unit of data. For example, operating over 3 color channel, 8 bit depth image data means that the basic, unseperable unit of data is Bit[8][3]. Speed up and slow down can make the pipeline handle more or fewer of these unseparable units per clock, but it cannot change the unit.

Aetherling’s type system has no tuple. Therefore, the question of what types an Add can be over doesn’t involve tuples. The implementation of Add is well-defined for all types currently allowed by the Aetherling Haskell IR.

-- maping over a MemRead/MemWrite is banking it, and wrapping
-- the type with an array is making a single memory read/write more
-- per clock.
attemptSpeedUp requestedMult op@(MemRead _) = (MapOp requestedMult op, requestedMult)
Copy link
Collaborator

@kayvonf kayvonf Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#braindump as this review caused me to think about how we treat memories.

TLDR 1; I think memRd and memWr should be leaf modifiable rate operators given their current definition.

TLDR 2; in the comment on line 102 above, "banking" should be replaced by "gathering".

=====================

We currently consider MemRead and MemWrite to be stateful ops. (as given in Operations/Properties.hs). These ops are stateful because they are currently defined to internally contain an address generator. And we know that we cannot throw a map around a stateful op without changing the program's behavior (recall the linebuffer example -- each copy of the linebuffer now gets a different substream of non-consecutive elements)

It's possible, and perhaps preferable to think of a memRd unit as two units that are composed via a composeSeq. (1) A stateful address generator that emits a stream of addresses of a specified length (for simplicity, let's consider this stream a bunch of integers: 0 1 2 3 4..., or if there was some form of stride functionality in the address generator 0 4 8 12 16 ...). (2) And a stateless unit MemAccess that actually accesses memory at the incoming address, emitting the value at that address. So now consider the following:

KayvonMemRead = AddrGen N 1 |>>=| MemAccess

(where AddrGen N 1 is an address generator generating N elements from throughput 1 and MemAccess is a stateless memory access unit.)

Now let's consider what speedup k KayvonMemRead should do:

Let's consider how to speedup AddrGen N 1.

Option 1: If I make it an op that speeds up itself, I'd just change it to: AddrGen N k.
I'd expect the semantics are for it to emit vectors of addresses. For k=2 [0 1] [2 3] [3 4] etc. Since the addresses are continuous, executing the subsequent memory access just needs a wider memory system.

Alternative 2: Consider throwing a map around AddrGen N 1 to speed it up: map k AddrGen N 1 would yield k concurrent streams of size N, which is clearly wrong. That's k times more elements being generated.

Alternative 3: Throwing a map around the op, and also dividing the resulting stream length is interesting to consider:
speedup k AddrGen N 1 = map k AddrGen (N/k) 1, but it yields the following pattern for k=2:
[0 2] [1 3] [2 4]...

This results in N total elements in a streams of length N/2, with element type int[2]. One might think that regardless of whether I choose option 1 or alternative 3 in how to speedup address generation, the stateless memAccess could be sped up as (map k MemAccess). However, this is not true.

To me (map k MemAccess) models k copies of a basic memory access unit. The meaning here is that the HW could make k accesses to k independent addresses at a time, since there is nothing known about the addresses being provided as input. The resulting memory itself would have to be able to arbitrate these k simultaneous requests.

However, with knowledge of the incoming stream, very different, and far more efficient, implementations of map k MemAccess could be built. For example, if it was known that the address stream was that produced by Option 1 above, then the memory need only handle one bulk access at a time, with that access being to an array of consecutive values. Let's call this parallel memory access operator MemAccessArray k.

Therefore...
map k MemAccess would be implemented as a special unit MemAccessArray k if the incoming stream had consecutive addresses in each token. This would be implemented via a banked SRAM, or just a wider memory port.

Otherwise...
map k MemAccess would be equivalent to MemAccessGather k, a much more complex hardware unit.

So in other words, I feel like memRd needs to be a leaf modifiable rate operator. Or if it was expressed in terms as a composeSeq of my hypothetical AddrGen and MemAccess/MemAccesArray/MemAccessGather ops, then all these ops need to be modifiable rate, and there's the equivalence:

memRd k = AddrGen N k |>>=| MemArray Rd k

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a significant simulator change. I will delay this change until later.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #17 for progress on this comment.

-- for these operations, the user has not specified the type separately from the
-- array length. Therefore, speedup may modify the meaning of the program
-- by changing the types of these ops.
attemptSpeedUp requestedMult op@(ArrayReshape _ _) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some ops, it makes sense to add a comment describing intuitively the meaning of speeding up the op.

For ArrayReshape, this just means that the resulting sped up op is now reshaping K input arrays at once. For this reason, it's just like any other leaf not rate modifiable op.

-- by changing the types of these ops.
attemptSpeedUp requestedMult op@(ArrayReshape _ _) =
(MapOp requestedMult op, requestedMult)
attemptSpeedUp requestedMult op@(DuplicateOutputs _ _) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the right place to put this suggestion, but we could consider whether we should change the name of Duplicate to Replicate, since it's not necessarily 2x.

-- internal state and mapping over it will cause it to behave unintuitively.
-- For example, an LineBuffer [2] [3] [300] T_Int will reading in 2 int per clock
-- and emit two overlapping windows of consecutive ints.
-- A Map 2 (LineBuffer [1] [3] [300] T_Int) will emit two windows, but the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful here. There is an assumption in the comment about the behavior of prior pipeline stages. It's that the stream of tokens arriving at the line buffer was not reshuffled accordingly. (This is a reasonable assumption, but it's not necessarily true that the stream arriving at the linebuffer was the same stream as before the speedup, just vectorized.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My design for speedup and slowdown actually does assume that the streams are not modified excepting number of bits processed per clock. I've made this assumption explicit: https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R5


-- is this an ok inverse of our decision to only underutil for slowdown of seq operators?
-- should this try to speed up children before itself?
attemptSpeedUp throughMult (MapOp par innerOp) | isComb innerOp =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to comment the intuitive description here. Speeding up a map by multiplier M is just multiplying the width parameter of the map.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also note that the current implementation does not try to recursively speedup the child node. It just parallelizes in a data parallel (input token parallel) way.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also note the speeding up of the rate before speeding up the children at https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R44.

The philosophy for these types of nodes is not to totally ignore the child nodes, but to speed them up only if you can't speed up the parent. Since you can always speed up the map op by multiplying its parameter. the map's children are never speed up.

attemptSpeedUp throughMult (MapOp par innerOp) | isComb innerOp =
(MapOp (par*throughMult) innerOp, throughMult)
attemptSpeedUp throughMult (MapOp par innerOp) =
let (spedUpInnerOp, actualMult) = attemptSpeedUp throughMult innerOp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the case where the innerOp IS NOT parallelizable via a map. But the guard checks to see if innerOp is combinational. An innerOp can definitely be non-combinational, but also parallelizable. In fact, innerOp is already being used in a map, SO IT MUST BE PARALLELIZABLE. As a result, I think we can drop the second case of speeding up a map. I think only the first one is necessary.

In other words, what is a situation where map k1 innerOp is valid, but map k1*m innerOp would not be? I can't think of one.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not using isComb any more. I'm now using hasInternalState (https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R190). I believe that if something has internal state (which is different from being sequential), then it can't be sped up. For example, a register that delays is sequential, but doesn't really have state. (it actually does have state, so hasInternalState will need to be renamed, but the idea is that there is no complex state that must be managed).

  2. I disagree with your philosophy. It's always valid to map over any op. The outcome may be unintuitive, but you always can make a bunch of copies of ops. Since it is well-defined what mapping any op, the user should be allowed to apply it to any op. However, changing the rate is not well defined. Unlike for other maps, changing the parallelism parameter for a map over an element with complex internal state will reorder the outputs. Thus, this is invalid.

@David-Durst David-Durst merged commit 93f4e4f into dev Aug 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants