Refactor ThroughputModifications #14

David-Durst · 2018-07-31T18:25:20Z

This PR mainly reworks ThroughputModifications.hs so that others can understand it. @kayvonf, this is the PR you are looking for.

This PR also reorders the arguments to ReduceOp and renames them, so ReduceOp 4 1 now means 4 tokens are reduced together (called numTokens) with 1 token received per clock (called parallelism or par).

@akeley98 - I had to reorder the args for reduce in simulator. Please confirm I did it the correct way.

… to be in four categories

…utModification.hs

akeley98 · 2018-07-31T19:52:41Z

The simulator's ReduceOp still seems okay. All tests still pass.

akeley98

Simulator ReduceOp still works correctly.

kayvonf · 2018-08-01T21:13:19Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+1. Leaf, non-modifiable rate - these ops are sped up and slowed down by wrapping
+them in a parent, modifiable rate op such as map and underutil. 
+
+2. Leaf, modifiable rate - these ops are sped up and slowed down by trying to


"trying to adjust their rate" --> trying to adjust their rate via changes to rate parameters. Since not all ops in this category permit all possible rates, it may not be possible to speed up/slow down these ops in a cases.

<Meta: I'm thinking we could be a little more precise about use of the term "adjust their rate" in these comments, since we also talk about changing the rate of a non-modifiable rate op. I get what we mean, but we might confuse someone if the context is less clear.>

I agree that rate is not the best term. I've switched to the term "throughput parameter". I introduce the term https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R7 and https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R33. Thoughts? I believe this clears up the confusion.

I've also decreased the confusion by renaming the categories indirectly and directly scalable. See https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R7 for their definitions.

kayvonf · 2018-08-01T21:22:08Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+-- ASSUMPTION: the user has specified a type for these ops and passes will not
+-- change the type because that would change the semantics of the
+-- program. For example, one could speed up an Add T_Int my making it a
+-- Add $ T_Array 2 T_Int, but that would change the program's meaning.


Unclear what you mean here by "change the program's meaning", since by some definition parallelizing the program also changes it's meaning in that the input and output types of the program are now different. Similarly, wrapping an ADD in a map, vs. defining add polymorphically to work on T_Array 2 T_Int conceptually produces the same set of singleton values.

Specifically, I think what this comment is trying to say is that we're making the assumption that basic combinational ops are defined on a few base types, and that more complex circuits are created from compositions of these ops. If we want an adder to operate on an array of size N, well, that requires a program that is map N (add).

I think this is a good assumption for the following reason: While it makes sense what the meaning of invoking ADD on array data types, it makes less sense what the meaning of invoking ADD on tuples. (Do we field-wise add the tuple's fields?) As such, I think it's intuitive to say that ADD is only defined on simple base types (it's not polymorphic on arrays), and that adding more complex data types is expressed in language via maps (to add arrays) and for adding tuples, creating a circuit with the appropriate tuple extractions and creation)

I think we mean the same thing, but have different definitions of base types. I’m including homogeneous arrays in my list of base types. I’m doing this as there are certain applications where nested arrays of bits are the basic unit of data. For example, operating over 3 color channel, 8 bit depth image data means that the basic, unseperable unit of data is Bit[8][3]. Speed up and slow down can make the pipeline handle more or fewer of these unseparable units per clock, but it cannot change the unit.

Aetherling’s type system has no tuple. Therefore, the question of what types an Add can be over doesn’t involve tuples. The implementation of Add is well-defined for all types currently allowed by the Aetherling Haskell IR.

kayvonf · 2018-08-01T23:44:46Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+-- maping over a MemRead/MemWrite is banking it, and wrapping
+-- the type with an array is making a single memory read/write more
+-- per clock.
+attemptSpeedUp requestedMult op@(MemRead _) = (MapOp requestedMult op, requestedMult)


#braindump as this review caused me to think about how we treat memories.

TLDR 1; I think memRd and memWr should be leaf modifiable rate operators given their current definition.

TLDR 2; in the comment on line 102 above, "banking" should be replaced by "gathering".

=====================

We currently consider MemRead and MemWrite to be stateful ops. (as given in Operations/Properties.hs). These ops are stateful because they are currently defined to internally contain an address generator. And we know that we cannot throw a map around a stateful op without changing the program's behavior (recall the linebuffer example -- each copy of the linebuffer now gets a different substream of non-consecutive elements)

It's possible, and perhaps preferable to think of a memRd unit as two units that are composed via a composeSeq. (1) A stateful address generator that emits a stream of addresses of a specified length (for simplicity, let's consider this stream a bunch of integers: 0 1 2 3 4..., or if there was some form of stride functionality in the address generator 0 4 8 12 16 ...). (2) And a stateless unit MemAccess that actually accesses memory at the incoming address, emitting the value at that address. So now consider the following:

KayvonMemRead = AddrGen N 1 |>>=| MemAccess

(where AddrGen N 1 is an address generator generating N elements from throughput 1 and MemAccess is a stateless memory access unit.)

Now let's consider what speedup k KayvonMemRead should do:

Let's consider how to speedup AddrGen N 1.

Option 1: If I make it an op that speeds up itself, I'd just change it to: AddrGen N k.
I'd expect the semantics are for it to emit vectors of addresses. For k=2 [0 1] [2 3] [3 4] etc. Since the addresses are continuous, executing the subsequent memory access just needs a wider memory system.

Alternative 2: Consider throwing a map around AddrGen N 1 to speed it up: map k AddrGen N 1 would yield k concurrent streams of size N, which is clearly wrong. That's k times more elements being generated.

Alternative 3: Throwing a map around the op, and also dividing the resulting stream length is interesting to consider:
speedup k AddrGen N 1 = map k AddrGen (N/k) 1, but it yields the following pattern for k=2:
[0 2] [1 3] [2 4]...

This results in N total elements in a streams of length N/2, with element type int[2]. One might think that regardless of whether I choose option 1 or alternative 3 in how to speedup address generation, the stateless memAccess could be sped up as (map k MemAccess). However, this is not true.

To me (map k MemAccess) models k copies of a basic memory access unit. The meaning here is that the HW could make k accesses to k independent addresses at a time, since there is nothing known about the addresses being provided as input. The resulting memory itself would have to be able to arbitrate these k simultaneous requests.

However, with knowledge of the incoming stream, very different, and far more efficient, implementations of map k MemAccess could be built. For example, if it was known that the address stream was that produced by Option 1 above, then the memory need only handle one bulk access at a time, with that access being to an array of consecutive values. Let's call this parallel memory access operator MemAccessArray k.

Therefore...
map k MemAccess would be implemented as a special unit MemAccessArray k if the incoming stream had consecutive addresses in each token. This would be implemented via a banked SRAM, or just a wider memory port.

Otherwise...
map k MemAccess would be equivalent to MemAccessGather k, a much more complex hardware unit.

So in other words, I feel like memRd needs to be a leaf modifiable rate operator. Or if it was expressed in terms as a composeSeq of my hypothetical AddrGen and MemAccess/MemAccesArray/MemAccessGather ops, then all these ops need to be modifiable rate, and there's the equivalence:

memRd k = AddrGen N k |>>=| MemArray Rd k

This will require a significant simulator change. I will delay this change until later.

See #17 for progress on this comment.

kayvonf · 2018-08-01T23:51:36Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+-- for these operations, the user has not specified the type separately from the
+-- array length. Therefore, speedup may modify the meaning of the program
+-- by changing the types of these ops.
+attemptSpeedUp requestedMult op@(ArrayReshape _ _) =


For some ops, it makes sense to add a comment describing intuitively the meaning of speeding up the op.

For ArrayReshape, this just means that the resulting sped up op is now reshaping K input arrays at once. For this reason, it's just like any other leaf not rate modifiable op.

kayvonf · 2018-08-01T23:52:34Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+-- by changing the types of these ops.
+attemptSpeedUp requestedMult op@(ArrayReshape _ _) =
+  (MapOp requestedMult op, requestedMult)
+attemptSpeedUp requestedMult op@(DuplicateOutputs _ _) =


This is probably not the right place to put this suggestion, but we could consider whether we should change the name of Duplicate to Replicate, since it's not necessarily 2x.

kayvonf · 2018-08-01T23:57:08Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

+-- internal state and mapping over it will cause it to behave unintuitively.
+-- For example, an LineBuffer [2] [3] [300] T_Int will reading in 2 int per clock
+-- and emit two overlapping windows of consecutive ints.
+-- A Map 2 (LineBuffer [1] [3] [300] T_Int) will emit two windows, but the


We should be careful here. There is an assumption in the comment about the behavior of prior pipeline stages. It's that the stream of tokens arriving at the line buffer was not reshuffled accordingly. (This is a reasonable assumption, but it's not necessarily true that the stream arriving at the linebuffer was the same stream as before the speedup, just vectorized.)

My design for speedup and slowdown actually does assume that the streams are not modified excepting number of bits processed per clock. I've made this assumption explicit: https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files#diff-a4e9be19161290150b0e024cdc796b07R5

kayvonf · 2018-08-01T23:59:05Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

-
-- is this an ok inverse of our decision to only underutil for slowdown of seq operators?
-- should this try to speed up children before itself?
-attemptSpeedUp throughMult (MapOp par innerOp) | isComb innerOp =


Would be nice to comment the intuitive description here. Speeding up a map by multiplier M is just multiplying the width parameter of the map.

We should also note that the current implementation does not try to recursively speedup the child node. It just parallelizes in a data parallel (input token parallel) way.

This is deleted code not in the code following the commit. Were you refering to the new implementation at https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R190

I do give the intuitive explanation at https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R181

I also note the speeding up of the rate before speeding up the children at https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R44.

The philosophy for these types of nodes is not to totally ignore the child nodes, but to speed them up only if you can't speed up the parent. Since you can always speed up the map op by multiplying its parameter. the map's children are never speed up.

kayvonf · 2018-08-02T00:04:08Z

src/Core/Aetherling/Passes/ThroughputModifications.hs

-attemptSpeedUp throughMult (MapOp par innerOp) | isComb innerOp =
-  (MapOp (par*throughMult) innerOp, throughMult)
-attemptSpeedUp throughMult (MapOp par innerOp) =
-  let (spedUpInnerOp, actualMult) = attemptSpeedUp throughMult innerOp


This seems to be the case where the innerOp IS NOT parallelizable via a map. But the guard checks to see if innerOp is combinational. An innerOp can definitely be non-combinational, but also parallelizable. In fact, innerOp is already being used in a map, SO IT MUST BE PARALLELIZABLE. As a result, I think we can drop the second case of speeding up a map. I think only the first one is necessary.

In other words, what is a situation where map k1 innerOp is valid, but map k1*m innerOp would not be? I can't think of one.

I'm not using isComb any more. I'm now using hasInternalState (https://github.com/David-Durst/aetherlingHaskellIR/pull/14/files/76c0a4810fb37bdc063954ba40df55af21cf309b#diff-a4e9be19161290150b0e024cdc796b07R190). I believe that if something has internal state (which is different from being sequential), then it can't be sped up. For example, a register that delays is sequential, but doesn't really have state. (it actually does have state, so hasInternalState will need to be renamed, but the idea is that there is no complex state that must be managed).

I disagree with your philosophy. It's always valid to map over any op. The outcome may be unintuitive, but you always can make a bunch of copies of ops. Since it is well-defined what mapping any op, the user should be allowed to apply it to any op. However, changing the rate is not well defined. Unlike for other maps, changing the parallelism parameter for a map over an element with complex internal state will reorder the outputs. Thus, this is invalid.

…n comment

David-Durst added 9 commits July 29, 2018 23:28

wrote up 4 cateogires in throughputmodifications.hs, reogred speed up…

9ac6813

… to be in four categories

commenting lb better, explaining all 4 categories

c207658

speed up documented according to 4 categories

8218608

all of slowdown done except decreaseLBPxPerClock

1d9d301

speedup and slowdown finished documenting

f4198d0

changed throuhgMult/throughDiv to requestedMult/requestedDiv

223227c

reflowed text a little, define newPar in comments

6f39e90

renamed numComb to numTokens and reorded vars for AST.hs and Throughp…

c3352a6

…utModification.hs

reordered args to reduce, rename numComb to numTokens

aae0239

David-Durst requested review from akeley98 and kayvonf July 31, 2018 18:38

Swap par and numTokens arg in simhlReduce to match new order.

76c0a48

akeley98 approved these changes Jul 31, 2018

View reviewed changes

kayvonf reviewed Aug 1, 2018

View reviewed changes

kayvonf reviewed Aug 2, 2018

View reviewed changes

David-Durst added 7 commits August 5, 2018 17:25

explaioning sequence array repack slow down and speed up

c56fd50

renamed modifiable, non-modifiable to directly/indirectly scalable

5e3f14e

using throuhgput parameters instead of rate in throuhgput modificatio…

700f5d0

…n comment

rewrote rate to throughput parameter in rest of code, cleaned up comment

1e46023

defined parent/leaf and directly/indirectly scalable

879dfb3

explaining basic type assumption

cc38780

explain order and value preservation assumption

c6a3295

David-Durst mentioned this pull request Aug 6, 2018

Separate Memory Access From Address Generation #17

Open

David-Durst merged commit 93f4e4f into dev Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ThroughputModifications #14

Refactor ThroughputModifications #14

David-Durst commented Jul 31, 2018

akeley98 commented Jul 31, 2018

akeley98 left a comment

kayvonf Aug 1, 2018

David-Durst Aug 6, 2018

kayvonf Aug 1, 2018

David-Durst Aug 6, 2018

kayvonf Aug 1, 2018 •

edited

Loading

David-Durst Aug 6, 2018

David-Durst Aug 6, 2018

kayvonf Aug 1, 2018

kayvonf Aug 1, 2018

kayvonf Aug 1, 2018

David-Durst Aug 6, 2018

kayvonf Aug 1, 2018

kayvonf Aug 2, 2018

David-Durst Aug 6, 2018

David-Durst Aug 6, 2018

kayvonf Aug 2, 2018

David-Durst Aug 6, 2018

Refactor ThroughputModifications #14

Refactor ThroughputModifications #14

Conversation

David-Durst commented Jul 31, 2018

akeley98 commented Jul 31, 2018

akeley98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayvonf Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayvonf Aug 1, 2018 •

edited

Loading