forked from cwensel/cascading
-
Notifications
You must be signed in to change notification settings - Fork 112
/
CHANGES.txt
1579 lines (934 loc) · 75.1 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Cascading Change Log
2.2.1
Updated Hadoop platform to fail during planning if "mapred.job.tracker" is not set.
Updated c.t.h.Hfs to improve duplicate identifier check performance. @gianm
Fixed issue where resolved fields were not properly presented to c.t.MultiSinkTap child c.t.Tap and c.s.Scheme
instances preventing header information from being written in the case of TextDelimited files.
Fixed issue where the number of fields parsed by c.s.u.DelimitedParser were greater than those declared could cause
an j.l.ArrayIndexOutOfBoundsException.
Fixed issue where a race condition could cause a NPE between c.c.Cascade#start() and Cascade#stop().
2.2.0
Fixed issue where c.p.CoGroup in local mode did not properly handle joins where the grouping j.u.Comparator
did not treat null values as equal. SQL semantics expect null values to not be equivalent. c.p.HashJoin
does not support non-equality between null and will issue a warning.
Updated c.p.a.AggregateBy sub-classes to pass 0 as default threshold value to allow the system default value
to be honored.
Added c.o.a.MaxValue and c.o.a.MinValue c.o.Aggregator sub-classes to replace c.o.a.Max and c.o.a.Min classes
respectively. MaxValue and MinValue rely on the values compared to be j.l.Comparable types resulting in a simpler
implementation and support for max/min of non numeric types.
Fixed issue where c.o.t.DateParser would drop incoming Tuples if the argument was null.
Fixed issue where c.t.Hasher was not honored during grouping in local mode.
Updated c.t.h.GlobHfs to use fewer resources when deriving member identifiers.
Updated c.t.h.HadoopTapPlatformTest to skip the c.t.h.Dfs test if HDFS filesystem is unavailable on the current
configuration.
Fixed issue where c.t.h.Hfs#resourceExists() could fail is the identifier represented a file globing pattern.
Changed regex j.u.r.Pattern builder methods on c.s.u.DelimitedParser from static to instance methods.
Updated c.t.TupleEntry to issue a warning if an "unmodifiable" c.t.Tuple is set via #setTuple() on a "modifiable"
TupleEntry instance. This typically is an indicator the Tuple instance is about to be cached and/or modified at a
later point. Unmodifiable, system created, Tuples should never be cached.
Added c.t.TupleEntry#selectInto() to provide a more efficient way to copy values from one c.t.Tuple into another.
Added c.t.TupleEntry#selectTupleCopy() and #selectEntryCopy method to always provide a modifiable and cacheable
instance.
Fixed issue where c.t.TupleEntry#selectTuple() and #selectEntry() could return a unmodifiable or un-cacheable
c.t.Tuple or TupleEntry depending on the given c.t.Fields selector.
Fixed issue where c.t.MultiSourceTap could keep too many open resources if #openForRead() is called directly.
Fixed issue where c.o.Buffer#flush() was never called.
Fixed issue where an exception at #close() on step state reader could mask more prominent errors.
Fixed issue where the c.t.TupleEntryCollector was not set to "null" on the c.o.OperationCall before
c.o.Operation#cleanup() was called to prevent the method from emitting values during cleanup. See Operation#flush().
Use "cascading.compatibility.retain.collector" to disable.
Fixed issue where c.f.h.ProcessFlow would not honor c.f.FlowListener instances. Currently does not support
the #onThrowable event.
Updated c.p.a.Unique to use c.o.b.FirstNBuffer to improve performance.
Added c.o.b.FirstNBuffer to provide a faster implementation of returning the first N tuples encountered in a grouping.
Updated junit to version 4.11.
Update default Apache Hadoop support to version 1.1.x. End support for 0.20.2.
Updated c.f.FlowDef to accept classpath elements that allow for pipe assemblies to load additional resources
from the current context j.l.ClassLoader.
Updated error messages in c.t.Fields, delegate property initialization to c.f.Flow sub-classes. @fderose
Removed Hadoop oro dependency from build and test runtime classpaths to stop transient build failures.
Added ability to pass System level properties into platform level property sets to override defaults during testing.
Fixed issue where c.t.l.FileTap#getFullIdentifier() was not returning the fully qualified path.
Added c.t.h.HfsProps to localize optional Hadoop HDFS specific properties, specifically provides properties for
enabling the combining of small files into larger splits.
Updated c.t.h.Hfs to allow for smaller files to be combined into fewer splits, thus fewer map tasks. @sjlee
Updated c.p.SubAssembly to support setting local and step properties via the c.p.ConfigDef.
Updated c.o.Buffer to allow implementations to disable nulling of non-grouping fields after the arguments iterator
has completed. This simplifies appending aggregated fields to the incoming tuple stream.
Updated c.t.Fields to return appending value when calling Fields#append on Fields.NONE and optimized Fields#subtract
when subtracting Fields.NONE.
Added c.f.AssemblyPlanner interface to allow for platform independent generative c.f.Flow planning.
Fixed issue in local mode where an OOME could cause a cascading set of additional OOMEs making the jvm unstable.
Updated c.f.s.MemoryCoGroupGate and c.f.l.s.LocalGroupByGate to drain internal collections when pipelining
tuples downstream in the pipeline.
Added c.t.h.BigDecimalSerialization to allow Hadoop to serialize and deserialize j.m.BigDecimal instances.
Update slf4j to version 1.7.2.
Added coercion support for j.m.BigDecimal.
Added c.p.PlatformSuite annotation allowing a c.PlatformTestCase sub-class to be marked as being a JUnit suite
of tests accessible, by default, via a static "suite" method.
Updated provided c.s.Scheme subclasses to honor field type information.
Updated c.o.expression, c.o.aggregator, and c.p.assembly operations to honor field type information.
Updated c.o.Identity and c.p.a.Coerce to uses field type information during coercion.
Added c.t.t.CoercibleType interface to allow for customization of individual field data types and formats. Also
added the c.t.t.DateType implementation for managing string formatted dates to and from a long timestamp.
Updated c.p.Splice to fail during planning if grouping or merging fields do not share the same field types, unless
the field in question has a j.u.Comparator to handle the incompatible comparisons.
Fixed issue where a c.p.CoGroup join on Fields.NONE would fail during planning.
Updated c.p.a.Unique to optionally filter out null values.
Added c.o.e.ScriptFunction, ScriptTupleFunction, and c.o.e.ScriptFilter operations to allow for more expressive
Java scripts.
Added "test.platform.includes" system property so tests can be limited to specified platforms.
Added c.p.a.MaxBy and c.p.a.MinBy c.p.a.AggregateBy sub-classes to perform max and min, respectively.
Updated c.p.a.SumBy and c.p.a.AverageBy to honor result fields type declaration by coercing the result to the
declared type.
Updated c.p.a.CountBy to count all value occurrences, non-null values, or only null values, within a grouping. Using
grouping Fields.NONE provides an efficient count for a set of columns. Counting distinct values is not supported.
Updated c.t.Fields to accept type information and to propagate type values along with fields.
Updated c.s.l.TextDelimited and c.s.h.TextDelimited to take c.s.u.DelimitedParser on the constructor to allow
for overriding parsing behavior. DelimitedParser now takes a c.s.u.FieldTypeResolver to allow for field name
permutations during source and sink, and type inference from field names.
2.1.6
Updated c.p.SubAssembly to throw UnsupportedOperationException on #getConfigDef() and #getStepConfigDef() calls.
Fixed issue where join field level c.t.Hasher instances were not honored during a c.p.HashJoin.
Fixed issue where a j.l.StackOverflowError would be thrown if the Hadoop mapred.input.format.class property
was not set.
Updated c.t.Fields#size() to return Fields.NONE on size == 0, instead of failing.
Fixed issue where Fields.REPLACE on an incoming Fields.UNKNOWN could result in a
java.lang.ArrayIndexOutOfBoundsException during runtime.
Updated c.s.h.HadoopStepStats counter caching strategy to make a final attempt even if max timeouts have been
met. Added "cascading.step.counter.timeout" property to allow tuning of timeout period.
2.1.5
Updated c.t.h.u.BytesComparator to implement c.t.Hasher as a convenience.
Fixed issue where c.c.CascadeListener was receiving null as the c.c.Cascade parameter.
2.1.4
Added ability to capture frameworks used in an application via c.p.AppProps.
Restored platform test compatibility with Cascading 2.0.x via return of c.p.PlatformRunner.Platform annotation
and deprecated c.t.LocalPlatform and c.t.HadoopPlatform platform implementations.
2.1.3
Fix for extra trailing ']' in c.t.Tap#toString().
Fix for c.f.FlowProcess#getNumProcessSlices() incorrectly returning zero in local mode, should be 1.
Fix for c.p.a.AggregateBy not honoring the global system property threshold value if not overridden on the ctor.
Fix for NPE if c.f.FlowProcess returns null config.
Fixed issue where a c.f.FlowStep would attempt to detect if it should be skipped regardless of whether the "runID"
had been set or not on the c.f.Flow enabling restartable flows.
2.1.2
Fixed issue where c.f.FlowProcess#openForWrite on Hadoop would re-use the existing o.a.h.m.OutputCollector instance
as that used in the current task.
Fixed issue where fetching remote Hadoop counter values could block indefinitely. Fetching remote counters is now
serialized across jobs to prevent deadlocks inside the Hadoop API and counter values are now cached with a final
refresh on job completion.
Fixed issue where NPE could be thrown by c.s.CascadingStats#getCounterValue if given counter had no value.
2.1.1
Fixed issue where c.s.h.TextDelimited would not honor charsetName.
Fixed issue where c.t.BaseTemplateTap would lose parent fields if they were declared as Fields.ALL.
Fixed issue where c.t.Fields#append would not include current Fields instance when appending an array of Fields
instances.
Fixed issue where subsequent c.p.Merge pipes in a pipeline path would obscure prior Merges preventing a c.t.Tap
insertion during planning resulting in a missing Tap configuration resource property.
Fixed NPE with c.s.l.TextDelimited when line after header was null.
Fix for c.s.u.DelimitedParser not fully honoring the default strict parsing policy. This resolution may cause
some text delimited files to fail if they have arbitrary numbers of fields.
Added quote and delimiter getters to c.s.l.TextDelimited and c.s.h.TextDelimited.
Fixed issue where a c.f.FlowStep being skipped was not considered successful after 2.0.7 merge.
2.1.0
Added c.t.t.FileType interface to mark specific platform c.t.Tap classes as representing a file like interface.
Fixed issue where c.p.a.Coerce would coerce a null value to 0 if the coerce type was a j.l.Number
instead of a numeric primitive, or false if the coerce type was j.l.Boolean instead of boolean.
Fixed issue where c.s.u.DelimitedParser did not honor number of field found in a text delimited file header.
Fixed issue where c.t.Tap#openForWrite did not honor the c.t.SinkMode#REPLACE setting.
Added version update check to print out latest available release. Use system property cascading.update.skip=true
to disable.
Updated all tuple stream permutations to minimize new c.t.Tuple instantiations and maximize upstream Tuple reuse.
Updated janino to version 2.6.1.
Updated c.s.l.TextLine, c.s.l.TextDelimited, c.s.h.TextLine, and c.s.h.TextDelimited to encode/decode any supported
j.n.c.Charset.
Fixed issue where c.o.t.DateParser may throw an NPE if the value to be parsed was null.
Added c.p.Props#buildProperties( Iterable<Map.Entry<String, String>> defaultProperties ) to allow for re-using
and existing o.a.h.m.JobConf instances as default properties.
Added c.p.a.FirstBy partial aggregator to allow for capturing first seen c.t.Tuple in a Tuple stream. Argument
c.f.Fields j.u.Comparators are honored for secondary sorting.
Updated c.p.a.AggregateBy to honor argumentField c.f.Fields j.u.Comparator instances for secondary sorting.
Updated c.o.a.First to accumulate the first N seen c.t.Tuple instances.
Added support for c.c.CascadeListener on c.c.Cascade instances.
Updated c.p.j.InnerJoin.JoinIterator and sub-classes to re-use c.t.Tuple instances.
Added support for restartable checkpoint c.f.Flow instances by providing a runID to identify run attempts.
Updated build and tests to simplify development of alternative planners.
2.0.8
Updated c.m.CascadingServices to more robustly load optional services. Service agent jar may now be optionally defined
in a cascading-service.properties file from the CLASSPATH with the "cascading.management.service.jar" property.
2.0.7
Fixed issue where c.t.Tap instances were not presented resolved c.t.Fields instances in local mode during planning.
Fixed issue where Hadoop forgets past job completion status of a job during very long running c.f.Flows and
throws a NPE when queried for the result.
2.0.6
Added "cascading.step.display.id.truncate" property to allow simple truncation of flow and step ID values in
the step display name.
Fixed issue where attempting to iterate the left most side of a join more than once would silently fail on the
Hadoop platform.
Fixed issue where step state was not properly removed from the Hadoop distributed cache during cleanup.
Fixed issue where c.f.Flow#writeStepsDot() would fail if a Flow c.f.FlowStep had multiple sinks.
Fix for c.t.h.i.MultiInputFormat throwing j.l.java.lang.ArrayIndexOutOfBoundsException when there aren't any
actual o.a.h.m.FileInputFormat input paths.
Fix for c.t.h.i.MultiInputFormat throwing j.l.IllegalStateException on an empty child o.a.h.m.InputSplit array.
Fix for j.l.IndexOutOfBoundsException thrown on an empty c.c.Cascade.
Fix for c.t.c.SpillableProps#SPILL_COMPRESS not being honored if set to false.
2.0.5
Updated c.f.p.ElementGraphException messages to name disconnected elements.
Properly scope c.t.Tap properties to c.f.l.LocalFlowStep and then pass them to source/sink stages in
c.f.l.s.LocalStepStreamGraph. @mrwalker
Fix for c.s.u.DelimitedParser to support delimiter as last char in quoted field.
Fix for c.o.f.UnGroup constructor failing against correct constructor values.
Added missing setter methods on c.p.AppProps for application jar path and class values.
Fix for possible NPE when debug logging is enabled during planning.
Improved error message when Hadoop serializer for a given type cannot be found in some cases.
2.0.4
Removed remnant log4j dependency in c.t.h.i.MultiInputSplit.
Fixed issue where c.t.Tap may fail resolving outgoing fields.
Added missing #equals() method to c.t.TupleEntry that will honor field j.u.Comparator instances.
Fixed issue where c.f.s.SparseTupleComparator would not properly sort with re-ordered sort fields.
Fixed issue where c.t.TupleEntryChainIterator#hasNext() would fail if called more than once.
Updated c.t.h.Hfs internal methods call #getPath() instead of #getIdentifier() so sub-classes can override.
Updated the #verify() methods on c.s.l.TextLine and c.s.h.TextLine to be protected.
2.0.3
Fixed issue where the c.f.p.FlowPlanner would allow declared fields in a checkpoint c.t.Tap instance.
Fixed issue where c.f.Flow#writeStepsDot() would fail if the Flow was planned by the local mode planner.
Added c.f.h.u.ObjectSerializer to allow for custom state serializers. To override the default
c.f.h.u.JavaObjectSerializer, specify the name of a class that implements ObjectSerializer (and optionally
implements o.a.h.c.Configurable) via the "cascading.util.serializer" property. @sritchie
2.0.2
Added cascading.version property to Hadoop job configuration.
Removed tests for deprecated method c.t.Tuple#parse().
Fixed error message in c.s.u.DelimitedParser where parsed value was not being reported.
Updated c.s.h.TextLine and c.s.l.TextLine to ignore planner presented fields to allow instances to be re-used.
Changed c.t.c.SpillableTupleList to use j.u.LinkedList to reduce memory footprint when backing a
c.t.c.SpillableTupleMap.
Fixed issue where c.p.Merge into the streamed side of a c.p.HashJoin would produce an incorrect plan.
Fixed issue where c.p.CoGroup was not properly resolving fields from immediate prior c.p.Every pipes.
2.0.1
Changed c.s.h.TextDelimited to use fully qualified path when reading headers so that the filesystem scheme
will be inherited.
Removed redundant property value kept by c.t.h.i.MultiInputSplit to reduce input split serialized size.
Updated commit and rollback functionality in c.f.BaseFlow and c.f.p.BaseFlowStep to fail the c.f.Flow on a
c.t.Tap#commitResource failure and to call Tap#rollbackResource on subsequent tap instances. Note this isn't
intended to provide a 2PC type transactional functionality.
Updated dependency to Hadoop 1.0.3
2.0.0
Added c.p.Checkpoint pipe to force any supported planners to persist the tuple stream at that location. If bound to
a checkpoint c.t.Tap via the c.f.FlowDef, this data will not be cleaned up after the c.f.Flow completes. This pipe
is useful in conjunction with a c.p.HashJoin to minimize replicated data.
Added c.t.l.TemplateTap for local mode. Refactored out c.t.BaseTemplateTap to simplify support for additional
platforms.
Added c.t.l.StdIn, StdOut, and StdErr local mode c.t.Tap types.
Changed c.f.h.HadoopFlowStep to save step state to the Hadoop distributed cache if larger than Short.MAX_VALUE.
Fixed issue where a null value was printed as "null" in c.o.r.RegexMatcher, c.o.r.RegexFilter, c.o.a.AssertGroupBase,
and c.o.t.FieldJoiner.
Updated dependency to Hadoop 1.0.2.
Changed c.s.h.TextDelimited and c.s.l.TextDelimited to optionally read the field names from from the header during
planning if skipHeaders or hasHeaders is set to true and if Fields.ALL or Fields.UNKNOWN is declared on the
constructor.
Changed the planner and added new methods to c.s.Scheme so that field names can be retrieved after a proper
configuration has been built, but before the planner resolves fields internally. This is useful for reading field
names from a header of a text file, or meta-data in a binary file. These methods are optional.
Fixed issue where any c.p.Splice following a c.p.Merge may be unable to resolve the tuple stream branch.
Added support for c.p.ConfigDef on c.p.Pipe and c.t.Tap classes to allow for process and pipe/tap level
property values. Where process allows a Pipe or Tap to set c.f.FlowStep specific properties.
Added c.p.Props base and sub-classes to simplify managing Cascading and Hadoop related properties.
Added c.m.UnitOfWorkSpawnStrategy interface to allow for pluggable thread management services. Also added
c.m.UnitOfWorkExecutorStrategy class as the default implementation.
Added typed set and add methods to c.t.Tuple and c.t.TupleEntry.
Changed packages for many internal types to simplify documentation.
Changed c.f.Flow and c.f.FlowStep to interfaces to hide internal only methods.
Added support for trapping actual raw input data as read by a c.s.Scheme during processing by allowing
c.t.TupleException to accept a payload c.t.Tuple instance with the data to be trapped. Updated c.s.h.TextDelimited
and c.s.l.TextDelimited to provide a proper payload when sourcing and parsing text.
Fixed issue where a c.p.GroupBy following a c.p.Every could not see result Aggregator fields from the Every instance.
Changed c.s.h.TextDelimited and c.s.l.TextDelimited to optionally write headers if writeHeaders or hasHeaders
is set to true. If Fields.ALL or Fields.UNKNOWN is declared, during sinking the field names will be resolved
at runtime.
Added the c.t.TupleCollectionFactory and c.t.TupleMapFactory interfaces and relevant implementations to allow
custom c.t.Spillable types to be plugged into a given execution. Spillable types are used to back in memory
collections to disk to improve scalability of c.p.CoGroup and c.p.HashJoin pipes on different platforms.
Fixed issue where a c.s.Scheme was not seeing properly resolved fields if they were not declared in the Scheme
instance. This allows a Scheme declared to sink c.t.Fields#ALL to see the actual field names during the
Scheme#sinkPrepare() and Scheme#sink() methods.
Changed c.t.TupleEntrySchemeSelector#prepare method to protected and is now called lazily internally during
the first add method. This should simplify custom c.t.Tap development and allows for lazily setting of resolved
sink fields.
Fixed issue where the grouping Tuple resulting from a c.p.CoGroup did not properly reflect all the current
grouping keys and field names. This fix allows an c.o.Aggregator or c.o.Buffer see which fields are null, if at all,
during an "outer" join type. resultGroupFields parameter now must reflect all joined fields as well.
Fixed issue where a c.p.GroupBy merge of branches with the same names threw a NPE.
Fixed issue where c.p.a.AggregateBy.AveragePartials functor was using fixed declared fields.
Added the "cascading.aggregateby.threshold" property so that a default threshold can be set for the
c.p.a.AggregateBy sub-assemblies.
Added the c.m.UnitOfWork interface to give c.f.Flow and c.c.Cascade a common contract.
Changed c.t.h.TupleSerialization#setSerializations() to force TupleSerialization and o.a.h.i.s.WritableSerialization
are first in the "io.serializations" list.
Added support for properties scoped at the pipe or process scope. Process scope properties will be inherited by
the current job if any.
Added c.t.SpillableTupleMap to allow durable groups during asymmetrical joins.
Changed c.t.SpillableTupleList to implement c.u.Collection and c.t.Spillable interfaces.
Renamed the c.p.Group class to c.p.Splice and created a c.p.Group interface. c.p.Groupby, CoGroup, Merge, and HashJoin
are all c.p.Splice types. Only GroupBy and CoGroup are c.p.Group types.
Moved all "joiners" to c.p.joiner package from c.p.cogroup as they are now shared with the c.p.HashJoin pipe.
Added c.p.HashJoin pipe to join two or more streams by a common key value without blocking/accumulating the largest
data stream. This differs from c.p.CoGroup in that there is no grouping or sorting, and on the MapReduce platform,
no Reduce task. The is commonly known as an asymmetrical or replicated join.
Changed c.t.h.TupleSerialization#setSerializations() to always include o.a.h.i.s.WritableSerialization as some
Hadoop versions do not include it if omitted. WritableSerialization is required by c.t.h.MultiInputSplit.
Added c.p.Merge pipe to create a union of multiple tuple streams. This differs from c.p.GroupBy in that there
is no grouping or sorting, and on the MapReduce platform, no Reduce task.
Added c.t.Tap#commitResource() && #rollbackResource() to allow the underlying resource to be notified write processing
has successfully completed or has failed, respectively, so that any additional cleanup or processing may be completed.
Added c.t.Hasher to allow any field level Comparators to have hashCode generation delegated to them for their
respective c.t.Tuple element/field value.
Added c.f.FlowStepStrategy interface to allow customization of c.f.p.FlowStep configuration information.
Changed c.f.Flow to lazily test child source taps for modified time to reduce file meta-data queries.
Changed c.t.CompositeTap#getChildTaps to return an j.u.Iterator to allow for lazy resolution of child tap instances.
Added "cascading.default.comparator" property to allow for a default j.u.Comparator class to be set and used
if no Comparator is returned by the c.t.Comparison interface or set on a c.t.Fields instance.
See c.t.h.TupleSerialization for the static accessor.
Changed planner to allow traps to be re-used across any branches. Prior planner would throw an error.
Changed c.f.Flow to delete traps during the same conditions a sink will be deleted before execution.
Fixed issue where the c.t.h.TemplateTap would not properly remove Hadoop temporary directories on completion.
Changed the behavior of traps to capture operation argument values instead of all the incoming values so that it is
simpler to identify the values causing the failure and reduce the data stored in the trap and log files, which
record a truncated stringified version of the argument tuple.
Updated c.f.h.MapReduceFlow to allow source/sink/trap create methods to be overridden by a sub-class in order
to support path identifiers not compatible with the Hadoop FS.
Changed c.f.FlowProcess increment methods to take a long instead of int type.
Fixed issue where the c.t.h.TemplateTap would not properly handle pathFields value if set to c.f.Fields.ALL.
Fixed issue where the c.p.a.AB.CompositeFunction was not getting flushed when planned into a reduce task.
Renamed c.p.a.Shape to c.p.a.Retain, as it retains given fields, and created c.p.a.Discard to perform the opposite
function or discarding given fields.
Added c.o.NoOp operation to allow fields to be dropped from a stream when used with c.t.Fields.SWAP.
Added c.T.Fields.NONE to denote no fields in a c.t.Tuple.
Added shutdown hook for the c.c.Cascade class so during jvm shutdown #stop() will be called forcing proper state
change.
Changed #stop() to push from c.c.Cascade down through c.f.Flow and c.f.p.FlowStep instances.
Added new JUnit runner for injecting platform dependencies into c.t.PlatformTestCase subclasses. Subclasses should
use c.t.PlatformRunner.Platform Annotation to specify relevant c.t.TestPlatform instances.
Changed test and assertion helper methods on c.t.CascadingTestCase static to remove subclassing requirement.
Upgraded to support JUnit 4.8.x.
Changed license from GPLv3 to APLv2.
Changed c.f.Flow to prevent #complete() from returning while #stop() is executing. Should prevent certain kinds
of race conditions when a shutdown hook is used, from a different thread, to stop running flows.
Added support for gradle.
Renamed c.f.FlowSkipIfSinkStale to FlowSkipIfSinkNotStale to match the semantics.
Added support for c.f.Flow tags via the c.f.FlowDef class.
Added c.f.FlowDef to allow for creating flow definitions via a fluent builder interface.
Added STARTED and SUBMITTED status to c.s.CascadingStats to properly track when a job is submitted vs when it actually begins
processing after being queued.
Added management interfaces for capturing detailed statistics.
Decoupled core from Apache Hadoop, removed stack based streaming model. Use c.f.h.HadoopFlowConnector to plan
Hadoop specific flows.
Implemented 'local' mode to support independent processing of complex processes in memory. Use
c.f.l.LocalFlowConnector for local mode specific flows.
Updated and simplified c.t.Tap and c.t.Scheme interfaces. Changes are not backwards compatible to 1.x releases.
Implemented new pipelining infrastructure to support more complex streaming topologies.
1.2.6
Fixed bug in TupleEntry#selectInteger() and marked it as deprecated.
1.2.5
Removed accidental SLF4J dependencies.
Fixed bug where ISE was thrown if c.f.Flow#stop() was called immediately after #start().
1.2.4
Added info logging of current split input path with a task, if any.
Fixed bug in c.o.f.And, c.o.f.Or, and c.o.f.Xor where the sub-select of arguments was not honored.
Added info log message when writing "direct" to a filesystem, bypassing the temporary folder removing the need to
rename the output file to its target location.
Fixed bug where if all paths that match a glob pattern are empty, an exception is not thrown causing Hadoop to throw
a java.lang.ArrayIndexOutOfBoundsException.
Updated planner to issue an error message if a tail c.p.Pipe instance doesn't not properly bind to a c.t.Tap instance.
1.2.3
Added c.f.Flow#setMaxConcurrentSteps to set the maximum number of steps that can be submitted concurrently.
Fixed bug where NPE was thrown when c.c.CascadeConnector tried to unwind nested c.t.MultiSourceTap instances.
Fixed bug where c.t.Fields#append() would fail when appending unordered selectors.
Updated c.f.FlowProcess to include #isCounterStatusInitialized() to test if the underlying reporting framework
is initialized.
Updated c.f.FlowProcess#keepAlive() method to fail silently if the underlying reporting framework is not initialized.
Updated error message thrown by c.f.FlowStep when unable to find c.t.Tap or c.p.Pipe instances in the flow plan due
to a Class serialized field not implementing #hashCode() or #equals() and relying in the object identity.
Added error message explaining the Hadoop mapred.jobtracker.completeuserjobs.maximum property needs to be increased
when dealing with large numbers of jobs. Also caching success value to lower chance of failure.
Fixed bug in c.t.GlobHfs where #equals() and #hashCode() were not consistent between calls.
1.2.2
Fixed bug where OOME caught from within the source c.t.Tap was not being re-thrown properly.
Added #getMapProgress() and #getReduceProgress() to c.f.h.HadoopStepStats.
Fixed NPE with some invocations of c.t.TupleEntry ctor.
Fixed bug where if an operation declared it returned Fields.ARGS and the argument selector used positions, the
outgoing values may merge incorrectly.
1.2.1
Changed info message to not announce ambiguous source trap if none has been set.
Fixed bug where if the c.o.Function result c.t.Tuple was passed immediately to a c.p.Group, it may become modified.
Fixed bug where c.t.TupleEntryIterator#hasNext() failed if called again after returning false.
Fixed issue where reduce task may fail with a OOME during sorting.
1.2.0
Added c.p.a.AverageBy sub-assembly for optimizing averaging processes.
Added c.p.c.GroupClosure#getFlowProcess method to allow c.p.c.Joiner implementations access to current
properties and counters.
Added c.s.CascadingStats methods for accessing available counter groups and names.
Added c.s.WritableSequenceFile as a convenience for reading/writing sequence files holding custom Hadoop
Writable types in either they key, value, or key and value positions.
Added retrieve/publish support to the Conjars repo via Ivy.
Added the c.p.a.AggregateBy class to encapsulate parallel partial Function aggregations and their reduce
side Aggregator. This is a superior alternative to so called MapReduce Combiners. See javadoc for details.
Changed c.o.Debug to print the number of tuples encountered on #cleanup().
Changed c.s.TextDelimited to always return the expected number of fields even if they are not parsed from
the current line and strict is false, unless Fields.ALL or Fields.UNKNOWN is declared.
Added c.p.a.SumBy sub-assembly for optimizing summing processes.
Added c.p.a.CountBy sub-assembly for optimizing counting processes.
Added c.s.CascadingStatus.Status.Skipped state so skipped c.f.Flow instances can be identified.
Added c.f.Flow#setSubmitPriority() to allow for custom order of Flows.
Fixed bug where c.t.MultiSourceTap#pathExists() would return true if one of the child paths was missing.
Changed c.c.CascadeConnector to fail if it detects cycles in the set of given c.f.Flow instances to manage.
Disable Hadoop warning about not using "options parser".
Added #isSource() and #isSink() methods to c.s.Scheme so that some Scheme instances can report they are either
sink or source only.
Added c.t.Fields#merge() method to allow simple merging of Fields instances which discarding duplicate names and
positions.
Added convenience methods on c.c.CascadeConnector#connect() and c.f.FlowConnector#connect() to accept
j.u.Collection<Flow> and j.u.Collection<Pipe> arguments, respectively.
Added Riffle support via the new c.f.ProcessFlow wrapper class. Riffle allows for non-Cascading jobs and/or
sets of iterative Flows to participate in a c.c.Cascade.
Changed c.c.Cascade instances to disable parallel execution if more than one Flow is a local only job.
Added c.c.Cascade#setMaxConcurrentFlows() property that limits the number of concurrently running Flows.
Added c.c.Cascade#writeDOT method for visualizing the dependencies between flow instances.
Added c.p.a.Unique sub-assembly for optimizing de-duping processes.
Changed c.s.TextDelimited to accept Fields.ALL or Fields.UNKNOWN for arbitrarily sized or unknown records.
Changed c.t.MultiSourceTap to support #openForRead().
Added c.t.Comparison and c.t.StreamComparator interfaces which allow for custom types to be
lazily deserialized during sort comparisons.
Added support for lazy deserialization during c.t.Tuple comparisons while shuffle sorting.
1.1.3
Added publishing of artifacts to the conjars.org jar repo via Ivy.
Added method c.s.CascadingStats#getCurrentDuration to return the current execution duration whether or not the
process/work is finished.
Fixed issues where c.t.Fields#getIndex may return invalid results if accessed from multiple threads simultaneously.
Fixed NPE when attempting to increment a counter before the first map/reduce invocation. Now throws a more
informative ISE message.
Fixed possible NPE when accessing counters via c.f.h.HadoopStepStats.
Fixed bug in c.s.TextDelimited where some unquoted empty values would not be properly parsed.
Added c.f.FlowStep#setName() method to allow override of MR job names. Use in conjunction with
FlowStep#containsPipeNamed() to find appropriate steps.
Fixed bug where c.f.MultiMapReducePlanner did not detect a split after a c.p.GroupBy or c.p.CoGroup where
one or more of the immediate pipes is an c.p.Every instance. An Each split is allowed.
Fixed c.t.TupleEntry#set method so that it may take a c.t.Fields instance for a field name.
Fixed NPE in c.t.TempHfs when parent c.f.Flow is used in a Cascade under certain conditions.
Fixed bug where mixed absolute and relative paths didn't not result in a proper topological sort when used
in a c.c.Cascade.
Fixed bug where a c.c.Cascading of c.f.Flow and c.f.MapReduceFlow instances did not properly sort topologically.
Added c.c.Cascade#writeDOT method to simplify debugging Cascade instances.
1.1.2
Fixed bug preventing c.s.TextDelimited schemes from being used with a c.t.TemplateTap.
Updated c.t.Scheme base class to force Field.ALL source declaration to Fields.UNKNOWN, and to force Fields.UNKNOWN
sink declaration to Fields.ALL.
Fixed bug where if null was passed to c.s.TextLine sinkCompression, the behavior would be undefined.
Added back c.t.Tuple#add( Comparable ) to remain backwards compatible with 1.0.
Fixed bug preventing Fields.ALL selector in c.p.Every when incoming positions are used instead of field names
and the given aggregator declares field names.
Fixed bug that prevented the configured codecs from loading for co-group spills.
Fixed bug where c.s.TextDelimited would fail on delimiters that are also regex special characters.
Fixed random j.u.ConcurrentModificationException error when running in Hadoop local mode by synchronizing
the c.f.s.StackElement#closeTraps method.
Fixed missing property values when stored in a nested j.u.Properties object.
Fixed NPE when counter group does not exist yet when querying c.s.FlowStats#getCounterValue.
1.1.1
Fixed bug where some unsafe operations followed by named c.p.Pipe instances were not considered during planning.
Removed imports for SLF4J and replaced with Apache LOG4j in c.s.TextDelimited.
Fixed bug where c.t.Fields.SWAP did not properly resolve when following a c.p.Every pipe.
1.1.0
Fixed bug where a c.t.Fields instance can be marked as ordered when modified via #set call.
Changed c.p.CoGroup to detect self-joins and optimize for them.
Changed trap handling to include failures from source and sink c.t.Tap instances. The source Tap will inherit
the assembly head trap and the sink will inherit the assembly tail trap.
Deprecated c.t.Tuple#parse(). It does not properly handle null values or types other than primitives.
Changed c.f.s.StackElement to log a warning for each trap captured. This includes a truncated print of the offending
c.t.TupleEntry and the thrown exception and stack trace. Traps being for exceptional cases, logging exceptions is a
reasonable response.
Changed map and reduce operation stack so that collected c.t.Tuple instances do not remain 'unmodifiable' after
being collected via the c.t.TupleEntryCollector.
Add #getArgumentFields() to c.o.OperationCall for all operations.
Added support for custom EMR properties used for managing task attempt temporary path management for some filesystems.
Changed c.t.TemplateTap to support an openTapsThreshold value. The default open taps is 300. After the threshold
is met, 10% of the least recently used open taps will be closed.
Changed c.t.Fields #setComparator fieldName argument to accept Fields instances as the fieldName argument.
Only the first field name or position is considered.
Changed c.t.TupleEntry 'get as type' accessors to now also accept c.t.Fields instances as the fieldName argument. Only
the first field name or position is considered.
Updated janino to 2.5.16.
Updated jgrapht to 0.8.1.
Changed c.f.s.FlowMapperStack to source key/value pairs once, instead of per branch.
Changed c.f.FlowPlanner to fail if not all sources or sinks are bound to heads or tails, respectively.
Changed c.t.TupleOutputStream to lookup tuple element writers by Class identity.
Added j.b.ConstructorProperties annotation to relevant class constructors.
Added new convenience method c.p.Pipe#names to return an array of all the pipe names in an assembly. This supports
the dynamic creation of traps from opaque assemblies.
Added new c.s.Scheme type c.s.TextDelimited to allow native support for delimited text files.
Added optimization during CoGrouping where the most LHS pipe will not ever be accumulated, instead the values iterator
will be used directly. This allows for the most dense values to be on the LHS, and the most sparse to be on the
RHS of the join.
Added new counters for tuple spills and reads. Also logs grouping after first spill.
Added compression of object serialization and deserialization, on by default. This improves reliability
of very large jobs with very large numbers of input files.
Fixed bad cast of j.l.Error when caught in map/reduce pipeline stack.
Added c.t.Fields#rename to simplify Fields instance manipulations.
Added support for resultGroupFields in c.p.CoGroup. This allows the outgoing grouping fields to be set.
Added c.t.h.BytesSerialization and c.t.h.BytesComparator to allow for c.t.Tuple instances
to hold raw byte arrays (byte[]), and allow joining, grouping, and secondary sorting.
Changed c.t.Tuple and underlying framework to support j.l.Object instead of j.l.Comparable. Note that
Tuple#get() returns Comparable to maintain backwards compatibility.
Added support for custom j.u.Comparator instances to control the grouping and sort orders in c.p.CoGroup and
c.p.GroupBy via the c.t.Fields class.
Added support for planner managed debugging levels via the c.o.DebugLevel enum. Now c.o.Debug operations
can be planned out at runtime in the same manner as c.o.Assertion operations.
Refactored xpath operations to re-use j.x.p.DocumentBuilder instances.
Refactored fields resolver framework to emit consistent error messages across all field resolution types.
Fixed bug where c.t.Tuples would fail when coercing non-standard java types or primitives.
Fixed bug where c.t.Tap instances that returned true for #isWriteDirect() were not properly being initialized
when used as a sink.
Added guid like ID values to c.f.Flow and c.c.Cascade instances.
Refactored reduce side grouping and co-grouping operations to remove redundant code calls.
Added ability to capture Hadoop specific job details like task start and stop times, and all available counter values.
Added accessor for increment counters on c.s.CascadingStats. This allows applications to pull aggregate counter
values from c.c.Cascade, c.f.Flow, or c.f.FlowSteps.
Added c.t.GlobHfs c.t.Tap type that accepts Hadoop style globbing syntax. This allows multiple files that match
a given pattern to be used as the sources to a Flow.
Added c.o.s.State and c.o.s.Counter helper operations that respectively set 'state' and increment counters.
Added c.f.FlowProcess#setStatus method to allow for text status messages to be posted.
Added c.o.a.AssertNotEquals assertion type.
Removed planner restriction that traps must not cross map/reduce boundaries. This allows for a single c.t.Tap
trap to be used across a whole branch, regardless of underlying topology.
Added new c.t.Field field set type named Fields.SWAP. Can only be used as a result selector. Specifies operation
results will replace the argument fields. The remaining input fields will remain intact.
Deprecated c.t.SinkMode#APPEND and replaced with c.t.SinkMode#UPDATE.
Added c.t.MultiSinkTap to allow for simultaneous writes to multiple unique locations.
Added support for compression of c.t.SpillableTupleList by default in order to speed up c.p.CoGrouping operations
where there are very large numbers of values per grouping key.
Added c.o.f.SetValue function for setting values based on the result of a c.o.Filter instance.
Added support for configuring polling interval of job status via c.f.h.MultiMapReducePlanner.
Added c.f.h.MultiMapReducePlanner optimization to detect 'equivalent' adjacent c.t.Tap instances in a c.f.Flow.
This can drastically reduce the number of jobs when there are intermediate sinks between pipe assemblies.
If the taps are not compatible, a job will be inserted to convert the temp tap data to the sink format.
Added support for 'safe' c.o.Operations. By default Operations are safe, that is, they have no side-effects, or
if they do, they are idempotent. Non-safe operations are treated differently by the c.f.h.MultiMapReducePlanner.
Added new c.t.Field field set type named Fields.REPLACE. Can only be used as a result selector. Specifies the
operation results will replace values in fields with the same names. That is, inline values can be replaced in a
single c.p.Each or c.p.Every. It is especially useful when used with Fields.ARGS as the operation field declaration.
Fix for case where one side of a branch multiplexed in a mapper could step on c.t.Tuple values before being
handed to the next branch. Previous fix was only for CoGroup, this support GroupBy merges.
1.0.18
Changed c.t.Tuple#print to not quote null elements to distinguish between 'null' Strings and null values.
Changed planner exception messages to quote head and tail names.
Changed log messages to info when hdfs client finalizer hook cannot be found.
Fix for NPE in c.t.h.MultiInputFormat during certain testing scenarios. Also changed proportioning to honor
suggested numSplits value.
Fix for temp files starting with underscores (_) causing them to be ignored.
Fix for mixed types in properties object causing ClassCastExceptions.
Fix for case where one side of a branch multiplexed in a mapper could step on c.t.Tuple values before being
handed to the next branch.
Fix for edge case where Cascading jars are stored in Hadoop classpath and deserialization of c.f.Flow fails.
Fix for bad cast of j.l.Error when caught in map/reduce pipeline stack.
Fix for bug when selecting positional Fields from positional Fields.
Fix for case when an c.o.Aggregator#start is called when there are no values to iterate across in current grouping.
1.0.17
Changed behavior when cleaning temp files that allows shutdown to continue even if an exception is thrown
during temp file delete.
Fix bug where c.f.FlowProcess#openTapForRead() included current input file values in iterator.
Fix for intermediate temp files not being cleaned up on c.f.Flow#stop().
Fixed bug where NPE is thrown if all hadoop default properties are not available.
1.0.16
Fixed bug where in some instances o.a.h.m.JobConf hangs when instantiated during co-grouping.
Fixed bug in c.CascadingTestCase#invokeBuffer where the output collector was not properly being set. Added
new methods on #invokeBuffer and #invokeAggregator to take a groping c.t.TupleEntry.
1.0.15
Fixed bug where c.t.Fields did not check for a null field name or position on the ctor.
Fixed bug in c.u.Util#join() methods where if the first value was empty, the delimiter was not properly applied.
Fixed issue in c.t.h.FSDigestOutputStream where seek() now must be implemented with modern versions of Hadoop.
1.0.14
Fixed bug in planner where JGraphT sometimes returns null instead of an empty List.
Fixed bug in c.o.x.XPathParser that prevented use of multiple xpath expressions.
Added configuration propety allowing job polling interval to be configured per c.f.Flow via
Flow#setJobPollingInterval().
Updated ant build to not hard-code hadoop/lib sub-dir names.
1.0.13
Fixed bug where non-String j.u.Property values where not being copied to the internal o.a.h.m.JobConf instance.
Fixed bug where custom serializations where not recognized during co-grouping spills inside c.t.SpillableTupleList.
1.0.12
Fixed bug where the c.f.FlowPlanner did not detect that tails were not bound to sinks, or that some tail references
were missing.
Fixed j.u.ConcurrentModificationException when using a c.c.CascadeConnector on c.f.Flows using a c.t.MultiSink
c.t.Tap.
Fixed bug where c.f.s.StackException was being wrapped preventing failures within sink c.t.Tap instances from
causing the c.f.Flow to fail. This mainly affected Flows using traps.
1.0.11
Added clearer error message when c.t.Tap is used as both source and sink in a given Flow.
Demoted all DEBUG related c.t.Tuple#print() calls to TRACE.
Fixed NPE when planner finds inconsistencies with c.t.Tap and c.p.Pipe names.
1.0.10
Updated planner error messages when field name collisions detected.
Fixed issue where temporary paths were not getting deleted consistently.
1.0.9
Fixed issue where reverse ordering a c.p.GroupBy was not possible when sortFields were not given.
Changed c.f.s.StackElement#close() behavior to close elements from the top of the stack.
1.0.8
Fixed bug where Hadoop FS shutdown hooks prevented cleanup of c.f.Flow intermediate files.
Fixed bug where c.t.MultiTap was not accounted for when planning a c.c.Cascade.
Fixed bug where operations in the default package caused NPE when calculating the stacktrace.
Added c.f.StepCounters enum and now increment the counters Tuples_Read, Tuples_Written, Tuples_Trapped.
Fixes for instabilities when using traps in some instances.
Workaround for bug in o.a.h.f.s.NativeS3FileSystem where a null is returned when getting a FileStatus array
in some cases.