-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update #1
Update #1
Commits on Jan 23, 2016
-
[SPARK-11137][STREAMING] Make StreamingContext.stop() exception-safe
Make StreamingContext.stop() exception-safe Author: jayadevanmurali <jayadevan.m@tcs.com> Closes apache#10807 from jayadevanmurali/branch-0.1-SPARK-11137.
Configuration menu - View commit details
-
Copy full SHA for 5f56980 - Browse repository at this point
Copy the full SHA 5f56980View commit details -
[SPARK-12904][SQL] Strength reduction for integral and decimal litera…
…l comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes apache#10882 from rxin/SPARK-12904-1.
Configuration menu - View commit details
-
Copy full SHA for 423783a - Browse repository at this point
Copy the full SHA 423783aView commit details -
[STREAMING][MINOR] Scaladoc + logs
Found while doing code review Author: Jacek Laskowski <jacek@japila.pl> Closes apache#10878 from jaceklaskowski/streaming-scaladoc-logs-tiny-fixes.
Configuration menu - View commit details
-
Copy full SHA for cfdcef7 - Browse repository at this point
Copy the full SHA cfdcef7View commit details
Commits on Jan 24, 2016
-
[SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build
ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
Configuration menu - View commit details
-
Copy full SHA for f400460 - Browse repository at this point
Copy the full SHA f400460View commit details -
[SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python …
…tools Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning). cc JoshRosen who looked at the original JIRA. Author: Holden Karau <holden@us.ibm.com> Closes apache#10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
Configuration menu - View commit details
-
Copy full SHA for a834001 - Browse repository at this point
Copy the full SHA a834001View commit details -
[SPARK-12120][PYSPARK] Improve exception message when failing to init…
…ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10126 from zjffdu/SPARK-12120.
Configuration menu - View commit details
-
Copy full SHA for e789b1d - Browse repository at this point
Copy the full SHA e789b1dView commit details
Commits on Jan 25, 2016
-
[SPARK-12624][PYSPARK] Checks row length when converting Java arrays …
…to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes apache#10886 from liancheng/spark-12624.
Configuration menu - View commit details
-
Copy full SHA for 3327fd2 - Browse repository at this point
Copy the full SHA 3327fd2View commit details -
[SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not …
…case class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#10895 from HyukjinKwon/SPARK-12901.
Configuration menu - View commit details
-
Copy full SHA for 3adebfc - Browse repository at this point
Copy the full SHA 3adebfcView commit details -
[SPARK-12932][JAVA API] improved error message for java type inferenc…
…e failure Author: Andy Grove <andygrove73@gmail.com> Closes apache#10865 from andygrove/SPARK-12932.
Configuration menu - View commit details
-
Copy full SHA for d8e4805 - Browse repository at this point
Copy the full SHA d8e4805View commit details -
[SPARK-12755][CORE] Stop the event logger before the DAG scheduler
[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped. This contribution is my original work, and I license this work to the Spark project under the project's open source license. Author: Michael Allman <michael@videoamp.com> Closes apache#10700 from mallman/stop_event_logger_first.
Configuration menu - View commit details
-
Copy full SHA for 4ee8191 - Browse repository at this point
Copy the full SHA 4ee8191View commit details -
[SPARK-11965][ML][DOC] Update user guide for RFormula feature interac…
…tions Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10222 from yanboliang/spark-11965.
Configuration menu - View commit details
-
Copy full SHA for dd2325d - Browse repository at this point
Copy the full SHA dd2325dView commit details -
Closes apache#9046 Closes apache#8532 Closes apache#10756 Closes apache#8960 Closes apache#10485 Closes apache#10467
Configuration menu - View commit details
-
Copy full SHA for ef8fb36 - Browse repository at this point
Copy the full SHA ef8fb36View commit details -
[SPARK-12149][WEB UI] Executor UI improvement suggestions - Color UI
Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time. Active Tasks is shaded blue with it's range based on percentage of total cores used. Failed Tasks is shaded red ranging over the first 10% of total tasks failed Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor. Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#10154 from ajbozarth/spark12149.
Configuration menu - View commit details
-
Copy full SHA for c037d25 - Browse repository at this point
Copy the full SHA c037d25View commit details -
[SPARK-12902] [SQL] visualization for generated operators
This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes apache#10828 from davies/viz_codegen.
Configuration menu - View commit details
-
Copy full SHA for 7d877c3 - Browse repository at this point
Copy the full SHA 7d877c3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 00026fa - Browse repository at this point
Copy the full SHA 00026faView commit details -
[SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part…
… of Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10891 from gatorsmile/commonKeysInPartitionByBucketBy.
Configuration menu - View commit details
-
Copy full SHA for 9348431 - Browse repository at this point
Copy the full SHA 9348431View commit details -
[SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark
```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10830 from yanboliang/spark-12905.
Configuration menu - View commit details
-
Copy full SHA for dcae355 - Browse repository at this point
Copy the full SHA dcae355View commit details -
[SPARK-12934][SQL] Count-min sketch serialization
This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian <lian@databricks.com> Closes apache#10893 from liancheng/cms-serialization.
Configuration menu - View commit details
-
Copy full SHA for 6f0f1d9 - Browse repository at this point
Copy the full SHA 6f0f1d9View commit details
Commits on Jan 26, 2016
-
[SPARK-12879] [SQL] improve the unsafe row writing framework
As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10809 from cloud-fan/unsafe-projection.
Configuration menu - View commit details
-
Copy full SHA for be375fc - Browse repository at this point
Copy the full SHA be375fcView commit details -
[SPARK-12936][SQL] Initial bloom filter implementation
This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to user. * always need the `expectedInsertions` parameter when create bloom filter. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10883 from cloud-fan/bloom-filter.
Configuration menu - View commit details
-
Copy full SHA for 109061f - Browse repository at this point
Copy the full SHA 109061fView commit details -
[SPARK-12934] use try-with-resources for streams
liancheng please take a look Author: tedyu <yuzhihong@gmail.com> Closes apache#10906 from tedyu/master.
Configuration menu - View commit details
-
Copy full SHA for fdcc351 - Browse repository at this point
Copy the full SHA fdcc351View commit details -
[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discret…
…izer Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <holden@us.ibm.com> Closes apache#10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
Configuration menu - View commit details
-
Copy full SHA for b66afde - Browse repository at this point
Copy the full SHA b66afdeView commit details -
[SPARK-12834] Change ser/de of JavaArray and JavaList
https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10772 from yinxusen/SPARK-12834.
Configuration menu - View commit details
-
Copy full SHA for ae47ba7 - Browse repository at this point
Copy the full SHA ae47ba7View commit details -
[SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test …
…in PySpark for now I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086. gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"? cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes apache#10909 from mengxr/SPARK-10086.
Configuration menu - View commit details
-
Copy full SHA for 27c910f - Browse repository at this point
Copy the full SHA 27c910fView commit details -
[SQL][MINOR] A few minor tweaks to CSV reader.
This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes apache#10919 from rxin/csv-minor.
Configuration menu - View commit details
-
Copy full SHA for d54cfed - Browse repository at this point
Copy the full SHA d54cfedView commit details -
[SPARK-12937][SQL] bloom filter serialization
This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10920 from cloud-fan/bloom-filter.
Configuration menu - View commit details
-
Copy full SHA for 6743de3 - Browse repository at this point
Copy the full SHA 6743de3View commit details -
[SPARK-12961][CORE] Prevent snappy-java memory leak
JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#10875 from viirya/prevent-snappy-memory-leak.
Configuration menu - View commit details
-
Copy full SHA for 5936bf9 - Browse repository at this point
Copy the full SHA 5936bf9View commit details -
[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable i…
…s inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes apache#10413 from srowen/SPARK-3369.
Configuration menu - View commit details
-
Copy full SHA for 649e9d0 - Browse repository at this point
Copy the full SHA 649e9d0View commit details -
[SPARK-10911] Executors should System.exit on clean shutdown.
Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441. Author: zhuol <zhuol@yahoo-inc.com> Closes apache#9946 from zhuoliu/10911.
zhuol authored and Tom Graves committedJan 26, 2016 Configuration menu - View commit details
-
Copy full SHA for ae0309a - Browse repository at this point
Copy the full SHA ae0309aView commit details -
[SPARK-12682][SQL] Add support for (optionally) not storing tables in…
… hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#10826 from sameeragarwal/skip-hive-metadata.
Configuration menu - View commit details
-
Copy full SHA for 08c781c - Browse repository at this point
Copy the full SHA 08c781cView commit details -
[SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instr…
…uctions for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of apache#10744 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10856 from zsxwing/akka-link-instruction.
Configuration menu - View commit details
-
Copy full SHA for cbd507d - Browse repository at this point
Copy the full SHA cbd507dView commit details -
[SPARK-11923][ML] Python API for ml.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10186 from yinxusen/SPARK-11923.
Configuration menu - View commit details
-
Copy full SHA for 8beab68 - Browse repository at this point
Copy the full SHA 8beab68View commit details -
[SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimize…
…r other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10863 from yinxusen/SPARK-12952.
Configuration menu - View commit details
-
Copy full SHA for fbf7623 - Browse repository at this point
Copy the full SHA fbf7623View commit details -
[SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted orde…
…r in dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10885 from JoshRosen/SPARK-8725.
Configuration menu - View commit details
-
Copy full SHA for ee74498 - Browse repository at this point
Copy the full SHA ee74498View commit details -
[SQL] Minor Scaladoc format fix
Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes apache#10926 from liancheng/agg-doc-fix.
Configuration menu - View commit details
-
Copy full SHA for 83507fe - Browse repository at this point
Copy the full SHA 83507feView commit details -
[SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark
environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10913 from zjffdu/SPARK-12993.
Configuration menu - View commit details
-
Copy full SHA for 19fdb21 - Browse repository at this point
Copy the full SHA 19fdb21View commit details -
[SPARK-10509][PYSPARK] Reduce excessive param boiler plate code
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes apache#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
Configuration menu - View commit details
-
Copy full SHA for eb91729 - Browse repository at this point
Copy the full SHA eb91729View commit details
Commits on Jan 27, 2016
-
[SPARK-12614][CORE] Don't throw non fatal exception from ask
Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10568 from zsxwing/send-ask-fail.
Configuration menu - View commit details
-
Copy full SHA for 22662b2 - Browse repository at this point
Copy the full SHA 22662b2View commit details -
[SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…
… Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes apache#9595 from zjffdu/SPARK-11622.
Configuration menu - View commit details
-
Copy full SHA for 1dac964 - Browse repository at this point
Copy the full SHA 1dac964View commit details -
[SPARK-12854][SQL] Implement complex types support in ColumnarBatch
This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes apache#10820 from nongli/spark-12854.
Configuration menu - View commit details
-
Copy full SHA for 5551273 - Browse repository at this point
Copy the full SHA 5551273View commit details -
[SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should …
…not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
Configuration menu - View commit details
-
Copy full SHA for b72611f - Browse repository at this point
Copy the full SHA b72611fView commit details -
[SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR
Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10829 from yanboliang/spark-12903.
Configuration menu - View commit details
-
Copy full SHA for e7f9199 - Browse repository at this point
Copy the full SHA e7f9199View commit details -
[SPARK-12935][SQL] DataFrame API for Count-Min Sketch
This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes apache#10911 from liancheng/cms-df-api.
Configuration menu - View commit details
-
Copy full SHA for ce38a35 - Browse repository at this point
Copy the full SHA ce38a35View commit details -
[SPARK-12728][SQL] Integrates SQL generation with native view
This PR is a follow-up of PR apache#10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes apache#10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
Configuration menu - View commit details
-
Copy full SHA for 58f5d8c - Browse repository at this point
Copy the full SHA 58f5d8cView commit details -
[SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext…
… shutdown If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence. vanzin Author: Nishkam Ravi <nishkamravi@gmail.com> Author: nishkamravi2 <nishkamravi@gmail.com> Closes apache#10881 from nishkamravi2/master_netty.
Configuration menu - View commit details
-
Copy full SHA for bae3c9a - Browse repository at this point
Copy the full SHA bae3c9aView commit details -
[SPARK-12780] Inconsistency returning value of ML python models' prop…
…erties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10724 from yinxusen/SPARK-12780.
Configuration menu - View commit details
-
Copy full SHA for 4db255c - Browse repository at this point
Copy the full SHA 4db255cView commit details -
[SPARK-12983][CORE][DOC] Correct metrics.properties.template
There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10902 from BenFradet/SPARK-12983.
Configuration menu - View commit details
-
Copy full SHA for 90b0e56 - Browse repository at this point
Copy the full SHA 90b0e56View commit details -
[SPARK-1680][DOCS] Explain environment variables for running on YARN …
…in cluster mode JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables Author: Andrew <weiner.andrew.j@gmail.com> Closes apache#10869 from weineran/branch-yarn-docs.
Configuration menu - View commit details
-
Copy full SHA for 093291c - Browse repository at this point
Copy the full SHA 093291cView commit details -
[SPARK-13023][PROJECT INFRA] Fix handling of root module in modules_t…
…o_test() There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10933 from JoshRosen/build-module-fix.
Configuration menu - View commit details
-
Copy full SHA for 41f0c85 - Browse repository at this point
Copy the full SHA 41f0c85View commit details -
[SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata w…
…ith `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes apache#8969 from jasoncl/SPARK-10847.
Configuration menu - View commit details
-
Copy full SHA for edd4737 - Browse repository at this point
Copy the full SHA edd4737View commit details -
[SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators
The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of apache#10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes apache#10835 from andrewor14/task-metrics-use-accums.
Configuration menu - View commit details
-
Copy full SHA for 87abcf7 - Browse repository at this point
Copy the full SHA 87abcf7View commit details -
[SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition'…
…s API contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10932 from JoshRosen/SPARK-13021.
Configuration menu - View commit details
-
Copy full SHA for 32f7411 - Browse repository at this point
Copy the full SHA 32f7411View commit details -
[SPARK-12938][SQL] DataFrame API for Bloom filter
This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10937 from cloud-fan/bloom-filter.
Configuration menu - View commit details
-
Copy full SHA for 680afab - Browse repository at this point
Copy the full SHA 680afabView commit details -
[SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQl…
…Parser commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and apache#10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches apache#10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10905 from hvanhovell/SPARK-12866.
Configuration menu - View commit details
-
Copy full SHA for ef96cd3 - Browse repository at this point
Copy the full SHA ef96cd3View commit details -
[HOTFIX] Fix Scala 2.11 compilation
by explicitly marking annotated parameters as vals (SI-8813). Caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10955 from andrewor14/fix-scala211.
Configuration menu - View commit details
-
Copy full SHA for d702f0c - Browse repository at this point
Copy the full SHA d702f0cView commit details -
[SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBa…
…tch.Row These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes apache#10952 from nongli/spark-13045.
Configuration menu - View commit details
-
Copy full SHA for 4a09123 - Browse repository at this point
Copy the full SHA 4a09123View commit details
Commits on Jan 28, 2016
-
Provide same info as in spark-submit --help
this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths." Author: James Lohse <jimlohse@users.noreply.github.com> Closes apache#10890 from jimlohse/patch-1.
Configuration menu - View commit details
-
Copy full SHA for c220443 - Browse repository at this point
Copy the full SHA c220443View commit details -
[SPARK-12818][SQL] Specialized integral and string types for Count-mi…
…n Sketch This PR is a follow-up of apache#10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes apache#10968 from liancheng/cms-specialized.
Configuration menu - View commit details
-
Copy full SHA for 415d0a8 - Browse repository at this point
Copy the full SHA 415d0a8View commit details -
[SPARK-12926][SQL] SQLContext to display warning message when non-sql…
… configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes apache#10849 from tejasapatil/SPARK-12926.
Configuration menu - View commit details
-
Copy full SHA for 6768039 - Browse repository at this point
Copy the full SHA 6768039View commit details -
[SPARK-13031] [SQL] cleanup codegen and improve test coverage
1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes apache#10944 from davies/gen_refactor.
Configuration menu - View commit details
-
Copy full SHA for cc18a71 - Browse repository at this point
Copy the full SHA cc18a71View commit details -
[SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solver
Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10639 from yanboliang/spark-9835.
Configuration menu - View commit details
-
Copy full SHA for df78a93 - Browse repository at this point
Copy the full SHA df78a93View commit details -
[SPARK-12401][SQL] Add integration tests for postgres enum types
We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10596 from maropu/AddTestsInIntegration.
Configuration menu - View commit details
-
Copy full SHA for abae889 - Browse repository at this point
Copy the full SHA abae889View commit details -
[SPARK-12749][SQL] add json option to parse floating-point types as D…
…ecimalType I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes apache#10936 from blbradley/spark-12749.
Configuration menu - View commit details
-
Copy full SHA for 3a40c0e - Browse repository at this point
Copy the full SHA 3a40c0eView commit details
Commits on Jan 29, 2016
-
[SPARK-11955][SQL] Mark optional fields in merging schema for safely …
…pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#9940 from viirya/safe-pushdown-parquet-filters.
Configuration menu - View commit details
-
Copy full SHA for 4637fc0 - Browse repository at this point
Copy the full SHA 4637fc0View commit details -
Revert "[SPARK-13031] [SQL] cleanup codegen and improve test coverage"
This reverts commit cc18a71.
Configuration menu - View commit details
-
Copy full SHA for b9dfdcc - Browse repository at this point
Copy the full SHA b9dfdccView commit details -
[SPARK-12968][SQL] Implement command to set current database
JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes apache#10916 from viirya/ddl-use-database.
Configuration menu - View commit details
-
Copy full SHA for 66449b8 - Browse repository at this point
Copy the full SHA 66449b8View commit details -
[SPARK-13067] [SQL] workaround for a weird scala reflection problem
A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10970 from cloud-fan/reflection.
Configuration menu - View commit details
-
Copy full SHA for 721ced2 - Browse repository at this point
Copy the full SHA 721ced2View commit details -
[SPARK-13050][BUILD] Scalatest tags fail build with the addition of t…
…he sketch module A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#10954 from ajbozarth/spark13050.
Configuration menu - View commit details
-
Copy full SHA for 8d3cc3d - Browse repository at this point
Copy the full SHA 8d3cc3dView commit details -
[SPARK-13031][SQL] cleanup codegen and improve test coverage
1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open apache#10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes apache#10977 from davies/gen_refactor.
Configuration menu - View commit details
-
Copy full SHA for 55561e7 - Browse repository at this point
Copy the full SHA 55561e7View commit details -
[SPARK-13032][ML][PYSPARK] PySpark support model export/import and ta…
…ke LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#10469 from yanboliang/spark-11939.
Configuration menu - View commit details
-
Copy full SHA for e51b6ea - Browse repository at this point
Copy the full SHA e51b6eaView commit details -
[SPARK-10873] Support column sort and search for History Server.
[SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience. 1. Create the historypage-template.html for displaying application information in datables. 2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search. 3. "duration" and "lastUpdated" rest API are added to application's "attempts". 4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified. Snapshots for how it looks like now: History page view: ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png) Search: ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png) Sort by started time: ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png) Author: zhuol <zhuol@yahoo-inc.com> Closes apache#10648 from zhuoliu/10873.
zhuol authored and Tom Graves committedJan 29, 2016 Configuration menu - View commit details
-
Copy full SHA for e4c1162 - Browse repository at this point
Copy the full SHA e4c1162View commit details -
[SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegen
simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible. generated code comparison for `hash(int, double, string, array<string>)`: **before:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); } /* input[1, double] */ double value5 = i.getDouble(1); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); } /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { int result10 = value1; for (int index11 = 0; index11 < value9.numElements(); index11++) { if (!value9.isNullAt(index11)) { final int element12 = value9.getInt(index11); result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10); } } value1 = result10; } } ``` **after:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); /* input[1, double] */ double value5 = i.getDouble(1); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { for (int index10 = 0; index10 < value9.numElements(); index10++) { final int element11 = value9.getInt(index10); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1); } } rowWriter14.write(0, value1); return result12; } ``` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10974 from cloud-fan/codegen.
Configuration menu - View commit details
-
Copy full SHA for c5f745e - Browse repository at this point
Copy the full SHA c5f745eView commit details -
[SPARK-12656] [SQL] Implement Intersect with Left-semi Join
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: apache#10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes apache#10630 from gatorsmile/IntersectBySemiJoin.
Configuration menu - View commit details
-
Copy full SHA for 5f686cc - Browse repository at this point
Copy the full SHA 5f686ccView commit details -
[SPARK-12818] Polishes spark-sketch module
Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <lian@databricks.com> Closes apache#10985 from liancheng/sketch-polishing.
Configuration menu - View commit details
-
Copy full SHA for 2b027e9 - Browse repository at this point
Copy the full SHA 2b027e9View commit details -
[SPARK-13055] SQLHistoryListener throws ClassCastException
This is an existing issue uncovered recently by apache#10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails. The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`. Author: Andrew Or <andrew@databricks.com> Closes apache#10971 from andrewor14/fix-sql-history.
Configuration menu - View commit details
-
Copy full SHA for e38b0ba - Browse repository at this point
Copy the full SHA e38b0baView commit details
Commits on Jan 30, 2016
-
[SPARK-13076][SQL] Rename ClientInterface -> HiveClient
And ClientWrapper -> HiveClientImpl. I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes. Author: Reynold Xin <rxin@databricks.com> Closes apache#10981 from rxin/SPARK-13076.
Configuration menu - View commit details
-
Copy full SHA for 2cbc412 - Browse repository at this point
Copy the full SHA 2cbc412View commit details -
[SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySet
Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/ Author: Andrew Or <andrew@databricks.com> Closes apache#10990 from andrewor14/accum-suite-less-flaky.
Andrew Or committedJan 30, 2016 Configuration menu - View commit details
-
Copy full SHA for e6ceac4 - Browse repository at this point
Copy the full SHA e6ceac4View commit details -
[SPARK-13088] Fix DAG viz in latest version of chrome
Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: andrewor14/dagre-d3@7d6c000, which is taken from the fix in the main repo: dagrejs/dagre-d3@1ef067f Upstream issue: dagrejs/dagre-d3#202 Author: Andrew Or <andrew@databricks.com> Closes apache#10986 from andrewor14/fix-dag-viz.
Andrew Or committedJan 30, 2016 Configuration menu - View commit details
-
Copy full SHA for 70e69fc - Browse repository at this point
Copy the full SHA 70e69fcView commit details -
[SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics
This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g. ``` sc.textFile(..., 4).coalesce(2).count() ``` we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`. For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark. This was caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10973 from andrewor14/fix-input-metrics-coalesce.
Andrew Or committedJan 30, 2016 Configuration menu - View commit details
-
Copy full SHA for 12252d1 - Browse repository at this point
Copy the full SHA 12252d1View commit details -
[SPARK-12914] [SQL] generate aggregation with grouping keys
This PR add support for grouping keys for generated TungstenAggregate. Spilling and performance improvements for BytesToBytesMap will be done by followup PR. Author: Davies Liu <davies@databricks.com> Closes apache#10855 from davies/gen_keys.
Configuration menu - View commit details
-
Copy full SHA for e6a02c6 - Browse repository at this point
Copy the full SHA e6a02c6View commit details -
[SPARK-13098] [SQL] remove GenericInternalRowWithSchema
This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10992 from cloud-fan/python.
Configuration menu - View commit details
-
Copy full SHA for dab246f - Browse repository at this point
Copy the full SHA dab246fView commit details -
[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10608 from JoshRosen/SPARK-6363.
Configuration menu - View commit details
-
Copy full SHA for 289373b - Browse repository at this point
Copy the full SHA 289373bView commit details -
[SPARK-13100][SQL] improving the performance of stringToDate method i…
…n DateTimeUtils.scala In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time. Author: wangyang <wangyang@haizhi.com> Closes apache#10994 from wangyang1992/datetimeUtil.
Configuration menu - View commit details
-
Copy full SHA for de28371 - Browse repository at this point
Copy the full SHA de28371View commit details
Commits on Jan 31, 2016
-
[SPARK-13070][SQL] Better error message when Parquet schema merging f…
…ails Make sure we throw better error messages when Parquet schema merging fails. Author: Cheng Lian <lian@databricks.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#10979 from viirya/schema-merging-failure-message.
Configuration menu - View commit details
-
Copy full SHA for a1303de - Browse repository at this point
Copy the full SHA a1303deView commit details -
[SPARK-12689][SQL] Migrate DDL parsing to the newly absorbed parser
JIRA: https://issues.apache.org/jira/browse/SPARK-12689 DDLParser processes three commands: createTable, describeTable and refreshTable. This patch migrates the three commands to newly absorbed parser. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes apache#10723 from viirya/migrate-ddl-describe.
Configuration menu - View commit details
-
Copy full SHA for 0e6d92d - Browse repository at this point
Copy the full SHA 0e6d92dView commit details -
[SPARK-13049] Add First/last with ignore nulls to functions.scala
This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g: ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))``` This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10957 from hvanhovell/SPARK-13049.
Configuration menu - View commit details
-
Copy full SHA for 5a8b978 - Browse repository at this point
Copy the full SHA 5a8b978View commit details
Commits on Feb 1, 2016
-
[SPARK-13093] [SQL] improve null check in nullSafeCodeGen for unary, …
…binary and ternary expression The current implementation is sub-optimal: * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable. * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others. This PR improves this by making the null check elimination more fine-grained. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10987 from cloud-fan/null-check.
Configuration menu - View commit details
-
Copy full SHA for c1da4d4 - Browse repository at this point
Copy the full SHA c1da4d4View commit details -
[SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateSta…
…teByKey is followed by a checkpointed dstream Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10934 from zsxwing/recursive-checkpoint.
Configuration menu - View commit details
-
Copy full SHA for 6075573 - Browse repository at this point
Copy the full SHA 6075573View commit details -
[SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressions
JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes apache#10963 from gatorsmile/seletStarAfterColDrop.
Configuration menu - View commit details
-
Copy full SHA for 33c8a49 - Browse repository at this point
Copy the full SHA 33c8a49View commit details -
[SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences
JIRA: https://issues.apache.org/jira/browse/SPARK-12705 **Scope:** This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*): - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`. - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it. **General Reference Resolution Rules:** - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children. - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed. - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries. - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`. **Implementation:** 1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes. 2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes. 3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node. **Risk:** Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10678 from gatorsmile/sortWindows.
Configuration menu - View commit details
-
Copy full SHA for 8f26eb5 - Browse repository at this point
Copy the full SHA 8f26eb5View commit details -
[DOCS] Fix the jar location of datanucleus in sql-programming-guid.md
ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10901 from maropu/DocFix.
Configuration menu - View commit details
-
Copy full SHA for da9146c - Browse repository at this point
Copy the full SHA da9146cView commit details -
[ML][MINOR] Invalid MulticlassClassification reference in ml-guide
In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator Author: Lewuathe <lewuathe@me.com> Closes apache#10996 from Lewuathe/fix-typo-in-ml-guide.
Configuration menu - View commit details
-
Copy full SHA for 711ce04 - Browse repository at this point
Copy the full SHA 711ce04View commit details -
[SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zooke…
…eper dir with mesos conf and add docs. Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings. Author: Timothy Chen <tnachen@gmail.com> Closes apache#10057 from tnachen/fix_mesos_dir.
Configuration menu - View commit details
-
Copy full SHA for 51b03b7 - Browse repository at this point
Copy the full SHA 51b03b7View commit details -
[SPARK-12265][MESOS] Spark calls System.exit inside driver instead of…
… throwing exception This takes over apache#10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user. Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10921 from dragos/pr/10729.
Configuration menu - View commit details
-
Copy full SHA for a41b68b - Browse repository at this point
Copy the full SHA a41b68bView commit details -
[SPARK-12979][MESOS] Don’t resolve paths on the local file system in …
…Mesos scheduler The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors. Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10923 from dragos/issue/canonical-paths.
Configuration menu - View commit details
-
Copy full SHA for c9b89a0 - Browse repository at this point
Copy the full SHA c9b89a0View commit details -
[SPARK-13043][SQL] Implement remaining catalyst types in ColumnarBatch.
This includes: float, boolean, short, decimal and calendar interval. Decimal is mapped to long or byte array depending on the size and calendar interval is mapped to a struct of int and long. The only remaining type is map. The schema mapping is straightforward but we might want to revisit how we deal with this in the rest of the execution engine. Author: Nong Li <nong@databricks.com> Closes apache#10961 from nongli/spark-13043.
Configuration menu - View commit details
-
Copy full SHA for 064b029 - Browse repository at this point
Copy the full SHA 064b029View commit details -
Fix for [SPARK-12854][SQL] Implement complex types support in Columna…
…rBatch Fixes build for Scala 2.11. Author: Jacek Laskowski <jacek@japila.pl> Closes apache#10946 from jaceklaskowski/SPARK-12854-fix.
Configuration menu - View commit details
-
Copy full SHA for a2973fe - Browse repository at this point
Copy the full SHA a2973feView commit details -
[SPARK-13078][SQL] API and test cases for internal catalog
This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper). I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality. Author: Reynold Xin <rxin@databricks.com> Closes apache#10982 from rxin/SPARK-13078.
Configuration menu - View commit details
-
Copy full SHA for be7a2fc - Browse repository at this point
Copy the full SHA be7a2fcView commit details
Commits on Feb 2, 2016
-
[SPARK-12637][CORE] Print stage info of finished stages properly
Improve printing of StageInfo in onStageCompleted See also apache#10585 Author: Sean Owen <sowen@cloudera.com> Closes apache#10922 from srowen/SPARK-12637.
Configuration menu - View commit details
-
Copy full SHA for 715a19d - Browse repository at this point
Copy the full SHA 715a19dView commit details -
[SPARK-12790][CORE] Remove HistoryServer old multiple files format
Removed isLegacyLogDirectory code path and updated tests andrewor14 Author: felixcheung <felixcheung_m@hotmail.com> Closes apache#10860 from felixcheung/historyserverformat.
Configuration menu - View commit details
-
Copy full SHA for 0df3cfb - Browse repository at this point
Copy the full SHA 0df3cfbView commit details -
[SPARK-13130][SQL] Make codegen variable names easier to read
1. Use lower case 2. Change long prefixes to something shorter (in this case I am changing only one: TungstenAggregate -> agg). Author: Reynold Xin <rxin@databricks.com> Closes apache#11017 from rxin/SPARK-13130.
Configuration menu - View commit details
-
Copy full SHA for 0fff5c6 - Browse repository at this point
Copy the full SHA 0fff5c6View commit details -
Configuration menu - View commit details
-
Copy full SHA for b8666fd - Browse repository at this point
Copy the full SHA b8666fdView commit details -
[SPARK-13087][SQL] Fix group by function for sort based aggregation
It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`. The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time. Author: Michael Armbrust <michael@databricks.com> Closes apache#11013 from marmbrus/groupByFunction-master.
Configuration menu - View commit details
-
Copy full SHA for 22ba213 - Browse repository at this point
Copy the full SHA 22ba213View commit details -
[SPARK-10820][SQL] Support for the continuous execution of structured…
… queries This is a follow up to 9aadcff that extends Spark SQL to allow users to _repeatedly_ optimize and execute structured queries. A `ContinuousQuery` can be expressed using SQL, DataFrames or Datasets. The purpose of this PR is only to add some initial infrastructure which will be extended in subsequent PRs. ## User-facing API - `sqlContext.streamFrom` and `df.streamTo` return builder objects that are analogous to the `read/write` interfaces already available to executing queries in a batch-oriented fashion. - `ContinuousQuery` provides an interface for interacting with a query that is currently executing in the background. ## Internal Interfaces - `StreamExecution` - executes streaming queries in micro-batches The following are currently internal, but public APIs will be provided in a future release. - `Source` - an interface for providers of continually arriving data. A source must have a notion of an `Offset` that monotonically tracks what data has arrived. For fault tolerance, a source must be able to replay data given a start offset. - `Sink` - an interface that accepts the results of a continuously executing query. Also responsible for tracking the offset that should be resumed from in the case of a failure. ## Testing - `MemoryStream` and `MemorySink` - simple implementations of source and sink that keep all data in memory and have methods for simulating durability failures - `StreamTest` - a framework for performing actions and checking invariants on a continuous query Author: Michael Armbrust <michael@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Josh Rosen <rosenville@gmail.com> Closes apache#11006 from marmbrus/structured-streaming.
Configuration menu - View commit details
-
Copy full SHA for 12a20c1 - Browse repository at this point
Copy the full SHA 12a20c1View commit details -
[SPARK-13094][SQL] Add encoders for seq/array of primitives
Author: Michael Armbrust <michael@databricks.com> Closes apache#11014 from marmbrus/seqEncoders.
Configuration menu - View commit details
-
Copy full SHA for 29d9218 - Browse repository at this point
Copy the full SHA 29d9218View commit details -
[SPARK-13114][SQL] Add a test for tokens more than the fields in schema
https://issues.apache.org/jira/browse/SPARK-13114 This PR adds a test for tokens more than the fields in schema. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11020 from HyukjinKwon/SPARK-13114.
Configuration menu - View commit details
-
Copy full SHA for b938301 - Browse repository at this point
Copy the full SHA b938301View commit details -
[SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to cons…
…istent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
Configuration menu - View commit details
-
Copy full SHA for cba1d6b - Browse repository at this point
Copy the full SHA cba1d6bView commit details -
[SPARK-13056][SQL] map column would throw NPE if value is null
Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#10964 from adrian-wang/npewriter.
Configuration menu - View commit details
-
Copy full SHA for 358300c - Browse repository at this point
Copy the full SHA 358300cView commit details -
[SPARK-12711][ML] ML StopWordsRemover does not protect itself from co…
…lumn name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes apache#10741 from grzegorz-chilkiewicz/master.
Configuration menu - View commit details
-
Copy full SHA for b1835d7 - Browse repository at this point
Copy the full SHA b1835d7View commit details -
[SPARK-13138][SQL] Add "logical" package prefix for ddl.scala
ddl.scala is defined in the execution package, and yet its reference of "UnaryNode" and "Command" are logical. This was fairly confusing when I was trying to understand the ddl code. Author: Reynold Xin <rxin@databricks.com> Closes apache#11021 from rxin/SPARK-13138.
Configuration menu - View commit details
-
Copy full SHA for 7f6e3ec - Browse repository at this point
Copy the full SHA 7f6e3ecView commit details -
[SPARK-12913] [SQL] Improve performance of stat functions
As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one. Author: Davies Liu <davies@databricks.com> Closes apache#10960 from davies/stddev.
Configuration menu - View commit details
-
Copy full SHA for be5dd88 - Browse repository at this point
Copy the full SHA be5dd88View commit details -
[SPARK-13121][STREAMING] java mapWithState mishandles scala Option
Already merged into 1.6 branch, this PR is to commit to master the same change Author: Gabriele Nizzoli <mail@nizzoli.net> Closes apache#11028 from gabrielenizzoli/patch-1.
Configuration menu - View commit details
-
Copy full SHA for d0df2ca - Browse repository at this point
Copy the full SHA d0df2caView commit details -
[DOCS] Update StructType.scala
The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes apache#10141 from swkimme/patch-1.
Configuration menu - View commit details
-
Copy full SHA for b377b03 - Browse repository at this point
Copy the full SHA b377b03View commit details
Commits on Feb 3, 2016
-
[SPARK-13150] [SQL] disable two flaky tests
Author: Davies Liu <davies@databricks.com> Closes apache#11037 from davies/disable_flaky.
Configuration menu - View commit details
-
Copy full SHA for 6de6a97 - Browse repository at this point
Copy the full SHA 6de6a97View commit details -
[SPARK-13020][SQL][TEST] fix random generator for map type
when we generate map, we first randomly pick a length, then create a seq of key value pair with the expected length, and finally call `toMap`. However, `toMap` will remove all duplicated keys, which makes the actual map size much less than we expected. This PR fixes this problem by put keys in a set first, to guarantee we have enough keys to build a map with expected length. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10930 from cloud-fan/random-generator.
Configuration menu - View commit details
-
Copy full SHA for 672032d - Browse repository at this point
Copy the full SHA 672032dView commit details -
[SPARK-12992] [SQL] Update parquet reader to support more types when …
…decoding to ColumnarBatch. This patch implements support for more types when doing the vectorized decode. There are a few more types remaining but they should be very straightforward after this. This code has a few copy and paste pieces but they are difficult to eliminate due to performance considerations. Specifically, this patch adds support for: - String, Long, Byte types - Dictionary encoding for those types. Author: Nong Li <nong@databricks.com> Closes apache#10908 from nongli/spark-12992.
Configuration menu - View commit details
-
Copy full SHA for 21112e8 - Browse repository at this point
Copy the full SHA 21112e8View commit details -
[SPARK-13122] Fix race condition in MemoryStore.unrollSafely()
https://issues.apache.org/jira/browse/SPARK-13122 A race condition can occur in MemoryStore's unrollSafely() method if two threads that return the same value for currentTaskAttemptId() execute this method concurrently. This change makes the operation of reading the initial amount of unroll memory used, performing the unroll, and updating the associated memory maps atomic in order to avoid this race condition. Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID. Author: Adam Budde <budde@amazon.com> Closes apache#11012 from budde/master.
Adam Budde authored and Andrew Or committedFeb 3, 2016 Configuration menu - View commit details
-
Copy full SHA for ff71261 - Browse repository at this point
Copy the full SHA ff71261View commit details -
[SPARK-12951] [SQL] support spilling in generated aggregate
This PR add spilling support for generated TungstenAggregate. If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated). The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite Author: Davies Liu <davies@databricks.com> Closes apache#10998 from davies/gen_spilling.
Configuration menu - View commit details
-
Copy full SHA for 99a6e3c - Browse repository at this point
Copy the full SHA 99a6e3cView commit details -
[SPARK-12732][ML] bug fix in linear regression train
Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently. Author: Imran Younus <iyounus@us.ibm.com> Closes apache#10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
Configuration menu - View commit details
-
Copy full SHA for 0557146 - Browse repository at this point
Copy the full SHA 0557146View commit details -
[SPARK-7997][CORE] Add rpcEnv.awaitTermination() back to SparkEnv
`rpcEnv.awaitTermination()` was not added in apache#10854 because some Streaming Python tests hung forever. This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv. Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11031 from zsxwing/awaitTermination.
Configuration menu - View commit details
-
Copy full SHA for 335f10e - Browse repository at this point
Copy the full SHA 335f10eView commit details -
[SPARK-13147] [SQL] improve readability of generated code
1. try to avoid the suffix (unique id) 2. remove the comment if there is no code generated. 3. re-arrange the order of functions 4. trop the new line for inlined blocks. Author: Davies Liu <davies@databricks.com> Closes apache#11032 from davies/better_suffix.
Configuration menu - View commit details
-
Copy full SHA for e86f8f6 - Browse repository at this point
Copy the full SHA e86f8f6View commit details -
[SPARK-12957][SQL] Initial support for constraint propagation in Spar…
…kSQL Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints: 1. `‘a > 10` 2. `isNotNull(‘a)` This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`). Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing Author: Sameer Agarwal <sameer@databricks.com> Closes apache#10844 from sameeragarwal/constraints.
Configuration menu - View commit details
-
Copy full SHA for 138c300 - Browse repository at this point
Copy the full SHA 138c300View commit details -
[SPARK-12739][STREAMING] Details of batch in Streaming tab uses two D…
…uration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <mario.briggs@in.ibm.com> Author: mariobriggs <mariobriggs@in.ibm.com> Closes apache#11022 from mariobriggs/spark-12739.
Configuration menu - View commit details
-
Copy full SHA for e9eb248 - Browse repository at this point
Copy the full SHA e9eb248View commit details -
[SPARK-12798] [SQL] generated BroadcastHashJoin
A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate). Author: Davies Liu <davies@databricks.com> Closes apache#10989 from davies/gen_join.
Configuration menu - View commit details
-
Copy full SHA for c4feec2 - Browse repository at this point
Copy the full SHA c4feec2View commit details -
[SPARK-13157] [SQL] Support any kind of input for SQL commands.
The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed. This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```. cc davies liancheng Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11052 from hvanhovell/SPARK-13157.
Configuration menu - View commit details
-
Copy full SHA for 9dd2741 - Browse repository at this point
Copy the full SHA 9dd2741View commit details -
[SPARK-3611][WEB UI] Show number of cores for each executor in applic…
…ation web UI Added a Cores column in the Executors UI Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11039 from ajbozarth/spark3611.
Configuration menu - View commit details
-
Copy full SHA for 3221edd - Browse repository at this point
Copy the full SHA 3221eddView commit details
Commits on Feb 4, 2016
-
[SPARK-13166][SQL] Remove DataStreamReader/Writer
They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like: ```scala val df = sqlContext.read.stream("...") val handle = df.write.stream("...") handle.stop() ``` Author: Reynold Xin <rxin@databricks.com> Closes apache#11062 from rxin/SPARK-13166.
Configuration menu - View commit details
-
Copy full SHA for 915a753 - Browse repository at this point
Copy the full SHA 915a753View commit details -
[SPARK-13131] [SQL] Use best and average time in benchmark
Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query). Having best time and average time together for more information (we can see kind of variance). rate, time per row and relative are all calculated using best time. The result looks like this: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz rang/filter/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- rang/filter/sum codegen=false 14332 / 16646 36.0 27.8 1.0X rang/filter/sum codegen=true 845 / 940 620.0 1.6 17.0X ``` Author: Davies Liu <davies@databricks.com> Closes apache#11018 from davies/gen_bench.
Configuration menu - View commit details
-
Copy full SHA for de09145 - Browse repository at this point
Copy the full SHA de09145View commit details -
[SPARK-13152][CORE] Fix task metrics deprecation warning
Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build. Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building. cc andrewor14 who did the initial deprecation Author: Holden Karau <holden@us.ibm.com> Closes apache#11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.
Configuration menu - View commit details
-
Copy full SHA for a8e2ba7 - Browse repository at this point
Copy the full SHA a8e2ba7View commit details -
[SPARK-13079][SQL] Extend and implement InMemoryCatalog
This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in apache#10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes apache#11069 from andrewor14/catalog.
Configuration menu - View commit details
-
Copy full SHA for a648311 - Browse repository at this point
Copy the full SHA a648311View commit details -
[SPARK-12828][SQL] add natural join support
Jira: https://issues.apache.org/jira/browse/SPARK-12828 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#10762 from adrian-wang/naturaljoin.
Configuration menu - View commit details
-
Copy full SHA for 0f81318 - Browse repository at this point
Copy the full SHA 0f81318View commit details -
[ML][DOC] fix wrong api link in ml onevsrest
minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#11068 from hhbyyh/onevsrestDoc.
Configuration menu - View commit details
-
Copy full SHA for c2c956b - Browse repository at this point
Copy the full SHA c2c956bView commit details -
[SPARK-13113] [CORE] Remove unnecessary bit operation when decoding p…
…age number JIRA: https://issues.apache.org/jira/browse/SPARK-13113 As we shift bits right, looks like the bitwise AND operation is unnecessary. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11002 from viirya/improve-decodepagenumber.
Configuration menu - View commit details
-
Copy full SHA for d390871 - Browse repository at this point
Copy the full SHA d390871View commit details -
[SPARK-12828][SQL] Natural join follow-up
This is a small addendum to apache#10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes apache#11070 from rxin/SPARK-12828-natural-join.
Configuration menu - View commit details
-
Copy full SHA for dee801a - Browse repository at this point
Copy the full SHA dee801aView commit details -
[SPARK-12330][MESOS] Fix mesos coarse mode cleanup
In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.
Configuration menu - View commit details
-
Copy full SHA for 2eaeafe - Browse repository at this point
Copy the full SHA 2eaeafeView commit details -
[SPARK-13164][CORE] Replace deprecated synchronized buffer in core
Building with scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative. Investigation shows we are already using ConcurrentLinkedQueue in other locations so switch our uses of SynchronizedBuffer to ConcurrentLinkedQueue. Author: Holden Karau <holden@us.ibm.com> Closes apache#11059 from holdenk/SPARK-13164-replace-deprecated-synchronized-buffer-in-core.
Configuration menu - View commit details
-
Copy full SHA for 62a7c28 - Browse repository at this point
Copy the full SHA 62a7c28View commit details -
[SPARK-13162] Standalone mode does not respect initial executors
Currently the Master would always set an application's initial executor limit to infinity. If the user specified `spark.dynamicAllocation.initialExecutors`, the config would not take effect. This is similar to apache#11047 but for standalone mode. Author: Andrew Or <andrew@databricks.com> Closes apache#11054 from andrewor14/standalone-da-initial.
Andrew Or committedFeb 4, 2016 Configuration menu - View commit details
-
Copy full SHA for 4120bcb - Browse repository at this point
Copy the full SHA 4120bcbView commit details -
[SPARK-13053][TEST] Unignore tests in InternalAccumulatorSuite
These were ignored because they are incorrectly written; they don't actually trigger stage retries, which is what the tests are testing. These tests are now rewritten to induce stage retries through fetch failures. Note: there were 2 tests before and now there's only 1. What happened? It turns out that the case where we only resubmit a subset of of the original missing partitions is very difficult to simulate in tests without potentially introducing flakiness. This is because the `DAGScheduler` removes all map outputs associated with a given executor when this happens, and we will need multiple executors to trigger this case, and sometimes the scheduler still removes map outputs from all executors. Author: Andrew Or <andrew@databricks.com> Closes apache#10969 from andrewor14/unignore-accum-test.
Andrew Or committedFeb 4, 2016 Configuration menu - View commit details
-
Copy full SHA for 15205da - Browse repository at this point
Copy the full SHA 15205daView commit details -
MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github: Closes apache#7971 (requested by yhuai) Closes apache#8539 (requested by srowen) Closes apache#8746 (requested by yhuai) Closes apache#9288 (requested by andrewor14) Closes apache#9321 (requested by andrewor14) Closes apache#9935 (requested by JoshRosen) Closes apache#10442 (requested by andrewor14) Closes apache#10585 (requested by srowen) Closes apache#10785 (requested by srowen) Closes apache#10832 (requested by andrewor14) Closes apache#10941 (requested by marmbrus) Closes apache#11024 (requested by andrewor14)
Andrew Or committedFeb 4, 2016 Configuration menu - View commit details
-
Copy full SHA for 085f510 - Browse repository at this point
Copy the full SHA 085f510View commit details -
[SPARK-13168][SQL] Collapse adjacent repartition operators
Spark SQL should collapse adjacent `Repartition` operators and only keep the last one. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11064 from JoshRosen/collapse-repartition.
Configuration menu - View commit details
-
Copy full SHA for 33212cb - Browse repository at this point
Copy the full SHA 33212cbView commit details -
[SPARK-12330][MESOS][HOTFIX] Rename timeout config
The config already describes time and accepts a general format that is not restricted to ms. This commit renames the internal config to use a format that's consistent in Spark.
Andrew Or committedFeb 4, 2016 Configuration menu - View commit details
-
Copy full SHA for c756bda - Browse repository at this point
Copy the full SHA c756bdaView commit details -
[SPARK-13079][SQL] InMemoryCatalog follow-ups
This patch incorporates review feedback from apache#11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes apache#11080 from andrewor14/catalog-follow-ups.
Configuration menu - View commit details
-
Copy full SHA for bd38dd6 - Browse repository at this point
Copy the full SHA bd38dd6View commit details -
[SPARK-13195][STREAMING] Fix NoSuchElementException when a state is n…
…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11081 from zsxwing/SPARK-13195.
Configuration menu - View commit details
-
Copy full SHA for 8e2f296 - Browse repository at this point
Copy the full SHA 8e2f296View commit details -
[HOTFIX] Fix style violation caused by c756bda
Andrew Or committedFeb 4, 2016 Configuration menu - View commit details
-
Copy full SHA for 7a4b37f - Browse repository at this point
Copy the full SHA 7a4b37fView commit details
Commits on Feb 5, 2016
-
[SPARK-13052] waitingApps metric doesn't show the number of apps curr…
…ently in the WAITING state Author: Raafat Akkad <raafat.akkad@gmail.com> Closes apache#10959 from RaafatAkkad/master.
Configuration menu - View commit details
-
Copy full SHA for 6dbfc40 - Browse repository at this point
Copy the full SHA 6dbfc40View commit details -
[SPARK-12850][SQL] Support Bucket Pruning (Predicate Pushdown for Buc…
…keted Tables) JIRA: https://issues.apache.org/jira/browse/SPARK-12850 This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`. Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column. So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR. Maybe we also need to add test cases to verify if bucket pruning works well for each data type. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10942 from gatorsmile/pruningBuckets.
Configuration menu - View commit details
-
Copy full SHA for e3c75c6 - Browse repository at this point
Copy the full SHA e3c75c6View commit details -
[SPARK-13208][CORE] Replace use of Pairs with Tuple2s
Another trivial deprecation fix for Scala 2.11 Author: Jakob Odersky <jakob@odersky.com> Closes apache#11089 from jodersky/SPARK-13208.
Configuration menu - View commit details
-
Copy full SHA for 352102e - Browse repository at this point
Copy the full SHA 352102eView commit details -
[SPARK-13187][SQL] Add boolean/long/double options in DataFrameReader…
…/Writer This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example. Before this patch: ```scala sqlContext.read.option("primitivesAsString", "true").json("/path/to/json") ``` After this patch: Before this patch: ```scala sqlContext.read.option("primitivesAsString", true).json("/path/to/json") ``` Author: Reynold Xin <rxin@databricks.com> Closes apache#11072 from rxin/SPARK-13187.
Configuration menu - View commit details
-
Copy full SHA for 82d84ff - Browse repository at this point
Copy the full SHA 82d84ffView commit details -
[SPARK-13166][SQL] Rename DataStreamReaderWriterSuite to DataFrameRea…
…derWriterSuite A follow up PR for apache#11062 because it didn't rename the test suite. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11096 from zsxwing/rename.
Configuration menu - View commit details
-
Copy full SHA for 7b73f17 - Browse repository at this point
Copy the full SHA 7b73f17View commit details -
[SPARK-12939][SQL] migrate encoder resolution logic to Analyzer
https://issues.apache.org/jira/browse/SPARK-12939 Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added. follow-ups: * remove encoders from typed aggregate expression. * completely remove resolve/bind in `ExpressionEncoder` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10852 from cloud-fan/bug.
Configuration menu - View commit details
-
Copy full SHA for 1ed354a - Browse repository at this point
Copy the full SHA 1ed354aView commit details -
[SPARK-13214][DOCS] update dynamicAllocation documentation
Author: Bill Chambers <bill@databricks.com> Closes apache#11094 from anabranch/dynamic-docs.
Bill Chambers authored and Andrew Or committedFeb 5, 2016 Configuration menu - View commit details
-
Copy full SHA for 66e1383 - Browse repository at this point
Copy the full SHA 66e1383View commit details -
[SPARK-13002][MESOS] Send initial request of executors for dyn alloca…
…tion Fix for [SPARK-13002](https://issues.apache.org/jira/browse/SPARK-13002) about the initial number of executors when running with dynamic allocation on Mesos. Instead of fixing it just for the Mesos case, made the change in `ExecutorAllocationManager`. It is already driving the number of executors running on Mesos, only no the initial value. The `None` and `Some(0)` are internal details on the computation of resources to reserved, in the Mesos backend scheduler. `executorLimitOption` has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state. Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore. This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode. Somebody knowing this part of the yarn support should review this change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes apache#11047 from skyluc/issue/initial-dyn-alloc-2.
Luc Bourlier authored and Andrew Or committedFeb 5, 2016 Configuration menu - View commit details
-
Copy full SHA for 0bb5b73 - Browse repository at this point
Copy the full SHA 0bb5b73View commit details -
[SPARK-13215] [SQL] remove fallback in codegen
Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116. Author: Davies Liu <davies@databricks.com> Closes apache#11097 from davies/remove_fallback.
Configuration menu - View commit details
-
Copy full SHA for 875f507 - Browse repository at this point
Copy the full SHA 875f507View commit details
Commits on Feb 6, 2016
-
[SPARK-13171][CORE] Replace future calls with Future
Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11. Also works with 2.10 Author: Jakob Odersky <jakob@odersky.com> Closes apache#11085 from jodersky/SPARK-13171.
Configuration menu - View commit details
-
Copy full SHA for 6883a51 - Browse repository at this point
Copy the full SHA 6883a51View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4f28291 - Browse repository at this point
Copy the full SHA 4f28291View commit details -
[SPARK-5865][API DOC] Add doc warnings for methods that return local …
…data structures rxin srowen I work out note message for rdd.take function, please help to review. If it's fine, I can apply to all other function later. Author: Tommy YU <tummyyu@163.com> Closes apache#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
Configuration menu - View commit details
-
Copy full SHA for 81da3be - Browse repository at this point
Copy the full SHA 81da3beView commit details
Commits on Feb 7, 2016
-
[SPARK-13132][MLLIB] cache standardization param value in LogisticReg…
…ression cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit. this change improves training times for one of my test sets from ~7m30s to ~4m30s Author: Gary King <gary@idibon.com> Closes apache#11027 from idigary/spark-13132-optimize-logistic-regression.
Configuration menu - View commit details
-
Copy full SHA for bc8890b - Browse repository at this point
Copy the full SHA bc8890bView commit details -
[SPARK-10963][STREAMING][KAFKA] make KafkaCluster public
Author: cody koeninger <cody@koeninger.org> Closes apache#9007 from koeninger/SPARK-10963.
Configuration menu - View commit details
-
Copy full SHA for 140ddef - Browse repository at this point
Copy the full SHA 140ddefView commit details
Commits on Feb 8, 2016
-
[SPARK-12986][DOC] Fix pydoc warnings in mllib/regression.py
I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs. Author: Nam Pham <phamducnam@gmail.com> Closes apache#11025 from nampham2/SPARK-12986.
Configuration menu - View commit details
-
Copy full SHA for edf4a0e - Browse repository at this point
Copy the full SHA edf4a0eView commit details -
[SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit
This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen. At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`. In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic. In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#7334 from JoshRosen/remove-copy-in-limit.
Configuration menu - View commit details
-
Copy full SHA for 06f0df6 - Browse repository at this point
Copy the full SHA 06f0df6View commit details -
[SPARK-13101][SQL] nullability of array type element should not fail …
…analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11035 from cloud-fan/ignore-nullability.
Configuration menu - View commit details
-
Copy full SHA for 8e4d15f - Browse repository at this point
Copy the full SHA 8e4d15fView commit details -
[SPARK-13210][SQL] catch OOM when allocate memory and expand array
There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes apache#11095 from davies/fix_expand.
Configuration menu - View commit details
-
Copy full SHA for 37bc203 - Browse repository at this point
Copy the full SHA 37bc203View commit details -
[SPARK-13095] [SQL] improve performance for broadcast join with dimen…
…sion table This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse. If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation. If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`. If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map. TODO: will do cleanup Author: Davies Liu <davies@databricks.com> Closes apache#11065 from davies/gen_dim.
Configuration menu - View commit details
-
Copy full SHA for ff0af0d - Browse repository at this point
Copy the full SHA ff0af0dView commit details
Commits on Feb 9, 2016
-
[SPARK-10620][SPARK-13054] Minor addendum to apache#10835
Additional changes to apache#10835, mainly related to style and visibility. This patch also adds back a few deprecated methods for backward compatibility. Author: Andrew Or <andrew@databricks.com> Closes apache#10958 from andrewor14/task-metrics-to-accums-followups.
Configuration menu - View commit details
-
Copy full SHA for eeaf45b - Browse repository at this point
Copy the full SHA eeaf45bView commit details -
[SPARK-12992] [SQL] Support vectorized decoding in UnsafeRowParquetRe…
…cordReader. WIP: running tests. Code needs a bit of clean up. This patch completes the vectorized decoding with the goal of passing the existing tests. There is still more patches to support the rest of the format spec, even just for flat schemas. This patch adds a new flag to enable the vectorized decoding. Tests were updated to try with both modes where applicable. Once this is working well, we can remove the previous code path. Author: Nong Li <nong@databricks.com> Closes apache#11055 from nongli/spark-12992-2.
Configuration menu - View commit details
-
Copy full SHA for 3708d13 - Browse repository at this point
Copy the full SHA 3708d13View commit details -
[SPARK-13176][CORE] Use native file linking instead of external proce…
…ss ln Since Spark requires at least JRE 1.7, it is safe to use built-in java.nio.Files. Author: Jakob Odersky <jakob@odersky.com> Closes apache#11098 from jodersky/SPARK-13176.
Configuration menu - View commit details
-
Copy full SHA for f9307d8 - Browse repository at this point
Copy the full SHA f9307d8View commit details -
[SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in str…
…eaming Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it. Some notes about how behaviour is different for reviewers: The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates. Author: Holden Karau <holden@us.ibm.com> Author: tedyu <yuzhihong@gmail.com> Closes apache#11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.
Configuration menu - View commit details
-
Copy full SHA for 159198e - Browse repository at this point
Copy the full SHA 159198eView commit details -
[SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFD…
…ataGenerator KMeans: Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values MFDataGenerator: Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere. I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way. Author: Holden Karau <holden@us.ibm.com> Closes apache#11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
Configuration menu - View commit details
-
Copy full SHA for ce83fe9 - Browse repository at this point
Copy the full SHA ce83fe9View commit details -
[SPARK-13040][DOCS] Update JDBC deprecated SPARK_CLASSPATH documentation
Update JDBC documentation based on http://stackoverflow.com/a/30947090/219530 as SPARK_CLASSPATH is deprecated. Also, that's how it worked, it didn't work with the SPARK_CLASSPATH or the --jars alone. This would solve issue: https://issues.apache.org/jira/browse/SPARK-13040 Author: Sebastián Ramírez <tiangolo@gmail.com> Closes apache#10948 from tiangolo/patch-docs-jdbc.
Configuration menu - View commit details
-
Copy full SHA for c882ec5 - Browse repository at this point
Copy the full SHA c882ec5View commit details -
[SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly…
… use low level linked list as it is deprecated. Author: sachin aggarwal <different.sachin@gmail.com> Closes apache#11113 from agsachin/master.
Configuration menu - View commit details
-
Copy full SHA for d9ba4d2 - Browse repository at this point
Copy the full SHA d9ba4d2View commit details -
[SPARK-13086][SHELL] Use the Scala REPL settings, to enable things li…
…ke `-i file`. Now: ``` $ bin/spark-shell -i test.scala NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/01/29 17:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/29 17:37:39 INFO Main: Created spark context.. Spark context available as sc (master = local[*], app id = local-1454085459000). 16/01/29 17:37:39 INFO Main: Created sql context.. SQL context available as sqlContext. Loading test.scala... hello Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45) Type in expressions to have them evaluated. Type :help for more information. ``` Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10984 from dragos/issue/repl-eval-file.
Configuration menu - View commit details
-
Copy full SHA for e30121a - Browse repository at this point
Copy the full SHA e30121aView commit details -
[SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as i…
…t is deprecated Replace SynchronizeQueue with synchronized access to a Queue Author: Sean Owen <sowen@cloudera.com> Closes apache#11111 from srowen/SPARK-13170.
Configuration menu - View commit details
-
Copy full SHA for 68ed363 - Browse repository at this point
Copy the full SHA 68ed363View commit details -
[SPARK-12807][YARN] Spark External Shuffle not working in Hadoop clus…
…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes apache#10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.
Configuration menu - View commit details
-
Copy full SHA for 34d0b70 - Browse repository at this point
Copy the full SHA 34d0b70View commit details -
[SPARK-13189] Cleanup build references to Scala 2.10
Author: Luciano Resende <lresende@apache.org> Closes apache#11092 from lresende/SPARK-13189.
Configuration menu - View commit details
-
Copy full SHA for 2dbb916 - Browse repository at this point
Copy the full SHA 2dbb916View commit details -
[SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression
Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10917 from cloud-fan/hash-benchmark.
Configuration menu - View commit details
-
Copy full SHA for 7fe4fe6 - Browse repository at this point
Copy the full SHA 7fe4fe6View commit details
Commits on Feb 10, 2016
-
[SPARK-13245][CORE] Call shuffleMetrics methods only in one thread fo…
…r ShuffleBlockFetcherIterator Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread. Also fix a race condition that could cause memory leak. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11138 from zsxwing/SPARK-13245.
Configuration menu - View commit details
-
Copy full SHA for fae830d - Browse repository at this point
Copy the full SHA fae830dView commit details -
[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate
This PR improve the lookup of BytesToBytesMap by: 1. Generate code for calculate the hash code of grouping keys. 2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection). Author: Davies Liu <davies@databricks.com> Closes apache#11010 from davies/gen_map.
Configuration menu - View commit details
-
Copy full SHA for 0e5ebac - Browse repository at this point
Copy the full SHA 0e5ebacView commit details -
[SPARK-10524][ML] Use the soft prediction to order categories' bins
JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#8734 from viirya/dt-soft-centroids.
Configuration menu - View commit details
-
Copy full SHA for 9267bc6 - Browse repository at this point
Copy the full SHA 9267bc6View commit details -
[SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removi…
…ng unnecessary Spark Filter Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10427 from maropu/RemoveFilterInJdbcScan.
Configuration menu - View commit details
-
Copy full SHA for 6f710f9 - Browse repository at this point
Copy the full SHA 6f710f9View commit details -
[SPARK-13149][SQL] Add FileStreamSource
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically. This is based on the initial work from marmbrus. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11034 from zsxwing/stream-df-file-source.
Configuration menu - View commit details
-
Copy full SHA for b385ce3 - Browse repository at this point
Copy the full SHA b385ce3View commit details -
[SPARK-11565] Replace deprecated DigestUtils.shaHex call
Author: Gábor Lipták <gliptak@gmail.com> Closes apache#9532 from gliptak/SPARK-11565.
Configuration menu - View commit details
-
Copy full SHA for 9269036 - Browse repository at this point
Copy the full SHA 9269036View commit details -
[SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts
Author: Jon Maurer <tritab@gmail.com> Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local> Closes apache#10789 from tritab/cmd_updates.
Configuration menu - View commit details
-
Copy full SHA for 2ba9b6a - Browse repository at this point
Copy the full SHA 2ba9b6aView commit details -
[SPARK-13203] Add scalastyle rule banning use of mutable.Synchronized…
…Buffer andrewor14 Please take a look Author: tedyu <yuzhihong@gmail.com> Closes apache#11134 from tedyu/master.
Configuration menu - View commit details
-
Copy full SHA for e834e42 - Browse repository at this point
Copy the full SHA e834e42View commit details -
[SPARK-9307][CORE][SPARK] Logging: Make it either stable or private
Make Logging private[spark]. Pretty much all there is to it. Author: Sean Owen <sowen@cloudera.com> Closes apache#11103 from srowen/SPARK-9307.
Configuration menu - View commit details
-
Copy full SHA for c0b71e0 - Browse repository at this point
Copy the full SHA c0b71e0View commit details -
[SPARK-5095][MESOS] Support launching multiple mesos executors in coa…
…rse grained mesos mode. This is the next iteration of tnachen's previous PR: apache#4027 In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution. This PR implements two high-level features. These two features are co-dependent, so they're implemented both here: - Mesos support for spark.executor.cores - Multiple executors per slave We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR. The contribution is my original work and I license the work to the project under the project's open source license. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#10993 from mgummelt/executor_sizing.
Michael Gummelt authored and Andrew Or committedFeb 10, 2016 Configuration menu - View commit details
-
Copy full SHA for 80cb963 - Browse repository at this point
Copy the full SHA 80cb963View commit details -
[SPARK-13254][SQL] Fix planning of TakeOrderedAndProject operator
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / apache#7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators. This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy. In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path. /cc davies and marmbrus for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11145 from JoshRosen/take-ordered-and-project-fix.
Configuration menu - View commit details
-
Copy full SHA for 5cf2059 - Browse repository at this point
Copy the full SHA 5cf2059View commit details -
[SPARK-13163][WEB UI] Column width on new History Server DataTables n…
…ot getting set correctly The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11057 from ajbozarth/spark13163.
Configuration menu - View commit details
-
Copy full SHA for 39cc620 - Browse repository at this point
Copy the full SHA 39cc620View commit details -
[SPARK-13126] fix the right margin of history page.
The right margin of the history page is little bit off. A simple fix for that issue. Author: zhuol <zhuol@yahoo-inc.com> Closes apache#11029 from zhuoliu/13126.
zhuol authored and Tom Graves committedFeb 10, 2016 Configuration menu - View commit details
-
Copy full SHA for 4b80026 - Browse repository at this point
Copy the full SHA 4b80026View commit details -
Configuration menu - View commit details
-
Copy full SHA for ce3bdae - Browse repository at this point
Copy the full SHA ce3bdaeView commit details -
[SPARK-13057][SQL] Add benchmark codes and the performance results fo…
…r implemented compression schemes for InMemoryRelation This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10965 from maropu/ImproveColumnarCache.
Configuration menu - View commit details
-
Copy full SHA for 5947fa8 - Browse repository at this point
Copy the full SHA 5947fa8View commit details -
[SPARK-12414][CORE] Remove closure serializer
Remove spark.closure.serializer option and use JavaSerializer always CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be. Author: Sean Owen <sowen@cloudera.com> Closes apache#11150 from srowen/SPARK-12414.
Configuration menu - View commit details
-
Copy full SHA for 29c5473 - Browse repository at this point
Copy the full SHA 29c5473View commit details
Commits on Feb 11, 2016
-
[SPARK-13146][SQL] Management API for continuous queries
### Management API for Continuous Queries **API for getting status of each query** - Whether active or not - Unique name of each query - Status of the sources and sinks - Exceptions **API for managing each query** - Immediately stop an active query - Waiting for a query to be terminated, correctly or with error **API for managing multiple queries** - Listing all active queries - Getting an active query by name - Waiting for any one of the active queries to be terminated **API for listening to query life cycle events** - ContinuousQueryListener API for query start, progress and termination events. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#11030 from tdas/streaming-df-management-api.
Configuration menu - View commit details
-
Copy full SHA for 0902e20 - Browse repository at this point
Copy the full SHA 0902e20View commit details -
[SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes apache#11158 from raelawang/master.
Configuration menu - View commit details
-
Copy full SHA for 719973b - Browse repository at this point
Copy the full SHA 719973bView commit details -
[SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Nam…
…e Ambiguity Caused by Internally Generated Expressions Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`. This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased. Here's an example Spark 1.6.0 snippet for illustration: ```scala sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true) ``` The above code produces the following resolved plan: ``` == Analyzed Logical Plan == _c0: bigint Project [_c0#101L] +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] +- Subquery t +- Project [id#46L AS a#47L,id#46L AS b#48L] +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26 ``` Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs. The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation. In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated. Could you review the solution? marmbrus liancheng I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11050 from gatorsmile/namingConflicts.
Configuration menu - View commit details
-
Copy full SHA for 663cc40 - Browse repository at this point
Copy the full SHA 663cc40View commit details -
[SPARK-13205][SQL] SQL Generation Support for Self Join
This PR addresses two issues: - Self join does not work in SQL Generation - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost. liancheng Could you please review the code changes? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11084 from gatorsmile/selfJoinInSQLGen.
Configuration menu - View commit details
-
Copy full SHA for 0f09f02 - Browse repository at this point
Copy the full SHA 0f09f02View commit details -
[SPARK-12706] [SQL] grouping() and grouping_id()
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels. grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR. The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive). Author: Davies Liu <davies@databricks.com> Closes apache#10677 from davies/grouping.
Configuration menu - View commit details
-
Copy full SHA for b5761d1 - Browse repository at this point
Copy the full SHA b5761d1View commit details -
[SPARK-13234] [SQL] remove duplicated SQL metrics
For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows. After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary. This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that). The new SQL UI will looks like: ![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png) Author: Davies Liu <davies@databricks.com> Closes apache#11163 from davies/remove_metrics.
Configuration menu - View commit details
-
Copy full SHA for 8f744fe - Browse repository at this point
Copy the full SHA 8f744feView commit details -
[SPARK-13276] Catch bad characters at the end of a Table Identifier/E…
…xpression string The parser currently parses the following strings without a hitch: * Table Identifier: * `a.b.c` should fail, but results in the following table identifier `a.b` * `table!#` should fail, but results in the following table identifier `table` * Expression * `1+2 r+e` should fail, but results in the following expression `1 + 2` This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing. cc cloud-fan (we discussed this in apache#10649) jayadevanmurali (this causes your PR apache#11051 to fail) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11159 from hvanhovell/SPARK-13276.
Configuration menu - View commit details
-
Copy full SHA for 1842c55 - Browse repository at this point
Copy the full SHA 1842c55View commit details -
[SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using…
… Union in SQL Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan. For example, before the fix, the following query has a plan with two `Distinct` ```scala sql("select * from t0 union select * from t0").explain(true) ``` ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` After the fix, the plan is changed without the extra `Distinct` as follows: ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#17L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#17L], [id#17L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11120 from gatorsmile/unionDistinct.
Configuration menu - View commit details
-
Copy full SHA for e88bff1 - Browse repository at this point
Copy the full SHA e88bff1View commit details -
[SPARK-13270][SQL] Remove extra new lines in whole stage codegen and …
…include pipeline plan in comments. Author: Nong Li <nong@databricks.com> Closes apache#11155 from nongli/spark-13270.
Configuration menu - View commit details
-
Copy full SHA for 18bcbbd - Browse repository at this point
Copy the full SHA 18bcbbdView commit details -
[SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.temp…
…late In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes apache#11149 from sasakitoa/remove_multibyte_in_sparkenv.
Configuration menu - View commit details
-
Copy full SHA for c2f21d8 - Browse repository at this point
Copy the full SHA c2f21d8View commit details -
[SPARK-13074][CORE] Add JavaSparkContext. getPersistentRDDs method
The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API. Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users. Author: Junyang <fly.shenjy@gmail.com> Closes apache#10978 from flyjy/master.
Configuration menu - View commit details
-
Copy full SHA for f9ae99f - Browse repository at this point
Copy the full SHA f9ae99fView commit details -
[SPARK-13124][WEB UI] Fixed CSS and JS issues caused by addition of J…
…Query DataTables Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11038 from ajbozarth/spark13124.
Configuration menu - View commit details
-
Copy full SHA for 13c17cb - Browse repository at this point
Copy the full SHA 13c17cbView commit details -
[STREAMING][TEST] Fix flaky streaming.FailureSuite
Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things - Makes sure no SparkContext is active after every test - Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#11166 from tdas/fix-failuresuite.
Configuration menu - View commit details
-
Copy full SHA for 219a74a - Browse repository at this point
Copy the full SHA 219a74aView commit details -
[SPARK-13277][SQL] ANTLR ignores other rule using the USING keyword
JIRA: https://issues.apache.org/jira/browse/SPARK-13277 There is an ANTLR warning during compilation: warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input This patch is to fix it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11168 from viirya/fix-parser-using.
Configuration menu - View commit details
-
Copy full SHA for e31c807 - Browse repository at this point
Copy the full SHA e31c807View commit details -
[SPARK-12982][SQL] Add table name validation in temp table registration
Add the table name validation at the temp table creation Author: jayadevanmurali <jayadevan.m@tcs.com> Closes apache#11051 from jayadevanmurali/branch-0.2-SPARK-12982.
Configuration menu - View commit details
-
Copy full SHA for 0d50a22 - Browse repository at this point
Copy the full SHA 0d50a22View commit details -
[SPARK-13279] Remove O(n^2) operation from scheduler.
This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes apache#11167 from sitalkedia/fix_stuck_driver and squashes the following commits: 3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function
Configuration menu - View commit details
-
Copy full SHA for 50fa6fd - Browse repository at this point
Copy the full SHA 50fa6fdView commit details -
Revert "[SPARK-13279] Remove O(n^2) operation from scheduler."
This reverts commit 50fa6fd.
Configuration menu - View commit details
-
Copy full SHA for c86009c - Browse repository at this point
Copy the full SHA c86009cView commit details -
[SPARK-13265][ML] Refactoring of basic ML import/export for other fil…
…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#11151 from yu-iskw/SPARK-13265.
Configuration menu - View commit details
-
Copy full SHA for efb65e0 - Browse repository at this point
Copy the full SHA efb65e0View commit details -
[SPARK-11515][ML] QuantileDiscretizer should take random seed
cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#9535 from yu-iskw/SPARK-11515.
Configuration menu - View commit details
-
Copy full SHA for 574571c - Browse repository at this point
Copy the full SHA 574571cView commit details -
[SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/i…
…mport PySpark ml.recommendation support export/import. Author: Kai Jiang <jiangkai@gmail.com> Closes apache#11044 from vectorijk/spark-13037.
Configuration menu - View commit details
-
Copy full SHA for c8f667d - Browse repository at this point
Copy the full SHA c8f667dView commit details -
[MINOR][ML][PYSPARK] Cleanup test cases of clustering.py
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode). cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10975 from yanboliang/clustering-cleanup.
Configuration menu - View commit details
-
Copy full SHA for 2426eb3 - Browse repository at this point
Copy the full SHA 2426eb3View commit details -
[SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import
PySpark ml.clustering support export/import. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10999 from yanboliang/spark-13035.
Configuration menu - View commit details
-
Copy full SHA for 30e0095 - Browse repository at this point
Copy the full SHA 30e0095View commit details
Commits on Feb 12, 2016
-
[SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw a…
…n error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes apache#10962 from sethah/SPARK-13047.
Configuration menu - View commit details
-
Copy full SHA for b354673 - Browse repository at this point
Copy the full SHA b354673View commit details -
[SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lo…
…st transformSchema https://issues.apache.org/jira/browse/SPARK-12765 Author: Liu Xiang <lxmtlab@gmail.com> Closes apache#10720 from sloth2012/sloth.
Configuration menu - View commit details
-
Copy full SHA for a525704 - Browse repository at this point
Copy the full SHA a525704View commit details -
[SPARK-12915][SQL] add SQL metrics of numOutputRows for whole stage c…
…odegen This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row. <img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png"> Author: Davies Liu <davies@databricks.com> Closes apache#11170 from davies/gen_metric.
Configuration menu - View commit details
-
Copy full SHA for b10af5e - Browse repository at this point
Copy the full SHA b10af5eView commit details -
[SPARK-13277][BUILD] Follow-up ANTLR warnings are treated as build er…
…rors It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors. cc rxin / viirya Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11174 from hvanhovell/ANTLR-warnings-as-errors.
Configuration menu - View commit details
-
Copy full SHA for 8121a4b - Browse repository at this point
Copy the full SHA 8121a4bView commit details -
[SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, …
…false) https://issues.apache.org/jira/browse/SPARK-12746 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes apache#10697 from Earthson/SPARK-12746.
Configuration menu - View commit details
-
Copy full SHA for 5f1c359 - Browse repository at this point
Copy the full SHA 5f1c359View commit details -
[SPARK-13153][PYSPARK] ML persistence failed when handle no default v…
…alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes apache#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
Configuration menu - View commit details
-
Copy full SHA for d3e2e20 - Browse repository at this point
Copy the full SHA d3e2e20View commit details -
[SPARK-7889][WEBUI] HistoryServer updates UI for incomplete apps
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI. https://issues.apache.org/jira/browse/SPARK-7889 Author: Steve Loughran <stevel@hortonworks.com> Author: Imran Rashid <irashid@cloudera.com> Closes apache#11118 from squito/SPARK-7889-alternate.
Configuration menu - View commit details
-
Copy full SHA for a2c7dcf - Browse repository at this point
Copy the full SHA a2c7dcfView commit details -
[SPARK-6166] Limit number of in flight outbound requests
This JIRA is related to apache#5852 Had to do some minor rework and test to make sure it works with current version of spark. Author: Sanket <schintap@untilservice-lm> Closes apache#10838 from redsanket/limit-outbound-connections.
Configuration menu - View commit details
-
Copy full SHA for 894921d - Browse repository at this point
Copy the full SHA 894921dView commit details -
[SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means
Add Python API for spark.ml bisecting k-means. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10889 from yanboliang/spark-12974.
Configuration menu - View commit details
-
Copy full SHA for a183dda - Browse repository at this point
Copy the full SHA a183ddaView commit details -
[SPARK-13154][PYTHON] Add linting for pydocs
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced. Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present. Author: Holden Karau <holden@us.ibm.com> Closes apache#11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
Configuration menu - View commit details
-
Copy full SHA for 64515e5 - Browse repository at this point
Copy the full SHA 64515e5View commit details -
[SPARK-12705] [SQL] push missing attributes for Sort
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate). Author: Davies Liu <davies@databricks.com> Closes apache#11153 from davies/resolve_sort.
Configuration menu - View commit details
-
Copy full SHA for 5b805df - Browse repository at this point
Copy the full SHA 5b805dfView commit details -
[SPARK-13282][SQL] LogicalPlan toSql should just return a String
Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not). Author: Reynold Xin <rxin@databricks.com> Closes apache#11171 from rxin/SPARK-13282.
Configuration menu - View commit details
-
Copy full SHA for c4d5ad8 - Browse repository at this point
Copy the full SHA c4d5ad8View commit details -
[SPARK-13260][SQL] count(*) does not work with CSV data source
https://issues.apache.org/jira/browse/SPARK-13260 This is a quicky fix for `count(*)`. When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count. Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11169 from HyukjinKwon/SPARK-13260.
Configuration menu - View commit details
-
Copy full SHA for ac7d6af - Browse repository at this point
Copy the full SHA ac7d6afView commit details -
[SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop
PySpark support ```covar_samp``` and ```covar_pop```. cc rxin davies marmbrus Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10876 from yanboliang/spark-12962.
Configuration menu - View commit details
-
Copy full SHA for 90de6b2 - Browse repository at this point
Copy the full SHA 90de6b2View commit details -
[SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to…
… consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module. Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
Configuration menu - View commit details
-
Copy full SHA for 42d6568 - Browse repository at this point
Copy the full SHA 42d6568View commit details -
[SPARK-5095] Fix style in mesos coarse grained scheduler code
andrewor14 This addressed your style comments from apache#10993 Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#11187 from mgummelt/fix_mesos_style.
Michael Gummelt authored and Andrew Or committedFeb 12, 2016 Configuration menu - View commit details
-
Copy full SHA for 38bc601 - Browse repository at this point
Copy the full SHA 38bc601View commit details -
[SPARK-5095] remove flaky test
Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#11164 from mgummelt/fix_mesos_tests.
Michael Gummelt authored and Andrew Or committedFeb 12, 2016 Configuration menu - View commit details
-
Copy full SHA for 62b1c07 - Browse repository at this point
Copy the full SHA 62b1c07View commit details
Commits on Feb 13, 2016
-
[SPARK-13293][SQL] generate Expand
Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns. After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup. Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns). Author: Davies Liu <davies@databricks.com> Closes apache#11177 from davies/gen_expand.
Configuration menu - View commit details
-
Copy full SHA for 2228f07 - Browse repository at this point
Copy the full SHA 2228f07View commit details -
[SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft…
… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes apache#11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.
Configuration menu - View commit details
-
Copy full SHA for 374c4b2 - Browse repository at this point
Copy the full SHA 374c4b2View commit details -
[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering f…
…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#10539 from viirya/fix-poweriter.
Configuration menu - View commit details
-
Copy full SHA for e3441e3 - Browse repository at this point
Copy the full SHA e3441e3View commit details
Commits on Feb 14, 2016
-
Configuration menu - View commit details
-
Copy full SHA for 610196f - Browse repository at this point
Copy the full SHA 610196fView commit details -
[SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is…
… deprecated Replace `getStackTraceString` with `Utils.exceptionString` Author: Sean Owen <sowen@cloudera.com> Closes apache#11182 from srowen/SPARK-13172.
Configuration menu - View commit details
-
Copy full SHA for 388cd9e - Browse repository at this point
Copy the full SHA 388cd9eView commit details -
[SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions.
This pull request has the following changes: 1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs. 2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package. 3. Move everything in execution/python.scala into the newly created execution.python package. Most of the diffs are just straight copy-paste. Author: Reynold Xin <rxin@databricks.com> Closes apache#11181 from rxin/SPARK-13296.
Configuration menu - View commit details
-
Copy full SHA for 354d4c2 - Browse repository at this point
Copy the full SHA 354d4c2View commit details -
[SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy
Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes apache#11180 from amitdev/master.
Configuration menu - View commit details
-
Copy full SHA for 331293c - Browse repository at this point
Copy the full SHA 331293cView commit details -
[SPARK-13278][CORE] Launcher fails to start with JDK 9 EA
See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme. Author: Claes Redestad <claes.redestad@gmail.com> Closes apache#11160 from cl4es/master.
Configuration menu - View commit details
-
Copy full SHA for 22e9723 - Browse repository at this point
Copy the full SHA 22e9723View commit details
Commits on Feb 15, 2016
-
[SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDat…
…e method to improve performance The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))` The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage. Author: Carson Wang <carson.wang@intel.com> Closes apache#11090 from carsonwang/SPARK-13185.
Configuration menu - View commit details
-
Copy full SHA for 7cb4d74 - Browse repository at this point
Copy the full SHA 7cb4d74View commit details -
[SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN
This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in apache#10451 and apache#10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In apache#7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from apache#10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11121 from JoshRosen/limit-pushdown-2.
Configuration menu - View commit details
-
Copy full SHA for a8bbc4f - Browse repository at this point
Copy the full SHA a8bbc4fView commit details -
[SPARK-12995][GRAPHX] Remove deprecate APIs from Pregel
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10918 from maropu/RemoveDeprecateInPregel.
Configuration menu - View commit details
-
Copy full SHA for 56d4939 - Browse repository at this point
Copy the full SHA 56d4939View commit details -
[SPARK-13312][MLLIB] Update java train-validation-split example in ml…
…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes apache#11199 from JeremyNixon/update_train_val_split_example.
Configuration menu - View commit details
-
Copy full SHA for adb5483 - Browse repository at this point
Copy the full SHA adb5483View commit details
Commits on Feb 16, 2016
-
[SPARK-13097][ML] Binarizer allowing Double AND Vector input types
This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type. A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image). This contribution is my original work and I license the work to the project under the project's open source license. viirya mengxr Author: seddonm1 <seddonm1@gmail.com> Closes apache#10976 from seddonm1/master.
Configuration menu - View commit details
-
Copy full SHA for cbeb006 - Browse repository at this point
Copy the full SHA cbeb006View commit details -
[SPARK-13018][DOCS] Replace example code in mllib-pmml-model-export.m…
…d using include_example Replace example code in mllib-pmml-model-export.md using include_example https://issues.apache.org/jira/browse/SPARK-13018 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes apache#11126 from keypointt/SPARK-13018.
Configuration menu - View commit details
-
Copy full SHA for e4675c2 - Browse repository at this point
Copy the full SHA e4675c2View commit details -
[SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Cont…
…aining GroupBy Columns Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns. This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6 For example, the following query returns a wrong result: ```scala sql("select course, sum(earnings) as sum from courseSales group by course, earnings" + " grouping sets((), (course), (course, earnings))" + " order by course, sum").show() ``` Before the fix, the results are like ``` [null,null] [Java,null] [Java,20000.0] [Java,30000.0] [dotNET,null] [dotNET,5000.0] [dotNET,10000.0] [dotNET,48000.0] ``` After the fix, the results become correct: ``` [null,113000.0] [Java,20000.0] [Java,30000.0] [Java,50000.0] [dotNET,5000.0] [dotNET,10000.0] [dotNET,48000.0] [dotNET,63000.0] ``` UPDATE: This PR also deprecated the external column: GROUPING__ID. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11100 from gatorsmile/groupingSets.
Configuration menu - View commit details
-
Copy full SHA for fee739f - Browse repository at this point
Copy the full SHA fee739fView commit details -
Correct SparseVector.parse documentation
There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes apache#11213 from mgyucht/fix-sparsevector-docs.
Configuration menu - View commit details
-
Copy full SHA for 827ed1c - Browse repository at this point
Copy the full SHA 827ed1cView commit details -
[SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collabora…
…tive filtering in general This documents the implementation of ALS in `spark.ml` with example code in scala, java and python. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10411 from BenFradet/SPARK-12247.
Configuration menu - View commit details
-
Copy full SHA for 00c72d2 - Browse repository at this point
Copy the full SHA 00c72d2View commit details -
[SPARK-12976][SQL] Add LazilyGenerateOrdering and use it for RangePar…
…titioner of Exchange. Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#10894 from ueshin/issues/SPARK-12976.
An error occurred while loading commit statuses
If this error continues, check GitHub Status for more information.Configuration menu - View commit details
-
Copy full SHA for 19dc69d - Browse repository at this point
Copy the full SHA 19dc69dView commit details -
[SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteA…
…headLog. The new logger name is under the org.apache.spark namespace. The detection of the caller name was also enhanced a bit to ignore some common things that show up in the call stack. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#11165 from vanzin/SPARK-13280.
Marcelo Vanzin committedFeb 16, 2016 An error occurred while loading commit statuses
If this error continues, check GitHub Status for more information.Configuration menu - View commit details
-
Copy full SHA for c7d00a2 - Browse repository at this point
Copy the full SHA c7d00a2View commit details -
[SPARK-13308] ManagedBuffers passed to OneToOneStreamManager need to …
…be freed in non-error cases ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see apache#10705). This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code. /cc zsxwing for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11193 from JoshRosen/add-missing-release-calls-in-network-layer.
An error occurred while loading commit statuses
If this error continues, check GitHub Status for more information.Configuration menu - View commit details
-
Copy full SHA for 5f37aad - Browse repository at this point
Copy the full SHA 5f37aadView commit details