Update #1

Make StreamingContext.stop() exception-safe Author: jayadevanmurali <jayadevan.m@tcs.com> Closes apache#10807 from jayadevanmurali/branch-0.1-SPARK-11137.

…l comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes apache#10882 from rxin/SPARK-12904-1.

Found while doing code review Author: Jacek Laskowski <jacek@japila.pl> Closes apache#10878 from jaceklaskowski/streaming-scaladoc-logs-tiny-fixes.

ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.

…tools Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning). cc JoshRosen who looked at the original JIRA. Author: Holden Karau <holden@us.ibm.com> Closes apache#10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.

…ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10126 from zjffdu/SPARK-12120.

…to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes apache#10886 from liancheng/spark-12624.

…case class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#10895 from HyukjinKwon/SPARK-12901.

…e failure Author: Andy Grove <andygrove73@gmail.com> Closes apache#10865 from andygrove/SPARK-12932.

[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped. This contribution is my original work, and I license this work to the Spark project under the project's open source license. Author: Michael Allman <michael@videoamp.com> Closes apache#10700 from mallman/stop_event_logger_first.

…tions Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10222 from yanboliang/spark-11965.

Closes apache#9046 Closes apache#8532 Closes apache#10756 Closes apache#8960 Closes apache#10485 Closes apache#10467

Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time. Active Tasks is shaded blue with it's range based on percentage of total cores used. Failed Tasks is shaded red ranging over the first 10% of total tasks failed Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor. Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#10154 from ajbozarth/spark12149.

This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes apache#10828 from davies/viz_codegen.

… of Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10891 from gatorsmile/commonKeysInPartitionByBucketBy.

```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10830 from yanboliang/spark-12905.

This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian <lian@databricks.com> Closes apache#10893 from liancheng/cms-serialization.

As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10809 from cloud-fan/unsafe-projection.

This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to user. * always need the `expectedInsertions` parameter when create bloom filter. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10883 from cloud-fan/bloom-filter.

liancheng please take a look Author: tedyu <yuzhihong@gmail.com> Closes apache#10906 from tedyu/master.

…izer Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <holden@us.ibm.com> Closes apache#10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.

https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10772 from yinxusen/SPARK-12834.

…in PySpark for now I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086. gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"? cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes apache#10909 from mengxr/SPARK-10086.

This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes apache#10919 from rxin/csv-minor.

This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10920 from cloud-fan/bloom-filter.

JIRA: https://issues.apache.org/jira/browse/SPARK-12961 To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object. JoshRosen Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#10875 from viirya/prevent-snappy-memory-leak.

…s inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes apache#10413 from srowen/SPARK-3369.

Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441. Author: zhuol <zhuol@yahoo-inc.com> Closes apache#9946 from zhuoliu/10911.

… hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#10826 from sameeragarwal/skip-hive-metadata.

…uctions for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of apache#10744 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10856 from zsxwing/akka-link-instruction.

https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10186 from yinxusen/SPARK-11923.

…r other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10863 from yinxusen/SPARK-12952.

…r in dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10885 from JoshRosen/SPARK-8725.

Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes apache#10926 from liancheng/agg-doc-fix.

environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10913 from zjffdu/SPARK-12993.

The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes apache#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.

Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10568 from zsxwing/send-ask-fail.

… Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes apache#9595 from zjffdu/SPARK-11622.

This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes apache#10820 from nongli/spark-12854.

…not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.

Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10829 from yanboliang/spark-12903.

This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes apache#10911 from liancheng/cms-df-api.

This PR is a follow-up of PR apache#10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes apache#10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.

… shutdown If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence. vanzin Author: Nishkam Ravi <nishkamravi@gmail.com> Author: nishkamravi2 <nishkamravi@gmail.com> Closes apache#10881 from nishkamravi2/master_netty.

…erties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10724 from yinxusen/SPARK-12780.

There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10902 from BenFradet/SPARK-12983.

…in cluster mode JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables Author: Andrew <weiner.andrew.j@gmail.com> Closes apache#10869 from weineran/branch-yarn-docs.

…o_test() There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10933 from JoshRosen/build-module-fix.

…ith `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes apache#8969 from jasoncl/SPARK-10847.

The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of apache#10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes apache#10835 from andrewor14/task-metrics-use-accums.

…s API contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10932 from JoshRosen/SPARK-13021.

This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10937 from cloud-fan/bloom-filter.

…Parser commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and apache#10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches apache#10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10905 from hvanhovell/SPARK-12866.

by explicitly marking annotated parameters as vals (SI-8813). Caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10955 from andrewor14/fix-scala211.

…tch.Row These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes apache#10952 from nongli/spark-13045.

this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths." Author: James Lohse <jimlohse@users.noreply.github.com> Closes apache#10890 from jimlohse/patch-1.

…n Sketch This PR is a follow-up of apache#10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes apache#10968 from liancheng/cms-specialized.

… configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes apache#10849 from tejasapatil/SPARK-12926.

1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes apache#10944 from davies/gen_refactor.

Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10639 from yanboliang/spark-9835.

We can handle posgresql-specific enum types as strings in jdbc. So, we should just add tests and close the corresponding JIRA ticket. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10596 from maropu/AddTestsInIntegration.

…ecimalType I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes apache#10936 from blbradley/spark-12749.

…pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#9940 from viirya/safe-pushdown-parquet-filters.

This reverts commit cc18a71.

JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes apache#10916 from viirya/ddl-use-database.

A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10970 from cloud-fan/reflection.

…he sketch module A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#10954 from ajbozarth/spark13050.

1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open apache#10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes apache#10977 from davies/gen_refactor.

…ke LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#10469 from yanboliang/spark-11939.

[SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience. 1. Create the historypage-template.html for displaying application information in datables. 2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search. 3. "duration" and "lastUpdated" rest API are added to application's "attempts". 4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified. Snapshots for how it looks like now: History page view: ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png) Search: ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png) Sort by started time: ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png) Author: zhuol <zhuol@yahoo-inc.com> Closes apache#10648 from zhuoliu/10873.

simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible. generated code comparison for `hash(int, double, string, array<string>)`: **before:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); } /* input[1, double] */ double value5 = i.getDouble(1); if (!false) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); } /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { int result10 = value1; for (int index11 = 0; index11 < value9.numElements(); index11++) { if (!value9.isNullAt(index11)) { final int element12 = value9.getInt(index11); result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10); } } value1 = result10; } } ``` **after:** ``` public UnsafeRow apply(InternalRow i) { /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */ int value1 = 42; /* input[0, int] */ int value3 = i.getInt(0); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1); /* input[1, double] */ double value5 = i.getDouble(1); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1); /* input[2, string] */ boolean isNull6 = i.isNullAt(2); UTF8String value7 = isNull6 ? null : (i.getUTF8String(2)); if (!isNull6) { value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1); } /* input[3, array<int>] */ boolean isNull8 = i.isNullAt(3); ArrayData value9 = isNull8 ? null : (i.getArray(3)); if (!isNull8) { for (int index10 = 0; index10 < value9.numElements(); index10++) { final int element11 = value9.getInt(index10); value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1); } } rowWriter14.write(0, value1); return result12; } ``` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10974 from cloud-fan/codegen.

Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: apache#10566 Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes apache#10630 from gatorsmile/IntersectBySemiJoin.

Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <lian@databricks.com> Closes apache#10985 from liancheng/sketch-polishing.

This is an existing issue uncovered recently by apache#10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails. The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`. Author: Andrew Or <andrew@databricks.com> Closes apache#10971 from andrewor14/fix-sql-history.

And ClientWrapper -> HiveClientImpl. I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes. Author: Reynold Xin <rxin@databricks.com> Closes apache#10981 from rxin/SPARK-13076.

Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty. https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/ Author: Andrew Or <andrew@databricks.com> Closes apache#10990 from andrewor14/accum-suite-less-flaky.

Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: andrewor14/dagre-d3@7d6c000, which is taken from the fix in the main repo: dagrejs/dagre-d3@1ef067f Upstream issue: dagrejs/dagre-d3#202 Author: Andrew Or <andrew@databricks.com> Closes apache#10986 from andrewor14/fix-dag-viz.

This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g. ``` sc.textFile(..., 4).coalesce(2).count() ``` we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`. For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark. This was caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10973 from andrewor14/fix-input-metrics-coalesce.

This PR add support for grouping keys for generated TungstenAggregate. Spilling and performance improvements for BytesToBytesMap will be done by followup PR. Author: Davies Liu <davies@databricks.com> Closes apache#10855 from davies/gen_keys.

This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10992 from cloud-fan/python.

This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10608 from JoshRosen/SPARK-6363.

…n DateTimeUtils.scala In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time. Author: wangyang <wangyang@haizhi.com> Closes apache#10994 from wangyang1992/datetimeUtil.

…ails Make sure we throw better error messages when Parquet schema merging fails. Author: Cheng Lian <lian@databricks.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#10979 from viirya/schema-merging-failure-message.

JIRA: https://issues.apache.org/jira/browse/SPARK-12689 DDLParser processes three commands: createTable, describeTable and refreshTable. This patch migrates the three commands to newly absorbed parser. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes apache#10723 from viirya/migrate-ddl-describe.

This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g: ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))``` This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10957 from hvanhovell/SPARK-13049.

…binary and ternary expression The current implementation is sub-optimal: * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable. * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others. This PR improves this by making the null check elimination more fine-grained. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10987 from cloud-fan/null-check.

…teByKey is followed by a checkpointed dstream Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10934 from zsxwing/recursive-checkpoint.

JIRA: https://issues.apache.org/jira/browse/SPARK-12989 In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case: ```scala val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num") .withColumn("Data", struct("A", "B", "C")) .drop("A") .drop("B") .drop("C") val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc) data.select($"*", max("num").over(winSpec) as "max").explain(true) ``` In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes apache#10963 from gatorsmile/seletStarAfterColDrop.

JIRA: https://issues.apache.org/jira/browse/SPARK-12705 **Scope:** This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*): - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`. - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it. **General Reference Resolution Rules:** - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children. - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed. - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries. - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`. **Implementation:** 1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes. 2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes. 3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node. **Risk:** Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10678 from gatorsmile/sortWindows.

ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10901 from maropu/DocFix.

In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator Author: Lewuathe <lewuathe@me.com> Closes apache#10996 from Lewuathe/fix-typo-in-ml-guide.

…eper dir with mesos conf and add docs. Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings. Author: Timothy Chen <tnachen@gmail.com> Closes apache#10057 from tnachen/fix_mesos_dir.

… throwing exception This takes over apache#10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user. Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com> Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10921 from dragos/pr/10729.

…Mesos scheduler The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors. Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10923 from dragos/issue/canonical-paths.

This includes: float, boolean, short, decimal and calendar interval. Decimal is mapped to long or byte array depending on the size and calendar interval is mapped to a struct of int and long. The only remaining type is map. The schema mapping is straightforward but we might want to revisit how we deal with this in the rest of the execution engine. Author: Nong Li <nong@databricks.com> Closes apache#10961 from nongli/spark-13043.

…rBatch Fixes build for Scala 2.11. Author: Jacek Laskowski <jacek@japila.pl> Closes apache#10946 from jaceklaskowski/SPARK-12854-fix.

This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper). I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality. Author: Reynold Xin <rxin@databricks.com> Closes apache#10982 from rxin/SPARK-13078.

Improve printing of StageInfo in onStageCompleted See also apache#10585 Author: Sean Owen <sowen@cloudera.com> Closes apache#10922 from srowen/SPARK-12637.

Removed isLegacyLogDirectory code path and updated tests andrewor14 Author: felixcheung <felixcheung_m@hotmail.com> Closes apache#10860 from felixcheung/historyserverformat.

1. Use lower case 2. Change long prefixes to something shorter (in this case I am changing only one: TungstenAggregate -> agg). Author: Reynold Xin <rxin@databricks.com> Closes apache#11017 from rxin/SPARK-13130.

It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`. The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time. Author: Michael Armbrust <michael@databricks.com> Closes apache#11013 from marmbrus/groupByFunction-master.

… queries This is a follow up to 9aadcff that extends Spark SQL to allow users to _repeatedly_ optimize and execute structured queries. A `ContinuousQuery` can be expressed using SQL, DataFrames or Datasets. The purpose of this PR is only to add some initial infrastructure which will be extended in subsequent PRs. ## User-facing API - `sqlContext.streamFrom` and `df.streamTo` return builder objects that are analogous to the `read/write` interfaces already available to executing queries in a batch-oriented fashion. - `ContinuousQuery` provides an interface for interacting with a query that is currently executing in the background. ## Internal Interfaces - `StreamExecution` - executes streaming queries in micro-batches The following are currently internal, but public APIs will be provided in a future release. - `Source` - an interface for providers of continually arriving data. A source must have a notion of an `Offset` that monotonically tracks what data has arrived. For fault tolerance, a source must be able to replay data given a start offset. - `Sink` - an interface that accepts the results of a continuously executing query. Also responsible for tracking the offset that should be resumed from in the case of a failure. ## Testing - `MemoryStream` and `MemorySink` - simple implementations of source and sink that keep all data in memory and have methods for simulating durability failures - `StreamTest` - a framework for performing actions and checking invariants on a continuous query Author: Michael Armbrust <michael@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Josh Rosen <rosenville@gmail.com> Closes apache#11006 from marmbrus/structured-streaming.

Author: Michael Armbrust <michael@databricks.com> Closes apache#11014 from marmbrus/seqEncoders.

https://issues.apache.org/jira/browse/SPARK-13114 This PR adds a test for tokens more than the fields in schema. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11020 from HyukjinKwon/SPARK-13114.

…istent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module. Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.

Jira: https://issues.apache.org/jira/browse/SPARK-13056 Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#10964 from adrian-wang/npewriter.

…lumn name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes apache#10741 from grzegorz-chilkiewicz/master.

ddl.scala is defined in the execution package, and yet its reference of "UnaryNode" and "Command" are logical. This was fairly confusing when I was trying to understand the ddl code. Author: Reynold Xin <rxin@databricks.com> Closes apache#11021 from rxin/SPARK-13138.

As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one. Author: Davies Liu <davies@databricks.com> Closes apache#10960 from davies/stddev.

Already merged into 1.6 branch, this PR is to commit to master the same change Author: Gabriele Nizzoli <mail@nizzoli.net> Closes apache#11028 from gabrielenizzoli/patch-1.

The example will throw error like <console>:20: error: not found: value StructType Need to add this line: import org.apache.spark.sql.types._ Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com> Closes apache#10141 from swkimme/patch-1.

Author: Davies Liu <davies@databricks.com> Closes apache#11037 from davies/disable_flaky.

when we generate map, we first randomly pick a length, then create a seq of key value pair with the expected length, and finally call `toMap`. However, `toMap` will remove all duplicated keys, which makes the actual map size much less than we expected. This PR fixes this problem by put keys in a set first, to guarantee we have enough keys to build a map with expected length. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10930 from cloud-fan/random-generator.

…decoding to ColumnarBatch. This patch implements support for more types when doing the vectorized decode. There are a few more types remaining but they should be very straightforward after this. This code has a few copy and paste pieces but they are difficult to eliminate due to performance considerations. Specifically, this patch adds support for: - String, Long, Byte types - Dictionary encoding for those types. Author: Nong Li <nong@databricks.com> Closes apache#10908 from nongli/spark-12992.

https://issues.apache.org/jira/browse/SPARK-13122 A race condition can occur in MemoryStore's unrollSafely() method if two threads that return the same value for currentTaskAttemptId() execute this method concurrently. This change makes the operation of reading the initial amount of unroll memory used, performing the unroll, and updating the associated memory maps atomic in order to avoid this race condition. Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID. Author: Adam Budde <budde@amazon.com> Closes apache#11012 from budde/master.

This PR add spilling support for generated TungstenAggregate. If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated). The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite Author: Davies Liu <davies@databricks.com> Closes apache#10998 from davies/gen_spilling.

Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently. Author: Imran Younus <iyounus@us.ibm.com> Closes apache#10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.

`rpcEnv.awaitTermination()` was not added in apache#10854 because some Streaming Python tests hung forever. This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv. Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11031 from zsxwing/awaitTermination.

1. try to avoid the suffix (unique id) 2. remove the comment if there is no code generated. 3. re-arrange the order of functions 4. trop the new line for inlined blocks. Author: Davies Liu <davies@databricks.com> Closes apache#11032 from davies/better_suffix.

…kSQL Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints: 1. `‘a > 10` 2. `isNotNull(‘a)` This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`). Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing Author: Sameer Agarwal <sameer@databricks.com> Closes apache#10844 from sameeragarwal/constraints.

…uration columns I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration' Author: Mario Briggs <mario.briggs@in.ibm.com> Author: mariobriggs <mariobriggs@in.ibm.com> Closes apache#11022 from mariobriggs/spark-12739.

A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate). Author: Davies Liu <davies@databricks.com> Closes apache#10989 from davies/gen_join.

The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed. This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```. cc davies liancheng Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11052 from hvanhovell/SPARK-13157.

…ation web UI Added a Cores column in the Executors UI Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11039 from ajbozarth/spark3611.

They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like: ```scala val df = sqlContext.read.stream("...") val handle = df.write.stream("...") handle.stop() ``` Author: Reynold Xin <rxin@databricks.com> Closes apache#11062 from rxin/SPARK-13166.

Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query). Having best time and average time together for more information (we can see kind of variance). rate, time per row and relative are all calculated using best time. The result looks like this: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz rang/filter/sum: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- rang/filter/sum codegen=false 14332 / 16646 36.0 27.8 1.0X rang/filter/sum codegen=true 845 / 940 620.0 1.6 17.0X ``` Author: Davies Liu <davies@databricks.com> Closes apache#11018 from davies/gen_bench.

Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build. Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building. cc andrewor14 who did the initial deprecation Author: Holden Karau <holden@us.ibm.com> Closes apache#11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.

This is a step towards consolidating `SQLContext` and `HiveContext`. This patch extends the existing Catalog API added in apache#10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested. About 200 lines are test code. Author: Andrew Or <andrew@databricks.com> Closes apache#11069 from andrewor14/catalog.

Jira: https://issues.apache.org/jira/browse/SPARK-12828 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#10762 from adrian-wang/naturaljoin.

minor fix for api link in ml onevsrest Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#11068 from hhbyyh/onevsrestDoc.

…age number JIRA: https://issues.apache.org/jira/browse/SPARK-13113 As we shift bits right, looks like the bitwise AND operation is unnecessary. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11002 from viirya/improve-decodepagenumber.

This is a small addendum to apache#10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes apache#11070 from rxin/SPARK-12828-natural-join.

In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work). This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`) This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0) With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later) Author: Charles Allen <charles@allen-net.com> Closes apache#10319 from drcrallen/SPARK-12330.

Building with scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative. Investigation shows we are already using ConcurrentLinkedQueue in other locations so switch our uses of SynchronizedBuffer to ConcurrentLinkedQueue. Author: Holden Karau <holden@us.ibm.com> Closes apache#11059 from holdenk/SPARK-13164-replace-deprecated-synchronized-buffer-in-core.

Currently the Master would always set an application's initial executor limit to infinity. If the user specified `spark.dynamicAllocation.initialExecutors`, the config would not take effect. This is similar to apache#11047 but for standalone mode. Author: Andrew Or <andrew@databricks.com> Closes apache#11054 from andrewor14/standalone-da-initial.

These were ignored because they are incorrectly written; they don't actually trigger stage retries, which is what the tests are testing. These tests are now rewritten to induce stage retries through fetch failures. Note: there were 2 tests before and now there's only 1. What happened? It turns out that the case where we only resubmit a subset of of the original missing partitions is very difficult to simulate in tests without potentially introducing flakiness. This is because the `DAGScheduler` removes all map outputs associated with a given executor when this happens, and we will need multiple executors to trigger this case, and sometimes the scheduler still removes map outputs from all executors. Author: Andrew Or <andrew@databricks.com> Closes apache#10969 from andrewor14/unignore-accum-test.

This commit exists to close the following pull requests on Github: Closes apache#7971 (requested by yhuai) Closes apache#8539 (requested by srowen) Closes apache#8746 (requested by yhuai) Closes apache#9288 (requested by andrewor14) Closes apache#9321 (requested by andrewor14) Closes apache#9935 (requested by JoshRosen) Closes apache#10442 (requested by andrewor14) Closes apache#10585 (requested by srowen) Closes apache#10785 (requested by srowen) Closes apache#10832 (requested by andrewor14) Closes apache#10941 (requested by marmbrus) Closes apache#11024 (requested by andrewor14)

Spark SQL should collapse adjacent `Repartition` operators and only keep the last one. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11064 from JoshRosen/collapse-repartition.

The config already describes time and accepts a general format that is not restricted to ms. This commit renames the internal config to use a format that's consistent in Spark.

This patch incorporates review feedback from apache#11069, which is already merged. Author: Andrew Or <andrew@databricks.com> Closes apache#11080 from andrewor14/catalog-follow-ups.

…ot set but timeoutThreshold is defined Check the state Existence before calling get. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11081 from zsxwing/SPARK-13195.

…ently in the WAITING state Author: Raafat Akkad <raafat.akkad@gmail.com> Closes apache#10959 from RaafatAkkad/master.

…keted Tables) JIRA: https://issues.apache.org/jira/browse/SPARK-12850 This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`. Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column. So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR. Maybe we also need to add test cases to verify if bucket pruning works well for each data type. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#10942 from gatorsmile/pruningBuckets.

Another trivial deprecation fix for Scala 2.11 Author: Jakob Odersky <jakob@odersky.com> Closes apache#11089 from jodersky/SPARK-13208.

…/Writer This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example. Before this patch: ```scala sqlContext.read.option("primitivesAsString", "true").json("/path/to/json") ``` After this patch: Before this patch: ```scala sqlContext.read.option("primitivesAsString", true).json("/path/to/json") ``` Author: Reynold Xin <rxin@databricks.com> Closes apache#11072 from rxin/SPARK-13187.

…derWriterSuite A follow up PR for apache#11062 because it didn't rename the test suite. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11096 from zsxwing/rename.

https://issues.apache.org/jira/browse/SPARK-12939 Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it. Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added. follow-ups: * remove encoders from typed aggregate expression. * completely remove resolve/bind in `ExpressionEncoder` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10852 from cloud-fan/bug.

Author: Bill Chambers <bill@databricks.com> Closes apache#11094 from anabranch/dynamic-docs.

…tion Fix for [SPARK-13002](https://issues.apache.org/jira/browse/SPARK-13002) about the initial number of executors when running with dynamic allocation on Mesos. Instead of fixing it just for the Mesos case, made the change in `ExecutorAllocationManager`. It is already driving the number of executors running on Mesos, only no the initial value. The `None` and `Some(0)` are internal details on the computation of resources to reserved, in the Mesos backend scheduler. `executorLimitOption` has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state. Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore. This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode. Somebody knowing this part of the yarn support should review this change. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes apache#11047 from skyluc/issue/initial-dyn-alloc-2.

Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116. Author: Davies Liu <davies@databricks.com> Closes apache#11097 from davies/remove_fallback.

Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11. Also works with 2.10 Author: Jakob Odersky <jakob@odersky.com> Closes apache#11085 from jodersky/SPARK-13171.

…data structures rxin srowen I work out note message for rdd.take function, please help to review. If it's fine, I can apply to all other function later. Author: Tommy YU <tummyyu@163.com> Closes apache#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.

…ression cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit. this change improves training times for one of my test sets from ~7m30s to ~4m30s Author: Gary King <gary@idibon.com> Closes apache#11027 from idigary/spark-13132-optimize-logistic-regression.

Author: cody koeninger <cody@koeninger.org> Closes apache#9007 from koeninger/SPARK-10963.

I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs. Author: Nam Pham <phamducnam@gmail.com> Closes apache#11025 from nampham2/SPARK-12986.

This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen. At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`. In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic. In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#7334 from JoshRosen/remove-copy-in-limit.

…analysis of encoder nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11035 from cloud-fan/ignore-nullability.

There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid. The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling. And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing). Author: Davies Liu <davies@databricks.com> Closes apache#11095 from davies/fix_expand.

…sion table This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse. If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation. If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`. If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map. TODO: will do cleanup Author: Davies Liu <davies@databricks.com> Closes apache#11065 from davies/gen_dim.

Additional changes to apache#10835, mainly related to style and visibility. This patch also adds back a few deprecated methods for backward compatibility. Author: Andrew Or <andrew@databricks.com> Closes apache#10958 from andrewor14/task-metrics-to-accums-followups.

…cordReader. WIP: running tests. Code needs a bit of clean up. This patch completes the vectorized decoding with the goal of passing the existing tests. There is still more patches to support the rest of the format spec, even just for flat schemas. This patch adds a new flag to enable the vectorized decoding. Tests were updated to try with both modes where applicable. Once this is working well, we can remove the previous code path. Author: Nong Li <nong@databricks.com> Closes apache#11055 from nongli/spark-12992-2.

…ss ln Since Spark requires at least JRE 1.7, it is safe to use built-in java.nio.Files. Author: Jakob Odersky <jakob@odersky.com> Closes apache#11098 from jodersky/SPARK-13176.

…eaming Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it. Some notes about how behaviour is different for reviewers: The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates. Author: Holden Karau <holden@us.ibm.com> Author: tedyu <yuzhihong@gmail.com> Closes apache#11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.

…ataGenerator KMeans: Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values MFDataGenerator: Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere. I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way. Author: Holden Karau <holden@us.ibm.com> Closes apache#11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.

Update JDBC documentation based on http://stackoverflow.com/a/30947090/219530 as SPARK_CLASSPATH is deprecated. Also, that's how it worked, it didn't work with the SPARK_CLASSPATH or the --jars alone. This would solve issue: https://issues.apache.org/jira/browse/SPARK-13040 Author: Sebastián Ramírez <tiangolo@gmail.com> Closes apache#10948 from tiangolo/patch-docs-jdbc.

… use low level linked list as it is deprecated. Author: sachin aggarwal <different.sachin@gmail.com> Closes apache#11113 from agsachin/master.

…ke `-i file`. Now: ``` $ bin/spark-shell -i test.scala NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 16/01/29 17:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/29 17:37:39 INFO Main: Created spark context.. Spark context available as sc (master = local[*], app id = local-1454085459000). 16/01/29 17:37:39 INFO Main: Created sql context.. SQL context available as sqlContext. Loading test.scala... hello Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45) Type in expressions to have them evaluated. Type :help for more information. ``` Author: Iulian Dragos <jaguarul@gmail.com> Closes apache#10984 from dragos/issue/repl-eval-file.

…t is deprecated Replace SynchronizeQueue with synchronized access to a Queue Author: Sean Owen <sowen@cloudera.com> Closes apache#11111 from srowen/SPARK-13170.

…ters with Jackson 2.2.3 Patch to 1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation 2. Use maven antrun to verify the JAR has the renamed classes Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install` Author: Steve Loughran <stevel@hortonworks.com> Closes apache#10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.

Author: Luciano Resende <lresende@apache.org> Closes apache#11092 from lresende/SPARK-13189.

Adds the benchmark results as comments. The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons: 1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153). 2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth? 3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster. The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10917 from cloud-fan/hash-benchmark.

…r ShuffleBlockFetcherIterator Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread. Also fix a race condition that could cause memory leak. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11138 from zsxwing/SPARK-13245.

This PR improve the lookup of BytesToBytesMap by: 1. Generate code for calculate the hash code of grouping keys. 2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection). Author: Davies Liu <davies@databricks.com> Closes apache#11010 from davies/gen_map.

JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#8734 from viirya/dt-soft-centroids.

…ng unnecessary Spark Filter Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10427 from maropu/RemoveFilterInJdbcScan.

`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically. This is based on the initial work from marmbrus. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11034 from zsxwing/stream-df-file-source.

Author: Gábor Lipták <gliptak@gmail.com> Closes apache#9532 from gliptak/SPARK-11565.

Author: Jon Maurer <tritab@gmail.com> Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local> Closes apache#10789 from tritab/cmd_updates.

…Buffer andrewor14 Please take a look Author: tedyu <yuzhihong@gmail.com> Closes apache#11134 from tedyu/master.

Make Logging private[spark]. Pretty much all there is to it. Author: Sean Owen <sowen@cloudera.com> Closes apache#11103 from srowen/SPARK-9307.

…rse grained mesos mode. This is the next iteration of tnachen's previous PR: apache#4027 In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution. This PR implements two high-level features. These two features are co-dependent, so they're implemented both here: - Mesos support for spark.executor.cores - Multiple executors per slave We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR. The contribution is my original work and I license the work to the project under the project's open source license. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#10993 from mgummelt/executor_sizing.

The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / apache#7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators. This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy. In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path. /cc davies and marmbrus for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11145 from JoshRosen/take-ordered-and-project-fix.

…ot getting set correctly The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11057 from ajbozarth/spark13163.

The right margin of the history page is little bit off. A simple fix for that issue. Author: zhuol <zhuol@yahoo-inc.com> Closes apache#11029 from zhuoliu/13126.

…r implemented compression schemes for InMemoryRelation This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10965 from maropu/ImproveColumnarCache.

Remove spark.closure.serializer option and use JavaSerializer always CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be. Author: Sean Owen <sowen@cloudera.com> Closes apache#11150 from srowen/SPARK-12414.

### Management API for Continuous Queries **API for getting status of each query** - Whether active or not - Unique name of each query - Status of the sources and sinks - Exceptions **API for managing each query** - Immediately stop an active query - Waiting for a query to be terminated, correctly or with error **API for managing multiple queries** - Listing all active queries - Getting an active query by name - Waiting for any one of the active queries to be terminated **API for listening to query life cycle events** - ContinuousQueryListener API for query start, progress and termination events. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#11030 from tdas/streaming-df-management-api.

Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator Author: raela <raela@databricks.com> Closes apache#11158 from raelawang/master.

…e Ambiguity Caused by Internally Generated Expressions Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`. This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased. Here's an example Spark 1.6.0 snippet for illustration: ```scala sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t") sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true) ``` The above code produces the following resolved plan: ``` == Analyzed Logical Plan == _c0: bigint Project [_c0#101L] +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L] +- Subquery t +- Project [id#46L AS a#47L,id#46L AS b#48L] +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26 ``` Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs. The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation. In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated. Could you review the solution? marmbrus liancheng I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11050 from gatorsmile/namingConflicts.

This PR addresses two issues: - Self join does not work in SQL Generation - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost. liancheng Could you please review the code changes? Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11084 from gatorsmile/selfJoinInSQLGen.

Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels. grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR. The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive). Author: Davies Liu <davies@databricks.com> Closes apache#10677 from davies/grouping.

For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows. After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary. This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that). The new SQL UI will looks like: ![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png) Author: Davies Liu <davies@databricks.com> Closes apache#11163 from davies/remove_metrics.

…xpression string The parser currently parses the following strings without a hitch: * Table Identifier: * `a.b.c` should fail, but results in the following table identifier `a.b` * `table!#` should fail, but results in the following table identifier `table` * Expression * `1+2 r+e` should fail, but results in the following expression `1 + 2` This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing. cc cloud-fan (we discussed this in apache#10649) jayadevanmurali (this causes your PR apache#11051 to fail) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11159 from hvanhovell/SPARK-13276.

… Union in SQL Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan. For example, before the fix, the following query has a plan with two `Distinct` ```scala sql("select * from t0 union select * from t0").explain(true) ``` ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_2 +- 'Distinct +- 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#16L] +- Subquery u_2 +- Distinct +- Project [id#16L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#16L], [id#16L] +- Aggregate [id#16L], [id#16L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` After the fix, the plan is changed without the extra `Distinct` as follows: ``` == Parsed Logical Plan == 'Project [unresolvedalias(*,None)] +- 'Subquery u_1 +- 'Distinct +- 'Union :- 'Project [unresolvedalias(*,None)] : +- 'UnresolvedRelation `t0`, None +- 'Project [unresolvedalias(*,None)] +- 'UnresolvedRelation `t0`, None == Analyzed Logical Plan == id: bigint Project [id#17L] +- Subquery u_1 +- Distinct +- Union :- Project [id#16L] : +- Subquery t0 : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Subquery t0 +- Relation[id#16L] ParquetRelation == Optimized Logical Plan == Aggregate [id#17L], [id#17L] +- Union :- Project [id#16L] : +- Relation[id#16L] ParquetRelation +- Project [id#16L] +- Relation[id#16L] ParquetRelation ``` Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11120 from gatorsmile/unionDistinct.

…include pipeline plan in comments. Author: Nong Li <nong@databricks.com> Closes apache#11155 from nongli/spark-13270.

…late In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes apache#11149 from sasakitoa/remove_multibyte_in_sparkenv.

The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API. Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users. Author: Junyang <fly.shenjy@gmail.com> Closes apache#10978 from flyjy/master.

…Query DataTables Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes apache#11038 from ajbozarth/spark13124.

Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things - Makes sure no SparkContext is active after every test - Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#11166 from tdas/fix-failuresuite.

JIRA: https://issues.apache.org/jira/browse/SPARK-13277 There is an ANTLR warning during compilation: warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7: Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input This patch is to fix it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11168 from viirya/fix-parser-using.

Add the table name validation at the temp table creation Author: jayadevanmurali <jayadevan.m@tcs.com> Closes apache#11051 from jayadevanmurali/branch-0.2-SPARK-12982.

This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <skedia@fb.com> Closes apache#11167 from sitalkedia/fix_stuck_driver and squashes the following commits: 3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function

This reverts commit 50fa6fd.

…e system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#11151 from yu-iskw/SPARK-13265.

cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes apache#9535 from yu-iskw/SPARK-11515.

…mport PySpark ml.recommendation support export/import. Author: Kai Jiang <jiangkai@gmail.com> Closes apache#11044 from vectorijk/spark-13037.

Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode). cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10975 from yanboliang/clustering-cleanup.

PySpark ml.clustering support export/import. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10999 from yanboliang/spark-13035.

…n error Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes apache#10962 from sethah/SPARK-13047.

…st transformSchema https://issues.apache.org/jira/browse/SPARK-12765 Author: Liu Xiang <lxmtlab@gmail.com> Closes apache#10720 from sloth2012/sloth.

…odegen This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row. <img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png"> Author: Davies Liu <davies@databricks.com> Closes apache#11170 from davies/gen_metric.

…rors It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors. cc rxin / viirya Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#11174 from hvanhovell/ANTLR-warnings-as-errors.

…false) https://issues.apache.org/jira/browse/SPARK-12746 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes apache#10697 from Earthson/SPARK-12746.

…alue parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes apache#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI. https://issues.apache.org/jira/browse/SPARK-7889 Author: Steve Loughran <stevel@hortonworks.com> Author: Imran Rashid <irashid@cloudera.com> Closes apache#11118 from squito/SPARK-7889-alternate.

This JIRA is related to apache#5852 Had to do some minor rework and test to make sure it works with current version of spark. Author: Sanket <schintap@untilservice-lm> Closes apache#10838 from redsanket/limit-outbound-connections.

Add Python API for spark.ml bisecting k-means. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10889 from yanboliang/spark-12974.

We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced. Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present. Author: Holden Karau <holden@us.ibm.com> Closes apache#11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.

The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate). Author: Davies Liu <davies@databricks.com> Closes apache#11153 from davies/resolve_sort.

Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not). Author: Reynold Xin <rxin@databricks.com> Closes apache#11171 from rxin/SPARK-13282.

https://issues.apache.org/jira/browse/SPARK-13260 This is a quicky fix for `count(*)`. When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count. Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#11169 from HyukjinKwon/SPARK-13260.

PySpark support ```covar_samp``` and ```covar_pop```. cc rxin davies marmbrus Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10876 from yanboliang/spark-12962.

… consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module. Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes apache#11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.

andrewor14 This addressed your style comments from apache#10993 Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#11187 from mgummelt/fix_mesos_style.

Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test. Author: Michael Gummelt <mgummelt@mesosphere.io> Closes apache#11164 from mgummelt/fix_mesos_tests.

Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns. After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup. Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns). Author: Davies Liu <davies@databricks.com> Closes apache#11177 from davies/gen_expand.

… Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pavey@thefilter.com> Closes apache#11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.

…ailed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#10539 from viirya/fix-poweriter.

… deprecated Replace `getStackTraceString` with `Utils.exceptionString` Author: Sean Owen <sowen@cloudera.com> Closes apache#11182 from srowen/SPARK-13172.

This pull request has the following changes: 1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs. 2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package. 3. Move everything in execution/python.scala into the newly created execution.python package. Most of the diffs are just straight copy-paste. Author: Reynold Xin <rxin@databricks.com> Closes apache#11181 from rxin/SPARK-13296.

Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amitdev@gmail.com> Closes apache#11180 from amitdev/master.

See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme. Author: Claes Redestad <claes.redestad@gmail.com> Closes apache#11160 from cl4es/master.

…e method to improve performance The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))` The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage. Author: Carson Wang <carson.wang@intel.com> Closes apache#11090 from carsonwang/SPARK-13185.

This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases: - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children. - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger. These optimizations were proposed previously by gatorsmile in apache#10451 and apache#10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In apache#7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting. When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from apache#10451; see that patch for additional discussion. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11121 from JoshRosen/limit-pushdown-2.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes apache#10918 from maropu/RemoveDeprecateInPregel.

…-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnixon2@gmail.com> Closes apache#11199 from JeremyNixon/update_train_val_split_example.

This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type. A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image). This contribution is my original work and I license the work to the project under the project's open source license. viirya mengxr Author: seddonm1 <seddonm1@gmail.com> Closes apache#10976 from seddonm1/master.

…d using include_example Replace example code in mllib-pmml-model-export.md using include_example https://issues.apache.org/jira/browse/SPARK-13018 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes apache#11126 from keypointt/SPARK-13018.

…aining GroupBy Columns Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns. This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6 For example, the following query returns a wrong result: ```scala sql("select course, sum(earnings) as sum from courseSales group by course, earnings" + " grouping sets((), (course), (course, earnings))" + " order by course, sum").show() ``` Before the fix, the results are like ``` [null,null] [Java,null] [Java,20000.0] [Java,30000.0] [dotNET,null] [dotNET,5000.0] [dotNET,10000.0] [dotNET,48000.0] ``` After the fix, the results become correct: ``` [null,113000.0] [Java,20000.0] [Java,30000.0] [Java,50000.0] [dotNET,5000.0] [dotNET,10000.0] [dotNET,48000.0] [dotNET,63000.0] ``` UPDATE: This PR also deprecated the external column: GROUPING__ID. Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11100 from gatorsmile/groupingSets.

There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes apache#11213 from mgyucht/fix-sparsevector-docs.

…tive filtering in general This documents the implementation of ALS in `spark.ml` with example code in scala, java and python. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10411 from BenFradet/SPARK-12247.

…titioner of Exchange. Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes apache#10894 from ueshin/issues/SPARK-12976.

…headLog. The new logger name is under the org.apache.spark namespace. The detection of the caller name was also enhanced a bit to ignore some common things that show up in the call stack. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#11165 from vanzin/SPARK-13280.

…be freed in non-error cases ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see apache#10705). This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code. /cc zsxwing for review. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#11193 from JoshRosen/add-missing-release-calls-in-network-layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update #1

Update #1

Commits on Jan 23, 2016

Commits on Jan 24, 2016

Commits on Jan 25, 2016

Commits on Jan 26, 2016

Commits on Jan 27, 2016

Commits on Jan 28, 2016

Commits on Jan 29, 2016

Commits on Jan 30, 2016

Commits on Jan 31, 2016

Commits on Feb 1, 2016

Commits on Feb 2, 2016

Commits on Feb 3, 2016

Commits on Feb 4, 2016

Commits on Feb 5, 2016

Commits on Feb 6, 2016

Commits on Feb 7, 2016

Commits on Feb 8, 2016

Commits on Feb 9, 2016

Commits on Feb 10, 2016

Commits on Feb 11, 2016

Commits on Feb 12, 2016

Commits on Feb 13, 2016

Commits on Feb 14, 2016

Commits on Feb 15, 2016

Commits on Feb 16, 2016

Update #1

Update #1

Commits on Jan 23, 2016

Commits on Jan 24, 2016

Commits on Jan 25, 2016

Commits on Jan 26, 2016

Commits on Jan 27, 2016

Commits on Jan 28, 2016

Commits on Jan 29, 2016

Commits on Jan 30, 2016

Commits on Jan 31, 2016

Commits on Feb 1, 2016

Commits on Feb 2, 2016

Commits on Feb 3, 2016

Commits on Feb 4, 2016

Commits on Feb 5, 2016

Commits on Feb 6, 2016

Commits on Feb 7, 2016

Commits on Feb 8, 2016

Commits on Feb 9, 2016

Commits on Feb 10, 2016

Commits on Feb 11, 2016

Commits on Feb 12, 2016

Commits on Feb 13, 2016

Commits on Feb 14, 2016

Commits on Feb 15, 2016

Commits on Feb 16, 2016

An error occurred while loading commit statuses

An error occurred while loading commit statuses

An error occurred while loading commit statuses