Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update #1

Merged
merged 822 commits into from
Feb 16, 2016
Merged

Update #1

merged 822 commits into from
Feb 16, 2016
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Jan 23, 2016

  1. [SPARK-11137][STREAMING] Make StreamingContext.stop() exception-safe

    Make StreamingContext.stop() exception-safe
    
    Author: jayadevanmurali <jayadevan.m@tcs.com>
    
    Closes apache#10807 from jayadevanmurali/branch-0.1-SPARK-11137.
    jayadevanmurali authored and srowen committed Jan 23, 2016
    Configuration menu
    Copy the full SHA
    5f56980 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12904][SQL] Strength reduction for integral and decimal litera…

    …l comparisons
    
    This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#10882 from rxin/SPARK-12904-1.
    rxin committed Jan 23, 2016
    Configuration menu
    Copy the full SHA
    423783a View commit details
    Browse the repository at this point in the history
  3. [STREAMING][MINOR] Scaladoc + logs

    Found while doing code review
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes apache#10878 from jaceklaskowski/streaming-scaladoc-logs-tiny-fixes.
    jaceklaskowski authored and rxin committed Jan 23, 2016
    Configuration menu
    Copy the full SHA
    cfdcef7 View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2016

  1. [SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build

    ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive).
    
    This patch attempts to improve the isolation of these tests in order to address this issue.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
    JoshRosen committed Jan 24, 2016
    Configuration menu
    Copy the full SHA
    f400460 View commit details
    Browse the repository at this point in the history
  2. [SPARK-10498][TOOLS][BUILD] Add requirements.txt file for dev python …

    …tools
    
    Minor since so few people use them, but it would probably be good to have a requirements file for our python release tools for easier setup (also version pinning).
    
    cc JoshRosen who looked at the original JIRA.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#10871 from holdenk/SPARK-10498-add-requirements-file-for-dev-python-tools.
    holdenk authored and JoshRosen committed Jan 24, 2016
    Configuration menu
    Copy the full SHA
    a834001 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12120][PYSPARK] Improve exception message when failing to init…

    …ialize HiveContext in PySpark
    
    davies Mind to review ?
    
    This is the error message after this PR
    
    ```
    15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
    /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
      warnings.warn("You must build Spark with Hive. "
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
        return DataFrameReader(self)
      File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
        self._jreader = sqlContext._ssql_ctx.read()
      File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
        raise e
    py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
    : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
    	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
    	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
    	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
    	at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
    	at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
    	at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
    	at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
    	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
    	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    	at py4j.Gateway.invoke(Gateway.java:214)
    	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    	at py4j.GatewayConnection.run(GatewayConnection.java:209)
    	at java.lang.Thread.run(Thread.java:745)
    ```
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes apache#10126 from zjffdu/SPARK-12120.
    zjffdu authored and JoshRosen committed Jan 24, 2016
    Configuration menu
    Copy the full SHA
    e789b1d View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2016

  1. [SPARK-12624][PYSPARK] Checks row length when converting Java arrays …

    …to Python rows
    
    When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10886 from liancheng/spark-12624.
    liancheng authored and yhuai committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    3327fd2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not …

    …case class and same format).
    
    https://issues.apache.org/jira/browse/SPARK-12901
    This PR refactors the options in JSON and CSV datasources.
    
    In more details,
    
    1. `JSONOptions` uses the same format as `CSVOptions`.
    2. Not case classes.
    3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed)
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#10895 from HyukjinKwon/SPARK-12901.
    HyukjinKwon authored and rxin committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    3adebfc View commit details
    Browse the repository at this point in the history
  3. [SPARK-12932][JAVA API] improved error message for java type inferenc…

    …e failure
    
    Author: Andy Grove <andygrove73@gmail.com>
    
    Closes apache#10865 from andygrove/SPARK-12932.
    andygrove authored and srowen committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    d8e4805 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12755][CORE] Stop the event logger before the DAG scheduler

    [SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.
    
    This contribution is my original work, and I license this work to the Spark project under the project's open source license.
    
    Author: Michael Allman <michael@videoamp.com>
    
    Closes apache#10700 from mallman/stop_event_logger_first.
    Michael Allman authored and srowen committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    4ee8191 View commit details
    Browse the repository at this point in the history
  5. [SPARK-11965][ML][DOC] Update user guide for RFormula feature interac…

    …tions
    
    Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10222 from yanboliang/spark-11965.
    yanboliang authored and mengxr committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    dd2325d View commit details
    Browse the repository at this point in the history
  6. Closes apache#10879

    Closes apache#9046
    Closes apache#8532
    Closes apache#10756
    Closes apache#8960
    Closes apache#10485
    Closes apache#10467
    mengxr committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    ef8fb36 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12149][WEB UI] Executor UI improvement suggestions - Color UI

    Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time.
    
    Active Tasks is shaded blue with it's range based on percentage of total cores used.
    Failed Tasks is shaded red ranging over the first 10% of total tasks failed
    Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor.
    Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time.
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes apache#10154 from ajbozarth/spark12149.
    ajbozarth authored and Tom Graves committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    c037d25 View commit details
    Browse the repository at this point in the history
  8. [SPARK-12902] [SQL] visualization for generated operators

    This PR brings back visualization for generated operators, they looks like:
    
    ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png)
    
    ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png)
    
    Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10828 from davies/viz_codegen.
    Davies Liu authored and davies committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    7d877c3 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    00026fa View commit details
    Browse the repository at this point in the history
  10. [SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part…

    … of Partitioning Columns
    
    When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example,
    ```
            df.write
              .format(source)
              .partitionBy("i")
              .bucketBy(8, "i", "k")
              .saveAsTable("bucketed_table")
    ```
    However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change.
    
    Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table.
    
    Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks!
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#10891 from gatorsmile/commonKeysInPartitionByBucketBy.
    gatorsmile authored and marmbrus committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    9348431 View commit details
    Browse the repository at this point in the history
  11. [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark

    ```PCAModel```  can output ```explainedVariance``` at Python side.
    
    cc mengxr srowen
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10830 from yanboliang/spark-12905.
    yanboliang authored and mengxr committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    dcae355 View commit details
    Browse the repository at this point in the history
  12. [SPARK-12934][SQL] Count-min sketch serialization

    This PR adds serialization support for `CountMinSketch`.
    
    A version number is added to version the serialized binary format.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10893 from liancheng/cms-serialization.
    liancheng authored and rxin committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    6f0f1d9 View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2016

  1. [SPARK-12879] [SQL] improve the unsafe row writing framework

    As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use.
    
    This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.
    
    a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR:
    **old version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    -------------------------------------------------------------------------------
    single long                             2616.04           102.61         1.00 X
    single nullable long                    3032.54            88.52         0.86 X
    primitive types                         9121.05            29.43         0.29 X
    nullable primitive types               12410.60            21.63         0.21 X
    ```
    
    **new version**
    ```
    Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
    unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    -------------------------------------------------------------------------------
    single long                             1533.34           175.07         1.00 X
    single nullable long                    2306.73           116.37         0.66 X
    primitive types                         8403.93            31.94         0.18 X
    nullable primitive types               12448.39            21.56         0.12 X
    ```
    
    For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process.  The benchmark code is included in this PR.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10809 from cloud-fan/unsafe-projection.
    cloud-fan authored and davies committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    be375fc View commit details
    Browse the repository at this point in the history
  2. [SPARK-12936][SQL] Initial bloom filter implementation

    This PR adds an initial implementation of bloom filter in the newly added sketch module.  The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).
    
    Some difference from the design doc:
    
    * expose `bitSize` instead of `sizeInBytes` to user.
    * always need the `expectedInsertions` parameter when create bloom filter.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10883 from cloud-fan/bloom-filter.
    cloud-fan authored and rxin committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    109061f View commit details
    Browse the repository at this point in the history
  3. [SPARK-12934] use try-with-resources for streams

    liancheng please take a look
    
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes apache#10906 from tedyu/master.
    tedyu authored and liancheng committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    fdcc351 View commit details
    Browse the repository at this point in the history
  4. [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discret…

    …izer
    
    Add Python API for ml.feature.QuantileDiscretizer.
    
    One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
    cc brkyvz & mengxr
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
    holdenk authored and jkbradley committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    b66afde View commit details
    Browse the repository at this point in the history
  5. [SPARK-12834] Change ser/de of JavaArray and JavaList

    https://issues.apache.org/jira/browse/SPARK-12834
    
    We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes apache#10772 from yinxusen/SPARK-12834.
    yinxusen authored and jkbradley committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    ae47ba7 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test …

    …in PySpark for now
    
    I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086.
    
    gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"?
    
    cc: jkbradley
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#10909 from mengxr/SPARK-10086.
    mengxr committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    27c910f View commit details
    Browse the repository at this point in the history
  7. [SQL][MINOR] A few minor tweaks to CSV reader.

    This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#10919 from rxin/csv-minor.
    rxin committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    d54cfed View commit details
    Browse the repository at this point in the history
  8. [SPARK-12937][SQL] bloom filter serialization

    This PR adds serialization support for BloomFilter.
    
    A version number is added to version the serialized binary format.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10920 from cloud-fan/bloom-filter.
    cloud-fan authored and rxin committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    6743de3 View commit details
    Browse the repository at this point in the history
  9. [SPARK-12961][CORE] Prevent snappy-java memory leak

    JIRA: https://issues.apache.org/jira/browse/SPARK-12961
    
    To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object.
    
    JoshRosen
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#10875 from viirya/prevent-snappy-memory-leak.
    viirya authored and srowen committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    5936bf9 View commit details
    Browse the repository at this point in the history
  10. [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable i…

    …s inconsistent with Scala's Iterator->Iterator
    
    Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
    
    CC rxin pwendell for API change; tdas since it also touches streaming.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#10413 from srowen/SPARK-3369.
    srowen committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    649e9d0 View commit details
    Browse the repository at this point in the history
  11. [SPARK-10911] Executors should System.exit on clean shutdown.

    Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441.
    
    Author: zhuol <zhuol@yahoo-inc.com>
    
    Closes apache#9946 from zhuoliu/10911.
    zhuol authored and Tom Graves committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    ae0309a View commit details
    Browse the repository at this point in the history
  12. [SPARK-12682][SQL] Add support for (optionally) not storing tables in…

    … hive metadata format
    
    This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL.
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes apache#10826 from sameeragarwal/skip-hive-metadata.
    sameeragarwal authored and yhuai committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    08c781c View commit details
    Browse the repository at this point in the history
  13. [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instr…

    …uctions for streaming-akka project
    
    Since `actorStream` is an external project, we should add the linking and deploying instructions for it.
    
    A follow up PR of apache#10744
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#10856 from zsxwing/akka-link-instruction.
    zsxwing authored and tdas committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    cbd507d View commit details
    Browse the repository at this point in the history
  14. [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector

    https://issues.apache.org/jira/browse/SPARK-11923
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes apache#10186 from yinxusen/SPARK-11923.
    yinxusen authored and jkbradley committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    8beab68 View commit details
    Browse the repository at this point in the history
  15. [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimize…

    …r other than its parent class
    
    https://issues.apache.org/jira/browse/SPARK-12952
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes apache#10863 from yinxusen/SPARK-12952.
    yinxusen authored and jkbradley committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    fbf7623 View commit details
    Browse the repository at this point in the history
  16. [SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted orde…

    …r in dev/run-tests
    
    This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies.  This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure
    
    Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after.
    
    In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#10885 from JoshRosen/SPARK-8725.
    JoshRosen committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    ee74498 View commit details
    Browse the repository at this point in the history
  17. [SQL] Minor Scaladoc format fix

    Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10926 from liancheng/agg-doc-fix.
    liancheng committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    83507fe View commit details
    Browse the repository at this point in the history
  18. [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark

    environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files.
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes apache#10913 from zjffdu/SPARK-12993.
    zjffdu authored and rxin committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    19fdb21 View commit details
    Browse the repository at this point in the history
  19. [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code

    The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
    holdenk authored and jkbradley committed Jan 26, 2016
    Configuration menu
    Copy the full SHA
    eb91729 View commit details
    Browse the repository at this point in the history

Commits on Jan 27, 2016

  1. [SPARK-12614][CORE] Don't throw non fatal exception from ask

    Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#10568 from zsxwing/send-ask-fail.
    zsxwing committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    22662b2 View commit details
    Browse the repository at this point in the history
  2. [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…

    … Add LibSVMOutputWriter
    
    The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
    * Partition is still not supported
    * Multiple input paths is not supported
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes apache#9595 from zjffdu/SPARK-11622.
    zjffdu authored and mengxr committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    1dac964 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12854][SQL] Implement complex types support in ColumnarBatch

    This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
    and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
    are treated as an array of bytes.
    
    ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
    of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
    are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
    and lengths into the child array. This structure is able to handle arbitrary nesting. It has
    the key property that we maintain columnar throughout and that primitive types are only stored
    in the leaf nodes and contiguous across rows. For example, if the schema is
    ```
    array<array<int>>
    ```
    There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.
    
    As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
    vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
    The vectors are not fixed length and will grow as necessary. This should make the usage a lot
    simpler for the writer.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#10820 from nongli/spark-12854.
    nongli authored and rxin committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    5551273 View commit details
    Browse the repository at this point in the history
  4. [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should …

    …not be regularized
    
    The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
    The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
    Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
    
    Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review.
    
    Author: Holden Karau <holden@us.ibm.com>
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
    holdenk authored and DB Tsai committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    b72611f View commit details
    Browse the repository at this point in the history
  5. [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR

    Add ```covar_samp``` and ```covar_pop``` for SparkR.
    Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.
    
    cc sun-rui felixcheung shivaram
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10829 from yanboliang/spark-12903.
    yanboliang authored and shivaram committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    e7f9199 View commit details
    Browse the repository at this point in the history
  6. [SPARK-12935][SQL] DataFrame API for Count-Min Sketch

    This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10911 from liancheng/cms-df-api.
    liancheng authored and rxin committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    ce38a35 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12728][SQL] Integrates SQL generation with native view

    This PR is a follow-up of PR apache#10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical.
    
    In this PR, a new SQL option `spark.sql.nativeView.canonical` is added.  When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach.
    
    One important issue this PR fixes is that, now we can use CTE when defining a view.  Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`.  However, HiveQL parser doesn't allow CTE appearing as a subquery.  Namely, something like this is disallowed:
    
    ```sql
    SELECT n
    FROM (
      WITH w AS (SELECT 1 AS n)
      SELECT * FROM w
    ) v
    ```
    
    This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string).
    
    Author: Cheng Lian <lian@databricks.com>
    Author: Yin Huai <yhuai@databricks.com>
    
    Closes apache#10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
    liancheng authored and yhuai committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    58f5d8c View commit details
    Browse the repository at this point in the history
  8. [SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext…

    … shutdown
    
    If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence.
    
    vanzin
    
    Author: Nishkam Ravi <nishkamravi@gmail.com>
    Author: nishkamravi2 <nishkamravi@gmail.com>
    
    Closes apache#10881 from nishkamravi2/master_netty.
    nishkamravi2 authored and zsxwing committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    bae3c9a View commit details
    Browse the repository at this point in the history
  9. [SPARK-12780] Inconsistency returning value of ML python models' prop…

    …erties
    
    https://issues.apache.org/jira/browse/SPARK-12780
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes apache#10724 from yinxusen/SPARK-12780.
    yinxusen authored and jkbradley committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    4db255c View commit details
    Browse the repository at this point in the history
  10. [SPARK-12983][CORE][DOC] Correct metrics.properties.template

    There are some typos or plain unintelligible sentences in the metrics template.
    
    Author: BenFradet <benjamin.fradet@gmail.com>
    
    Closes apache#10902 from BenFradet/SPARK-12983.
    BenFradet authored and srowen committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    90b0e56 View commit details
    Browse the repository at this point in the history
  11. [SPARK-1680][DOCS] Explain environment variables for running on YARN …

    …in cluster mode
    
    JIRA 1680 added a property called spark.yarn.appMasterEnv.  This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables
    
    Author: Andrew <weiner.andrew.j@gmail.com>
    
    Closes apache#10869 from weineran/branch-yarn-docs.
    weineran authored and srowen committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    093291c View commit details
    Browse the repository at this point in the history
  12. [SPARK-13023][PROJECT INFRA] Fix handling of root module in modules_t…

    …o_test()
    
    There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#10933 from JoshRosen/build-module-fix.
    JoshRosen committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    41f0c85 View commit details
    Browse the repository at this point in the history
  13. [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata w…

    …ith `None` triggers cryptic failure
    
    The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.
    
    Author: Jason Lee <cjlee@us.ibm.com>
    
    Closes apache#8969 from jasoncl/SPARK-10847.
    jasoncl authored and yhuai committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    edd4737 View commit details
    Browse the repository at this point in the history
  14. [SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators

    The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts:
    
    **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver.
    
    **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620.
    
    While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here.
    
    Note: This was once part of apache#10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10835 from andrewor14/task-metrics-use-accums.
    Andrew Or authored and JoshRosen committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    87abcf7 View commit details
    Browse the repository at this point in the history
  15. [SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition'…

    …s API contract
    
    Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array.
    
    If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html
    
    In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#10932 from JoshRosen/SPARK-13021.
    JoshRosen authored and yhuai committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    32f7411 View commit details
    Browse the repository at this point in the history
  16. [SPARK-12938][SQL] DataFrame API for Bloom filter

    This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.
    
    This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10937 from cloud-fan/bloom-filter.
    cloud-fan authored and rxin committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    680afab View commit details
    Browse the repository at this point in the history
  17. [SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQl…

    …Parser commands to new Parser
    
    This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive).
    
    This PR and apache#10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst.
    
    The PR is marked WIP as long as it doesn't pass all tests.
    
    cc rxin viirya winningsix (this touches apache#10144)
    
    Author: Herman van Hovell <hvanhovell@questtec.nl>
    
    Closes apache#10905 from hvanhovell/SPARK-12866.
    hvanhovell authored and rxin committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    ef96cd3 View commit details
    Browse the repository at this point in the history
  18. [HOTFIX] Fix Scala 2.11 compilation

    by explicitly marking annotated parameters as vals (SI-8813).
    
    Caused by apache#10835.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10955 from andrewor14/fix-scala211.
    Andrew Or authored and yhuai committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    d702f0c View commit details
    Browse the repository at this point in the history
  19. [SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBa…

    …tch.Row
    
    These two classes became identical as the implementation progressed.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#10952 from nongli/spark-13045.
    nongli authored and davies committed Jan 27, 2016
    Configuration menu
    Copy the full SHA
    4a09123 View commit details
    Browse the repository at this point in the history

Commits on Jan 28, 2016

  1. Provide same info as in spark-submit --help

    this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths."
    
    Author: James Lohse <jimlohse@users.noreply.github.com>
    
    Closes apache#10890 from jimlohse/patch-1.
    jimlohse authored and srowen committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    c220443 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12818][SQL] Specialized integral and string types for Count-mi…

    …n Sketch
    
    This PR is a follow-up of apache#10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10968 from liancheng/cms-specialized.
    liancheng authored and rxin committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    415d0a8 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12926][SQL] SQLContext to display warning message when non-sql…

    … configs are being set
    
    Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done.
    
    Author: Tejas Patil <tejasp@fb.com>
    
    Closes apache#10849 from tejasapatil/SPARK-12926.
    tejasapatil authored and marmbrus committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    6768039 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13031] [SQL] cleanup codegen and improve test coverage

    1. enable whole stage codegen during tests even there is only one operator supports that.
    2. split doProduce() into two APIs: upstream() and doProduce()
    3. generate prefix for fresh names of each operator
    4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again)
    5. fix bugs and tests.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10944 from davies/gen_refactor.
    Davies Liu authored and davies committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    cc18a71 View commit details
    Browse the repository at this point in the history
  5. [SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solver

    Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```.
    There are two limitations in the current implementation compared with R:
    * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code:
    ```
    glm( cbind(using, notUsing) ~  age + education + wantsMore , family = binomial)
    ```
    * It does not support ```offset```.
    
    Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
    The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM).
    Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.
    
    cc mengxr jkbradley
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10639 from yanboliang/spark-9835.
    yanboliang authored and mengxr committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    df78a93 View commit details
    Browse the repository at this point in the history
  6. [SPARK-12401][SQL] Add integration tests for postgres enum types

    We can handle posgresql-specific enum types as strings in jdbc.
    So, we should just add tests and close the corresponding JIRA ticket.
    
    Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
    
    Closes apache#10596 from maropu/AddTestsInIntegration.
    maropu authored and liancheng committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    abae889 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12749][SQL] add json option to parse floating-point types as D…

    …ecimalType
    
    I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.
    
    Added test for non-complex types. Should I add a test for complex types?
    
    Author: Brandon Bradley <bradleytastic@gmail.com>
    
    Closes apache#10936 from blbradley/spark-12749.
    blbradley authored and rxin committed Jan 28, 2016
    Configuration menu
    Copy the full SHA
    3a40c0e View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2016

  1. [SPARK-11955][SQL] Mark optional fields in merging schema for safely …

    …pushdowning filters in Parquet
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-11955
    
    Currently we simply skip pushdowning filters in parquet if we enable schema merging.
    
    However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.
    
    Author: Liang-Chi Hsieh <viirya@appier.com>
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#9940 from viirya/safe-pushdown-parquet-filters.
    viirya authored and liancheng committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    4637fc0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b9dfdcc View commit details
    Browse the repository at this point in the history
  3. [SPARK-12968][SQL] Implement command to set current database

    JIRA: https://issues.apache.org/jira/browse/SPARK-12968
    
    Implement command to set current database.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    Author: Liang-Chi Hsieh <viirya@appier.com>
    
    Closes apache#10916 from viirya/ddl-use-database.
    viirya authored and rxin committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    66449b8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13067] [SQL] workaround for a weird scala reflection problem

    A simple workaround to avoid getting parameter types when convert a
    logical plan to json.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10970 from cloud-fan/reflection.
    cloud-fan authored and davies committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    721ced2 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13050][BUILD] Scalatest tags fail build with the addition of t…

    …he sketch module
    
    A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark.
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes apache#10954 from ajbozarth/spark13050.
    ajbozarth authored and liancheng committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    8d3cc3d View commit details
    Browse the repository at this point in the history
  6. [SPARK-13031][SQL] cleanup codegen and improve test coverage

    1. enable whole stage codegen during tests even there is only one operator supports that.
    2. split doProduce() into two APIs: upstream() and doProduce()
    3. generate prefix for fresh names of each operator
    4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again)
    5. fix bugs and tests.
    
    This PR re-open apache#10944 and fix the bug.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10977 from davies/gen_refactor.
    Davies Liu authored and rxin committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    55561e7 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13032][ML][PYSPARK] PySpark support model export/import and ta…

    …ke LinearRegression as example
    
    * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
    * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
    
    cc mengxr jkbradley
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes apache#10469 from yanboliang/spark-11939.
    yanboliang authored and jkbradley committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    e51b6ea View commit details
    Browse the repository at this point in the history
  8. [SPARK-10873] Support column sort and search for History Server.

    [SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience.
    
    1. Create the historypage-template.html for displaying application information in datables.
    2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search.
    3. "duration" and "lastUpdated" rest API are added to application's "attempts".
    4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified.
    
    Snapshots for how it looks like now:
    
    History page view:
    ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png)
    
    Search:
    ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png)
    
    Sort by started time:
    ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png)
    
    Author: zhuol <zhuol@yahoo-inc.com>
    
    Closes apache#10648 from zhuoliu/10873.
    zhuol authored and Tom Graves committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    e4c1162 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegen

    simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible.
    
    generated code comparison for `hash(int, double, string, array<string>)`:
    **before:**
    ```
      public UnsafeRow apply(InternalRow i) {
        /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
        int value1 = 42;
        /* input[0, int] */
        int value3 = i.getInt(0);
        if (!false) {
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
        }
        /* input[1, double] */
        double value5 = i.getDouble(1);
        if (!false) {
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
        }
        /* input[2, string] */
        boolean isNull6 = i.isNullAt(2);
        UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
        if (!isNull6) {
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
        }
        /* input[3, array<int>] */
        boolean isNull8 = i.isNullAt(3);
        ArrayData value9 = isNull8 ? null : (i.getArray(3));
        if (!isNull8) {
          int result10 = value1;
          for (int index11 = 0; index11 < value9.numElements(); index11++) {
            if (!value9.isNullAt(index11)) {
              final int element12 = value9.getInt(index11);
              result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10);
            }
          }
          value1 = result10;
        }
      }
    ```
    **after:**
    ```
      public UnsafeRow apply(InternalRow i) {
        /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
        int value1 = 42;
        /* input[0, int] */
        int value3 = i.getInt(0);
        value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
        /* input[1, double] */
        double value5 = i.getDouble(1);
        value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
        /* input[2, string] */
        boolean isNull6 = i.isNullAt(2);
        UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
    
        if (!isNull6) {
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
        }
    
        /* input[3, array<int>] */
        boolean isNull8 = i.isNullAt(3);
        ArrayData value9 = isNull8 ? null : (i.getArray(3));
        if (!isNull8) {
          for (int index10 = 0; index10 < value9.numElements(); index10++) {
            final int element11 = value9.getInt(index10);
            value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1);
          }
        }
    
        rowWriter14.write(0, value1);
        return result12;
      }
    ```
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10974 from cloud-fan/codegen.
    cloud-fan authored and davies committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    c5f745e View commit details
    Browse the repository at this point in the history
  10. [SPARK-12656] [SQL] Implement Intersect with Left-semi Join

    Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
    
    After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: apache#10566
    
    Author: gatorsmile <gatorsmile@gmail.com>
    Author: xiaoli <lixiao1983@gmail.com>
    Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
    
    Closes apache#10630 from gatorsmile/IntersectBySemiJoin.
    gatorsmile authored and rxin committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    5f686cc View commit details
    Browse the repository at this point in the history
  11. [SPARK-12818] Polishes spark-sketch module

    Fixes various minor code and Javadoc styling issues.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes apache#10985 from liancheng/sketch-polishing.
    liancheng committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    2b027e9 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13055] SQLHistoryListener throws ClassCastException

    This is an existing issue uncovered recently by apache#10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails.
    
    The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10971 from andrewor14/fix-sql-history.
    Andrew Or authored and zsxwing committed Jan 29, 2016
    Configuration menu
    Copy the full SHA
    e38b0ba View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2016

  1. [SPARK-13076][SQL] Rename ClientInterface -> HiveClient

    And ClientWrapper -> HiveClientImpl.
    
    I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#10981 from rxin/SPARK-13076.
    rxin committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    2cbc412 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySet

    Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty.
    
    https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10990 from andrewor14/accum-suite-less-flaky.
    Andrew Or committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    e6ceac4 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13088] Fix DAG viz in latest version of chrome

    Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: andrewor14/dagre-d3@7d6c000, which is taken from the fix in the main repo: dagrejs/dagre-d3@1ef067f
    
    Upstream issue: dagrejs/dagre-d3#202
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10986 from andrewor14/fix-dag-viz.
    Andrew Or committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    70e69fc View commit details
    Browse the repository at this point in the history
  4. [SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics

    This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g.
    ```
    sc.textFile(..., 4).coalesce(2).count()
    ```
    we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`.
    
    For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark.
    
    This was caused by apache#10835.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10973 from andrewor14/fix-input-metrics-coalesce.
    Andrew Or committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    12252d1 View commit details
    Browse the repository at this point in the history
  5. [SPARK-12914] [SQL] generate aggregation with grouping keys

    This PR add support for grouping keys for generated TungstenAggregate.
    
    Spilling and performance improvements for BytesToBytesMap will be done by followup PR.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10855 from davies/gen_keys.
    Davies Liu authored and davies committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    e6a02c6 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13098] [SQL] remove GenericInternalRowWithSchema

    This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10992 from cloud-fan/python.
    cloud-fan authored and davies committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    dab246f View commit details
    Browse the repository at this point in the history
  7. [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version

    This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
    
    The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
    
    After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#10608 from JoshRosen/SPARK-6363.
    JoshRosen authored and rxin committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    289373b View commit details
    Browse the repository at this point in the history
  8. [SPARK-13100][SQL] improving the performance of stringToDate method i…

    …n DateTimeUtils.scala
    
     In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time.
    
    Author: wangyang <wangyang@haizhi.com>
    
    Closes apache#10994 from wangyang1992/datetimeUtil.
    wangyang authored and rxin committed Jan 30, 2016
    Configuration menu
    Copy the full SHA
    de28371 View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2016

  1. [SPARK-13070][SQL] Better error message when Parquet schema merging f…

    …ails
    
    Make sure we throw better error messages when Parquet schema merging fails.
    
    Author: Cheng Lian <lian@databricks.com>
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#10979 from viirya/schema-merging-failure-message.
    liancheng authored and rxin committed Jan 31, 2016
    Configuration menu
    Copy the full SHA
    a1303de View commit details
    Browse the repository at this point in the history
  2. [SPARK-12689][SQL] Migrate DDL parsing to the newly absorbed parser

    JIRA: https://issues.apache.org/jira/browse/SPARK-12689
    
    DDLParser processes three commands: createTable, describeTable and refreshTable.
    This patch migrates the three commands to newly absorbed parser.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    Author: Liang-Chi Hsieh <viirya@appier.com>
    
    Closes apache#10723 from viirya/migrate-ddl-describe.
    viirya authored and rxin committed Jan 31, 2016
    Configuration menu
    Copy the full SHA
    0e6d92d View commit details
    Browse the repository at this point in the history
  3. [SPARK-13049] Add First/last with ignore nulls to functions.scala

    This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g:
    ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))```
    
    This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.
    
    cc yhuai
    
    Author: Herman van Hovell <hvanhovell@questtec.nl>
    
    Closes apache#10957 from hvanhovell/SPARK-13049.
    hvanhovell authored and rxin committed Jan 31, 2016
    Configuration menu
    Copy the full SHA
    5a8b978 View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2016

  1. [SPARK-13093] [SQL] improve null check in nullSafeCodeGen for unary, …

    …binary and ternary expression
    
    The current implementation is sub-optimal:
    
    * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable.
    * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others.
    
    This PR improves this by making the null check elimination more fine-grained.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10987 from cloud-fan/null-check.
    cloud-fan authored and davies committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    c1da4d4 View commit details
    Browse the repository at this point in the history
  2. [SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateSta…

    …teByKey is followed by a checkpointed dstream
    
    Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#10934 from zsxwing/recursive-checkpoint.
    zsxwing authored and Andrew Or committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    6075573 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressions

    JIRA: https://issues.apache.org/jira/browse/SPARK-12989
    
    In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case:
    
    ```scala
    val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")
      .withColumn("Data", struct("A", "B", "C"))
      .drop("A")
      .drop("B")
      .drop("C")
    
    val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
    data.select($"*", max("num").over(winSpec) as "max").explain(true)
    ```
    In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    Author: xiaoli <lixiao1983@gmail.com>
    Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
    
    Closes apache#10963 from gatorsmile/seletStarAfterColDrop.
    gatorsmile authored and marmbrus committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    33c8a49 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences

    JIRA: https://issues.apache.org/jira/browse/SPARK-12705
    
    **Scope:**
    This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*):
      - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`.
      - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it.
    
    **General Reference Resolution Rules:**
      - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children.
      - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed.
      - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries.
      - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`.
    
    **Implementation:**
      1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes.
      2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes.
      3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node.
    
    **Risk:**
    Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#10678 from gatorsmile/sortWindows.
    gatorsmile authored and marmbrus committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    8f26eb5 View commit details
    Browse the repository at this point in the history
  5. [DOCS] Fix the jar location of datanucleus in sql-programming-guid.md

    ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds.
    
    Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
    
    Closes apache#10901 from maropu/DocFix.
    maropu authored and marmbrus committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    da9146c View commit details
    Browse the repository at this point in the history
  6. [ML][MINOR] Invalid MulticlassClassification reference in ml-guide

    In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc.
    
    https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator
    
    Author: Lewuathe <lewuathe@me.com>
    
    Closes apache#10996 from Lewuathe/fix-typo-in-ml-guide.
    Lewuathe authored and mengxr committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    711ce04 View commit details
    Browse the repository at this point in the history
  7. [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zooke…

    …eper dir with mesos conf and add docs.
    
    Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.
    
    Author: Timothy Chen <tnachen@gmail.com>
    
    Closes apache#10057 from tnachen/fix_mesos_dir.
    tnachen authored and Andrew Or committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    51b03b7 View commit details
    Browse the repository at this point in the history
  8. [SPARK-12265][MESOS] Spark calls System.exit inside driver instead of…

    … throwing exception
    
    This takes over apache#10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user.
    
    Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
    Author: Iulian Dragos <jaguarul@gmail.com>
    
    Closes apache#10921 from dragos/pr/10729.
    nraychaudhuri authored and Andrew Or committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    a41b68b View commit details
    Browse the repository at this point in the history
  9. [SPARK-12979][MESOS] Don’t resolve paths on the local file system in …

    …Mesos scheduler
    
    The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors.
    
    Author: Iulian Dragos <jaguarul@gmail.com>
    
    Closes apache#10923 from dragos/issue/canonical-paths.
    dragos authored and Andrew Or committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    c9b89a0 View commit details
    Browse the repository at this point in the history
  10. [SPARK-13043][SQL] Implement remaining catalyst types in ColumnarBatch.

    This includes: float, boolean, short, decimal and calendar interval.
    
    Decimal is mapped to long or byte array depending on the size and calendar
    interval is mapped to a struct of int and long.
    
    The only remaining type is map. The schema mapping is straightforward but
    we might want to revisit how we deal with this in the rest of the execution
    engine.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#10961 from nongli/spark-13043.
    nongli authored and rxin committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    064b029 View commit details
    Browse the repository at this point in the history
  11. Fix for [SPARK-12854][SQL] Implement complex types support in Columna…

    …rBatch
    
    Fixes build for Scala 2.11.
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes apache#10946 from jaceklaskowski/SPARK-12854-fix.
    jaceklaskowski authored and rxin committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    a2973fe View commit details
    Browse the repository at this point in the history
  12. [SPARK-13078][SQL] API and test cases for internal catalog

    This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper).
    
    I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#10982 from rxin/SPARK-13078.
    rxin committed Feb 1, 2016
    Configuration menu
    Copy the full SHA
    be7a2fc View commit details
    Browse the repository at this point in the history

Commits on Feb 2, 2016

  1. [SPARK-12637][CORE] Print stage info of finished stages properly

    Improve printing of StageInfo in onStageCompleted
    
    See also apache#10585
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#10922 from srowen/SPARK-12637.
    srowen authored and Andrew Or committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    715a19d View commit details
    Browse the repository at this point in the history
  2. [SPARK-12790][CORE] Remove HistoryServer old multiple files format

    Removed isLegacyLogDirectory code path and updated tests
    andrewor14
    
    Author: felixcheung <felixcheung_m@hotmail.com>
    
    Closes apache#10860 from felixcheung/historyserverformat.
    felixcheung authored and Andrew Or committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    0df3cfb View commit details
    Browse the repository at this point in the history
  3. [SPARK-13130][SQL] Make codegen variable names easier to read

    1. Use lower case
    2. Change long prefixes to something shorter (in this case I am changing only one: TungstenAggregate -> agg).
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11017 from rxin/SPARK-13130.
    rxin committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    0fff5c6 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    b8666fd View commit details
    Browse the repository at this point in the history
  5. [SPARK-13087][SQL] Fix group by function for sort based aggregation

    It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`.  The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#11013 from marmbrus/groupByFunction-master.
    marmbrus authored and yhuai committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    22ba213 View commit details
    Browse the repository at this point in the history
  6. [SPARK-10820][SQL] Support for the continuous execution of structured…

    … queries
    
    This is a follow up to 9aadcff that extends Spark SQL to allow users to _repeatedly_ optimize and execute structured queries.  A `ContinuousQuery` can be expressed using SQL, DataFrames or Datasets.  The purpose of this PR is only to add some initial infrastructure which will be extended in subsequent PRs.
    
    ## User-facing API
    
    - `sqlContext.streamFrom` and `df.streamTo` return builder objects that are analogous to the `read/write` interfaces already available to executing queries in a batch-oriented fashion.
    - `ContinuousQuery` provides an interface for interacting with a query that is currently executing in the background.
    
    ## Internal Interfaces
     - `StreamExecution` - executes streaming queries in micro-batches
    
    The following are currently internal, but public APIs will be provided in a future release.
     - `Source` - an interface for providers of continually arriving data.  A source must have a notion of an `Offset` that monotonically tracks what data has arrived.  For fault tolerance, a source must be able to replay data given a start offset.
     - `Sink` - an interface that accepts the results of a continuously executing query.  Also responsible for tracking the offset that should be resumed from in the case of a failure.
    
    ## Testing
     - `MemoryStream` and `MemorySink` - simple implementations of source and sink that keep all data in memory and have methods for simulating durability failures
     - `StreamTest` - a framework for performing actions and checking invariants on a continuous query
    
    Author: Michael Armbrust <michael@databricks.com>
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    Author: Josh Rosen <rosenville@gmail.com>
    
    Closes apache#11006 from marmbrus/structured-streaming.
    marmbrus committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    12a20c1 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13094][SQL] Add encoders for seq/array of primitives

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#11014 from marmbrus/seqEncoders.
    marmbrus committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    29d9218 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13114][SQL] Add a test for tokens more than the fields in schema

    https://issues.apache.org/jira/browse/SPARK-13114
    
    This PR adds a test for tokens more than the fields in schema.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#11020 from HyukjinKwon/SPARK-13114.
    HyukjinKwon authored and rxin committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    b938301 View commit details
    Browse the repository at this point in the history
  9. [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to cons…

    …istent format
    
    Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the clustering module.
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes apache#10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
    BryanCutler authored and mengxr committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    cba1d6b View commit details
    Browse the repository at this point in the history
  10. [SPARK-13056][SQL] map column would throw NPE if value is null

    Jira:
    https://issues.apache.org/jira/browse/SPARK-13056
    
    Create a map like
    { "a": "somestring", "b": null}
    Query like
    SELECT col["b"] FROM t1;
    NPE would be thrown.
    
    Author: Daoyuan Wang <daoyuan.wang@intel.com>
    
    Closes apache#10964 from adrian-wang/npewriter.
    adrian-wang authored and marmbrus committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    358300c View commit details
    Browse the repository at this point in the history
  11. [SPARK-12711][ML] ML StopWordsRemover does not protect itself from co…

    …lumn name duplication
    
    Fixes problem and verifies fix by test suite.
    Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn
    and deduplicates SchemaUtils.appendColumn functions.
    
    Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com>
    
    Closes apache#10741 from grzegorz-chilkiewicz/master.
    grzegorz-chilkiewicz authored and jkbradley committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    b1835d7 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13138][SQL] Add "logical" package prefix for ddl.scala

    ddl.scala is defined in the execution package, and yet its reference of "UnaryNode" and "Command" are logical. This was fairly confusing when I was trying to understand the ddl code.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11021 from rxin/SPARK-13138.
    rxin committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    7f6e3ec View commit details
    Browse the repository at this point in the history
  13. [SPARK-12913] [SQL] Improve performance of stat functions

    As benchmarked and discussed here: https://github.com/apache/spark/pull/10786/files#r50038294, benefits from codegen, the declarative aggregate function could be much faster than imperative one.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10960 from davies/stddev.
    Davies Liu authored and davies committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    be5dd88 View commit details
    Browse the repository at this point in the history
  14. [SPARK-13121][STREAMING] java mapWithState mishandles scala Option

    Already merged into 1.6 branch, this PR is to commit to master the same change
    
    Author: Gabriele Nizzoli <mail@nizzoli.net>
    
    Closes apache#11028 from gabrielenizzoli/patch-1.
    gabrielenizzoli authored and zsxwing committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    d0df2ca View commit details
    Browse the repository at this point in the history
  15. [DOCS] Update StructType.scala

    The example will throw error like
    <console>:20: error: not found: value StructType
    
    Need to add this line:
    import org.apache.spark.sql.types._
    
    Author: Kevin (Sangwoo) Kim <sangwookim.me@gmail.com>
    
    Closes apache#10141 from swkimme/patch-1.
    swkimme authored and marmbrus committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    b377b03 View commit details
    Browse the repository at this point in the history

Commits on Feb 3, 2016

  1. [SPARK-13150] [SQL] disable two flaky tests

    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11037 from davies/disable_flaky.
    Davies Liu authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    6de6a97 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13020][SQL][TEST] fix random generator for map type

    when we generate map, we first randomly pick a length, then create a seq of key value pair with the expected length, and finally call `toMap`. However, `toMap` will remove all duplicated keys, which makes the actual map size much less than we expected.
    
    This PR fixes this problem by put keys in a set first, to guarantee we have enough keys to build a map with expected length.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10930 from cloud-fan/random-generator.
    cloud-fan authored and yhuai committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    672032d View commit details
    Browse the repository at this point in the history
  3. [SPARK-12992] [SQL] Update parquet reader to support more types when …

    …decoding to ColumnarBatch.
    
    This patch implements support for more types when doing the vectorized decode. There are
    a few more types remaining but they should be very straightforward after this. This code
    has a few copy and paste pieces but they are difficult to eliminate due to performance
    considerations.
    
    Specifically, this patch adds support for:
      - String, Long, Byte types
      - Dictionary encoding for those types.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#10908 from nongli/spark-12992.
    nongli authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    21112e8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13122] Fix race condition in MemoryStore.unrollSafely()

    https://issues.apache.org/jira/browse/SPARK-13122
    
    A race condition can occur in MemoryStore's unrollSafely() method if two threads that
    return the same value for currentTaskAttemptId() execute this method concurrently. This
    change makes the operation of reading the initial amount of unroll memory used, performing
    the unroll, and updating the associated memory maps atomic in order to avoid this race
    condition.
    
    Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID.
    
    Author: Adam Budde <budde@amazon.com>
    
    Closes apache#11012 from budde/master.
    Adam Budde authored and Andrew Or committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    ff71261 View commit details
    Browse the repository at this point in the history
  5. [SPARK-12951] [SQL] support spilling in generated aggregate

    This PR add spilling support for generated TungstenAggregate.
    
    If spilling happened, it's not that bad to do the iterator based sort-merge-aggregate (not generated).
    
    The changes will be covered by TungstenAggregationQueryWithControlledFallbackSuite
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10998 from davies/gen_spilling.
    Davies Liu authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    99a6e3c View commit details
    Browse the repository at this point in the history
  6. [SPARK-12732][ML] bug fix in linear regression train

    Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently.
    
    Author: Imran Younus <iyounus@us.ibm.com>
    
    Closes apache#10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
    iyounus authored and DB Tsai committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    0557146 View commit details
    Browse the repository at this point in the history
  7. [SPARK-7997][CORE] Add rpcEnv.awaitTermination() back to SparkEnv

    `rpcEnv.awaitTermination()` was not added in apache#10854 because some Streaming Python tests hung forever.
    
    This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv.
    
    Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#11031 from zsxwing/awaitTermination.
    zsxwing authored and rxin committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    335f10e View commit details
    Browse the repository at this point in the history
  8. [SPARK-13147] [SQL] improve readability of generated code

    1. try to avoid the suffix (unique id)
    2. remove the comment if there is no code generated.
    3. re-arrange the order of functions
    4. trop the new line for inlined blocks.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11032 from davies/better_suffix.
    Davies Liu authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    e86f8f6 View commit details
    Browse the repository at this point in the history
  9. [SPARK-12957][SQL] Initial support for constraint propagation in Spar…

    …kSQL
    
    Based on the semantics of a query, we can derive a number of data constraints on output of each (logical or physical) operator. For instance, if a filter defines `‘a > 10`, we know that the output data of this filter satisfies 2 constraints:
    
    1. `‘a > 10`
    2. `isNotNull(‘a)`
    
    This PR proposes a possible way of keeping track of these constraints and propagating them in the logical plan, which can then help us build more advanced optimizations (such as pruning redundant filters, optimizing joins, among others). We define constraints as a set of (implicitly conjunctive) expressions. For e.g., if a filter operator has constraints = `Set(‘a > 10, ‘b < 100)`, it’s implied that the outputs satisfy both individual constraints (i.e., `‘a > 10` AND `‘b < 100`).
    
    Design Document: https://docs.google.com/a/databricks.com/document/d/1WQRgDurUBV9Y6CWOBS75PQIqJwT-6WftVa18xzm7nCo/edit?usp=sharing
    
    Author: Sameer Agarwal <sameer@databricks.com>
    
    Closes apache#10844 from sameeragarwal/constraints.
    sameeragarwal authored and marmbrus committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    138c300 View commit details
    Browse the repository at this point in the history
  10. [SPARK-12739][STREAMING] Details of batch in Streaming tab uses two D…

    …uration columns
    
    I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration'
    
    Author: Mario Briggs <mario.briggs@in.ibm.com>
    Author: mariobriggs <mariobriggs@in.ibm.com>
    
    Closes apache#11022 from mariobriggs/spark-12739.
    mariobriggs authored and zsxwing committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    e9eb248 View commit details
    Browse the repository at this point in the history
  11. [SPARK-12798] [SQL] generated BroadcastHashJoin

    A row from stream side could match multiple rows on build side, the loop for these matched rows should not be interrupted when emitting a row, so we buffer the output rows in a linked list, check the termination condition on producer loop (for example, Range or Aggregate).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10989 from davies/gen_join.
    Davies Liu authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    c4feec2 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13157] [SQL] Support any kind of input for SQL commands.

    The ```SparkSqlLexer``` currently swallows characters which have not been defined in the grammar. This causes problems with SQL commands, such as: ```add jar file:///tmp/ab/TestUDTF.jar```. In this example the `````` is swallowed.
    
    This PR adds an extra Lexer rule to handle such input, and makes a tiny modification to the ```ASTNode```.
    
    cc davies liancheng
    
    Author: Herman van Hovell <hvanhovell@questtec.nl>
    
    Closes apache#11052 from hvanhovell/SPARK-13157.
    hvanhovell authored and davies committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    9dd2741 View commit details
    Browse the repository at this point in the history
  13. [SPARK-3611][WEB UI] Show number of cores for each executor in applic…

    …ation web UI
    
    Added a Cores column in the Executors UI
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes apache#11039 from ajbozarth/spark3611.
    ajbozarth authored and zsxwing committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    3221edd View commit details
    Browse the repository at this point in the history

Commits on Feb 4, 2016

  1. [SPARK-13166][SQL] Remove DataStreamReader/Writer

    They seem redundant and we can simply use DataFrameReader/Writer. The new usage looks like:
    
    ```scala
    val df = sqlContext.read.stream("...")
    val handle = df.write.stream("...")
    handle.stop()
    ```
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11062 from rxin/SPARK-13166.
    rxin authored and marmbrus committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    915a753 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13131] [SQL] Use best and average time in benchmark

    Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query).
    
    Having best time and average time together for more information (we can see kind of variance).
    
    rate, time per row and relative are all calculated using best time.
    
    The result looks like this:
    ```
    Intel(R) Core(TM) i7-4558U CPU  2.80GHz
    rang/filter/sum:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    rang/filter/sum codegen=false          14332 / 16646         36.0          27.8       1.0X
    rang/filter/sum codegen=true              845 /  940        620.0           1.6      17.0X
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11018 from davies/gen_bench.
    Davies Liu authored and davies committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    de09145 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13152][CORE] Fix task metrics deprecation warning

    Make an internal non-deprecated version of incBytesRead and incRecordsRead so we don't have unecessary deprecation warnings in our build.
    
    Right now incBytesRead and incRecordsRead are marked as deprecated and for internal use only. We should make private[spark] versions which are not deprecated and switch to those internally so as to not clutter up the warning messages when building.
    
    cc andrewor14 who did the initial deprecation
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#11056 from holdenk/SPARK-13152-fix-task-metrics-deprecation-warnings.
    holdenk authored and Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    a8e2ba7 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13079][SQL] Extend and implement InMemoryCatalog

    This is a step towards consolidating `SQLContext` and `HiveContext`.
    
    This patch extends the existing Catalog API added in apache#10982 to include methods for handling table partitions. In particular, a partition is identified by `PartitionSpec`, which is just a `Map[String, String]`. The Catalog is still not used by anything yet, but its API is now more or less complete and an implementation is fully tested.
    
    About 200 lines are test code.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#11069 from andrewor14/catalog.
    Andrew Or authored and rxin committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    a648311 View commit details
    Browse the repository at this point in the history
  5. [SPARK-12828][SQL] add natural join support

    Jira:
    https://issues.apache.org/jira/browse/SPARK-12828
    
    Author: Daoyuan Wang <daoyuan.wang@intel.com>
    
    Closes apache#10762 from adrian-wang/naturaljoin.
    adrian-wang authored and rxin committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    0f81318 View commit details
    Browse the repository at this point in the history
  6. [ML][DOC] fix wrong api link in ml onevsrest

    minor fix for api link in ml onevsrest
    
    Author: Yuhao Yang <hhbyyh@gmail.com>
    
    Closes apache#11068 from hhbyyh/onevsrestDoc.
    hhbyyh authored and mengxr committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    c2c956b View commit details
    Browse the repository at this point in the history
  7. [SPARK-13113] [CORE] Remove unnecessary bit operation when decoding p…

    …age number
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13113
    
    As we shift bits right, looks like the bitwise AND operation is unnecessary.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#11002 from viirya/improve-decodepagenumber.
    viirya authored and davies committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    d390871 View commit details
    Browse the repository at this point in the history
  8. [SPARK-12828][SQL] Natural join follow-up

    This is a small addendum to apache#10762 to make the code more robust again future changes.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11070 from rxin/SPARK-12828-natural-join.
    rxin committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    dee801a View commit details
    Browse the repository at this point in the history
  9. [SPARK-12330][MESOS] Fix mesos coarse mode cleanup

    In the current implementation the mesos coarse scheduler does not wait for the mesos tasks to complete before ending the driver. This causes a race where the task has to finish cleaning up before the mesos driver terminates it with a SIGINT (and SIGKILL after 3 seconds if the SIGINT doesn't work).
    
    This PR causes the mesos coarse scheduler to wait for the mesos tasks to finish (with a timeout defined by `spark.mesos.coarse.shutdown.ms`)
    
    This PR also fixes a regression caused by [SPARK-10987] whereby submitting a shutdown causes a race between the local shutdown procedure and the notification of the scheduler driver disconnection. If the scheduler driver disconnection wins the race, the coarse executor incorrectly exits with status 1 (instead of the proper status 0)
    
    With this patch the mesos coarse scheduler terminates properly, the executors clean up, and the tasks are reported as `FINISHED` in the Mesos console (as opposed to `KILLED` in < 1.6 or `FAILED` in 1.6 and later)
    
    Author: Charles Allen <charles@allen-net.com>
    
    Closes apache#10319 from drcrallen/SPARK-12330.
    drcrallen authored and Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    2eaeafe View commit details
    Browse the repository at this point in the history
  10. [SPARK-13164][CORE] Replace deprecated synchronized buffer in core

    Building with scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative. Investigation shows we are already using ConcurrentLinkedQueue in other locations so switch our uses of SynchronizedBuffer to ConcurrentLinkedQueue.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#11059 from holdenk/SPARK-13164-replace-deprecated-synchronized-buffer-in-core.
    holdenk authored and Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    62a7c28 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13162] Standalone mode does not respect initial executors

    Currently the Master would always set an application's initial executor limit to infinity. If the user specified `spark.dynamicAllocation.initialExecutors`, the config would not take effect. This is similar to apache#11047 but for standalone mode.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#11054 from andrewor14/standalone-da-initial.
    Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    4120bcb View commit details
    Browse the repository at this point in the history
  12. [SPARK-13053][TEST] Unignore tests in InternalAccumulatorSuite

    These were ignored because they are incorrectly written; they don't actually trigger stage retries, which is what the tests are testing. These tests are now rewritten to induce stage retries through fetch failures.
    
    Note: there were 2 tests before and now there's only 1. What happened? It turns out that the case where we only resubmit a subset of of the original missing partitions is very difficult to simulate in tests without potentially introducing flakiness. This is because the `DAGScheduler` removes all map outputs associated with a given executor when this happens, and we will need multiple executors to trigger this case, and sometimes the scheduler still removes map outputs from all executors.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10969 from andrewor14/unignore-accum-test.
    Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    15205da View commit details
    Browse the repository at this point in the history
  13. MAINTENANCE: Automated closing of pull requests.

    This commit exists to close the following pull requests on Github:
    
    Closes apache#7971 (requested by yhuai)
    Closes apache#8539 (requested by srowen)
    Closes apache#8746 (requested by yhuai)
    Closes apache#9288 (requested by andrewor14)
    Closes apache#9321 (requested by andrewor14)
    Closes apache#9935 (requested by JoshRosen)
    Closes apache#10442 (requested by andrewor14)
    Closes apache#10585 (requested by srowen)
    Closes apache#10785 (requested by srowen)
    Closes apache#10832 (requested by andrewor14)
    Closes apache#10941 (requested by marmbrus)
    Closes apache#11024 (requested by andrewor14)
    Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    085f510 View commit details
    Browse the repository at this point in the history
  14. [SPARK-13168][SQL] Collapse adjacent repartition operators

    Spark SQL should collapse adjacent `Repartition` operators and only keep the last one.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#11064 from JoshRosen/collapse-repartition.
    JoshRosen authored and Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    33212cb View commit details
    Browse the repository at this point in the history
  15. [SPARK-12330][MESOS][HOTFIX] Rename timeout config

    The config already describes time and accepts a general format
    that is not restricted to ms. This commit renames the internal
    config to use a format that's consistent in Spark.
    Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    c756bda View commit details
    Browse the repository at this point in the history
  16. [SPARK-13079][SQL] InMemoryCatalog follow-ups

    This patch incorporates review feedback from apache#11069, which is already merged.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#11080 from andrewor14/catalog-follow-ups.
    Andrew Or authored and rxin committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    bd38dd6 View commit details
    Browse the repository at this point in the history
  17. [SPARK-13195][STREAMING] Fix NoSuchElementException when a state is n…

    …ot set but timeoutThreshold is defined
    
    Check the state Existence before calling get.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#11081 from zsxwing/SPARK-13195.
    zsxwing committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    8e2f296 View commit details
    Browse the repository at this point in the history
  18. [HOTFIX] Fix style violation caused by c756bda

    Andrew Or committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    7a4b37f View commit details
    Browse the repository at this point in the history

Commits on Feb 5, 2016

  1. [SPARK-13052] waitingApps metric doesn't show the number of apps curr…

    …ently in the WAITING state
    
    Author: Raafat Akkad <raafat.akkad@gmail.com>
    
    Closes apache#10959 from RaafatAkkad/master.
    RaafatAkkad authored and Andrew Or committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    6dbfc40 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12850][SQL] Support Bucket Pruning (Predicate Pushdown for Buc…

    …keted Tables)
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-12850
    
    This PR is to support bucket pruning when the predicates are `EqualTo`, `EqualNullSafe`, `IsNull`, `In`, and `InSet`.
    
    Like HIVE, in this PR, the bucket pruning works when the bucketing key has one and only one column.
    
    So far, I do not find a way to verify how many buckets are actually scanned. However, I did verify it when doing the debug. Could you provide a suggestion how to do it properly? Thank you! cloud-fan yhuai rxin marmbrus
    
    BTW, we can add more cases to support complex predicate including `Or` and `And`. Please let me know if I should do it in this PR.
    
    Maybe we also need to add test cases to verify if bucket pruning works well for each data type.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#10942 from gatorsmile/pruningBuckets.
    gatorsmile authored and rxin committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    e3c75c6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13208][CORE] Replace use of Pairs with Tuple2s

    Another trivial deprecation fix for Scala 2.11
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    Closes apache#11089 from jodersky/SPARK-13208.
    jodersky authored and rxin committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    352102e View commit details
    Browse the repository at this point in the history
  4. [SPARK-13187][SQL] Add boolean/long/double options in DataFrameReader…

    …/Writer
    
    This patch adds option function for boolean, long, and double types. This makes it slightly easier for Spark users to specify options without turning them into strings. Using the JSON data source as an example.
    
    Before this patch:
    ```scala
    sqlContext.read.option("primitivesAsString", "true").json("/path/to/json")
    ```
    
    After this patch:
    Before this patch:
    ```scala
    sqlContext.read.option("primitivesAsString", true).json("/path/to/json")
    ```
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11072 from rxin/SPARK-13187.
    rxin committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    82d84ff View commit details
    Browse the repository at this point in the history
  5. [SPARK-13166][SQL] Rename DataStreamReaderWriterSuite to DataFrameRea…

    …derWriterSuite
    
    A follow up PR for apache#11062 because it didn't rename the test suite.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#11096 from zsxwing/rename.
    zsxwing committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    7b73f17 View commit details
    Browse the repository at this point in the history
  6. [SPARK-12939][SQL] migrate encoder resolution logic to Analyzer

    https://issues.apache.org/jira/browse/SPARK-12939
    
    Now we will catch `ObjectOperator` in `Analyzer` and resolve the `fromRowExpression/deserializer` inside it.  Also update the `MapGroups` and `CoGroup` to pass in `dataAttributes`, so that we can correctly resolve value deserializer(the `child.output` contains both groupking key and values, which may mess things up if they have same-name attribtues). End-to-end tests are added.
    
    follow-ups:
    
    * remove encoders from typed aggregate expression.
    * completely remove resolve/bind in `ExpressionEncoder`
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10852 from cloud-fan/bug.
    cloud-fan authored and marmbrus committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    1ed354a View commit details
    Browse the repository at this point in the history
  7. [SPARK-13214][DOCS] update dynamicAllocation documentation

    Author: Bill Chambers <bill@databricks.com>
    
    Closes apache#11094 from anabranch/dynamic-docs.
    Bill Chambers authored and Andrew Or committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    66e1383 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13002][MESOS] Send initial request of executors for dyn alloca…

    …tion
    
    Fix for [SPARK-13002](https://issues.apache.org/jira/browse/SPARK-13002) about the initial number of executors when running with dynamic allocation on Mesos.
    Instead of fixing it just for the Mesos case, made the change in `ExecutorAllocationManager`. It is already driving the number of executors running on Mesos, only no the initial value.
    
    The `None` and `Some(0)` are internal details on the computation of resources to reserved, in the Mesos backend scheduler. `executorLimitOption` has to be initialized correctly, otherwise the Mesos backend scheduler will, either, create to many executors at launch, or not create any executors and not be able to recover from this state.
    
    Removed the 'special case' description in the doc. It was not totally accurate, and is not needed anymore.
    
    This doesn't fix the same problem visible with Spark standalone. There is no straightforward way to send the initial value in standalone mode.
    
    Somebody knowing this part of the yarn support should review this change.
    
    Author: Luc Bourlier <luc.bourlier@typesafe.com>
    
    Closes apache#11047 from skyluc/issue/initial-dyn-alloc-2.
    Luc Bourlier authored and Andrew Or committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    0bb5b73 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13215] [SQL] remove fallback in codegen

    Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116.
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11097 from davies/remove_fallback.
    Davies Liu authored and davies committed Feb 5, 2016
    Configuration menu
    Copy the full SHA
    875f507 View commit details
    Browse the repository at this point in the history

Commits on Feb 6, 2016

  1. [SPARK-13171][CORE] Replace future calls with Future

    Trivial search-and-replace to eliminate deprecation warnings in Scala 2.11.
    Also works with 2.10
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    Closes apache#11085 from jodersky/SPARK-13171.
    jodersky authored and rxin committed Feb 6, 2016
    Configuration menu
    Copy the full SHA
    6883a51 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4f28291 View commit details
    Browse the repository at this point in the history
  3. [SPARK-5865][API DOC] Add doc warnings for methods that return local …

    …data structures
    
    rxin srowen
    I work out note message for rdd.take function, please help to review.
    
    If it's fine, I can apply to all other function later.
    
    Author: Tommy YU <tummyyu@163.com>
    
    Closes apache#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
    Wenpei authored and srowen committed Feb 6, 2016
    Configuration menu
    Copy the full SHA
    81da3be View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2016

  1. [SPARK-13132][MLLIB] cache standardization param value in LogisticReg…

    …ression
    
    cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer
    
    also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit.
    
    this change improves training times for one of my test sets from ~7m30s to ~4m30s
    
    Author: Gary King <gary@idibon.com>
    
    Closes apache#11027 from idigary/spark-13132-optimize-logistic-regression.
    idigary authored and srowen committed Feb 7, 2016
    Configuration menu
    Copy the full SHA
    bc8890b View commit details
    Browse the repository at this point in the history
  2. [SPARK-10963][STREAMING][KAFKA] make KafkaCluster public

    Author: cody koeninger <cody@koeninger.org>
    
    Closes apache#9007 from koeninger/SPARK-10963.
    koeninger authored and srowen committed Feb 7, 2016
    Configuration menu
    Copy the full SHA
    140ddef View commit details
    Browse the repository at this point in the history

Commits on Feb 8, 2016

  1. [SPARK-12986][DOC] Fix pydoc warnings in mllib/regression.py

    I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs.
    
    Author: Nam Pham <phamducnam@gmail.com>
    
    Closes apache#11025 from nampham2/SPARK-12986.
    nampham2 authored and mengxr committed Feb 8, 2016
    Configuration menu
    Copy the full SHA
    edf4a0e View commit details
    Browse the repository at this point in the history
  2. [SPARK-8964] [SQL] Use Exchange to perform shuffle in Limit

    This patch changes the implementation of the physical `Limit` operator so that it relies on the `Exchange` operator to perform data movement rather than directly using `ShuffledRDD`. In addition to improving efficiency, this lays the necessary groundwork for further optimization of limit, such as limit pushdown or whole-stage codegen.
    
    At a high-level, this replaces the old physical `Limit` operator with two new operators, `LocalLimit` and `GlobalLimit`. `LocalLimit` performs per-partition limits, while `GlobalLimit` applies the final limit to a single partition; `GlobalLimit`'s declares that its `requiredInputDistribution` is `SinglePartition`, which will cause the planner to use an `Exchange` to perform the appropriate shuffles. Thus, a logical `Limit` appearing in the middle of a query plan will be expanded into `LocalLimit -> Exchange to one partition -> GlobalLimit`.
    
    In the old code, calling `someDataFrame.limit(100).collect()` or `someDataFrame.take(100)` would actually skip the shuffle and use a fast-path which used `executeTake()` in order to avoid computing all partitions in case only a small number of rows were requested. This patch preserves this optimization by treating logical `Limit` operators specially when they appear as the terminal operator in a query plan: if a `Limit` is the final operator, then we will plan a special `CollectLimit` physical operator which implements the old `take()`-based logic.
    
    In order to be able to match on operators only at the root of the query plan, this patch introduces a special `ReturnAnswer` logical operator which functions similar to `BroadcastHint`: this dummy operator is inserted at the root of the optimized logical plan before invoking the physical planner, allowing the planner to pattern-match on it.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#7334 from JoshRosen/remove-copy-in-limit.
    JoshRosen authored and davies committed Feb 8, 2016
    Configuration menu
    Copy the full SHA
    06f0df6 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13101][SQL] nullability of array type element should not fail …

    …analysis of encoder
    
    nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#11035 from cloud-fan/ignore-nullability.
    cloud-fan authored and marmbrus committed Feb 8, 2016
    Configuration menu
    Copy the full SHA
    8e4d15f View commit details
    Browse the repository at this point in the history
  4. [SPARK-13210][SQL] catch OOM when allocate memory and expand array

    There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid.
    
    The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling.
    
    And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11095 from davies/fix_expand.
    Davies Liu authored and JoshRosen committed Feb 8, 2016
    Configuration menu
    Copy the full SHA
    37bc203 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13095] [SQL] improve performance for broadcast join with dimen…

    …sion table
    
    This PR improve the performance for Broadcast join with dimension tables, which is common in data warehouse.
    
    If the join key can fit in a long, we will use a special api `get(Long)` to get the rows from HashedRelation.
    
    If the HashedRelation only have unique keys, we will use a special api `getValue(Long)` or `getValue(InternalRow)`.
    
    If the keys can fit within a long, also the keys are dense, we will use a array of UnsafeRow, instead a hash map.
    
    TODO: will do cleanup
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11065 from davies/gen_dim.
    Davies Liu authored and davies committed Feb 8, 2016
    Configuration menu
    Copy the full SHA
    ff0af0d View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2016

  1. [SPARK-10620][SPARK-13054] Minor addendum to apache#10835

    Additional changes to apache#10835, mainly related to style and visibility. This patch also adds back a few deprecated methods for backward compatibility.
    
    Author: Andrew Or <andrew@databricks.com>
    
    Closes apache#10958 from andrewor14/task-metrics-to-accums-followups.
    Andrew Or authored and JoshRosen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    eeaf45b View commit details
    Browse the repository at this point in the history
  2. [SPARK-12992] [SQL] Support vectorized decoding in UnsafeRowParquetRe…

    …cordReader.
    
    WIP: running tests. Code needs a bit of clean up.
    
    This patch completes the vectorized decoding with the goal of passing the existing
    tests. There is still more patches to support the rest of the format spec, even
    just for flat schemas.
    
    This patch adds a new flag to enable the vectorized decoding. Tests were updated
    to try with both modes where applicable.
    
    Once this is working well, we can remove the previous code path.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#11055 from nongli/spark-12992-2.
    nongli authored and davies committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    3708d13 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13176][CORE] Use native file linking instead of external proce…

    …ss ln
    
    Since Spark requires at least JRE 1.7, it is safe to use built-in java.nio.Files.
    
    Author: Jakob Odersky <jakob@odersky.com>
    
    Closes apache#11098 from jodersky/SPARK-13176.
    jodersky authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    f9307d8 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in str…

    …eaming
    
    Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it.
    
    Some notes about how behaviour is different for reviewers:
    The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates.
    
    Author: Holden Karau <holden@us.ibm.com>
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes apache#11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.
    holdenk authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    159198e View commit details
    Browse the repository at this point in the history
  5. [SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFD…

    …ataGenerator
    
    KMeans:
    Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values
    
    MFDataGenerator:
    Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.
    
    I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
    holdenk authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    ce83fe9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13040][DOCS] Update JDBC deprecated SPARK_CLASSPATH documentation

    Update JDBC documentation based on http://stackoverflow.com/a/30947090/219530 as SPARK_CLASSPATH is deprecated.
    
    Also, that's how it worked, it didn't work with the SPARK_CLASSPATH or the --jars alone.
    
    This would solve issue: https://issues.apache.org/jira/browse/SPARK-13040
    
    Author: Sebastián Ramírez <tiangolo@gmail.com>
    
    Closes apache#10948 from tiangolo/patch-docs-jdbc.
    tiangolo authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    c882ec5 View commit details
    Browse the repository at this point in the history
  7. [SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly…

    … use low level linked list as it is deprecated.
    
    Author: sachin aggarwal <different.sachin@gmail.com>
    
    Closes apache#11113 from agsachin/master.
    agsachin authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    d9ba4d2 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things li…

    …ke `-i file`.
    
    Now:
    
    ```
    $ bin/spark-shell -i test.scala
    NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel).
    16/01/29 17:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/01/29 17:37:39 INFO Main: Created spark context..
    Spark context available as sc (master = local[*], app id = local-1454085459000).
    16/01/29 17:37:39 INFO Main: Created sql context..
    SQL context available as sqlContext.
    Loading test.scala...
    hello
    
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
          /_/
    
    Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
    Type in expressions to have them evaluated.
    Type :help for more information.
    ```
    
    Author: Iulian Dragos <jaguarul@gmail.com>
    
    Closes apache#10984 from dragos/issue/repl-eval-file.
    dragos authored and srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    e30121a View commit details
    Browse the repository at this point in the history
  9. [SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as i…

    …t is deprecated
    
    Replace SynchronizeQueue with synchronized access to a Queue
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#11111 from srowen/SPARK-13170.
    srowen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    68ed363 View commit details
    Browse the repository at this point in the history
  10. [SPARK-12807][YARN] Spark External Shuffle not working in Hadoop clus…

    …ters with Jackson 2.2.3
    
    Patch to
    
    1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation
    2. Use maven antrun to verify the JAR has the renamed classes
    
    Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install`
    
    Author: Steve Loughran <stevel@hortonworks.com>
    
    Closes apache#10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.
    steveloughran authored and Marcelo Vanzin committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    34d0b70 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13189] Cleanup build references to Scala 2.10

    Author: Luciano Resende <lresende@apache.org>
    
    Closes apache#11092 from lresende/SPARK-13189.
    lresende authored and JoshRosen committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    2dbb916 View commit details
    Browse the repository at this point in the history
  12. [SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression

    Adds the benchmark results as comments.
    
    The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons:
    
    1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
    2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth?
    3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.
    
    The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes apache#10917 from cloud-fan/hash-benchmark.
    cloud-fan authored and davies committed Feb 9, 2016
    Configuration menu
    Copy the full SHA
    7fe4fe6 View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2016

  1. [SPARK-13245][CORE] Call shuffleMetrics methods only in one thread fo…

    …r ShuffleBlockFetcherIterator
    
    Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread.
    
    Also fix a race condition that could cause memory leak.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#11138 from zsxwing/SPARK-13245.
    zsxwing committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    fae830d View commit details
    Browse the repository at this point in the history
  2. [SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate

    This PR improve the lookup of BytesToBytesMap by:
    
    1. Generate code for calculate the hash code of grouping keys.
    
    2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11010 from davies/gen_map.
    Davies Liu authored and davies committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    0e5ebac View commit details
    Browse the repository at this point in the history
  3. [SPARK-10524][ML] Use the soft prediction to order categories' bins

    JIRA: https://issues.apache.org/jira/browse/SPARK-10524
    
    Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    Author: Liang-Chi Hsieh <viirya@appier.com>
    Author: Joseph K. Bradley <joseph@databricks.com>
    
    Closes apache#8734 from viirya/dt-soft-centroids.
    viirya authored and jkbradley committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    9267bc6 View commit details
    Browse the repository at this point in the history
  4. [SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removi…

    …ng unnecessary Spark Filter
    
    Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
    
    Current plan:
    ```
    == Optimized Logical Plan ==
    Project [col0#0,col1#1]
    +- Filter (col0#0 = xxx)
       +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
    
    == Physical Plan ==
    +- Filter (col0#0 = xxx)
       +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
    ```
    
    This patch enables a plan below;
    ```
    == Optimized Logical Plan ==
    Project [col0#0,col1#1]
    +- Filter (col0#0 = xxx)
       +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})
    
    == Physical Plan ==
    Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
    ```
    
    Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
    
    Closes apache#10427 from maropu/RemoveFilterInJdbcScan.
    maropu authored and yhuai committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    6f710f9 View commit details
    Browse the repository at this point in the history
  5. [SPARK-13149][SQL] Add FileStreamSource

    `FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.
    
    This is based on the initial work from marmbrus.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes apache#11034 from zsxwing/stream-df-file-source.
    zsxwing authored and tdas committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    b385ce3 View commit details
    Browse the repository at this point in the history
  6. [SPARK-11565] Replace deprecated DigestUtils.shaHex call

    Author: Gábor Lipták <gliptak@gmail.com>
    
    Closes apache#9532 from gliptak/SPARK-11565.
    gliptak authored and srowen committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    9269036 View commit details
    Browse the repository at this point in the history
  7. [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts

    Author: Jon Maurer <tritab@gmail.com>
    Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local>
    
    Closes apache#10789 from tritab/cmd_updates.
    tritab authored and srowen committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    2ba9b6a View commit details
    Browse the repository at this point in the history
  8. [SPARK-13203] Add scalastyle rule banning use of mutable.Synchronized…

    …Buffer
    
    andrewor14
    Please take a look
    
    Author: tedyu <yuzhihong@gmail.com>
    
    Closes apache#11134 from tedyu/master.
    tedyu authored and srowen committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    e834e42 View commit details
    Browse the repository at this point in the history
  9. [SPARK-9307][CORE][SPARK] Logging: Make it either stable or private

    Make Logging private[spark]. Pretty much all there is to it.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#11103 from srowen/SPARK-9307.
    srowen committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    c0b71e0 View commit details
    Browse the repository at this point in the history
  10. [SPARK-5095][MESOS] Support launching multiple mesos executors in coa…

    …rse grained mesos mode.
    
    This is the next iteration of tnachen's previous PR: apache#4027
    
    In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone.  This PR implements that resolution.
    
    This PR implements two high-level features.  These two features are co-dependent, so they're implemented both here:
    - Mesos support for spark.executor.cores
    - Multiple executors per slave
    
    We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
    
    The contribution is my original work and I license the work to the project under the project's open source license.
    
    Author: Michael Gummelt <mgummelt@mesosphere.io>
    
    Closes apache#10993 from mgummelt/executor_sizing.
    Michael Gummelt authored and Andrew Or committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    80cb963 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13254][SQL] Fix planning of TakeOrderedAndProject operator

    The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / apache#7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
    
    This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
    
    In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
    
    /cc davies and marmbrus for review.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#11145 from JoshRosen/take-ordered-and-project-fix.
    JoshRosen committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    5cf2059 View commit details
    Browse the repository at this point in the history
  12. [SPARK-13163][WEB UI] Column width on new History Server DataTables n…

    …ot getting set correctly
    
    The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data.
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes apache#11057 from ajbozarth/spark13163.
    ajbozarth authored and Tom Graves committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    39cc620 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13126] fix the right margin of history page.

    The right margin of the history page is little bit off. A simple fix for that issue.
    
    Author: zhuol <zhuol@yahoo-inc.com>
    
    Closes apache#11029 from zhuoliu/13126.
    zhuol authored and Tom Graves committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    4b80026 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    ce3bdae View commit details
    Browse the repository at this point in the history
  15. [SPARK-13057][SQL] Add benchmark codes and the performance results fo…

    …r implemented compression schemes for InMemoryRelation
    
    This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.
    
    Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
    
    Closes apache#10965 from maropu/ImproveColumnarCache.
    maropu authored and rxin committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    5947fa8 View commit details
    Browse the repository at this point in the history
  16. [SPARK-12414][CORE] Remove closure serializer

    Remove spark.closure.serializer option and use JavaSerializer always
    
    CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#11150 from srowen/SPARK-12414.
    srowen authored and rxin committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    29c5473 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2016

  1. [SPARK-13146][SQL] Management API for continuous queries

    ### Management API for Continuous Queries
    
    **API for getting status of each query**
    - Whether active or not
    - Unique name of each query
    - Status of the sources and sinks
    - Exceptions
    
    **API for managing each query**
    - Immediately stop an active query
    - Waiting for a query to be terminated, correctly or with error
    
    **API for managing multiple queries**
    - Listing all active queries
    - Getting an active query by name
    - Waiting for any one of the active queries to be terminated
    
    **API for listening to query life cycle events**
    - ContinuousQueryListener API for query start, progress and termination events.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#11030 from tdas/streaming-df-management-api.
    tdas authored and zsxwing committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    0902e20 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API

    Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
    
    Author: raela <raela@databricks.com>
    
    Closes apache#11158 from raelawang/master.
    raelawang authored and rxin committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    719973b View commit details
    Browse the repository at this point in the history
  3. [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Nam…

    …e Ambiguity Caused by Internally Generated Expressions
    
    Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
    
    This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
    
    Here's an example Spark 1.6.0 snippet for illustration:
    ```scala
    sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
    sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
    ```
    The above code produces the following resolved plan:
    ```
    == Analyzed Logical Plan ==
    _c0: bigint
    Project [_c0#101L]
    +- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
       +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
          +- Subquery t
             +- Project [id#46L AS a#47L,id#46L AS b#48L]
                +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
    ```
    Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
    
    The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
    
    In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
    
    Could you review the solution? marmbrus liancheng
    
    I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#11050 from gatorsmile/namingConflicts.
    gatorsmile authored and liancheng committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    663cc40 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13205][SQL] SQL Generation Support for Self Join

    This PR addresses two issues:
      - Self join does not work in SQL Generation
      - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
    
    liancheng Could you please review the code changes? Thank you!
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#11084 from gatorsmile/selfJoinInSQLGen.
    gatorsmile authored and liancheng committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    0f09f02 View commit details
    Browse the repository at this point in the history
  5. [SPARK-12706] [SQL] grouping() and grouping_id()

    Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
    
    grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
    
    The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#10677 from davies/grouping.
    Davies Liu authored and davies committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    b5761d1 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13234] [SQL] remove duplicated SQL metrics

    For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows.
    
    After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary.
    
    This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that).
    
    The new SQL UI will looks like:
    
    ![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png)
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11163 from davies/remove_metrics.
    Davies Liu authored and davies committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    8f744fe View commit details
    Browse the repository at this point in the history
  7. [SPARK-13276] Catch bad characters at the end of a Table Identifier/E…

    …xpression string
    
    The parser currently parses the following strings without a hitch:
    * Table Identifier:
      * `a.b.c` should fail, but results in the following table identifier `a.b`
      * `table!#` should fail, but results in the following table identifier `table`
    * Expression
      * `1+2 r+e` should fail, but results in the following expression `1 + 2`
    
    This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.
    
    cc cloud-fan (we discussed this in apache#10649) jayadevanmurali (this causes your PR apache#11051 to fail)
    
    Author: Herman van Hovell <hvanhovell@questtec.nl>
    
    Closes apache#11159 from hvanhovell/SPARK-13276.
    hvanhovell committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    1842c55 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using…

    … Union in SQL
    
    Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan.
    
    For example, before the fix, the following query has a plan with two `Distinct`
    ```scala
    sql("select * from t0 union select * from t0").explain(true)
    ```
    ```
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*,None)]
    +- 'Subquery u_2
       +- 'Distinct
          +- 'Project [unresolvedalias(*,None)]
             +- 'Subquery u_1
                +- 'Distinct
                   +- 'Union
                      :- 'Project [unresolvedalias(*,None)]
                      :  +- 'UnresolvedRelation `t0`, None
                      +- 'Project [unresolvedalias(*,None)]
                         +- 'UnresolvedRelation `t0`, None
    
    == Analyzed Logical Plan ==
    id: bigint
    Project [id#16L]
    +- Subquery u_2
       +- Distinct
          +- Project [id#16L]
             +- Subquery u_1
                +- Distinct
                   +- Union
                      :- Project [id#16L]
                      :  +- Subquery t0
                      :     +- Relation[id#16L] ParquetRelation
                      +- Project [id#16L]
                         +- Subquery t0
                            +- Relation[id#16L] ParquetRelation
    
    == Optimized Logical Plan ==
    Aggregate [id#16L], [id#16L]
    +- Aggregate [id#16L], [id#16L]
       +- Union
          :- Project [id#16L]
          :  +- Relation[id#16L] ParquetRelation
          +- Project [id#16L]
             +- Relation[id#16L] ParquetRelation
    ```
    After the fix, the plan is changed without the extra `Distinct` as follows:
    ```
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*,None)]
    +- 'Subquery u_1
       +- 'Distinct
          +- 'Union
             :- 'Project [unresolvedalias(*,None)]
             :  +- 'UnresolvedRelation `t0`, None
             +- 'Project [unresolvedalias(*,None)]
               +- 'UnresolvedRelation `t0`, None
    
    == Analyzed Logical Plan ==
    id: bigint
    Project [id#17L]
    +- Subquery u_1
       +- Distinct
          +- Union
            :- Project [id#16L]
            :  +- Subquery t0
            :     +- Relation[id#16L] ParquetRelation
            +- Project [id#16L]
              +- Subquery t0
              +- Relation[id#16L] ParquetRelation
    
    == Optimized Logical Plan ==
    Aggregate [id#17L], [id#17L]
    +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
        +- Relation[id#16L] ParquetRelation
    ```
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#11120 from gatorsmile/unionDistinct.
    gatorsmile authored and hvanhovell committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    e88bff1 View commit details
    Browse the repository at this point in the history
  9. [SPARK-13270][SQL] Remove extra new lines in whole stage codegen and …

    …include pipeline plan in comments.
    
    Author: Nong Li <nong@databricks.com>
    
    Closes apache#11155 from nongli/spark-13270.
    nongli authored and rxin committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    18bcbbd View commit details
    Browse the repository at this point in the history
  10. [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.temp…

    …late
    
    In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
    
    Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
    
    Closes apache#11149 from sasakitoa/remove_multibyte_in_sparkenv.
    sasakitoa authored and srowen committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    c2f21d8 View commit details
    Browse the repository at this point in the history
  11. [SPARK-13074][CORE] Add JavaSparkContext. getPersistentRDDs method

    The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API.
    
    Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users.
    
    Author: Junyang <fly.shenjy@gmail.com>
    
    Closes apache#10978 from flyjy/master.
    jyssky authored and srowen committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    f9ae99f View commit details
    Browse the repository at this point in the history
  12. [SPARK-13124][WEB UI] Fixed CSS and JS issues caused by addition of J…

    …Query DataTables
    
    Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages.
    
    Author: Alex Bozarth <ajbozart@us.ibm.com>
    
    Closes apache#11038 from ajbozarth/spark13124.
    ajbozarth authored and Tom Graves committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    13c17cb View commit details
    Browse the repository at this point in the history
  13. [STREAMING][TEST] Fix flaky streaming.FailureSuite

    Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things
    - Makes sure no SparkContext is active after every test
    - Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case)
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#11166 from tdas/fix-failuresuite.
    tdas authored and zsxwing committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    219a74a View commit details
    Browse the repository at this point in the history
  14. [SPARK-13277][SQL] ANTLR ignores other rule using the USING keyword

    JIRA: https://issues.apache.org/jira/browse/SPARK-13277
    
    There is an ANTLR warning during compilation:
    
        warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
        Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3
    
        As a result, alternative(s) 3 were disabled for that input
    
    This patch is to fix it.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes apache#11168 from viirya/fix-parser-using.
    viirya authored and hvanhovell committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    e31c807 View commit details
    Browse the repository at this point in the history
  15. [SPARK-12982][SQL] Add table name validation in temp table registration

    Add the table name validation at the temp table creation
    
    Author: jayadevanmurali <jayadevan.m@tcs.com>
    
    Closes apache#11051 from jayadevanmurali/branch-0.2-SPARK-12982.
    jayadevanmurali authored and hvanhovell committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    0d50a22 View commit details
    Browse the repository at this point in the history
  16. [SPARK-13279] Remove O(n^2) operation from scheduler.

    This commit removes an unnecessary duplicate check in addPendingTask that meant
    that scheduling a task set took time proportional to (# tasks)^2.
    
    Author: Sital Kedia <skedia@fb.com>
    
    Closes apache#11167 from sitalkedia/fix_stuck_driver and squashes the following commits:
    
    3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function
    Sital Kedia authored and kayousterhout committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    50fa6fd View commit details
    Browse the repository at this point in the history
  17. Revert "[SPARK-13279] Remove O(n^2) operation from scheduler."

    This reverts commit 50fa6fd.
    rxin committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    c86009c View commit details
    Browse the repository at this point in the history
  18. [SPARK-13265][ML] Refactoring of basic ML import/export for other fil…

    …e system besides HDFS
    
    jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes apache#11151 from yu-iskw/SPARK-13265.
    yu-iskw authored and mengxr committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    efb65e0 View commit details
    Browse the repository at this point in the history
  19. [SPARK-11515][ML] QuantileDiscretizer should take random seed

    cc jkbradley
    
    Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
    
    Closes apache#9535 from yu-iskw/SPARK-11515.
    yu-iskw authored and mengxr committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    574571c View commit details
    Browse the repository at this point in the history
  20. [SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/i…

    …mport
    
    PySpark ml.recommendation support export/import.
    
    Author: Kai Jiang <jiangkai@gmail.com>
    
    Closes apache#11044 from vectorijk/spark-13037.
    vectorijk authored and mengxr committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    c8f667d View commit details
    Browse the repository at this point in the history
  21. [MINOR][ML][PYSPARK] Cleanup test cases of clustering.py

    Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
    cc mengxr jkbradley
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10975 from yanboliang/clustering-cleanup.
    yanboliang authored and mengxr committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    2426eb3 View commit details
    Browse the repository at this point in the history
  22. [SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import

    PySpark ml.clustering support export/import.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10999 from yanboliang/spark-13035.
    yanboliang authored and mengxr committed Feb 11, 2016
    Configuration menu
    Copy the full SHA
    30e0095 View commit details
    Browse the repository at this point in the history

Commits on Feb 12, 2016

  1. [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw a…

    …n error
    
    Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
    
    In Python:
    ```python
    from pyspark.ml.classification import NaiveBayes
    nb = NaiveBayes()
    print nb.hasParam("smoothing")
    print nb.hasParam("notAParam")
    ```
    produces:
    > True
    > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
    
    However, in Scala:
    ```scala
    import org.apache.spark.ml.classification.NaiveBayes
    val nb  = new NaiveBayes()
    nb.hasParam("smoothing")
    nb.hasParam("notAParam")
    ```
    produces:
    > true
    > false
    
    cc holdenk
    
    Author: sethah <seth.hendrickson16@gmail.com>
    
    Closes apache#10962 from sethah/SPARK-13047.
    sethah authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    b354673 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lo…

    …st transformSchema
    
    https://issues.apache.org/jira/browse/SPARK-12765
    
    Author: Liu Xiang <lxmtlab@gmail.com>
    
    Closes apache#10720 from sloth2012/sloth.
    sloth2012 authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    a525704 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12915][SQL] add SQL metrics of numOutputRows for whole stage c…

    …odegen
    
    This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row.
    
    <img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png">
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11170 from davies/gen_metric.
    Davies Liu authored and rxin committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    b10af5e View commit details
    Browse the repository at this point in the history
  4. [SPARK-13277][BUILD] Follow-up ANTLR warnings are treated as build er…

    …rors
    
    It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors.
    
    cc rxin / viirya
    
    Author: Herman van Hovell <hvanhovell@questtec.nl>
    
    Closes apache#11174 from hvanhovell/ANTLR-warnings-as-errors.
    hvanhovell authored and rxin committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    8121a4b View commit details
    Browse the repository at this point in the history
  5. [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, …

    …false)
    
    https://issues.apache.org/jira/browse/SPARK-12746
    
    Author: Earthson Lu <Earthson.Lu@gmail.com>
    
    Closes apache#10697 from Earthson/SPARK-12746.
    Earthson authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    5f1c359 View commit details
    Browse the repository at this point in the history
  6. [SPARK-13153][PYSPARK] ML persistence failed when handle no default v…

    …alue parameter
    
    Fix this defect by check default value exist or not.
    
    yanboliang Please help to review.
    
    Author: Tommy YU <tummyyu@163.com>
    
    Closes apache#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
    Wenpei authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    d3e2e20 View commit details
    Browse the repository at this point in the history
  7. [SPARK-7889][WEBUI] HistoryServer updates UI for incomplete apps

    When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available.  It does this by checking if a version of the app has been loaded with a larger *filesize*.  If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.
    
    https://issues.apache.org/jira/browse/SPARK-7889
    
    Author: Steve Loughran <stevel@hortonworks.com>
    Author: Imran Rashid <irashid@cloudera.com>
    
    Closes apache#11118 from squito/SPARK-7889-alternate.
    steveloughran authored and squito committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    a2c7dcf View commit details
    Browse the repository at this point in the history
  8. [SPARK-6166] Limit number of in flight outbound requests

    This JIRA is related to
    apache#5852
    Had to do some minor rework and test to make sure it
    works with current version of spark.
    
    Author: Sanket <schintap@untilservice-lm>
    
    Closes apache#10838 from redsanket/limit-outbound-connections.
    Sanket authored and zsxwing committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    894921d View commit details
    Browse the repository at this point in the history
  9. [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means

    Add Python API for spark.ml bisecting k-means.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10889 from yanboliang/spark-12974.
    yanboliang authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    a183dda View commit details
    Browse the repository at this point in the history
  10. [SPARK-13154][PYTHON] Add linting for pydocs

    We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
    
    Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes apache#11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
    holdenk authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    64515e5 View commit details
    Browse the repository at this point in the history
  11. [SPARK-12705] [SQL] push missing attributes for Sort

    The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11153 from davies/resolve_sort.
    Davies Liu authored and davies committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    5b805df View commit details
    Browse the repository at this point in the history
  12. [SPARK-13282][SQL] LogicalPlan toSql should just return a String

    Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11171 from rxin/SPARK-13282.
    rxin committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    c4d5ad8 View commit details
    Browse the repository at this point in the history
  13. [SPARK-13260][SQL] count(*) does not work with CSV data source

    https://issues.apache.org/jira/browse/SPARK-13260
    This is a quicky fix for `count(*)`.
    
    When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
    
    Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes apache#11169 from HyukjinKwon/SPARK-13260.
    HyukjinKwon authored and rxin committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    ac7d6af View commit details
    Browse the repository at this point in the history
  14. [SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop

    PySpark support ```covar_samp``` and ```covar_pop```.
    
    cc rxin davies marmbrus
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes apache#10876 from yanboliang/spark-12962.
    yanboliang authored and davies committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    90de6b2 View commit details
    Browse the repository at this point in the history
  15. [SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to…

    … consistent format
    
    Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.
    
    Author: vijaykiran <mail@vijaykiran.com>
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes apache#11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
    vijaykiran authored and mengxr committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    42d6568 View commit details
    Browse the repository at this point in the history
  16. [SPARK-5095] Fix style in mesos coarse grained scheduler code

    andrewor14 This addressed your style comments from apache#10993
    
    Author: Michael Gummelt <mgummelt@mesosphere.io>
    
    Closes apache#11187 from mgummelt/fix_mesos_style.
    Michael Gummelt authored and Andrew Or committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    38bc601 View commit details
    Browse the repository at this point in the history
  17. [SPARK-5095] remove flaky test

    Overrode the start() method, which was previously starting a thread causing a race condition. I believe this should fix the flaky test.
    
    Author: Michael Gummelt <mgummelt@mesosphere.io>
    
    Closes apache#11164 from mgummelt/fix_mesos_tests.
    Michael Gummelt authored and Andrew Or committed Feb 12, 2016
    Configuration menu
    Copy the full SHA
    62b1c07 View commit details
    Browse the repository at this point in the history

Commits on Feb 13, 2016

  1. [SPARK-13293][SQL] generate Expand

    Expand suffer from create the UnsafeRow from same input multiple times, with codegen, it only need to copy some of the columns.
    
    After this, we can see 3X improvements (from 43 seconds to 13 seconds) on a TPCDS query (Q67) that have eight columns in Rollup.
    
    Ideally, we could mask some of the columns based on bitmask, I'd leave that in the future, because currently Aggregation (50 ns) is much slower than that just copy the variables (1-2 ns).
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes apache#11177 from davies/gen_expand.
    Davies Liu authored and rxin committed Feb 13, 2016
    Configuration menu
    Copy the full SHA
    2228f07 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft…

    … Windows
    
    Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.
    
    Is it worth considering also including this fix in any future 1.5.x releases (if any)?
    
    I confirm this is my own original work and license it to the Spark project under its open source license.
    
    Author: markpavey <mark.pavey@thefilter.com>
    
    Closes apache#11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.
    markpavey authored and srowen committed Feb 13, 2016
    Configuration menu
    Copy the full SHA
    374c4b2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering f…

    …ailed test
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-12363
    
    This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#10539 from viirya/fix-poweriter.
    viirya authored and mengxr committed Feb 13, 2016
    Configuration menu
    Copy the full SHA
    e3441e3 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2016

  1. Closes apache#11185

    rxin committed Feb 14, 2016
    Configuration menu
    Copy the full SHA
    610196f View commit details
    Browse the repository at this point in the history
  2. [SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is…

    … deprecated
    
    Replace `getStackTraceString` with `Utils.exceptionString`
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#11182 from srowen/SPARK-13172.
    srowen authored and rxin committed Feb 14, 2016
    Configuration menu
    Copy the full SHA
    388cd9e View commit details
    Browse the repository at this point in the history
  3. [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions.

    This pull request has the following changes:
    
    1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
    
    2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
    
    3. Move everything in execution/python.scala into the newly created execution.python package.
    
    Most of the diffs are just straight copy-paste.
    
    Author: Reynold Xin <rxin@databricks.com>
    
    Closes apache#11181 from rxin/SPARK-13296.
    rxin committed Feb 14, 2016
    Configuration menu
    Copy the full SHA
    354d4c2 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy

    Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps.
    
    Author: Amit Dev <amitdev@gmail.com>
    
    Closes apache#11180 from amitdev/master.
    amitdev authored and srowen committed Feb 14, 2016
    Configuration menu
    Copy the full SHA
    331293c View commit details
    Browse the repository at this point in the history
  5. [SPARK-13278][CORE] Launcher fails to start with JDK 9 EA

    See http://openjdk.java.net/jeps/223 for more information about the JDK 9 version string scheme.
    
    Author: Claes Redestad <claes.redestad@gmail.com>
    
    Closes apache#11160 from cl4es/master.
    cl4es authored and srowen committed Feb 14, 2016
    Configuration menu
    Copy the full SHA
    22e9723 View commit details
    Browse the repository at this point in the history

Commits on Feb 15, 2016

  1. [SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDat…

    …e method to improve performance
    
    The java `Calendar` object is expensive to create. I have a sub query like this `SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))`
    
    The table stores `visitDate` as String type and has 3 billion records. A `Calendar` object is created every time `DateTimeUtils.stringToDate` is called. By reusing the `Calendar` object, I saw about 20 seconds performance improvement for this stage.
    
    Author: Carson Wang <carson.wang@intel.com>
    
    Closes apache#11090 from carsonwang/SPARK-13185.
    carsonwang authored and rxin committed Feb 15, 2016
    Configuration menu
    Copy the full SHA
    7cb4d74 View commit details
    Browse the repository at this point in the history
  2. [SPARK-12503][SPARK-12505] Limit pushdown in UNION ALL and OUTER JOIN

    This patch adds a new optimizer rule for performing limit pushdown. Limits will now be pushed down in two cases:
    
    - If a limit is on top of a `UNION ALL` operator, then a partition-local limit operator will be pushed to each of the union operator's children.
    - If a limit is on top of an `OUTER JOIN` then a partition-local limit will be pushed to one side of the join. For `LEFT OUTER` and `RIGHT OUTER` joins, the limit will be pushed to the left and right side, respectively. For `FULL OUTER` join, we will only push limits when at most one of the inputs is already limited: if one input is limited we will push a smaller limit on top of it and if neither input is limited then we will limit the input which is estimated to be larger.
    
    These optimizations were proposed previously by gatorsmile in apache#10451 and apache#10454, but those earlier PRs were closed and deferred for later because at that time Spark's physical `Limit` operator would trigger a full shuffle to perform global limits so there was a chance that pushdowns could actually harm performance by causing additional shuffles/stages. In apache#7334, we split the `Limit` operator into separate `LocalLimit` and `GlobalLimit` operators, so we can now push down only local limits (which don't require extra shuffles). This patch is based on both of gatorsmile's patches, with changes and simplifications due to partition-local-limiting.
    
    When we push down the limit, we still keep the original limit in place, so we need a mechanism to ensure that the optimizer rule doesn't keep pattern-matching once the limit has been pushed down. In order to handle this, this patch adds a `maxRows` method to `SparkPlan` which returns the maximum number of rows that the plan can compute, then defines the pushdown rules to only push limits to children if the children's maxRows are greater than the limit's maxRows. This idea is carried over from apache#10451; see that patch for additional discussion.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#11121 from JoshRosen/limit-pushdown-2.
    JoshRosen authored and rxin committed Feb 15, 2016
    Configuration menu
    Copy the full SHA
    a8bbc4f View commit details
    Browse the repository at this point in the history
  3. [SPARK-12995][GRAPHX] Remove deprecate APIs from Pregel

    Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
    
    Closes apache#10918 from maropu/RemoveDeprecateInPregel.
    maropu authored and srowen committed Feb 15, 2016
    Configuration menu
    Copy the full SHA
    56d4939 View commit details
    Browse the repository at this point in the history
  4. [SPARK-13312][MLLIB] Update java train-validation-split example in ml…

    …-guide
    
    Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312.
    
    This contribution is my original work and I license the work to this project.
    
    Author: JeremyNixon <jnixon2@gmail.com>
    
    Closes apache#11199 from JeremyNixon/update_train_val_split_example.
    JeremyNixon authored and srowen committed Feb 15, 2016
    Configuration menu
    Copy the full SHA
    adb5483 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2016

  1. [SPARK-13097][ML] Binarizer allowing Double AND Vector input types

    This enhancement extends the existing SparkML Binarizer [SPARK-5891] to allow Vector in addition to the existing Double input column type.
    
    A use case for this enhancement is for when a user wants to Binarize many similar feature columns at once using the same threshold value (for example a binary threshold applied to many pixels in an image).
    
    This contribution is my original work and I license the work to the project under the project's open source license.
    
    viirya mengxr
    
    Author: seddonm1 <seddonm1@gmail.com>
    
    Closes apache#10976 from seddonm1/master.
    seddonm1 authored and mengxr committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    cbeb006 View commit details
    Browse the repository at this point in the history
  2. [SPARK-13018][DOCS] Replace example code in mllib-pmml-model-export.m…

    …d using include_example
    
    Replace example code in mllib-pmml-model-export.md using include_example
    https://issues.apache.org/jira/browse/SPARK-13018
    
    The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
    
    Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
    `{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}`
    Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala` and pick code blocks marked "example" and replace code block in
    `{% highlight %}`
     in the markdown.
    
    See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
    
    Author: Xin Ren <iamshrek@126.com>
    
    Closes apache#11126 from keypointt/SPARK-13018.
    keypointt authored and mengxr committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    e4675c2 View commit details
    Browse the repository at this point in the history
  3. [SPARK-13221] [SQL] Fixing GroupingSets when Aggregate Functions Cont…

    …aining GroupBy Columns
    
    Using GroupingSets will generate a wrong result when Aggregate Functions containing GroupBy columns.
    
    This PR is to fix it. Since the code changes are very small. Maybe we also can merge it to 1.6
    
    For example, the following query returns a wrong result:
    ```scala
    sql("select course, sum(earnings) as sum from courseSales group by course, earnings" +
         " grouping sets((), (course), (course, earnings))" +
         " order by course, sum").show()
    ```
    Before the fix, the results are like
    ```
    [null,null]
    [Java,null]
    [Java,20000.0]
    [Java,30000.0]
    [dotNET,null]
    [dotNET,5000.0]
    [dotNET,10000.0]
    [dotNET,48000.0]
    ```
    After the fix, the results become correct:
    ```
    [null,113000.0]
    [Java,20000.0]
    [Java,30000.0]
    [Java,50000.0]
    [dotNET,5000.0]
    [dotNET,10000.0]
    [dotNET,48000.0]
    [dotNET,63000.0]
    ```
    
    UPDATE:  This PR also deprecated the external column: GROUPING__ID.
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes apache#11100 from gatorsmile/groupingSets.
    gatorsmile authored and davies committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    fee739f View commit details
    Browse the repository at this point in the history
  4. Correct SparseVector.parse documentation

    There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.
    
    Author: Miles Yucht <miles@databricks.com>
    
    Closes apache#11213 from mgyucht/fix-sparsevector-docs.
    mgyucht authored and srowen committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    827ed1c View commit details
    Browse the repository at this point in the history
  5. [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collabora…

    …tive filtering in general
    
    This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.
    
    Author: BenFradet <benjamin.fradet@gmail.com>
    
    Closes apache#10411 from BenFradet/SPARK-12247.
    BenFradet authored and srowen committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    00c72d2 View commit details
    Browse the repository at this point in the history
  6. [SPARK-12976][SQL] Add LazilyGenerateOrdering and use it for RangePar…

    …titioner of Exchange.
    
    Add `LazilyGenerateOrdering` to support generated ordering for `RangePartitioner` of `Exchange` instead of `InterpretedOrdering`.
    
    Author: Takuya UESHIN <ueshin@happy-camper.st>
    
    Closes apache#10894 from ueshin/issues/SPARK-12976.
    ueshin authored and JoshRosen committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    19dc69d View commit details
    Browse the repository at this point in the history
  7. [SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteA…

    …headLog.
    
    The new logger name is under the org.apache.spark namespace.
    The detection of the caller name was also enhanced a bit to ignore
    some common things that show up in the call stack.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes apache#11165 from vanzin/SPARK-13280.
    Marcelo Vanzin committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    c7d00a2 View commit details
    Browse the repository at this point in the history
  8. [SPARK-13308] ManagedBuffers passed to OneToOneStreamManager need to …

    …be freed in non-error cases
    
    ManagedBuffers that are passed to `OneToOneStreamManager.registerStream` need to be freed by the manager once it's done using them. However, the current code only frees them in certain error-cases and not during typical operation. This isn't a major problem today, but it will cause memory leaks after we implement better locking / pinning in the BlockManager (see apache#10705).
    
    This patch modifies the relevant network code so that the ManagedBuffers are freed as soon as the messages containing them are processed by the lower-level Netty message sending code.
    
    /cc zsxwing for review.
    
    Author: Josh Rosen <joshrosen@databricks.com>
    
    Closes apache#11193 from JoshRosen/add-missing-release-calls-in-network-layer.
    JoshRosen authored and zsxwing committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    5f37aad View commit details
    Browse the repository at this point in the history