# Spark Data Operations

This notebook augments the video `2.5 - Spark Data Operations`.

## Types

DataFrame operations can be broken down across three distinct types:

- __Transformation__  
  Transformations alter the state of the data, but they are lazy. Meaning that
  Spark will hold off on the execution of the transformations until it
  encounters an action.

- __Action__  
  Actions are also responsible for altering the state of the data, but they
  are 'blocking'. Performing an action means that Spark will actually perform 
  all Transformations that have been collected up until that point.

- __Property__  
  Properties can be a bit odd, which is why they get their own category. A
  property cannot be put in either the Transformation or Action category
  because they behave differently as they can either be an action or 
  transformation depending on context.  
  
  A property always exists as part of general information on the DataFrame (or 
  other such classes). They are generally not callable, but can be used to 
  perform operations or call subclasses. Using them inside of an operation
  does not always break Spark's laziness, but it can.  
  
  For example, consider `DataFrame.columns`:
  - Behaving as a Transformation:  
  ```python
  for col in df.columns:
        df = df.drop(col)
  ``` 
  when looping through the columns of a DataFrame using `DataFrame.columns`
  it would behave like a Transformation.  
 
  - Behaving like an Action  
  But, when you use the same `DataFrame.columns` after performing operations and 
  print it, using `print(DataFrame.columns)` it behaves more like an action, 
  as Spark needs to go and figure out what columns the DataFrame has at that point.
  
  Note that the way a property behaves is not always consistent, as it highly
  depends on the context of how it is being called. In short, when using properties, 
  test your code thoroughly to ensure that you avoid breaking Spark's laziness 
  whenever possible.


## All Operations Categorized

The official PySpark documentation does not categorize DataFrame operations as per the aforementioned types. For this reason, I have created the following table as a guide. All Operations are clickable. Once clicked it will scroll down to the official PySpark documentation which I included in this notebook. 
  

| Operation                                                                                        | Type           | Short Description                                                                                                                                                                   | VersionAdded |
|:-------------------------------------------------------------------------------------------------|:---------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------|
| <a href="#pyspark.sql.DataFrame.agg">agg</a>                                                     | Transformation | Aggregate on the entire :class:`DataFrame` without groups (shorthand for ``df.groupBy.agg()``).                                                                                     | 1.3          |
| <a href="#pyspark.sql.DataFrame.alias">alias</a>                                                 | Transformation | Returns a new :class:`DataFrame` with an alias set.                                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.approxQuantile">approxQuantile</a>                               | Action         | Calculates the approximate quantiles of numerical columns of a DataFrame.                                                                                                           | 2.0          |
| <a href="#pyspark.sql.DataFrame.cache">cache</a>                                                 | Transformation | Persists the :class:`DataFrame` with the default storage level (C{MEMORY_AND_DISK}).                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.checkpoint">checkpoint</a>                                       | Transformation | Returns a checkpointed version of this Dataset.                                                                                                                                     | 2.1          |
| <a href="#pyspark.sql.DataFrame.coalesce">coalesce</a>                                           | Transformation | Returns a new :class:`DataFrame` that has exactly `numPartitions` partitions.                                                                                                       | 1.4          |
| <a href="#pyspark.sql.DataFrame.collect">collect</a>                                             | Action         | Returns all the records as a list of :class:`Row`.                                                                                                                                   | 1.3          |
| <a href="#pyspark.sql.DataFrame.colRegex">colRegex</a>                                           | Transformation | Selects column based on the column name specified as a regex and returns it as :class:`Column`.                                                                                     | 2.3          |
| <a href="#pyspark.sql.DataFrame.columns">columns</a>                                             | Property       | Returns all column names as a list.                                                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.corr">corr</a>                                                   | Action         | Calculates the correlation of two columns of a DataFrame as a double value.                                                                                                         | 1.4          |
| <a href="#pyspark.sql.DataFrame.count">count</a>                                                 | Action         | Returns the number of rows in this :class:`DataFrame`.                                                                                                                               | 1.3          |
| <a href="#pyspark.sql.DataFrame.cov">cov</a>                                                     | Action         | Calculate the sample covariance for the given columns, specified by their names, as a double value. :func:`DataFrame.cov` and :func:`DataFrameStatFunctions.cov` are aliases. | 1.4          |
| <a href="#pyspark.sql.DataFrame.createGlobalTempView">createGlobalTempView</a>                   | Transformation | Creates a global temporary view with this DataFrame.                                                                                                                                 | 2.1          |
| <a href="#pyspark.sql.DataFrame.createOrReplaceGlobalTempView">createOrReplaceGlobalTempView</a> | Transformation | Creates or replaces a global temporary view using the given name.                                                                                                                   | 2.2          |
| <a href="#pyspark.sql.DataFrame.createOrReplaceTempView">createOrReplaceTempView</a>             | Transformation | Creates or replaces a local temporary view with this DataFrame.                                                                                                                     | 2.0          |
| <a href="#pyspark.sql.DataFrame.createTempView">createTempView</a>                               | Transformation | Creates a local temporary view with this DataFrame.                                                                                                                                 | 2.0          |
| <a href="#pyspark.sql.DataFrame.crossJoin">crossJoin</a>                                         | Transformation | Returns the cartesian product with another :class:`DataFrame`.                                                                                                                       | 2.1          |
| <a href="#pyspark.sql.DataFrame.crosstab">crosstab</a>                                           | Transformation | Computes a pair-wise frequency table of the given columns.                                                                                                                           | 1.4          |
| <a href="#pyspark.sql.DataFrame.cube">cube</a>                                                   | Transformation | Create a multi-dimensional cube for the current :class:`DataFrame` using the specified columns, so we can run aggregation on them.                                        | 1.4          |
| <a href="#pyspark.sql.DataFrame.describe">describe</a>                                           | Transformation | Computes basic statistics for numeric and string columns.                                                                                                                           | 1.3.1        |
| <a href="#pyspark.sql.DataFrame.distinct">distinct</a>                                           | Transformation | Returns a new :class:`DataFrame` containing the distinct rows in this :class:`DataFrame`.                                                                                           | 1.3          |
| <a href="#pyspark.sql.DataFrame.drop">drop</a>                                                   | Transformation | Returns a new :class:`DataFrame` that drops the specified column. This is a no-op if schema doesn't contain the given column name(s).                                             | 1.4          |
| <a href="#pyspark.sql.DataFrame.drop_duplicates">drop_duplicates</a>                             | Transformation | :func:`drop_duplicates` is an alias for :func:`dropDuplicates`.                                                                                                                     | 1.4          |
| <a href="#pyspark.sql.DataFrame.dropDuplicates">dropDuplicates</a>                               | Transformation | Return a new :class:`DataFrame` with duplicate rows removed, optionally only considering certain columns.                                                                           | 1.4          |
| <a href="#pyspark.sql.DataFrame.dropna">dropna</a>                                               | Transformation | Returns a new :class:`DataFrame` omitting rows with null values. :func:`DataFrame.dropna` and :func:`DataFrameNaFunctions.drop` are aliases of each other.                    | 1.3.1        |
| <a href="#pyspark.sql.DataFrame.dtypes">dtypes</a>                                               | Property       | Returns all column names and their data types as a list.                                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.exceptAll">exceptAll</a>                                         | Transformation | Return a new :class:`DataFrame` containing rows in this :class:`DataFrame` but not in another :class:`DataFrame` while preserving duplicates.                                      | 2.4          |
| <a href="#pyspark.sql.DataFrame.explain">explain</a>                                             | Transformation | Prints the (logical and physical) plans to the console for debugging purpose.                                                                                                       | 1.3          |
| <a href="#pyspark.sql.DataFrame.fillna">fillna</a>                                               | Transformation | Replace null values, alias for ``na.fill()``.                                                                                                                                       | 1.3.1        |
| <a href="#pyspark.sql.DataFrame.filter">filter</a>                                               | Transformation | Filters rows using the given condition.                                                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.first">first</a>                                                 | Action         | Returns the first row as a :class:`Row`.                                                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.foreach">foreach</a>                                             | Transformation | Applies the ``f`` function to all :class:`Row` of this :class:`DataFrame`.                                                                                                           | 1.3          |
| <a href="#pyspark.sql.DataFrame.foreachPartition">foreachPartition</a>                           | Transformation | Applies the ``f`` function to each partition of this :class:`DataFrame`.                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.freqItems">freqItems</a>                                         | Transformation | Finding frequent items for columns, possibly with false positives.                                                                                                                   | 1.4          |
| <a href="#pyspark.sql.DataFrame.groupby">groupby</a>                                             | Transformation | :func:`groupby` is an alias for :func:`groupBy`.                                                                                                                                     | 1.4          |
| <a href="#pyspark.sql.DataFrame.groupBy">groupBy</a>                                             | Transformation | Groups the :class:`DataFrame` using the specified columns, so we can run aggregation on them.                                                                                       | 1.3          |
| <a href="#pyspark.sql.DataFrame.head">head</a>                                                   | Action         | Returns the first ``n`` rows.                                                                                                                                                       | 1.3          |
| <a href="#pyspark.sql.DataFrame.hint">hint</a>                                                   | Transformation | Specifies some hint on the current DataFrame.                                                                                                                                       | 2.2          |
| <a href="#pyspark.sql.DataFrame.intersect">intersect</a>                                         | Transformation | Return a new :class:`DataFrame` containing rows only in both this frame and another frame.                                                                                           | 1.3          |
| <a href="#pyspark.sql.DataFrame.intersectAll">intersectAll</a>                                   | Transformation | Return a new :class:`DataFrame` containing rows in both this dataframe and otherdataframe while preserving duplicates.                                                               | 2.4          |
| <a href="#pyspark.sql.DataFrame.isLocal">isLocal</a>                                             | Property       | Returns ``True`` if the :func:`collect` and :func:`take` methods can be run locally(without any Spark executors).                                                                   | 1.3          |
| <a href="#pyspark.sql.DataFrame.isStreaming">isStreaming</a>                                     | Property       | Returns true if this :class:`Dataset` contains one or more sources that continuouslyreturn data as it arrives.                                                                       | 2.0          |
| <a href="#pyspark.sql.DataFrame.join">join</a>                                                   | Transformation | Joins with another :class:`DataFrame`, using the given join expression.                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.limit">limit</a>                                                 | Transformation | Limits the result count to the number specified.                                                                                                                                     | 1.3          |
| <a href="#pyspark.sql.DataFrame.localCheckpoint">localCheckpoint</a>                             | Transformation | Returns a locally checkpointed version of this Dataset.                                                                                                                             | 2.3          |
| <a href="#pyspark.sql.DataFrame.na">na</a>                                                       | Property       | Returns a :class:`DataFrameNaFunctions` for handling missing values.                                                                                                                 | 1.3.1        |
| <a href="#pyspark.sql.DataFrame.orderBy">orderBy</a>                                             | Transformation | Returns a new :class:`DataFrame` sorted by the specified column(s).                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.persist">persist</a>                                             | Transformation | Sets the storage level to persist the contents of the :class:`DataFrame` across operations after the first time it is computed.                                                   | 1.3          |
| <a href="#pyspark.sql.DataFrame.printSchema">printSchema</a>                                     | Action         | Prints out the schema in the tree format.                                                                                                                                           | 1.3          |
| <a href="#pyspark.sql.DataFrame.randomSplit">randomSplit</a>                                     | Transformation | Randomly splits this :class:`DataFrame` with the provided weights.                                                                                                                   | 1.4          |
| <a href="#pyspark.sql.DataFrame.rdd">rdd</a>                                                     | Property       | Returns the content as an :class:`pyspark.RDD` of :class:`Row`.                                                                                                                     | 1.3          |
| <a href="#pyspark.sql.DataFrame.registerTempTable">registerTempTable</a>                         | Transformation | Registers this DataFrame as a temporary table using the given name.                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.repartition">repartition</a>                                     | Transformation | Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.                                                | 1.3          |
| <a href="#pyspark.sql.DataFrame.repartitionByRange">repartitionByRange</a>                       | Transformation | Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The resulting DataFrame is range partitioned.                                                | 2.4          |
| <a href="#pyspark.sql.DataFrame.replace">replace</a>                                             | Transformation | Returns a new :class:`DataFrame` replacing a value with another value. :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are aliases of each other.          | 1.4          |
| <a href="#pyspark.sql.DataFrame.rollup">rollup</a>                                               | Transformation | Create a multi-dimensional rollup for the current :class:`DataFrame` using the specified columns, so we can run aggregation on them.                                        | 1.4          |
| <a href="#pyspark.sql.DataFrame.sample">sample</a>                                               | Transformation | Returns a sampled subset of this :class:`DataFrame`.                                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.sampleBy">sampleBy</a>                                           | Transformation | Returns a stratified sample without replacement based on the fraction given on each stratum.                                                                                         | 1.5          |
| <a href="#pyspark.sql.DataFrame.schema">schema</a>                                               | Property       | Returns the schema of this :class:`DataFrame` as a :class:`pyspark.sql.types.StructType`.                                                                                           | 1.3          |
| <a href="#pyspark.sql.DataFrame.select">select</a>                                               | Transformation | Projects a set of expressions and returns a new :class:`DataFrame`.                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.selectExpr">selectExpr</a>                                       | Transformation | Projects a set of SQL expressions and returns a new :class:`DataFrame`.                                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.show">show</a>                                                   | Action         | Prints the first ``n`` rows to the console.                                                                                                                                         | 1.3          |
| <a href="#pyspark.sql.DataFrame.sort">sort</a>                                                   | Transformation | Returns a new :class:`DataFrame` sorted by the specified column(s).                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.sortWithinPartitions">sortWithinPartitions</a>                   | Transformation | Returns a new :class:`DataFrame` with each partition sorted by the specified column(s).                                                                                             | 1.6          |
| <a href="#pyspark.sql.DataFrame.stat">stat</a>                                                   | Transformation | Returns a :class:`DataFrameStatFunctions` for statistic functions.                                                                                                                   | 1.4          |
| <a href="#pyspark.sql.DataFrame.storageLevel">storageLevel</a>                                   | Property       | Get the :class:`DataFrame`'s current storage level.                                                                                                                                 | 2.1          |
| <a href="#pyspark.sql.DataFrame.subtract">subtract</a>                                           | Transformation | Return a new :class:`DataFrame` containing rows in this frame but not in another frame.                                                                                             | 1.3          |
| <a href="#pyspark.sql.DataFrame.summary">summary</a>                                             | Transformation | Computes specified statistics for numeric and string columns.                                                                                                                       | 2.3          |
| <a href="#pyspark.sql.DataFrame.take">take</a>                                                   | Action         | Returns the first ``num`` rows as a :class:`list` of :class:`Row`.                                                                                                                   | 1.3          |
| <a href="#pyspark.sql.DataFrame.toDF">toDF</a>                                                   | Transformation | Returns a new class:`DataFrame` that with new specified column names                                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.toJSON">toJSON</a>                                               | Transformation | Converts a :class:`DataFrame` into a :class:`RDD` of string                                                                                                                         | 1.3          |
| <a href="#pyspark.sql.DataFrame.toLocalIterator">toLocalIterator</a>                             | Transformation | Returns an iterator that contains all of the rows in this :class:`DataFrame`.The iterator will consume as much memory as the largest partition in this DataFrame.                    | 2.0          |
| <a href="#pyspark.sql.DataFrame.toPandas">toPandas</a>                                           | Action         | Returns the contents of this :class:`DataFrame` as Pandas ``pandas.DataFrame``.                                                                                                     | 1.3          |
| <a href="#pyspark.sql.DataFrame.union">union</a>                                                 | Transformation | Return a new :class:`DataFrame` containing union of rows in this and another frame.                                                                                                 | 2.0          |
| <a href="#pyspark.sql.DataFrame.unionAll">unionAll</a>                                           | Transformation | Return a new :class:`DataFrame` containing union of rows in this and another frame.                                                                                                 | 1.3          |
| <a href="#pyspark.sql.DataFrame.unionByName">unionByName</a>                                     | Transformation | Returns a new :class:`DataFrame` containing union of rows in this and another frame.                                                                                                 | 2.3          |
| <a href="#pyspark.sql.DataFrame.unpersist">unpersist</a>                                         | Transformation | Marks the :class:`DataFrame` as non-persistent, and remove all blocks for it from memory and disk.                                                                                   | 1.3          |
| <a href="#pyspark.sql.DataFrame.where">where</a>                                                 | Transformation | :func:`where` is an alias for :func:`filter`.                                                                                                                                       | 1.3          |
| <a href="#pyspark.sql.DataFrame.withColumn">withColumn</a>                                       | Transformation | Returns a new :class:`DataFrame` by adding a column or replacing theexisting column that has the same name.                                                                         | 1.3          |
| <a href="#pyspark.sql.DataFrame.withColumnRenamed">withColumnRenamed</a>                         | Transformation | Returns a new :class:`DataFrame` by renaming an existing column. This is a no-op if schema doesn't contain the given column name.                                                | 1.3          |
| <a href="#pyspark.sql.DataFrame.withWatermark">withWatermark</a>                                 | Transformation | Defines an event time watermark for this :class:`DataFrame`. A watermark tracks a pointin time before which we assume no more late data is going to arrive.                          | 2.1          |
| <a href="#pyspark.sql.DataFrame.write">write</a>                                                 | Action         | Interface for saving the content of the non-streaming :class:`DataFrame` out into external storage.                                                                                 | 1.4          |
| <a href="#pyspark.sql.DataFrame.writeStream">writeStream</a>                                     | Action         | Interface for saving the content of the streaming :class:`DataFrame` out into external storage.                                                                                     | 2.0          |

## Copy of PySpark Official Documentation

This exists simply as an augmentation of the official documentation. PySpark's official documentation can be found [here](https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrame).

<dl class="class">
<dt id="pyspark.sql.DataFrame">
<em class="property">class </em><code class="descclassname">pyspark.sql.</code><code class="descname">DataFrame</code><span class="sig-paren">(</span><em>jdf</em>, <em>sql_ctx</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame" title="Permalink to this definition">¶</a></dt>
<dd><p>A distributed collection of data grouped into named columns.</p>
<p>A <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> is equivalent to a relational table in Spark SQL,
and can be created using various functions in <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal notranslate"><span class="pre">SparkSession</span></code></a>:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">people</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">"..."</span><span class="p">)</span>
</pre></div>
</div>
<p>Once created, it can be manipulated using the various domain-specific-language
(DSL) functions defined in: <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>, <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a>.</p>
<p>To select a column from the data frame, use the apply method:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">ageCol</span> <span class="o">=</span> <span class="n">people</span><span class="o">.</span><span class="n">age</span>
</pre></div>
</div>
<p>A more concrete example:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># To create DataFrame using SparkSession</span>
<span class="n">people</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">"..."</span><span class="p">)</span>
<span class="n">department</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">read</span><span class="o">.</span><span class="n">parquet</span><span class="p">(</span><span class="s2">"..."</span><span class="p">)</span>

<span class="n">people</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">people</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">30</span><span class="p">)</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">department</span><span class="p">,</span> <span class="n">people</span><span class="o">.</span><span class="n">deptId</span> <span class="o">==</span> <span class="n">department</span><span class="o">.</span><span class="n">id</span><span class="p">)</span> \
  <span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">department</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s2">"gender"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s2">"salary"</span><span class="p">:</span> <span class="s2">"avg"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">:</span> <span class="s2">"max"</span><span class="p">})</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>

<dl class="method">
<dt id="pyspark.sql.DataFrame.agg">
<code class="descname">agg</code><span class="sig-paren">(</span><em>*exprs</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.agg"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.agg" title="Permalink to this definition">¶</a></dt>
<dd><p>Aggregate on the entire <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> without groups
(shorthand for <code class="docutils literal notranslate"><span class="pre">df.groupBy.agg()</span></code>).</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s2">"age"</span><span class="p">:</span> <span class="s2">"max"</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(max(age)=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">functions</span> <span class="k">as</span> <span class="n">F</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">F</span><span class="o">.</span><span class="n">min</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(min(age)=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.alias">
<code class="descname">alias</code><span class="sig-paren">(</span><em>alias</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.alias"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.alias" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> with an alias set.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>alias</strong> – string, an alias name to be set for the DataFrame.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df_as1</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"df_as1"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df_as2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"df_as2"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">joined_df</span> <span class="o">=</span> <span class="n">df_as1</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df_as2</span><span class="p">,</span> <span class="n">col</span><span class="p">(</span><span class="s2">"df_as1.name"</span><span class="p">)</span> <span class="o">==</span> <span class="n">col</span><span class="p">(</span><span class="s2">"df_as2.name"</span><span class="p">),</span> <span class="s1">'inner'</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">joined_df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"df_as1.name"</span><span class="p">,</span> <span class="s2">"df_as2.name"</span><span class="p">,</span> <span class="s2">"df_as2.age"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Bob', name='Bob', age=5), Row(name='Alice', name='Alice', age=2)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.approxQuantile">
<code class="descname">approxQuantile</code><span class="sig-paren">(</span><em>col</em>, <em>probabilities</em>, <em>relativeError</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.approxQuantile"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.approxQuantile" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the approximate quantiles of numerical columns of a
DataFrame.</p>
<p>The result of this algorithm has the following deterministic bound:
If the DataFrame has N elements and if we request the quantile at
probability <cite>p</cite> up to error <cite>err</cite>, then the algorithm will return
a sample <cite>x</cite> from the DataFrame so that the <em>exact</em> rank of <cite>x</cite> is
close to (p * N). More precisely,</p>
<blockquote>
<div><p>floor((p - err) * N) &lt;= rank(x) &lt;= ceil((p + err) * N).</p>
</div></blockquote>
<p>This method implements a variation of the Greenwald-Khanna
algorithm (with some speed optimizations). The algorithm was first
present in [[<a class="reference external" href="http://dx.doi.org/10.1145/375663.375670">http://dx.doi.org/10.1145/375663.375670</a>
Space-efficient Online Computation of Quantile Summaries]]
by Greenwald and Khanna.</p>
<p>Note that null values will be ignored in numerical columns before calculation.
For columns only containing null values, an empty list is returned.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>col</strong> – str, list.
Can be a single column name, or a list of names for multiple columns.</p></li>
<li><p><strong>probabilities</strong> – a list of quantile probabilities
Each number must belong to [0, 1].
For example 0 is the minimum, 0.5 is the median, 1 is the maximum.</p></li>
<li><p><strong>relativeError</strong> – The relative target precision to achieve
(&gt;= 0). If set to zero, the exact quantiles are computed, which
could be very expensive. Note that values greater than 1 are
accepted but give the same result as 1.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
<dd class="field-even"><p>the approximate quantiles at the given probabilities. If
the input <cite>col</cite> is a string, the output is a list of floats. If the
input <cite>col</cite> is a list or tuple of strings, the output is also a
list, but each element in it is a list of floats, i.e., the output
is a list of list of floats.</p>
</dd>
</dl>
<div class="versionchanged">
<p><span class="versionmodified changed">Changed in version 2.2: </span>Added support for multiple columns.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.cache">
<code class="descname">cache</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.cache"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cache" title="Permalink to this definition">¶</a></dt>
<dd><p>Persists the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> with the default storage level (<code class="xref py py-class docutils literal notranslate"><span class="pre">MEMORY_AND_DISK</span></code>).</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>The default storage level has changed to <code class="xref py py-class docutils literal notranslate"><span class="pre">MEMORY_AND_DISK</span></code> to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.checkpoint">
<code class="descname">checkpoint</code><span class="sig-paren">(</span><em>eager=True</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.checkpoint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.checkpoint" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the
logical plan of this DataFrame, which is especially useful in iterative algorithms where the
plan may grow exponentially. It will be saved to files inside the checkpoint
directory set with <code class="xref py py-class docutils literal notranslate"><span class="pre">SparkContext.setCheckpointDir()</span></code>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>eager</strong> – Whether to checkpoint this DataFrame immediately</p>
</dd>
</dl>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.coalesce">
<code class="descname">coalesce</code><span class="sig-paren">(</span><em>numPartitions</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.coalesce"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.coalesce" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> that has exactly <cite>numPartitions</cite> partitions.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>numPartitions</strong> – int, to specify the target number of partitions</p>
</dd>
</dl>
<p>Similar to coalesce defined on an <code class="xref py py-class docutils literal notranslate"><span class="pre">RDD</span></code>, this operation results in a
narrow dependency, e.g. if you go from 1000 partitions to 100 partitions,
there will not be a shuffle, instead each of the 100 new partitions will
claim 10 of the current partitions. If a larger number of partitions is requested,
it will stay at the current number of partitions.</p>
<p>However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can call repartition(). This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">coalesce</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.colRegex">
<code class="descname">colRegex</code><span class="sig-paren">(</span><em>colName</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.colRegex"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.colRegex" title="Permalink to this definition">¶</a></dt>
<dd><p>Selects column based on the column name specified as a regex and returns it
as <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>colName</strong> – string, column name specified as a regex.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s2">"b"</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="s2">"c"</span><span class="p">,</span>  <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s2">"Col1"</span><span class="p">,</span> <span class="s2">"Col2"</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">colRegex</span><span class="p">(</span><span class="s2">"`(Col1)?+.+`"</span><span class="p">))</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+</span>
<span class="go">|Col2|</span>
<span class="go">+----+</span>
<span class="go">|   1|</span>
<span class="go">|   2|</span>
<span class="go">|   3|</span>
<span class="go">+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.collect">
<code class="descname">collect</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.collect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.collect" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all the records as a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.columns">
<code class="descname">columns</code><a class="headerlink" href="#pyspark.sql.DataFrame.columns" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all column names as a list.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">columns</span>
<span class="go">['age', 'name']</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.corr">
<code class="descname">corr</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em>, <em>method=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.corr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.corr" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculates the correlation of two columns of a DataFrame as a double value.
Currently only supports the Pearson Correlation Coefficient.
<a class="reference internal" href="#pyspark.sql.DataFrame.corr" title="pyspark.sql.DataFrame.corr"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.corr()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.corr" title="pyspark.sql.DataFrameStatFunctions.corr"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameStatFunctions.corr()</span></code></a> are aliases of each other.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>col1</strong> – The name of the first column</p></li>
<li><p><strong>col2</strong> – The name of the second column</p></li>
<li><p><strong>method</strong> – The correlation method. Currently only supports “pearson”</p></li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.count">
<code class="descname">count</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.count"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.count" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the number of rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.cov">
<code class="descname">cov</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.cov"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cov" title="Permalink to this definition">¶</a></dt>
<dd><p>Calculate the sample covariance for the given columns, specified by their names, as a
double value. <a class="reference internal" href="#pyspark.sql.DataFrame.cov" title="pyspark.sql.DataFrame.cov"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.cov()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.cov" title="pyspark.sql.DataFrameStatFunctions.cov"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameStatFunctions.cov()</span></code></a> are aliases.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>col1</strong> – The name of the first column</p></li>
<li><p><strong>col2</strong> – The name of the second column</p></li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.createGlobalTempView">
<code class="descname">createGlobalTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.createGlobalTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createGlobalTempView" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a global temporary view with this DataFrame.</p>
<p>The lifetime of this temporary view is tied to this Spark application.
throws <code class="xref py py-class docutils literal notranslate"><span class="pre">TempTableAlreadyExistsException</span></code>, if the view name already exists in the
catalog.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"select * from global_temp.people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>  <span class="c1"># doctest: +IGNORE_EXCEPTION_DETAIL</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">AnalysisException</span>: <span class="n">u"Temporary table 'people' already exists;"</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.createOrReplaceGlobalTempView">
<code class="descname">createOrReplaceGlobalTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.createOrReplaceGlobalTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createOrReplaceGlobalTempView" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates or replaces a global temporary view using the given name.</p>
<p>The lifetime of this temporary view is tied to this Spark application.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">createOrReplaceGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"select * from global_temp.people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropGlobalTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.2.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.createOrReplaceTempView">
<code class="descname">createOrReplaceTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.createOrReplaceTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createOrReplaceTempView" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates or replaces a local temporary view with this DataFrame.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal notranslate"><span class="pre">SparkSession</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">createOrReplaceTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df3</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"select * from people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df3</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.createTempView">
<code class="descname">createTempView</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.createTempView"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.createTempView" title="Permalink to this definition">¶</a></dt>
<dd><p>Creates a local temporary view with this DataFrame.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal notranslate"><span class="pre">SparkSession</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.
throws <code class="xref py py-class docutils literal notranslate"><span class="pre">TempTableAlreadyExistsException</span></code>, if the view name already exists in the
catalog.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"select * from people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">createTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>  <span class="c1"># doctest: +IGNORE_EXCEPTION_DETAIL</span>
<span class="gt">Traceback (most recent call last):</span>
<span class="c">...</span>
<span class="gr">AnalysisException</span>: <span class="n">u"Temporary table 'people' already exists;"</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.crossJoin">
<code class="descname">crossJoin</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.crossJoin"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.crossJoin" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the cartesian product with another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>other</strong> – Right side of the cartesian product.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"height"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Tom', height=80), Row(name='Bob', height=85)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">crossJoin</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"height"</span><span class="p">))</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"name"</span><span class="p">,</span> <span class="s2">"height"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice', height=80), Row(age=2, name='Alice', height=85),</span>
<span class="go"> Row(age=5, name='Bob', height=80), Row(age=5, name='Bob', height=85)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.crosstab">
<code class="descname">crosstab</code><span class="sig-paren">(</span><em>col1</em>, <em>col2</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.crosstab"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.crosstab" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes a pair-wise frequency table of the given columns. Also known as a contingency
table. The number of distinct values for each column should be less than 1e4. At most 1e6
non-zero pair frequencies will be returned.
The first column of each row will be the distinct values of <cite>col1</cite> and the column names
will be the distinct values of <cite>col2</cite>. The name of the first column will be <cite>$col1_$col2</cite>.
Pairs that have no occurrences will have zero as their counts.
<a class="reference internal" href="#pyspark.sql.DataFrame.crosstab" title="pyspark.sql.DataFrame.crosstab"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.crosstab()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.crosstab" title="pyspark.sql.DataFrameStatFunctions.crosstab"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameStatFunctions.crosstab()</span></code></a> are aliases.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>col1</strong> – The name of the first column. Distinct items will make the first item of
each row.</p></li>
<li><p><strong>col2</strong> – The name of the second column. Distinct items will make the column names
of the DataFrame.</p></li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.cube">
<code class="descname">cube</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.cube"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.cube" title="Permalink to this definition">¶</a></dt>
<dd><p>Create a multi-dimensional cube for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cube</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| null|null|    2|</span>
<span class="go">| null|   2|    1|</span>
<span class="go">| null|   5|    1|</span>
<span class="go">|Alice|null|    1|</span>
<span class="go">|Alice|   2|    1|</span>
<span class="go">|  Bob|null|    1|</span>
<span class="go">|  Bob|   5|    1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.describe">
<code class="descname">describe</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.describe"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.describe" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes basic statistics for numeric and string columns.</p>
<p>This include count, mean, stddev, min, and max. If no columns are
given, this function computes statistics for all numerical or string columns.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">([</span><span class="s1">'age'</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+------------------+</span>
<span class="go">|summary|               age|</span>
<span class="go">+-------+------------------+</span>
<span class="go">|  count|                 2|</span>
<span class="go">|   mean|               3.5|</span>
<span class="go">| stddev|2.1213203435596424|</span>
<span class="go">|    min|                 2|</span>
<span class="go">|    max|                 5|</span>
<span class="go">+-------+------------------+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">|summary|               age| name|</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">|  count|                 2|    2|</span>
<span class="go">|   mean|               3.5| null|</span>
<span class="go">| stddev|2.1213203435596424| null|</span>
<span class="go">|    min|                 2|Alice|</span>
<span class="go">|    max|                 5|  Bob|</span>
<span class="go">+-------+------------------+-----+</span>
</pre></div>
</div>
<p>Use summary for expanded statistics and control over which statistics to compute.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.distinct">
<code class="descname">distinct</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.distinct"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.distinct" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing the distinct rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">distinct</span><span class="p">()</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">2</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.drop">
<code class="descname">drop</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.drop"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.drop" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> that drops the specified column.
This is a no-op if schema doesn’t contain the given column name(s).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>cols</strong> – a string name of the column to drop, or a
<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> to drop, or a list of string name of the columns to drop.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Alice'), Row(name='Bob')]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Alice'), Row(name='Bob')]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">'inner'</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, height=85, name='Bob')]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">'inner'</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob', height=85)]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">,</span> <span class="s1">'inner'</span><span class="p">)</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.dropDuplicates">
<code class="descname">dropDuplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.dropDuplicates"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropDuplicates" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> with duplicate rows removed,
optionally only considering certain columns.</p>
<p>For a static batch <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>, it just drops duplicate rows. For a streaming
<a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>, it will keep all data across triggers as intermediate state to drop
duplicates rows. You can use <a class="reference internal" href="#pyspark.sql.DataFrame.withWatermark" title="pyspark.sql.DataFrame.withWatermark"><code class="xref py py-func docutils literal notranslate"><span class="pre">withWatermark()</span></code></a> to limit how late the duplicate data can
be and system will accordingly limit the state. In addition, too late data older than
watermark will be dropped to avoid any possibility of duplicates.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.drop_duplicates" title="pyspark.sql.DataFrame.drop_duplicates"><code class="xref py py-func docutils literal notranslate"><span class="pre">drop_duplicates()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.dropDuplicates" title="pyspark.sql.DataFrame.dropDuplicates"><code class="xref py py-func docutils literal notranslate"><span class="pre">dropDuplicates()</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="k">import</span> <span class="n">Row</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span> \
<span class="gp">... </span>    <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> \
<span class="gp">... </span>    <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">),</span> \
<span class="gp">... </span>    <span class="n">Row</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s1">'Alice'</span><span class="p">,</span> <span class="n">age</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">80</span><span class="p">)])</span><span class="o">.</span><span class="n">toDF</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">|  5|    80|Alice|</span>
<span class="go">| 10|    80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dropDuplicates</span><span class="p">([</span><span class="s1">'name'</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">|  5|    80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.drop_duplicates">
<code class="descname">drop_duplicates</code><span class="sig-paren">(</span><em>subset=None</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.drop_duplicates" title="Permalink to this definition">¶</a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.drop_duplicates" title="pyspark.sql.DataFrame.drop_duplicates"><code class="xref py py-func docutils literal notranslate"><span class="pre">drop_duplicates()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.dropDuplicates" title="pyspark.sql.DataFrame.dropDuplicates"><code class="xref py py-func docutils literal notranslate"><span class="pre">dropDuplicates()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.dropna">
<code class="descname">dropna</code><span class="sig-paren">(</span><em>how='any'</em>, <em>thresh=None</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.dropna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.dropna" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> omitting rows with null values.
<a class="reference internal" href="#pyspark.sql.DataFrame.dropna" title="pyspark.sql.DataFrame.dropna"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.dropna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.drop" title="pyspark.sql.DataFrameNaFunctions.drop"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameNaFunctions.drop()</span></code></a> are aliases of each other.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>how</strong> – ‘any’ or ‘all’.
If ‘any’, drop a row if it contains any nulls.
If ‘all’, drop a row only if all its values are null.</p></li>
<li><p><strong>thresh</strong> – int, default None
If specified, drop rows that have less than <cite>thresh</cite> non-null values.
This overwrites the <cite>how</cite> parameter.</p></li>
<li><p><strong>subset</strong> – optional list of column names to consider.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">drop</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10|    80|Alice|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.1.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.dtypes">
<code class="descname">dtypes</code><a class="headerlink" href="#pyspark.sql.DataFrame.dtypes" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns all column names and their data types as a list.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">dtypes</span>
<span class="go">[('age', 'int'), ('name', 'string')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.exceptAll">
<code class="descname">exceptAll</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.exceptAll"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.exceptAll" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> but
not in another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> while preserving duplicates.</p>
<p>This is equivalent to <cite>EXCEPT ALL</cite> in SQL.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span>
<span class="gp">... </span>        <span class="p">[(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="p">(</span><span class="s2">"b"</span><span class="p">,</span>  <span class="mi">3</span><span class="p">),</span> <span class="p">(</span><span class="s2">"c"</span><span class="p">,</span> <span class="mi">4</span><span class="p">)],</span> <span class="p">[</span><span class="s2">"C1"</span><span class="p">,</span> <span class="s2">"C2"</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([(</span><span class="s2">"a"</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="p">(</span><span class="s2">"b"</span><span class="p">,</span> <span class="mi">3</span><span class="p">)],</span> <span class="p">[</span><span class="s2">"C1"</span><span class="p">,</span> <span class="s2">"C2"</span><span class="p">])</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span><span class="o">.</span><span class="n">exceptAll</span><span class="p">(</span><span class="n">df2</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+---+</span>
<span class="go">| C1| C2|</span>
<span class="go">+---+---+</span>
<span class="go">|  a|  1|</span>
<span class="go">|  a|  1|</span>
<span class="go">|  a|  2|</span>
<span class="go">|  c|  4|</span>
<span class="go">+---+---+</span>
</pre></div>
</div>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.explain">
<code class="descname">explain</code><span class="sig-paren">(</span><em>extended=False</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.explain"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.explain" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints the (logical and physical) plans to the console for debugging purpose.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>extended</strong> – boolean, default <code class="docutils literal notranslate"><span class="pre">False</span></code>. If <code class="docutils literal notranslate"><span class="pre">False</span></code>, prints only the physical plan.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">()</span>
<span class="go">== Physical Plan ==</span>
<span class="go">Scan ExistingRDD[age#0,name#1]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">explain</span><span class="p">(</span><span class="kc">True</span><span class="p">)</span>
<span class="go">== Parsed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Analyzed Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Optimized Logical Plan ==</span>
<span class="gp">...</span>
<span class="go">== Physical Plan ==</span>
<span class="gp">...</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.fillna">
<code class="descname">fillna</code><span class="sig-paren">(</span><em>value</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.fillna"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.fillna" title="Permalink to this definition">¶</a></dt>
<dd><p>Replace null values, alias for <code class="docutils literal notranslate"><span class="pre">na.fill()</span></code>.
<a class="reference internal" href="#pyspark.sql.DataFrame.fillna" title="pyspark.sql.DataFrame.fillna"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.fillna()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.fill" title="pyspark.sql.DataFrameNaFunctions.fill"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameNaFunctions.fill()</span></code></a> are aliases of each other.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>value</strong> – int, long, float, string, bool or dict.
Value to replace null values with.
If the value is a dict, then <cite>subset</cite> is ignored and <cite>value</cite> must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, boolean, or string.</p></li>
<li><p><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="mi">50</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-----+</span>
<span class="go">|age|height| name|</span>
<span class="go">+---+------+-----+</span>
<span class="go">| 10|    80|Alice|</span>
<span class="go">|  5|    50|  Bob|</span>
<span class="go">| 50|    50|  Tom|</span>
<span class="go">| 50|    50| null|</span>
<span class="go">+---+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df5</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">(</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+-------+-----+</span>
<span class="go">| age|   name|  spy|</span>
<span class="go">+----+-------+-----+</span>
<span class="go">|  10|  Alice|false|</span>
<span class="go">|   5|    Bob|false|</span>
<span class="go">|null|Mallory| true|</span>
<span class="go">+----+-------+-----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">fill</span><span class="p">({</span><span class="s1">'age'</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">:</span> <span class="s1">'unknown'</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+------+-------+</span>
<span class="go">|age|height|   name|</span>
<span class="go">+---+------+-------+</span>
<span class="go">| 10|    80|  Alice|</span>
<span class="go">|  5|  null|    Bob|</span>
<span class="go">| 50|  null|    Tom|</span>
<span class="go">| 50|  null|unknown|</span>
<span class="go">+---+------+-------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.filter">
<code class="descname">filter</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.filter"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.filter" title="Permalink to this definition">¶</a></dt>
<dd><p>Filters rows using the given condition.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal notranslate"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal notranslate"><span class="pre">filter()</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>condition</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> of <a class="reference internal" href="#pyspark.sql.types.BooleanType" title="pyspark.sql.types.BooleanType"><code class="xref py py-class docutils literal notranslate"><span class="pre">types.BooleanType</span></code></a>
or a string of SQL expression.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">&gt;</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice')]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="s2">"age &gt; 3"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="s2">"age = 2"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.first">
<code class="descname">first</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.first"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.first" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first row as a <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">Row(age=2, name='Alice')</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.foreach">
<code class="descname">foreach</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.foreach"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreach" title="Permalink to this definition">¶</a></dt>
<dd><p>Applies the <code class="docutils literal notranslate"><span class="pre">f</span></code> function to all <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a> of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a shorthand for <code class="docutils literal notranslate"><span class="pre">df.rdd.foreach()</span></code>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">person</span><span class="p">):</span>
<span class="gp">... </span>    <span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">foreach</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.foreachPartition">
<code class="descname">foreachPartition</code><span class="sig-paren">(</span><em>f</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.foreachPartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.foreachPartition" title="Permalink to this definition">¶</a></dt>
<dd><p>Applies the <code class="docutils literal notranslate"><span class="pre">f</span></code> function to each partition of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<p>This a shorthand for <code class="docutils literal notranslate"><span class="pre">df.rdd.foreachPartition()</span></code>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">people</span><span class="p">):</span>
<span class="gp">... </span>    <span class="k">for</span> <span class="n">person</span> <span class="ow">in</span> <span class="n">people</span><span class="p">:</span>
<span class="gp">... </span>        <span class="nb">print</span><span class="p">(</span><span class="n">person</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">foreachPartition</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.freqItems">
<code class="descname">freqItems</code><span class="sig-paren">(</span><em>cols</em>, <em>support=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.freqItems"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.freqItems" title="Permalink to this definition">¶</a></dt>
<dd><p>Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in
“<a class="reference external" href="http://dx.doi.org/10.1145/762471.762473">http://dx.doi.org/10.1145/762471.762473</a>, proposed by Karp, Schenker, and Papadimitriou”.
<a class="reference internal" href="#pyspark.sql.DataFrame.freqItems" title="pyspark.sql.DataFrame.freqItems"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.freqItems()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions.freqItems" title="pyspark.sql.DataFrameStatFunctions.freqItems"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameStatFunctions.freqItems()</span></code></a> are aliases.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>cols</strong> – Names of the columns to calculate frequent items for as a list or tuple of
strings.</p></li>
<li><p><strong>support</strong> – The frequency with which to consider an item ‘frequent’. Default is 1%.
The support must be greater than 1e-4.</p></li>
</ul>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.groupBy">
<code class="descname">groupBy</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.groupBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.groupBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Groups the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> using the specified columns,
so we can run aggregation on them. See <a class="reference internal" href="#pyspark.sql.GroupedData" title="pyspark.sql.GroupedData"><code class="xref py py-class docutils literal notranslate"><span class="pre">GroupedData</span></code></a>
for all the available aggregate functions.</p>
<p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal notranslate"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal notranslate"><span class="pre">groupBy()</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>cols</strong> – list of columns to group by.
Each element should be a column name (string) or an expression (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a>).</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">()</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(avg(age)=3.5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s1">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'age'</span><span class="p">:</span> <span class="s1">'mean'</span><span class="p">})</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">)</span><span class="o">.</span><span class="n">avg</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name='Alice', avg(age)=2.0), Row(name='Bob', avg(age)=5.0)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">groupBy</span><span class="p">([</span><span class="s1">'name'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">[Row(name='Alice', age=2, count=1), Row(name='Bob', age=5, count=1)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.groupby">
<code class="descname">groupby</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.groupby" title="Permalink to this definition">¶</a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.groupby" title="pyspark.sql.DataFrame.groupby"><code class="xref py py-func docutils literal notranslate"><span class="pre">groupby()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.groupBy" title="pyspark.sql.DataFrame.groupBy"><code class="xref py py-func docutils literal notranslate"><span class="pre">groupBy()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.head">
<code class="descname">head</code><span class="sig-paren">(</span><em>n=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.head"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.head" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first <code class="docutils literal notranslate"><span class="pre">n</span></code> rows.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This method should only be used if the resulting array is expected
to be small, as all the data is loaded into the driver’s memory.</p>
</div>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>n</strong> – int, default 1. Number of rows to return.</p>
</dd>
<dt class="field-even">Returns</dt>
<dd class="field-even"><p>If n is greater than 1, return a list of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a>.
If n is 1, return a single Row.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="go">Row(age=2, name='Alice')</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="go">[Row(age=2, name='Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.hint">
<code class="descname">hint</code><span class="sig-paren">(</span><em>name</em>, <em>*parameters</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.hint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.hint" title="Permalink to this definition">¶</a></dt>
<dd><p>Specifies some hint on the current DataFrame.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>name</strong> – A name of the hint.</p></li>
<li><p><strong>parameters</strong> – Optional parameters.</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
<dd class="field-even"><p><a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a></p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">hint</span><span class="p">(</span><span class="s2">"broadcast"</span><span class="p">),</span> <span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+---+------+</span>
<span class="go">|name|age|height|</span>
<span class="go">+----+---+------+</span>
<span class="go">| Bob|  5|    85|</span>
<span class="go">+----+---+------+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.2.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.intersect">
<code class="descname">intersect</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.intersect"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.intersect" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing rows only in
both this frame and another frame.</p>
<p>This is equivalent to <cite>INTERSECT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.intersectAll">
<code class="descname">intersectAll</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.intersectAll"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.intersectAll" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing rows in both this dataframe and other
dataframe while preserving duplicates.</p>
<p>This is equivalent to <cite>INTERSECT ALL</cite> in SQL.
&gt;&gt;&gt; df1 = spark.createDataFrame([(“a”, 1), (“a”, 1), (“b”, 3), (“c”, 4)], [“C1”, “C2”])
&gt;&gt;&gt; df2 = spark.createDataFrame([(“a”, 1), (“a”, 1), (“b”, 3)], [“C1”, “C2”])</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span><span class="o">.</span><span class="n">intersectAll</span><span class="p">(</span><span class="n">df2</span><span class="p">)</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">"C1"</span><span class="p">,</span> <span class="s2">"C2"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+---+</span>
<span class="go">| C1| C2|</span>
<span class="go">+---+---+</span>
<span class="go">|  a|  1|</span>
<span class="go">|  a|  1|</span>
<span class="go">|  b|  3|</span>
<span class="go">+---+---+</span>
</pre></div>
</div>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.isLocal">
<code class="descname">isLocal</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.isLocal"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.isLocal" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns <code class="docutils literal notranslate"><span class="pre">True</span></code> if the <a class="reference internal" href="#pyspark.sql.DataFrame.collect" title="pyspark.sql.DataFrame.collect"><code class="xref py py-func docutils literal notranslate"><span class="pre">collect()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrame.take" title="pyspark.sql.DataFrame.take"><code class="xref py py-func docutils literal notranslate"><span class="pre">take()</span></code></a> methods can be run locally
(without any Spark executors).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.isStreaming">
<code class="descname">isStreaming</code><a class="headerlink" href="#pyspark.sql.DataFrame.isStreaming" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns true if this <code class="xref py py-class docutils literal notranslate"><span class="pre">Dataset</span></code> contains one or more sources that continuously
return data as it arrives. A <code class="xref py py-class docutils literal notranslate"><span class="pre">Dataset</span></code> that reads data from a streaming source
must be executed as a <code class="xref py py-class docutils literal notranslate"><span class="pre">StreamingQuery</span></code> using the <code class="xref py py-func docutils literal notranslate"><span class="pre">start()</span></code> method in
<code class="xref py py-class docutils literal notranslate"><span class="pre">DataStreamWriter</span></code>.  Methods that return a single answer, (e.g., <a class="reference internal" href="#pyspark.sql.DataFrame.count" title="pyspark.sql.DataFrame.count"><code class="xref py py-func docutils literal notranslate"><span class="pre">count()</span></code></a> or
<a class="reference internal" href="#pyspark.sql.DataFrame.collect" title="pyspark.sql.DataFrame.collect"><code class="xref py py-func docutils literal notranslate"><span class="pre">collect()</span></code></a>) will throw an <code class="xref py py-class docutils literal notranslate"><span class="pre">AnalysisException</span></code> when there is a streaming
source present.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Evolving</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.join">
<code class="descname">join</code><span class="sig-paren">(</span><em>other</em>, <em>on=None</em>, <em>how=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.join"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.join" title="Permalink to this definition">¶</a></dt>
<dd><p>Joins with another <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>, using the given join expression.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>other</strong> – Right side of the join</p></li>
<li><p><strong>on</strong> – a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If <cite>on</cite> is a string or a list of strings indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an equi-join.</p></li>
<li><p><strong>how</strong> – str, default <code class="docutils literal notranslate"><span class="pre">inner</span></code>. Must be one of: <code class="docutils literal notranslate"><span class="pre">inner</span></code>, <code class="docutils literal notranslate"><span class="pre">cross</span></code>, <code class="docutils literal notranslate"><span class="pre">outer</span></code>,
<code class="docutils literal notranslate"><span class="pre">full</span></code>, <code class="docutils literal notranslate"><span class="pre">full_outer</span></code>, <code class="docutils literal notranslate"><span class="pre">left</span></code>, <code class="docutils literal notranslate"><span class="pre">left_outer</span></code>, <code class="docutils literal notranslate"><span class="pre">right</span></code>, <code class="docutils literal notranslate"><span class="pre">right_outer</span></code>,
<code class="docutils literal notranslate"><span class="pre">left_semi</span></code>, and <code class="docutils literal notranslate"><span class="pre">left_anti</span></code>.</p></li>
</ul>
</dd>
</dl>
<p>The following performs a full outer join between <code class="docutils literal notranslate"><span class="pre">df1</span></code> and <code class="docutils literal notranslate"><span class="pre">df2</span></code>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df2</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="s1">'outer'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name=None, height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">,</span> <span class="s1">'outer'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'name'</span><span class="p">,</span> <span class="s1">'height'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">cond</span> <span class="o">=</span> <span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">name</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">==</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df3</span><span class="p">,</span> <span class="n">cond</span><span class="p">,</span> <span class="s1">'outer'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df3</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Alice', age=2), Row(name='Bob', age=5)]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df2</span><span class="p">,</span> <span class="s1">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df2</span><span class="o">.</span><span class="n">height</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Bob', height=85)]</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">df4</span><span class="p">,</span> <span class="p">[</span><span class="s1">'name'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">])</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Bob', age=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.limit">
<code class="descname">limit</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.limit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.limit" title="Permalink to this definition">¶</a></dt>
<dd><p>Limits the result count to the number specified.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">limit</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.localCheckpoint">
<code class="descname">localCheckpoint</code><span class="sig-paren">(</span><em>eager=True</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.localCheckpoint"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.localCheckpoint" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a locally checkpointed version of this Dataset. Checkpointing can be used to
truncate the logical plan of this DataFrame, which is especially useful in iterative
algorithms where the plan may grow exponentially. Local checkpoints are stored in the
executors using the caching subsystem and therefore they are not reliable.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>eager</strong> – Whether to checkpoint this DataFrame immediately</p>
</dd>
</dl>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Experimental</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.3.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.na">
<code class="descname">na</code><a class="headerlink" href="#pyspark.sql.DataFrame.na" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions" title="pyspark.sql.DataFrameNaFunctions"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrameNaFunctions</span></code></a> for handling missing values.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.orderBy">
<code class="descname">orderBy</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.orderBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> or column names to sort by.</p></li>
<li><p><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s2">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s2">"age"</span><span class="p">),</span> <span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"name"</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.persist">
<code class="descname">persist</code><span class="sig-paren">(</span><em>storageLevel=StorageLevel(True</em>, <em>True</em>, <em>False</em>, <em>False</em>, <em>1)</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.persist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.persist" title="Permalink to this definition">¶</a></dt>
<dd><p>Sets the storage level to persist the contents of the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> across
operations after the first time it is computed. This can only be used to assign
a new storage level if the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> does not have a storage level set yet.
If no storage level is specified defaults to (<code class="xref py py-class docutils literal notranslate"><span class="pre">MEMORY_AND_DISK</span></code>).</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>The default storage level has changed to <code class="xref py py-class docutils literal notranslate"><span class="pre">MEMORY_AND_DISK</span></code> to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.printSchema">
<code class="descname">printSchema</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.printSchema"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.printSchema" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints out the schema in the tree format.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">printSchema</span><span class="p">()</span>
<span class="go">root</span>
<span class="go"> |-- age: integer (nullable = true)</span>
<span class="go"> |-- name: string (nullable = true)</span>
<span class="go">&lt;BLANKLINE&gt;</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.randomSplit">
<code class="descname">randomSplit</code><span class="sig-paren">(</span><em>weights</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.randomSplit"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.randomSplit" title="Permalink to this definition">¶</a></dt>
<dd><p>Randomly splits this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> with the provided weights.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>weights</strong> – list of doubles as weights with which to split the DataFrame. Weights will
be normalized if they don’t sum up to 1.0.</p></li>
<li><p><strong>seed</strong> – The seed for sampling.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span> <span class="o">=</span> <span class="n">df4</span><span class="o">.</span><span class="n">randomSplit</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">],</span> <span class="mi">24</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">1</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">splits</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">3</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.rdd">
<code class="descname">rdd</code><a class="headerlink" href="#pyspark.sql.DataFrame.rdd" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the content as an <a class="reference internal" href="pyspark.html#pyspark.RDD" title="pyspark.RDD"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyspark.RDD</span></code></a> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.registerTempTable">
<code class="descname">registerTempTable</code><span class="sig-paren">(</span><em>name</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.registerTempTable"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.registerTempTable" title="Permalink to this definition">¶</a></dt>
<dd><p>Registers this DataFrame as a temporary table using the given name.</p>
<p>The lifetime of this temporary table is tied to the <a class="reference internal" href="#pyspark.sql.SparkSession" title="pyspark.sql.SparkSession"><code class="xref py py-class docutils literal notranslate"><span class="pre">SparkSession</span></code></a>
that was used to create this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">registerTempTable</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s2">"select * from people"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">sorted</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span> <span class="o">==</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">df2</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">spark</span><span class="o">.</span><span class="n">catalog</span><span class="o">.</span><span class="n">dropTempView</span><span class="p">(</span><span class="s2">"people"</span><span class="p">)</span>
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Deprecated in 2.0, use createOrReplaceTempView instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.repartition">
<code class="descname">repartition</code><span class="sig-paren">(</span><em>numPartitions</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.repartition"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.repartition" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>numPartitions</strong> – can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.</p>
</dd>
</dl>
<div class="versionchanged">
<p><span class="versionmodified changed">Changed in version 1.6: </span>Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">union</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="s2">"age"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  5|  Bob|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  2|Alice|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">7</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">repartition</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  5|  Bob|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  2|Alice|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.repartitionByRange">
<code class="descname">repartitionByRange</code><span class="sig-paren">(</span><em>numPartitions</em>, <em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.repartitionByRange"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.repartitionByRange" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> partitioned by the given partitioning expressions. The
resulting DataFrame is range partitioned.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>numPartitions</strong> – can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.</p>
</dd>
</dl>
<p>At least one partition-by expression must be specified.
When no explicit sort order is specified, “ascending nulls first” is assumed.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">repartitionByRange</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">2</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">repartitionByRange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">rdd</span><span class="o">.</span><span class="n">getNumPartitions</span><span class="p">()</span>
<span class="go">1</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">data</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">repartitionByRange</span><span class="p">(</span><span class="s2">"age"</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.4.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.replace">
<code class="descname">replace</code><span class="sig-paren">(</span><em>to_replace</em>, <em>value=&lt;no value&gt;</em>, <em>subset=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.replace"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.replace" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> replacing a value with another value.
<a class="reference internal" href="#pyspark.sql.DataFrame.replace" title="pyspark.sql.DataFrame.replace"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrame.replace()</span></code></a> and <a class="reference internal" href="#pyspark.sql.DataFrameNaFunctions.replace" title="pyspark.sql.DataFrameNaFunctions.replace"><code class="xref py py-func docutils literal notranslate"><span class="pre">DataFrameNaFunctions.replace()</span></code></a> are
aliases of each other.
Values to_replace and value must have the same type and can only be numerics, booleans,
or strings. Value can have None. When replacing, the new value will be cast
to the type of the existing column.
For numeric replacements all values to be replaced should have unique
floating point representation. In case of conflicts (for example with <cite>{42: -1, 42.0: 1}</cite>)
and arbitrary replacement will be used.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>to_replace</strong> – bool, int, long, float, string, list or dict.
Value to be replaced.
If the value is a dict, then <cite>value</cite> is ignored or can be omitted, and <cite>to_replace</cite>
must be a mapping between a value and a replacement.</p></li>
<li><p><strong>value</strong> – bool, int, long, float, string, list or None.
The replacement value must be a bool, int, long, float, string or None. If <cite>value</cite> is a
list, <cite>value</cite> should be of the same length and type as <cite>to_replace</cite>.
If <cite>value</cite> is a scalar and <cite>to_replace</cite> is a sequence, then <cite>value</cite> is
used as a replacement for each item in <cite>to_replace</cite>.</p></li>
<li><p><strong>subset</strong> – optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if <cite>value</cite> is a string, and subset contains a non-string column,
then the non-string column is simply ignored.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+-----+</span>
<span class="go">| age|height| name|</span>
<span class="go">+----+------+-----+</span>
<span class="go">|  20|    80|Alice|</span>
<span class="go">|   5|  null|  Bob|</span>
<span class="go">|null|  null|  Tom|</span>
<span class="go">|null|  null| null|</span>
<span class="go">+----+------+-----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">'Alice'</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">|  10|    80|null|</span>
<span class="go">|   5|  null| Bob|</span>
<span class="go">|null|  null| Tom|</span>
<span class="go">|null|  null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">({</span><span class="s1">'Alice'</span><span class="p">:</span> <span class="kc">None</span><span class="p">})</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">|  10|    80|null|</span>
<span class="go">|   5|  null| Bob|</span>
<span class="go">|null|  null| Tom|</span>
<span class="go">|null|  null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df4</span><span class="o">.</span><span class="n">na</span><span class="o">.</span><span class="n">replace</span><span class="p">([</span><span class="s1">'Alice'</span><span class="p">,</span> <span class="s1">'Bob'</span><span class="p">],</span> <span class="p">[</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">],</span> <span class="s1">'name'</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+------+----+</span>
<span class="go">| age|height|name|</span>
<span class="go">+----+------+----+</span>
<span class="go">|  10|    80|   A|</span>
<span class="go">|   5|  null|   B|</span>
<span class="go">|null|  null| Tom|</span>
<span class="go">|null|  null|null|</span>
<span class="go">+----+------+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.rollup">
<code class="descname">rollup</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.rollup"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.rollup" title="Permalink to this definition">¶</a></dt>
<dd><p>Create a multi-dimensional rollup for the current <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> using
the specified columns, so we can run aggregation on them.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">rollup</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"age"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| name| age|count|</span>
<span class="go">+-----+----+-----+</span>
<span class="go">| null|null|    2|</span>
<span class="go">|Alice|null|    1|</span>
<span class="go">|Alice|   2|    1|</span>
<span class="go">|  Bob|null|    1|</span>
<span class="go">|  Bob|   5|    1|</span>
<span class="go">+-----+----+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.sample">
<code class="descname">sample</code><span class="sig-paren">(</span><em>withReplacement=None</em>, <em>fraction=None</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sample"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sample" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a sampled subset of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>withReplacement</strong> – Sample with replacement or not (default False).</p></li>
<li><p><strong>fraction</strong> – Fraction of rows to generate, range [0.0, 1.0].</p></li>
<li><p><strong>seed</strong> – Seed for sampling (default a random seed).</p></li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This is not guaranteed to provide exactly the fraction specified of the total
count of the given <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p><cite>fraction</cite> is required and, <cite>withReplacement</cite> and <cite>seed</cite> are optional.</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">4</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">fraction</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">4</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">withReplacement</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">fraction</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">1</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="mf">1.0</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">fraction</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="kc">False</span><span class="p">,</span> <span class="n">fraction</span><span class="o">=</span><span class="mf">1.0</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="go">10</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.sampleBy">
<code class="descname">sampleBy</code><span class="sig-paren">(</span><em>col</em>, <em>fractions</em>, <em>seed=None</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sampleBy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sampleBy" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a stratified sample without replacement based on the
fraction given on each stratum.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>col</strong> – column that defines strata</p></li>
<li><p><strong>fractions</strong> – sampling fraction for each stratum. If a stratum is not
specified, we treat its fraction as zero.</p></li>
<li><p><strong>seed</strong> – random seed</p></li>
</ul>
</dd>
<dt class="field-even">Returns</dt>
<dd class="field-even"><p>a new DataFrame that represents the stratified sample</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="n">col</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">((</span><span class="n">col</span><span class="p">(</span><span class="s2">"id"</span><span class="p">)</span> <span class="o">%</span> <span class="mi">3</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"key"</span><span class="p">))</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span> <span class="o">=</span> <span class="n">dataset</span><span class="o">.</span><span class="n">sampleBy</span><span class="p">(</span><span class="s2">"key"</span><span class="p">,</span> <span class="n">fractions</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mi">1</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">},</span> <span class="n">seed</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">sampled</span><span class="o">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s2">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="s2">"key"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|key|count|</span>
<span class="go">+---+-----+</span>
<span class="go">|  0|    5|</span>
<span class="go">|  1|    9|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.5.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.schema">
<code class="descname">schema</code><a class="headerlink" href="#pyspark.sql.DataFrame.schema" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the schema of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> as a <a class="reference internal" href="#pyspark.sql.types.StructType" title="pyspark.sql.types.StructType"><code class="xref py py-class docutils literal notranslate"><span class="pre">pyspark.sql.types.StructType</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">schema</span>
<span class="go">StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.select">
<code class="descname">select</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.select"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.select" title="Permalink to this definition">¶</a></dt>
<dd><p>Projects a set of expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>cols</strong> – list of column names (string) or expressions (<a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a>).
If one of the column names is ‘*’, that column is expanded to include all columns
in the current DataFrame.</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'*'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'name'</span><span class="p">,</span> <span class="s1">'age'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Alice', age=2), Row(name='Bob', age=5)]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">name</span><span class="p">,</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s1">'age'</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(name='Alice', age=12), Row(name='Bob', age=15)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.selectExpr">
<code class="descname">selectExpr</code><span class="sig-paren">(</span><em>*expr</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.selectExpr"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.selectExpr" title="Permalink to this definition">¶</a></dt>
<dd><p>Projects a set of SQL expressions and returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.</p>
<p>This is a variant of <a class="reference internal" href="#pyspark.sql.DataFrame.select" title="pyspark.sql.DataFrame.select"><code class="xref py py-func docutils literal notranslate"><span class="pre">select()</span></code></a> that accepts SQL expressions.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">selectExpr</span><span class="p">(</span><span class="s2">"age * 2"</span><span class="p">,</span> <span class="s2">"abs(age)"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.show">
<code class="descname">show</code><span class="sig-paren">(</span><em>n=20</em>, <em>truncate=True</em>, <em>vertical=False</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.show"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.show" title="Permalink to this definition">¶</a></dt>
<dd><p>Prints the first <code class="docutils literal notranslate"><span class="pre">n</span></code> rows to the console.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>n</strong> – Number of rows to show.</p></li>
<li><p><strong>truncate</strong> – If set to True, truncate strings longer than 20 chars by default.
If set to a number greater than one, truncates long strings to length <code class="docutils literal notranslate"><span class="pre">truncate</span></code>
and align cells right.</p></li>
<li><p><strong>vertical</strong> – If set to True, print output rows vertically (one line
per column value).</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span>
<span class="go">DataFrame[age: int, name: string]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">+---+-----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">truncate</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="go">+---+----+</span>
<span class="go">|age|name|</span>
<span class="go">+---+----+</span>
<span class="go">|  2| Ali|</span>
<span class="go">|  5| Bob|</span>
<span class="go">+---+----+</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">vertical</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="go">-RECORD 0-----</span>
<span class="go"> age  | 2</span>
<span class="go"> name | Alice</span>
<span class="go">-RECORD 1-----</span>
<span class="go"> age  | 5</span>
<span class="go"> name | Bob</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.sort">
<code class="descname">sort</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sort"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sort" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> sorted by the specified column(s).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> or column names to sort by.</p></li>
<li><p><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">age</span><span class="o">.</span><span class="n">desc</span><span class="p">())</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="k">import</span> <span class="o">*</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">asc</span><span class="p">(</span><span class="s2">"age"</span><span class="p">))</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">(</span><span class="n">desc</span><span class="p">(</span><span class="s2">"age"</span><span class="p">),</span> <span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">orderBy</span><span class="p">([</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"name"</span><span class="p">],</span> <span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=5, name='Bob'), Row(age=2, name='Alice')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.sortWithinPartitions">
<code class="descname">sortWithinPartitions</code><span class="sig-paren">(</span><em>*cols</em>, <em>**kwargs</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.sortWithinPartitions"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.sortWithinPartitions" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> with each partition sorted by the specified column(s).</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>cols</strong> – list of <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> or column names to sort by.</p></li>
<li><p><strong>ascending</strong> – boolean or list of boolean (default True).
Sort ascending vs. descending. Specify list for multiple sort orders.
If a list is specified, length of the list must equal length of the <cite>cols</cite>.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">sortWithinPartitions</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+---+-----+</span>
<span class="go">|age| name|</span>
<span class="go">+---+-----+</span>
<span class="go">|  2|Alice|</span>
<span class="go">|  5|  Bob|</span>
<span class="go">+---+-----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.6.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.stat">
<code class="descname">stat</code><a class="headerlink" href="#pyspark.sql.DataFrame.stat" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a <a class="reference internal" href="#pyspark.sql.DataFrameStatFunctions" title="pyspark.sql.DataFrameStatFunctions"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrameStatFunctions</span></code></a> for statistic functions.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.storageLevel">
<code class="descname">storageLevel</code><a class="headerlink" href="#pyspark.sql.DataFrame.storageLevel" title="Permalink to this definition">¶</a></dt>
<dd><p>Get the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>’s current storage level.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(False, False, False, False, 1)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">cache</span><span class="p">()</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(True, True, False, True, 1)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span><span class="o">.</span><span class="n">persist</span><span class="p">(</span><span class="n">StorageLevel</span><span class="o">.</span><span class="n">DISK_ONLY_2</span><span class="p">)</span><span class="o">.</span><span class="n">storageLevel</span>
<span class="go">StorageLevel(True, False, False, False, 2)</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.1.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.subtract">
<code class="descname">subtract</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.subtract"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.subtract" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing rows in this frame
but not in another frame.</p>
<p>This is equivalent to <cite>EXCEPT DISTINCT</cite> in SQL.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.summary">
<code class="descname">summary</code><span class="sig-paren">(</span><em>*statistics</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.summary"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.summary" title="Permalink to this definition">¶</a></dt>
<dd><p>Computes specified statistics for numeric and string columns. Available statistics are:
- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (eg, 75%)</p>
<p>If no statistics are given, this function computes count, mean, stddev, min,
approximate quartiles (percentiles at 25%, 50%, and 75%), and max.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This function is meant for exploratory data analysis, as we make no
guarantee about the backward compatibility of the schema of the resulting DataFrame.</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">|summary|               age| name|</span>
<span class="go">+-------+------------------+-----+</span>
<span class="go">|  count|                 2|    2|</span>
<span class="go">|   mean|               3.5| null|</span>
<span class="go">| stddev|2.1213203435596424| null|</span>
<span class="go">|    min|                 2|Alice|</span>
<span class="go">|    25%|                 2| null|</span>
<span class="go">|    50%|                 2| null|</span>
<span class="go">|    75%|                 5| null|</span>
<span class="go">|    max|                 5|  Bob|</span>
<span class="go">+-------+------------------+-----+</span>
</pre></div>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"count"</span><span class="p">,</span> <span class="s2">"min"</span><span class="p">,</span> <span class="s2">"25%"</span><span class="p">,</span> <span class="s2">"75%"</span><span class="p">,</span> <span class="s2">"max"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+---+-----+</span>
<span class="go">|summary|age| name|</span>
<span class="go">+-------+---+-----+</span>
<span class="go">|  count|  2|    2|</span>
<span class="go">|    min|  2|Alice|</span>
<span class="go">|    25%|  2| null|</span>
<span class="go">|    75%|  5| null|</span>
<span class="go">|    max|  5|  Bob|</span>
<span class="go">+-------+---+-----+</span>
</pre></div>
</div>
<p>To do a summary for specific columns first select them:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s2">"age"</span><span class="p">,</span> <span class="s2">"name"</span><span class="p">)</span><span class="o">.</span><span class="n">summary</span><span class="p">(</span><span class="s2">"count"</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+-------+---+----+</span>
<span class="go">|summary|age|name|</span>
<span class="go">+-------+---+----+</span>
<span class="go">|  count|  2|   2|</span>
<span class="go">+-------+---+----+</span>
</pre></div>
</div>
<p>See also describe for basic statistics.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.3.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.take">
<code class="descname">take</code><span class="sig-paren">(</span><em>num</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.take"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.take" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the first <code class="docutils literal notranslate"><span class="pre">num</span></code> rows as a <code class="xref py py-class docutils literal notranslate"><span class="pre">list</span></code> of <a class="reference internal" href="#pyspark.sql.Row" title="pyspark.sql.Row"><code class="xref py py-class docutils literal notranslate"><span class="pre">Row</span></code></a>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.toDF">
<code class="descname">toDF</code><span class="sig-paren">(</span><em>*cols</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toDF"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toDF" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new class:<cite>DataFrame</cite> that with new specified column names</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><p><strong>cols</strong> – list of new column names (string)</p>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toDF</span><span class="p">(</span><span class="s1">'f1'</span><span class="p">,</span> <span class="s1">'f2'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(f1=2, f2='Alice'), Row(f1=5, f2='Bob')]</span>
</pre></div>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.toJSON">
<code class="descname">toJSON</code><span class="sig-paren">(</span><em>use_unicode=True</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toJSON"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toJSON" title="Permalink to this definition">¶</a></dt>
<dd><p>Converts a <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> into a <code class="xref py py-class docutils literal notranslate"><span class="pre">RDD</span></code> of string.</p>
<p>Each row is turned into a JSON document as one element in the returned RDD.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toJSON</span><span class="p">()</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="go">'{"age":2,"name":"Alice"}'</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.toLocalIterator">
<code class="descname">toLocalIterator</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toLocalIterator"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toLocalIterator" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns an iterator that contains all of the rows in this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>.
The iterator will consume as much memory as the largest partition in this DataFrame.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="nb">list</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">toLocalIterator</span><span class="p">())</span>
<span class="go">[Row(age=2, name='Alice'), Row(age=5, name='Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.toPandas">
<code class="descname">toPandas</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.toPandas"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.toPandas" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns the contents of this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> as Pandas <code class="docutils literal notranslate"><span class="pre">pandas.DataFrame</span></code>.</p>
<p>This is only available if Pandas is installed and available.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This method should only be used if the resulting Pandas’s DataFrame is expected
to be small, as all the data is loaded into the driver’s memory.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Usage with spark.sql.execution.arrow.enabled=True is experimental.</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">toPandas</span><span class="p">()</span>  <span class="c1"># doctest: +SKIP</span>
<span class="go">   age   name</span>
<span class="go">0    2  Alice</span>
<span class="go">1    5    Bob</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.union">
<code class="descname">union</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.union"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.union" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing union of rows in this and another frame.</p>
<p>This is equivalent to <cite>UNION ALL</cite> in SQL. To do a SQL-style set union
(that does deduplication of elements), use this function followed by <a class="reference internal" href="#pyspark.sql.DataFrame.distinct" title="pyspark.sql.DataFrame.distinct"><code class="xref py py-func docutils literal notranslate"><span class="pre">distinct()</span></code></a>.</p>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.unionAll">
<code class="descname">unionAll</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unionAll"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unionAll" title="Permalink to this definition">¶</a></dt>
<dd><p>Return a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing union of rows in this and another frame.</p>
<p>This is equivalent to <cite>UNION ALL</cite> in SQL. To do a SQL-style set union
(that does deduplication of elements), use this function followed by <a class="reference internal" href="#pyspark.sql.DataFrame.distinct" title="pyspark.sql.DataFrame.distinct"><code class="xref py py-func docutils literal notranslate"><span class="pre">distinct()</span></code></a>.</p>
<p>Also as standard in SQL, this function resolves columns by position (not by name).</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Deprecated in 2.0, use <a class="reference internal" href="#pyspark.sql.DataFrame.union" title="pyspark.sql.DataFrame.union"><code class="xref py py-func docutils literal notranslate"><span class="pre">union()</span></code></a> instead.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.unionByName">
<code class="descname">unionByName</code><span class="sig-paren">(</span><em>other</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unionByName"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unionByName" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> containing union of rows in this and another frame.</p>
<p>This is different from both <cite>UNION ALL</cite> and <cite>UNION DISTINCT</cite> in SQL. To do a SQL-style set
union (that does deduplication of elements), use this function followed by <a class="reference internal" href="#pyspark.sql.DataFrame.distinct" title="pyspark.sql.DataFrame.distinct"><code class="xref py py-func docutils literal notranslate"><span class="pre">distinct()</span></code></a>.</p>
<p>The difference between this function and <a class="reference internal" href="#pyspark.sql.DataFrame.union" title="pyspark.sql.DataFrame.union"><code class="xref py py-func docutils literal notranslate"><span class="pre">union()</span></code></a> is that this function
resolves columns by name (not by position):</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]],</span> <span class="p">[</span><span class="s2">"col0"</span><span class="p">,</span> <span class="s2">"col1"</span><span class="p">,</span> <span class="s2">"col2"</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df2</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">([[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">]],</span> <span class="p">[</span><span class="s2">"col1"</span><span class="p">,</span> <span class="s2">"col2"</span><span class="p">,</span> <span class="s2">"col0"</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">df1</span><span class="o">.</span><span class="n">unionByName</span><span class="p">(</span><span class="n">df2</span><span class="p">)</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
<span class="go">+----+----+----+</span>
<span class="go">|col0|col1|col2|</span>
<span class="go">+----+----+----+</span>
<span class="go">|   1|   2|   3|</span>
<span class="go">|   6|   4|   5|</span>
<span class="go">+----+----+----+</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.unpersist">
<code class="descname">unpersist</code><span class="sig-paren">(</span><em>blocking=False</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.unpersist"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.unpersist" title="Permalink to this definition">¶</a></dt>
<dd><p>Marks the <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> as non-persistent, and remove all blocks for it from
memory and disk.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p><cite>blocking</cite> default has changed to False to match Scala in 2.0.</p>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.where">
<code class="descname">where</code><span class="sig-paren">(</span><em>condition</em><span class="sig-paren">)</span><a class="headerlink" href="#pyspark.sql.DataFrame.where" title="Permalink to this definition">¶</a></dt>
<dd><p><a class="reference internal" href="#pyspark.sql.DataFrame.where" title="pyspark.sql.DataFrame.where"><code class="xref py py-func docutils literal notranslate"><span class="pre">where()</span></code></a> is an alias for <a class="reference internal" href="#pyspark.sql.DataFrame.filter" title="pyspark.sql.DataFrame.filter"><code class="xref py py-func docutils literal notranslate"><span class="pre">filter()</span></code></a>.</p>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumn">
<code class="descname">withColumn</code><span class="sig-paren">(</span><em>colName</em>, <em>col</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.withColumn"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumn" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> by adding a column or replacing the
existing column that has the same name.</p>
<p>The column expression must be an expression over this DataFrame; attempting to add
a column from some other dataframe will raise an error.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>colName</strong> – string, name of the new column.</p></li>
<li><p><strong>col</strong> – a <a class="reference internal" href="#pyspark.sql.Column" title="pyspark.sql.Column"><code class="xref py py-class docutils literal notranslate"><span class="pre">Column</span></code></a> expression for the new column.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s1">'age2'</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">age</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.withColumnRenamed">
<code class="descname">withColumnRenamed</code><span class="sig-paren">(</span><em>existing</em>, <em>new</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.withColumnRenamed"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withColumnRenamed" title="Permalink to this definition">¶</a></dt>
<dd><p>Returns a new <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> by renaming an existing column.
This is a no-op if schema doesn’t contain the given column name.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>existing</strong> – string, name of the existing column to rename.</p></li>
<li><p><strong>new</strong> – string, new name of the column.</p></li>
</ul>
</dd>
</dl>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">df</span><span class="o">.</span><span class="n">withColumnRenamed</span><span class="p">(</span><span class="s1">'age'</span><span class="p">,</span> <span class="s1">'age2'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
<span class="go">[Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.3.</span></p>
</div>
</dd></dl>

<dl class="method">
<dt id="pyspark.sql.DataFrame.withWatermark">
<code class="descname">withWatermark</code><span class="sig-paren">(</span><em>eventTime</em>, <em>delayThreshold</em><span class="sig-paren">)</span><a class="reference internal" href="https://spark.apache.org/docs/2.4.3/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.withWatermark"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#pyspark.sql.DataFrame.withWatermark" title="Permalink to this definition">¶</a></dt>
<dd><p>Defines an event time watermark for this <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>. A watermark tracks a point
in time before which we assume no more late data is going to arrive.</p>
<dl class="simple">
<dt>Spark will use this watermark for several purposes:</dt><dd><ul class="simple">
<li><p>To know when a given time window aggregation can be finalized and thus can be emitted
when using output modes that do not allow updates.</p></li>
<li><p>To minimize the amount of state that we need to keep for on-going aggregations.</p></li>
</ul>
</dd>
</dl>
<p>The current watermark is computed by looking at the <cite>MAX(eventTime)</cite> seen across
all of the partitions in the query minus a user specified <cite>delayThreshold</cite>.  Due to the cost
of coordinating this value across partitions, the actual watermark used is only guaranteed
to be at least <cite>delayThreshold</cite> behind the actual event time.  In some cases we may still
process records that arrive more than <cite>delayThreshold</cite> late.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters</dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>eventTime</strong> – the name of the column that contains the event time of the row.</p></li>
<li><p><strong>delayThreshold</strong> – the minimum delay to wait to data to arrive late, relative to the
latest record that has been processed in the form of an interval
(e.g. “1 minute” or “5 hours”).</p></li>
</ul>
</dd>
</dl>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Evolving</p>
</div>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">sdf</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="s1">'name'</span><span class="p">,</span> <span class="n">sdf</span><span class="o">.</span><span class="n">time</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="s1">'timestamp'</span><span class="p">))</span><span class="o">.</span><span class="n">withWatermark</span><span class="p">(</span><span class="s1">'time'</span><span class="p">,</span> <span class="s1">'10 minutes'</span><span class="p">)</span>
<span class="go">DataFrame[name: string, time: timestamp]</span>
</pre></div>
</div>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.1.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.write">
<code class="descname">write</code><a class="headerlink" href="#pyspark.sql.DataFrame.write" title="Permalink to this definition">¶</a></dt>
<dd><p>Interface for saving the content of the non-streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> out into external
storage.</p>
<dl class="field-list simple">
<dt class="field-odd">Returns</dt>
<dd class="field-odd"><p><a class="reference internal" href="#pyspark.sql.DataFrameWriter" title="pyspark.sql.DataFrameWriter"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrameWriter</span></code></a></p>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 1.4.</span></p>
</div>
</dd></dl>

<dl class="attribute">
<dt id="pyspark.sql.DataFrame.writeStream">
<code class="descname">writeStream</code><a class="headerlink" href="#pyspark.sql.DataFrame.writeStream" title="Permalink to this definition">¶</a></dt>
<dd><p>Interface for saving the content of the streaming <a class="reference internal" href="#pyspark.sql.DataFrame" title="pyspark.sql.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> out into external
storage.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Evolving.</p>
</div>
<dl class="field-list simple">
<dt class="field-odd">Returns</dt>
<dd class="field-odd"><p><code class="xref py py-class docutils literal notranslate"><span class="pre">DataStreamWriter</span></code></p>
</dd>
</dl>
<div class="versionadded">
<p><span class="versionmodified added">New in version 2.0.</span></p>
</div>
</dd></dl>

</dd></dl>