# Overview of Sorting Data

We can use **orderBy** or **sort** to sort the date in a DataFrame.

We can perform composite sorting by passing multiple columns or expressions. By default data is sorted in ascending order, however, we can change to descending by applying **desc()** functions on the column. 

If the sorted column contains *NULL* values those will come on top of the sort, however we can change the position of these values to the very last.

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession. \
    builder. \
    enableHiveSupport(). \
    appName(f'evivancovid | Python - Basic Transformations'). \
    master('yarn'). \
    getOrCreate()

In [5]:
employees = [(1, "Scott", "Tiger", 1000.0, 10,
                      "united states", "+1 123 456 7890", "123 45 6789"
                     ),
                     (2, "Henry", "Ford", 1250.0, None,
                      "India", "+91 234 567 8901", "456 78 9123"
                     ),
                     (3, "Nick", "Junior", 750.0, '',
                      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
                     ),
                     (4, "Bill", "Gomes", 1500.0, 2,
                      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
                     )
                ]

employeesDF = spark. \
    createDataFrame(employees,
                    schema="""employee_id INT, first_name STRING, 
                    last_name STRING, salary FLOAT, bonus STRING, nationality STRING,
                    phone_number STRING, ssn STRING"""
                   )

employeesDF.show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|    2|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+



In [12]:
from pyspark.sql.functions import col, upper, when, expr

In [7]:
when?

[0;31mSignature:[0m [0mwhen[0m[0;34m([0m[0mcondition[0m[0;34m,[0m [0mvalue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Evaluates a list of conditions and returns one of multiple possible result expressions.
If :func:`pyspark.sql.Column.otherwise` is not invoked, None is returned for unmatched
conditions.

.. versionadded:: 1.4.0

Parameters
----------
condition : :class:`~pyspark.sql.Column`
    a boolean :class:`~pyspark.sql.Column` expression.
value :
    a literal value, or a :class:`~pyspark.sql.Column` expression.

>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
[Row(age=3), Row(age=4)]

>>> df.select(when(df.age == 2, df.age + 1).alias("age")).collect()
[Row(age=3), Row(age=None)]
[0;31mFile:[0m      /opt/spark3/python/pyspark/sql/functions.py
[0;31mType:[0m      function


### Get employees data in ascending order by nationality. Data related to United States should come at top always.

In [15]:
#DataFrame Style

employeesDF. \
    withColumn('sort_column', when(upper(col('nationality')) == 'UNITED STATES', 0).otherwise(1)). \
    orderBy('sort_column', 'nationality'). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|sort_column|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|          0|
|          4|      Bill|    Gomes|1500.0|    2|     AUSTRALIA|+61 987 654 3210|789 12 6118|          1|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|          1|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|          1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+



In [16]:
# SQl Style

employeesDF. \
    withColumn('sort_column', when(upper(col('nationality')) == 'UNITED STATES', 0).otherwise(1)). \
    orderBy('sort_column', 'nationality'). \
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|sort_column|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|          0|
|          4|      Bill|    Gomes|1500.0|    2|     AUSTRALIA|+61 987 654 3210|789 12 6118|          1|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|          1|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|          1|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+-----------+



### Sort the data in employeesDF using bonus. Data should be sorted numerically and null and empty values should come at the end.

In [21]:
c= col('X')

In [23]:
help(c.asc_nulls_last)

Help on method _ in module pyspark.sql.column:

_() method of pyspark.sql.column.Column instance
    Returns a sort expression based on ascending order of the column, and null values
    appear after non-null values.
    
    .. versionadded:: 2.4.0
    
    Examples
    --------
    >>> from pyspark.sql import Row
    >>> df = spark.createDataFrame([('Tom', 80), (None, 60), ('Alice', None)], ["name", "height"])
    >>> df.select(df.name).orderBy(df.name.asc_nulls_last()).collect()
    [Row(name='Alice'), Row(name='Tom'), Row(name=None)]



In [26]:
employeesDF.orderBy(
    employeesDF.bonus.cast('int').asc_nulls_last()).\
    show()

+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|bonus|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|    2|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          1|     Scott|    Tiger|1000.0|   10| united states| +1 123 456 7890|123 45 6789|
|          3|      Nick|   Junior| 750.0|     |united KINGDOM|+44 111 111 1111|222 33 4444|
|          2|     Henry|     Ford|1250.0| null|         India|+91 234 567 8901|456 78 9123|
+-----------+----------+---------+------+-----+--------------+----------------+-----------+

