## Sorting Data In Spark DataFrame

* Sort a data using ascending order by a specific column.
* Sort a data using descending order by a specific column.
* Dealing with nulls while sorting the data (having the null values at the begining or at the end).
* Sort a dataframe using multiple columns (composite sorting). We also need to be aware of how to sort the data in ascending order by first column and then descending order by second column as well as vice versa.
* We also need to make sure how to perform prioritized sorting. For example, let's say we want USA at the top and rest of the countries in ascending order by their respective names.

In [2]:
from pyspark.sql import *
from pyspark.sql.functions import *
import datetime

In [3]:
spark = SparkSession.builder.appName('SortingDataSpark').getOrCreate()

In [4]:
users = [
            {
                "id": 1,
                "first_name": "Pheobe",
                "last_name": "Buffay",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= "8273929", shop=None),
                "courses": [1, 3, 5, 7],
                "email": "pheobebuffay@abc.com",
                "is_customer": True,
                "amount_paid": 1000.55,
                "customer_from": datetime.date(2021, 1, 13),
                "last_updated_ts": datetime.datetime(2021, 2, 10, 1, 15, 0)
            },
            {
                "id": 2,
                "first_name": "Joey",
                "last_name": "Tribbiani",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= None, shop=None),
                "courses": [2, 4, 5],
                "email": "joey@abc.com",
                "is_customer": True,
                "amount_paid": 900.0,
                "customer_from": datetime.date(2021, 2, 14),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 3,
                "first_name": "Monica",
                "last_name": "Geller",
                "phone_numbers": Row(mobile= None, home= None, office= None, shop=None),
                "courses": [2],
                "email": "monica@abc.com",
                "is_customer": True,
                "amount_paid": 1000.90,
                "customer_from": datetime.date(2021, 2, 22),
                "last_updated_ts": datetime.datetime(2021, 2, 28, 7, 33, 0)
            },
            {
                "id": 4,
                "first_name": "Ross",
                "last_name": "Geller",
                "phone_numbers": Row(mobile= "82349238942", home= None, office= None, shop=None),
                "courses": [],
                "email": "ross@abc.com",
                "is_customer": True,
                "amount_paid": 1200.55,
                "customer_from": datetime.date(2021, 1, 19),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 1, 10, 0)
            },
            {
                "id": 5,
                "first_name": "Rachel",
                "last_name": "Green",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= "8273929", shop= "5343434654"),
                "courses": [3],
                "email": "rachel@abc.com",
                "is_customer": True,
                "amount_paid": None,
                "customer_from": datetime.date(2021, 2, 24),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 6,
                "first_name": "Chandler",
                "last_name": "Bing",
                "phone_numbers": Row(mobile= "8273929", home= None, office= None, shop=None),
                "courses": [2, 4],
                "email": "bing@abc.com",
                "is_customer": True,
                "amount_paid": 1000.80,
                "customer_from": None,
                "last_updated_ts": datetime.datetime(2021, 2, 25, 7, 33, 0)
            }
        ]

In [5]:
usersDF = spark.createDataFrame(users)

usersDF.printSchema()



root
 |-- amount_paid: double (nullable = true)
 |-- courses: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- customer_from: date (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- is_customer: boolean (nullable = true)
 |-- last_name: string (nullable = true)
 |-- last_updated_ts: timestamp (nullable = true)
 |-- phone_numbers: struct (nullable = true)
 |    |-- mobile: string (nullable = true)
 |    |-- home: string (nullable = true)
 |    |-- office: string (nullable = true)
 |    |-- shop: string (nullable = true)



In [6]:
help(usersDF.sort)

Help on method sort in module pyspark.sql.dataframe:

sort(*cols, **kwargs) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` sorted by the specified column(s).
    
    :param cols: list of :class:`Column` or column names to sort by.
    :param ascending: boolean or list of boolean (default ``True``).
        Sort ascending vs. descending. Specify list for multiple sort orders.
        If a list is specified, length of the list must equal length of the `cols`.
    
    >>> df.sort(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.sort("age", ascending=False).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.orderBy(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> from pyspark.sql.functions import *
    >>> df.sort(asc("age")).collect()
    [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    >>> df.orderBy(desc("age"), "name").collect()
    

In [7]:
help(usersDF.orderBy)

Help on method sort in module pyspark.sql.dataframe:

sort(*cols, **kwargs) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` sorted by the specified column(s).
    
    :param cols: list of :class:`Column` or column names to sort by.
    :param ascending: boolean or list of boolean (default ``True``).
        Sort ascending vs. descending. Specify list for multiple sort orders.
        If a list is specified, length of the list must equal length of the `cols`.
    
    >>> df.sort(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.sort("age", ascending=False).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> df.orderBy(df.age.desc()).collect()
    [Row(age=5, name='Bob'), Row(age=2, name='Alice')]
    >>> from pyspark.sql.functions import *
    >>> df.sort(asc("age")).collect()
    [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    >>> df.orderBy(desc("age"), "name").collect()
    

In [8]:
usersDF.show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|       null|       

Sort users data in ascending order by **first_name**

In [9]:
usersDF.sort('first_name').show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|     1000.8|      [2, 4]|         null|        bing@abc.com|  Chandler|  6|       true|     Bing|2021-02-25 07:33:00|        [8273929,,,]|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|       null|       

In [10]:
usersDF.sort(col('first_name')).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|     1000.8|      [2, 4]|         null|        bing@abc.com|  Chandler|  6|       true|     Bing|2021-02-25 07:33:00|        [8273929,,,]|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|       null|       

Sort users data in ascending order by **customer_from**

In [11]:
usersDF.sort(col('customer_from')).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|     1000.8|      [2, 4]|         null|        bing@abc.com|  Chandler|  6|       true|     Bing|2021-02-25 07:33:00|        [8273929,,,]|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|       

Sort users data in ascending order by **number of enrolled courses**

In [12]:
# Use size() for array type
usersDF.sort(size('courses')).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|     1000.8|      [2, 4]|         null|        bing@abc.com|  Chandler|  6|       true|     Bing|2021-02-25 07:33:00|        [8273929,,,]|
|      900.0|   [2, 

In [13]:
usersDF. \
select(['id', 'courses']). \
withColumn('no_of_courses', size('courses')). \
sort(size('courses')).show()

+---+------------+-------------+
| id|     courses|no_of_courses|
+---+------------+-------------+
|  4|          []|            0|
|  5|         [3]|            1|
|  3|         [2]|            1|
|  6|      [2, 4]|            2|
|  2|   [2, 4, 5]|            3|
|  1|[1, 3, 5, 7]|            4|
+---+------------+-------------+



Sort users data in descending order by **first_name**

In [14]:
usersDF.sort(col('first_name'), ascending=False).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|      900.0|   [2, 

Sort users data in descending order by **customer_from**

In [15]:
usersDF.sort(col('customer_from'), ascending=False).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|    1000.55|[1, 3, 

In [16]:
usersDF.sort(usersDF['first_name'].desc()).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|      900.0|   [2, 

In [17]:
# Use Spark SQL
usersDF.sort(desc('first_name')).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1200.55|          []|   2021-01-19|        ross@abc.com|      Ross|  4|       true|   Geller|2021-02-18 01:10:00|    [82349238942,,,]|
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|     1000.9|         [2]|   2021-02-22|      monica@abc.com|    Monica|  3|       true|   Geller|2021-02-28 07:33:00|               [,,,]|
|      900.0|   [2, 

Sort users data in descending order by **number of enrolled courses**

In [18]:
# Use size() for array type
usersDF.sort(size('courses'), ascending=False).show()

+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|amount_paid|     courses|customer_from|               email|first_name| id|is_customer|last_name|    last_updated_ts|       phone_numbers|
+-----------+------------+-------------+--------------------+----------+---+-----------+---------+-------------------+--------------------+
|    1000.55|[1, 3, 5, 7]|   2021-01-13|pheobebuffay@abc.com|    Pheobe|  1|       true|   Buffay|2021-02-10 01:15:00|[82349238942, 234...|
|      900.0|   [2, 4, 5]|   2021-02-14|        joey@abc.com|      Joey|  2|       true|Tribbiani|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.8|      [2, 4]|         null|        bing@abc.com|  Chandler|  6|       true|     Bing|2021-02-25 07:33:00|        [8273929,,,]|
|       null|         [3]|   2021-02-24|      rachel@abc.com|    Rachel|  5|       true|    Green|2021-02-18 03:33:00|[82349238942, 234...|
|     1000.9|       

In [19]:
usersDF. \
select('id', 'courses'). \
withColumn('no_of_courses', size('courses')). \
sort('no_of_courses', ascending=False).show()

+---+------------+-------------+
| id|     courses|no_of_courses|
+---+------------+-------------+
|  1|[1, 3, 5, 7]|            4|
|  2|   [2, 4, 5]|            3|
|  6|      [2, 4]|            2|
|  5|         [3]|            1|
|  3|         [2]|            1|
|  4|          []|            0|
+---+------------+-------------+



In [20]:
usersDF. \
select('id', 'courses'). \
withColumn('no_of_courses', size('courses')). \
sort(desc('no_of_courses')).show()

+---+------------+-------------+
| id|     courses|no_of_courses|
+---+------------+-------------+
|  1|[1, 3, 5, 7]|            4|
|  2|   [2, 4, 5]|            3|
|  6|      [2, 4]|            2|
|  3|         [2]|            1|
|  5|         [3]|            1|
|  4|          []|            0|
+---+------------+-------------+



#### Dealing with Nulls

Sort the data in ascending order by **customer_from**

In [21]:
cf = col('customer_from')

In [None]:
# Hit tab to view the functions on top of column type
cf. # tab

In [23]:
# Null values comes first
usersDF. \
select('id', 'customer_from'). \
orderBy('customer_from'). \
show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  6|         null|
|  1|   2021-01-13|
|  4|   2021-01-19|
|  2|   2021-02-14|
|  3|   2021-02-22|
|  5|   2021-02-24|
+---+-------------+



In [24]:
# Nulls at the end
usersDF. \
select('id', 'customer_from'). \
orderBy(usersDF['customer_from'].asc_nulls_last()). \
show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|   2021-01-13|
|  4|   2021-01-19|
|  2|   2021-02-14|
|  3|   2021-02-22|
|  5|   2021-02-24|
|  6|         null|
+---+-------------+



In [25]:
# Use Column Type
usersDF. \
select('id', 'customer_from'). \
orderBy(col('customer_from').asc_nulls_last()). \
show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|   2021-01-13|
|  4|   2021-01-19|
|  2|   2021-02-14|
|  3|   2021-02-22|
|  5|   2021-02-24|
|  6|         null|
+---+-------------+



Sort the data in descending order by **customer_from**

In [26]:
# Desc - Null at the begining (By default)
usersDF. \
select('id', 'customer_from'). \
orderBy(col('customer_from').desc()). \
show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  5|   2021-02-24|
|  3|   2021-02-22|
|  2|   2021-02-14|
|  4|   2021-01-19|
|  1|   2021-01-13|
|  6|         null|
+---+-------------+



In [27]:
usersDF. \
select('id', 'customer_from'). \
orderBy(col('customer_from').desc_nulls_first()). \
show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  6|         null|
|  5|   2021-02-24|
|  3|   2021-02-22|
|  2|   2021-02-14|
|  4|   2021-01-19|
|  1|   2021-01-13|
+---+-------------+



#### Composite Sorting

In [28]:
 courses = [{'course_id': 1,
             'course_name': '2020 Complete Python Bootcamp: From Zero to Hero in Python',
             'suitable_for': 'Beginner',
             'enrollment': 1100093,
             'stars': 4.6,
             'number_of_ratings': 318066},
           {'course_id': 4,
             'course_name': 'Angular - The Complete Guide (2020 Edition)',
             'suitable_for': 'Intermediate',
             'enrollment': 422557,
             'stars': 4.6,
             'number_of_ratings': 129984},
           {'course_id': 12,
             'course_name': 'Automate the Boring Stuff with Python Programming',
             'suitable_for': 'Advanced',
             'enrollment': 692617,
             'stars': 4.6,
             'number_of_ratings': 70508},
            {'course_id': 10,
             'course_name': 'Complete C# Unity Game Developer 2D',
             'suitable_for': 'Beginner',
             'enrollment': 364934,
             'stars': 4.6,
             'number_of_ratings': 78989},
            {'course_id': 5,
             'course_name': 'Java Programming Masterclass for Software Developers',
             'suitable_for': 'Beginner',
             'enrollment': 596726,
             'stars': 4.6,
             'number_of_ratings': 182997},
            {'course_id': 15,
             'course_name': 'Learn Python Programming',
             'suitable_for': 'Advanced',
             'enrollment': 240790,
             'stars': 4.5,
             'number_of_ratings': 58677},
             {'course_id': 3,
             'course_name': 'Machine Learning',
             'suitable_for': 'Intermediate',
             'enrollment': 692812,
             'stars': 4.5,
             'number_of_ratings': 132228},
             {'course_id': 14,
             'course_name': 'Modern React with PHP',
             'suitable_for': 'Intermediate',
             'enrollment': 203214,
             'stars': 4.7,
             'number_of_ratings': 60835},
             {'course_id': 8,
             'course_name': 'Python for Data Science',
             'suitable_for': 'Intermediate',
             'enrollment': 387789,
             'stars': 4.6,
             'number_of_ratings': 87403},
            {'course_id': 19,
             'course_name': 'Unreal Engine C++ Developer: Learn C++ and Make Video Games',
             'suitable_for': 'Advanced',
             'enrollment': 229005,
             'stars': 4.5,
             'number_of_ratings': 45860},
            {'course_id': 17,
             'course_name': 'iOS 13 & Swift 5 - The Complete iOS App Development Bootcamp',
             'suitable_for': 'Advanced',
             'enrollment': 179598,
             'stars': 4.8,
             'number_of_ratings': 49972}
           ]

In [29]:
coursesDF = spark.createDataFrame([Row(**course) for course in courses])

In [30]:
coursesDF.printSchema()

root
 |-- course_id: long (nullable = true)
 |-- course_name: string (nullable = true)
 |-- suitable_for: string (nullable = true)
 |-- enrollment: long (nullable = true)
 |-- stars: double (nullable = true)
 |-- number_of_ratings: long (nullable = true)



* Sort courses in ascending order by **suitable_for** and then in ascending order by **enrollment**.

In [31]:
coursesDF.sort(col('suitable_for'), col('enrollment')).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        4|Angular - The Com

In [32]:
coursesDF.sort(coursesDF['suitable_for'], coursesDF['enrollment']).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        4|Angular - The Com

In [33]:
coursesDF.sort(['suitable_for', 'enrollment']).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        4|Angular - The Com

* Sort courses in ascending order by **suitable_for** and then in descending order by **enrollment**

In [34]:
coursesDF.sort(col('suitable_for'), col('enrollment').desc()).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S

In [35]:
coursesDF.sort(coursesDF['suitable_for'], coursesDF['enrollment'].desc()).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S

In [37]:
coursesDF.sort('suitable_for', desc('number_of_ratings')).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S

In [38]:
coursesDF.sort(['suitable_for', 'number_of_ratings'], ascending=[1, 0]).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       17|iOS 13 & Swift 5 ...|    Advanced|    179598|  4.8|            49972|
|       19|Unreal Engine C++...|    Advanced|    229005|  4.5|            45860|
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S

#### Prioritized Sorting of a Spark DataFrame

* Make sure the data is sorted in custom order by level and then in numerically in descending order by number of ratings.
* All the beginner level courses should come first, followed by intermediate level and then by advanced level.

In [39]:
help(when)

Help on function when in module pyspark.sql.functions:

when(condition, value)
    Evaluates a list of conditions and returns one of multiple possible result expressions.
    If :func:`Column.otherwise` is not invoked, None is returned for unmatched conditions.
    
    :param condition: a boolean :class:`Column` expression.
    :param value: a literal value, or a :class:`Column` expression.
    
    >>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
    [Row(age=3), Row(age=4)]
    
    >>> df.select(when(df.age == 2, df.age + 1).alias("age")).collect()
    [Row(age=3), Row(age=None)]
    
    .. versionadded:: 1.4



In [40]:
coursesLevel = when(col('suitable_for') == 'Beginner', 0).otherwise(when(col('suitable_for') == 'Intermediate', 1).otherwise(2))

In [42]:
coursesDF. \
orderBy(coursesLevel, col('number_of_ratings').desc()).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       17|iOS 13 & Swift 5 

In [43]:
# SQL Style Syntax
coursesLevel = expr(""" \
                    CASE \
                    WHEN suitable_for = 'Beginner' THEN 0 \
                    WHEN suitable_for = 'Intermediate' THEN 1 \
                    ELSE 2 \
                    END""")

coursesDF.sort(coursesLevel, col('number_of_ratings').desc()).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        5|Java Programming ...|    Beginner|    596726|  4.6|           182997|
|       10|Complete C# Unity...|    Beginner|    364934|  4.6|            78989|
|        3|    Machine Learning|Intermediate|    692812|  4.5|           132228|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|       17|iOS 13 & Swift 5 