* ### String Manipulation Functions
  * Case Conversion - `lower`,  `upper`
  * Getting Length -  `length`
  * Extracting substrings - `substring`, `split`
  * Trimming - `trim`, `ltrim`, `rtrim`
  * Padding - `lpad`, `rpad`
  * Concatenating string - `concat`, `concat_ws`
* ### Date Manipulation Functions
  * Getting current date and time - `current_date`, `current_timestamp`
  * Date Arithmetic - `date_add`, `date_sub`, `datediff`, `months_between`, `add_months`, `next_day`
  * Beginning and Ending Date or Time - `last_day`, `trunc`, `date_trunc`
  * Formatting Date - `date_format`
  * Extracting Information - `dayofyear`, `dayofmonth`, `dayofweek`, `year`, `month`
* ### Aggregate Functions
  * `count`, `countDistinct`
  * `sum`, `avg`
  * `min`, `max`
* ### Other Functions
  * `CASE` and `WHEN`
  * `CAST` for type casting
  * Functions to manage special types such as `ARRAY`, `MAP`, `STRUCT` type columns
  * Many others

In [0]:
# Reading data

orders = spark.read.csv(
    '/public/retail_db/orders',
    schema='order_id INT, order_date STRING, order_customer_id INT, order_status STRING'
)

In [0]:
# We can find those functions in the pyspark.sql.functions:
from pyspark.sql.functions import *

In [0]:
# We can use help() function to get the documentation about the function
help(concat_ws)

Help on function concat_ws in module pyspark.sql.functions.builtin:

concat_ws(sep: str, *cols: 'ColumnOrName') -> pyspark.sql.column.Column
    Concatenates multiple input string columns together into a single string column,
    using the given separator.
    
    .. versionadded:: 1.5.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    sep : str
        words separator.
    cols : :class:`~pyspark.sql.Column` or str
        list of columns to work on.
    
    Returns
    -------
    :class:`~pyspark.sql.Column`
        string of concatenated words.
    
    Examples
    --------
    >>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
    >>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect()
    [Row(s='abcd-123')]



In [0]:
# It's important to remember that some of these functions doesn't work with column names (strings)
# They need col() or lit() object as an argument.

# For example concat() needs lit():
orders.select(concat('order_id', lit(' abc '), 'order_date')).show(truncate=False)

+-----------------------------------+
|concat(order_id,  abc , order_date)|
+-----------------------------------+
|1 abc 2013-07-25 00:00:00.0        |
|2 abc 2013-07-25 00:00:00.0        |
|3 abc 2013-07-25 00:00:00.0        |
|4 abc 2013-07-25 00:00:00.0        |
|5 abc 2013-07-25 00:00:00.0        |
|6 abc 2013-07-25 00:00:00.0        |
|7 abc 2013-07-25 00:00:00.0        |
|8 abc 2013-07-25 00:00:00.0        |
|9 abc 2013-07-25 00:00:00.0        |
|10 abc 2013-07-25 00:00:00.0       |
|11 abc 2013-07-25 00:00:00.0       |
|12 abc 2013-07-25 00:00:00.0       |
|13 abc 2013-07-25 00:00:00.0       |
|14 abc 2013-07-25 00:00:00.0       |
|15 abc 2013-07-25 00:00:00.0       |
|16 abc 2013-07-25 00:00:00.0       |
|17 abc 2013-07-25 00:00:00.0       |
|18 abc 2013-07-25 00:00:00.0       |
|19 abc 2013-07-25 00:00:00.0       |
|20 abc 2013-07-25 00:00:00.0       |
+-----------------------------------+
only showing top 20 rows



In [0]:
# And .alias() works only on col() object:
orders.select(col('order_id').alias('order_id_alias')).show()

+--------------+
|order_id_alias|
+--------------+
|             1|
|             2|
|             3|
|             4|
|             5|
|             6|
|             7|
|             8|
|             9|
|            10|
|            11|
|            12|
|            13|
|            14|
|            15|
|            16|
|            17|
|            18|
|            19|
|            20|
+--------------+
only showing top 20 rows

