## Create spark dataframes to select and rename columns

#### Index

[1. Create single column spark dataframe using list](#first) <br>
[2. Create multi column spark dataframe using list](#second) <br>
[3. Overview of Row](#third) <br>
[4. Convert list of list into spark dataframe using Row](#fourth) <br>
[5. Convert list of tuples into spark dataframe using Row](#fifth) <br>
[6. Convert list of dicts into spark dataframe using Row](#sixth) <br>

In [1]:
# Import necessary libraries
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType, StringType
import pandas as pd
import datetime

In [2]:
# Initiate spark session
spark = SparkSession \
        .builder \
        .appName('CreateSparkDF') \
        .getOrCreate()

In [3]:
users = [
            {
                "id": 1,
                "first_name": "Pheobe",
                "last_name": "Buffay",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= "8273929", shop=None),
                "courses": [1, 3, 5, 7],
                "email": "pheobebuffay@abc.com",
                "is_customer": True,
                "amount_paid": 1000.55,
                "customer_from": datetime.date(2021, 1, 13),
                "last_updated_ts": datetime.datetime(2021, 2, 10, 1, 15, 0)
            },
            {
                "id": 2,
                "first_name": "Joey",
                "last_name": "Tribbiani",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= None, shop=None),
                "courses": [2, 4, 5],
                "email": "joey@abc.com",
                "is_customer": True,
                "amount_paid": 900.0,
                "customer_from": datetime.date(2021, 2, 14),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 3,
                "first_name": "Monica",
                "last_name": "Geller",
                "phone_numbers": Row(mobile= None, home= None, office= None, shop=None),
                "courses": [2],
                "email": "monica@abc.com",
                "is_customer": True,
                "amount_paid": 1000.90,
                "customer_from": datetime.date(2021, 2, 22),
                "last_updated_ts": datetime.datetime(2021, 2, 28, 7, 33, 0)
            },
            {
                "id": 4,
                "first_name": "Ross",
                "last_name": "Geller",
                "phone_numbers": Row(mobile= "82349238942", home= None, office= None, shop=None),
                "courses": [],
                "email": "ross@abc.com",
                "is_customer": True,
                "amount_paid": 1200.55,
                "customer_from": datetime.date(2021, 1, 19),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 1, 10, 0)
            },
            {
                "id": 5,
                "first_name": "Rachel",
                "last_name": "Green",
                "phone_numbers": Row(mobile= "82349238942", home= "2348910249", office= "8273929", shop= "5343434654"),
                "courses": [3],
                "email": "rachel@abc.com",
                "is_customer": True,
                "amount_paid": None,
                "customer_from": datetime.date(2021, 2, 24),
                "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
            },
            {
                "id": 6,
                "first_name": "Chandler",
                "last_name": "Bing",
                "phone_numbers": Row(mobile= "8273929", home= None, office= None, shop=None),
                "courses": [2, 4],
                "email": "bing@abc.com",
                "is_customer": True,
                "amount_paid": 1000.80,
                "customer_from": datetime.date(2021, 2, 22),
                "last_updated_ts": datetime.datetime(2021, 2, 25, 7, 33, 0)
            }
        ]

In [4]:
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)

In [5]:
user_df = spark.createDataFrame(pd.DataFrame(users))

In [6]:
user_df.show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [7]:
user_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: struct (nullable = true)
 |    |-- mobile: string (nullable = true)
 |    |-- home: string (nullable = true)
 |    |-- office: string (nullable = true)
 |    |-- shop: string (nullable = true)
 |-- courses: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- email: string (nullable = true)
 |-- is_customer: boolean (nullable = true)
 |-- amount_paid: double (nullable = true)
 |-- customer_from: date (nullable = true)
 |-- last_updated_ts: timestamp (nullable = true)



#### Overview of Narrow and Wide Transformations

* Here are the functions related to narrow transformations. Narrow transformations doesn't result in shuffliing. These are also known as row level transformations.
    * df.select
    * df.filter
    * df.withColumn
    * df.withColumnRenamed
    * df.drop
* Here are the functions related to wide tranformations.
    * df.distinct
    * df.union or any set operation
    * df.join or any join operation
    * df.groupBy
    * df.sort or df.orderBy
* Any function that result in shuffling is wide transformation. For all the wide transformation, we have to deal with group of records based on a key.

### df.select

* `select()` function is used to select single, multiple, column by index, all columns from the list and the nested columns from a DataFrame.
* PySpark `select()` is a transformation function hence it returns a new DataFrame with the selected columns.

In [8]:
help(user_df.select)

Help on method select in module pyspark.sql.dataframe:

select(*cols) method of pyspark.sql.dataframe.DataFrame instance
    Projects a set of expressions and returns a new :class:`DataFrame`.
    
    :param cols: list of column names (string) or expressions (:class:`Column`).
        If one of the column names is '*', that column is expanded to include all columns
        in the current :class:`DataFrame`.
    
    >>> df.select('*').collect()
    [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
    >>> df.select('name', 'age').collect()
    [Row(name='Alice', age=2), Row(name='Bob', age=5)]
    >>> df.select(df.name, (df.age + 10).alias('age')).collect()
    [Row(name='Alice', age=12), Row(name='Bob', age=15)]
    
    .. versionadded:: 1.3



In [9]:
user_df.select('*').show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [10]:
user_df.select('id', 'first_name', 'last_name').show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [11]:
user_df.select(['id', 'first_name', 'last_name']).show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [12]:
# Defining alias to the dataframe
user_df.alias('u').select('u.*').show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [13]:
user_df.select(col('id'), 'first_name', 'last_name').show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [14]:
user_df.select('id', 'first_name', 'last_name', concat('first_name', lit(', '), 'last_name').alias('full_name')).show()

+---+----------+---------+---------------+
| id|first_name|last_name|      full_name|
+---+----------+---------+---------------+
|  1|    Pheobe|   Buffay| Pheobe, Buffay|
|  2|      Joey|Tribbiani|Joey, Tribbiani|
|  3|    Monica|   Geller| Monica, Geller|
|  4|      Ross|   Geller|   Ross, Geller|
|  5|    Rachel|    Green|  Rachel, Green|
|  6|  Chandler|     Bing| Chandler, Bing|
+---+----------+---------+---------------+



### df.selectExpr 

* It performs same operations as `select()`.
* `selectExpr()` is a way to integrate sql like syntax with dataframe API's to get the desired results.

In [15]:
help(user_df.selectExpr)

Help on method selectExpr in module pyspark.sql.dataframe:

selectExpr(*expr) method of pyspark.sql.dataframe.DataFrame instance
    Projects a set of SQL expressions and returns a new :class:`DataFrame`.
    
    This is a variant of :func:`select` that accepts SQL expressions.
    
    >>> df.selectExpr("age * 2", "abs(age)").collect()
    [Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]
    
    .. versionadded:: 1.3



In [16]:
user_df.selectExpr('*').show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [17]:
user_df.alias('u').selectExpr('u.*').show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [18]:
user_df.selectExpr('id', 'first_name', 'last_name').show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [19]:
user_df.selectExpr(['id', 'first_name', 'last_name']).show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [20]:
# Use sql like sytax to perform transformation
user_df.selectExpr('id', 'first_name', 'last_name', "concat(first_name, ', ', last_name) AS full_name").show()

+---+----------+---------+---------------+
| id|first_name|last_name|      full_name|
+---+----------+---------+---------------+
|  1|    Pheobe|   Buffay| Pheobe, Buffay|
|  2|      Joey|Tribbiani|Joey, Tribbiani|
|  3|    Monica|   Geller| Monica, Geller|
|  4|      Ross|   Geller|   Ross, Geller|
|  5|    Rachel|    Green|  Rachel, Green|
|  6|  Chandler|     Bing| Chandler, Bing|
+---+----------+---------+---------------+



In [21]:
# Way to perform sql queries in spark

# Convert dataframe into tempView
user_df.createOrReplaceTempView('users')

In [22]:
spark.sql("""SELECT id, first_name, last_name, concat(first_name, ', ', last_name) AS full_name FROM users""").show()

+---+----------+---------+---------------+
| id|first_name|last_name|      full_name|
+---+----------+---------+---------------+
|  1|    Pheobe|   Buffay| Pheobe, Buffay|
|  2|      Joey|Tribbiani|Joey, Tribbiani|
|  3|    Monica|   Geller| Monica, Geller|
|  4|      Ross|   Geller|   Ross, Geller|
|  5|    Rachel|    Green|  Rachel, Green|
|  6|  Chandler|     Bing| Chandler, Bing|
+---+----------+---------+---------------+



### col()

* Returns a Column based on the given column name.

In [46]:
help(col)

Help on function col in module pyspark.sql.functions:

col(col)
    Returns a :class:`Column` based on the given column name.
    
    .. versionadded:: 1.3



In [24]:
# Return column type object
user_df['id'], type(user_df['id'])

(Column<b'id'>, pyspark.sql.column.Column)

In [25]:
col('id'), type(col('id'))

(Column<b'id'>, pyspark.sql.column.Column)

In [30]:
# Applied to whole column
user_df.select((col('id') * 5).alias('Multiple of 5')).show()

+-------------+
|Multiple of 5|
+-------------+
|            5|
|           10|
|           15|
|           20|
|           25|
|           30|
+-------------+



In [32]:
# This does not work as there is no object by name u in this session(u is not in columnar format)
user_df.alias('u').select(u['id'], col('first_name'), 'last_name').show()

NameError: name 'u' is not defined

In [33]:
# This will work(here u is passed as a string)
user_df.alias('u').select('u.id', col('first_name'), 'last_name').show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [34]:
# This won't work as selectExpr only take column names or SQL style expression on column names
user_df.selectExpr(col('id'), 'first_name', 'last_name').show()

TypeError: Column is not iterable

In [35]:
# Runs effortlessly
user_df.selectExpr('id', 'first_name', 'last_name').show()

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



In [41]:
user_df.alias('u').selectExpr("u.id", 'first_name', 'last_name', "concat(u.first_name, ' ', u.last_name)").show()

+---+----------+---------+--------------------------------+
| id|first_name|last_name|concat(first_name,  , last_name)|
+---+----------+---------+--------------------------------+
|  1|    Pheobe|   Buffay|                   Pheobe Buffay|
|  2|      Joey|Tribbiani|                  Joey Tribbiani|
|  3|    Monica|   Geller|                   Monica Geller|
|  4|      Ross|   Geller|                     Ross Geller|
|  5|    Rachel|    Green|                    Rachel Green|
|  6|  Chandler|     Bing|                   Chandler Bing|
+---+----------+---------+--------------------------------+



In [42]:
user_df.createOrReplaceTempView('users')

In [45]:
spark.sql("""
    SELECT id, first_name, last_name,
    concat(u.first_name, ' ', u.last_name) AS full_name
    FROM users AS u
""").show()

+---+----------+---------+--------------+
| id|first_name|last_name|     full_name|
+---+----------+---------+--------------+
|  1|    Pheobe|   Buffay| Pheobe Buffay|
|  2|      Joey|Tribbiani|Joey Tribbiani|
|  3|    Monica|   Geller| Monica Geller|
|  4|      Ross|   Geller|   Ross Geller|
|  5|    Rachel|    Green|  Rachel Green|
|  6|  Chandler|     Bing| Chandler Bing|
+---+----------+---------+--------------+



In [47]:
cols = ['id', 'first_name', 'last_name']
user_df.select(*cols).show()  # select() accepts string, list of string or variable elements

+---+----------+---------+
| id|first_name|last_name|
+---+----------+---------+
|  1|    Pheobe|   Buffay|
|  2|      Joey|Tribbiani|
|  3|    Monica|   Geller|
|  4|      Ross|   Geller|
|  5|    Rachel|    Green|
|  6|  Chandler|     Bing|
+---+----------+---------+



There are quite a few functions available on top of column type
* `cast` (can be used on all important dataframe functions such as **select**, **filter**, **groupBy**, **orderBy** etc)
* `asc`, `desc` (typically used as part of **sort** or **orderBy**)
* `contains` (typically used as part of **filter** or **where**)

In [49]:
user_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: struct (nullable = true)
 |    |-- mobile: string (nullable = true)
 |    |-- home: string (nullable = true)
 |    |-- office: string (nullable = true)
 |    |-- shop: string (nullable = true)
 |-- courses: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- email: string (nullable = true)
 |-- is_customer: boolean (nullable = true)
 |-- amount_paid: double (nullable = true)
 |-- customer_from: date (nullable = true)
 |-- last_updated_ts: timestamp (nullable = true)



In [50]:
user_df.select(
    col('id'),
    date_format('customer_from', 'yyyyMMdd').alias('customer_from')
).show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|     20210113|
|  2|     20210214|
|  3|     20210222|
|  4|     20210119|
|  5|     20210224|
|  6|     20210222|
+---+-------------+



In [51]:
# NOTE: customer_from converted to string type
user_df.select(
    col('id'),
    date_format('customer_from', 'yyyyMMdd').alias('customer_from')
).printSchema()

root
 |-- id: long (nullable = true)
 |-- customer_from: string (nullable = true)



In [52]:
# NOTE: cast customer_from to integer type
user_df.select(
    col('id'),
    date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_from')
).printSchema()

root
 |-- id: long (nullable = true)
 |-- customer_from: integer (nullable = true)



In [53]:
# OR
cols = [col('id'), date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_from')]
user_df.select(*cols).show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|     20210113|
|  2|     20210214|
|  3|     20210222|
|  4|     20210119|
|  5|     20210224|
|  6|     20210222|
+---+-------------+



In [59]:
user_df.select(*cols).dtypes

[('id', 'bigint'), ('customer_from', 'int')]

### concat()

* Function to concat strings

In [54]:
help(concat)

Help on function concat in module pyspark.sql.functions:

concat(*cols)
    Concatenates multiple input columns together into a single column.
    The function works with strings, binary and compatible array columns.
    
    >>> df = spark.createDataFrame([('abcd','123')], ['s', 'd'])
    >>> df.select(concat(df.s, df.d).alias('s')).collect()
    [Row(s='abcd123')]
    
    >>> df = spark.createDataFrame([([1, 2], [3, 4], [5]), ([1, 2], None, [3])], ['a', 'b', 'c'])
    >>> df.select(concat(df.a, df.b, df.c).alias("arr")).collect()
    [Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
    
    .. versionadded:: 1.5



In [55]:
full_name = concat(col('first_name'), lit(', '), col('last_name'))

In [56]:
full_name_alias = full_name.alias('full_name')

In [58]:
user_df.select('id', full_name_alias).show()

+---+---------------+
| id|      full_name|
+---+---------------+
|  1| Pheobe, Buffay|
|  2|Joey, Tribbiani|
|  3| Monica, Geller|
|  4|   Ross, Geller|
|  5|  Rachel, Green|
|  6| Chandler, Bing|
+---+---------------+



### lit()

In [62]:
help(lit)

Help on function lit in module pyspark.sql.functions:

lit(col)
    Creates a :class:`Column` of literal value.
    
    >>> df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1)
    [Row(height=5, spark_user=True)]
    
    .. versionadded:: 1.3



In [64]:
user_df.createOrReplaceTempView('users')

In [66]:
spark.sql("""
        SELECT id, amount_paid + 25 AS amount_paid
        FROM users
""").show()

+---+-----------+
| id|amount_paid|
+---+-----------+
|  1|    1025.55|
|  2|      925.0|
|  3|     1025.9|
|  4|    1225.55|
|  5|        NaN|
|  6|     1025.8|
+---+-----------+



In [70]:
user_df.selectExpr('id', "(amount_paid + 25) AS amount_paid").show()

+---+-----------+
| id|amount_paid|
+---+-----------+
|  1|    1025.55|
|  2|      925.0|
|  3|     1025.9|
|  4|    1225.55|
|  5|        NaN|
|  6|     1025.8|
+---+-----------+



In [72]:
# This will fail
user_df.select("id", col("amount_paid") + 25).show()

+---+------------------+
| id|(amount_paid + 25)|
+---+------------------+
|  1|           1025.55|
|  2|             925.0|
|  3|            1025.9|
|  4|           1225.55|
|  5|               NaN|
|  6|            1025.8|
+---+------------------+



In [75]:
# This will not work as we are adding string type to column type object
user_df.select("id", "amount_paid" + lit(25)).show()

+---+------------------+
| id|(25 + amount_paid)|
+---+------------------+
|  1|              null|
|  2|              null|
|  3|              null|
|  4|              null|
|  5|              null|
|  6|              null|
+---+------------------+



In [76]:
# Best option is to make sure both of them are column type object to perform addition
user_df.select("id", col("amount_paid") + lit(25)).show()

+---+------------------+
| id|(amount_paid + 25)|
+---+------------------+
|  1|           1025.55|
|  2|             925.0|
|  3|            1025.9|
|  4|           1225.55|
|  5|               NaN|
|  6|            1025.8|
+---+------------------+



In [77]:
lit(25)

Column<b'25'>

### Rename spark columns or expressions

There are multiple wasy to rename spark dataframe columns or expressions.

* Using `alias` as a part of `select`
* Using `witColumn` on top of dataframe
* Using `withColumnRenamed` on top of dataframe
* Typically `withColumn` is used to perform row level transformations and then to provide a name to the result. If we provide the same name as existing column, then the column will be replaced with new one.
* If we want to just rename the column then it is better to use `withColumnRenamed`
* If we want to apply any transformations, we need to either use `select` or `withColumn`
* We can rename bunch of columns or change the order of the columns using `toDF`

In [78]:
# Using select and alias to create new column
user_df.select('id', 'first_name', 'last_name',
              concat('first_name', lit(', '), 'last_name').alias('full_name')
              ).show()

+---+----------+---------+---------------+
| id|first_name|last_name|      full_name|
+---+----------+---------+---------------+
|  1|    Pheobe|   Buffay| Pheobe, Buffay|
|  2|      Joey|Tribbiani|Joey, Tribbiani|
|  3|    Monica|   Geller| Monica, Geller|
|  4|      Ross|   Geller|   Ross, Geller|
|  5|    Rachel|    Green|  Rachel, Green|
|  6|  Chandler|     Bing| Chandler, Bing|
+---+----------+---------+---------------+



* Add another column by name `course_count` where it contains number of courses the user is enrolled for.

In [84]:
# Using withColumn
user_df.select('id', 'first_name', 'last_name'). \
              withColumn('full_name', concat('first_name', lit(', '), 'last_name')
              ).show()

+---+----------+---------+---------------+
| id|first_name|last_name|      full_name|
+---+----------+---------+---------------+
|  1|    Pheobe|   Buffay| Pheobe, Buffay|
|  2|      Joey|Tribbiani|Joey, Tribbiani|
|  3|    Monica|   Geller| Monica, Geller|
|  4|      Ross|   Geller|   Ross, Geller|
|  5|    Rachel|    Green|  Rachel, Green|
|  6|  Chandler|     Bing| Chandler, Bing|
+---+----------+---------+---------------+



In [87]:
# Using withColumn with same name, replace the old data (❌ misleading)
user_df.select('id', 'first_name', 'last_name'). \
              withColumn('first_name', concat('first_name', lit(', '), 'last_name')
              ).show()

+---+---------------+---------+
| id|     first_name|last_name|
+---+---------------+---------+
|  1| Pheobe, Buffay|   Buffay|
|  2|Joey, Tribbiani|Tribbiani|
|  3| Monica, Geller|   Geller|
|  4|   Ross, Geller|   Geller|
|  5|  Rachel, Green|    Green|
|  6| Chandler, Bing|     Bing|
+---+---------------+---------+



In [79]:
user_df.show()

+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
| id|first_name|last_name|       phone_numbers|     courses|               email|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+---------+--------------------+------------+--------------------+-----------+-----------+-------------+-------------------+
|  1|    Pheobe|   Buffay|[82349238942, 234...|[1, 3, 5, 7]|pheobebuffay@abc.com|       true|    1000.55|   2021-01-13|2021-02-10 01:15:00|
|  2|      Joey|Tribbiani|[82349238942, 234...|   [2, 4, 5]|        joey@abc.com|       true|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|    Monica|   Geller|               [,,,]|         [2]|      monica@abc.com|       true|     1000.9|   2021-02-22|2021-02-28 07:33:00|
|  4|      Ross|   Geller|    [82349238942,,,]|          []|        ross@abc.com|       true|    1200.55|   2021-01-19|2021-02-18 01:10:00|
|  5|    Rachel|    

In [89]:
# Using withColumn
user_df.select('id', 'courses'). \
              withColumn('course_count', size('courses')).show()

+---+------------+------------+
| id|     courses|course_count|
+---+------------+------------+
|  1|[1, 3, 5, 7]|           4|
|  2|   [2, 4, 5]|           3|
|  3|         [2]|           1|
|  4|          []|           0|
|  5|         [3]|           1|
|  6|      [2, 4]|           2|
+---+------------+------------+



* Rename `id` to `user_id`
* Rename `first_name` to `user_first_name`
* Rename `last_name` to `user_last_name`

In [96]:
help(user_df.withColumnRenamed)

Help on method withColumnRenamed in module pyspark.sql.dataframe:

withColumnRenamed(existing, new) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` by renaming an existing column.
    This is a no-op if schema doesn't contain the given column name.
    
    :param existing: string, name of the existing column to rename.
    :param new: string, new name of the column.
    
    >>> df.withColumnRenamed('age', 'age2').collect()
    [Row(age2=2, name='Alice'), Row(age2=5, name='Bob')]
    
    .. versionadded:: 1.3



In [99]:
user_df.select('id', 'first_name', 'last_name'). \
              withColumnRenamed('id', 'user_id'). \
              withColumnRenamed('first_name', 'user_first_name'). \
              withColumnRenamed('last_name', 'user_last_name').show()

+-------+---------------+--------------+
|user_id|user_first_name|user_last_name|
+-------+---------------+--------------+
|      1|         Pheobe|        Buffay|
|      2|           Joey|     Tribbiani|
|      3|         Monica|        Geller|
|      4|           Ross|        Geller|
|      5|         Rachel|         Green|
|      6|       Chandler|          Bing|
+-------+---------------+--------------+

