What is column and how to reference it
How to create column expression

In [1]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

from lib.logger import Log4j

spark = SparkSession \
            .builder \
            .master("local[3]") \
            .appName("With Column Demo") \
            .getOrCreate()

logger = Log4j(spark)
logger.info("Starting HelloSparkSQL")

In [2]:
data = [
    ('James','','Smith','1991-04-01','M',3000),
    ('Michael','Rose','','2000-05-19','M',4000),
    ('Robert','','Williams','1978-09-05','M',4000),
    ('Maria','Anne','Jones','1967-12-01','F',4000),
    ('Jen','Mary','Brown','1980-02-17','F',-1)
]

In [3]:
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)

In [4]:
df.show(5)
df.printSchema()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)



we can able to access column name using three method as mentioned below. with this we learn to refer indivisual column

In [8]:
from pyspark.sql.functions import *

In [9]:
df.select(column("firstname"),col("lastname"),df.gender).show(5)

+---------+--------+------+
|firstname|lastname|gender|
+---------+--------+------+
|    James|   Smith|     M|
|  Michael|        |     M|
|   Robert|Williams|     M|
|    Maria|   Jones|     F|
|      Jen|   Brown|     F|
+---------+--------+------+



How to ceate column expressions. which are of two types
1. String/SQL expressions
2. Column Object Expression

1. String/SQL Expression
let refer the columns using string
combine firstname lastname and middle name to form full name

In [10]:
df.select("firstname","middlename","lastname","dob","gender").show()

+---------+----------+--------+----------+------+
|firstname|middlename|lastname|       dob|gender|
+---------+----------+--------+----------+------+
|    James|          |   Smith|1991-04-01|     M|
|  Michael|      Rose|        |2000-05-19|     M|
|   Robert|          |Williams|1978-09-05|     M|
|    Maria|      Anne|   Jones|1967-12-01|     F|
|      Jen|      Mary|   Brown|1980-02-17|     F|
+---------+----------+--------+----------+------+



Column Object Expression

In [11]:
from pyspark.sql.functions import *
df.select(concat(df.firstname,df.middlename,df.lastname) .alias("fullName"),"dob","gender").show()

+--------------+----------+------+
|      fullName|       dob|gender|
+--------------+----------+------+
|    JamesSmith|1991-04-01|     M|
|   MichaelRose|2000-05-19|     M|
|RobertWilliams|1978-09-05|     M|
|MariaAnneJones|1967-12-01|     F|
|  JenMaryBrown|1980-02-17|     F|
+--------------+----------+------+



String Expression

In [12]:
df.select(concat("firstname","middlename","lastname"),"dob","gender").show()

+---------------------------------------+----------+------+
|concat(firstname, middlename, lastname)|       dob|gender|
+---------------------------------------+----------+------+
|                             JamesSmith|1991-04-01|     M|
|                            MichaelRose|2000-05-19|     M|
|                         RobertWilliams|1978-09-05|     M|
|                         MariaAnneJones|1967-12-01|     F|
|                           JenMaryBrown|1980-02-17|     F|
+---------------------------------------+----------+------+



In [13]:
from pyspark.sql.functions import *
df.withColumn("salary",col("salary").cast("Integer"))
df.withColumn("dob",col("dob").cast("DATE"))

DataFrame[firstname: string, middlename: string, lastname: string, dob: date, gender: string, salary: bigint]

In [14]:
spark.stop()