PySpark Date and Timestamp Functions are supported on DataFrame and SQL queries and they work similarly to traditional SQL, Date and Time are very important if you are using PySpark for ETL. Most of all these functions accept input as, Date type, Timestamp type, or String. If a String used, it should be in a default format that can be cast to date.

    DateType default format is yyyy-MM-dd 
    TimestampType default format is yyyy-MM-dd HH:mm:ss.SSSS
    Returns null if the input is a string that can not be cast to Date or Timestamp.
    
PySpark SQL provides several Date & Timestamp functions hence keep an eye on and understand these. Always you should choose these functions instead of writing your own functions (UDF) as these functions are compile-time safe, handles null, and perform better when compared to PySpark UDF. If your PySpark application is critical on performance try to avoid using custom UDF at all costs as these are not guarantee performance.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

In [14]:
from pyspark.sql.functions import *
from pyspark.sql.types import StructType,StructField, DateType, IntegerType,TimestampType,StringType
schema = StructType([ \
    StructField("user_id",IntegerType(),True), \
    StructField("movie_id",IntegerType(),True), \
    StructField("rating",IntegerType(),True), \
    StructField("date1", StringType(), True)
  ])
uDataDF = spark.read.format("csv").option("inferSchema", "true").option("header", "false").schema(schema).load("u.data.csv",sep ='\t')
uDataDF = uDataDF.withColumn("date2",from_unixtime(col("date1")))
uDataDF = uDataDF.withColumn("date",date_format(col("date2"), "MM/dd/yyyy")).orderBy("date")

In [15]:
uDataDF.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- movie_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- date1: string (nullable = true)
 |-- date2: string (nullable = true)
 |-- date: string (nullable = true)



In [16]:
uDataDF.show()

+-------+--------+------+---------+-------------------+----------+
|user_id|movie_id|rating|    date1|              date2|      date|
+-------+--------+------+---------+-------------------+----------+
|     66|     181|     5|883601425|1998-01-01 02:20:25|01/01/1998|
|      6|      69|     3|883601277|1998-01-01 02:17:57|01/01/1998|
|      6|     357|     4|883602422|1998-01-01 02:37:02|01/01/1998|
|      6|     517|     4|883602212|1998-01-01 02:33:32|01/01/1998|
|     66|     298|     4|883601324|1998-01-01 02:18:44|01/01/1998|
|      6|      86|     3|883603013|1998-01-01 02:46:53|01/01/1998|
|     66|     258|     4|883601089|1998-01-01 02:14:49|01/01/1998|
|     66|       1|     3|883601324|1998-01-01 02:18:44|01/01/1998|
|      6|      98|     5|883600680|1998-01-01 02:08:00|01/01/1998|
|      6|     492|     5|883601089|1998-01-01 02:14:49|01/01/1998|
|     66|     877|     1|883601089|1998-01-01 02:14:49|01/01/1998|
|     66|       7|     3|883601355|1998-01-01 02:19:15|01/01/1

In [17]:
data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()

+---+----------+
| id|     input|
+---+----------+
|  1|2020-02-01|
|  2|2019-03-01|
|  3|2021-03-01|
+---+----------+



In [19]:

df1 = df.withColumn("current_date",current_date())

df1.show()

+---+----------+------------+
| id|     input|current_date|
+---+----------+------------+
|  1|2020-02-01|  2022-01-21|
|  2|2019-03-01|  2022-01-21|
|  3|2021-03-01|  2022-01-21|
+---+----------+------------+



In [31]:
df1.printSchema()

root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)
 |-- current_date: date (nullable = false)



In [26]:
## date_format()
df2 = df.select(col("*"), 
    date_format(col("input"), "MM-dd-yyyy").alias("date_format") 
  )
df2.show()

+---+----------+-----------+
| id|     input|date_format|
+---+----------+-----------+
|  1|2020-02-01| 02-01-2020|
|  2|2019-03-01| 03-01-2019|
|  3|2021-03-01| 03-01-2021|
+---+----------+-----------+



In [27]:
df2.printSchema()

root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)
 |-- date_format: string (nullable = true)



to_date()

Below example converts string in date format yyyy-MM-dd to a DateType yyyy-MM-dd using to_date(). You can also use this to convert into any specific format. PySpark supports all patterns supports on Java DateTimeFormatter.

In [28]:
df3 =df.select(col("input"), 
    to_date(col("input"), "yyy-MM-dd").alias("to_date") 
  )
df3.show()

+----------+----------+
|     input|   to_date|
+----------+----------+
|2020-02-01|2020-02-01|
|2019-03-01|2019-03-01|
|2021-03-01|2021-03-01|
+----------+----------+



In [29]:
df3.printSchema()

root
 |-- input: string (nullable = true)
 |-- to_date: date (nullable = true)



In [32]:
#datediff()
df4 = df.select(col("input"), 
    datediff(current_date(),col("input")).alias("datediff")  
  )

In [33]:
df4.show()

+----------+--------+
|     input|datediff|
+----------+--------+
|2020-02-01|     721|
|2019-03-01|    1058|
|2021-03-01|     327|
+----------+--------+



In [34]:
df4.printSchema()

root
 |-- input: string (nullable = true)
 |-- datediff: integer (nullable = true)



In [38]:
df5 = df3.select(col("to_date"), 
    datediff(current_date(),col("to_date")).alias("datediff")  
  )

In [39]:
df5.show()

+----------+--------+
|   to_date|datediff|
+----------+--------+
|2020-02-01|     721|
|2019-03-01|    1058|
|2021-03-01|     327|
+----------+--------+



In [40]:
df5.printSchema()

root
 |-- to_date: date (nullable = true)
 |-- datediff: integer (nullable = true)



In [43]:
# months_between()
# The below example returns the months between two dates using months_between().

df6 = df3.select(col("to_date"), 
    months_between(current_date(),col("to_date")).alias("months_between")  
  )
df6.show()

+----------+--------------+
|   to_date|months_between|
+----------+--------------+
|2020-02-01|   23.67741935|
|2019-03-01|   34.67741935|
|2021-03-01|   10.67741935|
+----------+--------------+



In [44]:
df6.printSchema()

root
 |-- to_date: date (nullable = true)
 |-- months_between: double (nullable = true)



In [46]:
# add_months() , date_add(), date_sub()
# Here we are adding and subtracting date and month from a given input.

df.select(col("input"), 
    add_months(col("input"),3).alias("add_months"), 
    add_months(col("input"),-3).alias("sub_months"), 
    date_add(col("input"),4).alias("date_add"), 
    date_add(col("input"),-4).alias("date_sub"),
    date_sub(col("input"),4).alias("date_sub")
  ).show()

+----------+----------+----------+----------+----------+----------+
|     input|add_months|sub_months|  date_add|  date_sub|  date_sub|
+----------+----------+----------+----------+----------+----------+
|2020-02-01|2020-05-01|2019-11-01|2020-02-05|2020-01-28|2020-01-28|
|2019-03-01|2019-06-01|2018-12-01|2019-03-05|2019-02-25|2019-02-25|
|2021-03-01|2021-06-01|2020-12-01|2021-03-05|2021-02-25|2021-02-25|
+----------+----------+----------+----------+----------+----------+



In [47]:
df7 = df.select(col("input"), 
    add_months(col("input"),3).alias("add_months"), 
    add_months(col("input"),-3).alias("sub_months"), 
    date_add(col("input"),4).alias("date_add"), 
    date_add(col("input"),-4).alias("date_sub"),
    date_sub(col("input"),4).alias("date_sub")
  )

In [48]:
df7.printSchema()

root
 |-- input: string (nullable = true)
 |-- add_months: date (nullable = true)
 |-- sub_months: date (nullable = true)
 |-- date_add: date (nullable = true)
 |-- date_sub: date (nullable = true)
 |-- date_sub: date (nullable = true)



In [51]:
#year(), month(), month(),next_day(), weekofyear()

df8 = df.select(col("input"), 
     year(col("input")).alias("year"), 
     month(col("input")).alias("month"), 
     next_day(col("input"),"Sunday").alias("next_day"), 
     weekofyear(col("input")).alias("weekofyear") 
  )
df8.show()

+----------+----+-----+----------+----------+
|     input|year|month|  next_day|weekofyear|
+----------+----+-----+----------+----------+
|2020-02-01|2020|    2|2020-02-02|         5|
|2019-03-01|2019|    3|2019-03-03|         9|
|2021-03-01|2021|    3|2021-03-07|         9|
+----------+----+-----+----------+----------+



In [52]:
df8.printSchema()

root
 |-- input: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- next_day: date (nullable = true)
 |-- weekofyear: integer (nullable = true)



In [53]:
# dayofweek(), dayofmonth(), dayofyear()
df.select(col("input"),  
     dayofweek(col("input")).alias("dayofweek"), 
     dayofmonth(col("input")).alias("dayofmonth"), 
     dayofyear(col("input")).alias("dayofyear"), 
  ).show()

+----------+---------+----------+---------+
|     input|dayofweek|dayofmonth|dayofyear|
+----------+---------+----------+---------+
|2020-02-01|        7|         1|       32|
|2019-03-01|        6|         1|       60|
|2021-03-01|        2|         1|       60|
+----------+---------+----------+---------+



In [54]:
# current_timestamp()
# Following are the Timestamp Functions that you can use on SQL and on DataFrame. Let’s learn these with examples.

data=[["1","02-01-2020 11 01 19 06"],["2","03-01-2019 12 01 19 406"],["3","03-01-2021 12 01 19 406"]]
df2=spark.createDataFrame(data,["id","input"])
df2.show(truncate=False)

+---+-----------------------+
|id |input                  |
+---+-----------------------+
|1  |02-01-2020 11 01 19 06 |
|2  |03-01-2019 12 01 19 406|
|3  |03-01-2021 12 01 19 406|
+---+-----------------------+



In [55]:
df2.printSchema()

root
 |-- id: string (nullable = true)
 |-- input: string (nullable = true)



In [56]:
df9 = df2.select(current_timestamp().alias("current_timestamp")
  )
df9.show(1,truncate=False)

+-----------------------+
|current_timestamp      |
+-----------------------+
|2022-01-22 01:07:58.603|
+-----------------------+
only showing top 1 row



In [57]:
df9.printSchema()

root
 |-- current_timestamp: timestamp (nullable = false)



In [58]:
# to_timestamp()
# Converts string timestamp to Timestamp type format.

df2.select(col("input"), 
    to_timestamp(col("input"), "MM-dd-yyyy HH mm ss SSS").alias("to_timestamp") 
  ).show(truncate=False)

+-----------------------+-----------------------+
|input                  |to_timestamp           |
+-----------------------+-----------------------+
|02-01-2020 11 01 19 06 |2020-02-01 11:01:19.06 |
|03-01-2019 12 01 19 406|2019-03-01 12:01:19.406|
|03-01-2021 12 01 19 406|2021-03-01 12:01:19.406|
+-----------------------+-----------------------+



In [60]:
df2.select(col("input"), 
    to_timestamp(col("input"), "MM-dd-yyyy HH mm ss SSS").alias("to_timestamp") 
  ).printSchema()

root
 |-- input: string (nullable = true)
 |-- to_timestamp: timestamp (nullable = true)



In [62]:
# hour(), Minute() and second()
data=[["1","2020-02-01 11:01:19.06"],["2","2019-03-01 12:01:19.406"],["3","2021-03-01 12:01:19.406"]]
df3=spark.createDataFrame(data,["id","input"])

df3.select(col("input"), 
    hour(col("input")).alias("hour"), 
    minute(col("input")).alias("minute"),
    second(col("input")).alias("second") 
  ).show(truncate=False)

+-----------------------+----+------+------+
|input                  |hour|minute|second|
+-----------------------+----+------+------+
|2020-02-01 11:01:19.06 |11  |1     |19    |
|2019-03-01 12:01:19.406|12  |1     |19    |
|2021-03-01 12:01:19.406|12  |1     |19    |
+-----------------------+----+------+------+



https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html