**Handling the Non-Standard Date Format in PySpark:**

In PySpark, when you're loading data into a DataFrame, the `option("dateFormat", "")` is typically used to specify the format of the date columns in the input data. This option helps PySpark correctly parse and interpret the dates in your data according to a specific pattern.

Here's how you might use option("dateFormat", "") when loading a dataset (like a CSV or JSON) that contains date values:

**Example 1: Reading a CSV file with a custom date format**

Let's say your CSV file has a column of dates in the format MM-dd-yyyy, and you want to load the data into a DataFrame.

In [20]:
! cat /Users/sugumarsrinivasan/Documents/data/orders_sample2.csv

1,07-27-2013,30265,CLOSED
2,11-25-2013,20386,CLOSED
3,01-15-2014,15768,COMPLETE
4,07-14-2014,27181,PROCESSING
5,03-08-2014,12448,COMPLETE
6,08-20-2014,49340,CLOSED
7,09-12-2014,13801,PROCESSING
8,04-23-2014,28523,PENDING_PAYMENT
9,07-01-2014,26329,CLOSED
10,07-29-2013,38797,COMPLETE


In [None]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.catalogImplementation", "hive"). \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    enableHiveSupport(). \
    master("local"). \
    getOrCreate()

In [2]:
orders_schema = ("order_id integer, order_date date, customer_id long, order_status string")

In [17]:
df = spark.read \
.format("csv") \
.option("dateFormat","MM-dd-yyyy") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders_sample2.csv")

**Explanation:**

-   option("dateFormat", "MM-dd-yyyy"): This option tells PySpark to interpret the date columns in the format MM-dd-yyyy.
-   schema(orders_schema): This option tells PySpark to enforce the schema, defined in orders_schema

In [18]:
df.show()

+--------+----------+-----------+---------------+
|order_id|order_date|customer_id|   order_status|
+--------+----------+-----------+---------------+
|       1|2013-07-27|      30265|         CLOSED|
|       2|2013-11-25|      20386|         CLOSED|
|       3|2014-01-15|      15768|       COMPLETE|
|       4|2014-07-14|      27181|     PROCESSING|
|       5|2014-03-08|      12448|       COMPLETE|
|       6|2014-08-20|      49340|         CLOSED|
|       7|2014-09-12|      13801|     PROCESSING|
|       8|2014-04-23|      28523|PENDING_PAYMENT|
|       9|2014-07-01|      26329|         CLOSED|
|      10|2013-07-29|      38797|       COMPLETE|
+--------+----------+-----------+---------------+



In [7]:
df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)



**Notes:**

-   The date format string should be compatible with Java's SimpleDateFormat. Common date patterns include:

    -   yyyy-MM-dd for a date like 2024-12-27
    -   MM/dd/yyyy for a date like 12/27/2024
    -   yyyy-MM-dd'T'HH:mm:ss for a timestamp like 2024-12-27T14:30:00

-   You can also specify additional options like timestampFormat if you are dealing with timestamp columns instead of just dates.

**Use cases:**

-   This is useful when the date format in your data does not match the default yyyy-MM-dd format used by PySpark, and you want to make sure that the dates are parsed correctly.

-   It also helps if you're dealing with non-standard date formats in large datasets and need to optimize the reading process without needing to perform extra transformations after loading the data.

If you are working with date columns in a non-standard format and want an alternative way to handle them in PySpark (besides using the option("dateFormat", "") when loading data), there are a few other approaches you can take. These methods are useful when you don't want to rely on the loading options, or if you need to do further transformations or custom parsing.

**Using `to_date()` Functions:**

PySpark provides the `to_date()` and `to_timestamp()` functions to convert a string column into a date or timestamp type. You can specify the custom format directly within these functions.

**Example: Using to_date for Date Conversion**

Let's say you have a string column date_column with the format MM-dd-yyyy, and you want to convert it into a proper DateType column.

In [2]:
orders_schema = ("order_id integer, order_date string, customer_id long, order_status string")

In [3]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders_sample2.csv")

In [4]:
df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)



In [5]:
df.show()

+--------+----------+-----------+---------------+
|order_id|order_date|customer_id|   order_status|
+--------+----------+-----------+---------------+
|       1|07-27-2013|      30265|         CLOSED|
|       2|11-25-2013|      20386|         CLOSED|
|       3|01-15-2014|      15768|       COMPLETE|
|       4|07-14-2014|      27181|     PROCESSING|
|       5|03-08-2014|      12448|       COMPLETE|
|       6|08-20-2014|      49340|         CLOSED|
|       7|09-12-2014|      13801|     PROCESSING|
|       8|04-23-2014|      28523|PENDING_PAYMENT|
|       9|07-01-2014|      26329|         CLOSED|
|      10|07-29-2013|      38797|       COMPLETE|
+--------+----------+-----------+---------------+



In [9]:
from pyspark.sql.functions import *

In [19]:
#It creates an additional column with the name 'order_date_new' and update the data type in the newly added column.

new_df = df.withColumn("order_date_new", to_date("order_date", "MM-dd-yyyy"))

In [20]:
new_df.show()

+--------+----------+-----------+---------------+--------------+
|order_id|order_date|customer_id|   order_status|order_date_new|
+--------+----------+-----------+---------------+--------------+
|       1|07-27-2013|      30265|         CLOSED|    2013-07-27|
|       2|11-25-2013|      20386|         CLOSED|    2013-11-25|
|       3|01-15-2014|      15768|       COMPLETE|    2014-01-15|
|       4|07-14-2014|      27181|     PROCESSING|    2014-07-14|
|       5|03-08-2014|      12448|       COMPLETE|    2014-03-08|
|       6|08-20-2014|      49340|         CLOSED|    2014-08-20|
|       7|09-12-2014|      13801|     PROCESSING|    2014-09-12|
|       8|04-23-2014|      28523|PENDING_PAYMENT|    2014-04-23|
|       9|07-01-2014|      26329|         CLOSED|    2014-07-01|
|      10|07-29-2013|      38797|       COMPLETE|    2013-07-29|
+--------+----------+-----------+---------------+--------------+



In [21]:
new_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_date_new: date (nullable = true)



In [16]:
#It make the changes in the exiting column without creating any additional columns in the dataframe

new_df = df.withColumn("order_date", to_date("order_date", "MM-dd-yyyy"))

In [17]:
new_df.show()

+--------+----------+-----------+---------------+
|order_id|order_date|customer_id|   order_status|
+--------+----------+-----------+---------------+
|       1|2013-07-27|      30265|         CLOSED|
|       2|2013-11-25|      20386|         CLOSED|
|       3|2014-01-15|      15768|       COMPLETE|
|       4|2014-07-14|      27181|     PROCESSING|
|       5|2014-03-08|      12448|       COMPLETE|
|       6|2014-08-20|      49340|         CLOSED|
|       7|2014-09-12|      13801|     PROCESSING|
|       8|2014-04-23|      28523|PENDING_PAYMENT|
|       9|2014-07-01|      26329|         CLOSED|
|      10|2013-07-29|      38797|       COMPLETE|
+--------+----------+-----------+---------------+



In [18]:
new_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)



**Summary:**

-   to_date() and to_timestamp() are the primary functions for handling non-standard date and timestamp formats by specifying the format directly.

By using one or more of these methods, you can flexibly handle dates and timestamps in any format without relying solely on the option("dateFormat", "") approach when loading the data.



**Handling Inconsistent Data Types in Spark DataFrames: Dealing with Type Mismatch in Columns**

When you try to load a dataset into a Spark DataFrame with a column that has inconsistent types (e.g., a column defined as LongType but containing some rows with string values), Spark will encounter a problem while reading that data.

Hereâ€™s what typically happens:

**1. Type Mismatch (Error or Null values):**

-   If you specify the schema explicitly (e.g., using LongType for customer_id), Spark will attempt to cast the data to LongType.

    -   For rows where the customer_id is a string that cannot be converted to a long integer, Spark will throw a java.lang.NumberFormatException or a similar error during the loading process.
    -   Alternatively, Spark might replace those rows with null values if the string cannot be cast to LongType (depending on the error handling configuration).

**2. Schema Inference (DataFrame without explicit schema):**

-   If you rely on schema inference (without specifying the schema manually), Spark will try to infer the schema based on the data it reads.

    -   If most of the rows can be cast to LongType but a few contain strings, Spark might infer the column type as StringType to accommodate all values, which is less restrictive but may result in loss of intended data consistency.
    -   In this case, your customer_id column will be treated as StringType instead of LongType.

**3. Handling Data Conversion Errors:**

-   You can use some strategies to handle type conversion errors or unexpected data types:

    -   Using option("mode", "DROPMALFORMED"): This will drop any rows that cannot be parsed correctly based on the schema.
    -   Using option("mode", "PERMISSIVE") (default behavior): Spark will try to parse as many rows as possible, and the ones that fail will be set to null.
    -   Using option("mode", "FAILFAST"): This will cause the job to fail immediately upon encountering any corrupt record.

**Example:**

If you're loading data from a CSV file and you specify LongType for the customer_id column, you could see the following behavior:

In [31]:
! cat /Users/sugumarsrinivasan/Documents/data/orders_sample3.csv

1,2013-07-27,30265,CLOSED
2,2013-11-25,20386,CLOSED
3,2014-01-21,15768,COMPLETE
4,2014-07-04,27181,PROCESSING
5,2014-03-08,unknown,COMPLETE
6,2014-07-20,49340,CLOSED
7,2013-12-14,13801,PROCESSING
8,2014-04-23,error,PENDING_PAYMENT
9,2014-01-07,26329,CLOSED
10,2013-07-29,38797,COMPLETE


In [22]:
orders_schema = 'order_id integer, order_date string, customer_id long, order_status string'

In [23]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders_sample3.csv")

In [24]:
df.show()

+--------+----------+-----------+---------------+
|order_id|order_date|customer_id|   order_status|
+--------+----------+-----------+---------------+
|       1|2013-07-27|      30265|         CLOSED|
|       2|2013-11-25|      20386|         CLOSED|
|       3|2014-01-21|      15768|       COMPLETE|
|       4|2014-07-04|      27181|     PROCESSING|
|       5|2014-03-08|       NULL|       COMPLETE|
|       6|2014-07-20|      49340|         CLOSED|
|       7|2013-12-14|      13801|     PROCESSING|
|       8|2014-04-23|       NULL|PENDING_PAYMENT|
|       9|2014-01-07|      26329|         CLOSED|
|      10|2013-07-29|      38797|       COMPLETE|
+--------+----------+-----------+---------------+



In [25]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.option("mode","dropmalformed") \
.load("/Users/sugumarsrinivasan/Documents/data/orders_sample3.csv")

In [26]:
df.show()

+--------+----------+-----------+------------+
|order_id|order_date|customer_id|order_status|
+--------+----------+-----------+------------+
|       1|2013-07-27|      30265|      CLOSED|
|       2|2013-11-25|      20386|      CLOSED|
|       3|2014-01-21|      15768|    COMPLETE|
|       4|2014-07-04|      27181|  PROCESSING|
|       6|2014-07-20|      49340|      CLOSED|
|       7|2013-12-14|      13801|  PROCESSING|
|       9|2014-01-07|      26329|      CLOSED|
|      10|2013-07-29|      38797|    COMPLETE|
+--------+----------+-----------+------------+



In [27]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.option("mode","failfast") \
.load("/Users/sugumarsrinivasan/Documents/data/orders_sample3.csv")

In [None]:
df.show()

#output: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION] Malformed records are detected in record parsing: [5,2014-03-08,null,COMPLETE].
#Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. 

If there are non-numeric values in the customer_id column, those rows will either be skipped (if DROPMALFORMED is used) or set to null if PERMISSIVE mode is used.

**Summary:**

-   If the data is inconsistent (some rows are strings while others are valid long numbers), Spark may either throw an error or convert the column type to StringType during schema inference.

-   It's important to handle these issues by specifying an appropriate schema or setting the correct error-handling mode when reading the data.