**Inferring the Schema:**

Inferring the schema in PySpark allows you to automatically determine the structure (i.e., column names and types) of data, especially when dealing with formats like JSON, CSV, or Parquet. PySpark provides several ways to infer the schema depending on the format of the data.

Here’s a guide to inferring the schema in PySpark for common data formats:

**1. Inferring Schema from CSV File:**

When reading a CSV file in PySpark, you can automatically infer the schema by setting the inferSchema option to True. This means that PySpark will scan the first few rows of the CSV to determine the data types of the columns.
```
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("InferSchemaExample").getOrCreate()

# Read CSV file and infer schema
df = spark.read.option("inferSchema", "true").csv("path_to_your_file.csv", header=True)

# Show inferred schema
df.printSchema()

# Show data
df.show()
```

**Explanation:**
-   header=True: Treats the first row as the header (column names).
-   inferSchema=True: Infers the column types based on the data in the CSV file.

**2. Inferring Schema from JSON File:**

In the case of JSON, PySpark can infer the schema directly from the data, as JSON is a semi-structured format.

```
# Read JSON file and infer schema
df_json = spark.read.json("path_to_your_file.json")

# Show inferred schema
df_json.printSchema()

# Show data
df_json.show()
```

**3. Inferring Schema from Parquet File:**
Parquet is a columnar format that already contains schema information, so you don’t need to use inferSchema. PySpark will automatically read the schema when loading Parquet files.

```
# Read Parquet file
df_parquet = spark.read.parquet("path_to_your_file.parquet")

# Show inferred schema
df_parquet.printSchema()

# Show data
df_parquet.show()
```

**Optimizing the inferSchema Process:**

Inferring the schema can be costly on large datasets since PySpark needs to scan the data to determine the types. To improve performance:

-   Consider using a sample of the data with samplingRatio (e.g., 0.1 for 10% sample).
-   Use Parquet or JSON when possible, as they inherently store schema information.

```
# Example of inferring schema with a sampling ratio
df_sampled = spark.read \
.format("csv") \
.option("header","true") \
.option("inferSchema", "true") \
.option("samplingRatio", 0.1) \
.load("path_to_your_file.csv")

# Show schema and data
df_sampled.printSchema()
df_sampled.show()
```

**Conclusion**

-   **CSV:** Use inferSchema=True to automatically detect data types.
-   **JSON:** Schema is inferred directly from the data, no need for inferSchema.
-   **Parquet:** No need for schema inference, as schema is stored in the file itself.
-   **Custom Schema:** Use StructType if you want to define a schema explicitly.


**Schema Enforcement:**

Schema enforcement in PySpark refers to the process of ensuring that the data in a DataFrame conforms to a predefined structure or schema. This schema defines the column names, data types, and the structure of the data. Schema enforcement is important because it helps ensure data integrity, consistency, and efficient processing when working with large datasets.

**Why is Schema Enforcement Important?**

-   **1.Data Quality:** It ensures that the data adheres to the expected types and formats, preventing errors during computations or transformations.

-   **2.Performance:** PySpark can optimize operations like joins, aggregations, and filters when the schema is well-defined.

-   **3.Data Compatibility:** When working with multiple sources, enforcing a schema ensures that the data from different sources aligns properly.

-   **4.Validation:** It can help catch any discrepancies early in the data processing pipeline, preventing bad data from being processed further.


In [None]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.catalogImplementation", "hive"). \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    enableHiveSupport(). \
    master("local"). \
    getOrCreate()

In [11]:
df = spark.read \
.format("csv") \
.load("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [12]:
df.show(5)

+---+--------------------+-----+----------+
|_c0|                 _c1|  _c2|       _c3|
+---+--------------------+-----+----------+
|  1|2013-07-27 00:00:...|30265|    CLOSED|
|  2|2013-11-25 00:00:...|20386|    CLOSED|
|  3|2014-01-21 00:00:...|15768|  COMPLETE|
|  4|2014-07-04 00:00:...|27181|PROCESSING|
|  5|2014-03-08 00:00:...|12448|  COMPLETE|
+---+--------------------+-----+----------+
only showing top 5 rows



In [13]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)



**Defining the schema using DDL Method:**

In [5]:
orders_schema = ("order_id integer, order_date date, customer_id long, order_status string")

In [6]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [8]:
df.show(5)

+--------+----------+-----------+------------+
|order_id|order_date|customer_id|order_status|
+--------+----------+-----------+------------+
|       1|2013-07-27|      30265|      CLOSED|
|       2|2013-11-25|      20386|      CLOSED|
|       3|2014-01-21|      15768|    COMPLETE|
|       4|2014-07-04|      27181|  PROCESSING|
|       5|2014-03-08|      12448|    COMPLETE|
+--------+----------+-----------+------------+
only showing top 5 rows



In [9]:
df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)



if there is a data type issue, then we will get the column values as null

In [14]:
orders_schema = ("order_id integer, order_date date, customer_id long, order_status long")

In [15]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [16]:
df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: long (nullable = true)



In [17]:
df.show(5)

+--------+----------+-----------+------------+
|order_id|order_date|customer_id|order_status|
+--------+----------+-----------+------------+
|       1|2013-07-27|      30265|        NULL|
|       2|2013-11-25|      20386|        NULL|
|       3|2014-01-21|      15768|        NULL|
|       4|2014-07-04|      27181|        NULL|
|       5|2014-03-08|      12448|        NULL|
+--------+----------+-----------+------------+
only showing top 5 rows



**Defining the schema using StructType Method:**

In [21]:
from pyspark.sql.types import *

orders_schema = StructType([
    StructField("order_id", LongType()),
    StructField("order_date",DateType()),
    StructField("customer_id",LongType()),
    StructField("order_status",StringType())
])

In [22]:
df = spark.read \
.format("csv") \
.schema(orders_schema) \
.load("/Users/sugumarsrinivasan/Documents/data/orders.csv")

In [23]:
df.show(5)

+--------+----------+-----------+------------+
|order_id|order_date|customer_id|order_status|
+--------+----------+-----------+------------+
|       1|2013-07-27|      30265|      CLOSED|
|       2|2013-11-25|      20386|      CLOSED|
|       3|2014-01-21|      15768|    COMPLETE|
|       4|2014-07-04|      27181|  PROCESSING|
|       5|2014-03-08|      12448|    COMPLETE|
+--------+----------+-----------+------------+
only showing top 5 rows



In [24]:
df.printSchema()

root
 |-- order_id: long (nullable = true)
 |-- order_date: date (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- order_status: string (nullable = true)

