##  **Working with json file in pyspark**

1. Reading JSON file in PySpark
2. Reading from Multiline JSON File
3. Reading from Multiple files at a time
4. Reading all files in a directory
5. Reading files with a user-specified custom schema
6. Reading File using PySpark SQL
7. Write PySpark DataFrame to JSON file

=======================================================================================================================================

# Reading JSON File in PySpark

PySpark makes it easy to read JSON files into a DataFrame. The `spark.read.json()` function is used for this purpose.

### Steps to Read a JSON File:
1. Ensure the JSON file is stored in a path accessible to your Spark session.
2. Use `spark.read.json()` to load the JSON file into a DataFrame.

### Example Code:
```python
# Import necessary libraries
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Read JSON").getOrCreate()

# File path
json_file_path = "path/to/your/jsonfile.json"

# Reading JSON file into DataFrame
df = spark.read.json(json_file_path)

# Display DataFrame
df.show()

# Print schema
df.printSchema()


In [1]:
# initial setup 
import time
from pyspark.sql import SparkSession

spark = (SparkSession
        .builder
        .master("local[3]")
        .appName("my_spark_app")
        .getOrCreate()
        )
spark

In [2]:
df = spark.read.json("./source/json/employee_1.json")
df.show()

+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+--------+---------+-----+------------------+--------------------+
|             ADEmail|BusinessUnit|Current Employee Rating|       DOB|   DepartmentType|            Division|EmpID|EmployeeClassificationType|EmployeeStatus|EmployeeType| ExitDate|  FirstName|GenderCode|JobFunctionDescription| LastName|LocationCode|MaritalDesc|PayZone|Performance Score|RaceDesc|StartDate|State|        Supervisor|               Title|
+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+-------

In [3]:
df = spark.read.format("json").load("./source/json/employee_1.json")
df.show()

+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+--------+---------+-----+------------------+--------------------+
|             ADEmail|BusinessUnit|Current Employee Rating|       DOB|   DepartmentType|            Division|EmpID|EmployeeClassificationType|EmployeeStatus|EmployeeType| ExitDate|  FirstName|GenderCode|JobFunctionDescription| LastName|LocationCode|MaritalDesc|PayZone|Performance Score|RaceDesc|StartDate|State|        Supervisor|               Title|
+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+-------

=======================================================================================================================================

# Reading from Multiline JSON File in PySpark

Multiline JSON files are those where each JSON object spans multiple lines. To read such files, use the `multiline=True` parameter in the `spark.read.json()` method.

### Steps to Read a Multiline JSON File:
1. Ensure your JSON file contains records in a multiline format.
2. Use `spark.read.json()` with the `multiline=True` option.

### Example Code:
```python
# File path
multiline_json_file_path = "path/to/your/multiline_jsonfile.json"

# Reading multiline JSON file into DataFrame
df = spark.read.json(multiline_json_file_path, multiline=True)

# Display DataFrame
df.show()

# Print schema
df.printSchema()


In [4]:
df = (spark.read
      .format("json")
      .option("multiline",True)
      .load("./source/json/multiline_employee.json"))

df.show()

+-------+--------------------+--------------------+
|  empid|            personal|             profile|
+-------+--------------------+--------------------+
|SJ011MS|{{New York, 10038...|{Finance, Deputy ...|
|MJ012KS|{{Los Angeles, 90...|{Marketing, Senio...|
|BK013LP|{{Chicago, 60601,...|{Operations, Dire...|
|RJ014MT|{{Austin, 73301, ...|{Human Resources,...|
|TM015QP|{{Seattle, 98101,...|     {IT, Team Lead}|
|KT016NL|{{Phoenix, 85001,...|{Strategy, Senior...|
|SP017JL|{{Denver, 80201, ...|{Sales, Vice Pres...|
|LA018GV|{{Atlanta, 30301,...|{Project Manageme...|
|MB019WR|{{Dallas, 75201, ...| {Research, Analyst}|
|EW020TH|{{Boston, 02108, ...|{Data Science, Da...|
|JB021CV|{{San Diego, 9210...|{Logistics, Logis...|
|LH022MV|{{Miami, 33101, F...|{Customer Service...|
|AB023NF|{{San Francisco, ...|{Infrastructure, ...|
|GW024PT|{{Houston, 77001,...|{Administration, ...|
|CW025GK|{{Portland, 97201...|{Legal, Legal Adv...|
+-------+--------------------+--------------------+



In [5]:
df.printSchema()

root
 |-- empid: string (nullable = true)
 |-- personal: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- postalcode: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- streetaddress: string (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- gender: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- profile: struct (nullable = true)
 |    |-- department: string (nullable = true)
 |    |-- designation: string (nullable = true)



=======================================================================================================================================

# Reading from Multiple Files at a Time in PySpark

PySpark allows reading data from multiple JSON files simultaneously, which is useful for processing large datasets spread across multiple files.

### Ways to Read Multiple Files:
1. Specify multiple file paths as a list.
2. Use wildcard characters to match file patterns in a directory.

### Example Code:
```python
# File paths
file_paths = ["path/to/jsonfile1.json", "path/to/jsonfile2.json"]

# Reading multiple JSON files into a single DataFrame
df = spark.read.json(file_paths)

# Display DataFrame
df.show()

# Print schema
df.printSchema()


In [6]:
df = (spark.read
      .format("json")
      .load(["./source/json/employee_1.json","./source/json/employee_2.json"]))
df.count()

1999

=======================================================================================================================================

# Reading All Files in a Directory in PySpark

PySpark provides a convenient way to read all files from a directory at once, especially useful when dealing with datasets stored across multiple files.

### Steps to Read All Files from a Directory:
1. Use the `path` to the directory containing JSON files.
2. Apply `spark.read.json()` with the directory path to read all files.

### Example Code:
```python
# Directory path containing JSON files
directory_path = "path/to/directory/"

# Reading all JSON files from the directory into a single DataFrame
df = spark.read.json(directory_path)

# Display DataFrame
df.show()

# Print schema
df.printSchema()


In [7]:
df = (spark.read
      .format("json")
      .load("./source/json/employee_*.json"))
df.count()

3000

=======================================================================================================================================

# Reading Files with Custom Schema in PySpark

Use a custom schema to handle JSON files with complex or inconsistent structures.

### Example:
```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define custom schema
schema = StructType([
    StructField("empid", StringType(), True),
    StructField("personal", StructType([
        StructField("name", StringType(), True),
        StructField("gender", StringType(), True),
        StructField("age", IntegerType(), True)
    ])),
    StructField("profile", StructType([
        StructField("designation", StringType(), True),
        StructField("department", StringType(), True)
    ]))
])

# Read JSON files with custom schema
df = spark.read.json("path/to/files", schema=schema)

# Show DataFrame
df.show()


In [8]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DateType

schema = StructType([
    StructField("EmpID", IntegerType(), True),
    StructField("FirstName", StringType(), True),
    StructField("LastName", StringType(), True),
    StructField("StartDate", StringType(), True),
    StructField("ExitDate", StringType(), True),
    StructField("Title", StringType(), True),
    StructField("Supervisor", StringType(), True),
    StructField("ADEmail", StringType(), True),
    StructField("BusinessUnit", StringType(), True),
    StructField("EmployeeStatus", StringType(), True),
    StructField("EmployeeType", StringType(), True),
    StructField("PayZone", StringType(), True),
    StructField("EmployeeClassificationType", StringType(), True),
    StructField("DepartmentType", StringType(), True),
    StructField("Division", StringType(), True),
    StructField("DOB", StringType(), True),
    StructField("State", StringType(), True),
    StructField("JobFunctionDescription", StringType(), True),
    StructField("GenderCode", StringType(), True),
    StructField("LocationCode", IntegerType(), True),
    StructField("RaceDesc", StringType(), True),
    StructField("MaritalDesc", StringType(), True),
    StructField("Performance_Score", StringType(), True),
    StructField("Current_Employee_Rating", IntegerType(), True)
])


In [9]:
df = (spark.read
      .format("json")
      .schema(schema)
      .load("./source/json/employee_*.json"))
df.count()

3000

In [10]:
df.printSchema()

root
 |-- EmpID: integer (nullable = true)
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- StartDate: string (nullable = true)
 |-- ExitDate: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Supervisor: string (nullable = true)
 |-- ADEmail: string (nullable = true)
 |-- BusinessUnit: string (nullable = true)
 |-- EmployeeStatus: string (nullable = true)
 |-- EmployeeType: string (nullable = true)
 |-- PayZone: string (nullable = true)
 |-- EmployeeClassificationType: string (nullable = true)
 |-- DepartmentType: string (nullable = true)
 |-- Division: string (nullable = true)
 |-- DOB: string (nullable = true)
 |-- State: string (nullable = true)
 |-- JobFunctionDescription: string (nullable = true)
 |-- GenderCode: string (nullable = true)
 |-- LocationCode: integer (nullable = true)
 |-- RaceDesc: string (nullable = true)
 |-- MaritalDesc: string (nullable = true)
 |-- Performance_Score: string (nullable = true)
 |-- Current_Em

=======================================================================================================================================

# Reading File using PySpark SQL

PySpark SQL provides a way to read and query JSON files using SQL queries directly on DataFrames.

### Example:
```python
# SQL approach to read JSON file
df = spark.read.format("json").load("path/to/jsonfile.json")

# Register as a temporary SQL table
df.createOrReplaceTempView("json_table")

# Perform SQL queries on the table
result = spark.sql("SELECT empid, personal.name, profile.designation FROM json_table")

# Show the result
result.show()


In [11]:
df = (spark.sql("SELECT * FROM json.`./source/json/employee_1.json`;"))
df.show()

+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+--------+---------+-----+------------------+--------------------+
|             ADEmail|BusinessUnit|Current Employee Rating|       DOB|   DepartmentType|            Division|EmpID|EmployeeClassificationType|EmployeeStatus|EmployeeType| ExitDate|  FirstName|GenderCode|JobFunctionDescription| LastName|LocationCode|MaritalDesc|PayZone|Performance Score|RaceDesc|StartDate|State|        Supervisor|               Title|
+--------------------+------------+-----------------------+----------+-----------------+--------------------+-----+--------------------------+--------------+------------+---------+-----------+----------+----------------------+---------+------------+-----------+-------+-----------------+-------

In [12]:
df = (spark.read
      .format("json")
      .schema(schema)
      .load("./source/json/employee_*.json"))
df.count()

3000

=======================================================================================================================================

# Write PySpark DataFrame to JSON File

To write a PySpark DataFrame to a JSON file, use the `write` method with the `json` format.

### Example:
```python
# Sample DataFrame
data = [("SJ011MS", {"name": "Smith Jones", "age": 28}, {"designation": "Deputy General", "department": "Finance"})]
df = spark.createDataFrame(data, ["empid", "personal", "profile"])

# Write DataFrame to JSON file
df.write.json("path/to/output.json")

# Optionally, save multiple partitions
df.write.json("path/to/output/partitioned", mode="overwrite")


In [13]:
df.write.format("json").mode("overwrite").save("./Sinck/json/employee")

=======================================================================================================================================

# Options while Writing JSON Files

When writing a DataFrame to JSON files in PySpark, you can specify various options to control how the files are written. Below are some commonly used options:

### 1. `path`
- **Description**: Specifies the location where the JSON files will be saved.
- **Example**: `df.write.json("path/to/output.json")`

---

### 2. `mode`
- **Description**: Defines the behavior when writing to an existing directory.
  - `overwrite`: Overwrites existing files.
  - `append`: Appends data to existing files.
  - `ignore`: Ignores existing files and doesn’t write the data.
- **Example**: `df.write.json("path/to/output", mode="overwrite")`

---

### 3. `compression`
- **Description**: Specifies the compression codec to use when writing the JSON files (e.g., "gzip", "snappy").
- **Example**: `df.write.json("path/to/output.json", compression="gzip")`

---

### 4. `dateFormat`
- **Description**: Specifies the format for date columns when writing JSON files.
- **Example**: `df.write.json("path/to/output.json", dateFormat="yyyy-MM-dd")`

---

### 5. `timestampFormat`
- **Description**: Specifies the format for timestamp columns when writing JSON files.
- **Example**: `df.write.json("path/to/output.json", timestampFormat="yyyy-MM-dd HH:mm:ss")`

---

### 6. `lineSep`
- **Description**: Specifies the character sequence to use as a line separator between JSON objects.
- **Example**: `df.write.json("path/to/output.json", lineSep="\n")`

---

### 7. `encoding`
- **Description**: Specifies the character encoding to use when writing the JSON files.
- **Example**: `df.write.json("path/to/output.json", encoding="utf-8")`

