<h1><center>Reading and Writing Excel files in PySpark</center></h1>
<hr><hr><hr>

- Medium page link for instructions for setting up `com.crealytics` excel jar file, and working with excel file data in local pyspark: \
    - `https://medium.com/@amitjoshi7/how-to-read-excel-files-using-pyspark-in-databricks-637bb21b90be`

- Spark version installed on my system: \
    `2.12:3.3.2`


### All the jar files that are downloaded below must be placed inside the `jars` folder of the spark installation directory in the localmachine:
-------------------------------------------------------------------------------------------------------------------------------------------------
- Required jar file for handling excel files using pyspark, for this specific spark version `3.3.2`: \
    - `spark-excel_2.12-3.3.2_0.19.0.jar`

#### Required links for setup:
- Download page for `spark-excel_2.12-3.3.2_0.19.0.jar`: \
    `https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/3.3.2_0.19.0`


In [7]:
pip show ipynbname

Name: ipynbnameNote: you may need to restart the kernel to use updated packages.

Version: 2024.1.0.0
Summary: Simply returns either notebook filename or the full path to the notebook when run from Jupyter notebook in browser.
Home-page: https://github.com/msm1089/ipynbname
Author: Mark McPherson
Author-email: msm1089@yahoo.co.uk
License: MIT
Location: e:\programs & codes\apache_spark\_spark_venv\lib\site-packages
Requires: ipykernel
Required-by: 


In [8]:
import os
import ipynbname

notebook_name = ipynbname.name()

print(os.getcwd())
print("SPARK Home:--", os.getenv("SPARK_HOME"))
print("Notebook name:  ", notebook_name)

E:\Programs & Codes\apache_spark\my_pyspark_notebooks\000-pyspark-practice
SPARK Home:-- D:\Softwares\Apache_Spark\spark
Notebook name:   003-reading_writing_excel_files


In [5]:
import findspark
findspark.init()

In [13]:
excel_jar_path = os.getenv("SPARK_HOME") + "\\jars\\spark-excel_2.12-3.3.2_0.19.0.jar"

print( excel_jar_path )

D:\Softwares\Apache_Spark\spark\jars\spark-excel_2.12-3.3.2_0.19.0.jar


In [15]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .master("local[*]")
    .config("spark.jars", excel_jar_path)
    .appName( notebook_name )
    .getOrCreate()
)

spark

In [20]:
excel_file_path = "./data/JGEC_central_database.xlsx"

In [21]:
# Read an entire sheet
jgec_df = (
    spark.read
    .format("com.crealytics.spark.excel")
    .option("header", True)
    .option("inferSchema", True)
    .load( excel_file_path )
)

In [17]:
jgec_df.show()

+------+----+--------------------+----------+------+--------------------+--------+----------+----+--------------------+--------------------+--------------------+-------------+--------------------+-------------+---------------+--------------------+--------------------+----------+
|  date|dept|               email|graduation|hostel|            hostelID|hostelNo|hostelPaid|  id|                name|           paperCode|           paperName|         phNo|            receipts|        regNo|         rollNo|          tutionFees|            tutionID|tutionPaid|
+------+----+--------------------+----------+------+--------------------+--------+----------+----+--------------------+--------------------+--------------------+-------------+--------------------+-------------+---------------+--------------------+--------------------+----------+
|  NULL| CSE|jm2019@cse.jgec.a...|        UG|   YES|                NULL|       3|        NO|14.0|    Jyotirmoy Mondal|CS 801A\r\nCS-802...|Cryptography and ...

In [18]:
jgec_df.printSchema()

root
 |-- date: string (nullable = true)
 |-- dept: string (nullable = true)
 |-- email: string (nullable = true)
 |-- graduation: string (nullable = true)
 |-- hostel: string (nullable = true)
 |-- hostelID: string (nullable = true)
 |-- hostelNo: string (nullable = true)
 |-- hostelPaid: string (nullable = true)
 |-- id: double (nullable = true)
 |-- name: string (nullable = true)
 |-- paperCode: string (nullable = true)
 |-- paperName: string (nullable = true)
 |-- phNo: double (nullable = true)
 |-- receipts: string (nullable = true)
 |-- regNo: double (nullable = true)
 |-- rollNo: double (nullable = true)
 |-- tutionFees: string (nullable = true)
 |-- tutionID: string (nullable = true)
 |-- tutionPaid: string (nullable = true)



In [34]:
# Read specifid cells:

jgec_df_partial = (
    spark.read
    .format("com.crealytics.spark.excel")
    .option("header", True)
    .option("inferSchema", True)
    .option("dataAddress", "'JGEC db sheet'!A1:E10" )
    .option("treatEmptyValuesAsNulls", "true")  #to treat empty values as null
    .load( excel_file_path )
)

In [35]:
jgec_df_partial.show(truncate=False)

+------+----+-----------------------------+----------+------+
|date  |dept|email                        |graduation|hostel|
+------+----+-----------------------------+----------+------+
|NULL  |CSE |jm2019@cse.jgec.ac.in        |UG        |YES   |
|NULL  |ECE |sm2083@ece.jgec.ac.in        |UG        |NO    |
|NULL  |EE  |ud2005@ee.jgec.ac.in         |UG        |NO    |
|NULL  |CSE |ak2087@cse.jgec.ac.in        |UG        |NO    |
|NULL  |EE  |mahadev10.02.1997@gmail.com  |UG        |NO    |
|NULL  |EE  |ss2058@ee.jgec.ac.in         |UG        |YES   |
|3/2/20|ME  |sd2028@me.jgec.ac.in         |UG        |YES   |
|NULL  |IT  |dipshichakraborty29@gmail.com|UG        |NO    |
|NULL  |EE  |sadhukalyanshis@gmail.com    |PG        |NO    |
+------+----+-----------------------------+----------+------+



In [45]:
# write the dataframe to a different sheet of same workbook

(
    jgec_df_partial.write
    .format("com.crealytics.spark.excel")
    .option("header", True)
    .option("inferSchema", True)
    .option("dataAddress", "'new_sheet'!A2:Z100")
    .save( "./data/jgec_saved_by_pyspark.xlsx" )
)