## PySpark Interview Questions â€“ Dataset Recap & Exercises

### ðŸ“‚ Dataset Recap

**customers** â†’ customer_id, name, email, city, signup_date  
**products** â†’ product_id, name, category, price  
**orders** â†’ order_id, customer_id, order_date, status  
**order_items** â†’ item_id, order_id, product_id, quantity  
**payments** â†’ payment_id, order_id, amount, payment_date, payment_method  

---

#### ðŸ”¹ Beginner / Core Questions

1. Read all five CSV files into PySpark DataFrames with headers and schema inference.  
2. Show the schema of each DataFrame.  
3. Get the total number of customers, products, and orders.  
4. List all distinct product categories.  
5. Find the top 5 cities with the most customers.  

---

#### ðŸ”¹ Joins & Aggregations

6. Join orders with customers â†’ Get a DataFrame of orders with customer names.  
7. Find the total sales per product category.  
8. Find the average order value per customer.  
9. Get the top 3 customers by total spending.  
10. Find the top 5 products by total quantity sold.  

---

#### ðŸ”¹ Date & Time Functions

11. Find the number of orders placed each month.  
12. Get customers who signed up in 2023 but have not placed any orders.  
13. Find the first order date for each customer.  

---

#### ðŸ”¹ Window Functions

14. For each customer, rank their orders by order_date.  
15. For each product, find the top 2 orders with the highest quantity.  
16. Calculate the running total of payments per customer.  

---

#### ðŸ”¹ Null Handling & Data Cleaning

17. Find all rows where email is null in customers.  
18. Replace null payment_method values with `"Unknown"`.  
19. Drop any orders that have no items in order_items.  

---

#### ðŸ”¹ Performance / Advanced

20. Explain the difference between `repartition()` and `coalesce()`.  
21. If one table is very small (like products), how can you optimize joins with large tables?  
22. How does caching help if the same DataFrame is reused multiple times?  
23. Whatâ€™s the difference between wide and narrow transformations? Give examples.  

---

In [2]:
from pyspark.sql import SparkSession
# use yarn as resource manager
spark = SparkSession.builder.appName('Test App').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/01/19 14:36:43 WARN Utils: Your hostname, Harishankars-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.114 instead (on interface en0)
26/01/19 14:36:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/19 14:36:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/01/19 14:36:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
spark

In [3]:
master = spark.conf.get('spark.master')
master

'local[*]'

In [8]:
cust_df = spark.read.csv(r'/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/customers.csv', header=True, inferSchema=True)
item_df = spark.read.csv(r'/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/items.csv', header=True, inferSchema=True)
order_df = spark.read.csv(r'/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/orders.csv', header=True, inferSchema=True)
payment_df = spark.read.csv(r'/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/payments.csv', header=True, inferSchema=True)
product_df = spark.read.csv(r'/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/products.csv', header=True, inferSchema=True)

In [5]:
cust_df.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- signup_date: date (nullable = true)



In [6]:
order_df.count()

20

In [7]:
product_df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: integer (nullable = true)



In [8]:
# List all distinct product categories.
product_df.select('category').distinct().show()

+-----------+
|   category|
+-----------+
|Electronics|
|   Clothing|
|      Books|
|  Furniture|
+-----------+



In [51]:
cust_df.show(5)

+-----------+----------+---------+-----------+
|customer_id| cust_name|     city|signup_date|
+-----------+----------+---------+-----------+
|          1|Customer_1|Bangalore| 2023-07-16|
|          2|Customer_2|   Mumbai| 2023-01-12|
|          3|Customer_3|  Chennai| 2023-03-05|
|          4|Customer_4|Bangalore| 2023-01-07|
|          5|Customer_5|  Kolkata| 2023-02-01|
+-----------+----------+---------+-----------+
only showing top 5 rows


In [9]:
# Find the top 5 cities with the most customers.
from pyspark.sql.functions import col, desc
cust_df.groupBy(col('city')).count().withColumnRenamed('count', 'no_of_cust').orderBy(desc('no_of_cust')).limit(5).show()

+---------+----------+
|     city|no_of_cust|
+---------+----------+
|Bangalore|         6|
|  Kolkata|         5|
|  Chennai|         3|
|   Mumbai|         3|
|    Delhi|         3|
+---------+----------+



In [10]:
# Join orders with customers â†’ Get a DataFrame of orders with customer names.
cust_order_df = cust_df.join(order_df, how='inner', on='customer_id')
cust_order_df.show(5)

+-----------+----------+---------+-----------+--------+----------+---------+
|customer_id|      name|     city|signup_date|order_id|order_date|   status|
+-----------+----------+---------+-----------+--------+----------+---------+
|          2|Customer_2|   Mumbai| 2023-01-12|    1014|2023-12-08|  Shipped|
|          5|Customer_5|  Kolkata| 2023-02-01|    1017|2023-09-15|  Pending|
|          6|Customer_6|Bangalore| 2023-08-02|    1008|2023-08-25|  Pending|
|          6|Customer_6|Bangalore| 2023-08-02|    1001|2023-10-05|Cancelled|
|          7|Customer_7|Bangalore| 2023-07-30|    1020|2023-03-28|Delivered|
+-----------+----------+---------+-----------+--------+----------+---------+
only showing top 5 rows


In [11]:
# Find the total sales per product category.
from pyspark.sql.functions import sum as _sum
product_df.groupBy(col('category')).agg(_sum('price').alias('total_sales')).show()

+-----------+-----------+
|   category|total_sales|
+-----------+-----------+
|Electronics|      93799|
|   Clothing|     307004|
|      Books|     227885|
|  Furniture|     424193|
+-----------+-----------+



In [12]:
cust_df = cust_df.withColumnRenamed('name', 'cust_name')
cust_order_amount_df = cust_df.join(order_df, how='inner', on='customer_id')\
                              .join(order_item_df, how='inner', on='order_id')\
                              .join(product_df, how='inner', on='product_id')

In [13]:
cust_order_amount_df = cust_order_amount_df.withColumnRenamed('name', 'product_name')
cust_order_amount_df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- cust_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- signup_date: date (nullable = true)
 |-- order_date: date (nullable = true)
 |-- status: string (nullable = true)
 |-- order_item_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: integer (nullable = true)



In [14]:
# Find the average order value per customer.
filtered_cust_order_amount_df = cust_order_amount_df.select('customer_id', 'cust_name', 'product_name', 'price')
from pyspark.sql.functions import avg
filtered_cust_order_amount_df.groupBy('customer_id').agg(avg('price').alias('average_order_value')).orderBy('customer_id').show()

+-----------+-------------------+
|customer_id|average_order_value|
+-----------+-------------------+
|          2|            32583.0|
|          5|            77375.0|
|          6|            53233.0|
|          7|            47073.5|
|          8|            43770.0|
|          9|            30207.5|
|         10|            57087.5|
|         11|            60402.0|
|         12|            53973.5|
|         13|            47292.5|
|         15|           26884.75|
|         16|            51340.5|
|         18|            57723.0|
+-----------+-------------------+



In [15]:
# Get the top 3 customers by total spending.
filtered_cust_order_amount_df.groupBy('customer_id').agg(_sum('price').alias('total_order_amount')).orderBy(desc('total_order_amount')).limit(3).show()

+-----------+------------------+
|customer_id|total_order_amount|
+-----------+------------------+
|         12|            323841|
|         11|            241608|
|         18|            230892|
+-----------+------------------+



In [16]:
cust_order_amount_df.show(3)

+----------+--------+-----------+----------+-------+-----------+----------+-------+-------------+--------+------------+---------+-----+
|product_id|order_id|customer_id| cust_name|   city|signup_date|order_date| status|order_item_id|quantity|product_name| category|price|
+----------+--------+-----------+----------+-------+-----------+----------+-------+-------------+--------+------------+---------+-----+
|       114|    1014|          2|Customer_2| Mumbai| 2023-01-12|2023-12-08|Shipped|           28|       3|  Product_14|Furniture|59786|
|       112|    1014|          2|Customer_2| Mumbai| 2023-01-12|2023-12-08|Shipped|           27|       3|  Product_12|    Books| 5380|
|       102|    1017|          5|Customer_5|Kolkata| 2023-02-01|2023-09-15|Pending|           34|       4|   Product_2|Furniture|77711|
+----------+--------+-----------+----------+-------+-----------+----------+-------+-------------+--------+------------+---------+-----+
only showing top 3 rows


In [17]:
#Find the top 5 products by total quantity sold.
prduct_order_df = order_item_df.join(product_df, how='left', on='product_id')
prduct_order_df.groupBy('name').agg(_sum('quantity').alias('total_quantity_sold')).orderBy(desc('total_quantity_sold')).limit(5).show()

+----------+-------------------+
|      name|total_quantity_sold|
+----------+-------------------+
|Product_15|                 16|
| Product_2|                 13|
| Product_5|                 13|
|Product_19|                 10|
|Product_17|                 10|
+----------+-------------------+



In [18]:
#Find the number of orders placed each month.
from pyspark.sql.functions import day, month, year

order_df.groupBy(month('order_date')).count().show()

+-----------------+-----+
|month(order_date)|count|
+-----------------+-----+
|               12|    2|
|                6|    1|
|                3|    4|
|                9|    2|
|                4|    4|
|                8|    2|
|               10|    2|
|               11|    1|
|                2|    2|
+-----------------+-----+



In [19]:
cust_df.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- cust_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- signup_date: date (nullable = true)



In [20]:
#Get customers who signed up in 2023 but have not placed any orders.
cust_df.join(order_df, how='leftanti', on='customer_id').filter(year('signup_date') == 2023).show()

+-----------+-----------+---------+-----------+
|customer_id|  cust_name|     city|signup_date|
+-----------+-----------+---------+-----------+
|          1| Customer_1|Bangalore| 2023-07-16|
|          3| Customer_3|  Chennai| 2023-03-05|
|          4| Customer_4|Bangalore| 2023-01-07|
|         14|Customer_14|    Delhi| 2023-06-03|
|         17|Customer_17|    Delhi| 2023-06-21|
|         19|Customer_19|  Kolkata| 2023-06-22|
|         20|Customer_20|    Delhi| 2023-09-15|
+-----------+-----------+---------+-----------+



In [21]:
order_df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- status: string (nullable = true)



In [22]:
#Find the first order date for each customer.
from pyspark.sql.functions import min as _min
order_df.groupBy('customer_id').agg(_min(col('order_date'))).show()

+-----------+---------------+
|customer_id|min(order_date)|
+-----------+---------------+
|         12|     2023-02-28|
|         13|     2023-03-29|
|          6|     2023-08-25|
|         16|     2023-08-01|
|          5|     2023-09-15|
|         15|     2023-06-27|
|          9|     2023-04-05|
|          8|     2023-02-20|
|          7|     2023-03-28|
|         10|     2023-04-17|
|         11|     2023-03-12|
|          2|     2023-12-08|
|         18|     2023-04-11|
+-----------+---------------+



In [23]:
#For each customer, rank their orders by order_date.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy('customer_id').orderBy('order_date')
order_df.withColumn(
    'ranking',
     F.rank().over(windowSpec)
).show()

+--------+-----------+----------+---------+-------+
|order_id|customer_id|order_date|   status|ranking|
+--------+-----------+----------+---------+-------+
|    1014|          2|2023-12-08|  Shipped|      1|
|    1017|          5|2023-09-15|  Pending|      1|
|    1008|          6|2023-08-25|  Pending|      1|
|    1001|          6|2023-10-05|Cancelled|      2|
|    1020|          7|2023-03-28|Delivered|      1|
|    1004|          8|2023-02-20|Cancelled|      1|
|    1016|          8|2023-04-28|Delivered|      2|
|    1012|          9|2023-04-05|Delivered|      1|
|    1015|         10|2023-04-17|Delivered|      1|
|    1002|         11|2023-03-12|  Pending|      1|
|    1019|         11|2023-11-09|  Shipped|      2|
|    1013|         12|2023-02-28|Cancelled|      1|
|    1010|         12|2023-03-13|  Pending|      2|
|    1009|         12|2023-09-12|Delivered|      3|
|    1011|         13|2023-03-29|Delivered|      1|
|    1007|         15|2023-06-27|  Pending|      1|
|    1006|  

In [24]:
# For each product, find the top 2 orders with the highest quantity.
product_df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: integer (nullable = true)



In [25]:
#Calculate the running total of payments per customer.
cust_payment_df = order_df.join(payment_df, 'order_id', 'inner')
cust_payment_df.show()

+--------+-----------+----------+---------+----------+------------+------+
|order_id|customer_id|order_date|   status|payment_id|payment_type|amount|
+--------+-----------+----------+---------+----------+------------+------+
|    1001|          6|2023-10-05|Cancelled|      5001|        Card| 25658|
|    1002|         11|2023-03-12|  Pending|      5002|      Wallet| 81505|
|    1003|         18|2023-04-11|  Pending|      5003|         UPI| 84427|
|    1004|          8|2023-02-20|Cancelled|      5004|      Wallet| 43819|
|    1005|         18|2023-10-17|Delivered|      5005|  NetBanking| 74728|
|    1006|         15|2023-12-31|  Pending|      5006|         UPI| 48462|
|    1007|         15|2023-06-27|  Pending|      5007|         UPI| 40836|
|    1008|          6|2023-08-25|  Pending|      5008|         UPI| 74756|
|    1009|         12|2023-09-12|Delivered|      5009|  NetBanking| 37376|
|    1010|         12|2023-03-13|  Pending|      5010|         UPI| 79036|
|    1011|         13|202

In [26]:
#running total find
windowSpec = Window.partitionBy('customer_id').orderBy('amount')
cust_payment_df.withColumn(
    'running_total',
    F.sum('amount').over(windowSpec)
).select('customer_id', 'amount', 'running_total').show()

+-----------+------+-------------+
|customer_id|amount|running_total|
+-----------+------+-------------+
|          2| 94866|        94866|
|          5| 69613|        69613|
|          6| 25658|        25658|
|          6| 74756|       100414|
|          7| 73679|        73679|
|          8| 43819|        43819|
|          8| 77109|       120928|
|          9| 16963|        16963|
|         10| 65080|        65080|
|         11| 10752|        10752|
|         11| 81505|        92257|
|         12| 37376|        37376|
|         12| 40016|        77392|
|         12| 79036|       156428|
|         13| 54744|        54744|
|         15| 40836|        40836|
|         15| 48462|        89298|
|         16| 50575|        50575|
|         18| 74728|        74728|
|         18| 84427|       159155|
+-----------+------+-------------+



In [27]:
cust_null_df = spark.read.csv('/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/customers_with_nulls.csv', header=True, inferSchema=True)
cust_null_df.count()

20

In [28]:
cust_null_df.filter(col('email').isNull()).show()

+-----------+---------------+-----+----------+-------------------+--------------+
|customer_id|           name|email|     phone|            address|payment_method|
+-----------+---------------+-----+----------+-------------------+--------------+
|          6|  Emily Johnson| NULL|9883781479| 54 Main St, City 6|   Credit Card|
|          7|   David Wilson| NULL|9773993630|113 Main St, City 8|    Debit Card|
|         10|   Olivia Davis| NULL|9655822483| 33 Main St, City 7|           UPI|
|         18|Charlotte Lewis| NULL|9744099132| 14 Main St, City 8|        PayPal|
+-----------+---------------+-----+----------+-------------------+--------------+



In [None]:
# Replace null payment_method values with "Unknown". 
cust_null_df = cust_null_df.withColumn(
    'payment_method_new',
    F.when(col('payment_method').isNull(), 'Unknown')
    .otherwise(col('payment_method'))
)

In [30]:
cust_null_df = cust_null_df.drop('payment_method').withColumnRenamed('payment_method_new', 'payment_method')
cust_null_df.show()

+-----------+---------------+--------------------+----------+-------------------+--------------+
|customer_id|           name|               email|     phone|            address|payment_method|
+-----------+---------------+--------------------+----------+-------------------+--------------+
|          1|       John Doe|   john1@hotmail.com|      NULL| 95 Main St, City 8|           UPI|
|          2|     Jane Smith|   jane2@outlook.com|9558765521| 22 Main St, City 3|       Unknown|
|          3|   Robert Brown| robert3@hotmail.com|9456041729|177 Main St, City 5|   Net Banking|
|          4|    Linda White|  linda4@hotmail.com|9792231526| 78 Main St, City 3|       Unknown|
|          5|  Michael Green|michael5@outlook.com|9536496600| 28 Main St, City 5|           UPI|
|          6|  Emily Johnson|                NULL|9883781479| 54 Main St, City 6|   Credit Card|
|          7|   David Wilson|                NULL|9773993630|113 Main St, City 8|    Debit Card|
|          8|  Sophia Miller| 

In [36]:
mydata = [
    (1, 'x', 23),
    (2, 'y', 32)
]
mycol = ['id', 'name', 'age']

mydf = spark.createDataFrame(mydata, mycol)
mydf.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [39]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

myschema = StructType([
    StructField('id', IntegerType(), True),
    StructField('name', StringType(), True),
    StructField('age', IntegerType(), True)
])

mydf2 = spark.createDataFrame(mydata, myschema)
mydf2.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)



In [9]:
cust_df.printSchema()
order_df.printSchema()
item_df.printSchema()
payment_df.printSchema()
product_df.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- signup_date: date (nullable = true)

root
 |-- order_id: integer (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- order_date: date (nullable = true)
 |-- status: string (nullable = true)

root
 |-- order_item_id: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)

root
 |-- payment_id: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- amount: integer (nullable = true)

root
 |-- product_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- price: integer (nullable = true)



In [6]:
order_item_df.show(5)

26/01/19 14:45:12 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 11)
org.apache.spark.SparkException: [FAILED_READ_FILE.FILE_NOT_EXIST] Encountered error while reading file file:///Users/harishankargiri/Desktop/Data%20Engineering/PySpark/PractiseByGPT/order_items.csv. File does not exist. It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. SQLSTATE: KD001
	at org.apache.spark.sql.errors.QueryExecutionErrors$.fileNotExistError(QueryExecutionErrors.scala:831)
	at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2$.attachFilePath(FileDataSourceV2.scala:140)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:142)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.proce

Py4JJavaError: An error occurred while calling o42.showString.
: org.apache.spark.SparkException: [FAILED_READ_FILE.FILE_NOT_EXIST] Encountered error while reading file file:///Users/harishankargiri/Desktop/Data%20Engineering/PySpark/PractiseByGPT/order_items.csv. File does not exist. It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. SQLSTATE: KD001
	at org.apache.spark.sql.errors.QueryExecutionErrors$.fileNotExistError(QueryExecutionErrors.scala:831)
	at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2$.attachFilePath(FileDataSourceV2.scala:140)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:142)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:842)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2484)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2505)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2524)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:544)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:497)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:58)
	at org.apache.spark.sql.classic.Dataset.collectFromPlan(Dataset.scala:2244)
	at org.apache.spark.sql.classic.Dataset.$anonfun$head$1(Dataset.scala:1379)
	at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$2(Dataset.scala:2234)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:654)
	at org.apache.spark.sql.classic.Dataset.$anonfun$withAction$1(Dataset.scala:2232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$8(SQLExecution.scala:162)
	at org.apache.spark.sql.execution.SQLExecution$.withSessionTagsApplied(SQLExecution.scala:268)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$7(SQLExecution.scala:124)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
	at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:112)
	at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:106)
	at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:111)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:124)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:291)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:123)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:77)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:233)
	at org.apache.spark.sql.classic.Dataset.withAction(Dataset.scala:2232)
	at org.apache.spark.sql.classic.Dataset.head(Dataset.scala:1379)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2810)
	at org.apache.spark.sql.classic.Dataset.getRows(Dataset.scala:339)
	at org.apache.spark.sql.classic.Dataset.showString(Dataset.scala:375)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.FileNotFoundException: File file:/Users/harishankargiri/Desktop/Data Engineering/PySpark/PractiseByGPT/order_items.csv does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:189)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:572)
	at org.apache.hadoop.fs.ChecksumFileSystem.lambda$openFileWithOptions$0(ChecksumFileSystem.java:1100)
	at org.apache.hadoop.util.LambdaUtils.eval(LambdaUtils.java:52)
	at org.apache.hadoop.fs.ChecksumFileSystem.openFileWithOptions(ChecksumFileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem$FSDataInputStreamBuilder.build(FileSystem.java:4952)
	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:100)
	at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.$anonfun$_iterator$2(HadoopFileLinesReader.scala:66)
	at org.apache.spark.util.SparkErrorUtils.tryInitializeResource(SparkErrorUtils.scala:59)
	at org.apache.spark.util.SparkErrorUtils.tryInitializeResource$(SparkErrorUtils.scala:56)
	at org.apache.spark.util.Utils$.tryInitializeResource(Utils.scala:99)
	at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
	at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.$anonfun$readFile$1(CSVDataSource.scala:105)
	at org.apache.spark.TaskContextImpl.createResourceUninterruptibly(TaskContextImpl.scala:332)
	at org.apache.spark.util.Utils$.$anonfun$createResourceUninterruptiblyIfInTaskThread$1(Utils.scala:3097)
	at scala.Option.map(Option.scala:242)
	at org.apache.spark.util.Utils$.createResourceUninterruptiblyIfInTaskThread(Utils.scala:3096)
	at org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.readFile(CSVDataSource.scala:105)
	at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:147)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:230)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:289)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext0(FileScanRDD.scala:131)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:140)
	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:402)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:901)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:901)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:338)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
