In [None]:
'''
For someone with 10+ years of experience, when asked how you optimized a PySpark application, 
you'll want to highlight both technical optimizations and strategic decisions that improved performance and resource efficiency. 
Heres a structured way to answer, showcasing advanced knowledge:
'''

### 1. Optimized Joins and Data Shuffling
'''
   - Broadcast Joins: In cases where one dataset was significantly smaller, I used broadcast joins to reduce the shuffle overhead.
By broadcasting the smaller dataset, I avoided expensive data shuffling across the cluster. For instance, in one project where we had a 
large transactional dataset and a small lookup table, broadcasting improved join performance significantly.

   - Partitioning and Coalescing: To reduce shuffle operations, I ensured that both datasets involved in the join were 
repartitioned on the join key, minimizing network traffic. 
I also used coalesce in certain scenarios to reduce the number of partitions when the data size was small.
'''

### 2. Caching and Persistence
'''
   - Data Caching: I identified stages of the pipeline where intermediate DataFrames were reused multiple times and used the `.cache()` or 
   `.persist()` methods with an appropriate storage level (e.g., MEMORY_AND_DISK). This avoided re-computation and significantly improved 
   performance in iterative algorithms.

   - Clearing Cached Data: After these DataFrames were no longer needed, I always made sure to release the memory by using `unpersist()`, 
   preventing memory leakage.
'''

### 3. Tuning Cluster Resources
'''
   - Executor and Driver Tuning: By monitoring the Spark UI and logs, I found resource bottlenecks. I adjusted the number of cores and memory 
   allocated to executors and the driver. For large datasets, increasing the number of executor cores and adjusting spark.executor.memory 
   improved parallelism and reduced task duration.
   - Dynamic Allocation: In scenarios where workload fluctuated, I enabled dynamic resource allocation. This allowed Spark to scale the number 
   of executors up and down, optimizing resource utilization without over-provisioning.
'''

### 4. Improved Data Partitioning
'''
   - Optimized Partition Size: I ensured that data partitions were optimally sized by controlling the number of partitions using 
   `repartition()` or by setting the `spark.sql.shuffle.partitions` to a reasonable value, based on the dataset size and cluster resources. 
   For example, I reduced the default 200 partitions in shuffle operations to a more suitable number for small to medium datasets.

   - Skewed Data Handling: I encountered skewed data in certain joins where a few keys had disproportionately large data. I applied salting 
   techniques to break down these large keys into smaller chunks and reduce the processing time on those specific partitions.
'''

### 5. Efficient File Formats
'''
   - Parquet and ORC Formats: I converted large data tables to columnar file formats like Parquet and ORC, which are highly optimized for 
   read-heavy workloads. Using these formats reduced I/O overhead and allowed for faster data scans, especially when used with predicate
     pushdown and column pruning.
   - Compression: I also enabled Snappy compression for Parquet files, which provided a good balance between compression ratio and 
   decompression speed, further improving the overall application performance.
'''

### 6. SQL Query Optimizations
'''
   - Predicate Pushdown: I ensured that Spark was pushing down filters to the underlying data source. For instance, when reading from databases 
   or Parquet files, I used filters like `filter()` and `select()` to limit the amount of data read into Spark, thereby reducing both I/O 
   and processing time.

   - Avoiding Dataframe UDFs: Instead of using Python-based UDFs (User Defined Functions), which can be slow due to serialization overhead, 
   I utilized built-in Spark SQL functions and pandas UDFs whenever possible. This ensured vectorized execution, which led to faster processing.
'''


### 7. Optimizing Spark Configurations
'''
   - Adaptive Query Execution (AQE): I enabled AQE in certain workloads to allow Spark to dynamically optimize joins and shuffles at 
   runtime based on statistics. This helped in handling changing data characteristics, especially with large and dynamic datasets.
   - Tuning Shuffle Settings: I fine-tuned shuffle parameters like `spark.sql.shuffle.partitions` and `spark.reducer.maxSizeInFlight` 
   to optimize shuffle behavior, reducing the amount of data spilled to disk during joins and aggregations.
'''

### 8. Monitoring and Debugging
'''
   - Spark UI Analysis: I frequently used the Spark UI to monitor job stages, tasks, and executor performance. By identifying 
   stragglers (slow tasks) and bottlenecks, I was able to refine my Spark configurations and partitioning strategies to improve overall 
   efficiency.
   - Driver and Executor Logs: I analyzed log4j logs for detailed execution patterns and to identify specific tasks that were causing delays, 
   such as long-running GC cycles, insufficient memory, or incorrect configuration of resources.
'''

### 9. Batch vs Streaming Optimization
'''
   - Batch Jobs: For large batch jobs, I made sure to pipeline stages effectively and reduced unnecessary intermediate data storage by 
   chaining transformations when appropriate.
   - Structured Streaming: For streaming applications, I optimized watermarking and checkpointing configurations to balance latency and 
   fault tolerance while ensuring low overhead.
'''

### 10. ETL Pipeline Optimization
'''
   - Incremental Loads: I avoided processing entire datasets repeatedly by implementing incremental loads. This approach allowed for efficient 
   ETL processing, where only the new or modified data was processed, drastically reducing processing times.
   
   - Concurrency and Parallelism: For complex ETL pipelines, I made use of Sparks ability to run independent tasks concurrently by 
   setting up parallel ETL jobs and tuning the task scheduling behavior.

'''

### Summary:
'''
By emphasizing a combination of technical tuning (broadcast joins, partitioning, caching), strategic decisions (file formats, incremental loads),
and monitoring tools (Spark UI, logs), you demonstrate that you not only have a deep understanding of PySpark optimization but also the 
experience to identify and address bottlenecks in real-world applications.
Feel free to tailor these examples to your own experiences with specific details from projects you’ve worked on!
'''

In [None]:
#Optimizing PySpark applications is critical to improving performance, reducing execution time, and minimizing resource usage. 
#Below are several techniques and best practices for optimizing PySpark:

### 1. DataFrame Operations Optimizations
'''
 - Use DataFrames over RDDs: DataFrames are optimized for performance through Catalyst and Tungsten (PySparks query optimizer), while RDDs offer 
 no optimization. Always prefer DataFrames over RDDs unless you have a very specific use case.

- Avoid User-Defined Functions (UDFs): PySpark UDFs are not optimized and slow down performance. Whenever possible, use built-in PySpark 
 functions or native SQL expressions. UDFs require serialization and deserialization, which is slow.

- Columnar Operations: Leverage PySparks DataFrame API, which allows you to work on columns efficiently. Built-in functions like 
`.select()`, `.filter()`, `.withColumn()`, etc., are optimized.
'''
df = None 
df.select("column1", "column2").filter(df["column1"] > 100)

### 2. Caching and Persistence ###
'''
- Cache Intermediate Data: If you're going to reuse a DataFrame multiple times, persist or cache it in memory to avoid recomputation. 
Use `.cache()` or `.persist()`, but ensure you unpersist (`df.unpersist()`) the data after usage to release memory.

- Select Correct Storage Levels: If your data doesnt fit into memory, you can use different storage levels (e.g., 
1.MEMORY_AND_DISK, 
2. MEMORY_ONLY, 
3. DISK_ONLY
based on your workload needs.

'''
df.cache()

   

### 3. Partitioning and Shuffling
'''
- Partitioning Data: Use repartition() to reduce or increase the number of partitions based on your data size. For large datasets, 
increase the number of partitions to avoid data skew and excessive shuffling. Use `.coalesce()` to reduce partitions when needed 
(after a heavy shuffle operation).

- Avoid Unnecessary Shuffles: Shuffling is one of the most expensive operations in Spark. Minimize operations that trigger a shuffle, 
such as 
1. groupBy(),
2. join() , 
3. distinct() 
Use broadcast joins (discussed below) to optimize joins.

'''   
df = df.repartition(50, "column")  # Repartition based on a column
df = df.coalesce(10)  # Reduce partitions after shuffling

### 4. Optimize Joins
'''
- Broadcast Joins: If one of the datasets in your join is small enough to fit into memory, use a broadcast join to avoid shuffling the 
larger dataset. Spark will broadcast the smaller dataset to all worker nodes.
- Skewed Data Handling: If one side of your join has skewed data, consider using salting techniques to distribute the data evenly 
across partitions. Salting involves artificially adding a random key to break large partitions into smaller chunks.
'''

from pyspark.sql.functions import broadcast
df_join = large_df.join(broadcast(small_df), "common_column")

   
### 5. Predicate Pushdown
'''
- Filter Early: Use filtering as early as possible to reduce the amount of data processed. PySpark can push down filters to the source 
system (e.g., Parquet, ORC) for optimized I/O.

- Optimize Data Source Reads: Ensure that your data source supports predicate pushdown (e.g., Parquet, ORC). PySpark will apply filters 
before reading the data, reducing the amount of data loaded into memory.

'''

df.filter(df["age"] > 30).select("name", "age")


### 6. Serialization and Deserialization Optimizations
'''
- Use Efficient Serialization: Switch to Kryo serialization instead of the default Java serializer. Kryo is faster and uses less memory. 
You need to register your classes with Kryo to take advantage of this.
 - Avoid Large Objects: Minimize the size of objects sent between the driver and workers, especially when using actions like `.collect()`, 
 as they can lead to high serialization costs.

'''
conf = SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sc = SparkContext(conf=conf)


### 7. Use Efficient File Formats
'''
- Parquet and ORC: Use columnar formats like Parquet or ORC for both reading and writing data. These formats are optimized for 
performance and offer better compression and faster query times compared to CSV or JSON.

- Compression: Choose an efficient compression format, such as Snappy or Zlib, to reduce I/O overhead without significantly 
increasing CPU usage.
'''
df.write.format("parquet").option("compression", "snappy").save("s3://your-path")

### 8. Avoid Wide Transformations
'''
- Narrow Transformations: Operations like map(), filter(), and select() are narrow transformations, which don’t require data to move across 
partitions. Prioritize these wherever possible.
- Wide Transformations: Operations like 
  1. groupByKey(),
  2.  join(), and 
  3. reduceByKey() 
are wide transformations and cause data to shuffle across the network. Minimize wide transformations and use efficient alternatives like 
reduceByKey() instead of groupByKey().
'''

### 9. Memory Management
'''
- Executor Memory Tuning: Allocate enough memory for Spark executors by tuning the 
    1. `spark.executor.memory` setting. Too little memory causes frequent garbage collection, while too much can lead to inefficiencies.

- Off-Heap Memory Tuning: You can enable off-heap memory to store data outside of the JVM heap, reducing the pressure on the garbage collector.

spark.executor.memoryOverhead=1024
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=4g
   
'''   
spark-submit --executor-memory 8G --driver-memory 4G


### 10. Garbage Collection Tuning
'''
- GC Tuning: By adjusting the garbage collection strategy and tuning the heap size, you can reduce the time spent on GC. 
Use tools like G1GC for large heap sizes and configure GC logs for deeper analysis.

--conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35"
   
'''

### 11. Avoid Collecting Data to the Driver
'''
   - Limit Data Transfer to Driver: Avoid using `.collect()`, `.count()`, and other actions that pull large amounts of data back to the driver. 
   This can overwhelm the drivers memory. Use `show()` or `take()` to sample small amounts of data.
'''
df.show(10)  # Displays only 10 rows instead of collecting everything


### 12. Adaptive Query Execution (AQE)
'''
 - Enable Adaptive Query Execution (AQE): AQE dynamically optimizes query plans based on runtime statistics, such as joining strategies and 
 partition sizes. This can lead to significant performance improvements.
'''
spark.sql.adaptive.enabled=true


### 13. Broadcast Variables
'''
   - Use Broadcast Variables: For small datasets or static lookup tables, broadcast variables reduce communication overhead by sending a 
   read-only copy of the data to all worker nodes.
'''
broadcast_var = sc.broadcast(lookup_table)

### 14. Parallelism
'''
   - Increase Parallelism: Spark automatically assigns a default level of parallelism, but for larger datasets, increasing the parallelism can 
   improve performance. Use `spark.default.parallelism` and `spark.sql.shuffle.partitions` for SQL queries.
spark.default.parallelism=200
spark.sql.shuffle.partitions=200


'''   

### 15. Avoid Skewed Data
'''
   - Skewed Data Handling: If you have unevenly distributed data (e.g., one key having a large number of rows), it can cause performance 
   bottlenecks. Use salting techniques or custom partitioning strategies to distribute the data evenly.
'''

### Summary of Key Optimizations:
'''
1. Use DataFrames instead of RDDs and minimize UDFs.
2. Cache and persist frequently reused data.
3. Optimize joins using broadcast and avoid unnecessary shuffles.
4. Use efficient file formats like Parquet/ORC and enable predicate pushdown.
5. Tune memory and GC settings to balance performance.
6. Leverage Adaptive Query Execution (AQE) for dynamic query optimizations.
'''

#By following these optimization strategies, your PySpark jobs will run faster and more efficiently, making the most out of your resources on 
# AWS EMR or any Spark-based platform. Would you like detailed guidance on applying any specific optimization in your current setup?