In [None]:
'''
focus on more advanced concepts such as 
- performance optimization, 
- complex data transformations, 
- large-scale data processing,  
- architectural decision-making. 
'''

In [None]:
### 1. **Optimization of PySpark Jobs**
# Q) You are processing a large dataset in PySpark that involves multiple joins and aggregations. The job is slow and often times out. 
# What steps would you take to optimize the performance of the PySpark job?
'''
- Enable **predicate pushdown** when reading data.
- Repartition the data intelligently based on join keys.
- Use **broadcast joins** when one of the tables is small.
- Cache intermediate DataFrames if they are used multiple times.
- Avoid using `count()` or `collect()` unless absolutely necessary.
- Set proper **shuffle partitions** based on cluster size (`spark.sql.shuffle.partitions`).
- Use **filter** and **select** to reduce the size of DataFrames early in the process.

'''
# Example: Optimizing joins
small_df = broadcast(small_df)
large_df.join(small_df, 'join_key', 'inner').repartition('join_key').show()


In [None]:
### 2. **Handling Skewed Data**
# Q) You notice that one of the partitioned columns in your data has highly skewed data, leading to slow tasks in your Spark job. 
# How would you handle data skewness in PySpark?
'''
- **Salting**: Introduce artificial keys to distribute the data more evenly.
- Use **random partitioning** to reduce the imbalance.
- **Broadcast join** when the smaller dataset is skewed.
- Increase the **shuffle partitions**.

'''  
from pyspark.sql.functions import expr, monotonically_increasing_id

# Example of salting
df = df.withColumn("salted_key", expr("concat(join_key, '_', (monotonically_increasing_id() % 10))"))


In [None]:
### 3. **Data Pipeline Design**
#   - **Question:** You are tasked with designing an ETL pipeline to process a daily batch of 1 TB of JSON files stored in S3 and 
# load the processed data into a Redshift table. Explain the design of your pipeline using PySpark and include optimizations for both 
# compute and storage.
'''
     - Use **S3 Select** to filter data while reading large JSON files.
     - Set up **dynamic partition pruning** and **predicate pushdown**.
     - Use **incremental loads** and watermarking to handle late data.
     - Use **DataFrame caching** and repartitioning based on the Redshift schema.
     - Compress the data using Parquet or ORC before loading into Redshift for space and performance efficiency.
     - Use **copy command** or **AWS Glue** for efficient Redshift load.
''' 
df = spark.read.json('s3://bucket/data/')
df_filtered = df.filter(df['date'] == '2024-09-13')  # Partition pruning
df_filtered.write.parquet('/path/to/save')

In [None]:
### 4. **Streaming and Batch Integration**
# Explain how you would integrate both batch and real-time streaming data pipelines in PySpark. For example, consider you have to 
# process clickstream data (streaming) along with daily user profile updates (batch).
'''
    - Use **Structured Streaming** to handle real-time clickstream data.
    - Batch user profile updates are processed separately and joined with the streaming data.
    - Utilize **watermarking** to handle late data in the streaming pipeline.
    - Output both streams and batch results in the same storage format (e.g., Parquet) for consistency.

'''

# Example: Streaming and batch join
streaming_df = spark.readStream.format("kafka").load()
batch_df = spark.read.parquet("/path/to/batch")
result_df = streaming_df.join(batch_df, "user_id", "left_outer")


In [None]:
### 5. **Advanced Window Functions**
   - **Question:** You have a dataset that tracks stock prices in real-time, and you need to calculate a rolling 7-day average price for each stock symbol. How would you implement this using PySpark?
   - **Expected Solution:**
     Use **Window functions** with time-based partitioning.
     ```python
     from pyspark.sql.window import Window
     from pyspark.sql.functions import avg

     window_spec = Window.partitionBy('symbol').orderBy('date').rowsBetween(-6, 0)
     df.withColumn('7_day_avg_price', avg('price').over(window_spec)).show()
     ```

In [None]:

### 6. **Fault Tolerance and Error Handling in PySpark**
   - **Question:** Describe how you would handle fault tolerance and ensure data consistency in a large-scale PySpark application that processes real-time sensor data.
   - **Expected Solution:**
     - Enable **checkpointing** for long-running streaming jobs to ensure fault tolerance.
     - Handle **bad records** using `badRecordsPath` option in file-based data ingestion.
     - Implement **idempotent** writes using unique identifiers for each record.
     - Use **exactly-once semantics** with Structured Streaming and transactional sinks like Delta Lake.
     ```python
     streaming_df.writeStream \
       .format("delta") \
       .option("checkpointLocation", "/path/to/checkpoint") \
       .start("/path/to/output")
     ```


In [None]:

### 7. **Handling Large Data with PySpark and Partitioning Strategies**
   - **Question:** You have a dataset that exceeds 10 TB, and you need to efficiently store and process it using PySpark. How would you design the storage layout and partitioning strategy to optimize read and write performance?
   - **Expected Solution:**
     - Partition the data based on high cardinality columns (e.g., `date`, `region`).
     - Use **bucketing** on frequently used join columns to reduce shuffle.
     - Store the data in **Parquet/ORC** format with snappy compression for efficient reads and writes.
     ```python
     df.write \
       .partitionBy('date', 'region') \
       .bucketBy(10, 'user_id') \
       .format('parquet') \
       .save('/path/to/save')
     ```

In [None]:
### 8. **Custom UDFs and UDAFs**
   - **Question:** Write a custom PySpark UDF that takes a string column and returns the reverse of each string. 
How would you handle performance issues with UDFs?
   - **Expected Solution:**
     - Use **pandas UDFs** (vectorized UDFs) for performance improvement.
     ```python
     from pyspark.sql.functions import udf
     from pyspark.sql.types import StringType

     @udf(returnType=StringType())
     def reverse_string(s):
         return s[::-1]

     df.withColumn('reversed', reverse_string(df['column'])).show()

     # Pandas UDF
     from pyspark.sql.functions import pandas_udf

     @pandas_udf(StringType())
     def reverse_string_pandas(s):
         return s.apply(lambda x: x[::-1])

     df.withColumn('reversed', reverse_string_pandas(df['column'])).show()
     ```

In [None]:
### 9. **Cluster Resource Management and Tuning**
   - **Question:** You are working on a PySpark job in a YARN cluster that frequently runs out of memory. What configurations and tuning steps would you apply to manage memory and resources effectively?
   - **Expected Solution:**
     - Increase **executor memory** and **driver memory** based on the job's requirements.
     - Tune **number of cores** and **executors** to balance parallelism and resource usage.
     - Set **spark.memory.fraction** and **spark.memory.storageFraction** to optimize memory for execution and storage.
     - Use **dynamic allocation** to optimize resource usage across different stages.
     ```python
     spark = SparkSession.builder \
       .config("spark.executor.memory", "8g") \
       .config("spark.executor.cores", "4") \
       .config("spark.dynamicAllocation.enabled", "true") \
       .getOrCreate()
     ```

These questions require deeper knowledge of PySpark internals, data engineering architecture, performance tuning, and real-world problem-solving.