In [None]:
'''
focus on more advanced concepts such as 
- performance optimization, 
- complex data transformations, 
- large-scale data processing,  
- architectural decision-making. 
'''

In [None]:
### 1. **Optimization of PySpark Jobs**
# Q) You are processing a large dataset in PySpark that involves multiple joins and aggregations. The job is slow and often times out. 
# What steps would you take to optimize the performance of the PySpark job?
'''
- Enable **predicate pushdown** when reading data.
- Repartition the data intelligently based on join keys.
- Use **broadcast joins** when one of the tables is small.
- Cache intermediate DataFrames if they are used multiple times.
- Avoid using `count()` or `collect()` unless absolutely necessary.
- Set proper **shuffle partitions** based on cluster size (`spark.sql.shuffle.partitions`).
- Use **filter** and **select** to reduce the size of DataFrames early in the process.

'''
# Example: Optimizing joins
small_df = broadcast(small_df)
large_df.join(small_df, 'join_key', 'inner').repartition('join_key').show()


In [None]:
### 2. **Handling Skewed Data**
# Q) You notice that one of the partitioned columns in your data has highly skewed data, leading to slow tasks in your Spark job. 
# How would you handle data skewness in PySpark?
'''
- **Salting**: Introduce artificial keys to distribute the data more evenly.
- Use **random partitioning** to reduce the imbalance.
- **Broadcast join** when the smaller dataset is skewed.
- Increase the **shuffle partitions**.

'''  
from pyspark.sql.functions import expr, monotonically_increasing_id

# Example of salting
df = df.withColumn("salted_key", expr("concat(join_key, '_', (monotonically_increasing_id() % 10))"))


In [None]:
### 3. **Data Pipeline Design**
#   - **Question:** You are tasked with designing an ETL pipeline to process a daily batch of 1 TB of JSON files stored in S3 and 
# load the processed data into a Redshift table. Explain the design of your pipeline using PySpark and include optimizations for both 
# compute and storage.
'''
     - Use **S3 Select** to filter data while reading large JSON files.
     - Set up **dynamic partition pruning** and **predicate pushdown**.
     - Use **incremental loads** and watermarking to handle late data.
     - Use **DataFrame caching** and repartitioning based on the Redshift schema.
     - Compress the data using Parquet or ORC before loading into Redshift for space and performance efficiency.
     - Use **copy command** or **AWS Glue** for efficient Redshift load.
''' 
df = spark.read.json('s3://bucket/data/')
df_filtered = df.filter(df['date'] == '2024-09-13')  # Partition pruning
df_filtered.write.parquet('/path/to/save')

In [None]:
### 4. **Streaming and Batch Integration**
# Explain how you would integrate both batch and real-time streaming data pipelines in PySpark. For example, consider you have to 
# process clickstream data (streaming) along with daily user profile updates (batch).
'''
    - Use **Structured Streaming** to handle real-time clickstream data.
    - Batch user profile updates are processed separately and joined with the streaming data.
    - Utilize **watermarking** to handle late data in the streaming pipeline.
    - Output both streams and batch results in the same storage format (e.g., Parquet) for consistency.

'''

# Example: Streaming and batch join
streaming_df = spark.readStream.format("kafka").load()
batch_df = spark.read.parquet("/path/to/batch")
result_df = streaming_df.join(batch_df, "user_id", "left_outer")


In [None]:
### 5. **Advanced Window Functions**
   - **Question:** You have a dataset that tracks stock prices in real-time, and you need to calculate a rolling 7-day average price for each stock symbol. How would you implement this using PySpark?
   - **Expected Solution:**
     Use **Window functions** with time-based partitioning.
     ```python
     from pyspark.sql.window import Window
     from pyspark.sql.functions import avg

     window_spec = Window.partitionBy('symbol').orderBy('date').rowsBetween(-6, 0)
     df.withColumn('7_day_avg_price', avg('price').over(window_spec)).show()
     ```

In [None]:

### 6. **Fault Tolerance and Error Handling in PySpark**
   - **Question:** Describe how you would handle fault tolerance and ensure data consistency in a large-scale PySpark application that processes real-time sensor data.
   - **Expected Solution:**
     - Enable **checkpointing** for long-running streaming jobs to ensure fault tolerance.
     - Handle **bad records** using `badRecordsPath` option in file-based data ingestion.
     - Implement **idempotent** writes using unique identifiers for each record.
     - Use **exactly-once semantics** with Structured Streaming and transactional sinks like Delta Lake.
     ```python
     streaming_df.writeStream \
       .format("delta") \
       .option("checkpointLocation", "/path/to/checkpoint") \
       .start("/path/to/output")
     ```


In [None]:

### 7. **Handling Large Data with PySpark and Partitioning Strategies**
   - **Question:** You have a dataset that exceeds 10 TB, and you need to efficiently store and process it using PySpark. How would you design the storage layout and partitioning strategy to optimize read and write performance?
   - **Expected Solution:**
     - Partition the data based on high cardinality columns (e.g., `date`, `region`).
     - Use **bucketing** on frequently used join columns to reduce shuffle.
     - Store the data in **Parquet/ORC** format with snappy compression for efficient reads and writes.
     ```python
     df.write \
       .partitionBy('date', 'region') \
       .bucketBy(10, 'user_id') \
       .format('parquet') \
       .save('/path/to/save')
     ```

In [None]:
### 8. **Custom UDFs and UDAFs**
   - **Question:** Write a custom PySpark UDF that takes a string column and returns the reverse of each string. 
How would you handle performance issues with UDFs?
   - **Expected Solution:**
     - Use **pandas UDFs** (vectorized UDFs) for performance improvement.
     ```python
     from pyspark.sql.functions import udf
     from pyspark.sql.types import StringType

     @udf(returnType=StringType())
     def reverse_string(s):
         return s[::-1]

     df.withColumn('reversed', reverse_string(df['column'])).show()

     # Pandas UDF
     from pyspark.sql.functions import pandas_udf

     @pandas_udf(StringType())
     def reverse_string_pandas(s):
         return s.apply(lambda x: x[::-1])

     df.withColumn('reversed', reverse_string_pandas(df['column'])).show()
     ```

In [None]:
### 9. **Cluster Resource Management and Tuning**
   - **Question:** You are working on a PySpark job in a YARN cluster that frequently runs out of memory. What configurations and tuning steps would you apply to manage memory and resources effectively?
   - **Expected Solution:**
     - Increase **executor memory** and **driver memory** based on the job's requirements.
     - Tune **number of cores** and **executors** to balance parallelism and resource usage.
     - Set **spark.memory.fraction** and **spark.memory.storageFraction** to optimize memory for execution and storage.
     - Use **dynamic allocation** to optimize resource usage across different stages.
     ```python
     spark = SparkSession.builder \
       .config("spark.executor.memory", "8g") \
       .config("spark.executor.cores", "4") \
       .config("spark.dynamicAllocation.enabled", "true") \
       .getOrCreate()
     ```

These questions require deeper knowledge of PySpark internals, data engineering architecture, performance tuning, and real-world problem-solving.

In [None]:
Here are some advanced **PySpark problems** for a **senior data engineer** to solve. These tasks involve performance optimization, data partitioning, window functions, and stream processing. They are designed to test real-world problem-solving skills.

### 1. **Large Dataset Join Optimization**
   - **Problem:** You have two large datasets: `user_logs` (10 billion rows) and `user_profiles` (10 million rows). The two DataFrames need to be joined on the `user_id` column. The join operation is slow, and you're running into memory issues.
     - **Task:**
       - Write a PySpark solution to optimize this join.
       - Hint: Consider using **broadcast joins** or repartitioning strategies.
     ```python
     from pyspark.sql.functions import broadcast

     # Example: Use broadcast join for the smaller dataset
     result = user_logs.join(broadcast(user_profiles), "user_id", "inner")
     result.show()
     ```

### 2. **Handling Skewed Data in a Join**
   - **Problem:** You are working with two datasets that need to be joined. One of the datasets, `sales_data`, is highly skewed, with 90% of the rows having the same value for the `region` column.
     - **Task:**
       - Write a solution to handle this skew in PySpark.
       - Hint: Use techniques like **salting** to distribute the skewed data across multiple partitions.
     ```python
     from pyspark.sql.functions import expr, monotonically_increasing_id

     # Example: Add salt to distribute skewed data
     salted_sales = sales_data.withColumn("salted_region", expr("concat(region, '_', (monotonically_increasing_id() % 10))"))
     result = salted_sales.join(other_df, salted_sales['salted_region'] == other_df['region'], "inner")
     result.show()
     ```

### 3. **Rolling Window Calculation on Time Series Data**
   - **Problem:** You are given a dataset `stock_prices` with columns `symbol`, `date`, and `price`. You are required to calculate the 30-day rolling average price for each stock symbol.
     - **Task:**
       - Write a PySpark solution that computes the rolling average using **Window functions**.
     ```python
     from pyspark.sql.window import Window
     from pyspark.sql.functions import avg

     window_spec = Window.partitionBy('symbol').orderBy('date').rowsBetween(-29, 0)
     result = stock_prices.withColumn('30_day_avg_price', avg('price').over(window_spec))
     result.show()
     ```

### 4. **Processing Real-Time Streaming Data**
   - **Problem:** You are tasked with processing real-time clickstream data from a Kafka source. Each event contains a `user_id`, `timestamp`, and `action`. You need to write a PySpark Structured Streaming job to:
     - Filter out `action = 'login'`
     - Group the data by `user_id` and calculate the total number of logins per 5-minute window.
     - Output the result to a Parquet file.
     - **Task:**
       - Implement this streaming job using PySpark.
     ```python
     from pyspark.sql.functions import window, col
     from pyspark.sql import SparkSession

     spark = SparkSession.builder.appName("StreamingApp").getOrCreate()

     # Read from Kafka source
     df = spark.readStream.format("kafka") \
       .option("kafka.bootstrap.servers", "localhost:9092") \
       .option("subscribe", "clickstream") \
       .load()

     # Parse the value and create a DataFrame
     df_parsed = df.selectExpr("CAST(value AS STRING)").alias("json") \
       .selectExpr("json['user_id']", "json['timestamp']", "json['action']")

     # Filter for login actions
     df_filtered = df_parsed.filter(col('action') == 'login')

     # Group by user and count logins within 5-minute windows
     result = df_filtered.groupBy(
       window(col("timestamp"), "5 minutes"), col("user_id")
     ).count()

     # Write the result to Parquet
     result.writeStream \
       .outputMode("append") \
       .format("parquet") \
       .option("path", "/path/to/output") \
       .option("checkpointLocation", "/path/to/checkpoint") \
       .start()
     ```

### 5. **Handling Large Files with Efficient Partitioning**
   - **Problem:** You have a large dataset (5 TB of Parquet files) stored in S3, and you are tasked with optimizing the reading process for analytical queries on specific columns like `region` and `date`.
     - **Task:**
       - Design an efficient PySpark read operation for this large dataset.
       - Apply a partitioning strategy to optimize future reads based on `region` and `date`.
     ```python
     # Read from S3 and partition the dataset by region and date
     df = spark.read.parquet("s3://bucket/large_dataset/")
     
     # Partition the dataset
     df.write.partitionBy("region", "date").parquet("s3://bucket/optimized_dataset/")
     
     # Optimize reading by selecting specific partitions
     optimized_df = spark.read.parquet("s3://bucket/optimized_dataset/") \
       .filter("region = 'US' AND date = '2024-09-13'")
     optimized_df.show()
     ```

### 6. **ETL Pipeline for Incremental Data**
   - **Problem:** You need to build an ETL pipeline that reads daily JSON files from an S3 bucket, processes the data, and writes it to a Redshift table. The pipeline should only process new data (incremental loads).
     - **Task:**
       - Implement a PySpark solution to read and process the new files and append the result to Redshift.
       - Ensure that old data is not reprocessed.
     ```python
     from pyspark.sql import SparkSession

     spark = SparkSession.builder.appName("ETL Pipeline").getOrCreate()

     # Read the latest JSON files
     df = spark.read.json("s3://bucket/daily_data/2024-09-13/")

     # Apply transformations (e.g., filter, aggregation)
     transformed_df = df.filter(df['event'] == 'purchase')

     # Write to Redshift
     transformed_df.write \
       .format("jdbc") \
       .option("url", "jdbc:redshift://your-redshift-url") \
       .option("dbtable", "public.purchases") \
       .option("user", "username") \
       .option("password", "password") \
       .mode("append") \
       .save()
     ```

### 7. **Custom Aggregation with UDAFs**
   - **Problem:** You need to write a custom **User Defined Aggregate Function (UDAF)** in PySpark to calculate the median value for a given column in a DataFrame.
     - **Task:**
       - Write a PySpark solution using a UDAF to compute the median value of a column.
     ```python
     from pyspark.sql.expressions import UserDefinedAggregateFunction
     from pyspark.sql.types import *

     class MedianUDAF(UserDefinedAggregateFunction):
         def inputSchema(self):
             return StructType([StructField("input", DoubleType())])
         
         def bufferSchema(self):
             return StructType([StructField("buffer", ArrayType(DoubleType()))])
         
         def dataType(self):
             return DoubleType()
         
         def deterministic(self):
             return True
         
         def initialize(self, buffer):
             buffer[0] = []
         
         def update(self, buffer, input):
             buffer[0].append(input)
         
         def merge(self, buffer1, buffer2):
             buffer1[0].extend(buffer2[0])
         
         def evaluate(self, buffer):
             sorted_values = sorted(buffer[0])
             count = len(sorted_values)
             if count % 2 == 0:
                 return (sorted_values[count // 2 - 1] + sorted_values[count // 2]) / 2
             else:
                 return sorted_values[count // 2]

     # Register and use UDAF
     spark.udf.register("median_udaf", MedianUDAF())

     df = spark.createDataFrame([(1.0,), (2.0,), (3.0,), (4.0,), (5.0,)], ["value"])
     df.selectExpr("median_udaf(value)").show()
     ```

### 8. **Managing Data Lineage and Metadata**
   - **Problem:** You need to ensure full data lineage and metadata tracking for your PySpark ETL pipelines.
     - **Task:**
       - Write a PySpark solution to track metadata (e.g., processing time, source file, transformation steps) for every job run and store it in a separate metadata table.
     ```python
     from pyspark.sql.functions import current_timestamp, input_file_name

     df = spark.read.csv("s3://bucket/data/")
     df = df.withColumn("processing_time", current_timestamp()) \
            .withColumn("source_file", input_file_name())

     # Save data and metadata
     df.write.parquet("/path/to/processed_data")

     metadata_df = df.select("processing_time", "source_file").distinct()
     metadata_df.write.mode("append").parquet("/path/to/metadata")
     ```

These problems focus on real-world PySpark challenges, covering performance, partitioning, custom UDFs, streaming, and ETL pipeline designs, allowing you to demonstrate advanced skills in data engineering.