## 1) **SQL — Find 2nd highest salary per department**

Problem: table `employees(emp_id, dept, salary)`. Return highest and 2nd highest salary per `dept`.
Solution (standard, window approach):

```sql
SELECT dept, salary
FROM (
  SELECT dept, salary,
         ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
  FROM employees
) t
WHERE rn = 2;
```

Notes: use `DENSE_RANK()` instead of `ROW_NUMBER()` if you want ties to be handled (i.e., second distinct salary).


In [None]:

## 2) **Spark / PySpark — remove duplicates and write partitioned Delta**

Task: read large JSON/Parquet, dedupe by `id`, write to Delta partitioned by `year` and `month` with schema enforcement.

PySpark snippet:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month

spark = SparkSession.builder.appName("dedupe_write").getOrCreate()

df = spark.read.parquet("s3://bucket/path/")  # or json/csv
# assume df has timestamp column 'ts' and unique id 'id'
df = df.withColumn("year", year(col("ts"))).withColumn("month", month(col("ts")))

# dedupe keeping latest record per id using window
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc

w = Window.partitionBy("id").orderBy(desc("ts"))
df_dedup = df.withColumn("rn", row_number().over(w)).filter(col("rn") == 1).drop("rn")

(df_dedup
  .write
  .format("delta")
  .mode("overwrite")              # or "append" depending on use-case
  .partitionBy("year","month")
  .option("overwriteSchema","true")
  .save("/mnt/delta/table_name"))
```

# Optimization hints: repartition by partition keys before write to avoid small files; use `coalesce`/`repartition` tuned to executor cores; enable dynamic partition overwrite if 
# doing incremental partition updates.

## 3) **System design / pipeline question — design ingestion for PB-scale historical + streaming**

Short bullets you can speak to in an interview:

* Bronze (raw) landing: Parquet/avro on S3 with time-based partitions.
* Ingestion: bulk backfill via Spark clusters (EMR/Databricks) using parallel reads; streaming via Kafka → Structured Streaming.
* Processing: Use Delta Lake for ACID, schema evolution; apply schema enforcement and compaction (OPTIMIZE / ZORDER).
* Orchestration: Airflow / Databricks Jobs for batch; EventBridge/Lambda/Step Functions for triggers.
* Cost/perf: auto-scaling clusters, spot instances for heavy batch; data skipping & partition pruning; cache hot tables.
  (If you want, I can draw an architecture diagram for this.)

In [None]:
## 4) **PySpark code problem — count unique users per 1-hour window from Kafka**

Core idea (Structured Streaming):

```python
from pyspark.sql.functions import from_json, col, window
schema = "user_id STRING, event_time TIMESTAMP, action STRING"

raw = (spark.readStream.format("kafka")
       .option("kafka.bootstrap.servers", "host:port")
       .option("subscribe", "topic")
       .load())

events = raw.select(from_json(col("value").cast("string"), schema).alias("j")).select("j.*")

# watermark + window
agg = (events
       .withWatermark("event_time", "10 minutes")
       .groupBy(window(col("event_time"), "1 hour"))
       .agg(countDistinct("user_id").alias("unique_users")))

query = (agg.writeStream.format("console")   # or Delta sink
         .outputMode("complete")
         .start())
#Talk about watermarking, late data, and checkpointing.

## 5) **Performance tuning question — large join taking too long**

Answer outline to give in interview:

* Check skew: use `salting` or broadcast smaller table.
* Use broadcast join if one table is small (`broadcast(df)`).
* Repartition on join keys to avoid shuffle.
* Persist intermediate results if reused.
* Use column pruning and `select` only required columns.
* Use proper file formats (Parquet/ORC) and partitioning; tune shuffle partitions (`spark.sql.shuffle.partitions`).


## 6) **Behavioral / case question — explain a project where you improved cost or performance**

Structure your answer: Situation → Task → Action → Result (quantify). Example quick bullet:

* S: ETL job took 6 hours and cost $X/day.
* T: Reduce runtime and cost.
* A: Rewrote join strategy, reduced data scanned via partition pruning, switched to spot instances for batch.
* R: Runtime dropped to 45 minutes, cost cut by 60%.