### For comapny specific I will keep in same folder only

Congrats on clearing Round 1! For Round 2, expect deeper technical questions and problem-solving scenarios. Hereâ€™s what to prepare:

### **1. AWS Glue**
- **How does AWS Glue work?**
  - AWS Glue is a serverless ETL service that automates data preparation. It uses crawlers to detect schema and generates PySpark scripts for transformation.

- **How do you optimize Glue jobs?**
  - Use DynamicFrame over DataFrame for schema evolution.
  - Enable job bookmarks for incremental processing.
  - Partition data for parallelism.
  - Use S3 storage classes wisely (e.g., Standard vs. Intelligent-Tiering).

- **What are Glue job types?**
  - **Spark ETL Jobs** (batch processing)
  - **Streaming Jobs** (real-time ingestion)
  - **Python Shell Jobs** (lighter workloads)

- **Whatâ€™s the difference between Glue DynamicFrame and DataFrame?**
  - **DynamicFrame:** Schema-aware, handles semi-structured data, supports AWS Glue transformations.
  - **DataFrame:** Uses standard PySpark functions, better for performance tuning.


### **2. PySpark**
- **How do you handle large datasets efficiently in PySpark?**
  - Use **broadcast joins** for small lookup tables.
  - Optimize **shuffle operations** (reduce unnecessary shuffling).
  - Cache frequently accessed DataFrames.
  - Set proper partitioning (`repartition()` vs `coalesce()`).

- **Explain Window Functions in PySpark.**
  - Used for ranking, aggregations, running totals, etc.
  - Example:
    ```python
    from pyspark.sql.window import Window
    from pyspark.sql.functions import row_number

    window_spec = Window.partitionBy("category").orderBy("price")
    df = df.withColumn("rank", row_number().over(window_spec))
    ```

- **How do you handle skewed data in PySpark?**
  - **Salting:** Introduce random keys to balance partitions.
  - **Bucketing:** Store data in predefined hash buckets.
  - **Skew Join Handling:** Use `skewHint()` to redistribute load.



### **3. SQL**
- **How do you optimize a slow-running SQL query?**
  - Use indexes effectively.
  - Avoid SELECT *; specify required columns.
  - Use proper join strategies (HASH JOIN vs. MERGE JOIN).
  - Partition large tables.
  - Analyze query execution plans (`EXPLAIN`).

- **Write a SQL query to get the second-highest salary.**
  ```sql
  SELECT MAX(salary) FROM employees 
  WHERE salary < (SELECT MAX(salary) FROM employees);
  ```

- **How do you handle duplicate records in SQL?**
  - Using `DISTINCT`
  - Using `GROUP BY`
  - Using `ROW_NUMBER()` and deleting duplicates.

### **4. Python**
- **How do you handle memory issues in Python when processing large datasets?**
  - Use **generators** instead of lists.
  - Process data in **chunks** using `pandas.read_csv(chunk_size=5000)`.
  - Use `multiprocessing` for parallel processing.

- **What is the difference between deep copy and shallow copy?**
  - **Shallow Copy (`copy.copy()`):** Creates a new object but references the original nested objects.
  - **Deep Copy (`copy.deepcopy()`):** Recursively copies all objects.

- **How does Python manage memory?**
  - Uses **Garbage Collection (GC)** with reference counting.
  - `gc.collect()` manually triggers garbage collection.

### **5. Scenario-Based Questions**
- **You have a Glue job processing daily 100GB CSV files. How do you optimize it?**
  - Use **columnar formats** (Parquet/ORC) instead of CSV.
  - Partition data properly.
  - Use **pushdown predicates** (`df.filter()`) to reduce data load.
  - Enable **AWS Glue Workflows** for dependency management.

- **How do you troubleshoot a slow Glue job?**
  - Check **Spark UI logs** for skewed partitions.
  - Use `glueContext.getDataSink()` to track S3 write latency.
  - Tune memory and executor settings.

- **How do you design an ETL pipeline for real-time data ingestion?**
  - Use **AWS Kinesis / Kafka** for ingestion.
  - Process data with **Glue Streaming / Spark Structured Streaming**.
  - Store transformed data in **S3 + Athena / Redshift** for querying.

Would you like help with mock interviews or more scenario-based questions? ðŸš€