In [None]:
### 1. **DataFrame Operations**
# - **Question:** Given a DataFrame with columns `employee_id`, `name`, `age`, and `salary`, write a PySpark code to:
#   - Filter out employees older than 30 years.
#    - Find the average salary for those employees.
   
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

spark = SparkSession.builder.getOrCreate()

# Sample DataFrame
data = [(1, 'John', 28, 3000),
             (2, 'Jane', 35, 5000),
             (3, 'Sam', 32, 4500),
             (4, 'Linda', 29, 3500)]

df = spark.createDataFrame(data, ['employee_id', 'name', 'age', 'salary'])

# Filter employees older than 30 and calculate average salary
result = df.filter(df.age > 30).agg(avg('salary').alias('average_salary'))
result.show()

In [None]:
### 2. **GroupBy and Aggregation**
#- **Question:** Given a DataFrame with columns `department`, `employee_id`, and `salary`, write a PySpark query to calculate the total 
# salary for each department.
df.groupBy('department').sum('salary').alias('total_salary').show()
 

In [None]:
### 3. **Handling Missing Data**
   - **Question:** Given a DataFrame with missing values in multiple columns, how would you:
     - Drop rows where any column has null values?
     - Replace null values in a specific column, say `age`, with the average age?
   - **Expected Solution:**
     ```python
     from pyspark.sql.functions import col, mean

     # Drop rows with any null values
     df_cleaned = df.na.drop()

     # Replace null values in 'age' column with the average age
     avg_age = df.select(mean(col('age'))).first()[0]
     df_filled = df.na.fill({'age': avg_age})
     ```

In [None]:
### 4. **Window Functions**
   - **Question:** Given a DataFrame containing `department`, `employee_id`, and `salary`, write PySpark code to add a column that ranks employees in each department based on their salary.
   - **Expected Solution:**
     ```python
     from pyspark.sql.window import Window
     from pyspark.sql.functions import rank

     window_spec = Window.partitionBy('department').orderBy(df['salary'].desc())

     df.withColumn('rank', rank().over(window_spec)).show()
     ```

In [None]:
### 5. **Joins**
   - **Question:** Given two DataFrames, `employees` and `departments`, write PySpark code to perform an inner join on a common column `department_id`.
   - **Expected Solution:**
     ```python
     employees.join(departments, on='department_id', how='inner').show()
     ```

In [None]:

### 6. **File Handling**
   - **Question:** Write PySpark code to load a CSV file into a DataFrame, perform a transformation, and write the result back as a Parquet file.
   - **Expected Solution:**
     ```python
     # Read CSV
     df = spark.read.csv('/path/to/file.csv', header=True, inferSchema=True)

     # Perform a transformation (e.g., filter records)
     df_filtered = df.filter(df['salary'] > 4000)

     # Write the result to Parquet
     df_filtered.write.parquet('/path/to/output')
     ```

In [None]:
### 7. **Optimization and Performance**
   - **Question:** How can you optimize PySpark jobs for better performance? Write code to cache a DataFrame and explain its impact.
   - **Expected Solution:**
     ```python
     # Cache the DataFrame
     df_cached = df.cache()

     # Perform transformations after caching
     df_filtered = df_cached.filter(df['salary'] > 4000)
     df_filtered.show()

     # Explanation: Caching stores the DataFrame in memory, speeding up future actions on the same data by avoiding recomputation.
     ```

In [None]:
### 8. **Error Handling**
   - **Question:** How would you handle errors or bad records in a PySpark job when reading data from a file?
   - **Expected Solution:**
     ```python
     df = spark.read.option("badRecordsPath", "/path/to/error").csv('/path/to/file.csv')
     ```