### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample data with missing values
data = {
    'Age': [25, 27, np.nan, 29, 30],
    'Salary': [50000, 54000, 58000, np.nan, 62000],
    'Department': ['HR', 'Finance', 'HR', np.nan, 'IT']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Show original DataFrame
print("Original DataFrame:\n", df)

# --- Mean Imputation (only for numerical columns) ---
mean_imputer = SimpleImputer(strategy='mean')
df['Age'] = mean_imputer.fit_transform(df[['Age']])
df['Salary'] = mean_imputer.fit_transform(df[['Salary']])

# --- Median Imputation ---
median_imputer = SimpleImputer(strategy='median')
df['Age'] = median_imputer.fit_transform(df[['Age']])  # Replacing mean with median if preferred

# --- Mode Imputation (categorical or numerical) ---
mode_imputer = SimpleImputer(strategy='most_frequent')
df['Department'] = mode_imputer.fit_transform(df[['Department']]).ravel()

# Show the DataFrame after imputation
print("\nDataFrame after Imputation:\n", df)


Original DataFrame:
     Age   Salary Department
0  25.0  50000.0         HR
1  27.0  54000.0    Finance
2   NaN  58000.0         HR
3  29.0      NaN        NaN
4  30.0  62000.0         IT

DataFrame after Imputation:
      Age   Salary Department
0  25.00  50000.0         HR
1  27.00  54000.0    Finance
2  27.75  58000.0         HR
3  29.00  56000.0         HR
4  30.00  62000.0         IT


### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [2]:
# Write your code from here