### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [5]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Function to impute missing values
def impute_missing_values(df):
    # Check if required columns exist in the dataframe
    required_columns = ['Age', 'Salary', 'Department']
    for col in required_columns:
        if col not in df.columns:
            raise ValueError(f"DataFrame must contain the column '{col}'.")

    # Check if 'Age' and 'Salary' columns contain numeric data
    if not pd.api.types.is_numeric_dtype(df['Age']):
        raise ValueError("'Age' column should contain numeric data.")
    if not pd.api.types.is_numeric_dtype(df['Salary']):
        raise ValueError("'Salary' column should contain numeric data.")
    
    # Impute missing values in 'Age' using the median strategy
    median_imputer = SimpleImputer(strategy='median')
    df['Age'] = median_imputer.fit_transform(df[['Age']])
    
    # Impute missing values in 'Salary' using the mean strategy
    mean_imputer = SimpleImputer(strategy='mean')
    df['Salary'] = mean_imputer.fit_transform(df[['Salary']])

    # Impute missing values in 'Department' using the mode (most frequent) strategy
    mode_imputer = SimpleImputer(strategy='most_frequent')
    df['Department'] = mode_imputer.fit_transform(df[['Department']]).ravel()

    return df

# Sample dataset with missing values
data = {
    'Age': [25, 27, np.nan, 29, 30],
    'Salary': [50000, 54000, 58000, np.nan, 62000],
    'Department': ['HR', 'Finance', 'HR', np.nan, 'IT']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Show original DataFrame
print("Original DataFrame:")
print(df)

# Call the function to impute missing values
df = impute_missing_values(df)

# Show the DataFrame after imputation
print("\nDataFrame after Imputation:")
print(df)


Original DataFrame:
    Age   Salary Department
0  25.0  50000.0         HR
1  27.0  54000.0    Finance
2   NaN  58000.0         HR
3  29.0      NaN        NaN
4  30.0  62000.0         IT

DataFrame after Imputation:
    Age   Salary Department
0  25.0  50000.0         HR
1  27.0  54000.0    Finance
2  28.0  58000.0         HR
3  29.0  56000.0         HR
4  30.0  62000.0         IT


### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [6]:
# Write your code from here