### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

# Step 1: Create Spark session
spark = SparkSession.builder \
    .appName("Schema Mismatches Handling") \
    .getOrCreate()

# Step 2: Load DataFrame (Assuming CSV input for this example)
data = [
    (25, 50000, 'HR'),
    (27, 54000, 'Finance'),
    (None, 58000, 'HR'),
    (29, None, 'Finance'),
    (30, 62000, 'IT'),
    (None, None, None)
]

# Define the schema explicitly
schema = StructType([
    StructField("Age", IntegerType(), True),
    StructField("Salary", DoubleType(), True),
    StructField("Department", StringType(), True)
])

# Create DataFrame using the schema
df = spark.createDataFrame(data, schema)

# Step 3: Define the expected schema (no mismatches in this case)
expected_schema = StructType([
    StructField("Age", IntegerType(), True),
    StructField("Salary", DoubleType(), True),
    StructField("Department", StringType(), True)
])

# Step 4: Handle schema mismatches by casting columns to the expected data types
df_corrected = df.select(
    col("Age").cast(IntegerType()).alias("Age"),
    col("Salary").cast(DoubleType()).alias("Salary"),
    col("Department").cast(StringType()).alias("Department")
)

# Show the schema after casting and the corrected data
df_corrected.printSchema()
df_corrected.show()


ModuleNotFoundError: No module named 'pyspark'

### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [None]:
# Write your code from here
import pandas as pd
import numpy as np

# Sample data with missing values
data = {
    'Age': [25, 27, np.nan, 29, 30],
    'Salary': [50000, 54000, 58000, np.nan, 62000],
    'Department': ['HR', 'Finance', 'HR', np.nan, 'IT']
}

# Step 1: Load the data into a pandas DataFrame
df = pd.DataFrame(data)

# Show original data
print("Original DataFrame:")
print(df)

# Step 2: Detect and fill missing values
# Impute numerical columns with mean for 'Age' and 'Salary'
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

# Impute categorical column 'Department' with the mode (most frequent value)
df['Department'] = df['Department'].fillna(df['Department'].mode()[0])

# Step 3: Report the changes
print("\nDataFrame after Imputation:")
print(df)

# Report changes: Display which columns had missing values and how many were filled
missing_report = {
    'Age': df['Age'].isnull().sum(),
    'Salary': df['Salary'].isnull().sum(),
    'Department': df['Department'].isnull().sum()
}

print("\nMissing Values Report:")
print(missing_report)
