### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [None]:
# Step 1: Create Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SchemaMismatchHandling") \
    .getOrCreate()

# Step 2: Load dataframe with schema mismatch (e.g., incorrect data types or missing fields)
data = [
    {"id": "1", "name": "Alice", "age": "25"},  # age should be int
    {"id": "2", "name": "Bob", "age": "30"},
    {"id": "3", "name": "Charlie", "age": None},  # missing age
]

df_raw = spark.createDataFrame(data)
print("Original DataFrame with schema mismatch:")
df_raw.printSchema()
df_raw.show()

# Step 3: Define the expected schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

expected_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
])

# Step 4: Handle schema mismatches (e.g., cast columns)
from pyspark.sql.functions import col

df_corrected = df_raw \
    .withColumn("id", col("id").cast("int")) \


### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [None]:
# Write your code from here

import pandas as pd
import numpy as np

# Sample ETL input: DataFrame with missing values
data = {
    "customer_id": [101, 102, 103, 104, 105],
    "name": ["Alice", "Bob", "Charlie", None, "Eve"],
    "age": [25, np.nan, 30, 22, np.nan],
    "purchase_amount": [100.5, 250.0, None, 175.0, 200.0]
}

df = pd.DataFrame(data)
print("Original Data with Missing Values:")
print(df)

# Step 1: Detect incomplete data
missing_report = df.isnull().sum()
print("\nMissing Values Report:")
print(missing_report)

# Step 2: Fill missing values (e.g., mean for numerical, placeholder for strings)
df["name"].fillna("Unknown", inplace=True)
df["age"].fillna(df["age"].mean(), inplace=True)
df["purchase_amount"].fillna(df["purchase_amount"].mean(), inplace=True)

# Step 3: Report changes
print("\nData After Filling Missing Values:")
print(df)
