### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [3]:
# Write your code from here


### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [4]:
# Write your code from here
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
from pyspark.sql.functions import col

def main():
    # Step 1: Create Spark session
    spark = SparkSession.builder.appName("SchemaMismatchHandling").getOrCreate()

    # Step 2: Load dataframe (example CSV with schema issues)
    df = spark.read.option("header", True).csv("input_data.csv")
    print("Original DataFrame schema:")
    df.printSchema()
    print("Original data:")
    df.show()

    # Step 3: Define expected schema
    expected_schema = StructType([
        StructField("id", IntegerType(), True),
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("salary", DoubleType(), True)
    ])

    # Step 4: Handle schema mismatches
    # Convert columns to expected data types, handle missing columns by adding nulls
    for field in expected_schema.fields:
        if field.name not in df.columns:
            # Add missing column with null values
            df = df.withColumn(field.name, col("id") * 0)  # create dummy column
            df = df.drop(field.name)  # then drop to replace with null column
            from pyspark.sql.functions import lit
            df = df.withColumn(field.name, lit(None).cast(field.dataType))
        else:
            # Cast existing columns to expected type
            df = df.withColumn(field.name, col(field.name).cast(field.dataType))
    
    # Remove any extra columns not in expected schema
    df = df.select([field.name for field in expected_schema.fields])

    # Step 5: Show corrected data
    print("Corrected DataFrame schema:")
    df.printSchema()
    print("Corrected data:")
    df.show()

    spark.stop()

if __name__ == "__main__":
    main()


ModuleNotFoundError: No module named 'pyspark'