### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [1]:
# Write your code from here
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col

# Step 1: Create Spark session
spark = SparkSession.builder \
    .appName("Schema Mismatch Handling") \
    .getOrCreate()

# Step 2: Load DataFrame (Assume it has mismatched schema)
data = [
    ("Alice", "30", "50000.0"),     # age and salary should be numeric
    ("Bob", "25", "not_available"),# salary should be float
    ("Charlie", None, "45000.50"), # age is missing
]

# Inferred schema (all fields are strings)
df = spark.createDataFrame(data, ["name", "age", "salary"])
print("Original DataFrame with mismatched schema:")
df.show()

# Step 3: Define expected schema
expected_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", DoubleType(), True)
])

# Step 4: Handle schema mismatches (cast to correct types with error handling)
df_corrected = df.withColumn("age", col("age").cast(IntegerType())) \
                 .withColumn("salary", col("salary").cast(DoubleType()))

print("Corrected DataFrame with expected schema:")
df_corrected.show()

# Optional: Validate schema
print("Corrected DataFrame Schema:")
df_corrected.printSchema()


JAVA_HOME is not set


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [None]:
# Write your code from here
import pandas as pd
import numpy as np

# Sample ETL input data with missing values
data = {
    'customer_id': [101, 102, 103, 104, 105],
    'age': [25, np.nan, 30, np.nan, 45],
    'income': [50000, 60000, None, 75000, None]
}

df = pd.DataFrame(data)
print("Original Data (with missing values):")
print(df)

# Step 1: Detect incomplete data
missing_summary = df.isnull().sum()
print("\nMissing Values Summary:")
print(missing_summary)

# Step 2: Fill missing values (estimate using mean or median)
df['age'].fillna(df['age'].mean(), inplace=True)
df['income'].fillna(df['income'].median(), inplace=True)

# Step 3: Report changes
print("\nData After Filling Missing Values:")
print(df)
